All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration
@ 2010-09-11 14:04 Anthony Liguori
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
                   ` (3 more replies)
  0 siblings, 4 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 14:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Stefan Hajnoczi, Juan Quintela

Today, live migration only works when using shared storage that is fully
cache coherent using raw images.

The failure case with weak coherent (i.e. NFS) is subtle but nontheless still
exists.  NFS only guarantees close-to-open coherence and when performing a live
migration, we do an open on the source and an open on the destination.  We
fsync() on the source before launching the destination but since we have two
simultaneous opens, we're not guaranteed coherence.

This is not necessarily a problem except that we are a bit gratituous in reading
from the disk before launching a guest.  This means that as things stand today,
we're guaranteed to read the first 64k of the disk and as such, if a client
writes to that region during live migration, corruption will result.

The second failure condition has to do with image files (such as qcow2).  Today,
we aggressively cache metadata in all image formats and that cache is definitely
not coherent even with fully coherent shared storage.

In all image formats, we prefetch at least the L1 table in open() which means
that if there is a write operation that causes a modification to an L1 table,
corruption will ensue.

This series attempts to address both of these issue.  Technically, if a NFS
client aggressively prefetches this solution is not enough but in practice,
Linux doesn't do that.

I need some help with the qcow2 metadata invalidation.  We need to delay the
loading of the l1 and the reference count table but we only do this
synchronously today.  I think we can just do this on demand but I'd still like
a second opinion.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-11 14:04 [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Anthony Liguori
@ 2010-09-11 14:04 ` Anthony Liguori
  2010-09-12 10:37   ` Avi Kivity
                     ` (2 more replies)
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing Anthony Liguori
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 14:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, Juan Quintela

Image files have two types of data: immutable data that describes things like
image size, backing files, etc. and mutable data that includes offset and
reference count tables.

Today, image formats aggressively cache mutable data to improve performance.  In
some cases, this happens before a guest even starts.  When dealing with live
migration, since a file is open on two machines, the caching of meta data can
lead to data corruption.

This patch addresses this by introducing a mechanism to invalidate any cached
mutable data a block driver may have which is then used by the live migration
code.

NB, this still requires coherent shared storage.  Addressing migration without
coherent shared storage (i.e. NFS) requires additional work.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

diff --git a/block.c b/block.c
index ebbc376..cd2ee31 100644
--- a/block.c
+++ b/block.c
@@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs)
     return bs->device_name;
 }
 
+void bdrv_invalidate_cache(BlockDriverState *bs)
+{
+    if (bs->drv && bs->drv->bdrv_invalidate_cache) {
+        bs->drv->bdrv_invalidate_cache(bs);
+    }
+}
+
+void bdrv_invalidate_cache_all(void)
+{
+    BlockDriverState *bs;
+
+    QTAILQ_FOREACH(bs, &bdrv_states, list) {
+        bdrv_invalidate_cache(bs);
+    }
+}
+
 void bdrv_flush(BlockDriverState *bs)
 {
     if (bs->open_flags & BDRV_O_NO_FLUSH) {
diff --git a/block.h b/block.h
index 5f64380..387f6d3 100644
--- a/block.h
+++ b/block.h
@@ -141,6 +141,10 @@ BlockDriverAIOCB *bdrv_aio_ioctl(BlockDriverState *bs,
         unsigned long int req, void *buf,
         BlockDriverCompletionFunc *cb, void *opaque);
 
+/* Invalidate any cached metadata used by image formats */
+void bdrv_invalidate_cache(BlockDriverState *bs);
+void bdrv_invalidate_cache_all(void);
+
 /* Ensure contents are flushed to disk.  */
 void bdrv_flush(BlockDriverState *bs);
 void bdrv_flush_all(void);
diff --git a/block_int.h b/block_int.h
index e8e7156..bca99b2 100644
--- a/block_int.h
+++ b/block_int.h
@@ -60,6 +60,7 @@ struct BlockDriver {
     void (*bdrv_close)(BlockDriverState *bs);
     int (*bdrv_create)(const char *filename, QEMUOptionParameter *options);
     void (*bdrv_flush)(BlockDriverState *bs);
+    void (*bdrv_invalidate_cache)(BlockDriverState *bs);
     int (*bdrv_is_allocated)(BlockDriverState *bs, int64_t sector_num,
                              int nb_sectors, int *pnum);
     int (*bdrv_set_key)(BlockDriverState *bs, const char *key);
diff --git a/migration.c b/migration.c
index 468d517..64d3d15 100644
--- a/migration.c
+++ b/migration.c
@@ -69,6 +69,9 @@ void process_incoming_migration(QEMUFile *f)
 
     incoming_expected = false;
 
+    /* Make sure all file formats flush their mutable metadata */
+    bdrv_invalidate_cache_all();
+
     if (autostart)
         vm_start();
 }
@@ -370,6 +373,7 @@ void migrate_fd_put_ready(void *opaque)
 
         qemu_aio_flush();
         bdrv_flush_all();
+        bdrv_invalidate_cache_all();
         if ((qemu_savevm_state_complete(s->mon, s->file)) < 0) {
             if (old_vm_running) {
                 vm_start();
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-11 14:04 [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Anthony Liguori
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
@ 2010-09-11 14:04 ` Anthony Liguori
  2010-09-11 16:53   ` Stefan Hajnoczi
                     ` (2 more replies)
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts Anthony Liguori
  2010-09-12 10:46 ` [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Avi Kivity
  3 siblings, 3 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 14:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, Juan Quintela

The use of protocols in backing_files is currently broken because of some
checks for adjusting relative pathnames.

Additionally, there's a spurious read when using an nbd protocol that can be
quite destructive when using copy-on-read.  Potentially, this can lead to
probing an image file over top of NBD but this is completely wrong as NBD
devices are not growable.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
---
NB: this is absolutely not ideal.  A more elegant suggestion would be
appreciated.  I don't think NBD cleanly fits the model of a protocol as it
stands today.

diff --git a/block.c b/block.c
index cd2ee31..a32d5dd 100644
--- a/block.c
+++ b/block.c
@@ -344,6 +344,12 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
         return ret;
     }
 
+    if (strcmp(bs->drv->protocol_name, "nbd") == 0) {
+        drv = bs->drv;
+        bdrv_delete(bs);
+        goto out;
+    }
+
     /* Return the raw BlockDriver * to scsi-generic devices or empty drives */
     if (bs->sg || !bdrv_is_inserted(bs)) {
         bdrv_delete(bs);
@@ -373,6 +379,7 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
             }
         }
     }
+out:
     if (!drv) {
         ret = -ENOENT;
     }
@@ -603,10 +610,16 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
         BlockDriver *back_drv = NULL;
 
         bs->backing_hd = bdrv_new("");
-        path_combine(backing_filename, sizeof(backing_filename),
-                     filename, bs->backing_file);
-        if (bs->backing_format[0] != '\0')
-            back_drv = bdrv_find_format(bs->backing_format);
+        back_drv = bdrv_find_protocol(bs->backing_file);
+        if (!back_drv) {
+            path_combine(backing_filename, sizeof(backing_filename),
+                         filename, bs->backing_file);
+            if (bs->backing_format[0] != '\0')
+                back_drv = bdrv_find_format(bs->backing_format);
+        } else {
+            pstrcpy(backing_filename, sizeof(backing_filename),
+                    bs->backing_file);
+        }
 
         /* backing files always opened read-only */
         back_flags =
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-11 14:04 [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Anthony Liguori
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing Anthony Liguori
@ 2010-09-11 14:04 ` Anthony Liguori
  2010-09-11 17:24   ` Stefan Hajnoczi
                     ` (2 more replies)
  2010-09-12 10:46 ` [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Avi Kivity
  3 siblings, 3 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 14:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, Anthony Liguori, Stefan Hajnoczi, Juan Quintela

This fixes a couple nasty problems relating to live migration.

1) When dealing with shared storage with weak coherence (i.e. NFS), even if
   we re-read, we may end up with undesired caching.  By delaying any reads
   until we absolutely have to, we decrease the likelihood of any undesirable
   caching.

2) When dealing with copy-on-read, the local storage acts as a cache.  We need
   to make sure to avoid any reads to avoid polluting the local cache.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 1e466d1..57d8db3 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -94,6 +94,33 @@ static void put_le16(uint16_t *p, unsigned int v)
     *p = cpu_to_le16(v);
 }
 
+static int guess_geometry(IDEState *s)
+{
+    int cylinders, heads, secs;
+
+    if (s->cylinders != 0) {
+        return -1;
+    }
+
+    bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
+    if (cylinders < 1 || cylinders > 16383) {
+        error_report("cyls must be between 1 and 16383");
+        return -1;
+    }
+    if (heads < 1 || heads > 16) {
+        error_report("heads must be between 1 and 16");
+        return -1;
+    }
+    if (secs < 1 || secs > 63) {
+        error_report("secs must be between 1 and 63");
+        return -1;
+    }
+    s->cylinders = cylinders;
+    s->heads = heads;
+    s->sectors = secs;
+    return 0;
+}
+
 static void ide_identify(IDEState *s)
 {
     uint16_t *p;
@@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
 	return;
     }
 
+    guess_geometry(s);
+
     memset(s->io_buffer, 0, 512);
     p = (uint16_t *)s->io_buffer;
     put_le16(p + 0, 0x0040);
@@ -2614,27 +2643,13 @@ void ide_bus_reset(IDEBus *bus)
 int ide_init_drive(IDEState *s, BlockDriverState *bs,
                    const char *version, const char *serial)
 {
-    int cylinders, heads, secs;
     uint64_t nb_sectors;
 
     s->bs = bs;
-    bdrv_get_geometry(bs, &nb_sectors);
-    bdrv_guess_geometry(bs, &cylinders, &heads, &secs);
-    if (cylinders < 1 || cylinders > 16383) {
-        error_report("cyls must be between 1 and 16383");
-        return -1;
-    }
-    if (heads < 1 || heads > 16) {
-        error_report("heads must be between 1 and 16");
-        return -1;
-    }
-    if (secs < 1 || secs > 63) {
-        error_report("secs must be between 1 and 63");
-        return -1;
-    }
-    s->cylinders = cylinders;
-    s->heads = heads;
-    s->sectors = secs;
+    bdrv_get_geometry(s->bs, &nb_sectors);
+    s->cylinders = 0;
+    s->heads = 0;
+    s->sectors = 0;
     s->nb_sectors = nb_sectors;
     /* The SMART values should be preserved across power cycles
        but they aren't.  */
diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index bd6bbe6..0bf17ec 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
 
     bdrv_get_geometry(s->bs, &capacity);
     bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs);
+    if (cylinders == 0) {
+        bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
+    }
+
     memset(&blkcfg, 0, sizeof(blkcfg));
     stq_raw(&blkcfg.capacity, capacity);
     stl_raw(&blkcfg.seg_max, 128 - 2);
@@ -501,7 +505,6 @@ static int virtio_blk_load(QEMUFile *f, void *opaque, int version_id)
 VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf)
 {
     VirtIOBlock *s;
-    int cylinders, heads, secs;
     static int virtio_blk_id;
     DriveInfo *dinfo;
 
@@ -525,7 +528,6 @@ VirtIODevice *virtio_blk_init(DeviceState *dev, BlockConf *conf)
     s->conf = conf;
     s->rq = NULL;
     s->sector_mask = (s->conf->logical_block_size / BDRV_SECTOR_SIZE) - 1;
-    bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
 
     /* NB: per existing s/n string convention the string is terminated
      * by '\0' only when less than sizeof (s->sn)
-- 
1.7.0.4

^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing Anthony Liguori
@ 2010-09-11 16:53   ` Stefan Hajnoczi
  2010-09-11 17:27     ` Anthony Liguori
  2010-09-15 16:06   ` [Qemu-devel] " Juan Quintela
  2010-09-16  8:08   ` Kevin Wolf
  2 siblings, 1 reply; 51+ messages in thread
From: Stefan Hajnoczi @ 2010-09-11 16:53 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> Additionally, there's a spurious read when using an nbd protocol that can be
> quite destructive when using copy-on-read.  Potentially, this can lead to
> probing an image file over top of NBD but this is completely wrong as NBD
> devices are not growable.

Can you describe the copy-on-read scenario where the 2 KB probe read
is a problem?

Accessing a fixed size image file over NBD is probably uncommon, but
I'm not sure if there's a reason to forbid it.

Stefan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts Anthony Liguori
@ 2010-09-11 17:24   ` Stefan Hajnoczi
  2010-09-11 17:34     ` Anthony Liguori
  2010-09-12 10:42   ` Avi Kivity
  2010-09-13  8:32   ` Kevin Wolf
  2 siblings, 1 reply; 51+ messages in thread
From: Stefan Hajnoczi @ 2010-09-11 17:24 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori <aliguori@us.ibm.com> wrote:
> This fixes a couple nasty problems relating to live migration.
>
> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>   we re-read, we may end up with undesired caching.  By delaying any reads
>   until we absolutely have to, we decrease the likelihood of any undesirable
>   caching.
>
> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>   to make sure to avoid any reads to avoid polluting the local cache.
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index 1e466d1..57d8db3 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>        return;
>     }
>
> +    guess_geometry(s);
> +

Does the same change need to be made in ide_cfata_identify()?

I quickly checked the VMStateDescription and don't see cylinders,
heads, sectors being saved for migration.  I am concerned that IDE
will break after migration if the following happens:
1. Guest resumes and does not issue ATA IDENTIFY so cylinders, heads,
sectors are not initialized.
2. Normal I/O is performed, invoking ide_get_sector() which uses
geometry information that has not been initialized.

Did I miss something?

> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index bd6bbe6..0bf17ec 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
>
>     bdrv_get_geometry(s->bs, &capacity);
>     bdrv_get_geometry_hint(s->bs, &cylinders, &heads, &secs);
> +    if (cylinders == 0) {
> +        bdrv_guess_geometry(s->bs, &cylinders, &heads, &secs);
> +    }
> +

bdrv_guess_geometry() can be called unconditionally.  The call to
bdrv_get_geometry_hint() can be eliminated.  bdrv_guess_geometry()
updates the geometry hint and does not probe the boot sector after the
first time.

Stefan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-11 16:53   ` Stefan Hajnoczi
@ 2010-09-11 17:27     ` Anthony Liguori
  2010-09-11 17:45       ` Anthony Liguori
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 17:27 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/11/2010 11:53 AM, Stefan Hajnoczi wrote:
> On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori<aliguori@us.ibm.com>  wrote:
>    
>> Additionally, there's a spurious read when using an nbd protocol that can be
>> quite destructive when using copy-on-read.  Potentially, this can lead to
>> probing an image file over top of NBD but this is completely wrong as NBD
>> devices are not growable.
>>      
> Can you describe the copy-on-read scenario where the 2 KB probe read
> is a problem?
>
> Accessing a fixed size image file over NBD is probably uncommon, but
> I'm not sure if there's a reason to forbid it.
>    

I think the better solution is to explicitly specific raw with nbd.  
IOW, I think -drive file=nbd:localhost:1026,format=raw should work the 
same way.  I still feel slightly weird about probing happening with 
nbd.  It seems like it could only result in badness.

The specific scenario is migration.  I'm using a copy-on-read file on 
the destination and I want to be sure that I don't read any blocks 
(since they're copied) until the source stops execution.

Regards,

Anthony Liguori

> Stefan
>
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-11 17:24   ` Stefan Hajnoczi
@ 2010-09-11 17:34     ` Anthony Liguori
  0 siblings, 0 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 17:34 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/11/2010 12:24 PM, Stefan Hajnoczi wrote:
> On Sat, Sep 11, 2010 at 3:04 PM, Anthony Liguori<aliguori@us.ibm.com>  wrote:
>    
>> This fixes a couple nasty problems relating to live migration.
>>
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>    we re-read, we may end up with undesired caching.  By delaying any reads
>>    until we absolutely have to, we decrease the likelihood of any undesirable
>>    caching.
>>
>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>    to make sure to avoid any reads to avoid polluting the local cache.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index 1e466d1..57d8db3 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>>         return;
>>      }
>>
>> +    guess_geometry(s);
>> +
>>      
> Does the same change need to be made in ide_cfata_identify()?
>
> I quickly checked the VMStateDescription and don't see cylinders,
> heads, sectors being saved for migration.  I am concerned that IDE
> will break after migration if the following happens:
> 1. Guest resumes and does not issue ATA IDENTIFY so cylinders, heads,
> sectors are not initialized.
> 2. Normal I/O is performed, invoking ide_get_sector() which uses
> geometry information that has not been initialized.
>
> Did I miss something?
>    

Nope, I just missed some places that need calls.  I've renamed 
guess_geometry() to init_geometry() and added calls to 
ide_cfata_identify(), ide_get_sector(), and ide_set_sector().

That should cover it all.

>> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
>> index bd6bbe6..0bf17ec 100644
>> --- a/hw/virtio-blk.c
>> +++ b/hw/virtio-blk.c
>> @@ -427,6 +427,10 @@ static void virtio_blk_update_config(VirtIODevice *vdev, uint8_t *config)
>>
>>      bdrv_get_geometry(s->bs,&capacity);
>>      bdrv_get_geometry_hint(s->bs,&cylinders,&heads,&secs);
>> +    if (cylinders == 0) {
>> +        bdrv_guess_geometry(s->bs,&cylinders,&heads,&secs);
>> +    }
>> +
>>      
> bdrv_guess_geometry() can be called unconditionally.  The call to
> bdrv_get_geometry_hint() can be eliminated.  bdrv_guess_geometry()
> updates the geometry hint and does not probe the boot sector after the
> first time.
>    

Yeah, that's a fair point.

Regards,

Anthony Liguori

> Stefan
>
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-11 17:27     ` Anthony Liguori
@ 2010-09-11 17:45       ` Anthony Liguori
  0 siblings, 0 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-11 17:45 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/11/2010 12:27 PM, Anthony Liguori wrote:
> I think the better solution is to explicitly specific raw with nbd.  
> IOW, I think -drive file=nbd:localhost:1026,format=raw should work the 
> same way.  I still feel slightly weird about probing happening with 
> nbd.  It seems like it could only result in badness.

Yeah, w/o this patch, 'qemu-img create -f qcow2 -b nbd:localhost:1026 -o 
backing_fmt=raw foo.img' doesn't generate any reads although if you drop 
-o backing_fmt, it will.

Not sure I agree it's the most useful thing, but I'm not going to claim 
it's never useful so this patch is not a good idea.

Regards,

Anthony Liguori

>
> The specific scenario is migration.  I'm using a copy-on-read file on 
> the destination and I want to be sure that I don't read any blocks 
> (since they're copied) until the source stops execution.
>
> Regards,
>
> Anthony Liguori
>
>> Stefan
>>
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
@ 2010-09-12 10:37   ` Avi Kivity
  2010-09-12 13:06     ` Anthony Liguori
  2010-09-13  8:21   ` Kevin Wolf
  2010-09-15 15:53   ` Juan Quintela
  2 siblings, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 10:37 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
> Image files have two types of data: immutable data that describes things like
> image size, backing files, etc. and mutable data that includes offset and
> reference count tables.
>

Note: even the logical size is, in principle, mutable.  If we introduce 
external snapshots, then so are backing files?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts Anthony Liguori
  2010-09-11 17:24   ` Stefan Hajnoczi
@ 2010-09-12 10:42   ` Avi Kivity
  2010-09-12 13:08     ` Anthony Liguori
  2010-09-13  8:32   ` Kevin Wolf
  2 siblings, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 10:42 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
> This fixes a couple nasty problems relating to live migration.
>
> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>     we re-read, we may end up with undesired caching.  By delaying any reads
>     until we absolutely have to, we decrease the likelihood of any undesirable
>     caching.
>
> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>     to make sure to avoid any reads to avoid polluting the local cache.
>
> +
>   static void ide_identify(IDEState *s)
>   {
>       uint16_t *p;
> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>   	return;
>       }
>
> +    guess_geometry(s);
> +

This can cause a disk read, no?  Shouldn't it be made asynchronous?

Or just move it to just before the guest starts?

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration
  2010-09-11 14:04 [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Anthony Liguori
                   ` (2 preceding siblings ...)
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts Anthony Liguori
@ 2010-09-12 10:46 ` Avi Kivity
  2010-09-12 13:12   ` Anthony Liguori
  3 siblings, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 10:46 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
> Today, live migration only works when using shared storage that is fully
> cache coherent using raw images.
>
> The failure case with weak coherent (i.e. NFS) is subtle but nontheless still
> exists.  NFS only guarantees close-to-open coherence and when performing a live
> migration, we do an open on the source and an open on the destination.  We
> fsync() on the source before launching the destination but since we have two
> simultaneous opens, we're not guaranteed coherence.
>
> This is not necessarily a problem except that we are a bit gratituous in reading
> from the disk before launching a guest.  This means that as things stand today,
> we're guaranteed to read the first 64k of the disk and as such, if a client
> writes to that region during live migration, corruption will result.
>
> The second failure condition has to do with image files (such as qcow2).  Today,
> we aggressively cache metadata in all image formats and that cache is definitely
> not coherent even with fully coherent shared storage.
>
> In all image formats, we prefetch at least the L1 table in open() which means
> that if there is a write operation that causes a modification to an L1 table,
> corruption will ensue.
>
> This series attempts to address both of these issue.  Technically, if a NFS
> client aggressively prefetches this solution is not enough but in practice,
> Linux doesn't do that.

I think it is unlikely that it will, but I prefer to be on the right 
side of the standards.  Why not delay image open until after migration 
completes?  I know your concern about the image not being there, but we 
can verify that with access().  If the image is deleted between access() 
and open() then the user has much bigger problems.

Note that on NFS, removing (and I think chmoding) a file after it is 
opened will cause subsequent data access to fail, unlike posix.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 10:37   ` Avi Kivity
@ 2010-09-12 13:06     ` Anthony Liguori
  2010-09-12 13:28       ` Avi Kivity
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-12 13:06 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/12/2010 05:37 AM, Avi Kivity wrote:
>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>> Image files have two types of data: immutable data that describes 
>> things like
>> image size, backing files, etc. and mutable data that includes offset 
>> and
>> reference count tables.
>>
>
> Note: even the logical size is, in principle, mutable.  If we 
> introduce external snapshots, then so are backing files?

Maybe it should be, "data the can be changed during a live migration".

Backing files and logical size shouldn't change during live migration.

But even so, I think the interface make sense.  It's basically, drop 
anything you have cached that may change during migration.  What needs 
to be read is dependent on features/format.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-12 10:42   ` Avi Kivity
@ 2010-09-12 13:08     ` Anthony Liguori
  2010-09-12 13:26       ` Avi Kivity
  2010-09-15 16:10       ` [Qemu-devel] " Juan Quintela
  0 siblings, 2 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-12 13:08 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/12/2010 05:42 AM, Avi Kivity wrote:
>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>> This fixes a couple nasty problems relating to live migration.
>>
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), 
>> even if
>>     we re-read, we may end up with undesired caching.  By delaying 
>> any reads
>>     until we absolutely have to, we decrease the likelihood of any 
>> undesirable
>>     caching.
>>
>> 2) When dealing with copy-on-read, the local storage acts as a 
>> cache.  We need
>>     to make sure to avoid any reads to avoid polluting the local cache.
>>
>> +
>>   static void ide_identify(IDEState *s)
>>   {
>>       uint16_t *p;
>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>>       return;
>>       }
>>
>> +    guess_geometry(s);
>> +
>
> This can cause a disk read, no?  Shouldn't it be made asynchronous?

Yes, it should.  I'm not sure there's an easy way to make it 
asynchronous though not because of the block layer but because of how 
these functions are called.

>
> Or just move it to just before the guest starts?

We don't really have a notion of "guest starts" today although maybe we 
should.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration
  2010-09-12 10:46 ` [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Avi Kivity
@ 2010-09-12 13:12   ` Anthony Liguori
  0 siblings, 0 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-12 13:12 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/12/2010 05:46 AM, Avi Kivity wrote:
>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>> Today, live migration only works when using shared storage that is fully
>> cache coherent using raw images.
>>
>> The failure case with weak coherent (i.e. NFS) is subtle but 
>> nontheless still
>> exists.  NFS only guarantees close-to-open coherence and when 
>> performing a live
>> migration, we do an open on the source and an open on the 
>> destination.  We
>> fsync() on the source before launching the destination but since we 
>> have two
>> simultaneous opens, we're not guaranteed coherence.
>>
>> This is not necessarily a problem except that we are a bit gratituous 
>> in reading
>> from the disk before launching a guest.  This means that as things 
>> stand today,
>> we're guaranteed to read the first 64k of the disk and as such, if a 
>> client
>> writes to that region during live migration, corruption will result.
>>
>> The second failure condition has to do with image files (such as 
>> qcow2).  Today,
>> we aggressively cache metadata in all image formats and that cache is 
>> definitely
>> not coherent even with fully coherent shared storage.
>>
>> In all image formats, we prefetch at least the L1 table in open() 
>> which means
>> that if there is a write operation that causes a modification to an 
>> L1 table,
>> corruption will ensue.
>>
>> This series attempts to address both of these issue.  Technically, if 
>> a NFS
>> client aggressively prefetches this solution is not enough but in 
>> practice,
>> Linux doesn't do that.
>
> I think it is unlikely that it will, but I prefer to be on the right 
> side of the standards.

I've been asking around about this and one thing that was suggested was 
acquiring a file lock as NFS requires that a lock acquisition drops any 
client cache for a file.  I need to understand this a bit more so it's 
step #2.

>   Why not delay image open until after migration completes?  I know 
> your concern about the image not being there, but we can verify that 
> with access().  If the image is deleted between access() and open() 
> then the user has much bigger problems.

3/3 would still be needed because if we delay the open we obviously can 
do a read until an open.

So it's only really a choice between invalidate_cache and delaying 
open.  It's a far less invasive change to just do invalidate_cache 
though and it has some nice properties.

Regards,

Anthony Liguori

> Note that on NFS, removing (and I think chmoding) a file after it is 
> opened will cause subsequent data access to fail, unlike posix.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-12 13:08     ` Anthony Liguori
@ 2010-09-12 13:26       ` Avi Kivity
  2010-09-12 15:29         ` Anthony Liguori
  2010-09-15 16:10       ` [Qemu-devel] " Juan Quintela
  1 sibling, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 13:26 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Stefan Hajnoczi, Juan Quintela

  On 09/12/2010 03:08 PM, Anthony Liguori wrote:
>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>
>
> Yes, it should.  I'm not sure there's an easy way to make it 
> asynchronous though not because of the block layer but because of how 
> these functions are called.

Sorry to harp on the subject, but that's the standard problem with state 
machines.  Every time you want to do a blocking operation in a function, 
you have to put all its locals in some structure, split the function 
into two, do some scheduling, etc.

>>
>> Or just move it to just before the guest starts?
>
> We don't really have a notion of "guest starts" today although maybe 
> we should.

Wasn't there some qdev callback that represents this?  Faint memory from 
the reset thread.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 13:06     ` Anthony Liguori
@ 2010-09-12 13:28       ` Avi Kivity
  2010-09-12 15:26         ` Anthony Liguori
  2010-09-15 15:57         ` Juan Quintela
  0 siblings, 2 replies; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 13:28 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>
> Backing files and logical size shouldn't change during live migration.

Why not?

>
> But even so, I think the interface make sense.  It's basically, drop 
> anything you have cached that may change during migration.  What needs 
> to be read is dependent on features/format.

Sure.  Certainly as an incremental step.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 13:28       ` Avi Kivity
@ 2010-09-12 15:26         ` Anthony Liguori
  2010-09-12 16:06           ` Avi Kivity
  2010-09-15 15:57         ` Juan Quintela
  1 sibling, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-12 15:26 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/12/2010 08:28 AM, Avi Kivity wrote:
>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>
>> Backing files and logical size shouldn't change during live migration.
>
> Why not?

To make our lives easier.

Regards,

Anthony Liguori

>>
>> But even so, I think the interface make sense.  It's basically, drop 
>> anything you have cached that may change during migration.  What 
>> needs to be read is dependent on features/format.
>
> Sure.  Certainly as an incremental step.
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-12 13:26       ` Avi Kivity
@ 2010-09-12 15:29         ` Anthony Liguori
  2010-09-12 16:04           ` Avi Kivity
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-12 15:29 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/12/2010 08:26 AM, Avi Kivity wrote:
>  On 09/12/2010 03:08 PM, Anthony Liguori wrote:
>>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>>
>>
>> Yes, it should.  I'm not sure there's an easy way to make it 
>> asynchronous though not because of the block layer but because of how 
>> these functions are called.
>
> Sorry to harp on the subject, but that's the standard problem with 
> state machines.  Every time you want to do a blocking operation in a 
> function, you have to put all its locals in some structure, split the 
> function into two, do some scheduling, etc.

We can't block the VCPU thread for arbitrarily long periods of time.  If 
we get a PIO operation requesting information about geometry, we can't 
wait for a disk read in order to satisfy that request.

We need to kick off the I/O operations in the background such that the 
data is available before the PIO operation happens.  This isn't SM vs. 
threads at all, this is simply about the fact that we can't do block I/O 
during a PIO operation.

>>>
>>> Or just move it to just before the guest starts?
>>
>> We don't really have a notion of "guest starts" today although maybe 
>> we should.
>
> Wasn't there some qdev callback that represents this?  Faint memory 
> from the reset thread.

Yes, realize().  I guess that's a reasonable approach.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-12 15:29         ` Anthony Liguori
@ 2010-09-12 16:04           ` Avi Kivity
  0 siblings, 0 replies; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 16:04 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, qemu-devel, Stefan Hajnoczi, Juan Quintela

  On 09/12/2010 05:29 PM, Anthony Liguori wrote:
> On 09/12/2010 08:26 AM, Avi Kivity wrote:
>>  On 09/12/2010 03:08 PM, Anthony Liguori wrote:
>>>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>>>
>>>
>>> Yes, it should.  I'm not sure there's an easy way to make it 
>>> asynchronous though not because of the block layer but because of 
>>> how these functions are called.
>>
>> Sorry to harp on the subject, but that's the standard problem with 
>> state machines.  Every time you want to do a blocking operation in a 
>> function, you have to put all its locals in some structure, split the 
>> function into two, do some scheduling, etc.
>
> We can't block the VCPU thread for arbitrarily long periods of time.  
> If we get a PIO operation requesting information about geometry, we 
> can't wait for a disk read in order to satisfy that request.
>
> We need to kick off the I/O operations in the background such that the 
> data is available before the PIO operation happens.  This isn't SM vs. 
> threads at all, this is simply about the fact that we can't do block 
> I/O during a PIO operation.

Isn't this an identify command, where the guest can only read the data 
after the interface indicates it is ready (or posts an interrupt)?  I 
thought the guest interface was already async.

The code appears to support this:

         switch(val) {
         case WIN_IDENTIFY:
             if (s->bs && s->drive_kind != IDE_CD) {
                 if (s->drive_kind != IDE_CFATA)
                     ide_identify(s);
                 else
                     ide_cfata_identify(s);
                 s->status = READY_STAT | SEEK_STAT;
                 ide_transfer_start(s, s->io_buffer, 512, 
ide_transfer_stop);
             } else {
                 if (s->drive_kind == IDE_CD) {
                     ide_set_signature(s);
                 }
                 ide_abort_command(s);
             }
             ide_set_irq(s->bus);
             break;

but I may be misinterpreting it.  If I'm right, all we need to do is 
push this whole block to a thread.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 15:26         ` Anthony Liguori
@ 2010-09-12 16:06           ` Avi Kivity
  2010-09-12 17:10             ` Anthony Liguori
  0 siblings, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 16:06 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>
>>> Backing files and logical size shouldn't change during live migration.
>>
>> Why not?
>
> To make our lives easier.

It means management needs to block volume resize while a live migration 
takes place.  Since live migration is typically done by the system 
automatically, while volume resize happens in response to user request, 
this isn't a good idea.  Both in terms of user experience, and in terms 
of pushing more complexity to management.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 16:06           ` Avi Kivity
@ 2010-09-12 17:10             ` Anthony Liguori
  2010-09-12 17:51               ` Avi Kivity
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-12 17:10 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/12/2010 11:06 AM, Avi Kivity wrote:
>  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
>> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>>
>>>> Backing files and logical size shouldn't change during live migration.
>>>
>>> Why not?
>>
>> To make our lives easier.
>
> It means management needs to block volume resize while a live 
> migration takes place.  Since live migration is typically done by the 
> system automatically, while volume resize happens in response to user 
> request, this isn't a good idea.  Both in terms of user experience, 
> and in terms of pushing more complexity to management.

We don't do volume resize today so it's a moot point.

Regards,

Anthony Liguori

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 17:10             ` Anthony Liguori
@ 2010-09-12 17:51               ` Avi Kivity
  2010-09-15 16:00                 ` [Qemu-devel] " Juan Quintela
  0 siblings, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-12 17:51 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Juan Quintela

  On 09/12/2010 07:10 PM, Anthony Liguori wrote:
> On 09/12/2010 11:06 AM, Avi Kivity wrote:
>>  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
>>> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>>>
>>>>> Backing files and logical size shouldn't change during live 
>>>>> migration.
>>>>
>>>> Why not?
>>>
>>> To make our lives easier.
>>
>> It means management needs to block volume resize while a live 
>> migration takes place.  Since live migration is typically done by the 
>> system automatically, while volume resize happens in response to user 
>> request, this isn't a good idea.  Both in terms of user experience, 
>> and in terms of pushing more complexity to management.
>
> We don't do volume resize today so it's a moot point.
>

Let's be prepared for the future.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
  2010-09-12 10:37   ` Avi Kivity
@ 2010-09-13  8:21   ` Kevin Wolf
  2010-09-13 13:27     ` Anthony Liguori
  2010-09-15 16:03     ` Juan Quintela
  2010-09-15 15:53   ` Juan Quintela
  2 siblings, 2 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-13  8:21 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 11.09.2010 16:04, schrieb Anthony Liguori:
> Image files have two types of data: immutable data that describes things like
> image size, backing files, etc. and mutable data that includes offset and
> reference count tables.
> 
> Today, image formats aggressively cache mutable data to improve performance.  In
> some cases, this happens before a guest even starts.  When dealing with live
> migration, since a file is open on two machines, the caching of meta data can
> lead to data corruption.
> 
> This patch addresses this by introducing a mechanism to invalidate any cached
> mutable data a block driver may have which is then used by the live migration
> code.
> 
> NB, this still requires coherent shared storage.  Addressing migration without
> coherent shared storage (i.e. NFS) requires additional work.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> 
> diff --git a/block.c b/block.c
> index ebbc376..cd2ee31 100644
> --- a/block.c
> +++ b/block.c
> @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs)
>      return bs->device_name;
>  }
>  
> +void bdrv_invalidate_cache(BlockDriverState *bs)
> +{
> +    if (bs->drv && bs->drv->bdrv_invalidate_cache) {
> +        bs->drv->bdrv_invalidate_cache(bs);
> +    }
> +}

There is a simple generic implementation:

drv = bs->drv;
drv->close(bs);
drv->open(bs, bs->open_flags);

Note that this only reopens e.g. the qcow2 layer, but not the image
file, which is bs->file.

This works for all simple case, that is, one format on top of one or
more protocols, where the protocols don't need invalidation. I think
this includes everything that is possible today. With -blockdev we might
need to revise this to include the lower layers, too. (But only
sometimes, because we don't want to reopen the file)

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts Anthony Liguori
  2010-09-11 17:24   ` Stefan Hajnoczi
  2010-09-12 10:42   ` Avi Kivity
@ 2010-09-13  8:32   ` Kevin Wolf
  2010-09-13 13:29     ` Anthony Liguori
  2010-09-15 16:16     ` Juan Quintela
  2 siblings, 2 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-13  8:32 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 11.09.2010 16:04, schrieb Anthony Liguori:
> This fixes a couple nasty problems relating to live migration.
> 
> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>    we re-read, we may end up with undesired caching.  By delaying any reads
>    until we absolutely have to, we decrease the likelihood of any undesirable
>    caching.
> 
> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>    to make sure to avoid any reads to avoid polluting the local cache.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

I think we should also delay even opening the image file at all to the
latest possible point to avoid that new problems of this kind are
introduced. Ideally, opening the image would be the very last thing we
do before telling the migration source that we're set and it should
close the images.

Even better would be to only open the image when the source has already
closed it (we could completely avoid the invalidation/reopen then), but
I think you were afraid that we might lose the image on both ends.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-13  8:21   ` Kevin Wolf
@ 2010-09-13 13:27     ` Anthony Liguori
  2010-09-15 16:03     ` Juan Quintela
  1 sibling, 0 replies; 51+ messages in thread
From: Anthony Liguori @ 2010-09-13 13:27 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Juan Quintela, qemu-devel, Stefan Hajnoczi

On 09/13/2010 03:21 AM, Kevin Wolf wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>    
>> Image files have two types of data: immutable data that describes things like
>> image size, backing files, etc. and mutable data that includes offset and
>> reference count tables.
>>
>> Today, image formats aggressively cache mutable data to improve performance.  In
>> some cases, this happens before a guest even starts.  When dealing with live
>> migration, since a file is open on two machines, the caching of meta data can
>> lead to data corruption.
>>
>> This patch addresses this by introducing a mechanism to invalidate any cached
>> mutable data a block driver may have which is then used by the live migration
>> code.
>>
>> NB, this still requires coherent shared storage.  Addressing migration without
>> coherent shared storage (i.e. NFS) requires additional work.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>
>> diff --git a/block.c b/block.c
>> index ebbc376..cd2ee31 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -1453,6 +1453,22 @@ const char *bdrv_get_device_name(BlockDriverState *bs)
>>       return bs->device_name;
>>   }
>>
>> +void bdrv_invalidate_cache(BlockDriverState *bs)
>> +{
>> +    if (bs->drv&&  bs->drv->bdrv_invalidate_cache) {
>> +        bs->drv->bdrv_invalidate_cache(bs);
>> +    }
>> +}
>>      
> There is a simple generic implementation:
>
> drv = bs->drv;
> drv->close(bs);
> drv->open(bs, bs->open_flags);
>
> Note that this only reopens e.g. the qcow2 layer, but not the image
> file, which is bs->file.
>    

That's a pretty good idea for a general implementation, I'll update the 
patches accordingly.

Regards,

Anthony Liguori

> This works for all simple case, that is, one format on top of one or
> more protocols, where the protocols don't need invalidation. I think
> this includes everything that is possible today. With -blockdev we might
> need to revise this to include the lower layers, too. (But only
> sometimes, because we don't want to reopen the file)
>
> Kevin
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13  8:32   ` Kevin Wolf
@ 2010-09-13 13:29     ` Anthony Liguori
  2010-09-13 13:39       ` Kevin Wolf
  2010-09-15 16:16     ` Juan Quintela
  1 sibling, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-13 13:29 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Juan Quintela, qemu-devel, Stefan Hajnoczi

On 09/13/2010 03:32 AM, Kevin Wolf wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>    
>> This fixes a couple nasty problems relating to live migration.
>>
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>     we re-read, we may end up with undesired caching.  By delaying any reads
>>     until we absolutely have to, we decrease the likelihood of any undesirable
>>     caching.
>>
>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>     to make sure to avoid any reads to avoid polluting the local cache.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>      
> I think we should also delay even opening the image file at all to the
> latest possible point to avoid that new problems of this kind are
> introduced. Ideally, opening the image would be the very last thing we
> do before telling the migration source that we're set and it should
> close the images.
>    

There's a lot of possible failure scenarios that opening an image file 
can introduce.  Fortunately, I don't think we have a strict requirement 
for it provided that we make a couple of reasonable changes.

> Even better would be to only open the image when the source has already
> closed it (we could completely avoid the invalidation/reopen then), but
> I think you were afraid that we might lose the image on both ends.
>    

Yeah, one of the key design points of live migration is to minimize the 
number of failure scenarios where you lose a VM.  If someone typed the 
wrong command line or shared storage hasn't been mounted yet and we 
delay failure until live migration is in the critical path, that would 
be terribly unfortunate.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 13:29     ` Anthony Liguori
@ 2010-09-13 13:39       ` Kevin Wolf
  2010-09-13 13:42         ` Anthony Liguori
  0 siblings, 1 reply; 51+ messages in thread
From: Kevin Wolf @ 2010-09-13 13:39 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Anthony Liguori, Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 13.09.2010 15:29, schrieb Anthony Liguori:
> On 09/13/2010 03:32 AM, Kevin Wolf wrote:
>> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>>    
>>> This fixes a couple nasty problems relating to live migration.
>>>
>>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>>     we re-read, we may end up with undesired caching.  By delaying any reads
>>>     until we absolutely have to, we decrease the likelihood of any undesirable
>>>     caching.
>>>
>>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>>     to make sure to avoid any reads to avoid polluting the local cache.
>>>
>>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>>      
>> I think we should also delay even opening the image file at all to the
>> latest possible point to avoid that new problems of this kind are
>> introduced. Ideally, opening the image would be the very last thing we
>> do before telling the migration source that we're set and it should
>> close the images.
>>    
> 
> There's a lot of possible failure scenarios that opening an image file 
> can introduce.  Fortunately, I don't think we have a strict requirement 
> for it provided that we make a couple of reasonable changes.
>
>> Even better would be to only open the image when the source has already
>> closed it (we could completely avoid the invalidation/reopen then), but
>> I think you were afraid that we might lose the image on both ends.
>>    
> 
> Yeah, one of the key design points of live migration is to minimize the 
> number of failure scenarios where you lose a VM.  If someone typed the 
> wrong command line or shared storage hasn't been mounted yet and we 
> delay failure until live migration is in the critical path, that would 
> be terribly unfortunate.

We would catch most of them if we try to open the image when migration
starts and immediately close it again until migration is (almost)
completed, so that no other code can possibly use it before the source
has really closed it.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 13:39       ` Kevin Wolf
@ 2010-09-13 13:42         ` Anthony Liguori
  2010-09-13 14:13           ` Kevin Wolf
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-13 13:42 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Juan Quintela, qemu-devel, Stefan Hajnoczi

On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>> Yeah, one of the key design points of live migration is to minimize the
>> number of failure scenarios where you lose a VM.  If someone typed the
>> wrong command line or shared storage hasn't been mounted yet and we
>> delay failure until live migration is in the critical path, that would
>> be terribly unfortunate.
>>      
> We would catch most of them if we try to open the image when migration
> starts and immediately close it again until migration is (almost)
> completed, so that no other code can possibly use it before the source
> has really closed it.
>    

I think the only real advantage is that we fix NFS migration, right?

But if we do invalidate_cache() as you suggested with a close/open of 
the qcow2 layer, and also acquire and release a lock in the file layer 
by propagating the invalidate_cache(), that should work robustly with NFS.

I think that's a simpler change.  Do you see additional advantages to 
delaying the open?

Regards,

anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 13:42         ` Anthony Liguori
@ 2010-09-13 14:13           ` Kevin Wolf
  2010-09-13 14:34             ` Anthony Liguori
  2010-09-13 19:29             ` Stefan Hajnoczi
  0 siblings, 2 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-13 14:13 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Anthony Liguori, Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 13.09.2010 15:42, schrieb Anthony Liguori:
> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>> Yeah, one of the key design points of live migration is to minimize the
>>> number of failure scenarios where you lose a VM.  If someone typed the
>>> wrong command line or shared storage hasn't been mounted yet and we
>>> delay failure until live migration is in the critical path, that would
>>> be terribly unfortunate.
>>>      
>> We would catch most of them if we try to open the image when migration
>> starts and immediately close it again until migration is (almost)
>> completed, so that no other code can possibly use it before the source
>> has really closed it.
>>    
> 
> I think the only real advantage is that we fix NFS migration, right?

That's the one that we know about, yes.

The rest is not a specific scenario, but a strong feeling that having an
image opened twice at the same time feels dangerous. As soon as an
open/close sequence writes to the image for some format, we probably
have a bug. For example, what about this mounted flag that you were
discussing for QED?

> But if we do invalidate_cache() as you suggested with a close/open of 
> the qcow2 layer, and also acquire and release a lock in the file layer 
> by propagating the invalidate_cache(), that should work robustly with NFS.
> 
> I think that's a simpler change.  Do you see additional advantages to 
> delaying the open?

Just that it makes it very obvious if a device model is doing bad things
and accessing the image before it should. The difference is a failed
request vs. silently corrupted data.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 14:13           ` Kevin Wolf
@ 2010-09-13 14:34             ` Anthony Liguori
  2010-09-14  9:47               ` Avi Kivity
  2010-09-13 19:29             ` Stefan Hajnoczi
  1 sibling, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-13 14:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>> I think the only real advantage is that we fix NFS migration, right?
>>      
> That's the one that we know about, yes.
>
> The rest is not a specific scenario, but a strong feeling that having an
> image opened twice at the same time feels dangerous.

We've never really had clear semantics about live migration and block 
driver's life cycles.  At a high level, for live migration to work, we 
need the following sequence:

1) src> flush all pending writes to disk
2) <barrier>
3) dst> invalidate any cached data
4) dst> start guest

We've gotten away ignoring (3) because raw disks never cache anything.  
But that assumes that we're on top of cache coherent storage.  If we 
don't have fully cache coherent storage, we need to do more.

We need to extend (3) to also flush the cache of the underlying 
storage.  There are two ways we can solve this, we can either ensure 
that (3) is a nop by not having any operations that would cause caching 
until after (3), or we can find a way to inject a flush into the 
underlying cache.

Since he later approach requires storage specific knowledge, I went with 
the former approach.  Of course, if you did a close-open sequence at 
(3), it may achieve the same goal but only really for NFS.  If you have 
something weaker than close-to-open coherency, you still need to do 
something unique in step (3).

I don't know that I see a perfect model.  Pushing reads past point (3) 
is easy and fixes raw on top of NFS.  I think we want to do that because 
it's low hanging fruit.  An block driver hook for (3) also seems 
appealing because we can make use of it easily in QED.

That said, I'm open to suggestions of a better model.  Delaying open 
(especially if you open, then close, then open again) seems a bit hacky.

With respect to the devices, I think the question of when block devices 
can begin accessing drives is orthogonal to this discussion.  Even 
without delaying open, we could simply not give them their 
BlockDriverStates until realize() or something like that.

Regards,

Anthony Liguori

>   As soon as an
> open/close sequence writes to the image for some format, we probably
> have a bug. For example, what about this mounted flag that you were
> discussing for QED?
>
>    
>> But if we do invalidate_cache() as you suggested with a close/open of
>> the qcow2 layer, and also acquire and release a lock in the file layer
>> by propagating the invalidate_cache(), that should work robustly with NFS.
>>
>> I think that's a simpler change.  Do you see additional advantages to
>> delaying the open?
>>      
> Just that it makes it very obvious if a device model is doing bad things
> and accessing the image before it should. The difference is a failed
> request vs. silently corrupted data.
>
> Kevin
>
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 14:13           ` Kevin Wolf
  2010-09-13 14:34             ` Anthony Liguori
@ 2010-09-13 19:29             ` Stefan Hajnoczi
  2010-09-13 20:03               ` Kevin Wolf
  1 sibling, 1 reply; 51+ messages in thread
From: Stefan Hajnoczi @ 2010-09-13 19:29 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Anthony Liguori, Anthony Liguori, qemu-devel, Stefan Hajnoczi,
	Juan Quintela

On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf <kwolf@redhat.com> wrote:
> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>> Yeah, one of the key design points of live migration is to minimize the
>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>> delay failure until live migration is in the critical path, that would
>>>> be terribly unfortunate.
>>>>
>>> We would catch most of them if we try to open the image when migration
>>> starts and immediately close it again until migration is (almost)
>>> completed, so that no other code can possibly use it before the source
>>> has really closed it.
>>>
>>
>> I think the only real advantage is that we fix NFS migration, right?
>
> That's the one that we know about, yes.
>
> The rest is not a specific scenario, but a strong feeling that having an
> image opened twice at the same time feels dangerous. As soon as an
> open/close sequence writes to the image for some format, we probably
> have a bug. For example, what about this mounted flag that you were
> discussing for QED?

There is some room left to work in, even if we can't check in open().
One idea would be to do the check asynchronously once I/O begins.  It
is actually easy to check L1/L2 tables as they are loaded.

The only barrier relationship between I/O and checking is that an
allocating write (which will need to update L1/L2 tables) is only
allowed after check completes.  Otherwise reads and non-allocating
writes may proceed while the image is not yet fully checked.  We can
detect when a table element is an invalid offset and discard it.

Stefan

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 19:29             ` Stefan Hajnoczi
@ 2010-09-13 20:03               ` Kevin Wolf
  2010-09-13 20:09                 ` Anthony Liguori
  0 siblings, 1 reply; 51+ messages in thread
From: Kevin Wolf @ 2010-09-13 20:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Anthony Liguori, Anthony Liguori, qemu-devel, Stefan Hajnoczi,
	Juan Quintela

Am 13.09.2010 21:29, schrieb Stefan Hajnoczi:
> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>>> Yeah, one of the key design points of live migration is to minimize the
>>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>>> delay failure until live migration is in the critical path, that would
>>>>> be terribly unfortunate.
>>>>>
>>>> We would catch most of them if we try to open the image when migration
>>>> starts and immediately close it again until migration is (almost)
>>>> completed, so that no other code can possibly use it before the source
>>>> has really closed it.
>>>>
>>>
>>> I think the only real advantage is that we fix NFS migration, right?
>>
>> That's the one that we know about, yes.
>>
>> The rest is not a specific scenario, but a strong feeling that having an
>> image opened twice at the same time feels dangerous. As soon as an
>> open/close sequence writes to the image for some format, we probably
>> have a bug. For example, what about this mounted flag that you were
>> discussing for QED?
> 
> There is some room left to work in, even if we can't check in open().
> One idea would be to do the check asynchronously once I/O begins.  It
> is actually easy to check L1/L2 tables as they are loaded.
> 
> The only barrier relationship between I/O and checking is that an
> allocating write (which will need to update L1/L2 tables) is only
> allowed after check completes.  Otherwise reads and non-allocating
> writes may proceed while the image is not yet fully checked.  We can
> detect when a table element is an invalid offset and discard it.

I'm not even talking about such complicated things. You wanted to have a
dirty flag in the header, right? So when we allow opening an image
twice, you get this sequence with migration:

Source: open
Destination: open (with dirty image)
Source: close

The image is now marked as clean, even though the destination is still
working on it.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 20:03               ` Kevin Wolf
@ 2010-09-13 20:09                 ` Anthony Liguori
  2010-09-14  8:28                   ` Kevin Wolf
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-13 20:09 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Juan Quintela

On 09/13/2010 03:03 PM, Kevin Wolf wrote:
> Am 13.09.2010 21:29, schrieb Stefan Hajnoczi:
>    
>> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>      
>>> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>>>        
>>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>>          
>>>>>> Yeah, one of the key design points of live migration is to minimize the
>>>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>>>> delay failure until live migration is in the critical path, that would
>>>>>> be terribly unfortunate.
>>>>>>
>>>>>>              
>>>>> We would catch most of them if we try to open the image when migration
>>>>> starts and immediately close it again until migration is (almost)
>>>>> completed, so that no other code can possibly use it before the source
>>>>> has really closed it.
>>>>>
>>>>>            
>>>> I think the only real advantage is that we fix NFS migration, right?
>>>>          
>>> That's the one that we know about, yes.
>>>
>>> The rest is not a specific scenario, but a strong feeling that having an
>>> image opened twice at the same time feels dangerous. As soon as an
>>> open/close sequence writes to the image for some format, we probably
>>> have a bug. For example, what about this mounted flag that you were
>>> discussing for QED?
>>>        
>> There is some room left to work in, even if we can't check in open().
>> One idea would be to do the check asynchronously once I/O begins.  It
>> is actually easy to check L1/L2 tables as they are loaded.
>>
>> The only barrier relationship between I/O and checking is that an
>> allocating write (which will need to update L1/L2 tables) is only
>> allowed after check completes.  Otherwise reads and non-allocating
>> writes may proceed while the image is not yet fully checked.  We can
>> detect when a table element is an invalid offset and discard it.
>>      
> I'm not even talking about such complicated things. You wanted to have a
> dirty flag in the header, right? So when we allow opening an image
> twice, you get this sequence with migration:
>
> Source: open
> Destination: open (with dirty image)
> Source: close
>
> The image is now marked as clean, even though the destination is still
> working on it.
>    

The dirty flag should be read on demand (which is the first time we 
fetch an L1/L2 table).

I agree that the life cycle of the block drivers is getting fuzzy.  Need 
to think quite a bit here.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 20:09                 ` Anthony Liguori
@ 2010-09-14  8:28                   ` Kevin Wolf
  0 siblings, 0 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-14  8:28 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Stefan Hajnoczi, qemu-devel, Stefan Hajnoczi, Juan Quintela

Am 13.09.2010 22:09, schrieb Anthony Liguori:
> On 09/13/2010 03:03 PM, Kevin Wolf wrote:
>> Am 13.09.2010 21:29, schrieb Stefan Hajnoczi:
>>    
>>> On Mon, Sep 13, 2010 at 3:13 PM, Kevin Wolf<kwolf@redhat.com>  wrote:
>>>      
>>>> Am 13.09.2010 15:42, schrieb Anthony Liguori:
>>>>        
>>>>> On 09/13/2010 08:39 AM, Kevin Wolf wrote:
>>>>>          
>>>>>>> Yeah, one of the key design points of live migration is to minimize the
>>>>>>> number of failure scenarios where you lose a VM.  If someone typed the
>>>>>>> wrong command line or shared storage hasn't been mounted yet and we
>>>>>>> delay failure until live migration is in the critical path, that would
>>>>>>> be terribly unfortunate.
>>>>>>>
>>>>>>>              
>>>>>> We would catch most of them if we try to open the image when migration
>>>>>> starts and immediately close it again until migration is (almost)
>>>>>> completed, so that no other code can possibly use it before the source
>>>>>> has really closed it.
>>>>>>
>>>>>>            
>>>>> I think the only real advantage is that we fix NFS migration, right?
>>>>>          
>>>> That's the one that we know about, yes.
>>>>
>>>> The rest is not a specific scenario, but a strong feeling that having an
>>>> image opened twice at the same time feels dangerous. As soon as an
>>>> open/close sequence writes to the image for some format, we probably
>>>> have a bug. For example, what about this mounted flag that you were
>>>> discussing for QED?
>>>>        
>>> There is some room left to work in, even if we can't check in open().
>>> One idea would be to do the check asynchronously once I/O begins.  It
>>> is actually easy to check L1/L2 tables as they are loaded.
>>>
>>> The only barrier relationship between I/O and checking is that an
>>> allocating write (which will need to update L1/L2 tables) is only
>>> allowed after check completes.  Otherwise reads and non-allocating
>>> writes may proceed while the image is not yet fully checked.  We can
>>> detect when a table element is an invalid offset and discard it.
>>>      
>> I'm not even talking about such complicated things. You wanted to have a
>> dirty flag in the header, right? So when we allow opening an image
>> twice, you get this sequence with migration:
>>
>> Source: open
>> Destination: open (with dirty image)
>> Source: close
>>
>> The image is now marked as clean, even though the destination is still
>> working on it.
>>    
> 
> The dirty flag should be read on demand (which is the first time we 
> fetch an L1/L2 table).
> 
> I agree that the life cycle of the block drivers is getting fuzzy.  Need 
> to think quite a bit here.

The point is that after the close the flag is wrong on disk. That's
completely unrelated to when you read it. If anything, you could also
delay writing the flag until the first write, so that the close comes first.

But honestly, I'm not very excited about pulling half of the open()
logic into read/write functions.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13 14:34             ` Anthony Liguori
@ 2010-09-14  9:47               ` Avi Kivity
  2010-09-14 12:51                 ` Anthony Liguori
  0 siblings, 1 reply; 51+ messages in thread
From: Avi Kivity @ 2010-09-14  9:47 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

  On 09/13/2010 04:34 PM, Anthony Liguori wrote:
> On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>>> I think the only real advantage is that we fix NFS migration, right?
>> That's the one that we know about, yes.
>>
>> The rest is not a specific scenario, but a strong feeling that having an
>> image opened twice at the same time feels dangerous.
>
> We've never really had clear semantics about live migration and block 
> driver's life cycles.  At a high level, for live migration to work, we 
> need the following sequence:
>
> 1) src> flush all pending writes to disk
> 2) <barrier>
> 3) dst> invalidate any cached data
> 4) dst> start guest

That's pretty complicated, compared to

1) src> close
2) dst> open
3) dst> start guest

There are two failure scenarios with this model:

1. dst cannot open the image

We fix that by killing dst and continuing src (which has to re-open its 
images).

2. dst cannot open the image, and src cannot as well

In this case, what would be gained by having an image handle open in one 
of the hosts, but no way to open it again?  As soon as the surviving 
qemu exited (or crashed), the image would be lost for ever.

To get (1) working correctly we need an event that tells management that 
all initialization has completed and the guest is ready to run (so 
management can terminate the source).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-14  9:47               ` Avi Kivity
@ 2010-09-14 12:51                 ` Anthony Liguori
  2010-09-14 13:16                   ` Avi Kivity
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-14 12:51 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

On 09/14/2010 04:47 AM, Avi Kivity wrote:
>  On 09/13/2010 04:34 PM, Anthony Liguori wrote:
>> On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>>>> I think the only real advantage is that we fix NFS migration, right?
>>> That's the one that we know about, yes.
>>>
>>> The rest is not a specific scenario, but a strong feeling that 
>>> having an
>>> image opened twice at the same time feels dangerous.
>>
>> We've never really had clear semantics about live migration and block 
>> driver's life cycles.  At a high level, for live migration to work, 
>> we need the following sequence:
>>
>> 1) src> flush all pending writes to disk
>> 2) <barrier>
>> 3) dst> invalidate any cached data
>> 4) dst> start guest
>
> That's pretty complicated, compared to
>
> 1) src> close

1.5) <barrier>

> 2) dst> open
> 3) dst> start guest

You need to make sure the open happens *after* the close.

You're just using close to flush all pending writes or open to 
invalidate any cached data.

Regards,

Anthony Liguori

> There are two failure scenarios with this model:
>
> 1. dst cannot open the image
>
> We fix that by killing dst and continuing src (which has to re-open 
> its images).
>
> 2. dst cannot open the image, and src cannot as well
>
> In this case, what would be gained by having an image handle open in 
> one of the hosts, but no way to open it again?  As soon as the 
> surviving qemu exited (or crashed), the image would be lost for ever.
>
> To get (1) working correctly we need an event that tells management 
> that all initialization has completed and the guest is ready to run 
> (so management can terminate the source).
>

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-14 12:51                 ` Anthony Liguori
@ 2010-09-14 13:16                   ` Avi Kivity
  0 siblings, 0 replies; 51+ messages in thread
From: Avi Kivity @ 2010-09-14 13:16 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, Juan Quintela, qemu-devel, Stefan Hajnoczi

  On 09/14/2010 02:51 PM, Anthony Liguori wrote:
> On 09/14/2010 04:47 AM, Avi Kivity wrote:
>>  On 09/13/2010 04:34 PM, Anthony Liguori wrote:
>>> On 09/13/2010 09:13 AM, Kevin Wolf wrote:
>>>>> I think the only real advantage is that we fix NFS migration, right?
>>>> That's the one that we know about, yes.
>>>>
>>>> The rest is not a specific scenario, but a strong feeling that 
>>>> having an
>>>> image opened twice at the same time feels dangerous.
>>>
>>> We've never really had clear semantics about live migration and 
>>> block driver's life cycles.  At a high level, for live migration to 
>>> work, we need the following sequence:
>>>
>>> 1) src> flush all pending writes to disk
>>> 2) <barrier>
>>> 3) dst> invalidate any cached data
>>> 4) dst> start guest
>>
>> That's pretty complicated, compared to
>>
>> 1) src> close
>
> 1.5) <barrier>
>
>> 2) dst> open
>> 3) dst> start guest
>
> You need to make sure the open happens *after* the close.
>
> You're just using close to flush all pending writes or open to 
> invalidate any cached data.


Right, that's simpler than teaching all block format drivers about 
invalidating any cached data, or using nfs locks to force the host to do 
the same.

It also complies with close-to-open consistency without relying on 
implementation details.


-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
  2010-09-12 10:37   ` Avi Kivity
  2010-09-13  8:21   ` Kevin Wolf
@ 2010-09-15 15:53   ` Juan Quintela
  2 siblings, 0 replies; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 15:53 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

Anthony Liguori <aliguori@us.ibm.com> wrote:
> Image files have two types of data: immutable data that describes things like
> image size, backing files, etc. and mutable data that includes offset and
> reference count tables.
>
> Today, image formats aggressively cache mutable data to improve performance.  In
> some cases, this happens before a guest even starts.  When dealing with live
> migration, since a file is open on two machines, the caching of meta data can
> lead to data corruption.
>
> This patch addresses this by introducing a mechanism to invalidate any cached
> mutable data a block driver may have which is then used by the live migration
> code.
>
> NB, this still requires coherent shared storage.  Addressing migration without
> coherent shared storage (i.e. NFS) requires additional work.
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>

For the NFS case, we just need 2 different case:
- outgoing: nothing to do (generic code already do an fsync())
- incoming: we need to reopen the image.

For the rest, I agree.  Once told that, I can implement my changes here.

Later, Juan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 13:28       ` Avi Kivity
  2010-09-12 15:26         ` Anthony Liguori
@ 2010-09-15 15:57         ` Juan Quintela
  1 sibling, 0 replies; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 15:57 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

Avi Kivity <avi@redhat.com> wrote:
>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>
>> Backing files and logical size shouldn't change during live migration.
>
> Why not?

It is supposed to be the same in both sides O:-)
If you want to extend a disk, you need to told qemu about that.  Either
before or after migration, but not "during" migration.

>
>>
>> But even so, I think the interface make sense.  It's basically, drop
>> anything you have cached that may change during migration.  What
>> needs to be read is dependent on features/format.
>
> Sure.  Certainly as an incremental step.

Later, Juan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-12 17:51               ` Avi Kivity
@ 2010-09-15 16:00                 ` Juan Quintela
  0 siblings, 0 replies; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 16:00 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

Avi Kivity <avi@redhat.com> wrote:
>  On 09/12/2010 07:10 PM, Anthony Liguori wrote:
>> On 09/12/2010 11:06 AM, Avi Kivity wrote:
>>>  On 09/12/2010 05:26 PM, Anthony Liguori wrote:
>>>> On 09/12/2010 08:28 AM, Avi Kivity wrote:
>>>>>  On 09/12/2010 03:06 PM, Anthony Liguori wrote:
>>>>>>
>>>>>> Backing files and logical size shouldn't change during live
>>>>>> migration.
>>>>>
>>>>> Why not?
>>>>
>>>> To make our lives easier.
>>>
>>> It means management needs to block volume resize while a live
>>> migration takes place.  Since live migration is typically done by
>>> the system automatically, while volume resize happens in response
>>> to user request, this isn't a good idea.  Both in terms of user
>>> experience, and in terms of pushing more complexity to management.
>>
>> We don't do volume resize today so it's a moot point.
>>
>
> Let's be prepared for the future.

This is better done when we would be able to create the machine during
migration.  As we have everything organized today, this is basically
imposible :(

The problem is not related with live migration, is related with the fact
that we need to give the same arguments in both source & target machines.

Later, Juan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-13  8:21   ` Kevin Wolf
  2010-09-13 13:27     ` Anthony Liguori
@ 2010-09-15 16:03     ` Juan Quintela
  2010-09-16  7:54       ` Kevin Wolf
  1 sibling, 1 reply; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 16:03 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, qemu-devel, Stefan Hajnoczi

Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:

> There is a simple generic implementation:
>
> drv = bs->drv;
> drv->close(bs);
> drv->open(bs, bs->open_flags);
>
> Note that this only reopens e.g. the qcow2 layer, but not the image
> file, which is bs->file.
>
> This works for all simple case, that is, one format on top of one or
> more protocols, where the protocols don't need invalidation. I think
> this includes everything that is possible today. With -blockdev we might
> need to revise this to include the lower layers, too. (But only
> sometimes, because we don't want to reopen the file)

we require it for nfs consistency.  We need to:

source: fsync()
target: open() (after the fsync).

About your 1st comment, I am doing:

bdrv_close(bdrv);
bdrv_open(bdrv);

Do you mean that this don't close and open again the file with qcow2?

Later, Juan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing Anthony Liguori
  2010-09-11 16:53   ` Stefan Hajnoczi
@ 2010-09-15 16:06   ` Juan Quintela
  2010-09-16 15:40     ` Anthony Liguori
  2010-09-16  8:08   ` Kevin Wolf
  2 siblings, 1 reply; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 16:06 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

Anthony Liguori <aliguori@us.ibm.com> wrote:
> The use of protocols in backing_files is currently broken because of some
> checks for adjusting relative pathnames.
>
> Additionally, there's a spurious read when using an nbd protocol that can be
> quite destructive when using copy-on-read.  Potentially, this can lead to
> probing an image file over top of NBD but this is completely wrong as NBD
> devices are not growable.
>
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> ---
> NB: this is absolutely not ideal.  A more elegant suggestion would be
> appreciated.  I don't think NBD cleanly fits the model of a protocol as it
> stands today.

Bad, bad boy, you fixed two things in a single patch.

>
> diff --git a/block.c b/block.c
> index cd2ee31..a32d5dd 100644
> --- a/block.c
> +++ b/block.c
> @@ -344,6 +344,12 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
>          return ret;
>      }
>  
> +    if (strcmp(bs->drv->protocol_name, "nbd") == 0) {
> +        drv = bs->drv;
> +        bdrv_delete(bs);
> +        goto out;
> +    }
> +
>      /* Return the raw BlockDriver * to scsi-generic devices or empty drives */
>      if (bs->sg || !bdrv_is_inserted(bs)) {
>          bdrv_delete(bs);
> @@ -373,6 +379,7 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
>              }
>          }
>      }
> +out:
>      if (!drv) {
>          ret = -ENOENT;
>      }

I have no opinion about this change.

> @@ -603,10 +610,16 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>          BlockDriver *back_drv = NULL;
>  
>          bs->backing_hd = bdrv_new("");
> -        path_combine(backing_filename, sizeof(backing_filename),
> -                     filename, bs->backing_file);
> -        if (bs->backing_format[0] != '\0')
> -            back_drv = bdrv_find_format(bs->backing_format);
> +        back_drv = bdrv_find_protocol(bs->backing_file);
> +        if (!back_drv) {
> +            path_combine(backing_filename, sizeof(backing_filename),
> +                         filename, bs->backing_file);
> +            if (bs->backing_format[0] != '\0')
> +                back_drv = bdrv_find_format(bs->backing_format);
> +        } else {
> +            pstrcpy(backing_filename, sizeof(backing_filename),
> +                    bs->backing_file);
> +        }
>  
>          /* backing files always opened read-only */
>          back_flags =

But this one breaks my setup, I have to backout this patch to be able to
launch guests with qcow2 file images.

Later, Juan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-12 13:08     ` Anthony Liguori
  2010-09-12 13:26       ` Avi Kivity
@ 2010-09-15 16:10       ` Juan Quintela
  1 sibling, 0 replies; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 16:10 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Kevin Wolf, Anthony Liguori, Avi Kivity, Stefan Hajnoczi, qemu-devel

Anthony Liguori <anthony@codemonkey.ws> wrote:
> On 09/12/2010 05:42 AM, Avi Kivity wrote:
>>  On 09/11/2010 05:04 PM, Anthony Liguori wrote:
>>> This fixes a couple nasty problems relating to live migration.
>>>
>>> 1) When dealing with shared storage with weak coherence (i.e. NFS),
>>> even if
>>>     we re-read, we may end up with undesired caching.  By delaying
>>> any reads
>>>     until we absolutely have to, we decrease the likelihood of any
>>> undesirable
>>>     caching.
>>>
>>> 2) When dealing with copy-on-read, the local storage acts as a
>>> cache.  We need
>>>     to make sure to avoid any reads to avoid polluting the local cache.
>>>
>>> +
>>>   static void ide_identify(IDEState *s)
>>>   {
>>>       uint16_t *p;
>>> @@ -105,6 +132,8 @@ static void ide_identify(IDEState *s)
>>>       return;
>>>       }
>>>
>>> +    guess_geometry(s);
>>> +
>>
>> This can cause a disk read, no?  Shouldn't it be made asynchronous?
>
> Yes, it should.  I'm not sure there's an easy way to make it
> asynchronous though not because of the block layer but because of how
> these functions are called.
>
>>
>> Or just move it to just before the guest starts?
>
> We don't really have a notion of "guest starts" today although maybe
> we should.

I agree that we should have more states.  Expecting Migration/Migration
done are quite special casses that are handled adhoc now.  Having
some state like: NeverRun will make trivial to have vm_start() do the
right thing on the 1st run.

Later, Juan.

> Regards,
>
> Anthony Liguori

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 3/3] disk: don't read from disk until the guest starts
  2010-09-13  8:32   ` Kevin Wolf
  2010-09-13 13:29     ` Anthony Liguori
@ 2010-09-15 16:16     ` Juan Quintela
  1 sibling, 0 replies; 51+ messages in thread
From: Juan Quintela @ 2010-09-15 16:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, qemu-devel, Stefan Hajnoczi

Kevin Wolf <kwolf@redhat.com> wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>> This fixes a couple nasty problems relating to live migration.
>> 
>> 1) When dealing with shared storage with weak coherence (i.e. NFS), even if
>>    we re-read, we may end up with undesired caching.  By delaying any reads
>>    until we absolutely have to, we decrease the likelihood of any undesirable
>>    caching.
>> 
>> 2) When dealing with copy-on-read, the local storage acts as a cache.  We need
>>    to make sure to avoid any reads to avoid polluting the local cache.
>> 
>> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
>
> I think we should also delay even opening the image file at all to the
> latest possible point to avoid that new problems of this kind are
> introduced. Ideally, opening the image would be the very last thing we
> do before telling the migration source that we're set and it should
> close the images.
>
> Even better would be to only open the image when the source has already
> closed it (we could completely avoid the invalidation/reopen then), but
> I think you were afraid that we might lose the image on both ends.

But then we have formats like qcow2 that store the "number_of_sectors"
inside the file.

And we need the size of the disk to initialize some structures :(

About having the image opened in two different machines .....

- Having it opened in two places: makes it "interesting" for cocherence reasons.
- Not having it, makes the failure case of anything failing during
  migration ... interesting.

Real solution for "this" problem is to just send in the migration stream
the size/head/cylinder/sector ... values and not having to open the
image.

Problem here is that for qemu architecture, we have to know this values
before we call migration functions.

As far as I can see, until we are not able to create a machine from
inside the monitor (i.e. qemu initialization & machine creation are
decoupled), we are not going to be able to fix migration properly (using
that very same mechanism to create the machine during migration).

Later, Juan.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 1/3] block: allow migration to work with image files
  2010-09-15 16:03     ` Juan Quintela
@ 2010-09-16  7:54       ` Kevin Wolf
  0 siblings, 0 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-16  7:54 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Anthony Liguori, qemu-devel, Stefan Hajnoczi

Am 15.09.2010 18:03, schrieb Juan Quintela:
> Kevin Wolf <kwolf@redhat.com> wrote:
>> Am 11.09.2010 16:04, schrieb Anthony Liguori:
> 
>> There is a simple generic implementation:
>>
>> drv = bs->drv;
>> drv->close(bs);
>> drv->open(bs, bs->open_flags);
>>
>> Note that this only reopens e.g. the qcow2 layer, but not the image
>> file, which is bs->file.
>>
>> This works for all simple case, that is, one format on top of one or
>> more protocols, where the protocols don't need invalidation. I think
>> this includes everything that is possible today. With -blockdev we might
>> need to revise this to include the lower layers, too. (But only
>> sometimes, because we don't want to reopen the file)
> 
> we require it for nfs consistency.  We need to:
> 
> source: fsync()
> target: open() (after the fsync).
> 
> About your 1st comment, I am doing:
> 
> bdrv_close(bdrv);
> bdrv_open(bdrv);
> 
> Do you mean that this don't close and open again the file with qcow2?

It does. bdrv_open/close take care of the whole stack.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-11 14:04 ` [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing Anthony Liguori
  2010-09-11 16:53   ` Stefan Hajnoczi
  2010-09-15 16:06   ` [Qemu-devel] " Juan Quintela
@ 2010-09-16  8:08   ` Kevin Wolf
  2010-09-16 13:00     ` Anthony Liguori
  2 siblings, 1 reply; 51+ messages in thread
From: Kevin Wolf @ 2010-09-16  8:08 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: Juan Quintela, qemu-devel, Stefan Hajnoczi

Am 11.09.2010 16:04, schrieb Anthony Liguori:
> The use of protocols in backing_files is currently broken because of some
> checks for adjusting relative pathnames.
> 
> Additionally, there's a spurious read when using an nbd protocol that can be
> quite destructive when using copy-on-read.  Potentially, this can lead to
> probing an image file over top of NBD but this is completely wrong as NBD
> devices are not growable.
> 
> Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
> ---
> NB: this is absolutely not ideal.  A more elegant suggestion would be
> appreciated.  I don't think NBD cleanly fits the model of a protocol as it
> stands today.
> 
> diff --git a/block.c b/block.c
> index cd2ee31..a32d5dd 100644
> --- a/block.c
> +++ b/block.c
> @@ -344,6 +344,12 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
>          return ret;
>      }
>  
> +    if (strcmp(bs->drv->protocol_name, "nbd") == 0) {
> +        drv = bs->drv;
> +        bdrv_delete(bs);
> +        goto out;
> +    }

Is nbd really the only protocol that behaves like this? I don't like
hardcoding driver names in generic block layer code.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [Qemu-devel] Re: [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-16  8:08   ` Kevin Wolf
@ 2010-09-16 13:00     ` Anthony Liguori
  2010-09-16 14:08       ` Kevin Wolf
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-16 13:00 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Anthony Liguori, Juan Quintela, qemu-devel, Stefan Hajnoczi

On 09/16/2010 03:08 AM, Kevin Wolf wrote:
> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>    
>> The use of protocols in backing_files is currently broken because of some
>> checks for adjusting relative pathnames.
>>
>> Additionally, there's a spurious read when using an nbd protocol that can be
>> quite destructive when using copy-on-read.  Potentially, this can lead to
>> probing an image file over top of NBD but this is completely wrong as NBD
>> devices are not growable.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>> ---
>> NB: this is absolutely not ideal.  A more elegant suggestion would be
>> appreciated.  I don't think NBD cleanly fits the model of a protocol as it
>> stands today.
>>
>> diff --git a/block.c b/block.c
>> index cd2ee31..a32d5dd 100644
>> --- a/block.c
>> +++ b/block.c
>> @@ -344,6 +344,12 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
>>           return ret;
>>       }
>>
>> +    if (strcmp(bs->drv->protocol_name, "nbd") == 0) {
>> +        drv = bs->drv;
>> +        bdrv_delete(bs);
>> +        goto out;
>> +    }
>>      
> Is nbd really the only protocol that behaves like this? I don't like
> hardcoding driver names in generic block layer code.
>    

I'll drop this chunk from the patch as using backing_fmt achieves the 
same goal.  The important hunk is the next one that removes assumptions 
that URIs are filenames.

Regards,

Anthony Liguori

> Kevin
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-16 13:00     ` Anthony Liguori
@ 2010-09-16 14:08       ` Kevin Wolf
  0 siblings, 0 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-16 14:08 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: Anthony Liguori, qemu-devel, Stefan Hajnoczi, Juan Quintela

Am 16.09.2010 15:00, schrieb Anthony Liguori:
> On 09/16/2010 03:08 AM, Kevin Wolf wrote:
>> Am 11.09.2010 16:04, schrieb Anthony Liguori:
>>    
>>> The use of protocols in backing_files is currently broken because of some
>>> checks for adjusting relative pathnames.
>>>
>>> Additionally, there's a spurious read when using an nbd protocol that can be
>>> quite destructive when using copy-on-read.  Potentially, this can lead to
>>> probing an image file over top of NBD but this is completely wrong as NBD
>>> devices are not growable.
>>>
>>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>> ---
>>> NB: this is absolutely not ideal.  A more elegant suggestion would be
>>> appreciated.  I don't think NBD cleanly fits the model of a protocol as it
>>> stands today.
>>>
>>> diff --git a/block.c b/block.c
>>> index cd2ee31..a32d5dd 100644
>>> --- a/block.c
>>> +++ b/block.c
>>> @@ -344,6 +344,12 @@ static int find_image_format(const char *filename, BlockDriver **pdrv)
>>>           return ret;
>>>       }
>>>
>>> +    if (strcmp(bs->drv->protocol_name, "nbd") == 0) {
>>> +        drv = bs->drv;
>>> +        bdrv_delete(bs);
>>> +        goto out;
>>> +    }
>>>      
>> Is nbd really the only protocol that behaves like this? I don't like
>> hardcoding driver names in generic block layer code.
>>    
> 
> I'll drop this chunk from the patch as using backing_fmt achieves the 
> same goal.  The important hunk is the next one that removes assumptions 
> that URIs are filenames.

Right, but next hunk is broken as Juan already mentioned. You always get
a protocol back, so with this patch applied you never resolve relative
paths for backing files.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-15 16:06   ` [Qemu-devel] " Juan Quintela
@ 2010-09-16 15:40     ` Anthony Liguori
  2010-09-17  8:53       ` Kevin Wolf
  0 siblings, 1 reply; 51+ messages in thread
From: Anthony Liguori @ 2010-09-16 15:40 UTC (permalink / raw)
  To: Juan Quintela; +Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi

On 09/15/2010 11:06 AM, Juan Quintela wrote:
> Anthony Liguori<aliguori@us.ibm.com>  wrote:
>    
>> The use of protocols in backing_files is currently broken because of some
>> checks for adjusting relative pathnames.
>>
>> Additionally, there's a spurious read when using an nbd protocol that can be
>> quite destructive when using copy-on-read.  Potentially, this can lead to
>> probing an image file over top of NBD but this is completely wrong as NBD
>> devices are not growable.
>>
>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>> ---
>> NB: this is absolutely not ideal.  A more elegant suggestion would be
>> appreciated.  I don't think NBD cleanly fits the model of a protocol as it
>> stands today.
>>      
> Bad, bad boy, you fixed two things in a single patch.
>    

You're not supposed to notice that :-)  It was an RFC series, I was 
really just looking to start a conversation.

>> @@ -603,10 +610,16 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>>           BlockDriver *back_drv = NULL;
>>
>>           bs->backing_hd = bdrv_new("");
>> -        path_combine(backing_filename, sizeof(backing_filename),
>> -                     filename, bs->backing_file);
>> -        if (bs->backing_format[0] != '\0')
>> -            back_drv = bdrv_find_format(bs->backing_format);
>> +        back_drv = bdrv_find_protocol(bs->backing_file);
>> +        if (!back_drv) {
>> +            path_combine(backing_filename, sizeof(backing_filename),
>> +                         filename, bs->backing_file);
>> +            if (bs->backing_format[0] != '\0')
>> +                back_drv = bdrv_find_format(bs->backing_format);
>> +        } else {
>> +            pstrcpy(backing_filename, sizeof(backing_filename),
>> +                    bs->backing_file);
>> +        }
>>
>>           /* backing files always opened read-only */
>>           back_flags =
>>      
> But this one breaks my setup, I have to backout this patch to be able to
> launch guests with qcow2 file images.
>    

Kevin, do you have an opinion about the best way to resolve this?

The relative/absolute path is broken for non-file URIs.  I think we 
could mark a protocol as having a file URI type or we could push the 
absolute/relative conversion down to the file/phys_dev protocol.

I think the former approach is probably the least invasive.

Regards,

Anthony LIguori

> Later, Juan.
>
>    

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing
  2010-09-16 15:40     ` Anthony Liguori
@ 2010-09-17  8:53       ` Kevin Wolf
  0 siblings, 0 replies; 51+ messages in thread
From: Kevin Wolf @ 2010-09-17  8:53 UTC (permalink / raw)
  To: Anthony Liguori; +Cc: qemu-devel, Stefan Hajnoczi, Juan Quintela

Am 16.09.2010 17:40, schrieb Anthony Liguori:
> On 09/15/2010 11:06 AM, Juan Quintela wrote:
>> Anthony Liguori<aliguori@us.ibm.com>  wrote:
>>    
>>> The use of protocols in backing_files is currently broken because of some
>>> checks for adjusting relative pathnames.
>>>
>>> Additionally, there's a spurious read when using an nbd protocol that can be
>>> quite destructive when using copy-on-read.  Potentially, this can lead to
>>> probing an image file over top of NBD but this is completely wrong as NBD
>>> devices are not growable.
>>>
>>> Signed-off-by: Anthony Liguori<aliguori@us.ibm.com>
>>> ---
>>> NB: this is absolutely not ideal.  A more elegant suggestion would be
>>> appreciated.  I don't think NBD cleanly fits the model of a protocol as it
>>> stands today.
>>>      
>> Bad, bad boy, you fixed two things in a single patch.
>>    
> 
> You're not supposed to notice that :-)  It was an RFC series, I was 
> really just looking to start a conversation.
> 
>>> @@ -603,10 +610,16 @@ int bdrv_open(BlockDriverState *bs, const char *filename, int flags,
>>>           BlockDriver *back_drv = NULL;
>>>
>>>           bs->backing_hd = bdrv_new("");
>>> -        path_combine(backing_filename, sizeof(backing_filename),
>>> -                     filename, bs->backing_file);
>>> -        if (bs->backing_format[0] != '\0')
>>> -            back_drv = bdrv_find_format(bs->backing_format);
>>> +        back_drv = bdrv_find_protocol(bs->backing_file);
>>> +        if (!back_drv) {
>>> +            path_combine(backing_filename, sizeof(backing_filename),
>>> +                         filename, bs->backing_file);
>>> +            if (bs->backing_format[0] != '\0')
>>> +                back_drv = bdrv_find_format(bs->backing_format);
>>> +        } else {
>>> +            pstrcpy(backing_filename, sizeof(backing_filename),
>>> +                    bs->backing_file);
>>> +        }
>>>
>>>           /* backing files always opened read-only */
>>>           back_flags =
>>>      
>> But this one breaks my setup, I have to backout this patch to be able to
>> launch guests with qcow2 file images.
>>    
> 
> Kevin, do you have an opinion about the best way to resolve this?
> 
> The relative/absolute path is broken for non-file URIs.  I think we 
> could mark a protocol as having a file URI type or we could push the 
> absolute/relative conversion down to the file/phys_dev protocol.
> 
> I think the former approach is probably the least invasive.

I agree. It gets a bit hard for things like blkdebug or blkverify which
actually get two file names, but I guess they usually won't be used as a
backing file - and if they do, it's probably reasonable to require
absolute paths or to resolve them relative to the working directory
instead of the image. They are only for debugging anyway.

Kevin

^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2010-09-17  8:53 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-11 14:04 [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Anthony Liguori
2010-09-11 14:04 ` [Qemu-devel] [PATCH 1/3] block: allow migration to work with image files Anthony Liguori
2010-09-12 10:37   ` Avi Kivity
2010-09-12 13:06     ` Anthony Liguori
2010-09-12 13:28       ` Avi Kivity
2010-09-12 15:26         ` Anthony Liguori
2010-09-12 16:06           ` Avi Kivity
2010-09-12 17:10             ` Anthony Liguori
2010-09-12 17:51               ` Avi Kivity
2010-09-15 16:00                 ` [Qemu-devel] " Juan Quintela
2010-09-15 15:57         ` Juan Quintela
2010-09-13  8:21   ` Kevin Wolf
2010-09-13 13:27     ` Anthony Liguori
2010-09-15 16:03     ` Juan Quintela
2010-09-16  7:54       ` Kevin Wolf
2010-09-15 15:53   ` Juan Quintela
2010-09-11 14:04 ` [Qemu-devel] [PATCH 2/3] block-nbd: fix use of protocols in backing files and nbd probing Anthony Liguori
2010-09-11 16:53   ` Stefan Hajnoczi
2010-09-11 17:27     ` Anthony Liguori
2010-09-11 17:45       ` Anthony Liguori
2010-09-15 16:06   ` [Qemu-devel] " Juan Quintela
2010-09-16 15:40     ` Anthony Liguori
2010-09-17  8:53       ` Kevin Wolf
2010-09-16  8:08   ` Kevin Wolf
2010-09-16 13:00     ` Anthony Liguori
2010-09-16 14:08       ` Kevin Wolf
2010-09-11 14:04 ` [Qemu-devel] [PATCH 3/3] disk: don't read from disk until the guest starts Anthony Liguori
2010-09-11 17:24   ` Stefan Hajnoczi
2010-09-11 17:34     ` Anthony Liguori
2010-09-12 10:42   ` Avi Kivity
2010-09-12 13:08     ` Anthony Liguori
2010-09-12 13:26       ` Avi Kivity
2010-09-12 15:29         ` Anthony Liguori
2010-09-12 16:04           ` Avi Kivity
2010-09-15 16:10       ` [Qemu-devel] " Juan Quintela
2010-09-13  8:32   ` Kevin Wolf
2010-09-13 13:29     ` Anthony Liguori
2010-09-13 13:39       ` Kevin Wolf
2010-09-13 13:42         ` Anthony Liguori
2010-09-13 14:13           ` Kevin Wolf
2010-09-13 14:34             ` Anthony Liguori
2010-09-14  9:47               ` Avi Kivity
2010-09-14 12:51                 ` Anthony Liguori
2010-09-14 13:16                   ` Avi Kivity
2010-09-13 19:29             ` Stefan Hajnoczi
2010-09-13 20:03               ` Kevin Wolf
2010-09-13 20:09                 ` Anthony Liguori
2010-09-14  8:28                   ` Kevin Wolf
2010-09-15 16:16     ` Juan Quintela
2010-09-12 10:46 ` [Qemu-devel] [RFC][PATCH 0/3] Fix caching issues with live migration Avi Kivity
2010-09-12 13:12   ` Anthony Liguori

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.