All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack
@ 2014-12-09 18:15 Dr. David Alan Gilbert (git)
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration Dr. David Alan Gilbert (git)
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2014-12-09 18:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, jsnow

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

This pair of patches fixes a problem where IDE/ATAPI cdrom
reads get lost/corrupted over migration.

The first of the patches (restore atapi_dma flag) is
a simple fix that I think is safe; it no longer causes
corruption in the case we saw, but does still trigger
a long timeout.

The second is a hack; it throws a medium error causing
the guest to retry the command in the case where migration
happens just between the IDE/ATAPI command being submitted
and the bmdma being finished.   This recovers a lot
faster than the timeout.

Only tried on Linux guests so far; I think it might be possible
to replace both of these by reparsing the command buffer for
ATAPI; I'm just not confident I know when that's safe to do,
and I wanted to see how disgusted people were by the 2nd hack.

Dave

Dr. David Alan Gilbert (2):
  Restore atapi_dma flag across migration
  atapi migration: Throw recoverable error to avoid recovery

 hw/ide/atapi.c    | 17 +++++++++++++++++
 hw/ide/core.c     |  1 +
 hw/ide/internal.h |  2 ++
 hw/ide/pci.c      | 11 +++++++++++
 4 files changed, 31 insertions(+)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration
  2014-12-09 18:15 [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert (git)
@ 2014-12-09 18:15 ` Dr. David Alan Gilbert (git)
  2014-12-10  5:04   ` John Snow
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery Dr. David Alan Gilbert (git)
  2015-01-07 16:26 ` [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert
  2 siblings, 1 reply; 15+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2014-12-09 18:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, jsnow

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

If a migration happens just after the guest has kicked
off an ATAPI command and kicked off DMA, we lose the atapi_dma
flag, and the destination tries to complete the command as PIO
rather than DMA.  This upsets Linux; modern libata based kernels
stumble and recover OK, older kernels end up passing bad data
to userspace.

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/ide/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index d4af5e2..ac3f015 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -2417,6 +2417,7 @@ static int ide_drive_pio_post_load(void *opaque, int version_id)
     s->end_transfer_func = transfer_end_table[s->end_transfer_fn_idx];
     s->data_ptr = s->io_buffer + s->cur_io_buffer_offset;
     s->data_end = s->data_ptr + s->cur_io_buffer_len;
+    s->atapi_dma = s->feature & 1; /* as per cmd_packet */
 
     return 0;
 }
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-09 18:15 [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert (git)
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration Dr. David Alan Gilbert (git)
@ 2014-12-09 18:15 ` Dr. David Alan Gilbert (git)
  2014-12-10  6:14   ` John Snow
                     ` (2 more replies)
  2015-01-07 16:26 ` [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert
  2 siblings, 3 replies; 15+ messages in thread
From: Dr. David Alan Gilbert (git) @ 2014-12-09 18:15 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, jsnow

From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>

(With the previous atapi_dma flag recovery)
If migration happens between the ATAPI command being written and the
bmdma being started, the DMA is dropped.  Eventually the guest times
out and recovers, but that can take many seconds.
(This is rare, on a pingpong reading the CD continuously I hit
this about ~1/30-1/50 migrates)

I don't think we've got enough state to be able to recover safely
at this point, so I throw a 'medium error, no seek complete'
that I'm assuming guests will try and recover from an apparently
dirty CD.

OK, it's a hack, the real solution is probably to push a lot of
ATAPI state into the migration stream, but this is a fix that
works with no stream changes. Tested only on Linux (both RHEL5
(pre-libata) and RHEL7).

Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
---
 hw/ide/atapi.c    | 17 +++++++++++++++++
 hw/ide/internal.h |  2 ++
 hw/ide/pci.c      | 11 +++++++++++
 3 files changed, 30 insertions(+)

diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
index c63b7e5..e17799c 100644
--- a/hw/ide/atapi.c
+++ b/hw/ide/atapi.c
@@ -394,6 +394,23 @@ static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
     }
 }
 
+
+/* Called by *_restart_bh when the transfer function points
+ * to ide_atapi_cmd
+ */
+void ide_atapi_dma_restart(IDEState *s)
+{
+    /*
+     * I'm not sure we have enough stored to restart the command
+     * safely, so give the guest an error it should recover from.
+     * I'm assuming most guests will try to recover from something
+     * listed as a medium error on a CD; it seems to work on Linux.
+     * This would be more of a problem if we did any other type of
+     * DMA operation.
+     */
+    ide_atapi_cmd_error(s, MEDIUM_ERROR, ASC_NO_SEEK_COMPLETE);
+}
+
 static inline uint8_t ide_atapi_set_profile(uint8_t *buf, uint8_t *index,
                                             uint16_t profile)
 {
diff --git a/hw/ide/internal.h b/hw/ide/internal.h
index 8a3eca4..8b65285 100644
--- a/hw/ide/internal.h
+++ b/hw/ide/internal.h
@@ -289,6 +289,7 @@ typedef struct IDEDMAOps IDEDMAOps;
 #define ATAPI_INT_REASON_TAG            0xf8
 
 /* same constants as bochs */
+#define ASC_NO_SEEK_COMPLETE                 0x02
 #define ASC_ILLEGAL_OPCODE                   0x20
 #define ASC_LOGICAL_BLOCK_OOR                0x21
 #define ASC_INV_FIELD_IN_CMD_PACKET          0x24
@@ -529,6 +530,7 @@ void ide_dma_error(IDEState *s);
 
 void ide_atapi_cmd_ok(IDEState *s);
 void ide_atapi_cmd_error(IDEState *s, int sense_key, int asc);
+void ide_atapi_dma_restart(IDEState *s);
 void ide_atapi_io_error(IDEState *s, int ret);
 
 void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val);
diff --git a/hw/ide/pci.c b/hw/ide/pci.c
index bee5ad3..e3f2054 100644
--- a/hw/ide/pci.c
+++ b/hw/ide/pci.c
@@ -235,6 +235,17 @@ static void bmdma_restart_bh(void *opaque)
         }
     } else if (error_status & IDE_RETRY_FLUSH) {
         ide_flush_cache(bmdma_active_if(bm));
+    } else {
+        IDEState *s = bmdma_active_if(bm);
+
+        /*
+         * We've not got any bits to tell us about ATAPI - but
+         * we do have the end_transfer_func that tells us what
+         * we're trying to do.
+         */
+        if (s->end_transfer_func == ide_atapi_cmd) {
+            ide_atapi_dma_restart(s);
+        }
     }
 }
 
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration Dr. David Alan Gilbert (git)
@ 2014-12-10  5:04   ` John Snow
  0 siblings, 0 replies; 15+ messages in thread
From: John Snow @ 2014-12-10  5:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel; +Cc: pbonzini



On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> If a migration happens just after the guest has kicked
> off an ATAPI command and kicked off DMA, we lose the atapi_dma
> flag, and the destination tries to complete the command as PIO
> rather than DMA.  This upsets Linux; modern libata based kernels
> stumble and recover OK, older kernels end up passing bad data
> to userspace.
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   hw/ide/core.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/hw/ide/core.c b/hw/ide/core.c
> index d4af5e2..ac3f015 100644
> --- a/hw/ide/core.c
> +++ b/hw/ide/core.c
> @@ -2417,6 +2417,7 @@ static int ide_drive_pio_post_load(void *opaque, int version_id)
>       s->end_transfer_func = transfer_end_table[s->end_transfer_fn_idx];
>       s->data_ptr = s->io_buffer + s->cur_io_buffer_offset;
>       s->data_end = s->data_ptr + s->cur_io_buffer_len;
> +    s->atapi_dma = s->feature & 1; /* as per cmd_packet */
>
>       return 0;
>   }
>

This looks fine to me; though perhaps a more comprehensive fix might 
prevent us needing this at all. As the code exists today, it's 
definitely an error to lose this value, as you've noticed.

Reviewed-by: John Snow <jsnow@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery Dr. David Alan Gilbert (git)
@ 2014-12-10  6:14   ` John Snow
  2014-12-10 12:05     ` Dr. David Alan Gilbert
  2014-12-10 20:09     ` Dr. David Alan Gilbert
  2014-12-10 22:04   ` Paolo Bonzini
  2015-01-16 17:28   ` John Snow
  2 siblings, 2 replies; 15+ messages in thread
From: John Snow @ 2014-12-10  6:14 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git); +Cc: pbonzini, qemu-devel



On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> (With the previous atapi_dma flag recovery)
> If migration happens between the ATAPI command being written and the
> bmdma being started, the DMA is dropped.  Eventually the guest times
> out and recovers, but that can take many seconds.
> (This is rare, on a pingpong reading the CD continuously I hit
> this about ~1/30-1/50 migrates)
>
> I don't think we've got enough state to be able to recover safely
> at this point, so I throw a 'medium error, no seek complete'
> that I'm assuming guests will try and recover from an apparently
> dirty CD.
>
> OK, it's a hack, the real solution is probably to push a lot of
> ATAPI state into the migration stream, but this is a fix that
> works with no stream changes. Tested only on Linux (both RHEL5
> (pre-libata) and RHEL7).
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   hw/ide/atapi.c    | 17 +++++++++++++++++
>   hw/ide/internal.h |  2 ++
>   hw/ide/pci.c      | 11 +++++++++++
>   3 files changed, 30 insertions(+)
>
> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
> index c63b7e5..e17799c 100644
> --- a/hw/ide/atapi.c
> +++ b/hw/ide/atapi.c
> @@ -394,6 +394,23 @@ static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
>       }
>   }
>
> +
> +/* Called by *_restart_bh when the transfer function points
> + * to ide_atapi_cmd
> + */
> +void ide_atapi_dma_restart(IDEState *s)
> +{
> +    /*
> +     * I'm not sure we have enough stored to restart the command
> +     * safely, so give the guest an error it should recover from.
> +     * I'm assuming most guests will try to recover from something
> +     * listed as a medium error on a CD; it seems to work on Linux.
> +     * This would be more of a problem if we did any other type of
> +     * DMA operation.
> +     */
> +    ide_atapi_cmd_error(s, MEDIUM_ERROR, ASC_NO_SEEK_COMPLETE);
> +}
> +

Is this safe for non-data commands? Can we even get there in such a case?

>   static inline uint8_t ide_atapi_set_profile(uint8_t *buf, uint8_t *index,
>                                               uint16_t profile)
>   {
> diff --git a/hw/ide/internal.h b/hw/ide/internal.h
> index 8a3eca4..8b65285 100644
> --- a/hw/ide/internal.h
> +++ b/hw/ide/internal.h
> @@ -289,6 +289,7 @@ typedef struct IDEDMAOps IDEDMAOps;
>   #define ATAPI_INT_REASON_TAG            0xf8
>
>   /* same constants as bochs */
> +#define ASC_NO_SEEK_COMPLETE                 0x02
>   #define ASC_ILLEGAL_OPCODE                   0x20
>   #define ASC_LOGICAL_BLOCK_OOR                0x21
>   #define ASC_INV_FIELD_IN_CMD_PACKET          0x24
> @@ -529,6 +530,7 @@ void ide_dma_error(IDEState *s);
>
>   void ide_atapi_cmd_ok(IDEState *s);
>   void ide_atapi_cmd_error(IDEState *s, int sense_key, int asc);
> +void ide_atapi_dma_restart(IDEState *s);
>   void ide_atapi_io_error(IDEState *s, int ret);
>
>   void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val);
> diff --git a/hw/ide/pci.c b/hw/ide/pci.c
> index bee5ad3..e3f2054 100644
> --- a/hw/ide/pci.c
> +++ b/hw/ide/pci.c
> @@ -235,6 +235,17 @@ static void bmdma_restart_bh(void *opaque)
>           }
>       } else if (error_status & IDE_RETRY_FLUSH) {
>           ide_flush_cache(bmdma_active_if(bm));
> +    } else {
> +        IDEState *s = bmdma_active_if(bm);
> +
> +        /*
> +         * We've not got any bits to tell us about ATAPI - but
> +         * we do have the end_transfer_func that tells us what
> +         * we're trying to do.
> +         */
> +        if (s->end_transfer_func == ide_atapi_cmd) {
> +            ide_atapi_dma_restart(s);
> +        }

OK, so when the restart routines get invoked we add a hook to see if we 
were in the middle of an ATAPI command and acknowledge that we don't 
know how to properly handle this.

Isn't this going to run on every vmstate change, though? I think we 
don't clear out end_transfer_func on success, so this might fire off 
more than we want it to, although I guess end_transfer_func is usually 
going to get set to ide_atapi_cmd_reply_end if it finishes normally ...

>       }
>   }
>
>

Indeed a hack, but it's probably appropriate: if our code cannot in fact 
handle ATAPI migration, throwing an error or disabling migration is the 
correct thing to do, but I don't think users would be very happy with 
the second option. I feel that this is an OK workaround because it 
should not introduce spurious errors or retries for cases where we 
manage to avoid migrating in the middle of the loop. This will at least 
let the currently broken case limp along until we fix it more properly.

What makes me the most curious is how this plays out in Windows if this 
case is triggered. Throw a trace around the fake error and see if you 
can't observe it getting called during a pingpong test while Windows 
reads a CD.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-10  6:14   ` John Snow
@ 2014-12-10 12:05     ` Dr. David Alan Gilbert
  2014-12-10 20:09     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2014-12-10 12:05 UTC (permalink / raw)
  To: John Snow; +Cc: pbonzini, qemu-devel

* John Snow (jsnow@redhat.com) wrote:
> 
> 
> On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:
> >From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> >(With the previous atapi_dma flag recovery)
> >If migration happens between the ATAPI command being written and the
> >bmdma being started, the DMA is dropped.  Eventually the guest times
> >out and recovers, but that can take many seconds.
> >(This is rare, on a pingpong reading the CD continuously I hit
> >this about ~1/30-1/50 migrates)
> >
> >I don't think we've got enough state to be able to recover safely
> >at this point, so I throw a 'medium error, no seek complete'
> >that I'm assuming guests will try and recover from an apparently
> >dirty CD.
> >
> >OK, it's a hack, the real solution is probably to push a lot of
> >ATAPI state into the migration stream, but this is a fix that
> >works with no stream changes. Tested only on Linux (both RHEL5
> >(pre-libata) and RHEL7).
> >
> >Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> >---
> >  hw/ide/atapi.c    | 17 +++++++++++++++++
> >  hw/ide/internal.h |  2 ++
> >  hw/ide/pci.c      | 11 +++++++++++
> >  3 files changed, 30 insertions(+)
> >
> >diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
> >index c63b7e5..e17799c 100644
> >--- a/hw/ide/atapi.c
> >+++ b/hw/ide/atapi.c
> >@@ -394,6 +394,23 @@ static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
> >      }
> >  }
> >
> >+
> >+/* Called by *_restart_bh when the transfer function points
> >+ * to ide_atapi_cmd
> >+ */
> >+void ide_atapi_dma_restart(IDEState *s)
> >+{
> >+    /*
> >+     * I'm not sure we have enough stored to restart the command
> >+     * safely, so give the guest an error it should recover from.
> >+     * I'm assuming most guests will try to recover from something
> >+     * listed as a medium error on a CD; it seems to work on Linux.
> >+     * This would be more of a problem if we did any other type of
> >+     * DMA operation.
> >+     */
> >+    ide_atapi_cmd_error(s, MEDIUM_ERROR, ASC_NO_SEEK_COMPLETE);
> >+}
> >+
> 
> Is this safe for non-data commands? Can we even get there in such a case?

See below.

> >  static inline uint8_t ide_atapi_set_profile(uint8_t *buf, uint8_t *index,
> >                                              uint16_t profile)
> >  {
> >diff --git a/hw/ide/internal.h b/hw/ide/internal.h
> >index 8a3eca4..8b65285 100644
> >--- a/hw/ide/internal.h
> >+++ b/hw/ide/internal.h
> >@@ -289,6 +289,7 @@ typedef struct IDEDMAOps IDEDMAOps;
> >  #define ATAPI_INT_REASON_TAG            0xf8
> >
> >  /* same constants as bochs */
> >+#define ASC_NO_SEEK_COMPLETE                 0x02
> >  #define ASC_ILLEGAL_OPCODE                   0x20
> >  #define ASC_LOGICAL_BLOCK_OOR                0x21
> >  #define ASC_INV_FIELD_IN_CMD_PACKET          0x24
> >@@ -529,6 +530,7 @@ void ide_dma_error(IDEState *s);
> >
> >  void ide_atapi_cmd_ok(IDEState *s);
> >  void ide_atapi_cmd_error(IDEState *s, int sense_key, int asc);
> >+void ide_atapi_dma_restart(IDEState *s);
> >  void ide_atapi_io_error(IDEState *s, int ret);
> >
> >  void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val);
> >diff --git a/hw/ide/pci.c b/hw/ide/pci.c
> >index bee5ad3..e3f2054 100644
> >--- a/hw/ide/pci.c
> >+++ b/hw/ide/pci.c
> >@@ -235,6 +235,17 @@ static void bmdma_restart_bh(void *opaque)
> >          }
> >      } else if (error_status & IDE_RETRY_FLUSH) {
> >          ide_flush_cache(bmdma_active_if(bm));
> >+    } else {
> >+        IDEState *s = bmdma_active_if(bm);
> >+
> >+        /*
> >+         * We've not got any bits to tell us about ATAPI - but
> >+         * we do have the end_transfer_func that tells us what
> >+         * we're trying to do.
> >+         */
> >+        if (s->end_transfer_func == ide_atapi_cmd) {
> >+            ide_atapi_dma_restart(s);
> >+        }
> 
> OK, so when the restart routines get invoked we add a hook to see if we were
> in the middle of an ATAPI command and acknowledge that we don't know how to
> properly handle this.

As to your qeustion above about non-data commands; hmm probably - but how
do I guard it? I guess I could check for the atapi_dma flag the previous
patch fixed.
(This is all probably still broken for non-DMA atapi transfers)

> Isn't this going to run on every vmstate change, though?

There aren't many - only starting/stopping the CPU does it; and bmdma_restart_cb
guards it by 'if (!running)' exit, so it'll only do it when the CPU starts
running again.

> I think we don't
> clear out end_transfer_func on success, so this might fire off more than we
> want it to, although I guess end_transfer_func is usually going to get set
> to ide_atapi_cmd_reply_end if it finishes normally ...

Right, or if ide_transfer_stop is called.

> >      }
> >  }
> >
> >
> 
> Indeed a hack, but it's probably appropriate: if our code cannot in fact
> handle ATAPI migration, throwing an error or disabling migration is the
> correct thing to do, but I don't think users would be very happy with the
> second option. I feel that this is an OK workaround because it should not
> introduce spurious errors or retries for cases where we manage to avoid
> migrating in the middle of the loop. This will at least let the currently
> broken case limp along until we fix it more properly.
> 
> What makes me the most curious is how this plays out in Windows if this case
> is triggered. Throw a trace around the fake error and see if you can't
> observe it getting called during a pingpong test while Windows reads a CD.

Yeh, I'm going to figure out how to try that.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-10  6:14   ` John Snow
  2014-12-10 12:05     ` Dr. David Alan Gilbert
@ 2014-12-10 20:09     ` Dr. David Alan Gilbert
  1 sibling, 0 replies; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2014-12-10 20:09 UTC (permalink / raw)
  To: John Snow; +Cc: pbonzini, qemu-devel

* John Snow (jsnow@redhat.com) wrote:

> What makes me the most curious is how this plays out in Windows if this case
> is triggered. Throw a trace around the fake error and see if you can't
> observe it getting called during a pingpong test while Windows reads a CD.

I've not managed to trigger it on Windows (2012).
I've run InfraRecorder's track test with heavy migration and it
doesn't seem to trigger this case, or cause any extra errors.
(Curiously it does complain about not being able to read extended
error information and say our CDROM is giving about 35% C2 errors - 
even without migration; but I think this is just bogus junk it's
picking up, although feel free to polish the scratches out of
our virtual CD).

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery Dr. David Alan Gilbert (git)
  2014-12-10  6:14   ` John Snow
@ 2014-12-10 22:04   ` Paolo Bonzini
  2014-12-11 19:45     ` Dr. David Alan Gilbert
  2015-01-16 17:28   ` John Snow
  2 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2014-12-10 22:04 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel; +Cc: jsnow



On 09/12/2014 19:15, Dr. David Alan Gilbert (git) wrote:
> (With the previous atapi_dma flag recovery)
> If migration happens between the ATAPI command being written and the
> bmdma being started, the DMA is dropped.  Eventually the guest times
> out and recovers, but that can take many seconds.
> (This is rare, on a pingpong reading the CD continuously I hit
> this about ~1/30-1/50 migrates)
> 
> I don't think we've got enough state to be able to recover safely
> at this point, so I throw a 'medium error, no seek complete'
> that I'm assuming guests will try and recover from an apparently
> dirty CD.

What is the part of the state that is not being migrated?  Also, what is
the ATAPI command that is being run?

Paolo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-10 22:04   ` Paolo Bonzini
@ 2014-12-11 19:45     ` Dr. David Alan Gilbert
  2014-12-18 19:39       ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2014-12-11 19:45 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: jsnow, qemu-devel

* Paolo Bonzini (pbonzini@redhat.com) wrote:
> 
> 
> On 09/12/2014 19:15, Dr. David Alan Gilbert (git) wrote:
> > (With the previous atapi_dma flag recovery)
> > If migration happens between the ATAPI command being written and the
> > bmdma being started, the DMA is dropped.  Eventually the guest times
> > out and recovers, but that can take many seconds.
> > (This is rare, on a pingpong reading the CD continuously I hit
> > this about ~1/30-1/50 migrates)
> > 
> > I don't think we've got enough state to be able to recover safely
> > at this point, so I throw a 'medium error, no seek complete'
> > that I'm assuming guests will try and recover from an apparently
> > dirty CD.
> 

> Also, what is the ATAPI command that is being run?

The command in my test case is:
28 00 00 06 df b8 00 00 40 00 00 00
so just a standard READ (10); I guess it's possible that other OSs/operations
maybe doing different READ variants.

> What is the part of the state that is not being migrated?

I can't see lba, io_buffer_index, io_buffer_size or cd_sector_size being passed
to a VMSTATE_ macro; and having noticed that those are missing I didn't dig much
deeper.
(I'll admit to not really understanding the interaction between the core.c, atapi.c
and pci.c dma code properly to know if it can be reconstructed).

Dave

> Paolo
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-11 19:45     ` Dr. David Alan Gilbert
@ 2014-12-18 19:39       ` Dr. David Alan Gilbert
  2014-12-18 23:42         ` John Snow
  0 siblings, 1 reply; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2014-12-18 19:39 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: jsnow, qemu-devel

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> * Paolo Bonzini (pbonzini@redhat.com) wrote:
> > 
> > 
> > On 09/12/2014 19:15, Dr. David Alan Gilbert (git) wrote:
> > > (With the previous atapi_dma flag recovery)
> > > If migration happens between the ATAPI command being written and the
> > > bmdma being started, the DMA is dropped.  Eventually the guest times
> > > out and recovers, but that can take many seconds.
> > > (This is rare, on a pingpong reading the CD continuously I hit
> > > this about ~1/30-1/50 migrates)
> > > 
> > > I don't think we've got enough state to be able to recover safely
> > > at this point, so I throw a 'medium error, no seek complete'
> > > that I'm assuming guests will try and recover from an apparently
> > > dirty CD.
> > 
> 
> > Also, what is the ATAPI command that is being run?
> 
> The command in my test case is:
> 28 00 00 06 df b8 00 00 40 00 00 00
> so just a standard READ (10); I guess it's possible that other OSs/operations
> maybe doing different READ variants.
> 
> > What is the part of the state that is not being migrated?
> 
> I can't see lba, io_buffer_index, io_buffer_size or cd_sector_size being passed
> to a VMSTATE_ macro; and having noticed that those are missing I didn't dig much
> deeper.
> (I'll admit to not really understanding the interaction between the core.c, atapi.c
> and pci.c dma code properly to know if it can be reconstructed).

Actually, another argument for this is that even if we come along and add
the extra state as a subsection, we can still do this as a fallback when receiving
older streams.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-18 19:39       ` Dr. David Alan Gilbert
@ 2014-12-18 23:42         ` John Snow
  0 siblings, 0 replies; 15+ messages in thread
From: John Snow @ 2014-12-18 23:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert, Paolo Bonzini; +Cc: qemu-devel



On 12/18/2014 02:39 PM, Dr. David Alan Gilbert wrote:
> * Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
>> * Paolo Bonzini (pbonzini@redhat.com) wrote:
>>>
>>>
>>> On 09/12/2014 19:15, Dr. David Alan Gilbert (git) wrote:
>>>> (With the previous atapi_dma flag recovery)
>>>> If migration happens between the ATAPI command being written and the
>>>> bmdma being started, the DMA is dropped.  Eventually the guest times
>>>> out and recovers, but that can take many seconds.
>>>> (This is rare, on a pingpong reading the CD continuously I hit
>>>> this about ~1/30-1/50 migrates)
>>>>
>>>> I don't think we've got enough state to be able to recover safely
>>>> at this point, so I throw a 'medium error, no seek complete'
>>>> that I'm assuming guests will try and recover from an apparently
>>>> dirty CD.
>>>
>>
>>> Also, what is the ATAPI command that is being run?
>>
>> The command in my test case is:
>> 28 00 00 06 df b8 00 00 40 00 00 00
>> so just a standard READ (10); I guess it's possible that other OSs/operations
>> maybe doing different READ variants.
>>
>>> What is the part of the state that is not being migrated?
>>
>> I can't see lba, io_buffer_index, io_buffer_size or cd_sector_size being passed
>> to a VMSTATE_ macro; and having noticed that those are missing I didn't dig much
>> deeper.
>> (I'll admit to not really understanding the interaction between the core.c, atapi.c
>> and pci.c dma code properly to know if it can be reconstructed).
>
> Actually, another argument for this is that even if we come along and add
> the extra state as a subsection, we can still do this as a fallback when receiving
> older streams.
>

Fair enough. I vote we give this hack a shot for now, then. I can always 
put this in my basket of things to look at ($later) after I have 
finished ($other_things).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack
  2014-12-09 18:15 [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert (git)
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration Dr. David Alan Gilbert (git)
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery Dr. David Alan Gilbert (git)
@ 2015-01-07 16:26 ` Dr. David Alan Gilbert
  2015-01-30 16:07   ` Dr. David Alan Gilbert
  2 siblings, 1 reply; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2015-01-07 16:26 UTC (permalink / raw)
  To: kwolf, stefanha; +Cc: pbonzini, jsnow, qemu-devel


Oops, forgot to include Kevin and Stefan on cc for this.

Dave

* Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> 
> This pair of patches fixes a problem where IDE/ATAPI cdrom
> reads get lost/corrupted over migration.
> 
> The first of the patches (restore atapi_dma flag) is
> a simple fix that I think is safe; it no longer causes
> corruption in the case we saw, but does still trigger
> a long timeout.
> 
> The second is a hack; it throws a medium error causing
> the guest to retry the command in the case where migration
> happens just between the IDE/ATAPI command being submitted
> and the bmdma being finished.   This recovers a lot
> faster than the timeout.
> 
> Only tried on Linux guests so far; I think it might be possible
> to replace both of these by reparsing the command buffer for
> ATAPI; I'm just not confident I know when that's safe to do,
> and I wanted to see how disgusted people were by the 2nd hack.
> 
> Dave
> 
> Dr. David Alan Gilbert (2):
>   Restore atapi_dma flag across migration
>   atapi migration: Throw recoverable error to avoid recovery
> 
>  hw/ide/atapi.c    | 17 +++++++++++++++++
>  hw/ide/core.c     |  1 +
>  hw/ide/internal.h |  2 ++
>  hw/ide/pci.c      | 11 +++++++++++
>  4 files changed, 31 insertions(+)
> 
> -- 
> 2.1.0
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2014-12-09 18:15 ` [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery Dr. David Alan Gilbert (git)
  2014-12-10  6:14   ` John Snow
  2014-12-10 22:04   ` Paolo Bonzini
@ 2015-01-16 17:28   ` John Snow
  2015-02-02 12:11     ` Kevin Wolf
  2 siblings, 1 reply; 15+ messages in thread
From: John Snow @ 2015-01-16 17:28 UTC (permalink / raw)
  To: Dr. David Alan Gilbert (git), qemu-devel; +Cc: pbonzini



On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:
> From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
>
> (With the previous atapi_dma flag recovery)
> If migration happens between the ATAPI command being written and the
> bmdma being started, the DMA is dropped.  Eventually the guest times
> out and recovers, but that can take many seconds.
> (This is rare, on a pingpong reading the CD continuously I hit
> this about ~1/30-1/50 migrates)
>
> I don't think we've got enough state to be able to recover safely
> at this point, so I throw a 'medium error, no seek complete'
> that I'm assuming guests will try and recover from an apparently
> dirty CD.
>
> OK, it's a hack, the real solution is probably to push a lot of
> ATAPI state into the migration stream, but this is a fix that
> works with no stream changes. Tested only on Linux (both RHEL5
> (pre-libata) and RHEL7).
>
> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>
> ---
>   hw/ide/atapi.c    | 17 +++++++++++++++++
>   hw/ide/internal.h |  2 ++
>   hw/ide/pci.c      | 11 +++++++++++
>   3 files changed, 30 insertions(+)
>
> diff --git a/hw/ide/atapi.c b/hw/ide/atapi.c
> index c63b7e5..e17799c 100644
> --- a/hw/ide/atapi.c
> +++ b/hw/ide/atapi.c
> @@ -394,6 +394,23 @@ static void ide_atapi_cmd_read(IDEState *s, int lba, int nb_sectors,
>       }
>   }
>
> +
> +/* Called by *_restart_bh when the transfer function points
> + * to ide_atapi_cmd
> + */
> +void ide_atapi_dma_restart(IDEState *s)
> +{
> +    /*
> +     * I'm not sure we have enough stored to restart the command
> +     * safely, so give the guest an error it should recover from.
> +     * I'm assuming most guests will try to recover from something
> +     * listed as a medium error on a CD; it seems to work on Linux.
> +     * This would be more of a problem if we did any other type of
> +     * DMA operation.
> +     */
> +    ide_atapi_cmd_error(s, MEDIUM_ERROR, ASC_NO_SEEK_COMPLETE);
> +}
> +
>   static inline uint8_t ide_atapi_set_profile(uint8_t *buf, uint8_t *index,
>                                               uint16_t profile)
>   {
> diff --git a/hw/ide/internal.h b/hw/ide/internal.h
> index 8a3eca4..8b65285 100644
> --- a/hw/ide/internal.h
> +++ b/hw/ide/internal.h
> @@ -289,6 +289,7 @@ typedef struct IDEDMAOps IDEDMAOps;
>   #define ATAPI_INT_REASON_TAG            0xf8
>
>   /* same constants as bochs */
> +#define ASC_NO_SEEK_COMPLETE                 0x02
>   #define ASC_ILLEGAL_OPCODE                   0x20
>   #define ASC_LOGICAL_BLOCK_OOR                0x21
>   #define ASC_INV_FIELD_IN_CMD_PACKET          0x24
> @@ -529,6 +530,7 @@ void ide_dma_error(IDEState *s);
>
>   void ide_atapi_cmd_ok(IDEState *s);
>   void ide_atapi_cmd_error(IDEState *s, int sense_key, int asc);
> +void ide_atapi_dma_restart(IDEState *s);
>   void ide_atapi_io_error(IDEState *s, int ret);
>
>   void ide_ioport_write(void *opaque, uint32_t addr, uint32_t val);
> diff --git a/hw/ide/pci.c b/hw/ide/pci.c
> index bee5ad3..e3f2054 100644
> --- a/hw/ide/pci.c
> +++ b/hw/ide/pci.c
> @@ -235,6 +235,17 @@ static void bmdma_restart_bh(void *opaque)
>           }
>       } else if (error_status & IDE_RETRY_FLUSH) {
>           ide_flush_cache(bmdma_active_if(bm));
> +    } else {
> +        IDEState *s = bmdma_active_if(bm);
> +
> +        /*
> +         * We've not got any bits to tell us about ATAPI - but
> +         * we do have the end_transfer_func that tells us what
> +         * we're trying to do.
> +         */
> +        if (s->end_transfer_func == ide_atapi_cmd) {
> +            ide_atapi_dma_restart(s);
> +        }
>       }
>   }
>
>

I think this workaround is sensible until we develop a more precise fix.

Reviewed-by: John Snow <jsnow@redhat.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack
  2015-01-07 16:26 ` [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert
@ 2015-01-30 16:07   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 15+ messages in thread
From: Dr. David Alan Gilbert @ 2015-01-30 16:07 UTC (permalink / raw)
  To: kwolf, stefanha, qemu-devel; +Cc: pbonzini, jsnow

* Dr. David Alan Gilbert (dgilbert@redhat.com) wrote:
> 
> Oops, forgot to include Kevin and Stefan on cc for this.

Ping;
John R-b'd the two patches:
  https://lists.gnu.org/archive/html/qemu-devel/2014-12/msg01133.html
  http://lists.gnu.org/archive/html/qemu-devel/2015-01/msg01956.html

although hmm, possibly before I added both of you on
the cc.

Dave

> 
> Dave
> 
> * Dr. David Alan Gilbert (git) (dgilbert@redhat.com) wrote:
> > From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> > 
> > This pair of patches fixes a problem where IDE/ATAPI cdrom
> > reads get lost/corrupted over migration.
> > 
> > The first of the patches (restore atapi_dma flag) is
> > a simple fix that I think is safe; it no longer causes
> > corruption in the case we saw, but does still trigger
> > a long timeout.
> > 
> > The second is a hack; it throws a medium error causing
> > the guest to retry the command in the case where migration
> > happens just between the IDE/ATAPI command being submitted
> > and the bmdma being finished.   This recovers a lot
> > faster than the timeout.
> > 
> > Only tried on Linux guests so far; I think it might be possible
> > to replace both of these by reparsing the command buffer for
> > ATAPI; I'm just not confident I know when that's safe to do,
> > and I wanted to see how disgusted people were by the 2nd hack.
> > 
> > Dave
> > 
> > Dr. David Alan Gilbert (2):
> >   Restore atapi_dma flag across migration
> >   atapi migration: Throw recoverable error to avoid recovery
> > 
> >  hw/ide/atapi.c    | 17 +++++++++++++++++
> >  hw/ide/core.c     |  1 +
> >  hw/ide/internal.h |  2 ++
> >  hw/ide/pci.c      | 11 +++++++++++
> >  4 files changed, 31 insertions(+)
> > 
> > -- 
> > 2.1.0
> > 
> > 
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery
  2015-01-16 17:28   ` John Snow
@ 2015-02-02 12:11     ` Kevin Wolf
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Wolf @ 2015-02-02 12:11 UTC (permalink / raw)
  To: John Snow; +Cc: pbonzini, Dr. David Alan Gilbert (git), qemu-devel

Am 16.01.2015 um 18:28 hat John Snow geschrieben:
> 
> 
> On 12/09/2014 01:15 PM, Dr. David Alan Gilbert (git) wrote:
> >From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
> >
> >(With the previous atapi_dma flag recovery)
> >If migration happens between the ATAPI command being written and the
> >bmdma being started, the DMA is dropped.  Eventually the guest times
> >out and recovers, but that can take many seconds.
> >(This is rare, on a pingpong reading the CD continuously I hit
> >this about ~1/30-1/50 migrates)
> >
> >I don't think we've got enough state to be able to recover safely
> >at this point, so I throw a 'medium error, no seek complete'
> >that I'm assuming guests will try and recover from an apparently
> >dirty CD.
> >
> >OK, it's a hack, the real solution is probably to push a lot of
> >ATAPI state into the migration stream, but this is a fix that
> >works with no stream changes. Tested only on Linux (both RHEL5
> >(pre-libata) and RHEL7).
> >
> >Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

> I think this workaround is sensible until we develop a more precise fix.
> 
> Reviewed-by: John Snow <jsnow@redhat.com>

Thanks, applied all to the block branch.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-02-02 12:12 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-12-09 18:15 [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert (git)
2014-12-09 18:15 ` [Qemu-devel] [PATCH 1/2] Restore atapi_dma flag across migration Dr. David Alan Gilbert (git)
2014-12-10  5:04   ` John Snow
2014-12-09 18:15 ` [Qemu-devel] [PATCH 2/2] atapi migration: Throw recoverable error to avoid recovery Dr. David Alan Gilbert (git)
2014-12-10  6:14   ` John Snow
2014-12-10 12:05     ` Dr. David Alan Gilbert
2014-12-10 20:09     ` Dr. David Alan Gilbert
2014-12-10 22:04   ` Paolo Bonzini
2014-12-11 19:45     ` Dr. David Alan Gilbert
2014-12-18 19:39       ` Dr. David Alan Gilbert
2014-12-18 23:42         ` John Snow
2015-01-16 17:28   ` John Snow
2015-02-02 12:11     ` Kevin Wolf
2015-01-07 16:26 ` [Qemu-devel] [PATCH 0/2] ATAPI migration fix/hack Dr. David Alan Gilbert
2015-01-30 16:07   ` Dr. David Alan Gilbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.