All of lore.kernel.org
 help / color / mirror / Atom feed
* spapr_events: Sure we may ignore migrate_add_blocker() failure?
@ 2021-07-15 13:32 Markus Armbruster
  2021-07-19  2:31 ` David Gibson
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2021-07-15 13:32 UTC (permalink / raw)
  To: Aravinda Prasad, Ganesh Goudar, David Gibson, qemu-devel

Commit 2500fb423a "migration: Include migration support for machine
check handling" adds this:

    ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
    if (ret == -EBUSY) {
        /*
         * We don't want to abort so we let the migration to continue.
         * In a rare case, the machine check handler will run on the target.
         * Though this is not preferable, it is better than aborting
         * the migration or killing the VM.
         */
        warn_report("Received a fwnmi while migration was in progress");
    }

migrate_add_blocker() can fail in two ways:

1. -EBUSY: migration is already in progress

   Ignoring this one is clearly intentional.  The comment explains why.
   I'm taking it at face value (I'm a spapr ignoramus).  Aside: I doubt
   the warning is going to help users.

2. -EACCES: we're running with -only-migratable

   Why may we ignore -only-migratable here?

By the way, we leak @local_err on failure.  I'll post a patch, but I'd
like my question answered first.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
  2021-07-15 13:32 spapr_events: Sure we may ignore migrate_add_blocker() failure? Markus Armbruster
@ 2021-07-19  2:31 ` David Gibson
  2021-07-19  7:18   ` Markus Armbruster
  0 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2021-07-19  2:31 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Aravinda Prasad, Ganesh Goudar, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2668 bytes --]

On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
> Commit 2500fb423a "migration: Include migration support for machine
> check handling" adds this:
> 
>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
>     if (ret == -EBUSY) {
>         /*
>          * We don't want to abort so we let the migration to continue.
>          * In a rare case, the machine check handler will run on the target.
>          * Though this is not preferable, it is better than aborting
>          * the migration or killing the VM.
>          */
>         warn_report("Received a fwnmi while migration was in progress");
>     }
> 
> migrate_add_blocker() can fail in two ways:
> 
> 1. -EBUSY: migration is already in progress
> 
>    Ignoring this one is clearly intentional.  The comment explains why.
>    I'm taking it at face value (I'm a spapr ignoramus).

Right.  The argument isn't really about papr particularly, except
insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
is a reporting mechanism for certain low-level hardware failures
(think memory ECC or cpu level faults, IIRC).  If we migrate between
detecting and reporting the error, then the particulars we report will
be mostly meaningless since they relate to hardware we're no longer
running on.  Hence the migration blocker.

However, migrating away from a (non-fatal) fwnmi error is a pretty
reasonable response, so we don't want to actually fail a migration if
its already in progress.

>    Aside: I doubt
>    the warning is going to help users.

You're probably right, but it's not very clear how to do better.  It
might possibly help someone in tech support explain why the reported
fwnmi doesn't seem to match the hardware the guest is (now) running
on.

> 2. -EACCES: we're running with -only-migratable
> 
>    Why may we ignore -only-migratable here?

Short answer: because I didn't think about that case.  Long answer:
I think we probably shoud ignore it anyway.  As above, receiving a
fwnmi doesn't really prevent migration, it just means that if you're
unlucky it can report stale information.  Since migrating away from a
possibly-dubious host would be a reasonable response to a non-fatal
fwnmi, I don't think we want to simply prohibit fwnmi entirely with
-only-migratable.

> By the way, we leak @local_err on failure.  I'll post a patch, but I'd
> like my question answered first.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
  2021-07-19  2:31 ` David Gibson
@ 2021-07-19  7:18   ` Markus Armbruster
  2021-07-19  7:20     ` David Gibson
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2021-07-19  7:18 UTC (permalink / raw)
  To: David Gibson; +Cc: Aravinda Prasad, Ganesh Goudar, qemu-devel

David Gibson <david@gibson.dropbear.id.au> writes:

> On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
>> Commit 2500fb423a "migration: Include migration support for machine
>> check handling" adds this:
>> 
>>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
>>     if (ret == -EBUSY) {
>>         /*
>>          * We don't want to abort so we let the migration to continue.
>>          * In a rare case, the machine check handler will run on the target.
>>          * Though this is not preferable, it is better than aborting
>>          * the migration or killing the VM.
>>          */
>>         warn_report("Received a fwnmi while migration was in progress");
>>     }
>> 
>> migrate_add_blocker() can fail in two ways:
>> 
>> 1. -EBUSY: migration is already in progress
>> 
>>    Ignoring this one is clearly intentional.  The comment explains why.
>>    I'm taking it at face value (I'm a spapr ignoramus).
>
> Right.  The argument isn't really about papr particularly, except
> insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
> is a reporting mechanism for certain low-level hardware failures
> (think memory ECC or cpu level faults, IIRC).  If we migrate between
> detecting and reporting the error, then the particulars we report will
> be mostly meaningless since they relate to hardware we're no longer
> running on.  Hence the migration blocker.
>
> However, migrating away from a (non-fatal) fwnmi error is a pretty
> reasonable response, so we don't want to actually fail a migration if
> its already in progress.
>
>>    Aside: I doubt
>>    the warning is going to help users.
>
> You're probably right, but it's not very clear how to do better.  It
> might possibly help someone in tech support explain why the reported
> fwnmi doesn't seem to match the hardware the guest is (now) running
> on.

Perhaps pointing to the actual problem could help: the FWNMI's
information is mostly meaningless.

>> 2. -EACCES: we're running with -only-migratable
>> 
>>    Why may we ignore -only-migratable here?
>
> Short answer: because I didn't think about that case.  Long answer:
> I think we probably shoud ignore it anyway.  As above, receiving a
> fwnmi doesn't really prevent migration, it just means that if you're
> unlucky it can report stale information.  Since migrating away from a
> possibly-dubious host would be a reasonable response to a non-fatal
> fwnmi, I don't think we want to simply prohibit fwnmi entirely with
> -only-migratable.

I think the comment text and placement could be improved to make clear
ignoring this failure is intentional, too.  How do you like the
following?

diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
index a8f2cc6bdc..54d8e856d3 100644
--- a/hw/ppc/spapr_events.c
+++ b/hw/ppc/spapr_events.c
@@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
         }
     }
 
+    /*
+     * Try to block migration while FWNMI is being handled, so the
+     * machine check handler runs where the information passed to it
+     * actually makes sense.  This won't actually block migration,
+     * only delay it slightly.  If the attempt fails, carry on.
+     */
     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
     if (ret == -EBUSY) {
-        /*
-         * We don't want to abort so we let the migration to continue.
-         * In a rare case, the machine check handler will run on the target.
-         * Though this is not preferable, it is better than aborting
-         * the migration or killing the VM. It is okay to call
-         * migrate_del_blocker on a blocker that was not added (which the
-         * nmi-interlock handler would do when it's called after this).
-         */
         warn_report("Received a fwnmi while migration was in progress");
     }
 
>> By the way, we leak @local_err on failure.  I'll post a patch, but I'd
>> like my question answered first.



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
  2021-07-19  7:18   ` Markus Armbruster
@ 2021-07-19  7:20     ` David Gibson
  2021-07-19 10:41       ` Markus Armbruster
  0 siblings, 1 reply; 14+ messages in thread
From: David Gibson @ 2021-07-19  7:20 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Aravinda Prasad, Ganesh Goudar, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4459 bytes --]

On Mon, Jul 19, 2021 at 09:18:07AM +0200, Markus Armbruster wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
> >> Commit 2500fb423a "migration: Include migration support for machine
> >> check handling" adds this:
> >> 
> >>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
> >>     if (ret == -EBUSY) {
> >>         /*
> >>          * We don't want to abort so we let the migration to continue.
> >>          * In a rare case, the machine check handler will run on the target.
> >>          * Though this is not preferable, it is better than aborting
> >>          * the migration or killing the VM.
> >>          */
> >>         warn_report("Received a fwnmi while migration was in progress");
> >>     }
> >> 
> >> migrate_add_blocker() can fail in two ways:
> >> 
> >> 1. -EBUSY: migration is already in progress
> >> 
> >>    Ignoring this one is clearly intentional.  The comment explains why.
> >>    I'm taking it at face value (I'm a spapr ignoramus).
> >
> > Right.  The argument isn't really about papr particularly, except
> > insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
> > is a reporting mechanism for certain low-level hardware failures
> > (think memory ECC or cpu level faults, IIRC).  If we migrate between
> > detecting and reporting the error, then the particulars we report will
> > be mostly meaningless since they relate to hardware we're no longer
> > running on.  Hence the migration blocker.
> >
> > However, migrating away from a (non-fatal) fwnmi error is a pretty
> > reasonable response, so we don't want to actually fail a migration if
> > its already in progress.
> >
> >>    Aside: I doubt
> >>    the warning is going to help users.
> >
> > You're probably right, but it's not very clear how to do better.  It
> > might possibly help someone in tech support explain why the reported
> > fwnmi doesn't seem to match the hardware the guest is (now) running
> > on.
> 
> Perhaps pointing to the actual problem could help: the FWNMI's
> information is mostly meaningless.

Sorry, I don't follow what you're suggesting.

> 
> >> 2. -EACCES: we're running with -only-migratable
> >> 
> >>    Why may we ignore -only-migratable here?
> >
> > Short answer: because I didn't think about that case.  Long answer:
> > I think we probably shoud ignore it anyway.  As above, receiving a
> > fwnmi doesn't really prevent migration, it just means that if you're
> > unlucky it can report stale information.  Since migrating away from a
> > possibly-dubious host would be a reasonable response to a non-fatal
> > fwnmi, I don't think we want to simply prohibit fwnmi entirely with
> > -only-migratable.
> 
> I think the comment text and placement could be improved to make clear
> ignoring this failure is intentional, too.  How do you like the
> following?

That's fair..

> 
> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
> index a8f2cc6bdc..54d8e856d3 100644
> --- a/hw/ppc/spapr_events.c
> +++ b/hw/ppc/spapr_events.c
> @@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
>          }
>      }
>  
> +    /*
> +     * Try to block migration while FWNMI is being handled, so the
> +     * machine check handler runs where the information passed to it
> +     * actually makes sense.  This won't actually block migration,
> +     * only delay it slightly.  If the attempt fails, carry on.
> +     */
>      ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
>      if (ret == -EBUSY) {
> -        /*
> -         * We don't want to abort so we let the migration to continue.
> -         * In a rare case, the machine check handler will run on the target.
> -         * Though this is not preferable, it is better than aborting
> -         * the migration or killing the VM. It is okay to call
> -         * migrate_del_blocker on a blocker that was not added (which the
> -         * nmi-interlock handler would do when it's called after this).
> -         */
>          warn_report("Received a fwnmi while migration was in progress");
>      }

LGTM.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
  2021-07-19  7:20     ` David Gibson
@ 2021-07-19 10:41       ` Markus Armbruster
  2021-07-19 11:00         ` -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?) Markus Armbruster
  2021-07-21  6:26         ` spapr_events: Sure we may ignore migrate_add_blocker() failure? David Gibson
  0 siblings, 2 replies; 14+ messages in thread
From: Markus Armbruster @ 2021-07-19 10:41 UTC (permalink / raw)
  To: David Gibson; +Cc: Aravinda Prasad, Ganesh Goudar, qemu-devel

David Gibson <david@gibson.dropbear.id.au> writes:

> On Mon, Jul 19, 2021 at 09:18:07AM +0200, Markus Armbruster wrote:
>> David Gibson <david@gibson.dropbear.id.au> writes:
>> 
>> > On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
>> >> Commit 2500fb423a "migration: Include migration support for machine
>> >> check handling" adds this:
>> >> 
>> >>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
>> >>     if (ret == -EBUSY) {
>> >>         /*
>> >>          * We don't want to abort so we let the migration to continue.
>> >>          * In a rare case, the machine check handler will run on the target.
>> >>          * Though this is not preferable, it is better than aborting
>> >>          * the migration or killing the VM.
>> >>          */
>> >>         warn_report("Received a fwnmi while migration was in progress");
>> >>     }
>> >> 
>> >> migrate_add_blocker() can fail in two ways:
>> >> 
>> >> 1. -EBUSY: migration is already in progress
>> >> 
>> >>    Ignoring this one is clearly intentional.  The comment explains why.
>> >>    I'm taking it at face value (I'm a spapr ignoramus).
>> >
>> > Right.  The argument isn't really about papr particularly, except
>> > insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
>> > is a reporting mechanism for certain low-level hardware failures
>> > (think memory ECC or cpu level faults, IIRC).  If we migrate between
>> > detecting and reporting the error, then the particulars we report will
>> > be mostly meaningless since they relate to hardware we're no longer
>> > running on.  Hence the migration blocker.
>> >
>> > However, migrating away from a (non-fatal) fwnmi error is a pretty
>> > reasonable response, so we don't want to actually fail a migration if
>> > its already in progress.
>> >
>> >>    Aside: I doubt
>> >>    the warning is going to help users.
>> >
>> > You're probably right, but it's not very clear how to do better.  It
>> > might possibly help someone in tech support explain why the reported
>> > fwnmi doesn't seem to match the hardware the guest is (now) running
>> > on.
>> 
>> Perhaps pointing to the actual problem could help: the FWNMI's
>> information is mostly meaningless.
>
> Sorry, I don't follow what you're suggesting.

We warn

    warning: Received a fwnmi while migration was in progress

when we fail to block migration because it's already in progress.
But what does this mean?  Perhaps warn like this:

    warning: FWNMI while migration is in progress
    The guest's report for this may be less than useful.

My phrasing may well be off, but I hope you get the idea.

Note that we keep quiet when we fail to block migration due to
-only-migrate.  I agree with that.  The failure makes a difference only
when migration gets triggered in a narrow time window, which should be
quite rare.  Would be nice to warn when migration does get triggered in
that time window, though.  Not sure it's worth the trouble, in
particular if we'd have to create infrastructure first.

>
>> 
>> >> 2. -EACCES: we're running with -only-migratable
>> >> 
>> >>    Why may we ignore -only-migratable here?
>> >
>> > Short answer: because I didn't think about that case.  Long answer:
>> > I think we probably shoud ignore it anyway.  As above, receiving a
>> > fwnmi doesn't really prevent migration, it just means that if you're
>> > unlucky it can report stale information.  Since migrating away from a
>> > possibly-dubious host would be a reasonable response to a non-fatal
>> > fwnmi, I don't think we want to simply prohibit fwnmi entirely with
>> > -only-migratable.
>> 
>> I think the comment text and placement could be improved to make clear
>> ignoring this failure is intentional, too.  How do you like the
>> following?
>
> That's fair..
>
>> 
>> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
>> index a8f2cc6bdc..54d8e856d3 100644
>> --- a/hw/ppc/spapr_events.c
>> +++ b/hw/ppc/spapr_events.c
>> @@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
>>          }
>>      }
>>  
>> +    /*
>> +     * Try to block migration while FWNMI is being handled, so the
>> +     * machine check handler runs where the information passed to it
>> +     * actually makes sense.  This won't actually block migration,
>> +     * only delay it slightly.  If the attempt fails, carry on.
>> +     */
>>      ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
>>      if (ret == -EBUSY) {
>> -        /*
>> -         * We don't want to abort so we let the migration to continue.
>> -         * In a rare case, the machine check handler will run on the target.
>> -         * Though this is not preferable, it is better than aborting
>> -         * the migration or killing the VM. It is okay to call
>> -         * migrate_del_blocker on a blocker that was not added (which the
>> -         * nmi-interlock handler would do when it's called after this).
>> -         */
>>          warn_report("Received a fwnmi while migration was in progress");
>>      }
>
> LGTM.

Thanks, I'll post this.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?)
  2021-07-19 10:41       ` Markus Armbruster
@ 2021-07-19 11:00         ` Markus Armbruster
  2021-07-19 12:42           ` Dr. David Alan Gilbert
  2021-11-02 14:30           ` Juan Quintela
  2021-07-21  6:26         ` spapr_events: Sure we may ignore migrate_add_blocker() failure? David Gibson
  1 sibling, 2 replies; 14+ messages in thread
From: Markus Armbruster @ 2021-07-19 11:00 UTC (permalink / raw)
  To: qemu-devel; +Cc: Juan Quintela, Dr. David Alan Gilbert, David Gibson

We appear to use migration blockers in two ways:

(1) Prevent migration for an indefinite time, typically due to use of
some feature that isn't compatible with migration.

(2) Delay migration for a short time.

Option -only-migrate is designed for (1).  It interferes with (2).

Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
adds a migration blocker on realize, and deletes it on unrealize.  With
-only-migrate, device realize fails.  Works as designed.

Example for (2): spapr_mce_req_event() makes an effort to prevent
migration degrate the reporting of FWNMIs.  It adds a migration blocker
when it receives one, and deletes it when it's done handling it.  This
is a best effort; if migration is already in progress by the time FWNMI
is received, we simply carry on, and that's okay.  However, option
-only-migrate sabotages the best effort entirely.

While this isn't exactly terrible, it may be a weakness in our thinking
and our infrastructure.  I'm bringing it up so the people in charge are
aware :)



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?)
  2021-07-19 11:00         ` -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?) Markus Armbruster
@ 2021-07-19 12:42           ` Dr. David Alan Gilbert
  2021-07-20  5:30             ` -only-migrate and the two different uses of migration blockers Markus Armbruster
  2021-11-02 14:30           ` Juan Quintela
  1 sibling, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-19 12:42 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Juan Quintela, qemu-devel, David Gibson

* Markus Armbruster (armbru@redhat.com) wrote:
> We appear to use migration blockers in two ways:
> 
> (1) Prevent migration for an indefinite time, typically due to use of
> some feature that isn't compatible with migration.
> 
> (2) Delay migration for a short time.
> 
> Option -only-migrate is designed for (1).  It interferes with (2).
> 
> Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
> adds a migration blocker on realize, and deletes it on unrealize.  With
> -only-migrate, device realize fails.  Works as designed.
> 
> Example for (2): spapr_mce_req_event() makes an effort to prevent
> migration degrate the reporting of FWNMIs.  It adds a migration blocker
> when it receives one, and deletes it when it's done handling it.  This
> is a best effort; if migration is already in progress by the time FWNMI
> is received, we simply carry on, and that's okay.  However, option
> -only-migrate sabotages the best effort entirely.

That's interesting; it's the first time I've heard of anyone using it as
'best effort'.  I've always regarded blockers as blocking.

> While this isn't exactly terrible, it may be a weakness in our thinking
> and our infrastructure.  I'm bringing it up so the people in charge are
> aware :)

Thanks.

It almost feels like they need a way to temporarily hold off
'completion' of migratio - i.e. the phase where we stop the CPU and
write the device data;  mind you you'd also probably want it to stop
cold-migrates/snapshots?

Dave

-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers
  2021-07-19 12:42           ` Dr. David Alan Gilbert
@ 2021-07-20  5:30             ` Markus Armbruster
  2021-07-21  6:32               ` David Gibson
  0 siblings, 1 reply; 14+ messages in thread
From: Markus Armbruster @ 2021-07-20  5:30 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, qemu-devel, David Gibson

"Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:

> * Markus Armbruster (armbru@redhat.com) wrote:
>> We appear to use migration blockers in two ways:
>> 
>> (1) Prevent migration for an indefinite time, typically due to use of
>> some feature that isn't compatible with migration.
>> 
>> (2) Delay migration for a short time.
>> 
>> Option -only-migrate is designed for (1).  It interferes with (2).
>> 
>> Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
>> adds a migration blocker on realize, and deletes it on unrealize.  With
>> -only-migrate, device realize fails.  Works as designed.
>> 
>> Example for (2): spapr_mce_req_event() makes an effort to prevent
>> migration degrate the reporting of FWNMIs.  It adds a migration blocker
>> when it receives one, and deletes it when it's done handling it.  This
>> is a best effort; if migration is already in progress by the time FWNMI
>> is received, we simply carry on, and that's okay.  However, option
>> -only-migrate sabotages the best effort entirely.
>
> That's interesting; it's the first time I've heard of anyone using it as
> 'best effort'.  I've always regarded blockers as blocking.

Me too, until I found this one.

>> While this isn't exactly terrible, it may be a weakness in our thinking
>> and our infrastructure.  I'm bringing it up so the people in charge are
>> aware :)
>
> Thanks.
>
> It almost feels like they need a way to temporarily hold off
> 'completion' of migratio - i.e. the phase where we stop the CPU and
> write the device data;  mind you you'd also probably want it to stop
> cold-migrates/snapshots?

Yes, a proper way to delay 'completion' for a bit would be clearer, and
wouldn't let -only-migrate interfere.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: spapr_events: Sure we may ignore migrate_add_blocker() failure?
  2021-07-19 10:41       ` Markus Armbruster
  2021-07-19 11:00         ` -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?) Markus Armbruster
@ 2021-07-21  6:26         ` David Gibson
  1 sibling, 0 replies; 14+ messages in thread
From: David Gibson @ 2021-07-21  6:26 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Aravinda Prasad, Ganesh Goudar, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 5829 bytes --]

On Mon, Jul 19, 2021 at 12:41:09PM +0200, Markus Armbruster wrote:
> David Gibson <david@gibson.dropbear.id.au> writes:
> 
> > On Mon, Jul 19, 2021 at 09:18:07AM +0200, Markus Armbruster wrote:
> >> David Gibson <david@gibson.dropbear.id.au> writes:
> >> 
> >> > On Thu, Jul 15, 2021 at 03:32:06PM +0200, Markus Armbruster wrote:
> >> >> Commit 2500fb423a "migration: Include migration support for machine
> >> >> check handling" adds this:
> >> >> 
> >> >>     ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, &local_err);
> >> >>     if (ret == -EBUSY) {
> >> >>         /*
> >> >>          * We don't want to abort so we let the migration to continue.
> >> >>          * In a rare case, the machine check handler will run on the target.
> >> >>          * Though this is not preferable, it is better than aborting
> >> >>          * the migration or killing the VM.
> >> >>          */
> >> >>         warn_report("Received a fwnmi while migration was in progress");
> >> >>     }
> >> >> 
> >> >> migrate_add_blocker() can fail in two ways:
> >> >> 
> >> >> 1. -EBUSY: migration is already in progress
> >> >> 
> >> >>    Ignoring this one is clearly intentional.  The comment explains why.
> >> >>    I'm taking it at face value (I'm a spapr ignoramus).
> >> >
> >> > Right.  The argument isn't really about papr particularly, except
> >> > insofar as understanding what fwnmi is.  fwnmi (FirmWare assisted NMI)
> >> > is a reporting mechanism for certain low-level hardware failures
> >> > (think memory ECC or cpu level faults, IIRC).  If we migrate between
> >> > detecting and reporting the error, then the particulars we report will
> >> > be mostly meaningless since they relate to hardware we're no longer
> >> > running on.  Hence the migration blocker.
> >> >
> >> > However, migrating away from a (non-fatal) fwnmi error is a pretty
> >> > reasonable response, so we don't want to actually fail a migration if
> >> > its already in progress.
> >> >
> >> >>    Aside: I doubt
> >> >>    the warning is going to help users.
> >> >
> >> > You're probably right, but it's not very clear how to do better.  It
> >> > might possibly help someone in tech support explain why the reported
> >> > fwnmi doesn't seem to match the hardware the guest is (now) running
> >> > on.
> >> 
> >> Perhaps pointing to the actual problem could help: the FWNMI's
> >> information is mostly meaningless.
> >
> > Sorry, I don't follow what you're suggesting.
> 
> We warn
> 
>     warning: Received a fwnmi while migration was in progress
> 
> when we fail to block migration because it's already in progress.
> But what does this mean?  Perhaps warn like this:
> 
>     warning: FWNMI while migration is in progress
>     The guest's report for this may be less than useful.
> 
> My phrasing may well be off, but I hope you get the idea.

I see your point.  It may be some time before this reaches the top of
priority list, however.

> Note that we keep quiet when we fail to block migration due to
> -only-migrate.  I agree with that.  The failure makes a difference only
> when migration gets triggered in a narrow time window, which should be
> quite rare.  Would be nice to warn when migration does get triggered in
> that time window, though.  Not sure it's worth the trouble, in
> particular if we'd have to create infrastructure first.
> 
> >
> >> 
> >> >> 2. -EACCES: we're running with -only-migratable
> >> >> 
> >> >>    Why may we ignore -only-migratable here?
> >> >
> >> > Short answer: because I didn't think about that case.  Long answer:
> >> > I think we probably shoud ignore it anyway.  As above, receiving a
> >> > fwnmi doesn't really prevent migration, it just means that if you're
> >> > unlucky it can report stale information.  Since migrating away from a
> >> > possibly-dubious host would be a reasonable response to a non-fatal
> >> > fwnmi, I don't think we want to simply prohibit fwnmi entirely with
> >> > -only-migratable.
> >> 
> >> I think the comment text and placement could be improved to make clear
> >> ignoring this failure is intentional, too.  How do you like the
> >> following?
> >
> > That's fair..
> >
> >> 
> >> diff --git a/hw/ppc/spapr_events.c b/hw/ppc/spapr_events.c
> >> index a8f2cc6bdc..54d8e856d3 100644
> >> --- a/hw/ppc/spapr_events.c
> >> +++ b/hw/ppc/spapr_events.c
> >> @@ -911,16 +911,14 @@ void spapr_mce_req_event(PowerPCCPU *cpu, bool recovered)
> >>          }
> >>      }
> >>  
> >> +    /*
> >> +     * Try to block migration while FWNMI is being handled, so the
> >> +     * machine check handler runs where the information passed to it
> >> +     * actually makes sense.  This won't actually block migration,
> >> +     * only delay it slightly.  If the attempt fails, carry on.
> >> +     */
> >>      ret = migrate_add_blocker(spapr->fwnmi_migration_blocker, NULL);
> >>      if (ret == -EBUSY) {
> >> -        /*
> >> -         * We don't want to abort so we let the migration to continue.
> >> -         * In a rare case, the machine check handler will run on the target.
> >> -         * Though this is not preferable, it is better than aborting
> >> -         * the migration or killing the VM. It is okay to call
> >> -         * migrate_del_blocker on a blocker that was not added (which the
> >> -         * nmi-interlock handler would do when it's called after this).
> >> -         */
> >>          warn_report("Received a fwnmi while migration was in progress");
> >>      }
> >
> > LGTM.
> 
> Thanks, I'll post this.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers
  2021-07-20  5:30             ` -only-migrate and the two different uses of migration blockers Markus Armbruster
@ 2021-07-21  6:32               ` David Gibson
  2021-07-22 18:00                 ` Dr. David Alan Gilbert
  2021-11-02 14:32                 ` Juan Quintela
  0 siblings, 2 replies; 14+ messages in thread
From: David Gibson @ 2021-07-21  6:32 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Juan Quintela, Dr. David Alan Gilbert, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2661 bytes --]

On Tue, Jul 20, 2021 at 07:30:16AM +0200, Markus Armbruster wrote:
> "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> 
> > * Markus Armbruster (armbru@redhat.com) wrote:
> >> We appear to use migration blockers in two ways:
> >> 
> >> (1) Prevent migration for an indefinite time, typically due to use of
> >> some feature that isn't compatible with migration.
> >> 
> >> (2) Delay migration for a short time.
> >> 
> >> Option -only-migrate is designed for (1).  It interferes with (2).
> >> 
> >> Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
> >> adds a migration blocker on realize, and deletes it on unrealize.  With
> >> -only-migrate, device realize fails.  Works as designed.
> >> 
> >> Example for (2): spapr_mce_req_event() makes an effort to prevent
> >> migration degrate the reporting of FWNMIs.  It adds a migration blocker
> >> when it receives one, and deletes it when it's done handling it.  This
> >> is a best effort; if migration is already in progress by the time FWNMI
> >> is received, we simply carry on, and that's okay.  However, option
> >> -only-migrate sabotages the best effort entirely.
> >
> > That's interesting; it's the first time I've heard of anyone using it as
> > 'best effort'.  I've always regarded blockers as blocking.
> 
> Me too, until I found this one.

Right, it may well have been the first usage this way, this fwnmi
stuff isn't super old.

> >> While this isn't exactly terrible, it may be a weakness in our thinking
> >> and our infrastructure.  I'm bringing it up so the people in charge are
> >> aware :)
> >
> > Thanks.
> >
> > It almost feels like they need a way to temporarily hold off
> > 'completion' of migratio - i.e. the phase where we stop the CPU and
> > write the device data;  mind you you'd also probably want it to stop
> > cold-migrates/snapshots?
> 
> Yes, a proper way to delay 'completion' for a bit would be clearer, and
> wouldn't let -only-migrate interfere.

Right.  If that becomes a thing, we should use it here.  Note that
this one use case probably isn't a very strong argument for it,
though.  The only problem here is slightly less that optimal error
reporting in a rare edge case (hardware fault occurs by chance at the
same time as a migration).


.... and, also, I half-suspect that the whole fwnmi feature exists
more to tick IBM RAS check boxes than because anyone will actually use
it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers
  2021-07-21  6:32               ` David Gibson
@ 2021-07-22 18:00                 ` Dr. David Alan Gilbert
  2021-07-25  6:25                   ` David Gibson
  2021-11-02 14:32                 ` Juan Quintela
  1 sibling, 1 reply; 14+ messages in thread
From: Dr. David Alan Gilbert @ 2021-07-22 18:00 UTC (permalink / raw)
  To: David Gibson; +Cc: Juan Quintela, Markus Armbruster, qemu-devel

* David Gibson (david@gibson.dropbear.id.au) wrote:
> On Tue, Jul 20, 2021 at 07:30:16AM +0200, Markus Armbruster wrote:
> > "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> > 
> > > * Markus Armbruster (armbru@redhat.com) wrote:
> > >> We appear to use migration blockers in two ways:
> > >> 
> > >> (1) Prevent migration for an indefinite time, typically due to use of
> > >> some feature that isn't compatible with migration.
> > >> 
> > >> (2) Delay migration for a short time.
> > >> 
> > >> Option -only-migrate is designed for (1).  It interferes with (2).
> > >> 
> > >> Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
> > >> adds a migration blocker on realize, and deletes it on unrealize.  With
> > >> -only-migrate, device realize fails.  Works as designed.
> > >> 
> > >> Example for (2): spapr_mce_req_event() makes an effort to prevent
> > >> migration degrate the reporting of FWNMIs.  It adds a migration blocker
> > >> when it receives one, and deletes it when it's done handling it.  This
> > >> is a best effort; if migration is already in progress by the time FWNMI
> > >> is received, we simply carry on, and that's okay.  However, option
> > >> -only-migrate sabotages the best effort entirely.
> > >
> > > That's interesting; it's the first time I've heard of anyone using it as
> > > 'best effort'.  I've always regarded blockers as blocking.
> > 
> > Me too, until I found this one.
> 
> Right, it may well have been the first usage this way, this fwnmi
> stuff isn't super old.
> 
> > >> While this isn't exactly terrible, it may be a weakness in our thinking
> > >> and our infrastructure.  I'm bringing it up so the people in charge are
> > >> aware :)
> > >
> > > Thanks.
> > >
> > > It almost feels like they need a way to temporarily hold off
> > > 'completion' of migratio - i.e. the phase where we stop the CPU and
> > > write the device data;  mind you you'd also probably want it to stop
> > > cold-migrates/snapshots?
> > 
> > Yes, a proper way to delay 'completion' for a bit would be clearer, and
> > wouldn't let -only-migrate interfere.
> 
> Right.  If that becomes a thing, we should use it here.  Note that
> this one use case probably isn't a very strong argument for it,
> though.  The only problem here is slightly less that optimal error
> reporting in a rare edge case (hardware fault occurs by chance at the
> same time as a migration).

Can you at least put a scary comment in to say why it's so odd.

If you wanted a choice of a different bad way to do this, since you have
savevm_htab_handlers, you might be able to make htab_save_iterate claim
there's always more to do.

> 
> .... and, also, I half-suspect that the whole fwnmi feature exists
> more to tick IBM RAS check boxes than because anyone will actually use
> it.

Ah at least it's always reliable....

Dave

> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson


-- 
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers
  2021-07-22 18:00                 ` Dr. David Alan Gilbert
@ 2021-07-25  6:25                   ` David Gibson
  0 siblings, 0 replies; 14+ messages in thread
From: David Gibson @ 2021-07-25  6:25 UTC (permalink / raw)
  To: Dr. David Alan Gilbert; +Cc: Juan Quintela, Markus Armbruster, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3430 bytes --]

On Thu, Jul 22, 2021 at 07:00:56PM +0100, Dr. David Alan Gilbert wrote:
> * David Gibson (david@gibson.dropbear.id.au) wrote:
> > On Tue, Jul 20, 2021 at 07:30:16AM +0200, Markus Armbruster wrote:
> > > "Dr. David Alan Gilbert" <dgilbert@redhat.com> writes:
> > > 
> > > > * Markus Armbruster (armbru@redhat.com) wrote:
> > > >> We appear to use migration blockers in two ways:
> > > >> 
> > > >> (1) Prevent migration for an indefinite time, typically due to use of
> > > >> some feature that isn't compatible with migration.
> > > >> 
> > > >> (2) Delay migration for a short time.
> > > >> 
> > > >> Option -only-migrate is designed for (1).  It interferes with (2).
> > > >> 
> > > >> Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
> > > >> adds a migration blocker on realize, and deletes it on unrealize.  With
> > > >> -only-migrate, device realize fails.  Works as designed.
> > > >> 
> > > >> Example for (2): spapr_mce_req_event() makes an effort to prevent
> > > >> migration degrate the reporting of FWNMIs.  It adds a migration blocker
> > > >> when it receives one, and deletes it when it's done handling it.  This
> > > >> is a best effort; if migration is already in progress by the time FWNMI
> > > >> is received, we simply carry on, and that's okay.  However, option
> > > >> -only-migrate sabotages the best effort entirely.
> > > >
> > > > That's interesting; it's the first time I've heard of anyone using it as
> > > > 'best effort'.  I've always regarded blockers as blocking.
> > > 
> > > Me too, until I found this one.
> > 
> > Right, it may well have been the first usage this way, this fwnmi
> > stuff isn't super old.
> > 
> > > >> While this isn't exactly terrible, it may be a weakness in our thinking
> > > >> and our infrastructure.  I'm bringing it up so the people in charge are
> > > >> aware :)
> > > >
> > > > Thanks.
> > > >
> > > > It almost feels like they need a way to temporarily hold off
> > > > 'completion' of migratio - i.e. the phase where we stop the CPU and
> > > > write the device data;  mind you you'd also probably want it to stop
> > > > cold-migrates/snapshots?
> > > 
> > > Yes, a proper way to delay 'completion' for a bit would be clearer, and
> > > wouldn't let -only-migrate interfere.
> > 
> > Right.  If that becomes a thing, we should use it here.  Note that
> > this one use case probably isn't a very strong argument for it,
> > though.  The only problem here is slightly less that optimal error
> > reporting in a rare edge case (hardware fault occurs by chance at the
> > same time as a migration).
> 
> Can you at least put a scary comment in to say why it's so odd.
> 
> If you wanted a choice of a different bad way to do this, since you have
> savevm_htab_handlers, you might be able to make htab_save_iterate claim
> there's always more to do.

That would only work if the hash MMU is in use, which won't be the
case with most current systems.

> > .... and, also, I half-suspect that the whole fwnmi feature exists
> > more to tick IBM RAS check boxes than because anyone will actually use
> > it.
> 
> Ah at least it's always reliable....
> 
> Dave
> 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers
  2021-07-19 11:00         ` -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?) Markus Armbruster
  2021-07-19 12:42           ` Dr. David Alan Gilbert
@ 2021-11-02 14:30           ` Juan Quintela
  1 sibling, 0 replies; 14+ messages in thread
From: Juan Quintela @ 2021-11-02 14:30 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: qemu-devel, Dr. David Alan Gilbert, David Gibson

Markus Armbruster <armbru@redhat.com> wrote:
> We appear to use migration blockers in two ways:
>
> (1) Prevent migration for an indefinite time, typically due to use of
> some feature that isn't compatible with migration.
>
> (2) Delay migration for a short time.
>
> Option -only-migrate is designed for (1).  It interferes with (2).
>
> Example for (1): device "x-pci-proxy-dev" doesn't support migration.  It
> adds a migration blocker on realize, and deletes it on unrealize.  With
> -only-migrate, device realize fails.  Works as designed.
>
> Example for (2): spapr_mce_req_event() makes an effort to prevent
> migration degrate the reporting of FWNMIs.  It adds a migration blocker
> when it receives one, and deletes it when it's done handling it.  This
> is a best effort; if migration is already in progress by the time FWNMI
> is received, we simply carry on, and that's okay.  However, option
> -only-migrate sabotages the best effort entirely.
>
> While this isn't exactly terrible, it may be a weakness in our thinking
> and our infrastructure.  I'm bringing it up so the people in charge are
> aware :)

Hi

On the past, we have talked about this (but done nothing).

What we "thought" was to change save_complete() to just return the
equivalent of -EAGAIN, i.e. right now it is not a good time for doing a
migration, wait a little while and try again.

There is no code for that.

Fixing this will also help with latency issues.  When we move to the
complation stage, we have the equivalent of a sync in the block layer.
that can take a long time, but we don't have a way to timeout and get
back to normal migration and try it a bit later.

Later, Juan.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: -only-migrate and the two different uses of migration blockers
  2021-07-21  6:32               ` David Gibson
  2021-07-22 18:00                 ` Dr. David Alan Gilbert
@ 2021-11-02 14:32                 ` Juan Quintela
  1 sibling, 0 replies; 14+ messages in thread
From: Juan Quintela @ 2021-11-02 14:32 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-devel, Markus Armbruster, Dr. David Alan Gilbert

David Gibson <david@gibson.dropbear.id.au> wrote:
> On Tue, Jul 20, 2021 at 07:30:16AM +0200, Markus Armbruster wrote:
>
> Right, it may well have been the first usage this way, this fwnmi
> stuff isn't super old.
>
>> >> While this isn't exactly terrible, it may be a weakness in our thinking
>> >> and our infrastructure.  I'm bringing it up so the people in charge are
>> >> aware :)
>> >
>> > Thanks.
>> >
>> > It almost feels like they need a way to temporarily hold off
>> > 'completion' of migratio - i.e. the phase where we stop the CPU and
>> > write the device data;  mind you you'd also probably want it to stop
>> > cold-migrates/snapshots?
>> 
>> Yes, a proper way to delay 'completion' for a bit would be clearer, and
>> wouldn't let -only-migrate interfere.
>
> Right.  If that becomes a thing, we should use it here.  Note that
> this one use case probably isn't a very strong argument for it,
> though.  The only problem here is slightly less that optimal error
> reporting in a rare edge case (hardware fault occurs by chance at the
> same time as a migration).
>
>
> .... and, also, I half-suspect that the whole fwnmi feature exists
> more to tick IBM RAS check boxes than because anyone will actually use
> it.

Right now the problem is when we broke migration stream.  Sometime there
is a field that is only needed on rare ocassions (othrewise we would
have found it right away).  But only thing that we can do now is abort
the migration.  If we were able to say, try it a little later, we could
fix that kind of trouble.

That is more or less what you have here.

Later, Juan.



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-11-02 14:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-15 13:32 spapr_events: Sure we may ignore migrate_add_blocker() failure? Markus Armbruster
2021-07-19  2:31 ` David Gibson
2021-07-19  7:18   ` Markus Armbruster
2021-07-19  7:20     ` David Gibson
2021-07-19 10:41       ` Markus Armbruster
2021-07-19 11:00         ` -only-migrate and the two different uses of migration blockers (was: spapr_events: Sure we may ignore migrate_add_blocker() failure?) Markus Armbruster
2021-07-19 12:42           ` Dr. David Alan Gilbert
2021-07-20  5:30             ` -only-migrate and the two different uses of migration blockers Markus Armbruster
2021-07-21  6:32               ` David Gibson
2021-07-22 18:00                 ` Dr. David Alan Gilbert
2021-07-25  6:25                   ` David Gibson
2021-11-02 14:32                 ` Juan Quintela
2021-11-02 14:30           ` Juan Quintela
2021-07-21  6:26         ` spapr_events: Sure we may ignore migrate_add_blocker() failure? David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.