Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer

From: George Dunlap <dunlapg@umich.edu>
To: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
Cc: "Jan Beulich" <jbeulich@suse.com>,
	"Juergen Gross" <jgross@suse.de>,
	"Dario Faggioli" <dfaggioli@suse.com>,
	"Andrew Cooper" <andrew.cooper3@citrix.com>,
	xen-devel <xen-devel@lists.xenproject.org>,
	"Jürgen Groß" <jgross@suse.com>
Subject: Re: [Xen-devel] Xen crash after S3 suspend - Xen 4.13 and newer
Date: Fri, 14 Oct 2022 17:42:40 +0100	[thread overview]
Message-ID: <CAFLBxZZUknp80CQx9rhQhD=hzqV+eOR5Me=eyU1sJUHMaMUybg@mail.gmail.com> (raw)
In-Reply-To: <YymUZCfLZRWl6xr5@mail-itl>

[-- Attachment #1: Type: text/plain, Size: 3771 bytes --]

On Tue, Sep 20, 2022 at 11:23 AM Marek Marczykowski-Górecki <
marmarek@invisiblethingslab.com> wrote:

>
> I have two (non exclusive) ideas here:
> 1. If old_cpu is actually still available, do not move it at all.
> 2. Use sched_migrate() instead of sched_set_res().
>

Other possibilities:

3.  Make sure that svc->rqd is set to null when the affinity is broken.

Currently on vcpu creation, sched_init_vcpu() expects to set the pcpu; and
it looks like for credit2, the svc->rqd may not be set until the first time
it's woken up (that's the 'if' part of the 'if/else' clause whose 'else'
contains the ASSERT() you're hitting).  If when we broke the CPU affinity
on suspend, we set the runqueues to NULL, then on wake it would "take" the
runqueue assigned by restore_vcpu_affinity().

4. Make sched2_unit_wake() tolerant of pcpus changing under its feet.

#3 would potentially make things more robust, but would require adding some
sort of call-back to notify schedulers that affinity had been broken.  ATM
this might only be used by credit2.

#4 would potentially be dangerous: if some other bit of credit2 code which
assumes the svc->rq is valid.

> Here is the patch that fixes it for me:
> ---8<---
> diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
> index 83455fbde1c8..dcf202d8b307 100644
> --- a/xen/common/sched/core.c
> +++ b/xen/common/sched/core.c
> @@ -1246,19 +1246,29 @@ void restore_vcpu_affinity(struct domain *d)
>              }
>          }
>
> -        res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
> +        /* Prefer old cpu if available. */
> +        if ( cpumask_test_cpu(old_cpu, cpumask_scratch_cpu(cpu)) )
> +            res = get_sched_res(old_cpu);
> +        else
> +            res = get_sched_res(cpumask_any(cpumask_scratch_cpu(cpu)));
>          sched_set_res(unit, res);
>
>          spin_unlock_irq(lock);
>
> -        /* v->processor might have changed, so reacquire the lock. */
> -        lock = unit_schedule_lock_irq(unit);
> -        res = sched_pick_resource(unit_scheduler(unit), unit);
> -        sched_set_res(unit, res);
> -        spin_unlock_irq(lock);
> -
> +        /*
> +         * If different cpu was chosen, it was random, let scheduler do
> proper
> +         * decision.
> +         */
>          if ( old_cpu != sched_unit_master(unit) )
> +        {
> +            /* v->processor might have changed, so reacquire the lock. */
> +            lock = unit_schedule_lock_irq(unit);
> +            res = sched_pick_resource(unit_scheduler(unit), unit);
> +            sched_migrate(unit_scheduler(unit), unit, res->master_cpu);
> +            spin_unlock_irq(lock);
> +
>              sched_move_irqs(unit);
> +        }
>      }
>
>      rcu_read_unlock(&sched_res_rculock);
> ---8<---
>
> I have several doubts here:
>
> 1. If old_cpu is available, is sched_set_res() needed at all?
> 2. Should both calls be changed to sched_migrate()? Currently I changed
>    only the second one, in case scheduler could be confused about
>    old_cpu not being available anymore.
> 3. Are there any extra locking requirements for sched_migrate() at this
>    stage? The long comment above sched_unit_migrate_start() suggests
>    there might be, but I'm not sure if that's really the case during
>    resume.
> 4. Related to the above - should thaw_domains() be modified to call
>    restore_vcpu_affinity() for all domains first, and unpause only
>    later? That could reduce locking requirements, I guess.
>

Unfortunately this code has had a lot of churn since the last time I really
engaged with it; I'm going to have to come back to this on Monday.

Jürgen / Dario, any thoughts?

 -George

[-- Attachment #2: Type: text/html, Size: 4790 bytes --]