On Thu, 2016-08-11 at 16:51 +0100, Andrew Cooper wrote: > On 11/08/16 15:59, Dario Faggioli wrote: > >  > > Which, I think needs at least this hunk (from 6b53bb4ab3c9  "sched: > > better handle (not) inserting idle vCPUs in runqueues"): > > > > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > > index 2beebe8..fddcd52 100644 > > --- a/xen/common/schedule.c > > +++ b/xen/common/schedule.c > > @@ -240,20 +240,22 @@ int sched_init_vcpu(struct vcpu *v, unsigned > > int processor) > >      init_timer(&v->poll_timer, poll_timer_fn, > >                 v, v->processor); > >   > > -    /* Idle VCPUs are scheduled immediately. */ > > +    v->sched_priv = SCHED_OP(DOM2OP(d), alloc_vdata, v, d- > > >sched_priv); > > +    if ( v->sched_priv == NULL ) > > +        return 1; > > + > > +    TRACE_2D(TRC_SCHED_DOM_ADD, v->domain->domain_id, v->vcpu_id); > > + > > +    /* Idle VCPUs are scheduled immediately, so don't put them in > > runqueue. */ > >      if ( is_idle_domain(d) ) > >      { > >          per_cpu(schedule_data, v->processor).curr = v; > >          v->is_running = 1; > >      } > > - > > -    TRACE_2D(TRC_SCHED_DOM_ADD, v->domain->domain_id, v->vcpu_id); > > - > > -    v->sched_priv = SCHED_OP(DOM2OP(d), alloc_vdata, v, d- > > >sched_priv); > > -    if ( v->sched_priv == NULL ) > > -        return 1; > > - > > -    SCHED_OP(DOM2OP(d), insert_vcpu, v); > > +    else > > +    { > > +        SCHED_OP(DOM2OP(d), insert_vcpu, v); > > +    } > >   > >      return 0; > >  } > > > > So, yeah, it's proving a little more complicated than how I thought > > it > > would have, just by looking at the patches. :-/ > > > > Will let know. > FWIW, this looks very similar to the regression I just raised against > Xen 4.7 "[Xen-devel] Scheduler regression in 4.7".  The stack traces > are > suspiciously similar.   > I thought the same at the beginning, but they actually may not be the same or even related. This happens at early boot, and reason is we try to call csched_cpu_pick() on the idle vcpus, which does not make any sense, and in fact one of the ASSERTS triggers. In your case, system has booted fine already. And the reason for that is you're looking at 4.7, and 4.7 is no longer calling insert_vcpu(), which then calls csched_cpu_pick(), on idle vcpus at boot, thanks to the patch I'm mentioning above. And in fact, I confirm that, on 4.6, with "just" the hunk above of said patch, I can boot, create a (large) VM, play a bit with it, shutdown or reboot it, and shutdown the host as well. Also, yours seems to _explode_ because of a race on the runq (in IS_RUNQ_IDLE()), this one _asserts_ here:         /* Pick an online CPU from the proper affinity mask */         csched_balance_cpumask(vc, balance_step, &cpus);         cpumask_and(&cpus, &cpus, online);         /* If present, prefer vc's current processor */         cpu = cpumask_test_cpu(vc->processor, &cpus)                 ? vc->processor                 : cpumask_cycle(vc->processor, &cpus);         ASSERT(cpumask_test_cpu(cpu, &cpus)); Because, as I said, we're on early boot, and most likely, there's almost no one in online! > I expect they have the same root cause. > No, I think they're two different things. Regards, Dario -- <> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)