On 10/03/15 16:59, Paul E. McKenney wrote: > On Tue, Mar 10, 2015 at 03:30:42PM +0000, James Hogan wrote: >> Hi Paul, >> >> On 03/03/15 17:42, Paul E. McKenney wrote: >>> From: "Paul E. McKenney" >>> >>> This commit removes the open-coded CPU-offline notification with new >>> common code. This change avoids calling scheduler code using RCU from >>> an offline CPU that RCU is ignoring. This commit is compatible with >>> the existing code in not checking for timeout during a prior offline >>> for a given CPU. >>> >>> Signed-off-by: Paul E. McKenney >>> Cc: James Hogan >>> Cc: >> >> I gave this a try via linux-next, but unfortunately it causes the >> following warning every time a CPU goes down: >> META213-Thread0 DSP [LogF] CPU1: unable to kill > > That is certainly not what I had in mind, thank you for finding this! > >> If I add printks, I see that the state on entry to both cpu_wait_death >> and cpu_report_death is already CPU_POST_DEAD, suggesting that it hasn't >> changed from its initial value. >> >> Should arches other than x86 now be calling cpu_set_state_online()? The >> patchlet below seems to resolve it for Meta (not sure if that is the >> best place in the startup sequence to do it, perhaps it doesn't matter). >> >> diff --git a/arch/metag/kernel/smp.c b/arch/metag/kernel/smp.c >> index ac3a199e33e7..430e379ec71f 100644 >> --- a/arch/metag/kernel/smp.c >> +++ b/arch/metag/kernel/smp.c >> @@ -383,6 +383,7 @@ asmlinkage void secondary_start_kernel(void) >> * OK, now it's safe to let the boot CPU continue >> */ >> set_cpu_online(cpu, true); >> + cpu_set_state_online(cpu); >> complete(&cpu_running); >> >> /* >> >> Looking at the comment before cpu_set_state_online: >>> /* >>> * Mark the specified CPU online. >>> * >>> * Note that it is permissible to omit this call entirely, as is >>> * done in architectures that do no CPU-hotplug error checking. >>> */ >> >> Which suggests it wasn't wrong to omit it before your patches came >> along. > > And that suggestion is quite correct. The idea was indeed to accommodate > architectures that do not do error checking. > > Does the following patch (on top of current -next) remove the need for > your addition of cpu_set_state_online() above? Don't forget the "oldstate == ", otherwise it'll work for the wrong reason :-/ Checking for CPU_POST_DEAD does seem to fix the immediate problem, however this still leaves open the possibility of a single timeout propagating to all further offlines after CPU_DEAD_FROZEN gets set. I've confirmed that by adding a delay loop only on the second cpu_report_death() call, and sure enough the 2nd and further offlines all fail even though the CPU stops immediately after the 2nd one. If this check is primarily so that CPU_DEAD_FROZEN is set if cpu_wait_death timed out, would it be better to instead check explicitly for CPU_BROKEN? diff --git a/kernel/smpboot.c b/kernel/smpboot.c index 18688e0b0422..c697f73d82d6 100644 --- a/kernel/smpboot.c +++ b/kernel/smpboot.c @@ -460,7 +460,7 @@ bool cpu_report_death(void) do { oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu)); - if (oldstate == CPU_ONLINE) + if (oldstate != CPU_BROKEN) newstate = CPU_DEAD; else newstate = CPU_DEAD_FROZEN; Cheers James > > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/kernel/smpboot.c b/kernel/smpboot.c > index 18688e0b0422..80400e019c86 100644 > --- a/kernel/smpboot.c > +++ b/kernel/smpboot.c > @@ -460,7 +460,7 @@ bool cpu_report_death(void) > > do { > oldstate = atomic_read(&per_cpu(cpu_hotplug_state, cpu)); > - if (oldstate == CPU_ONLINE) > + if (oldstate == CPU_ONLINE || CPU_POST_DEAD) > newstate = CPU_DEAD; > else > newstate = CPU_DEAD_FROZEN; >