All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Rafael J. Wysocki" <rjw@sisk.pl>
To: Colin Cross <ccross@android.com>
Cc: linux-pm@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org, Kevin Hilman <khilman@ti.com>,
	Len Brown <len.brown@intel.com>,
	Russell King <linux@arm.linux.org.uk>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	Kay Sievers <kay.sievers@vrfy.org>,
	Amit Kucheria <amit.kucheria@linaro.org>,
	Arjan van de Ven <arjan@linux.intel.com>,
	Arnd Bergmann <arnd.bergmann@linaro.org>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [linux-pm] [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus
Date: Sat, 5 May 2012 00:27:54 +0200	[thread overview]
Message-ID: <201205050027.54452.rjw@sisk.pl> (raw)
In-Reply-To: <CAMbhsRRSGypR-N23vXbHSKLc_TPo6PTjQ2-PFicknCqX8tYq9Q@mail.gmail.com>

On Friday, May 04, 2012, Colin Cross wrote:
> On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Colin Cross wrote:
> >> On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > [...]
> >>
> >> >> +/**
> >> >> + * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting
> >> >> + * @coupled: the struct coupled that contains the current cpu
> >> >> + *
> >> >> + * Returns true if all cpus coupled to this target state are in the wait loop
> >> >> + */
> >> >> +static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled)
> >> >> +{
> >> >> +     int alive;
> >> >> +     int waiting;
> >> >> +
> >> >> +     /*
> >> >> +      * Read alive before reading waiting so a booting cpu is not treated as
> >> >> +      * idle
> >> >> +      */
> >> >
> >> > Well, the comment doesn't really explain much.  In particular, why the boot CPU
> >> > could be treated as idle if the reads were in a different order.
> >>
> >> Hm, I think the race condition is on a cpu going down.  What about:
> >> Read alive before reading waiting.  If waiting is read before alive,
> >> this cpu could see another cpu as waiting just before it goes offline,
> >> between when it the other cpu decrements waiting and when it
> >> decrements alive, which could cause alive == waiting when one cpu is
> >> not waiting.
> >
> > Reading them in this particular order doesn't stop the race, though.  I mean,
> > if the hotplug happens just right after you've read alive_count, you still have
> > a wrong value.  waiting_count is set independently, it seems, so there's no
> > ordering between the two on the "store" side and the "load" side ordering
> > doesn't matter.
> 
> As commented in the hotplug path, hotplug relies on the fact that one
> of the cpus in the cluster is involved in the hotplug of the cpu that
> is changing (this may not be true for multiple clusters, but it is
> easy to fix by IPI-ing to a cpu that is in the same cluster when that
> happens).

That's very fragile and potentially sets a trap for people trying to make
the kernel work on systems with multiple clusters.

> That means that waiting count is always guaranteed to be at
> least 1 less than alive count when alive count changes.  All this read
> ordering needs to do is make sure that this cpu doesn't see
> waiting_count == alive_count by reading them in the wrong order.

So, the concern seems to be that if the local CPU reorders the reads
from waiting_count and alive_count and enough time elapses between one
read and the other, the decrementation of waiting_count may happen
between them and then the CPU may use the outdated value for comparison,
right?

Still, though, even if the barrier is there, the modification of
alive_count in the hotplug notifier routine may not happen before
the read from alive_count in cpuidle_coupled_cpus_waiting() is completed.
Isn't that a problem?

> > I would just make the CPU hotplug notifier routine block until
> > cpuidle_enter_state_coupled() is done and the latter return immediately
> > if the CPU hotplug notifier routine is in progress, perhaps falling back
> > to the safe state.  Or I would make the CPU hotplug notifier routine
> > disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE
> > and only re-enable it after the hotplug has been completed.
> 
> I'll take a look at disabling coupled idle completely during hotplug.

Great, thanks!

> >> >> +     alive = atomic_read(&coupled->alive_count);
> >> >> +     smp_rmb();
> >> >> +     waiting = atomic_read(&coupled->waiting_count);
> >> >
> >> > Have you considered using one atomic variable to accommodate both counters
> >> > such that the upper half contains one counter and the lower half contains
> >> > the other?
> >>
> >> There are 3 counters (alive, waiting, and ready).  Do you want me to
> >> squish all of them into a single atomic_t, which would limit to 1023
> >> cpus?
> >
> > No.  I'd make sure that cpuidle_enter_state_coupled() did't race with CPU
> > hotplug, so as to make alive_count stable from its standpoint, and I'd
> > put the two remaining counters into one atomic_t variable.
> 
> I'll take a look at using a single atomic_t.  My initial worry was
> that the increased contention on the shared variable would cause more
> cmpxchg retries, but since waiting_count and ready_count are designed
> to be modified in sequential phases that shouldn't be an issue.
> 
[...]
> >> >> +     while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> +             entered_state = cpuidle_enter_state(dev, drv,
> >> >> +                     dev->safe_state_index);
> >> >> +
> >> >> +             local_irq_enable();
> >> >> +             while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> +                     cpu_relax();
> >> >
> >> > Hmm.  What exactly is this loop supposed to achieve?
> >> This is to ensure that the outstanding wakeups have been processed so
> >> we don't go to idle with an interrupt pending an immediately wake up.
> >
> > I see.  Is it actually safe to reenable interrupts at this point, though?
> 
> I think so.  The normal idle loop will enable interrupts in a similar
> fashion to what happens here.  There are two things to worry about: a
> processed interrupt causing work to be scheduled that should bring
> this cpu out of idle, or changing the next timer which would
> invalidate the current requested state.  The first is handled by
> checking need_resched() after interrupts are disabled again, the
> second is currently unhandled but does not affect correct operation,
> it just races into a less-than-optimal idle state.

I see.

> >> >> +             local_irq_disable();
> >> >
> >> > Anyway, you seem to be calling it twice along with this enabling/disabling of
> >> > interrupts.  I'd put that into a separate function and explain its role in a
> >> > kerneldoc comment.
> >>
> >> I left it here to be obvious that I was enabling interrupts in the
> >> idle path, but I can refactor it out if you prefer.
> >
> > Well, you can call the function to make it obvious. :-)
> >
> > Anyway, I think that code duplication is a worse thing than a reasonable
> > amount of non-obviousness, so to speak.
> >
> >> >> +     }
> >> >> +
> >> >> +     /* give a chance to process any remaining pokes */
> >> >> +     local_irq_enable();
> >> >> +     while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> +             cpu_relax();
> >> >> +     local_irq_disable();
> >> >> +
> >> >> +     if (need_resched()) {
> >> >> +             cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> +             goto out;
> >> >> +     }
> >> >> +
> >> >> +     /*
> >> >> +      * All coupled cpus are probably idle.  There is a small chance that
> >> >> +      * one of the other cpus just became active.  Increment a counter when
> >> >> +      * ready, and spin until all coupled cpus have incremented the counter.
> >> >> +      * Once a cpu has incremented the counter, it cannot abort idle and must
> >> >> +      * spin until either the count has hit alive_count, or another cpu
> >> >> +      * leaves idle.
> >> >> +      */
> >> >> +
> >> >> +     smp_mb__before_atomic_inc();
> >> >> +     atomic_inc(&coupled->ready_count);
> >> >> +     smp_mb__after_atomic_inc();
> >> >
> >> > It seems that at least one of these barriers is unnecessary ...
> >> The first is to ensure ordering between ready_count and waiting count,
> >
> > Are you afraid that the test against waiting_count from
> > cpuidle_coupled_cpus_waiting() may get reordered after the incrementation
> > of ready_count or is it something else?
> 
> Yes, ready_count must not be incremented before waiting_count == alive_count.

Well, control doesn't reach the atomic_inc() statement if this condition
is not satisfied, so I don't see how it can be possibly reordered before
the while () loop without breaking the control flow guarantees.

> >> the second is for ready_count vs. alive_count and requested_state.
> >
> > This one I can understand, but ...
> >
> >> >> +     /* alive_count can't change while ready_count > 0 */
> >> >> +     alive = atomic_read(&coupled->alive_count);
> >
> > What happens if CPU hotplug happens right here?
> 
> According to the comment above that line that can't happen -
> alive_count can't change while ready_count > 0, because that implies
> that all cpus are waiting and none can be in the hotplug path where
> alive_count is changed.  Looking at it again that is not entirely
> true, alive_count could change on systems with >2 cpus, but I think it
> can't cause an issue because alive_count would be 2 greater than
> waiting_count before alive_count was changed.  Either way, it will be
> fixed by disabling coupled idle during hotplug.

Yup.

> >> >> +     while (atomic_read(&coupled->ready_count) != alive) {
> >> >> +             /* Check if any other cpus bailed out of idle. */
> >> >> +             if (!cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> +                     atomic_dec(&coupled->ready_count);
> >> >> +                     smp_mb__after_atomic_dec();
> >
> > And the barrier here?  Even if the old value of ready_count leaks into
> > the while () loop after retry, that doesn't seem to matter.
> 
> All of these will be academic if ready_count and waiting_count share
> an atomic_t.
> waiting_count must not be decremented by exiting the while loop after
> the retry label until ready_count is decremented here, but that is
> also protected by the barrier in set_not_waiting.  One of them could
> be dropped.
> 
> >> >> +                     goto retry;
> >> >> +             }
> >> >> +
> >> >> +             cpu_relax();
> >> >> +     }
> >> >> +
> >> >> +     /* all cpus have acked the coupled state */
> >> >> +     smp_rmb();
> >> >
> >> > What is the barrier here for?
> >> This protects ready_count vs. requested_state.  It is already
> >> implicitly protected by the atomic_inc_return in set_waiting, but I
> >> thought it would be better to protect it explicitly here.  I think I
> >> added the smp_mb__after_atomic_inc above later, which makes this one
> >> superflous, so I'll drop it.
> >
> > OK
> >
> >> >> +
> >> >> +     next_state = cpuidle_coupled_get_state(dev, coupled);
> >> >> +
> >> >> +     entered_state = cpuidle_enter_state(dev, drv, next_state);
> >> >> +
> >> >> +     cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> +     atomic_dec(&coupled->ready_count);
> >> >> +     smp_mb__after_atomic_dec();
> >> >> +
> >> >> +out:
> >> >> +     /*
> >> >> +      * Normal cpuidle states are expected to return with irqs enabled.
> >> >> +      * That leads to an inefficiency where a cpu receiving an interrupt
> >> >> +      * that brings it out of idle will process that interrupt before
> >> >> +      * exiting the idle enter function and decrementing ready_count.  All
> >> >> +      * other cpus will need to spin waiting for the cpu that is processing
> >> >> +      * the interrupt.  If the driver returns with interrupts disabled,
> >> >> +      * all other cpus will loop back into the safe idle state instead of
> >> >> +      * spinning, saving power.
> >> >> +      *
> >> >> +      * Calling local_irq_enable here allows coupled states to return with
> >> >> +      * interrupts disabled, but won't cause problems for drivers that
> >> >> +      * exit with interrupts enabled.
> >> >> +      */
> >> >> +     local_irq_enable();
> >> >> +
> >> >> +     /*
> >> >> +      * Wait until all coupled cpus have exited idle.  There is no risk that
> >> >> +      * a cpu exits and re-enters the ready state because this cpu has
> >> >> +      * already decremented its waiting_count.
> >> >> +      */
> >> >> +     while (atomic_read(&coupled->ready_count) != 0)
> >> >> +             cpu_relax();
> >> >> +
> >> >> +     smp_rmb();
> >> >
> >> > And here?
> >>
> >> This was to protect ready_count vs. looping back in and reading
> >> alive_count.
> >
> > Well, I'm lost. :-)
> >
> > You've not modified anything after the previous smp_mb__after_atomic_dec(),
> > so what exactly is the reordering this is supposed to work against?
> >
> > And while we're at it, I'm not quite sure what the things that the previous
> > smp_mb__after_atomic_dec() separates from each other are.
> 
> Instead of justifying all of these, let me try the combined atomic_t
> trick and justify the (many fewer) remaining barriers.

OK, cool! :-)

Thanks,
Rafael

WARNING: multiple messages have this Message-ID (diff)
From: "Rafael J. Wysocki" <rjw@sisk.pl>
To: Colin Cross <ccross@android.com>
Cc: Kevin Hilman <khilman@ti.com>, Len Brown <len.brown@intel.com>,
	Russell King <linux@arm.linux.org.uk>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Kay Sievers <kay.sievers@vrfy.org>,
	linux-kernel@vger.kernel.org,
	Amit Kucheria <amit.kucheria@linaro.org>,
	linux-pm@lists.linux-foundation.org,
	Arjan van de Ven <arjan@linux.intel.com>,
	Arnd Bergmann <arnd.bergmann@linaro.org>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus
Date: Sat, 5 May 2012 00:27:54 +0200	[thread overview]
Message-ID: <201205050027.54452.rjw@sisk.pl> (raw)
In-Reply-To: <CAMbhsRRSGypR-N23vXbHSKLc_TPo6PTjQ2-PFicknCqX8tYq9Q@mail.gmail.com>

On Friday, May 04, 2012, Colin Cross wrote:
> On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Colin Cross wrote:
> >> On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > [...]
> >>
> >> >> +/**
> >> >> + * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting
> >> >> + * @coupled: the struct coupled that contains the current cpu
> >> >> + *
> >> >> + * Returns true if all cpus coupled to this target state are in the wait loop
> >> >> + */
> >> >> +static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled)
> >> >> +{
> >> >> +     int alive;
> >> >> +     int waiting;
> >> >> +
> >> >> +     /*
> >> >> +      * Read alive before reading waiting so a booting cpu is not treated as
> >> >> +      * idle
> >> >> +      */
> >> >
> >> > Well, the comment doesn't really explain much.  In particular, why the boot CPU
> >> > could be treated as idle if the reads were in a different order.
> >>
> >> Hm, I think the race condition is on a cpu going down.  What about:
> >> Read alive before reading waiting.  If waiting is read before alive,
> >> this cpu could see another cpu as waiting just before it goes offline,
> >> between when it the other cpu decrements waiting and when it
> >> decrements alive, which could cause alive == waiting when one cpu is
> >> not waiting.
> >
> > Reading them in this particular order doesn't stop the race, though.  I mean,
> > if the hotplug happens just right after you've read alive_count, you still have
> > a wrong value.  waiting_count is set independently, it seems, so there's no
> > ordering between the two on the "store" side and the "load" side ordering
> > doesn't matter.
> 
> As commented in the hotplug path, hotplug relies on the fact that one
> of the cpus in the cluster is involved in the hotplug of the cpu that
> is changing (this may not be true for multiple clusters, but it is
> easy to fix by IPI-ing to a cpu that is in the same cluster when that
> happens).

That's very fragile and potentially sets a trap for people trying to make
the kernel work on systems with multiple clusters.

> That means that waiting count is always guaranteed to be at
> least 1 less than alive count when alive count changes.  All this read
> ordering needs to do is make sure that this cpu doesn't see
> waiting_count == alive_count by reading them in the wrong order.

So, the concern seems to be that if the local CPU reorders the reads
from waiting_count and alive_count and enough time elapses between one
read and the other, the decrementation of waiting_count may happen
between them and then the CPU may use the outdated value for comparison,
right?

Still, though, even if the barrier is there, the modification of
alive_count in the hotplug notifier routine may not happen before
the read from alive_count in cpuidle_coupled_cpus_waiting() is completed.
Isn't that a problem?

> > I would just make the CPU hotplug notifier routine block until
> > cpuidle_enter_state_coupled() is done and the latter return immediately
> > if the CPU hotplug notifier routine is in progress, perhaps falling back
> > to the safe state.  Or I would make the CPU hotplug notifier routine
> > disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE
> > and only re-enable it after the hotplug has been completed.
> 
> I'll take a look at disabling coupled idle completely during hotplug.

Great, thanks!

> >> >> +     alive = atomic_read(&coupled->alive_count);
> >> >> +     smp_rmb();
> >> >> +     waiting = atomic_read(&coupled->waiting_count);
> >> >
> >> > Have you considered using one atomic variable to accommodate both counters
> >> > such that the upper half contains one counter and the lower half contains
> >> > the other?
> >>
> >> There are 3 counters (alive, waiting, and ready).  Do you want me to
> >> squish all of them into a single atomic_t, which would limit to 1023
> >> cpus?
> >
> > No.  I'd make sure that cpuidle_enter_state_coupled() did't race with CPU
> > hotplug, so as to make alive_count stable from its standpoint, and I'd
> > put the two remaining counters into one atomic_t variable.
> 
> I'll take a look at using a single atomic_t.  My initial worry was
> that the increased contention on the shared variable would cause more
> cmpxchg retries, but since waiting_count and ready_count are designed
> to be modified in sequential phases that shouldn't be an issue.
> 
[...]
> >> >> +     while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> +             entered_state = cpuidle_enter_state(dev, drv,
> >> >> +                     dev->safe_state_index);
> >> >> +
> >> >> +             local_irq_enable();
> >> >> +             while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> +                     cpu_relax();
> >> >
> >> > Hmm.  What exactly is this loop supposed to achieve?
> >> This is to ensure that the outstanding wakeups have been processed so
> >> we don't go to idle with an interrupt pending an immediately wake up.
> >
> > I see.  Is it actually safe to reenable interrupts at this point, though?
> 
> I think so.  The normal idle loop will enable interrupts in a similar
> fashion to what happens here.  There are two things to worry about: a
> processed interrupt causing work to be scheduled that should bring
> this cpu out of idle, or changing the next timer which would
> invalidate the current requested state.  The first is handled by
> checking need_resched() after interrupts are disabled again, the
> second is currently unhandled but does not affect correct operation,
> it just races into a less-than-optimal idle state.

I see.

> >> >> +             local_irq_disable();
> >> >
> >> > Anyway, you seem to be calling it twice along with this enabling/disabling of
> >> > interrupts.  I'd put that into a separate function and explain its role in a
> >> > kerneldoc comment.
> >>
> >> I left it here to be obvious that I was enabling interrupts in the
> >> idle path, but I can refactor it out if you prefer.
> >
> > Well, you can call the function to make it obvious. :-)
> >
> > Anyway, I think that code duplication is a worse thing than a reasonable
> > amount of non-obviousness, so to speak.
> >
> >> >> +     }
> >> >> +
> >> >> +     /* give a chance to process any remaining pokes */
> >> >> +     local_irq_enable();
> >> >> +     while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> +             cpu_relax();
> >> >> +     local_irq_disable();
> >> >> +
> >> >> +     if (need_resched()) {
> >> >> +             cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> +             goto out;
> >> >> +     }
> >> >> +
> >> >> +     /*
> >> >> +      * All coupled cpus are probably idle.  There is a small chance that
> >> >> +      * one of the other cpus just became active.  Increment a counter when
> >> >> +      * ready, and spin until all coupled cpus have incremented the counter.
> >> >> +      * Once a cpu has incremented the counter, it cannot abort idle and must
> >> >> +      * spin until either the count has hit alive_count, or another cpu
> >> >> +      * leaves idle.
> >> >> +      */
> >> >> +
> >> >> +     smp_mb__before_atomic_inc();
> >> >> +     atomic_inc(&coupled->ready_count);
> >> >> +     smp_mb__after_atomic_inc();
> >> >
> >> > It seems that at least one of these barriers is unnecessary ...
> >> The first is to ensure ordering between ready_count and waiting count,
> >
> > Are you afraid that the test against waiting_count from
> > cpuidle_coupled_cpus_waiting() may get reordered after the incrementation
> > of ready_count or is it something else?
> 
> Yes, ready_count must not be incremented before waiting_count == alive_count.

Well, control doesn't reach the atomic_inc() statement if this condition
is not satisfied, so I don't see how it can be possibly reordered before
the while () loop without breaking the control flow guarantees.

> >> the second is for ready_count vs. alive_count and requested_state.
> >
> > This one I can understand, but ...
> >
> >> >> +     /* alive_count can't change while ready_count > 0 */
> >> >> +     alive = atomic_read(&coupled->alive_count);
> >
> > What happens if CPU hotplug happens right here?
> 
> According to the comment above that line that can't happen -
> alive_count can't change while ready_count > 0, because that implies
> that all cpus are waiting and none can be in the hotplug path where
> alive_count is changed.  Looking at it again that is not entirely
> true, alive_count could change on systems with >2 cpus, but I think it
> can't cause an issue because alive_count would be 2 greater than
> waiting_count before alive_count was changed.  Either way, it will be
> fixed by disabling coupled idle during hotplug.

Yup.

> >> >> +     while (atomic_read(&coupled->ready_count) != alive) {
> >> >> +             /* Check if any other cpus bailed out of idle. */
> >> >> +             if (!cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> +                     atomic_dec(&coupled->ready_count);
> >> >> +                     smp_mb__after_atomic_dec();
> >
> > And the barrier here?  Even if the old value of ready_count leaks into
> > the while () loop after retry, that doesn't seem to matter.
> 
> All of these will be academic if ready_count and waiting_count share
> an atomic_t.
> waiting_count must not be decremented by exiting the while loop after
> the retry label until ready_count is decremented here, but that is
> also protected by the barrier in set_not_waiting.  One of them could
> be dropped.
> 
> >> >> +                     goto retry;
> >> >> +             }
> >> >> +
> >> >> +             cpu_relax();
> >> >> +     }
> >> >> +
> >> >> +     /* all cpus have acked the coupled state */
> >> >> +     smp_rmb();
> >> >
> >> > What is the barrier here for?
> >> This protects ready_count vs. requested_state.  It is already
> >> implicitly protected by the atomic_inc_return in set_waiting, but I
> >> thought it would be better to protect it explicitly here.  I think I
> >> added the smp_mb__after_atomic_inc above later, which makes this one
> >> superflous, so I'll drop it.
> >
> > OK
> >
> >> >> +
> >> >> +     next_state = cpuidle_coupled_get_state(dev, coupled);
> >> >> +
> >> >> +     entered_state = cpuidle_enter_state(dev, drv, next_state);
> >> >> +
> >> >> +     cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> +     atomic_dec(&coupled->ready_count);
> >> >> +     smp_mb__after_atomic_dec();
> >> >> +
> >> >> +out:
> >> >> +     /*
> >> >> +      * Normal cpuidle states are expected to return with irqs enabled.
> >> >> +      * That leads to an inefficiency where a cpu receiving an interrupt
> >> >> +      * that brings it out of idle will process that interrupt before
> >> >> +      * exiting the idle enter function and decrementing ready_count.  All
> >> >> +      * other cpus will need to spin waiting for the cpu that is processing
> >> >> +      * the interrupt.  If the driver returns with interrupts disabled,
> >> >> +      * all other cpus will loop back into the safe idle state instead of
> >> >> +      * spinning, saving power.
> >> >> +      *
> >> >> +      * Calling local_irq_enable here allows coupled states to return with
> >> >> +      * interrupts disabled, but won't cause problems for drivers that
> >> >> +      * exit with interrupts enabled.
> >> >> +      */
> >> >> +     local_irq_enable();
> >> >> +
> >> >> +     /*
> >> >> +      * Wait until all coupled cpus have exited idle.  There is no risk that
> >> >> +      * a cpu exits and re-enters the ready state because this cpu has
> >> >> +      * already decremented its waiting_count.
> >> >> +      */
> >> >> +     while (atomic_read(&coupled->ready_count) != 0)
> >> >> +             cpu_relax();
> >> >> +
> >> >> +     smp_rmb();
> >> >
> >> > And here?
> >>
> >> This was to protect ready_count vs. looping back in and reading
> >> alive_count.
> >
> > Well, I'm lost. :-)
> >
> > You've not modified anything after the previous smp_mb__after_atomic_dec(),
> > so what exactly is the reordering this is supposed to work against?
> >
> > And while we're at it, I'm not quite sure what the things that the previous
> > smp_mb__after_atomic_dec() separates from each other are.
> 
> Instead of justifying all of these, let me try the combined atomic_t
> trick and justify the (many fewer) remaining barriers.

OK, cool! :-)

Thanks,
Rafael

WARNING: multiple messages have this Message-ID (diff)
From: rjw@sisk.pl (Rafael J. Wysocki)
To: linux-arm-kernel@lists.infradead.org
Subject: [linux-pm] [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus
Date: Sat, 5 May 2012 00:27:54 +0200	[thread overview]
Message-ID: <201205050027.54452.rjw@sisk.pl> (raw)
In-Reply-To: <CAMbhsRRSGypR-N23vXbHSKLc_TPo6PTjQ2-PFicknCqX8tYq9Q@mail.gmail.com>

On Friday, May 04, 2012, Colin Cross wrote:
> On Fri, May 4, 2012 at 4:51 AM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > On Friday, May 04, 2012, Colin Cross wrote:
> >> On Thu, May 3, 2012 at 3:14 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote:
> > [...]
> >>
> >> >> +/**
> >> >> + * cpuidle_coupled_cpus_waiting - check if all cpus in a coupled set are waiting
> >> >> + * @coupled: the struct coupled that contains the current cpu
> >> >> + *
> >> >> + * Returns true if all cpus coupled to this target state are in the wait loop
> >> >> + */
> >> >> +static inline bool cpuidle_coupled_cpus_waiting(struct cpuidle_coupled *coupled)
> >> >> +{
> >> >> +     int alive;
> >> >> +     int waiting;
> >> >> +
> >> >> +     /*
> >> >> +      * Read alive before reading waiting so a booting cpu is not treated as
> >> >> +      * idle
> >> >> +      */
> >> >
> >> > Well, the comment doesn't really explain much.  In particular, why the boot CPU
> >> > could be treated as idle if the reads were in a different order.
> >>
> >> Hm, I think the race condition is on a cpu going down.  What about:
> >> Read alive before reading waiting.  If waiting is read before alive,
> >> this cpu could see another cpu as waiting just before it goes offline,
> >> between when it the other cpu decrements waiting and when it
> >> decrements alive, which could cause alive == waiting when one cpu is
> >> not waiting.
> >
> > Reading them in this particular order doesn't stop the race, though.  I mean,
> > if the hotplug happens just right after you've read alive_count, you still have
> > a wrong value.  waiting_count is set independently, it seems, so there's no
> > ordering between the two on the "store" side and the "load" side ordering
> > doesn't matter.
> 
> As commented in the hotplug path, hotplug relies on the fact that one
> of the cpus in the cluster is involved in the hotplug of the cpu that
> is changing (this may not be true for multiple clusters, but it is
> easy to fix by IPI-ing to a cpu that is in the same cluster when that
> happens).

That's very fragile and potentially sets a trap for people trying to make
the kernel work on systems with multiple clusters.

> That means that waiting count is always guaranteed to be at
> least 1 less than alive count when alive count changes.  All this read
> ordering needs to do is make sure that this cpu doesn't see
> waiting_count == alive_count by reading them in the wrong order.

So, the concern seems to be that if the local CPU reorders the reads
from waiting_count and alive_count and enough time elapses between one
read and the other, the decrementation of waiting_count may happen
between them and then the CPU may use the outdated value for comparison,
right?

Still, though, even if the barrier is there, the modification of
alive_count in the hotplug notifier routine may not happen before
the read from alive_count in cpuidle_coupled_cpus_waiting() is completed.
Isn't that a problem?

> > I would just make the CPU hotplug notifier routine block until
> > cpuidle_enter_state_coupled() is done and the latter return immediately
> > if the CPU hotplug notifier routine is in progress, perhaps falling back
> > to the safe state.  Or I would make the CPU hotplug notifier routine
> > disable the "coupled cpuidle" entirely on DOWN_PREPARE and UP_PREPARE
> > and only re-enable it after the hotplug has been completed.
> 
> I'll take a look at disabling coupled idle completely during hotplug.

Great, thanks!

> >> >> +     alive = atomic_read(&coupled->alive_count);
> >> >> +     smp_rmb();
> >> >> +     waiting = atomic_read(&coupled->waiting_count);
> >> >
> >> > Have you considered using one atomic variable to accommodate both counters
> >> > such that the upper half contains one counter and the lower half contains
> >> > the other?
> >>
> >> There are 3 counters (alive, waiting, and ready).  Do you want me to
> >> squish all of them into a single atomic_t, which would limit to 1023
> >> cpus?
> >
> > No.  I'd make sure that cpuidle_enter_state_coupled() did't race with CPU
> > hotplug, so as to make alive_count stable from its standpoint, and I'd
> > put the two remaining counters into one atomic_t variable.
> 
> I'll take a look at using a single atomic_t.  My initial worry was
> that the increased contention on the shared variable would cause more
> cmpxchg retries, but since waiting_count and ready_count are designed
> to be modified in sequential phases that shouldn't be an issue.
> 
[...]
> >> >> +     while (!need_resched() && !cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> +             entered_state = cpuidle_enter_state(dev, drv,
> >> >> +                     dev->safe_state_index);
> >> >> +
> >> >> +             local_irq_enable();
> >> >> +             while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> +                     cpu_relax();
> >> >
> >> > Hmm.  What exactly is this loop supposed to achieve?
> >> This is to ensure that the outstanding wakeups have been processed so
> >> we don't go to idle with an interrupt pending an immediately wake up.
> >
> > I see.  Is it actually safe to reenable interrupts at this point, though?
> 
> I think so.  The normal idle loop will enable interrupts in a similar
> fashion to what happens here.  There are two things to worry about: a
> processed interrupt causing work to be scheduled that should bring
> this cpu out of idle, or changing the next timer which would
> invalidate the current requested state.  The first is handled by
> checking need_resched() after interrupts are disabled again, the
> second is currently unhandled but does not affect correct operation,
> it just races into a less-than-optimal idle state.

I see.

> >> >> +             local_irq_disable();
> >> >
> >> > Anyway, you seem to be calling it twice along with this enabling/disabling of
> >> > interrupts.  I'd put that into a separate function and explain its role in a
> >> > kerneldoc comment.
> >>
> >> I left it here to be obvious that I was enabling interrupts in the
> >> idle path, but I can refactor it out if you prefer.
> >
> > Well, you can call the function to make it obvious. :-)
> >
> > Anyway, I think that code duplication is a worse thing than a reasonable
> > amount of non-obviousness, so to speak.
> >
> >> >> +     }
> >> >> +
> >> >> +     /* give a chance to process any remaining pokes */
> >> >> +     local_irq_enable();
> >> >> +     while (cpumask_test_cpu(dev->cpu, &cpuidle_coupled_poked_mask))
> >> >> +             cpu_relax();
> >> >> +     local_irq_disable();
> >> >> +
> >> >> +     if (need_resched()) {
> >> >> +             cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> +             goto out;
> >> >> +     }
> >> >> +
> >> >> +     /*
> >> >> +      * All coupled cpus are probably idle.  There is a small chance that
> >> >> +      * one of the other cpus just became active.  Increment a counter when
> >> >> +      * ready, and spin until all coupled cpus have incremented the counter.
> >> >> +      * Once a cpu has incremented the counter, it cannot abort idle and must
> >> >> +      * spin until either the count has hit alive_count, or another cpu
> >> >> +      * leaves idle.
> >> >> +      */
> >> >> +
> >> >> +     smp_mb__before_atomic_inc();
> >> >> +     atomic_inc(&coupled->ready_count);
> >> >> +     smp_mb__after_atomic_inc();
> >> >
> >> > It seems that at least one of these barriers is unnecessary ...
> >> The first is to ensure ordering between ready_count and waiting count,
> >
> > Are you afraid that the test against waiting_count from
> > cpuidle_coupled_cpus_waiting() may get reordered after the incrementation
> > of ready_count or is it something else?
> 
> Yes, ready_count must not be incremented before waiting_count == alive_count.

Well, control doesn't reach the atomic_inc() statement if this condition
is not satisfied, so I don't see how it can be possibly reordered before
the while () loop without breaking the control flow guarantees.

> >> the second is for ready_count vs. alive_count and requested_state.
> >
> > This one I can understand, but ...
> >
> >> >> +     /* alive_count can't change while ready_count > 0 */
> >> >> +     alive = atomic_read(&coupled->alive_count);
> >
> > What happens if CPU hotplug happens right here?
> 
> According to the comment above that line that can't happen -
> alive_count can't change while ready_count > 0, because that implies
> that all cpus are waiting and none can be in the hotplug path where
> alive_count is changed.  Looking at it again that is not entirely
> true, alive_count could change on systems with >2 cpus, but I think it
> can't cause an issue because alive_count would be 2 greater than
> waiting_count before alive_count was changed.  Either way, it will be
> fixed by disabling coupled idle during hotplug.

Yup.

> >> >> +     while (atomic_read(&coupled->ready_count) != alive) {
> >> >> +             /* Check if any other cpus bailed out of idle. */
> >> >> +             if (!cpuidle_coupled_cpus_waiting(coupled)) {
> >> >> +                     atomic_dec(&coupled->ready_count);
> >> >> +                     smp_mb__after_atomic_dec();
> >
> > And the barrier here?  Even if the old value of ready_count leaks into
> > the while () loop after retry, that doesn't seem to matter.
> 
> All of these will be academic if ready_count and waiting_count share
> an atomic_t.
> waiting_count must not be decremented by exiting the while loop after
> the retry label until ready_count is decremented here, but that is
> also protected by the barrier in set_not_waiting.  One of them could
> be dropped.
> 
> >> >> +                     goto retry;
> >> >> +             }
> >> >> +
> >> >> +             cpu_relax();
> >> >> +     }
> >> >> +
> >> >> +     /* all cpus have acked the coupled state */
> >> >> +     smp_rmb();
> >> >
> >> > What is the barrier here for?
> >> This protects ready_count vs. requested_state.  It is already
> >> implicitly protected by the atomic_inc_return in set_waiting, but I
> >> thought it would be better to protect it explicitly here.  I think I
> >> added the smp_mb__after_atomic_inc above later, which makes this one
> >> superflous, so I'll drop it.
> >
> > OK
> >
> >> >> +
> >> >> +     next_state = cpuidle_coupled_get_state(dev, coupled);
> >> >> +
> >> >> +     entered_state = cpuidle_enter_state(dev, drv, next_state);
> >> >> +
> >> >> +     cpuidle_coupled_set_not_waiting(dev, coupled);
> >> >> +     atomic_dec(&coupled->ready_count);
> >> >> +     smp_mb__after_atomic_dec();
> >> >> +
> >> >> +out:
> >> >> +     /*
> >> >> +      * Normal cpuidle states are expected to return with irqs enabled.
> >> >> +      * That leads to an inefficiency where a cpu receiving an interrupt
> >> >> +      * that brings it out of idle will process that interrupt before
> >> >> +      * exiting the idle enter function and decrementing ready_count.  All
> >> >> +      * other cpus will need to spin waiting for the cpu that is processing
> >> >> +      * the interrupt.  If the driver returns with interrupts disabled,
> >> >> +      * all other cpus will loop back into the safe idle state instead of
> >> >> +      * spinning, saving power.
> >> >> +      *
> >> >> +      * Calling local_irq_enable here allows coupled states to return with
> >> >> +      * interrupts disabled, but won't cause problems for drivers that
> >> >> +      * exit with interrupts enabled.
> >> >> +      */
> >> >> +     local_irq_enable();
> >> >> +
> >> >> +     /*
> >> >> +      * Wait until all coupled cpus have exited idle.  There is no risk that
> >> >> +      * a cpu exits and re-enters the ready state because this cpu has
> >> >> +      * already decremented its waiting_count.
> >> >> +      */
> >> >> +     while (atomic_read(&coupled->ready_count) != 0)
> >> >> +             cpu_relax();
> >> >> +
> >> >> +     smp_rmb();
> >> >
> >> > And here?
> >>
> >> This was to protect ready_count vs. looping back in and reading
> >> alive_count.
> >
> > Well, I'm lost. :-)
> >
> > You've not modified anything after the previous smp_mb__after_atomic_dec(),
> > so what exactly is the reordering this is supposed to work against?
> >
> > And while we're at it, I'm not quite sure what the things that the previous
> > smp_mb__after_atomic_dec() separates from each other are.
> 
> Instead of justifying all of these, let me try the combined atomic_t
> trick and justify the (many fewer) remaining barriers.

OK, cool! :-)

Thanks,
Rafael

  reply	other threads:[~2012-05-04 22:23 UTC|newest]

Thread overview: 78+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-30 20:09 [PATCHv3 0/5] coupled cpuidle state support Colin Cross
2012-04-30 20:09 ` Colin Cross
2012-04-30 20:09 ` [PATCHv3 1/5] cpuidle: refactor out cpuidle_enter_state Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-05-03 20:50   ` Rafael J. Wysocki
2012-05-03 20:50     ` Rafael J. Wysocki
2012-05-03 20:50     ` Rafael J. Wysocki
2012-04-30 20:09 ` [PATCHv3 2/5] cpuidle: fix error handling in __cpuidle_register_device Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-05-03 20:50   ` [linux-pm] " Rafael J. Wysocki
2012-05-03 20:50     ` Rafael J. Wysocki
2012-05-03 20:50     ` Rafael J. Wysocki
2012-04-30 20:09 ` [PATCHv3 3/5] cpuidle: add support for states that affect multiple cpus Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-05-03 22:14   ` [linux-pm] " Rafael J. Wysocki
2012-05-03 22:14     ` Rafael J. Wysocki
2012-05-03 22:14     ` Rafael J. Wysocki
2012-05-03 23:09     ` [linux-pm] " Colin Cross
2012-05-03 23:09       ` Colin Cross
2012-05-03 23:09       ` Colin Cross
2012-05-04 11:51       ` [linux-pm] " Rafael J. Wysocki
2012-05-04 11:51         ` Rafael J. Wysocki
2012-05-04 11:51         ` Rafael J. Wysocki
2012-05-04 18:56         ` [linux-pm] " Colin Cross
2012-05-04 18:56           ` Colin Cross
2012-05-04 18:56           ` Colin Cross
2012-05-04 22:27           ` Rafael J. Wysocki [this message]
2012-05-04 22:27             ` [linux-pm] " Rafael J. Wysocki
2012-05-04 22:27             ` Rafael J. Wysocki
2012-04-30 20:09 ` [PATCHv3 4/5] cpuidle: coupled: add parallel barrier function Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-04-30 20:09 ` [PATCHv3 5/5] cpuidle: coupled: add trace events Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-04-30 20:09   ` Colin Cross
2012-05-03 21:00   ` Steven Rostedt
2012-05-03 21:00     ` Steven Rostedt
2012-05-03 21:00     ` Steven Rostedt
2012-05-03 21:13     ` Colin Cross
2012-05-03 21:13       ` Colin Cross
2012-05-03 21:13       ` Colin Cross
2012-04-30 21:18 ` [PATCHv3 0/5] coupled cpuidle state support Colin Cross
2012-04-30 21:18   ` Colin Cross
2012-04-30 21:18   ` Colin Cross
2012-04-30 21:25 ` Rafael J. Wysocki
2012-04-30 21:25   ` Rafael J. Wysocki
2012-04-30 21:25   ` Rafael J. Wysocki
2012-04-30 21:37   ` Colin Cross
2012-04-30 21:37     ` Colin Cross
2012-04-30 21:37     ` Colin Cross
2012-04-30 21:54     ` Rafael J. Wysocki
2012-04-30 21:54       ` Rafael J. Wysocki
2012-04-30 21:54       ` Rafael J. Wysocki
2012-04-30 22:01       ` Colin Cross
2012-04-30 22:01         ` Colin Cross
2012-04-30 22:01         ` Colin Cross
2012-05-03 20:00         ` Rafael J. Wysocki
2012-05-03 20:00           ` Rafael J. Wysocki
2012-05-03 20:00           ` Rafael J. Wysocki
2012-05-03 20:18           ` Colin Cross
2012-05-03 20:18             ` Colin Cross
2012-05-03 20:18             ` Colin Cross
2012-05-03 20:43             ` Rafael J. Wysocki
2012-05-03 20:43               ` Rafael J. Wysocki
2012-05-03 20:43               ` Rafael J. Wysocki
2012-05-04 10:04           ` Lorenzo Pieralisi
2012-05-04 10:04             ` Lorenzo Pieralisi
2012-05-04 10:04             ` Lorenzo Pieralisi
2012-05-01 10:43     ` Lorenzo Pieralisi
2012-05-01 10:43       ` Lorenzo Pieralisi
2012-05-01 10:43       ` Lorenzo Pieralisi
2012-05-02  0:11       ` Colin Cross
2012-05-02  0:11         ` Colin Cross
2012-05-02  0:11         ` Colin Cross
2012-05-02  7:22         ` Santosh Shilimkar
2012-05-02  7:22           ` Santosh Shilimkar
2012-05-02  7:22           ` Santosh Shilimkar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=201205050027.54452.rjw@sisk.pl \
    --to=rjw@sisk.pl \
    --cc=amit.kucheria@linaro.org \
    --cc=arjan@linux.intel.com \
    --cc=arnd.bergmann@linaro.org \
    --cc=ccross@android.com \
    --cc=gregkh@linuxfoundation.org \
    --cc=kay.sievers@vrfy.org \
    --cc=khilman@ti.com \
    --cc=len.brown@intel.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@lists.linux-foundation.org \
    --cc=linux@arm.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.