All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
@ 2016-06-13 16:37 Steven Rostedt
  2016-06-14 11:58 ` Peter Zijlstra
  0 siblings, 1 reply; 7+ messages in thread
From: Steven Rostedt @ 2016-06-13 16:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

Back in 2008 Gregory Haskins noticed that there was unfair locking when
dealing with double_lock_balance(). The old code (which still exists
when CONFIG_PREEMPT is disabled), on contention, will check the
ordering of the run-queue locks (by address) to determine if it should
release the current lock before taking the second lock.

	if (!spin_trylock(busiest->lock)) {
		if (busiest < this_rq) {
			spin_unlock(&this_rq->lock);
			spin_lock(&busiest->lock);
			spin_lock(&this_rq->lock);
		} else
			spin_lock(&busiest->lock);
	}

This gives unfair access for higher CPUs who are trying to take a
contended lock. For lower numbered CPUs (with lower address of their rq
structure), it wont release its lock when trying to take the next lock.
This can cause an escalation of latency in locking if there's a waiter
on the current 'this_rq->lock', which now must also wait for the
'busiest->lock' to be taken.

The solution was to simply release the current (this_rq) lock and then
take both locks.

	spin_unlock(&this_rq->lock);
	double_rq_lock(this_rq, busiest);

Where double_rq_lock() takes the locks in order of the rq's address.
This way, if there was a waiter on the rq lock, due to the ticket
spinlocks, it would grab the lock immediately, and the current owner
will have to wait for it now that it released the lock.

What I found while testing an 80 CPU core was a large "ping-pong"ing
around of the locks.

Let's say that an RT task on a CPU is woken up from another CPU, and
the CPU that the task was woken on is also running an RT task and tries
to push:


	CPU 0				CPU 1
	-----				-----
    [ wake up ]
				     spin_lock(cpu1_rq->lock);
    spin_lock(cpu1_rq->lock)
				    double_lock_balance()
				    [ release cpu1_rq->lock ]
				    spin_lock(cpu1_rq->lock)
    [due to ticket, now acquires
     cpu1_rq->lock ]

    [goes to push task]
    double_lock_balance()
    [ release cpu1_rq->lock ]
                                   [ acquires lock ]
				   spin_lock(cpu2_rq->lock)
				   [ blocks as cpu2 is using it ]

And then CPU2 would call double_lock_balance() and the ping pong
continues.

What I could not understand about Gregory's patch is that regardless of
contention, the currently held lock is always released, opening up a
window for this ping ponging to occur. When I changed the code to only
release on contention of the second lock, things improved tremendously.

The problem that Gregory was facing was that there was an unfair access
when there was contention. But what his solution did was to always
release the lock even when there was no contention on the second lock,
which happened to cause more contention later on.

I talked with Peter Zijlstra about this, and he pointed me to the
thread where this patch was discussed back in 2008. As I read the
thread, Nick Piggin brought up this very issue. After the little
discussion, it was determined to let the waiting task have the lock
even when there was no contention to be even more "fair". The problem
though, Nick's suggestion was never tested. If it had been back then,
I'm sure that Gregory would have decided to only release the lock if
there was contention on the second lock.

Link: http://lkml.kernel.org/r/20080825201534.23217.14936.stgit@dev.haskins.net

Suggested-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ec2e8d23527e..3ed9ef770085 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1548,10 +1548,15 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
 	__acquires(busiest->lock)
 	__acquires(this_rq->lock)
 {
-	raw_spin_unlock(&this_rq->lock);
-	double_rq_lock(this_rq, busiest);
+	int ret = 0;
+
+	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
+		raw_spin_unlock(&this_rq->lock);
+		double_rq_lock(this_rq, busiest);
+		ret = 1;
+	}
 
-	return 1;
+	return ret;
 }
 
 #else

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
  2016-06-13 16:37 [PATCH] sched: Do not release current rq lock on non contended double_lock_balance() Steven Rostedt
@ 2016-06-14 11:58 ` Peter Zijlstra
  2016-06-14 17:52   ` Steven Rostedt
  2016-06-14 18:02   ` Steven Rostedt
  0 siblings, 2 replies; 7+ messages in thread
From: Peter Zijlstra @ 2016-06-14 11:58 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

On Mon, Jun 13, 2016 at 12:37:32PM -0400, Steven Rostedt wrote:
> The solution was to simply release the current (this_rq) lock and then
> take both locks.
> 
> 	spin_unlock(&this_rq->lock);
> 	double_rq_lock(this_rq, busiest);

> What I could not understand about Gregory's patch is that regardless of
> contention, the currently held lock is always released, opening up a
> window for this ping ponging to occur. When I changed the code to only
> release on contention of the second lock, things improved tremendously.

Its simpler to reason about and there wasn't a problem with at the time.

The above puts a strict limit on hold time and is fair because of the
queueing.

> +++ b/kernel/sched/sched.h
> @@ -1548,10 +1548,15 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
>  	__acquires(busiest->lock)
>  	__acquires(this_rq->lock)
>  {
> +	int ret = 0;
> +
> +	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
> +		raw_spin_unlock(&this_rq->lock);
> +		double_rq_lock(this_rq, busiest);
> +		ret = 1;
> +	}
>  
> +	return ret;
>  }

This relies on trylock no being allowed to steal the lock, which I think
is true for all fair spinlocks (for ticket this must be true, but it is
possible with qspinlock for example).

And it does indeed make the hold time harder to analyze.

For instance; pull_rt_task() does:

	for_each_cpu() {
		double_lock_balance(this, that);
		...
		double_unlock_balance(this, that);
	}

Which, with the trylock, ends up with a max possible hold time of
O(nr_cpus).

Unlikely, sure, but RT is a game of upper bounds etc.

So should we maybe do something like:

	if (unlikely(raw_spin_is_contended(&this_rq->lock) ||
	             !raw_spin_trylock(&busiest->lock))) {
		raw_spin_unlock(&this_rq->lock);
		double_rq_lock(this_rq, busiest);
		ret = 1;
	}

?

> 	CPU 0				CPU 1
> 	-----				-----
>     [ wake up ]
> 				     spin_lock(cpu1_rq->lock);
>     spin_lock(cpu1_rq->lock)
> 				    double_lock_balance()
> 				    [ release cpu1_rq->lock ]
> 				    spin_lock(cpu1_rq->lock)
>     [due to ticket, now acquires
>      cpu1_rq->lock ]
> 
>     [goes to push task]
>     double_lock_balance()
>     [ release cpu1_rq->lock ]
>                                    [ acquires lock ]
> 				   spin_lock(cpu2_rq->lock)
> 				   [ blocks as cpu2 is using it ]
> 

Also, its not entirely clear this scenario helps illustrate how your
change is better; because here the lock _is_ contended, so we'll fail
the trylock, no?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
  2016-06-14 11:58 ` Peter Zijlstra
@ 2016-06-14 17:52   ` Steven Rostedt
  2016-06-14 18:02   ` Steven Rostedt
  1 sibling, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2016-06-14 17:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

> 
> > 	CPU 0				CPU 1
> > 	-----				-----
> >     [ wake up ]
> > 				     spin_lock(cpu1_rq->lock);
> >     spin_lock(cpu1_rq->lock)
> > 				    double_lock_balance()
> > 				    [ release cpu1_rq->lock ]
> > 				    spin_lock(cpu1_rq->lock)
> >     [due to ticket, now acquires
> >      cpu1_rq->lock ]
> > 
> >     [goes to push task]
> >     double_lock_balance()
> >     [ release cpu1_rq->lock ]
> >                                    [ acquires lock ]
> > 				   spin_lock(cpu2_rq->lock)
> > 				   [ blocks as cpu2 is using it ]
> >   
> 
> Also, its not entirely clear this scenario helps illustrate how your
> change is better; because here the lock _is_ contended, so we'll fail
> the trylock, no?

Sorry, I should have been more specific that the double lock balance
was grabbing cpu2_rq (another rq lock), where there was no contention.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
  2016-06-14 11:58 ` Peter Zijlstra
  2016-06-14 17:52   ` Steven Rostedt
@ 2016-06-14 18:02   ` Steven Rostedt
  2016-06-14 19:42     ` Peter Zijlstra
  2016-06-15 11:14     ` Peter Zijlstra
  1 sibling, 2 replies; 7+ messages in thread
From: Steven Rostedt @ 2016-06-14 18:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

On Tue, 14 Jun 2016 13:58:20 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> 
> The above puts a strict limit on hold time and is fair because of the
> queueing.
> 
> > +++ b/kernel/sched/sched.h
> > @@ -1548,10 +1548,15 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest)
> >  	__acquires(busiest->lock)
> >  	__acquires(this_rq->lock)
> >  {
> > +	int ret = 0;
> > +
> > +	if (unlikely(!raw_spin_trylock(&busiest->lock))) {
> > +		raw_spin_unlock(&this_rq->lock);
> > +		double_rq_lock(this_rq, busiest);
> > +		ret = 1;
> > +	}
> >  
> > +	return ret;
> >  }  
> 
> This relies on trylock no being allowed to steal the lock, which I think
> is true for all fair spinlocks (for ticket this must be true, but it is
> possible with qspinlock for example).
> 
> And it does indeed make the hold time harder to analyze.
> 
> For instance; pull_rt_task() does:
> 
> 	for_each_cpu() {
> 		double_lock_balance(this, that);
> 		...
> 		double_unlock_balance(this, that);
> 	}
> 
> Which, with the trylock, ends up with a max possible hold time of
> O(nr_cpus).

Sure, but I think we should try to limit that loop too, because that
loop itself is what is triggering the large latency for me, because
it constantly releases a spinlock and has to wait. This loop is done
with preemption disabled.

> 
> Unlikely, sure, but RT is a game of upper bounds etc.

Sure, but should we force worst case all the time?

We do a lot of optimization to allow for good throughput as well.

> 
> So should we maybe do something like:
> 
> 	if (unlikely(raw_spin_is_contended(&this_rq->lock) ||
> 	             !raw_spin_trylock(&busiest->lock))) {

Why do we care if this_rq is contended? That's exactly what causes
large latency to happen. Because when we let go of this_rq, this fast
path becomes much slower because now it must wait for whatever is
waiting on it to finish. The more CPUs you have, the bigger this issue
becomes.

If there's a loop on O(nr_cpus) (which is still technically O(1)) then
another CPU may need to wait for that loop to finish. But the loop
itself is kept tighter. If we always always release the lock, we allow
other CPUs to continue at the expense of the one CPU from continuing,
the K of that O(nr_cpus) becomes much larger and we consolidate the
latency all to a single CPU which may be the one that is running the
highest priority task in the system.

I'm seeing 100s us latency because of this. With this patch, that
latency disappears.

-- Steve



> 		raw_spin_unlock(&this_rq->lock);
> 		double_rq_lock(this_rq, busiest);
> 		ret = 1;
> 	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
  2016-06-14 18:02   ` Steven Rostedt
@ 2016-06-14 19:42     ` Peter Zijlstra
  2016-06-15 11:14     ` Peter Zijlstra
  1 sibling, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2016-06-14 19:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

On Tue, Jun 14, 2016 at 02:02:28PM -0400, Steven Rostedt wrote:
> On Tue, 14 Jun 2016 13:58:20 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> > And it does indeed make the hold time harder to analyze.
> > 
> > For instance; pull_rt_task() does:
> > 
> > 	for_each_cpu() {
> > 		double_lock_balance(this, that);
> > 		...
> > 		double_unlock_balance(this, that);
> > 	}
> > 
> > Which, with the trylock, ends up with a max possible hold time of
> > O(nr_cpus).
> 
> Sure, but I think we should try to limit that loop too, because that
> loop itself is what is triggering the large latency for me, because
> it constantly releases a spinlock and has to wait. This loop is done
> with preemption disabled.

Much worse, its done with IRQs disabled. But that affects only the local
CPU. Holding the lock that long affects all other CPUs too.

> > Unlikely, sure, but RT is a game of upper bounds etc.
> 
> Sure, but should we force worst case all the time?

How is that relevant? Either you have a bounded operation or you don't.

> We do a lot of optimization to allow for good throughput as well.

Only within keeping the upper bounds. The moment you let go of that,
you've destroyed RT.

> > So should we maybe do something like:
> > 
> > 	if (unlikely(raw_spin_is_contended(&this_rq->lock) ||
> > 	             !raw_spin_trylock(&busiest->lock))) {
> 
> Why do we care if this_rq is contended? 

To bound hold time.

> That's exactly what causes
> large latency to happen. Because when we let go of this_rq, this fast
> path becomes much slower because now it must wait for whatever is
> waiting on it to finish. The more CPUs you have, the bigger this issue
> becomes.

Yes, icky issue.

And while the numbers look pretty I'm not sure you've not introduced
another, less likely, issue.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
  2016-06-14 18:02   ` Steven Rostedt
  2016-06-14 19:42     ` Peter Zijlstra
@ 2016-06-15 11:14     ` Peter Zijlstra
  2016-06-15 16:13       ` Steven Rostedt
  1 sibling, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2016-06-15 11:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

On Tue, Jun 14, 2016 at 02:02:28PM -0400, Steven Rostedt wrote:
> > For instance; pull_rt_task() does:
> > 
> > 	for_each_cpu() {
> > 		double_lock_balance(this, that);
> > 		...
> > 		double_unlock_balance(this, that);
> > 	}
> > 
> > Which, with the trylock, ends up with a max possible hold time of
> > O(nr_cpus).
> 
> Sure, but I think we should try to limit that loop too, because that
> loop itself is what is triggering the large latency for me, because
> it constantly releases a spinlock and has to wait. This loop is done
> with preemption disabled.

OK, so should not the whole HAVE_RT_PUSH_IPI thing have avoided that
loop entirely? And therefore made the point moot?

In any case, can't we add another cpupri for pushable tasks and use that
to find the highest priority task to pull and avoid the loop thus?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] sched: Do not release current rq lock on non contended double_lock_balance()
  2016-06-15 11:14     ` Peter Zijlstra
@ 2016-06-15 16:13       ` Steven Rostedt
  0 siblings, 0 replies; 7+ messages in thread
From: Steven Rostedt @ 2016-06-15 16:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Ingo Molnar, Thomas Gleixner, Clark Williams,
	Andrew Morton, Nick Piggin

On Wed, 15 Jun 2016 13:14:53 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> OK, so should not the whole HAVE_RT_PUSH_IPI thing have avoided that
> loop entirely? And therefore made the point moot?

I believe there was another issue that we had in our tests. But I don't
have the trace available with me. I'll rerun the tests when I get back
home and have some more concrete examples for you.

> 
> In any case, can't we add another cpupri for pushable tasks and use that
> to find the highest priority task to pull and avoid the loop thus?

I thought about this too, but I was a bit concerned about complexities
this would add. But I can look into it. Currently I'm in NYC for
personal reasons and will take a look at this when I get back.

-- Steve

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-15 16:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-13 16:37 [PATCH] sched: Do not release current rq lock on non contended double_lock_balance() Steven Rostedt
2016-06-14 11:58 ` Peter Zijlstra
2016-06-14 17:52   ` Steven Rostedt
2016-06-14 18:02   ` Steven Rostedt
2016-06-14 19:42     ` Peter Zijlstra
2016-06-15 11:14     ` Peter Zijlstra
2016-06-15 16:13       ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.