linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets
@ 2012-09-14  7:47 Nikolay Ulyanitsky
  2012-09-14 18:40 ` Borislav Petkov
  2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
  0 siblings, 2 replies; 115+ messages in thread
From: Nikolay Ulyanitsky @ 2012-09-14  7:47 UTC (permalink / raw)
  To: linux-kernel

Hi
I compiled the 3.6-rc5 kernel with the same config from 3.5.3 and got
the 15-20% performance drop of PostgreSQL 9.2 on AMD chipsets (880G,
990X).

CentOS 6.3 x86_64
PostgreSQL 9.2
cpufreq scaling_governor - performance

# /etc/init.d/postgresql initdb
# echo "fsync = off" >> /var/lib/pgsql/data/postgresql.conf
# /etc/init.d/postgresql start
# su - postgres
$ psql
# create database pgbench;
# \q

# pgbench -i pgbench && pgbench -c 10 -t 10000 pgbench
tps = 4670.635648 (including connections establishing)
tps = 4673.630345 (excluding connections establishing)[/code]

On kernel 3.5.3:
tps = ~5800

1) Host 1 - 15-20% performance drop
AMD Phenom(tm) II X6 1090T Processor
MB: AMD 880G
RAM: 16 Gb DDR3
SSD: PLEXTOR PX-256M3 256Gb

2) Host 2 - 15-20% performance drop
AMD Phenom(tm) II X6 1055T Processor
MB: AMD 990X
RAM: 32 Gb DDR3
SSD: Corsair Performance Pro 128Gb

3) Host 3 - no problems - same performance
Intel E6300
MB: Intel® P43 / ICH10
RAM: 4 Gb DDR3
HDD: SATA 7200 rpm

Kernel config - http://pastebin.com/cFpg5JSJ

Any ideas?

Thx

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets
  2012-09-14  7:47 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets Nikolay Ulyanitsky
@ 2012-09-14 18:40 ` Borislav Petkov
  2012-09-14 18:51   ` Borislav Petkov
  2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
  1 sibling, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-14 18:40 UTC (permalink / raw)
  To: Nikolay Ulyanitsky; +Cc: linux-kernel, Andreas Herrmann

On Fri, Sep 14, 2012 at 10:47:44AM +0300, Nikolay Ulyanitsky wrote:
> Hi
> I compiled the 3.6-rc5 kernel with the same config from 3.5.3 and got
> the 15-20% performance drop of PostgreSQL 9.2 on AMD chipsets (880G,
> 990X).
> 
> CentOS 6.3 x86_64
> PostgreSQL 9.2
> cpufreq scaling_governor - performance
> 
> # /etc/init.d/postgresql initdb
> # echo "fsync = off" >> /var/lib/pgsql/data/postgresql.conf
> # /etc/init.d/postgresql start
> # su - postgres
> $ psql
> # create database pgbench;
> # \q
> 
> # pgbench -i pgbench && pgbench -c 10 -t 10000 pgbench
> tps = 4670.635648 (including connections establishing)
> tps = 4673.630345 (excluding connections establishing)[/code]

Ok, I was able to reproduce it here too, albeit with different userspace
(debian testing and postgres 9.1).

I'll try a coarse bisection of the -rcs first, to see where the
regression appeared.

Thanks for reporting this.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets
  2012-09-14 18:40 ` Borislav Petkov
@ 2012-09-14 18:51   ` Borislav Petkov
  0 siblings, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-14 18:51 UTC (permalink / raw)
  To: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann

On Fri, Sep 14, 2012 at 08:40:32PM +0200, Borislav Petkov wrote:
> I'll try a coarse bisection of the -rcs first, to see where the
> regression appeared.

Ok, 3.6-rc1 already shows the regression so I have to do the normal
bisection now to see which patch caused it.

Stay tuned.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14  7:47 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets Nikolay Ulyanitsky
  2012-09-14 18:40 ` Borislav Petkov
@ 2012-09-14 21:27 ` Borislav Petkov
  2012-09-14 21:40   ` Peter Zijlstra
                     ` (2 more replies)
  1 sibling, 3 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-14 21:27 UTC (permalink / raw)
  To: Nikolay Ulyanitsky, Mike Galbraith
  Cc: linux-kernel, Andreas Herrmann, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar

(Adding everybody to CC and leaving the below for reference.)

Guys,

as Nikolay says below, we have a regression in 3.6 with pgbench's
benchmark in postgresql.

I was able to reproduce it on another box here and did a bisection run.
It pointed to the commit below.

And yes, reverting that commit fixes the issue here.

@Nikolay: can you try reverting it from 3.6-rc5 and check whether the
regression dissapears at your end?

Thanks.

commit 970e178985cadbca660feb02f4d2ee3a09f7fdda
Author: Mike Galbraith <efault@gmx.de>
Date:   Tue Jun 12 05:18:32 2012 +0200

    sched: Improve scalability via 'CPU buddies', which withstand random perturbations
    
    Traversing an entire package is not only expensive, it also leads to tasks
    bouncing all over a partially idle and possible quite large package.  Fix
    that up by assigning a 'buddy' CPU to try to motivate.  Each buddy may try
    to motivate that one other CPU, if it's busy, tough, it may then try its
    SMT sibling, but that's all this optimization is allowed to cost.
    
    Sibling cache buddies are cross-wired to prevent bouncing.
    
    4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
    
     clients     1       2       4        8       16       32       64      128
     ..........................................................................
     pre        30      41     118      645     3769     6214    12233    14312
     post      299     603    1211     2418     4697     6847    11606    14557
    
    A nice increase in performance.
    
    Signed-off-by: Mike Galbraith <efault@gmx.de>
    Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>


On Fri, Sep 14, 2012 at 10:47:44AM +0300, Nikolay Ulyanitsky wrote:
> Hi
> I compiled the 3.6-rc5 kernel with the same config from 3.5.3 and got
> the 15-20% performance drop of PostgreSQL 9.2 on AMD chipsets (880G,
> 990X).
> 
> CentOS 6.3 x86_64
> PostgreSQL 9.2
> cpufreq scaling_governor - performance
> 
> # /etc/init.d/postgresql initdb
> # echo "fsync = off" >> /var/lib/pgsql/data/postgresql.conf
> # /etc/init.d/postgresql start
> # su - postgres
> $ psql
> # create database pgbench;
> # \q
> 
> # pgbench -i pgbench && pgbench -c 10 -t 10000 pgbench
> tps = 4670.635648 (including connections establishing)
> tps = 4673.630345 (excluding connections establishing)[/code]
> 
> On kernel 3.5.3:
> tps = ~5800
> 
> 1) Host 1 - 15-20% performance drop
> AMD Phenom(tm) II X6 1090T Processor
> MB: AMD 880G
> RAM: 16 Gb DDR3
> SSD: PLEXTOR PX-256M3 256Gb
> 
> 2) Host 2 - 15-20% performance drop
> AMD Phenom(tm) II X6 1055T Processor
> MB: AMD 990X
> RAM: 32 Gb DDR3
> SSD: Corsair Performance Pro 128Gb
> 
> 3) Host 3 - no problems - same performance
> Intel E6300
> MB: Intel® P43 / ICH10
> RAM: 4 Gb DDR3
> HDD: SATA 7200 rpm
> 
> Kernel config - http://pastebin.com/cFpg5JSJ
> 
> Any ideas?
> 
> Thx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
@ 2012-09-14 21:40   ` Peter Zijlstra
  2012-09-14 21:44     ` Linus Torvalds
  2012-09-14 21:45     ` Borislav Petkov
  2012-09-14 21:42   ` Linus Torvalds
  2012-09-15  4:11   ` Mike Galbraith
  2 siblings, 2 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-14 21:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikolay Ulyanitsky, Mike Galbraith, linux-kernel,
	Andreas Herrmann, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 23:27 +0200, Borislav Petkov wrote:
> 
> I was able to reproduce it on another box here and did a bisection run.
> It pointed to the commit below.
> 
> And yes, reverting that commit fixes the issue here.

Hmm, cute. What kind of machine did you test it on? Nikolay's machines
look to be smallish AMD X6 or ancient Intel c2d (the patch will indeed
have absolutely no effect on a dual core).

I'll see about running pgbench on a bigger Intel tomorrow if Mike
doesn't beat me to it.

The problem the patch is trying to address is not having to scan an
entire package for idle cores on every wakeup now that packages are
getting stupid big.

Regressing Postgres otoh isn't nice either..

Anyway, I guess I'm fine with nixing this patch until we figure out
something smarter..

I'm also curious to know wth postgres does that this patch makes such a
big difference...

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
  2012-09-14 21:40   ` Peter Zijlstra
@ 2012-09-14 21:42   ` Linus Torvalds
  2012-09-15  3:33     ` Mike Galbraith
  2012-09-24 15:00     ` Mel Gorman
  2012-09-15  4:11   ` Mike Galbraith
  2 siblings, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-14 21:42 UTC (permalink / raw)
  To: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Linus Torvalds,
	Andrew Morton, Thomas Gleixner, Ingo Molnar

On Fri, Sep 14, 2012 at 2:27 PM, Borislav Petkov <bp@alien8.de> wrote:
>
> as Nikolay says below, we have a regression in 3.6 with pgbench's
> benchmark in postgresql.
>
> I was able to reproduce it on another box here and did a bisection run.
> It pointed to the commit below.

Ok. I guess we should just revert it. However, before we do that,
maybe Mike can make it just use the exact old semantics of
select_idle_sibling() in the update_top_cache_domain() logic.

Because the patch in question seems to do two things:
 (a) cache the "idle_buggy" logic, so that we don't have those costly loops
 (b) change it to do that "left-right" thing.

and that (b) thing may be what causes a regression for you.

So my gut feel is that the patch was wrong to begin with, exactly
because it did two independent changes. It *should* have treated those
two issues as independent changes and separate commits.

Maybe I'm mis-reading it. Mike? Peter?

            Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:40   ` Peter Zijlstra
@ 2012-09-14 21:44     ` Linus Torvalds
  2012-09-14 21:56       ` Peter Zijlstra
  2012-09-14 21:45     ` Borislav Petkov
  1 sibling, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-14 21:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, Sep 14, 2012 at 2:40 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> The problem the patch is trying to address is not having to scan an
> entire package for idle cores on every wakeup now that packages are
> getting stupid big.

No, it does something *else* too. That whole "left-right" logic to
(according to the commit message) "prevent bouncing" is entirely new,
afaik.

So it is *not* just about avoiding to have to scan the whole package.
It changes actual semantics too. No?

            Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:40   ` Peter Zijlstra
  2012-09-14 21:44     ` Linus Torvalds
@ 2012-09-14 21:45     ` Borislav Petkov
  1 sibling, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-14 21:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nikolay Ulyanitsky, Mike Galbraith, linux-kernel,
	Andreas Herrmann, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, Sep 14, 2012 at 11:40:43PM +0200, Peter Zijlstra wrote:
> Hmm, cute. What kind of machine did you test it on? Nikolay's machines
> look to be smallish AMD X6 or ancient Intel c2d (the patch will indeed
> have absolutely no effect on a dual core).

Yep, I took an X6 too. So it's a single-socket, 6-core AMD, F10h.

> I'll see about running pgbench on a bigger Intel tomorrow if Mike
> doesn't beat me to it.

Can try that too on one of the bigger machines I have, if needed.

> The problem the patch is trying to address is not having to scan an
> entire package for idle cores on every wakeup now that packages are
> getting stupid big.
> 
> Regressing Postgres otoh isn't nice either..
> 
> Anyway, I guess I'm fine with nixing this patch until we figure out
> something smarter..
> 
> I'm also curious to know wth postgres does that this patch makes such a
> big difference...

I'm using 9.1 in Debian testing while Nikolay is using 9.2.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:44     ` Linus Torvalds
@ 2012-09-14 21:56       ` Peter Zijlstra
  2012-09-14 21:59         ` Peter Zijlstra
  2012-09-14 22:01         ` Linus Torvalds
  0 siblings, 2 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-14 21:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 14:44 -0700, Linus Torvalds wrote:
> On Fri, Sep 14, 2012 at 2:40 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > The problem the patch is trying to address is not having to scan an
> > entire package for idle cores on every wakeup now that packages are
> > getting stupid big.
> 
> No, it does something *else* too. That whole "left-right" logic to
> (according to the commit message) "prevent bouncing" is entirely new,
> afaik.
> 
> So it is *not* just about avoiding to have to scan the whole package.
> It changes actual semantics too. No?

Both things change semantics, not looking at the entire package is new
too. But yeah I guess you could look at the exact cross-stitching as an
enhancement to the 'idle_buggy' thing.



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:56       ` Peter Zijlstra
@ 2012-09-14 21:59         ` Peter Zijlstra
  2012-09-15  3:57           ` Mike Galbraith
  2012-09-14 22:01         ` Linus Torvalds
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-14 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 23:56 +0200, Peter Zijlstra wrote:
> On Fri, 2012-09-14 at 14:44 -0700, Linus Torvalds wrote:
> > On Fri, Sep 14, 2012 at 2:40 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > >
> > > The problem the patch is trying to address is not having to scan an
> > > entire package for idle cores on every wakeup now that packages are
> > > getting stupid big.
> > 
> > No, it does something *else* too. That whole "left-right" logic to
> > (according to the commit message) "prevent bouncing" is entirely new,
> > afaik.
> > 
> > So it is *not* just about avoiding to have to scan the whole package.
> > It changes actual semantics too. No?
> 
> Both things change semantics, not looking at the entire package is new
> too. But yeah I guess you could look at the exact cross-stitching as an
> enhancement to the 'idle_buggy' thing.

What I'm saying is that having an idle_buggy means you have to assign
one in the first place, his left-right stuff might not be the simplest
means to do that -- in fact I suggested he do a simple shift first time
I saw that patch.

So if not the left-right thing, you still need to do _something_ to make
the idle_buggy work at all. So its not entirely separate.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:56       ` Peter Zijlstra
  2012-09-14 21:59         ` Peter Zijlstra
@ 2012-09-14 22:01         ` Linus Torvalds
  2012-09-14 22:10           ` Peter Zijlstra
  2012-09-14 22:14           ` Borislav Petkov
  1 sibling, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-14 22:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, Sep 14, 2012 at 2:56 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Both things change semantics, not looking at the entire package is new
> too.

Well, the "idle_buddy" thing on its own could be considered to be
purely a caching thing.

Sure, it doesn't take tsk_cpus_allowed() into account while setting up
the cache (since it's not dynamic enough), but *assuming* the common
case is that people let threads be on any of the cores of a package,
it should be possible to make the cache 100% equivalent with no
semantic change. No?

The code doesn't even try to do that kind of "don't change semantics",
though, and makes the idle-buddy thing entirely different.

            Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 22:01         ` Linus Torvalds
@ 2012-09-14 22:10           ` Peter Zijlstra
  2012-09-14 22:20             ` Linus Torvalds
  2012-09-14 22:14           ` Borislav Petkov
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-14 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 15:01 -0700, Linus Torvalds wrote:
> Sure, it doesn't take tsk_cpus_allowed() into account while setting up
> the cache (since it's not dynamic enough), but *assuming* the common
> case is that people let threads be on any of the cores of a package,
> it should be possible to make the cache 100% equivalent with no
> semantic change. No? 

I'm not seeing how it could be. Only ever looking at 1 other cpu
(regardless which one) cannot be the same as checking 'all' of them.

Suppose we have the 6 core AMD chip, a task being woken on cpu0 would
look at cpus 1-6 (the entire package shares cache) to see if any of them
was idle. Only looking at a single cpu will avoid looking at the other
4.

The chance of finding an idle cpu to run on is much bigger the more cpus
you look at (also more expensive).


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 22:01         ` Linus Torvalds
  2012-09-14 22:10           ` Peter Zijlstra
@ 2012-09-14 22:14           ` Borislav Petkov
  1 sibling, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-14 22:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Nikolay Ulyanitsky, Mike Galbraith, linux-kernel,
	Andreas Herrmann, Andrew Morton, Thomas Gleixner, Ingo Molnar

On Fri, Sep 14, 2012 at 03:01:44PM -0700, Linus Torvalds wrote:
> On Fri, Sep 14, 2012 at 2:56 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Both things change semantics, not looking at the entire package is new
> > too.
> 
> Well, the "idle_buddy" thing on its own could be considered to be
> purely a caching thing.
> 
> Sure, it doesn't take tsk_cpus_allowed() into account while setting up
> the cache (since it's not dynamic enough), but *assuming* the common
> case is that people let threads be on any of the cores of a package,
> it should be possible to make the cache 100% equivalent with no
> semantic change. No?
> 
> The code doesn't even try to do that kind of "don't change semantics",
> though, and makes the idle-buddy thing entirely different.

Could it be that this machine has a non-power of two cores (6) and maybe
the idle buddy selection doesn't work as expected?

 (Although it shouldn't since the commit message speaks also of non-power of
  two cores (40) and it works fine there).

Anyway, here's the topology of the box:

$ for i in $(ls cpu?/topology/*); do echo -n $i": "; cat $i; done
cpu0/topology/core_id: 0
cpu0/topology/core_siblings: 3f
cpu0/topology/core_siblings_list: 0-5
cpu0/topology/physical_package_id: 0
cpu0/topology/thread_siblings: 01
cpu0/topology/thread_siblings_list: 0
cpu1/topology/core_id: 1
cpu1/topology/core_siblings: 3f
cpu1/topology/core_siblings_list: 0-5
cpu1/topology/physical_package_id: 0
cpu1/topology/thread_siblings: 02
cpu1/topology/thread_siblings_list: 1
cpu2/topology/core_id: 2
cpu2/topology/core_siblings: 3f
cpu2/topology/core_siblings_list: 0-5
cpu2/topology/physical_package_id: 0
cpu2/topology/thread_siblings: 04
cpu2/topology/thread_siblings_list: 2
cpu3/topology/core_id: 3
cpu3/topology/core_siblings: 3f
cpu3/topology/core_siblings_list: 0-5
cpu3/topology/physical_package_id: 0
cpu3/topology/thread_siblings: 08
cpu3/topology/thread_siblings_list: 3
cpu4/topology/core_id: 4
cpu4/topology/core_siblings: 3f
cpu4/topology/core_siblings_list: 0-5
cpu4/topology/physical_package_id: 0
cpu4/topology/thread_siblings: 10
cpu4/topology/thread_siblings_list: 4
cpu5/topology/core_id: 5
cpu5/topology/core_siblings: 3f
cpu5/topology/core_siblings_list: 0-5
cpu5/topology/physical_package_id: 0
cpu5/topology/thread_siblings: 20
cpu5/topology/thread_siblings_list: 5

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 22:10           ` Peter Zijlstra
@ 2012-09-14 22:20             ` Linus Torvalds
  0 siblings, 0 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-14 22:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, Sep 14, 2012 at 3:10 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> I'm not seeing how it could be. Only ever looking at 1 other cpu
> (regardless which one) cannot be the same as checking 'all' of them.

Oh, you're right, it has that cpu_idle() in the loop too. So yeah, you
can't make it be even remotely equivalent. You can only make it
equivalent for the "all other cpu's are idle" case.

It doesn't even do *that*, though.

In fact, as far as I can tell, it looks like a cpu could be its own
idle_buddy. That whole logic looks very odd.

I vote we just revert it as "insane". The code really doesn't seem to
make any sense.

               Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:42   ` Linus Torvalds
@ 2012-09-15  3:33     ` Mike Galbraith
  2012-09-15 16:16       ` Andi Kleen
  2012-09-24 15:00     ` Mel Gorman
  1 sibling, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15  3:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Nikolay Ulyanitsky, linux-kernel,
	Andreas Herrmann, Peter Zijlstra, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 14:42 -0700, Linus Torvalds wrote: 
> On Fri, Sep 14, 2012 at 2:27 PM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > as Nikolay says below, we have a regression in 3.6 with pgbench's
> > benchmark in postgresql.
> >
> > I was able to reproduce it on another box here and did a bisection run.
> > It pointed to the commit below.
> 
> Ok. I guess we should just revert it. However, before we do that,
> maybe Mike can make it just use the exact old semantics of
> select_idle_sibling() in the update_top_cache_domain() logic.
> 
> Because the patch in question seems to do two things:
>  (a) cache the "idle_buggy" logic, so that we don't have those costly loops
>  (b) change it to do that "left-right" thing.
> 
> and that (b) thing may be what causes a regression for you.
> 
> So my gut feel is that the patch was wrong to begin with, exactly
> because it did two independent changes. It *should* have treated those
> two issues as independent changes and separate commits.
> 
> Maybe I'm mis-reading it. Mike? Peter?

It does two things, but it's one problem.  If you crawl over the whole
package, you constantly pull tasks all over the package, which as you
can see from the numbers hurts quite a lot.

The only reason I can think of why pgbench might suffer is postgres's
userspace spinlocks.  If you always look for an idle core, you improve
the odds that the wakeup won't preempt a lock holder, sending others
into a long spin.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:59         ` Peter Zijlstra
@ 2012-09-15  3:57           ` Mike Galbraith
  0 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15  3:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 23:59 +0200, Peter Zijlstra wrote: 
> On Fri, 2012-09-14 at 23:56 +0200, Peter Zijlstra wrote:
> > On Fri, 2012-09-14 at 14:44 -0700, Linus Torvalds wrote:
> > > On Fri, Sep 14, 2012 at 2:40 PM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > >
> > > > The problem the patch is trying to address is not having to scan an
> > > > entire package for idle cores on every wakeup now that packages are
> > > > getting stupid big.
> > > 
> > > No, it does something *else* too. That whole "left-right" logic to
> > > (according to the commit message) "prevent bouncing" is entirely new,
> > > afaik.
> > > 
> > > So it is *not* just about avoiding to have to scan the whole package.
> > > It changes actual semantics too. No?
> > 
> > Both things change semantics, not looking at the entire package is new
> > too. But yeah I guess you could look at the exact cross-stitching as an
> > enhancement to the 'idle_buggy' thing.
> 
> What I'm saying is that having an idle_buggy means you have to assign
> one in the first place, his left-right stuff might not be the simplest
> means to do that -- in fact I suggested he do a simple shift first time
> I saw that patch.

Shift just means that upon perturbation, tasks shift their way around
the package vs random bounce around, that's why I cross wired.
> So if not the left-right thing, you still need to do _something_ to make
> the idle_buggy work at all. So its not entirely separate.

Yeah.

-Mike



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
  2012-09-14 21:40   ` Peter Zijlstra
  2012-09-14 21:42   ` Linus Torvalds
@ 2012-09-15  4:11   ` Mike Galbraith
       [not found]     ` <CA+55aFz1A7HbMYS9o-GTS5Zm=Xx8MUD7cR05GMVo--2E34jcgQ@mail.gmail.com>
  2012-09-15 10:44     ` Borislav Petkov
  2 siblings, 2 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15  4:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Fri, 2012-09-14 at 23:27 +0200, Borislav Petkov wrote: 
> (Adding everybody to CC and leaving the below for reference.)
> 
> Guys,
> 
> as Nikolay says below, we have a regression in 3.6 with pgbench's
> benchmark in postgresql.
> 
> I was able to reproduce it on another box here and did a bisection run.
> It pointed to the commit below.
> 
> And yes, reverting that commit fixes the issue here.

My wild (and only) theory is that this is userspace spinlock related.
If so, starting the server and benchmark SCHED_BATCH should not only
kill the regression, but likely improve throughput as well.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
       [not found]     ` <CA+55aFz1A7HbMYS9o-GTS5Zm=Xx8MUD7cR05GMVo--2E34jcgQ@mail.gmail.com>
@ 2012-09-15  4:42       ` Mike Galbraith
  0 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15  4:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Herrmann, Nikolay Ulyanitsky, Borislav Petkov,
	Peter Zijlstra, Thomas Gleixner, Ingo Molnar, Andrew Morton,
	linux-kernel

On Fri, 2012-09-14 at 21:15 -0700, Linus Torvalds wrote:
> We don't do random crazy "use another scheduler" to hide problems with
> the default one.

I was proposing a diagnostic.
> 
> And since you are the author of the regressing patch, and don't seem
> to treat the regression as something serious, I really see no
> alternative to just reverting it.

Wow, I don't know how you obtained that impression, but fine, nuke it.

>       Linus
> 
> On Sep 14, 2012 9:11 PM, "Mike Galbraith" <efault@gmx.de> wrote:
>         On Fri, 2012-09-14 at 23:27 +0200, Borislav Petkov wrote:
>         > (Adding everybody to CC and leaving the below for
>         reference.)
>         >
>         > Guys,
>         >
>         > as Nikolay says below, we have a regression in 3.6 with
>         pgbench's
>         > benchmark in postgresql.
>         >
>         > I was able to reproduce it on another box here and did a
>         bisection run.
>         > It pointed to the commit below.
>         >
>         > And yes, reverting that commit fixes the issue here.
>         
>         My wild (and only) theory is that this is userspace spinlock
>         related.
>         If so, starting the server and benchmark SCHED_BATCH should
>         not only
>         kill the regression, but likely improve throughput as well.
>         
>         -Mike
>         



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15  4:11   ` Mike Galbraith
       [not found]     ` <CA+55aFz1A7HbMYS9o-GTS5Zm=Xx8MUD7cR05GMVo--2E34jcgQ@mail.gmail.com>
@ 2012-09-15 10:44     ` Borislav Petkov
  2012-09-15 14:47       ` Mike Galbraith
  1 sibling, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-15 10:44 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Sat, Sep 15, 2012 at 06:11:02AM +0200, Mike Galbraith wrote:
> My wild (and only) theory is that this is userspace spinlock related.
> If so, starting the server and benchmark SCHED_BATCH should not only
> kill the regression, but likely improve throughput as well.

FWIW,

I went and tried it. Here are the exact steps:

$ ps ax | grep postgres
 2066 ?        S      0:01 /usr/lib/postgresql/9.1/bin/postgres -D /var/lib/postgresql/9.1/main -c config_file=/etc/postgresql/9.1/main/postgresql.conf
 2070 ?        Ss     0:07 postgres: writer process
 2071 ?        Ss     0:05 postgres: wal writer process
 2072 ?        Ss     0:01 postgres: autovacuum launcher process
 2073 ?        Ss     0:01 postgres: stats collector process
 5788 pts/0    S+     0:00 grep postgres

# set to SCHED_BATCH
$ schedtool -B 2066 2070 2071 2072 2073

# verify:
$ schedtool 2066 2070 2071 2072 2073
PID  2066: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
PID  2070: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
PID  2071: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
PID  2072: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
PID  2073: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f

$ su - postgres
postgres@hhost:~$ export PATH=$PATH:/usr/lib/postgresql/9.1/bin/
postgres@hhost:~$ schedtool -B -e pgbench -i pgbench && pgbench -c 10 -t 10000 pgbench

...

tps = 4388.118940 (including connections establishing)
tps = 4391.771875 (excluding connections establishing)

=> even better than the results with 3.5 (had something around 3900ish
on that particular configuration).

HTH.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 10:44     ` Borislav Petkov
@ 2012-09-15 14:47       ` Mike Galbraith
  2012-09-15 15:18         ` Borislav Petkov
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15 14:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Sat, 2012-09-15 at 12:44 +0200, Borislav Petkov wrote: 
> On Sat, Sep 15, 2012 at 06:11:02AM +0200, Mike Galbraith wrote:
> > My wild (and only) theory is that this is userspace spinlock related.
> > If so, starting the server and benchmark SCHED_BATCH should not only
> > kill the regression, but likely improve throughput as well.
> 
> FWIW,
> 
> I went and tried it. Here are the exact steps:
> 
> $ ps ax | grep postgres
>  2066 ?        S      0:01 /usr/lib/postgresql/9.1/bin/postgres -D /var/lib/postgresql/9.1/main -c config_file=/etc/postgresql/9.1/main/postgresql.conf
>  2070 ?        Ss     0:07 postgres: writer process
>  2071 ?        Ss     0:05 postgres: wal writer process
>  2072 ?        Ss     0:01 postgres: autovacuum launcher process
>  2073 ?        Ss     0:01 postgres: stats collector process
>  5788 pts/0    S+     0:00 grep postgres
> 
> # set to SCHED_BATCH
> $ schedtool -B 2066 2070 2071 2072 2073
> 
> # verify:
> $ schedtool 2066 2070 2071 2072 2073
> PID  2066: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
> PID  2070: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
> PID  2071: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
> PID  2072: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
> PID  2073: PRIO   0, POLICY B: SCHED_BATCH   , NICE   0, AFFINITY 0x3f
> 
> $ su - postgres
> postgres@hhost:~$ export PATH=$PATH:/usr/lib/postgresql/9.1/bin/
> postgres@hhost:~$ schedtool -B -e pgbench -i pgbench && pgbench -c 10 -t 10000 pgbench
> 
> ...
> 
> tps = 4388.118940 (including connections establishing)
> tps = 4391.771875 (excluding connections establishing)
> 
> => even better than the results with 3.5 (had something around 3900ish
> on that particular configuration).

Increasing /proc/sys/kernel/sched_min_granularity_ns to roughly half of
sched_latency_ns should also help.  That will allow LAST_BUDDY to do
it's job, try to hand the CPU back to a preempted task if possible.  The
change that increased sched_nr_latency to 8 should have injured
postgress as well.  ATM, it's disabled unless you're massively loaded.

I _think_ it's about preemption, but it doesn't matter, patch is toast.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 14:47       ` Mike Galbraith
@ 2012-09-15 15:18         ` Borislav Petkov
  2012-09-15 16:13           ` Mike Galbraith
  0 siblings, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-15 15:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Sat, Sep 15, 2012 at 04:47:39PM +0200, Mike Galbraith wrote:
> Increasing /proc/sys/kernel/sched_min_granularity_ns to roughly half
> of sched_latency_ns should also help. That will allow LAST_BUDDY to do
> it's job, try to hand the CPU back to a preempted task if possible.

Just for my n00b scheduler understanding: this way you're practically
extending the timeslice of the task so that it gets done without being
preempted and the lock-holding period of the preempted task gets smaller
and thus you get more completed transactions in postgres during the
benchmark run?

> The change that increased sched_nr_latency to 8 should have injured
> postgress as well. ATM, it's disabled unless you're massively loaded.
>
> I _think_ it's about preemption, but it doesn't matter, patch is
> toast.

In any case, if you wanna retry the buddy thing for 3.7, ping me and I
can run it on the assortment of machines I have here.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 15:18         ` Borislav Petkov
@ 2012-09-15 16:13           ` Mike Galbraith
  2012-09-15 19:44             ` Borislav Petkov
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15 16:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Sat, 2012-09-15 at 17:18 +0200, Borislav Petkov wrote: 
> On Sat, Sep 15, 2012 at 04:47:39PM +0200, Mike Galbraith wrote:
> > Increasing /proc/sys/kernel/sched_min_granularity_ns to roughly half
> > of sched_latency_ns should also help. That will allow LAST_BUDDY to do
> > it's job, try to hand the CPU back to a preempted task if possible.
> 
> Just for my n00b scheduler understanding: this way you're practically
> extending the timeslice of the task so that it gets done without being
> preempted and the lock-holding period of the preempted task gets smaller
> and thus you get more completed transactions in postgres during the
> benchmark run?

Not really, preemption will happen, but when the preempting task goes to
sleep (or uses it's fair share), instead of selecting the leftmost task
(lowest vruntime), the preempted task gets the CPU back if we can do
that without violating fairness.  If the preempted task happens to be a
userland spinlock holder, it then release the lock sooner, others don't
spin as long, do more work, less playing space heater while lock holder
waits for spinners to eat enough CPU to become less deserving that it.

> > The change that increased sched_nr_latency to 8 should have injured
> > postgress as well. ATM, it's disabled unless you're massively loaded.
> >
> > I _think_ it's about preemption, but it doesn't matter, patch is
> > toast.
> 
> In any case, if you wanna retry the buddy thing for 3.7, ping me and I
> can run it on the assortment of machines I have here.

Thanks, but off the top of my head I see no way to fix it up without
there being side effects somewhere, there always are.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15  3:33     ` Mike Galbraith
@ 2012-09-15 16:16       ` Andi Kleen
  2012-09-15 16:36         ` Mike Galbraith
  0 siblings, 1 reply; 115+ messages in thread
From: Andi Kleen @ 2012-09-15 16:16 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

Mike Galbraith <efault@gmx.de> writes:
>
> The only reason I can think of why pgbench might suffer is postgres's
> userspace spinlocks.  If you always look for an idle core, you improve
> the odds that the wakeup won't preempt a lock holder, sending others
> into a long spin.

User space spinlocks like this unfortunately have a tendency to break
with all kinds of scheduler changes. We've seen this frequently too
with other users. The best bet currently is to use the real time
scheduler, but with various tweaks to get its overhead down.

Ultimatively the problem is that user space spinlocks with CPU
oversubcription is a very unstable setup and small changes can
easily disturb it.

Just using futex is unfortunately not the answer either.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 16:16       ` Andi Kleen
@ 2012-09-15 16:36         ` Mike Galbraith
  2012-09-15 17:08           ` richard -rw- weinberger
  2012-09-15 21:32           ` Alan Cox
  0 siblings, 2 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-15 16:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sat, 2012-09-15 at 09:16 -0700, Andi Kleen wrote: 
> Mike Galbraith <efault@gmx.de> writes:
> >
> > The only reason I can think of why pgbench might suffer is postgres's
> > userspace spinlocks.  If you always look for an idle core, you improve
> > the odds that the wakeup won't preempt a lock holder, sending others
> > into a long spin.
> 
> User space spinlocks like this unfortunately have a tendency to break
> with all kinds of scheduler changes. We've seen this frequently too
> with other users. The best bet currently is to use the real time
> scheduler, but with various tweaks to get its overhead down.

Yeah, that's one way, but decidedly sub-optimal.

> Ultimatively the problem is that user space spinlocks with CPU
> oversubcription is a very unstable setup and small changes can
> easily disturb it.
> 
> Just using futex is unfortunately not the answer either.

Yes, postgress performs loads better with it's spinlocks, but due to
that, it necessarily _hates_ preemption.  How the is the scheduler
supposed to know that any specific userland task _really_ shouldn't be
preempted at any specific time, else bad things follow?

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 16:36         ` Mike Galbraith
@ 2012-09-15 17:08           ` richard -rw- weinberger
  2012-09-16  4:48             ` Mike Galbraith
  2012-09-15 21:32           ` Alan Cox
  1 sibling, 1 reply; 115+ messages in thread
From: richard -rw- weinberger @ 2012-09-15 17:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andi Kleen, Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sat, Sep 15, 2012 at 6:36 PM, Mike Galbraith <efault@gmx.de> wrote:
>> Just using futex is unfortunately not the answer either.
>
> Yes, postgress performs loads better with it's spinlocks, but due to
> that, it necessarily _hates_ preemption.  How the is the scheduler
> supposed to know that any specific userland task _really_ shouldn't be
> preempted at any specific time, else bad things follow?

Why perform custom userspace spinlocks better than futex() based ones?
I thought we have futex() to get rid of the custom ones...
Makes futex() only sense when things like priority inheritance are needed?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 16:13           ` Mike Galbraith
@ 2012-09-15 19:44             ` Borislav Petkov
  0 siblings, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-15 19:44 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Linus Torvalds, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Sat, Sep 15, 2012 at 06:13:45PM +0200, Mike Galbraith wrote:
> > Just for my n00b scheduler understanding: this way you're practically
> > extending the timeslice of the task so that it gets done without being
> > preempted and the lock-holding period of the preempted task gets smaller
> > and thus you get more completed transactions in postgres during the
> > benchmark run?
> 
> Not really, preemption will happen, but when the preempting task goes to
> sleep (or uses it's fair share), instead of selecting the leftmost task
> (lowest vruntime), the preempted task gets the CPU back if we can do
> that without violating fairness.  If the preempted task happens to be a
> userland spinlock holder, it then release the lock sooner, others don't
> spin as long, do more work, less playing space heater while lock holder
> waits for spinners to eat enough CPU to become less deserving that it.

Ok, I definitely grok this. Thanks for explaining.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 16:36         ` Mike Galbraith
  2012-09-15 17:08           ` richard -rw- weinberger
@ 2012-09-15 21:32           ` Alan Cox
  2012-09-16  4:35             ` Mike Galbraith
  1 sibling, 1 reply; 115+ messages in thread
From: Alan Cox @ 2012-09-15 21:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Andi Kleen, Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

> Yes, postgress performs loads better with it's spinlocks, but due to
> that, it necessarily _hates_ preemption.  How the is the scheduler
> supposed to know that any specific userland task _really_ shouldn't be
> preempted at any specific time, else bad things follow?

You provide a shared page for a process group so it can write hints to
which is kernel mapped so the scheduler can peek..

Alan

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 21:32           ` Alan Cox
@ 2012-09-16  4:35             ` Mike Galbraith
  2012-09-16 19:57               ` Linus Torvalds
  2012-09-19 12:35               ` Mike Galbraith
  0 siblings, 2 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-16  4:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sat, 2012-09-15 at 22:32 +0100, Alan Cox wrote: 
> > Yes, postgress performs loads better with it's spinlocks, but due to
> > that, it necessarily _hates_ preemption.  How the is the scheduler
> > supposed to know that any specific userland task _really_ shouldn't be
> > preempted at any specific time, else bad things follow?
> 
> You provide a shared page for a process group so it can write hints to
> which is kernel mapped so the scheduler can peek..

Or perhaps a flag ala SCHED_RESET_ON_FORK to provide a not necessarily
followed hint.  That hint could be to simply always try the LAST_BUDDY
thing with flagged tasks, since we know that works (postgress inspired
LAST_BUDDY).  Even with postgress like things, fast mover kthreads etc
punching through isn't necessarily a bad thing, you just need to avoid
the punch leaving a gigantic hole.

Oh, while I'm thinking about it, there's another scenario that could
cause the select_idle_sibling() change to affect pgbench on largeish
packages, but it boils down to preemption odds as well.  IIRC pgbench
_was_ at least 1:N, ie one process driving the whole load.  Waker of
many (singularly bad idea as a way to generate load) being preempted by
it's wakees stalls the whole load, so expensive spreading of wakees to
the four winds ala WAKE_BALANCE becomes attractive, that pain being
markedly less intense than having multiple cores go idle while creator
or work waits for one.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-15 17:08           ` richard -rw- weinberger
@ 2012-09-16  4:48             ` Mike Galbraith
  0 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-16  4:48 UTC (permalink / raw)
  To: richard -rw- weinberger
  Cc: Andi Kleen, Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sat, 2012-09-15 at 19:08 +0200, richard -rw- weinberger wrote: 
> On Sat, Sep 15, 2012 at 6:36 PM, Mike Galbraith <efault@gmx.de> wrote:
> >> Just using futex is unfortunately not the answer either.
> >
> > Yes, postgress performs loads better with it's spinlocks, but due to
> > that, it necessarily _hates_ preemption.  How the is the scheduler
> > supposed to know that any specific userland task _really_ shouldn't be
> > preempted at any specific time, else bad things follow?
> 
> Why perform custom userspace spinlocks better than futex() based ones?
> I thought we have futex() to get rid of the custom ones...
> Makes futex() only sense when things like priority inheritance are needed?

Dunno.  Likely because data doesn't go cold when you spin a bit, but go
to sleep and the next guy may stomp cache flat with size XXL boots.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-16  4:35             ` Mike Galbraith
@ 2012-09-16 19:57               ` Linus Torvalds
  2012-09-17  8:08                 ` Mike Galbraith
  2012-09-19 12:35               ` Mike Galbraith
  1 sibling, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-16 19:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alan Cox, Andi Kleen, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sat, Sep 15, 2012 at 9:35 PM, Mike Galbraith <efault@gmx.de> wrote:
>
> Oh, while I'm thinking about it, there's another scenario that could
> cause the select_idle_sibling() change to affect pgbench on largeish
> packages, but it boils down to preemption odds as well.

So here's a possible suggestion..

Let's assume that the scheduler code to find the next idle CPU on the
package is actually a good idea, and we shouldn't mess with the idea.

But at the same time, it's clearly an *expensive* idea, which is why
you introduced the "only test a single CPU buddy" approach instead.
But that didn't work, and you can come up with multiple reasons why it
wouldn't work. Plus, quite fundamentally, it's rather understandable
that "try to find an idle CPU on the same package" really would be a
good idea, right?

So instead of limiting the "try to find an idle CPU on the same
package" to "pick *one* idle CPU on the same package to try", how
about just trying to make the whole "find another idle CPU" much
cheaper, and much more scalable that way?

Quite frankly, the loop in select_idle_sibling() is insanely expensive
for what it really wants to do. All it really wants to do is:
 - iterate over idle processors on this package
 - make sure that idle processor is in the cpu's allowed.

Right?

But the whole use of "cpumask_intersects()" etc is quite expensive,
and there's that crazy double loop to do the above. So no wonder that
it's expensive and causes scalability problems. That would be
*especially* true if nr_cpumask_bits is big, and we have
CONFIG_CPUMASK_OFFSTACK defined.

So I would suggest:
 (a) revert the original commit (already done in my tree)
 (b) look at just making the loop *much* cheaper.

For example, all those "cpumask_intersects()" and "cpumask_first()"
things are *really* expensive, and expand to tons of code especially
for the OFFSTACK case (and aren't exactly free even for the smaller
case). And it really is all stupidly and badly done. I bet we can make
that code faster without really changing the  end result at all, just
changing the algorithm.

For example, what if we got rid of all the crazy "sd groups" crap at
run-time, and just built a single linked circular list of CPU's on the
same package?

Then we'd replace that crazy-expensive double loop over sd->groups and
for_each_cpu() crap (not to mention cpumask_first_and() etc) with just
a simple loop over that (and pick the *next* idle cpu, instead of
doing that crazy "pick first one in a bitmask after and'ing").

In fact, looking at select_idle_sibling(), I just want to puke. The
thing is crazy.

Why the hell isn't the *very* first thing that function does just a simple

    if (idle_cpu(target))
        return target;

instead it does totally f*cking insane things, and checks whether
"target == cpu && idle_cpu(cpu)".

The code is shit. Just fix the shit, instead of trying to come up with
some totally different model. Ok? I bet just fixing it to not have
insane double loops would already get 90% of the speedup that Mike's
original patch did, but without the downsides of having to pick just a
single idle-buddy.

We might also possibly add a "look at SMT buddy first" case, because
Mike is probably right that bouncing all over the package isn't
necessarily a good idea unless we really need to. But that would be a
different thing.

            Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-16 19:57               ` Linus Torvalds
@ 2012-09-17  8:08                 ` Mike Galbraith
  2012-09-17 10:07                   ` Ingo Molnar
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-17  8:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alan Cox, Andi Kleen, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sun, 2012-09-16 at 12:57 -0700, Linus Torvalds wrote: 
> On Sat, Sep 15, 2012 at 9:35 PM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Oh, while I'm thinking about it, there's another scenario that could
> > cause the select_idle_sibling() change to affect pgbench on largeish
> > packages, but it boils down to preemption odds as well.
> 
> So here's a possible suggestion..
> 
> Let's assume that the scheduler code to find the next idle CPU on the
> package is actually a good idea, and we shouldn't mess with the idea.

We should definitely mess with the idea, as it causes some problems.

> But at the same time, it's clearly an *expensive* idea, which is why
> you introduced the "only test a single CPU buddy" approach instead.
> But that didn't work, and you can come up with multiple reasons why it
> wouldn't work. Plus, quite fundamentally, it's rather understandable
> that "try to find an idle CPU on the same package" really would be a
> good idea, right?

I would argue that it did work, it shut down the primary source of pain,
which I do not believe to be the traversal cost, rather the bouncing.

    4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
    
     clients     1       2       4        8       16       32       64      128
     ..........................................................................
     pre        30      41     118      645     3769     6214    12233    14312
     post      299     603    1211     2418     4697     6847    11606    14557

10x at 1 pair shouldn't be traversal, the whole box is otherwise idle.
We'll do a lot more (ever more futile) traversal as load increases, but
at the same time, our futile attempts fail more frequently, so we shoot
ourselves in the foot less frequently.

The down side is (appears to be) that I also shut down some ~odd case
preemption salvation, salvation that only large packages will receive.

The problem as I see it is that we're making light tasks _too_ mobile,
turning an optimization into a pessimization for light tasks.  For
longer running tasks this mobility within a large package isn't such a
big deal, but for fast movers, it hurts a lot.

> So instead of limiting the "try to find an idle CPU on the same
> package" to "pick *one* idle CPU on the same package to try", how
> about just trying to make the whole "find another idle CPU" much
> cheaper, and much more scalable that way?

I'm all for anything that makes the fastpath lighter.  We're too fat.

> Quite frankly, the loop in select_idle_sibling() is insanely expensive
> for what it really wants to do. All it really wants to do is:
>  - iterate over idle processors on this package
>  - make sure that idle processor is in the cpu's allowed.
> 
> Right?

For longish running tasks, yeah.  For the problematic loads mentioned,
you'd be better off killing select_idle_sibling() entirely, and turning
WAKE_BALANCE on methinks, that would buy them more preemption salvation.

> But the whole use of "cpumask_intersects()" etc is quite expensive,
> and there's that crazy double loop to do the above. So no wonder that
> it's expensive and causes scalability problems. That would be
> *especially* true if nr_cpumask_bits is big, and we have
> CONFIG_CPUMASK_OFFSTACK defined.
> 
> So I would suggest:
>  (a) revert the original commit (already done in my tree)
>  (b) look at just making the loop *much* cheaper.
> 
> For example, all those "cpumask_intersects()" and "cpumask_first()"
> things are *really* expensive, and expand to tons of code especially
> for the OFFSTACK case (and aren't exactly free even for the smaller
> case). And it really is all stupidly and badly done. I bet we can make
> that code faster without really changing the  end result at all, just
> changing the algorithm.
> 
> For example, what if we got rid of all the crazy "sd groups" crap at
> run-time, and just built a single linked circular list of CPU's on the
> same package?
> 
> Then we'd replace that crazy-expensive double loop over sd->groups and
> for_each_cpu() crap (not to mention cpumask_first_and() etc) with just
> a simple loop over that (and pick the *next* idle cpu, instead of
> doing that crazy "pick first one in a bitmask after and'ing").

Agreed, cheaper traversal should be doable, and would be nice.

> In fact, looking at select_idle_sibling(), I just want to puke. The
> thing is crazy.
> 
> Why the hell isn't the *very* first thing that function does just a simple
> 
>     if (idle_cpu(target))
>         return target;
> 
> instead it does totally f*cking insane things, and checks whether
> "target == cpu && idle_cpu(cpu)".

Hm, in the current incarnation, target is either this_cpu or task_cpu()
if wake_affine() said "no", so yeah, seems you're right.

> The code is shit. Just fix the shit, instead of trying to come up with
> some totally different model. Ok? I bet just fixing it to not have
> insane double loops would already get 90% of the speedup that Mike's
> original patch did, but without the downsides of having to pick just a
> single idle-buddy.

I'm skeptical, but would love to be wrong about that.

> We might also possibly add a "look at SMT buddy first" case, because
> Mike is probably right that bouncing all over the package isn't
> necessarily a good idea unless we really need to. But that would be a
> different thing.

The code used to do that, was recently modified to look for an idle core
first.  Testing that in enterprise, because I desperately needed to
combat throughput loss 2.6.32->3.0 is how I landed here.  What I found
while integrating was Dr. Jekyll and Mr. Hyde.  Less balancing is more..
until it's not enough.  Damn.  I beat up select_idle_sibling() to turn
regression into the needed progression.

Box _regressed_ only in that finding an idle SMT sibling could save us
from ourselves before if box _had_ SMT, but the bounce problem was lying
there from day one.  I didn't see it before, because I didn't go looking
for evilness on enterprise hardware, used to have no access to sexy
hardware, and didn't know things like "opterons suck" and "L3 is not
nearly as wonderful as shared L2".

On an opteron, best thing you can do for fast mover loads is to turn
select_idle_sibling() off, you need a lot of overlap to win.  On Intel
hardware otoh, 3.0 kernel now kicks the snot out of 2.6.32 at fast mover
localhost _and_ aim7 compute.  Opterons don't respond to much of
anything other than "if load is fast mover, turn it the hell off", but
large Intel packages much appreciated Mr. Hyde having his fangs pulled.

You're welcome to keep the fully fanged version of Mr. Hyde ;-) but I
now know beyond doubt that he's one evil SOB, so I will keep this patch
until better psychopath therapy kit comes along, and already have knobs
set such that LAST_BUDDY provides preemption salvation.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-17  8:08                 ` Mike Galbraith
@ 2012-09-17 10:07                   ` Ingo Molnar
  2012-09-17 10:47                     ` Mike Galbraith
  2012-09-17 14:39                     ` Andi Kleen
  0 siblings, 2 replies; 115+ messages in thread
From: Ingo Molnar @ 2012-09-17 10:07 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Borislav Petkov,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Andrew Morton, Thomas Gleixner


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2012-09-16 at 12:57 -0700, Linus Torvalds wrote: 
> > On Sat, Sep 15, 2012 at 9:35 PM, Mike Galbraith <efault@gmx.de> wrote:
> > >
> > > Oh, while I'm thinking about it, there's another scenario that could
> > > cause the select_idle_sibling() change to affect pgbench on largeish
> > > packages, but it boils down to preemption odds as well.
> > 
> > So here's a possible suggestion..
> > 
> > Let's assume that the scheduler code to find the next idle CPU on the
> > package is actually a good idea, and we shouldn't mess with the idea.
> 
> We should definitely mess with the idea, as it causes some problems.
> 
> > But at the same time, it's clearly an *expensive* idea, 
> > which is why you introduced the "only test a single CPU 
> > buddy" approach instead. But that didn't work, and you can 
> > come up with multiple reasons why it wouldn't work. Plus, 
> > quite fundamentally, it's rather understandable that "try to 
> > find an idle CPU on the same package" really would be a good 
> > idea, right?
> 
> I would argue that it did work, it shut down the primary 
> source of pain, which I do not believe to be the traversal 
> cost, rather the bouncing.
> 
>     4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
>     
>      clients     1       2       4        8       16       32       64      128
>      ..........................................................................
>      pre        30      41     118      645     3769     6214    12233    14312
>      post      299     603    1211     2418     4697     6847    11606    14557

That's a very tempting speedup for a simpler and more 
fundamental workload than postgresql's somewhat weird
user-space spinlocks that burn CPU time in user-space
instead of blocking/waiting on a futex.

IIRC mysql does this properly and outperforms postgresql
on this benchmark, in an apples-to-apples configuration?

> 10x at 1 pair shouldn't be traversal, the whole box is 
> otherwise idle. We'll do a lot more (ever more futile) 
> traversal as load increases, but at the same time, our futile 
> attempts fail more frequently, so we shoot ourselves in the 
> foot less frequently.
> 
> The down side is (appears to be) that I also shut down some 
> ~odd case preemption salvation, salvation that only large 
> packages will receive.
> 
> The problem as I see it is that we're making light tasks _too_ 
> mobile, turning an optimization into a pessimization for light 
> tasks.  For longer running tasks this mobility within a large 
> package isn't such a big deal, but for fast movers, it hurts a 
> lot.

There's not enough time to resolve this for v3.6, so I agree 
with the revert - would you be willing to post a v2 of your 
original patch? I really think we want your tbench speedups, 
quite a few real-world messaging applications use the tbench 
patterns of scheduling.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-17 10:07                   ` Ingo Molnar
@ 2012-09-17 10:47                     ` Mike Galbraith
  2012-09-17 14:39                     ` Andi Kleen
  1 sibling, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-17 10:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Alan Cox, Andi Kleen, Borislav Petkov,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Andrew Morton, Thomas Gleixner

On Mon, 2012-09-17 at 12:07 +0200, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 

> >     4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:
> >     
> >      clients     1       2       4        8       16       32       64      128
> >      ..........................................................................
> >      pre        30      41     118      645     3769     6214    12233    14312
> >      post      299     603    1211     2418     4697     6847    11606    14557
> 
> That's a very tempting speedup for a simpler and more 
> fundamental workload than postgresql's somewhat weird
> user-space spinlocks that burn CPU time in user-space
> instead of blocking/waiting on a futex.
> 
> IIRC mysql does this properly and outperforms postgresql
> on this benchmark, in an apples-to-apples configuration?

It's been a while since I fiddled with oltp (lost my fast mysql db,
every attempt to re-create produced a complete slug), but postgress was
always the throughput winner at that here.

> > 10x at 1 pair shouldn't be traversal, the whole box is 
> > otherwise idle. We'll do a lot more (ever more futile) 
> > traversal as load increases, but at the same time, our futile 
> > attempts fail more frequently, so we shoot ourselves in the 
> > foot less frequently.
> > 
> > The down side is (appears to be) that I also shut down some 
> > ~odd case preemption salvation, salvation that only large 
> > packages will receive.
> > 
> > The problem as I see it is that we're making light tasks _too_ 
> > mobile, turning an optimization into a pessimization for light 
> > tasks.  For longer running tasks this mobility within a large 
> > package isn't such a big deal, but for fast movers, it hurts a 
> > lot.
> 
> There's not enough time to resolve this for v3.6, so I agree 
> with the revert - would you be willing to post a v2 of your 
> original patch? I really think we want your tbench speedups, 
> quite a few real-world messaging applications use the tbench 
> patterns of scheduling.

I don't know what a v2 would look like, but I can keep thinking about
this irritating little <naughty words elided>.  Peter's a lot hairier
chested, not to mention having a sense of _taste_ :) so it might be
better to just consider my patch a diagnostic, and let him fix it up in
a (likely lots) less tummy distressing manner.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-17 10:07                   ` Ingo Molnar
  2012-09-17 10:47                     ` Mike Galbraith
@ 2012-09-17 14:39                     ` Andi Kleen
  1 sibling, 0 replies; 115+ messages in thread
From: Andi Kleen @ 2012-09-17 14:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Linus Torvalds, Alan Cox, Andi Kleen,
	Borislav Petkov, Nikolay Ulyanitsky, linux-kernel,
	Andreas Herrmann, Peter Zijlstra, Andrew Morton, Thomas Gleixner

> IIRC mysql does this properly and outperforms postgresql

Properly = futexes?

It depends on the MySQL engine. Last time I looked InnoDB 
did use custom spinlocks too.  Some of the other MySQL
engines seem to use adaptive pthread mutexes in glibc, but those
have their problems too.

In general unfortunately MySQL is not a single load, it's more 
like a wide range of loads depending on the underlying storage engine.
PostgreSQL at least is more consistent in its problems.

-Andi


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-16  4:35             ` Mike Galbraith
  2012-09-16 19:57               ` Linus Torvalds
@ 2012-09-19 12:35               ` Mike Galbraith
  2012-09-19 14:54                 ` Ingo Molnar
  1 sibling, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-19 12:35 UTC (permalink / raw)
  To: Alan Cox
  Cc: Andi Kleen, Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Sun, 2012-09-16 at 06:35 +0200, Mike Galbraith wrote:

> Oh, while I'm thinking about it, there's another scenario that could
> cause the select_idle_sibling() change to affect pgbench on largeish
> packages, but it boils down to preemption odds as well.  IIRC pgbench
> _was_ at least 1:N, ie one process driving the whole load.  Waker of
> many (singularly bad idea as a way to generate load) being preempted by
> it's wakees stalls the whole load, so expensive spreading of wakees to
> the four winds ala WAKE_BALANCE becomes attractive, that pain being
> markedly less intense than having multiple cores go idle while creator
> or work waits for one.

Enabling SMT on little E5620 box says that's the deal.  pgbench as run
is 1:N, and all you have to do is disable select_idle_sibling() entirely
to see that for _this_ (~odd) load, max spread and lower wakeup latency
for the mother of all work itself is a good thing.

pgbench -i pgbench && pgbench -c $N -T 10 pgbench

N=   1     2     4     8    16    32    64
  1336  2482  3752  3485  3327  2928  2290  virgin 3.6.0-rc6
  1408  2457  3363  3070  2938  2368  1757  +revert reverted 
  1310  2492  2487  2729  2186   975   874  +revert + select_idle_sibling() disabled
  1407  2505  3422  3137  3093  2828  2250  +revert + schedctl -B /etc/init.d/postgresql restart
  1321  2403  2515  2759  2420  2301  1894  +revert + schedctl -B /etc/init.d/postgresql restart + select_idle_sibling() disabled

Hohum, damned if ya do, damned if ya don't.  Damn.

-Mike 



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-19 12:35               ` Mike Galbraith
@ 2012-09-19 14:54                 ` Ingo Molnar
  2012-09-19 15:23                   ` Mike Galbraith
  0 siblings, 1 reply; 115+ messages in thread
From: Ingo Molnar @ 2012-09-19 14:54 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Alan Cox, Andi Kleen, Linus Torvalds, Borislav Petkov,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Andrew Morton, Thomas Gleixner


* Mike Galbraith <efault@gmx.de> wrote:

> On Sun, 2012-09-16 at 06:35 +0200, Mike Galbraith wrote:
> 
> > Oh, while I'm thinking about it, there's another scenario 
> > that could cause the select_idle_sibling() change to affect 
> > pgbench on largeish packages, but it boils down to 
> > preemption odds as well.  IIRC pgbench _was_ at least 1:N, 
> > ie one process driving the whole load.  Waker of many 
> > (singularly bad idea as a way to generate load) being 
> > preempted by it's wakees stalls the whole load, so expensive 
> > spreading of wakees to the four winds ala WAKE_BALANCE 
> > becomes attractive, that pain being markedly less intense 
> > than having multiple cores go idle while creator or work 
> > waits for one.
> 
> Enabling SMT on little E5620 box says that's the deal.  
> pgbench as run is 1:N, and all you have to do is disable 
> select_idle_sibling() entirely to see that for _this_ (~odd) 
> load, max spread and lower wakeup latency for the mother of 
> all work itself is a good thing.
> 
> pgbench -i pgbench && pgbench -c $N -T 10 pgbench
> 
> N=   1     2     4     8    16    32    64
>   1336  2482  3752  3485  3327  2928  2290  virgin 3.6.0-rc6
>   1408  2457  3363  3070  2938  2368  1757  +revert reverted 
>   1310  2492  2487  2729  2186   975   874  +revert + select_idle_sibling() disabled
>   1407  2505  3422  3137  3093  2828  2250  +revert + schedctl -B /etc/init.d/postgresql restart
>   1321  2403  2515  2759  2420  2301  1894  +revert + schedctl -B /etc/init.d/postgresql restart + select_idle_sibling() disabled
> 
> Hohum, damned if ya do, damned if ya don't.  Damn.

As a test, could you mark that 'big PostgreSQL central work 
queue process' with some high priority (renice -20?), to make 
sure it's never preempted by wakees? Does that recover 
performance as well?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-19 14:54                 ` Ingo Molnar
@ 2012-09-19 15:23                   ` Mike Galbraith
  0 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-19 15:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Alan Cox, Andi Kleen, Linus Torvalds, Borislav Petkov,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Peter Zijlstra, Andrew Morton, Thomas Gleixner

On Wed, 2012-09-19 at 16:54 +0200, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > On Sun, 2012-09-16 at 06:35 +0200, Mike Galbraith wrote:
> > 
> > > Oh, while I'm thinking about it, there's another scenario 
> > > that could cause the select_idle_sibling() change to affect 
> > > pgbench on largeish packages, but it boils down to 
> > > preemption odds as well.  IIRC pgbench _was_ at least 1:N, 
> > > ie one process driving the whole load.  Waker of many 
> > > (singularly bad idea as a way to generate load) being 
> > > preempted by it's wakees stalls the whole load, so expensive 
> > > spreading of wakees to the four winds ala WAKE_BALANCE 
> > > becomes attractive, that pain being markedly less intense 
> > > than having multiple cores go idle while creator or work 
> > > waits for one.
> > 
> > Enabling SMT on little E5620 box says that's the deal.  
> > pgbench as run is 1:N, and all you have to do is disable 
> > select_idle_sibling() entirely to see that for _this_ (~odd) 
> > load, max spread and lower wakeup latency for the mother of 
> > all work itself is a good thing.
> > 
> > pgbench -i pgbench && pgbench -c $N -T 10 pgbench
> > 
> > N=   1     2     4     8    16    32    64
> >   1336  2482  3752  3485  3327  2928  2290  virgin 3.6.0-rc6
> >   1408  2457  3363  3070  2938  2368  1757  +revert reverted 
> >   1310  2492  2487  2729  2186   975   874  +revert + select_idle_sibling() disabled
> >   1407  2505  3422  3137  3093  2828  2250  +revert + schedctl -B /etc/init.d/postgresql restart
> >   1321  2403  2515  2759  2420  2301  1894  +revert + schedctl -B /etc/init.d/postgresql restart + select_idle_sibling() disabled
> > 
> > Hohum, damned if ya do, damned if ya don't.  Damn.
> 
> As a test, could you mark that 'big PostgreSQL central work 
> queue process' with some high priority (renice -20?), to make 
> sure it's never preempted by wakees? Does that recover 
> performance as well?

schedctl -B started postgress SCHED_BATCH, so pgbench won't be preempted
since it's the only SCHED_NORMAL task left in the lot.  All others are
postmaster, and SCHED_BATCH.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-14 21:42   ` Linus Torvalds
  2012-09-15  3:33     ` Mike Galbraith
@ 2012-09-24 15:00     ` Mel Gorman
  2012-09-24 15:23       ` Nikolay Ulyanitsky
                         ` (2 more replies)
  1 sibling, 3 replies; 115+ messages in thread
From: Mel Gorman @ 2012-09-24 15:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Fri, Sep 14, 2012 at 02:42:44PM -0700, Linus Torvalds wrote:
> On Fri, Sep 14, 2012 at 2:27 PM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > as Nikolay says below, we have a regression in 3.6 with pgbench's
> > benchmark in postgresql.
> >
> > I was able to reproduce it on another box here and did a bisection run.
> > It pointed to the commit below.
> 
> Ok. I guess we should just revert it. However, before we do that,
> maybe Mike can make it just use the exact old semantics of
> select_idle_sibling() in the update_top_cache_domain() logic.
> 

The patch that you being reverted was meant to fix problems with
commit 4dcfe102 (sched: Avoid SMT siblings in select_idle_sibling() if
possible). That patch made select_idle_sibling() quite fat and I know it
is responsible for a 2% regression in a kernel compile benchmark between
kernel 3.1 and 3.2 on an old AMD Phenom II X4 940. Reverting Mike's patch
might fix this Postgres regression but it reintroduces the overhead caused
by commit 4dcfe102 for other cases.  I do not have a suggestion on how to
make this better, I'm just pointing out that the revert has some downsides.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:00     ` Mel Gorman
@ 2012-09-24 15:23       ` Nikolay Ulyanitsky
  2012-09-24 15:53         ` Borislav Petkov
  2012-09-24 15:30       ` Peter Zijlstra
  2012-09-25  4:16       ` Mike Galbraith
  2 siblings, 1 reply; 115+ messages in thread
From: Nikolay Ulyanitsky @ 2012-09-24 15:23 UTC (permalink / raw)
  To: linux-kernel

On 24 September 2012 18:00, Mel Gorman <mgorman@suse.de> wrote:
> Reverting Mike's patch might fix this Postgres regression but it reintroduces the overhead caused
> by commit 4dcfe102 for other cases.

I just tested pgbench on kernel 3.6-rc7.
There is no performance regression between 3.5.3 and 3.6-rc7 on AMD X6.
Thx

--
With best regards,
Nikolay

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:00     ` Mel Gorman
  2012-09-24 15:23       ` Nikolay Ulyanitsky
@ 2012-09-24 15:30       ` Peter Zijlstra
  2012-09-24 15:51         ` Mike Galbraith
  2012-09-24 15:52         ` Linus Torvalds
  2012-09-25  4:16       ` Mike Galbraith
  2 siblings, 2 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-24 15:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	Mike Galbraith, linux-kernel, Andreas Herrmann, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Mon, 2012-09-24 at 16:00 +0100, Mel Gorman wrote:
> On Fri, Sep 14, 2012 at 02:42:44PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 14, 2012 at 2:27 PM, Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > as Nikolay says below, we have a regression in 3.6 with pgbench's
> > > benchmark in postgresql.
> > >
> > > I was able to reproduce it on another box here and did a bisection run.
> > > It pointed to the commit below.
> > 
> > Ok. I guess we should just revert it. However, before we do that,
> > maybe Mike can make it just use the exact old semantics of
> > select_idle_sibling() in the update_top_cache_domain() logic.
> > 
> 
> The patch that you being reverted was meant to fix problems with
> commit 4dcfe102 (sched: Avoid SMT siblings in select_idle_sibling() if
> possible). That patch made select_idle_sibling() quite fat and I know it
> is responsible for a 2% regression in a kernel compile benchmark between
> kernel 3.1 and 3.2 on an old AMD Phenom II X4 940. Reverting Mike's patch
> might fix this Postgres regression but it reintroduces the overhead caused
> by commit 4dcfe102 for other cases.  I do not have a suggestion on how to
> make this better, I'm just pointing out that the revert has some downsides.


Something like the below removes a number of cpumask operations, which
on big machines can be quite expensive.

No idea if its sufficient, but its a start.

Anyway, does anybody have any clue as to why AMD and Intel machine
behave significantly different here? Does an Intel box with HT disabled
behave similar to AMD? or is it something about the micro-architecture?

---
 kernel/sched/fair.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..8757097 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2661,17 +2661,29 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
+			int candidate = nr_cpu_ids;
 
+			/*
+			 * In the SMT case the groups are the SMT-siblings,
+			 * otherwise they're singleton groups.
+			 */
 			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
+					continue;
+
+				/*
+				 * If any of the SMT-siblings are !idle, the
+				 * core isn't idle.
+				 */
 				if (!idle_cpu(i))
 					goto next;
+
+				if (candidate == nr_cpu_ids)
+					candidate = i;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
+			target = candidate;
+
 			goto done;
 next:
 			sg = sg->next;


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:30       ` Peter Zijlstra
@ 2012-09-24 15:51         ` Mike Galbraith
  2012-09-24 15:52         ` Linus Torvalds
  1 sibling, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-24 15:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, 2012-09-24 at 17:30 +0200, Peter Zijlstra wrote:

> Anyway, does anybody have any clue as to why AMD and Intel machine
> behave significantly different here? Does an Intel box with HT disabled
> behave similar to AMD? or is it something about the micro-architecture?

If you mean pgbench, it should act about the same on both.  As soon as
pgbench starts using significant CPU and has to compete, it'll start
having trouble.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:30       ` Peter Zijlstra
  2012-09-24 15:51         ` Mike Galbraith
@ 2012-09-24 15:52         ` Linus Torvalds
  2012-09-24 16:07           ` Peter Zijlstra
  2012-09-24 16:12           ` Peter Zijlstra
  1 sibling, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-24 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, Sep 24, 2012 at 8:30 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> No idea if its sufficient, but its a start.

Can we please do this too?

    diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
    index 96e2b18b6283..2010c1ece7b3 100644
    --- a/kernel/sched/fair.c
    +++ b/kernel/sched/fair.c
    @@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group,
struct task_struct *p, int this_cpu)
      */
     static int select_idle_sibling(struct task_struct *p, int target)
     {
    -	int cpu = smp_processor_id();
    -	int prev_cpu = task_cpu(p);
     	struct sched_domain *sd;
     	struct sched_group *sg;
     	int i;

    -	/*
    -	 * If the task is going to be woken-up on this cpu and if it is
    -	 * already idle, then it is the right target.
    -	 */
    -	if (target == cpu && idle_cpu(cpu))
    -		return cpu;
    -
    -	/*
    -	 * If the task is going to be woken-up on the cpu where it previously
    -	 * ran and if it is currently idle, then it the right target.
    -	 */
    -	if (target == prev_cpu && idle_cpu(prev_cpu))
    -		return prev_cpu;
    +	if (idle_cpu(target))
    +		return target;

     	/*
     	 * Otherwise, iterate the domains and find an elegible idle cpu.

(obviously whitespace-damaged). The whole "let's test prev_cpu or cpu"
seems stupid and counter-productive. The only possible values for
'target' are the two we test for.

Your patch looks odd, though. Why do you use some complex initial
value for 'candidate' (nr_cpu_ids) instead of a simple and readable
one (-1)?

And the whole "if we find any non-idle cpu, skip the whole domain"
logic really seems a bit odd (that's not new to your patch, though).
Can somebody explain what the whole point of that idiotically written
function is?

                  Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:23       ` Nikolay Ulyanitsky
@ 2012-09-24 15:53         ` Borislav Petkov
  0 siblings, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-24 15:53 UTC (permalink / raw)
  To: Nikolay Ulyanitsky
  Cc: linux-kernel, Mike Galbraith, Andreas Herrmann, Peter Zijlstra,
	Linus Torvalds, Andrew Morton, Thomas Gleixner, Ingo Molnar

On Mon, Sep 24, 2012 at 06:23:37PM +0300, Nikolay Ulyanitsky wrote:
> On 24 September 2012 18:00, Mel Gorman <mgorman@suse.de> wrote:
> > Reverting Mike's patch might fix this Postgres regression but it reintroduces the overhead caused
> > by commit 4dcfe102 for other cases.
> 
> I just tested pgbench on kernel 3.6-rc7.
> There is no performance regression between 3.5.3 and 3.6-rc7 on AMD X6.

Yes,

what I'm seeing here too. Thanks for testing.

Just one thing, please hit reply-to-all when you reply to the thread so
that everybody can get your mail. I'm readding them for now.

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:52         ` Linus Torvalds
@ 2012-09-24 16:07           ` Peter Zijlstra
  2012-09-24 16:33             ` Linus Torvalds
  2012-09-24 16:12           ` Peter Zijlstra
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, 2012-09-24 at 08:52 -0700, Linus Torvalds wrote:
> Your patch looks odd, though. Why do you use some complex initial
> value for 'candidate' (nr_cpu_ids) instead of a simple and readable
> one (-1)? 

nr_cpu_ids is the typical no-value value for cpumask operations -- yes
this is annoying and I keep doing it wrong far too often.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:52         ` Linus Torvalds
  2012-09-24 16:07           ` Peter Zijlstra
@ 2012-09-24 16:12           ` Peter Zijlstra
  2012-09-24 16:30             ` Linus Torvalds
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, 2012-09-24 at 08:52 -0700, Linus Torvalds wrote:
> And the whole "if we find any non-idle cpu, skip the whole domain"
> logic really seems a bit odd (that's not new to your patch, though).
> Can somebody explain what the whole point of that idiotically written
> function is? 

So we're looking for an idle cpu around @target. We prefer a cpu of an
idle core, since SMT-siblings share L[12] cache. The way we do this is
by iterating the topology tree downwards starting at the LLC (L3) cache
level. Its groups are either the SMT-siblings or singleton groups.

In case its the SMT things, we want to skip if there's a non-idle cpu in
that mask to avoid sharing L[12].

If we don't find an empty core, we go down the domain tree, either
finding a NULL domain and terminating, or finding the SMT domain, and
we'll see if there's an idle SMT sibling.

Is it pretty?, not really. Is there a better way of doing it?, possibly,
I just haven't thought of it yet :/

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:12           ` Peter Zijlstra
@ 2012-09-24 16:30             ` Linus Torvalds
  2012-09-24 16:52               ` Borislav Petkov
  2012-09-24 16:54               ` Peter Zijlstra
  0 siblings, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-24 16:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, Sep 24, 2012 at 9:12 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> So we're looking for an idle cpu around @target. We prefer a cpu of an
> idle core, since SMT-siblings share L[12] cache. The way we do this is
> by iterating the topology tree downwards starting at the LLC (L3) cache
> level. Its groups are either the SMT-siblings or singleton groups.

So if it'sally guaranteed to be SMT-siblings or singleton groups, then
the whole "for_each_cpu()" is a total disaster. That's a truly
expensive way to look up adjacent CPU's. Is there no saner way to look
up that thing? Like a simple circular list of SMT siblings (I realize
that on x86 that list is either one or two, but other SMT
implementations are groups of four or more).

So I suspect your patch largely makes things faster (avoid those
insane cpumask operations), but the for_each_cpu() one is still an
absolutely horrible way to find a couple of basically statically known
(modulo hotplug, which is disabled here anyway) CPU's. So even if the
algorithm makes sense at some higher level, it doesn't really seem to
make sense from an implementation standpoint.

Also, do we really want to spread things out that aggressively?
How/why do we know that we don't want to share L2 caches, for example?
It sounds like a bad idea from a power standpoint, and possibly
performance too.

                    Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:07           ` Peter Zijlstra
@ 2012-09-24 16:33             ` Linus Torvalds
  2012-09-24 16:54               ` Peter Zijlstra
  0 siblings, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-24 16:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, Sep 24, 2012 at 9:07 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2012-09-24 at 08:52 -0700, Linus Torvalds wrote:
>> Your patch looks odd, though. Why do you use some complex initial
>> value for 'candidate' (nr_cpu_ids) instead of a simple and readable
>> one (-1)?
>
> nr_cpu_ids is the typical no-value value for cpumask operations -- yes
> this is annoying and I keep doing it wrong far too often.

Can we please just fix it? Making the excuse that it's the "typical
no-value" is still stupid, because it's a f*cking moronic no-value.

Whoever thinks that it's smart to test against "nr_cpu_ids" when
there's a much more natural value (-1) is crazy. The source code looks
more complex, but the code it *generates* is clearly more complex and
worse too.

Sure, the "scan bits" bitops will return ">= nr_cpu_ids" for the "I
couldn't find a bit" thing, but that doesn't mean that everything else
should.

            Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:30             ` Linus Torvalds
@ 2012-09-24 16:52               ` Borislav Petkov
  2012-09-24 16:54               ` Peter Zijlstra
  1 sibling, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-24 16:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, Sep 24, 2012 at 09:30:05AM -0700, Linus Torvalds wrote:
> On Mon, Sep 24, 2012 at 9:12 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > So we're looking for an idle cpu around @target. We prefer a cpu of an
> > idle core, since SMT-siblings share L[12] cache. The way we do this is
> > by iterating the topology tree downwards starting at the LLC (L3) cache
> > level. Its groups are either the SMT-siblings or singleton groups.
> 
> So if it'sally guaranteed to be SMT-siblings or singleton groups, then
> the whole "for_each_cpu()" is a total disaster. That's a truly
> expensive way to look up adjacent CPU's. Is there no saner way to look
> up that thing? Like a simple circular list of SMT siblings (I realize
> that on x86 that list is either one or two, but other SMT
> implementations are groups of four or more).
> 
> So I suspect your patch largely makes things faster (avoid those
> insane cpumask operations), but the for_each_cpu() one is still an
> absolutely horrible way to find a couple of basically statically known
> (modulo hotplug, which is disabled here anyway) CPU's. So even if the
> algorithm makes sense at some higher level, it doesn't really seem to
> make sense from an implementation standpoint.
> 
> Also, do we really want to spread things out that aggressively?
> How/why do we know that we don't want to share L2 caches, for example?
> It sounds like a bad idea from a power standpoint, and possibly
> performance too.

Right,

maybe the quicker lookup would be the other way around, down the cache
hierarchy: check the CPUs sharing L1, then L2 and if there's no idle
cpu, fall back to the L3-sharing ones and then simply grab one. I don't
know whether that could work though, we'd need to run it heavily.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:33             ` Linus Torvalds
@ 2012-09-24 16:54               ` Peter Zijlstra
  2012-09-25 12:10                 ` Hillf Danton
  0 siblings, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar

On Mon, 2012-09-24 at 09:33 -0700, Linus Torvalds wrote:
> Sure, the "scan bits" bitops will return ">= nr_cpu_ids" for the "I
> couldn't find a bit" thing, but that doesn't mean that everything else
> should. 

Fair enough..

---
 kernel/sched/fair.c | 42 +++++++++++++++++++++---------------------
 1 file changed, 21 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..329f78d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
-	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i;
 
-	/*
-	 * If the task is going to be woken-up on this cpu and if it is
-	 * already idle, then it is the right target.
-	 */
-	if (target == cpu && idle_cpu(cpu))
-		return cpu;
-
-	/*
-	 * If the task is going to be woken-up on the cpu where it previously
-	 * ran and if it is currently idle, then it the right target.
-	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
-		return prev_cpu;
+	if (idle_cpu(target))
+		return target;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
@@ -2661,18 +2648,31 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
+			int candidate = -1;
 
+			/*
+			 * In the SMT case the groups are the SMT-siblings,
+			 * otherwise they're singleton groups.
+			 */
 			for_each_cpu(i, sched_group_cpus(sg)) {
+				if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
+					continue;
+
+				/*
+				 * If any of the SMT-siblings are !idle, the
+				 * core isn't idle.
+				 */
 				if (!idle_cpu(i))
 					goto next;
+
+				if (candidate < 0)
+					candidate = i;
 			}
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
+			if (candidate >= 0) {
+				target = candidate;
+				goto done;
+			}
 next:
 			sg = sg->next;
 		} while (sg != sd->groups);


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:30             ` Linus Torvalds
  2012-09-24 16:52               ` Borislav Petkov
@ 2012-09-24 16:54               ` Peter Zijlstra
  2012-09-24 17:44                 ` Peter Zijlstra
  2012-09-24 18:26                 ` Mike Galbraith
  1 sibling, 2 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-24 16:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 09:30 -0700, Linus Torvalds wrote:
> On Mon, Sep 24, 2012 at 9:12 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > So we're looking for an idle cpu around @target. We prefer a cpu of an
> > idle core, since SMT-siblings share L[12] cache. The way we do this is
> > by iterating the topology tree downwards starting at the LLC (L3) cache
> > level. Its groups are either the SMT-siblings or singleton groups.
> 
> So if it'sally guaranteed to be SMT-siblings or singleton groups, then
> the whole "for_each_cpu()" is a total disaster. That's a truly
> expensive way to look up adjacent CPU's. Is there no saner way to look
> up that thing? Like a simple circular list of SMT siblings (I realize
> that on x86 that list is either one or two, but other SMT
> implementations are groups of four or more).

SMT siblings aren't actually adjacent in the cpu number space (on x86 at
least).

So the alternative you suggest is pointer chasing a list, is that really
much better than scanning a mostly empty bitmap?

I've no idea how bad these bitmap scanning instructions are on modern
chips. But let me try and come up with the list thing, I think we've
actually got that someplace as well.

> So I suspect your patch largely makes things faster (avoid those
> insane cpumask operations), but the for_each_cpu() one is still an
> absolutely horrible way to find a couple of basically statically known
> (modulo hotplug, which is disabled here anyway) CPU's. So even if the
> algorithm makes sense at some higher level, it doesn't really seem to
> make sense from an implementation standpoint.

Agreed.

> Also, do we really want to spread things out that aggressively?
> How/why do we know that we don't want to share L2 caches, for example?
> It sounds like a bad idea from a power standpoint, and possibly
> performance too.

IIRC this current stuff is the result of Mike and Suresh running a few
benchmarks.. Mike, Suresh, either one of you remember this? Otherwise
I'll have to go trawl the archives.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:54               ` Peter Zijlstra
@ 2012-09-24 17:44                 ` Peter Zijlstra
  2012-09-25 13:23                   ` Mel Gorman
  2012-09-24 18:26                 ` Mike Galbraith
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-24 17:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky, Mike Galbraith,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote:
> But let me try and come up with the list thing, I think we've
> actually got that someplace as well. 

OK, I'm sure the below can be written better, but my brain is gone for
the day...

---
 include/linux/sched.h |   1 +
 kernel/sched/core.c   |   1 +
 kernel/sched/fair.c   | 102 +++++++++++++++++++++++++++++++++++---------------
 3 files changed, 73 insertions(+), 31 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0beac68..d72ea68 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -888,6 +888,7 @@ struct sched_group {
 	atomic_t ref;
 
 	unsigned int group_weight;
+	int group_first;
 	struct sched_group_power *sgp;
 
 	/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b38f00e..1177eb1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5781,6 +5781,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 
 	do {
 		sg->group_weight = cpumask_weight(sched_group_cpus(sg));
+		sg->group_first = cpumask_first(sched_group_cpus(sg));
 		sg = sg->next;
 	} while (sg != sd->groups);
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..601bc38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2634,50 +2634,90 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
  */
 static int select_idle_sibling(struct task_struct *p, int target)
 {
-	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
-	struct sched_domain *sd;
-	struct sched_group *sg;
-	int i;
+	struct sched_domain *sd_smt, *sd_llc;
+	struct sched_group *sg_smt, *sg_llc;
 
 	/*
-	 * If the task is going to be woken-up on this cpu and if it is
-	 * already idle, then it is the right target.
+	 * Of the target is idle, easy peasy, we're done.
 	 */
-	if (target == cpu && idle_cpu(cpu))
-		return cpu;
+	if (idle_cpu(target))
+		return target;
 
 	/*
-	 * If the task is going to be woken-up on the cpu where it previously
-	 * ran and if it is currently idle, then it the right target.
+	 * Otherwise, see if there's an idle core in the cache domain.
 	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
-		return prev_cpu;
+	sd_llc = rcu_dereference(per_cpu(sd_llc, target));
+	sg_llc = sd_llc->groups;
+	do {
+		int candidate = -1;
+
+		sd_smt = rcu_dereference(per_cpu(sd_llc, sg_llc->group_first));
+		for_each_lower_domain(sd_smt) {
+			if (sd_smt->flags & SD_SHARE_CPUPOWER) /* aka. SMT */
+				break;
+		}
+
+		if (!sd_smt) {
+			int cpu = sg_llc->group_first; /* Assume singleton group */
+
+			if (!idle_cpu(cpu))
+				goto next_llc;
+
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				goto next_llc;
+
+			return cpu;
+		}
+
+		sg_smt = sd_smt->groups;
+		do {
+			int cpu = sg_smt->group_first; /* Assume singleton group */
+
+			if (!idle_cpu(cpu)) /* core is not idle, skip to next core */
+				goto next_llc;
+
+			if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+				goto next_smt;
+
+			if (candidate < 0)
+				candidate = cpu;
+
+next_smt:
+			sg_smt = sg_smt->next;
+		} while (sg_smt != sd_smt->groups);
+
+		if (candidate >= 0)
+			return candidate;
+
+next_llc:
+		sg_llc = sg_llc->next;
+	} while (sg_llc != sd_llc->groups);
 
 	/*
-	 * Otherwise, iterate the domains and find an elegible idle cpu.
+	 * Failing that, see if there's an idle SMT sibling.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
+	sd_smt = rcu_dereference(per_cpu(sd_llc, target));
+	for_each_lower_domain(sd_smt) {
+		if (sd_smt->flags & SD_SHARE_CPUPOWER) /* aka. SMT */
+			break;
+	}
+
+	if (sd_smt) {
+		sg_smt = sd_smt->groups;
 		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
+			int cpu = sg_smt->group_first; /* Assume singleton group */
 
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
-			}
+			if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+			    idle_cpu(cpu))
+				return cpu;
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
+			sg_smt = sg_smt->next;
+		} while (sg_smt != sd_smt->groups);
 	}
-done:
+
+	/*
+	 * OK, no idle siblings of any kind, take what we started with.
+	 */
 	return target;
 }
 


^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:54               ` Peter Zijlstra
  2012-09-24 17:44                 ` Peter Zijlstra
@ 2012-09-24 18:26                 ` Mike Galbraith
  2012-09-24 19:12                   ` Linus Torvalds
  1 sibling, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-24 18:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote: 
> On Mon, 2012-09-24 at 09:30 -0700, Linus Torvalds wrote:

> > Also, do we really want to spread things out that aggressively?
> > How/why do we know that we don't want to share L2 caches, for example?
> > It sounds like a bad idea from a power standpoint, and possibly
> > performance too.
> 
> IIRC this current stuff is the result of Mike and Suresh running a few
> benchmarks.. Mike, Suresh, either one of you remember this? Otherwise
> I'll have to go trawl the archives.

Aside from the cache pollution I recall having been mentioned, on my
E5620, cross core is a tbench win over affine, cross thread is not.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 18:26                 ` Mike Galbraith
@ 2012-09-24 19:12                   ` Linus Torvalds
  2012-09-24 19:20                     ` Borislav Petkov
                                       ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-24 19:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith <efault@gmx.de> wrote:
>
> Aside from the cache pollution I recall having been mentioned, on my
> E5620, cross core is a tbench win over affine, cross thread is not.

Oh, I agree with trying to avoid HT threads, the resource contention
easily gets too bad.

It's more a question of "if we have real cores with separate L1's but
shared L2's, go with those first, before we start distributing it out
to separate L2's".

             Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 19:12                   ` Linus Torvalds
@ 2012-09-24 19:20                     ` Borislav Petkov
  2012-09-25  1:57                       ` Mike Galbraith
  2012-09-25  1:39                     ` Mike Galbraith
  2012-09-25 21:11                     ` Suresh Siddha
  2 siblings, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-24 19:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, Sep 24, 2012 at 12:12:18PM -0700, Linus Torvalds wrote:
> On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Aside from the cache pollution I recall having been mentioned, on my
> > E5620, cross core is a tbench win over affine, cross thread is not.
> 
> Oh, I agree with trying to avoid HT threads, the resource contention
> easily gets too bad.
> 
> It's more a question of "if we have real cores with separate L1's but
> shared L2's, go with those first, before we start distributing it out
> to separate L2's".

Yes, this is exactly what I meant before. We basically want to avoid
unnecessary, high-volume probe traffic over the L3 or memory controller,
if possible.

So, trying harder to select an L2 sibling would be more beneficial,
IMHO, instead of scanning the whole node.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 19:12                   ` Linus Torvalds
  2012-09-24 19:20                     ` Borislav Petkov
@ 2012-09-25  1:39                     ` Mike Galbraith
  2012-09-25 21:11                     ` Suresh Siddha
  2 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-25  1:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote: 
> On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Aside from the cache pollution I recall having been mentioned, on my
> > E5620, cross core is a tbench win over affine, cross thread is not.
> 
> Oh, I agree with trying to avoid HT threads, the resource contention
> easily gets too bad.
> 
> It's more a question of "if we have real cores with separate L1's but
> shared L2's, go with those first, before we start distributing it out
> to separate L2's".

Oh absolutely, if you have cores with shared L2, that's _the_ way to go,
shared L2 is lovely, and what select_idle_sibling() was originally all
about.  Westmere manages to cough up a tbench win without that.  Heck, I
generated some numbers for Borislav yesterday showing the thing winning
at _netperf TCP_RR_, one byte synchronous ping pong.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 19:20                     ` Borislav Petkov
@ 2012-09-25  1:57                       ` Mike Galbraith
  2012-09-25  2:11                         ` Linus Torvalds
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-25  1:57 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 21:20 +0200, Borislav Petkov wrote: 
> On Mon, Sep 24, 2012 at 12:12:18PM -0700, Linus Torvalds wrote:
> > On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith <efault@gmx.de> wrote:
> > >
> > > Aside from the cache pollution I recall having been mentioned, on my
> > > E5620, cross core is a tbench win over affine, cross thread is not.
> > 
> > Oh, I agree with trying to avoid HT threads, the resource contention
> > easily gets too bad.
> > 
> > It's more a question of "if we have real cores with separate L1's but
> > shared L2's, go with those first, before we start distributing it out
> > to separate L2's".
> 
> Yes, this is exactly what I meant before. We basically want to avoid
> unnecessary, high-volume probe traffic over the L3 or memory controller,
> if possible.
> 
> So, trying harder to select an L2 sibling would be more beneficial,
> IMHO, instead of scanning the whole node.

If those L2 siblings are cores, oh yeah.  Do any modern packages have
multi-core shared L2?

-Mike



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  1:57                       ` Mike Galbraith
@ 2012-09-25  2:11                         ` Linus Torvalds
  2012-09-25  2:49                           ` Mike Galbraith
  2012-09-25 11:58                           ` Peter Zijlstra
  0 siblings, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-25  2:11 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, Sep 24, 2012 at 6:57 PM, Mike Galbraith <efault@gmx.de> wrote:
>
> If those L2 siblings are cores, oh yeah.  Do any modern packages have
> multi-core shared L2?

The upcoming AMD "steamroller" is supposed to have enough of
separation between the cores sharing the L2 cache to probably be worth
splitting them up (they do share the FP unit, and some ifetch).

That's still somewhere in between HT and true multi-core, but it looks
to be closer to multi-core than HT (the current bulldozer/piledriver
is too, but it shares so much of the instruction decoder that I think
it's better to think of it as HT than as really multiple cores -
there's way too much sharing going on).

In the not-so-distant past, we had the intel "Dunnington" Xeon, which
was iirc basically three Core 2 duo's bolted together (ie three
clusters of two cores sharing L2, and a fully shared L3). So that was
a true multi-core with fairly big shared L2, and it really would be
sad to not use the second core aggressively.

So it's not very common, but it's not unheard of either.

            Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  2:11                         ` Linus Torvalds
@ 2012-09-25  2:49                           ` Mike Galbraith
  2012-09-25  3:10                             ` Linus Torvalds
  2012-09-25 11:58                           ` Peter Zijlstra
  1 sibling, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-25  2:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:

> In the not-so-distant past, we had the intel "Dunnington" Xeon, which
> was iirc basically three Core 2 duo's bolted together (ie three
> clusters of two cores sharing L2, and a fully shared L3). So that was
> a true multi-core with fairly big shared L2, and it really would be
> sad to not use the second core aggressively.

Ah.  That's what I did to select_idle_sibling() in a nutshell, converted
the problematic large L3 packages into multiple ~core2duo pairs, modulo
shared L2 'course.  Bounce proof, and on Westmere, the jabbering back
and forth in L3 somehow doesn't hurt as much as expected, so the things
act (more or less, L2 traffic _does_ matter;) like the real deal.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  2:49                           ` Mike Galbraith
@ 2012-09-25  3:10                             ` Linus Torvalds
  2012-09-25  3:20                               ` Mike Galbraith
  0 siblings, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-25  3:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, Sep 24, 2012 at 7:49 PM, Mike Galbraith <efault@gmx.de> wrote:
>
> Ah.  That's what I did to select_idle_sibling() in a nutshell, converted
> the problematic large L3 packages into multiple ~core2duo pairs, modulo
> shared L2 'course.  Bounce proof, and on Westmere, the jabbering back
> and forth in L3 somehow doesn't hurt as much as expected, so the things
> act (more or less, L2 traffic _does_ matter;) like the real deal.

Right. But your patch *only* looked at the pair.

Which may be bounce-proof, but we also saw that it was unacceptable.

              Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  3:10                             ` Linus Torvalds
@ 2012-09-25  3:20                               ` Mike Galbraith
  2012-09-25  3:32                                 ` Linus Torvalds
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-25  3:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 20:10 -0700, Linus Torvalds wrote: 
> On Mon, Sep 24, 2012 at 7:49 PM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Ah.  That's what I did to select_idle_sibling() in a nutshell, converted
> > the problematic large L3 packages into multiple ~core2duo pairs, modulo
> > shared L2 'course.  Bounce proof, and on Westmere, the jabbering back
> > and forth in L3 somehow doesn't hurt as much as expected, so the things
> > act (more or less, L2 traffic _does_ matter;) like the real deal.
> 
> Right. But your patch *only* looked at the pair.
> 
> Which may be bounce-proof, but we also saw that it was unacceptable.

Yes.  Cross wiring traverse _start_ points should eliminate (well, damp)
bounce as well without killing the 1:N latency/preempt benefits of large
L3 packages.  You'll still take a lot of L2 misses while doing futile
traverse when fully committed, but that's a separate issue.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  3:20                               ` Mike Galbraith
@ 2012-09-25  3:32                                 ` Linus Torvalds
  2012-09-25  3:43                                   ` Mike Galbraith
  0 siblings, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-25  3:32 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, Sep 24, 2012 at 8:20 PM, Mike Galbraith <efault@gmx.de> wrote:
>
> Yes.  Cross wiring traverse _start_ points should eliminate (well, damp)
> bounce as well without killing the 1:N latency/preempt benefits of large
> L3 packages.

Yes, a "test buddy first, then check the other cores in the package"
hybrid approach might be reasonable.

Of course, that's effectively what the whole "prev_cpu" thing is kind
of supposed to also do, isn't it? Because it's even lovelier if you
can avoid bouncing around by trying to hit a previous CPU that might
just have some of the old data in the caches still.

      Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  3:32                                 ` Linus Torvalds
@ 2012-09-25  3:43                                   ` Mike Galbraith
  0 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-25  3:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 20:32 -0700, Linus Torvalds wrote: 
> On Mon, Sep 24, 2012 at 8:20 PM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Yes.  Cross wiring traverse _start_ points should eliminate (well, damp)
> > bounce as well without killing the 1:N latency/preempt benefits of large
> > L3 packages.
> 
> Yes, a "test buddy first, then check the other cores in the package"
> hybrid approach might be reasonable.
> 
> Of course, that's effectively what the whole "prev_cpu" thing is kind
> of supposed to also do, isn't it? Because it's even lovelier if you
> can avoid bouncing around by trying to hit a previous CPU that might
> just have some of the old data in the caches still.

prev_cpu can be anywhere, so buddies sometimes need help getting back
together when they've been disrupted, but yeah, in the general case it's
local, so you want prev_cpu if it can be had.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 15:00     ` Mel Gorman
  2012-09-24 15:23       ` Nikolay Ulyanitsky
  2012-09-24 15:30       ` Peter Zijlstra
@ 2012-09-25  4:16       ` Mike Galbraith
  2 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-25  4:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Peter Zijlstra, Andrew Morton,
	Thomas Gleixner, Ingo Molnar

On Mon, 2012-09-24 at 16:00 +0100, Mel Gorman wrote: 
> On Fri, Sep 14, 2012 at 02:42:44PM -0700, Linus Torvalds wrote:
> > On Fri, Sep 14, 2012 at 2:27 PM, Borislav Petkov <bp@alien8.de> wrote:
> > >
> > > as Nikolay says below, we have a regression in 3.6 with pgbench's
> > > benchmark in postgresql.
> > >
> > > I was able to reproduce it on another box here and did a bisection run.
> > > It pointed to the commit below.
> > 
> > Ok. I guess we should just revert it. However, before we do that,
> > maybe Mike can make it just use the exact old semantics of
> > select_idle_sibling() in the update_top_cache_domain() logic.
> > 
> 
> The patch that you being reverted was meant to fix problems with
> commit 4dcfe102 (sched: Avoid SMT siblings in select_idle_sibling() if
> possible). That patch made select_idle_sibling() quite fat and I know it
> is responsible for a 2% regression in a kernel compile benchmark between
> kernel 3.1 and 3.2 on an old AMD Phenom II X4 940. Reverting Mike's patch
> might fix this Postgres regression but it reintroduces the overhead caused
> by commit 4dcfe102 for other cases.  I do not have a suggestion on how to
> make this better, I'm just pointing out that the revert has some downsides.

Yeah.  The very good thing about this is that Linus becoming interested
in a problem like two faced little (naughty word) select_idle_sibling()
brings more minds to bear, so we'll end up in better shape.  Right now,
we're in not so wonderful shape with or without my patch, depending very
much on which processor you're running which load on.  That needs to get
better.  AMD is being hurt _bad_ on fast movers, but just kill the damn
thing on AMD processors, and pgbench will fall through the floor.

Aiming darts would be lots easier if the bullseye would stop moving ;-) 

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25  2:11                         ` Linus Torvalds
  2012-09-25  2:49                           ` Mike Galbraith
@ 2012-09-25 11:58                           ` Peter Zijlstra
  2012-09-25 13:17                             ` Borislav Petkov
  1 sibling, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-25 11:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Borislav Petkov, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
> In the not-so-distant past, we had the intel "Dunnington" Xeon, which
> was iirc basically three Core 2 duo's bolted together (ie three
> clusters of two cores sharing L2, and a fully shared L3). So that was
> a true multi-core with fairly big shared L2, and it really would be
> sad to not use the second core aggressively. 

Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
without a PSU atm so checking gets a little hard) so the LLC level was
the L2 and all worked out right (it also not having SMT helped of
course).

But if there was a Xeon chip that did add a package L3 then yes, all
this would become more interesting still. We'd need to extend the
scheduler topology a bit as well, I don't think it can currently handle
this well.

So I guess we get to do some work for steamroller.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 16:54               ` Peter Zijlstra
@ 2012-09-25 12:10                 ` Hillf Danton
  0 siblings, 0 replies; 115+ messages in thread
From: Hillf Danton @ 2012-09-25 12:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Mel Gorman, Borislav Petkov, Nikolay Ulyanitsky,
	Mike Galbraith, linux-kernel, Andreas Herrmann, Andrew Morton,
	Th

On Tue, Sep 25, 2012 at 12:54 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> On Mon, 2012-09-24 at 09:33 -0700, Linus Torvalds wrote:
>> Sure, the "scan bits" bitops will return ">= nr_cpu_ids" for the "I
>> couldn't find a bit" thing, but that doesn't mean that everything else
>> should.
>
> Fair enough..
>
> ---
>  kernel/sched/fair.c | 42 +++++++++++++++++++++---------------------
>  1 file changed, 21 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b800a1..329f78d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2634,25 +2634,12 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>   */
>  static int select_idle_sibling(struct task_struct *p, int target)
>  {
> -       int cpu = smp_processor_id();
> -       int prev_cpu = task_cpu(p);
>         struct sched_domain *sd;
>         struct sched_group *sg;
>         int i;
>
> -       /*
> -        * If the task is going to be woken-up on this cpu and if it is
> -        * already idle, then it is the right target.
> -        */
> -       if (target == cpu && idle_cpu(cpu))
> -               return cpu;
> -
> -       /*
> -        * If the task is going to be woken-up on the cpu where it previously
> -        * ran and if it is currently idle, then it the right target.
> -        */
> -       if (target == prev_cpu && idle_cpu(prev_cpu))
> -               return prev_cpu;
> +       if (idle_cpu(target))
> +               return target;
>
>         /*
>          * Otherwise, iterate the domains and find an elegible idle cpu.
> @@ -2661,18 +2648,31 @@ static int select_idle_sibling(struct task_struct *p, int target)
>         for_each_lower_domain(sd) {
>                 sg = sd->groups;
>                 do {
> -                       if (!cpumask_intersects(sched_group_cpus(sg),
> -                                               tsk_cpus_allowed(p)))
> -                               goto next;
> +                       int candidate = -1;
>
> +                       /*
> +                        * In the SMT case the groups are the SMT-siblings,
> +                        * otherwise they're singleton groups.
> +                        */
>                         for_each_cpu(i, sched_group_cpus(sg)) {
> +                               if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
> +                                       continue;
> +
> +                               /*
> +                                * If any of the SMT-siblings are !idle, the
> +                                * core isn't idle.
> +                                */
>                                 if (!idle_cpu(i))
>                                         goto next;
> +
> +                               if (candidate < 0)
> +                                       candidate = i;


Any reason to determine candidate by scanning a non-idle core?
>                         }
>
> -                       target = cpumask_first_and(sched_group_cpus(sg),
> -                                       tsk_cpus_allowed(p));
> -                       goto done;
> +                       if (candidate >= 0) {
> +                               target = candidate;
> +                               goto done;
> +                       }
>  next:
>                         sg = sg->next;
>                 } while (sg != sd->groups);
>
> --

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 11:58                           ` Peter Zijlstra
@ 2012-09-25 13:17                             ` Borislav Petkov
  2012-09-25 17:00                               ` Borislav Petkov
  0 siblings, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-25 13:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Mike Galbraith, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 01:58:06PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 19:11 -0700, Linus Torvalds wrote:
> > In the not-so-distant past, we had the intel "Dunnington" Xeon, which
> > was iirc basically three Core 2 duo's bolted together (ie three
> > clusters of two cores sharing L2, and a fully shared L3). So that was
> > a true multi-core with fairly big shared L2, and it really would be
> > sad to not use the second core aggressively. 
> 
> Ah indeed. My Core2Quad didn't have an L3 afaik (its sitting around
> without a PSU atm so checking gets a little hard) so the LLC level was
> the L2 and all worked out right (it also not having SMT helped of
> course).
> 
> But if there was a Xeon chip that did add a package L3 then yes, all
> this would become more interesting still. We'd need to extend the
> scheduler topology a bit as well, I don't think it can currently handle
> this well.
> 
> So I guess we get to do some work for steamroller.

Right, but before that we can still do some experimenting on Bulldozer
- we have the shared 2M L2 there too and it would be nice to improve
select_idle_sibling there.

For example, I did some measurements a couple of days ago on Bulldozer
of tbench with and without select_idle_sibling:

tbench runs single-socket OR-B (box has 8 cores, 4 CUs) (tbench_srv
localhost), tbench default settings as in debian testing

# clients                                                       1       2       4       8       12      16
3.6-rc6+tip/auto-latest                                         115.91  238.571 469.606 1865.77 1863.08 1851.46
3.6-rc6+tip/auto-latest-kill select_idle_sibling():             354.619 534.714 900.069 1969.35 1955.91 1940.84


3.6-rc6+tip/auto-latest
-----------------------
Throughput 115.91 MB/sec   1 clients  1 procs  max_latency=0.296 ms
Throughput 238.571 MB/sec  2 clients  2 procs  max_latency=1.296 ms
Throughput 469.606 MB/sec  4 clients  4 procs  max_latency=0.340 ms
Throughput 1865.77 MB/sec  8 clients  8 procs  max_latency=3.393 ms
Throughput 1863.08 MB/sec  12 clients  12 procs  max_latency=0.322 ms
Throughput 1851.46 MB/sec  16 clients  16 procs  max_latency=2.059 ms

3.6-rc6+tip/auto-latest-kill select_idle_sibling()
--------------------------------------------------
Throughput 354.619 MB/sec  1 clients  1 procs  max_latency=0.321 ms
Throughput 534.714 MB/sec  2 clients  2 procs  max_latency=2.651 ms
Throughput 900.069 MB/sec  4 clients  4 procs  max_latency=10.823 ms
Throughput 1969.35 MB/sec  8 clients  8 procs  max_latency=1.630 ms
Throughput 1955.91 MB/sec  12 clients  12 procs  max_latency=3.236 ms
Throughput 1940.84 MB/sec  16 clients  16 procs  max_latency=0.314 ms

So improving this select_idle_sibling thing wouldn't be such a bad
thing.

Btw, I'll run your patch at http://marc.info/?l=linux-kernel&m=134850571330618
with the same benchmark to see what it brings.

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 17:44                 ` Peter Zijlstra
@ 2012-09-25 13:23                   ` Mel Gorman
  2012-09-25 14:36                     ` Peter Zijlstra
  0 siblings, 1 reply; 115+ messages in thread
From: Mel Gorman @ 2012-09-25 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	Mike Galbraith, linux-kernel, Andreas Herrmann, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Mon, Sep 24, 2012 at 07:44:17PM +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-24 at 18:54 +0200, Peter Zijlstra wrote:
> > But let me try and come up with the list thing, I think we've
> > actually got that someplace as well. 
> 
> OK, I'm sure the below can be written better, but my brain is gone for
> the day...
> 

It crashes on boot due to the fact that you created a function-scope variable
called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
were interested in. Result: dereferenced uninitialised pointer and kaboom.
Trivial to fix so it boots at least.

This is a silly test for a scheduler patch but as "sched: Avoid SMT siblings
in select_idle_sibling() if possible" regressed 2% back in 3.2, it seemed
reasonable to retest with it.

KERNBENCH
                               3.6.0                 3.6.0                 3.6.0
                         rc6-vanilla    rc6-mikebuddy-v1r1  rc6-idlesibling-v1r1
User    min         352.47 (  0.00%)      351.77 (  0.20%)      352.30 (  0.05%)
User    mean        353.10 (  0.00%)      352.78 (  0.09%)      352.77 (  0.09%)
User    stddev        0.41 (  0.00%)        0.56 (-36.13%)        0.35 ( 15.16%)
User    max         353.55 (  0.00%)      353.43 (  0.03%)      353.31 (  0.07%)
System  min          34.86 (  0.00%)       34.83 (  0.09%)       35.37 ( -1.46%)
System  mean         35.35 (  0.00%)       35.29 (  0.16%)       35.63 ( -0.80%)
System  stddev        0.41 (  0.00%)        0.40 (  0.10%)        0.15 ( 62.26%)
System  max          35.94 (  0.00%)       36.05 ( -0.31%)       35.81 (  0.36%)
Elapsed min         110.18 (  0.00%)      109.65 (  0.48%)      110.04 (  0.13%)
Elapsed mean        110.21 (  0.00%)      109.75 (  0.42%)      110.15 (  0.06%)
Elapsed stddev        0.03 (  0.00%)        0.07 (-167.83%)        0.09 (-207.56%)
Elapsed max         110.26 (  0.00%)      109.86 (  0.36%)      110.26 (  0.00%)
CPU     min         352.00 (  0.00%)      353.00 ( -0.28%)      352.00 (  0.00%)
CPU     mean        352.00 (  0.00%)      353.00 ( -0.28%)      352.00 (  0.00%)
CPU     stddev        0.00 (  0.00%)        0.00 (  0.00%)        0.00 (  0.00%)
CPU     max         352.00 (  0.00%)      353.00 ( -0.28%)      352.00 (  0.00%)

mikebuddy-v1r1 is Mike's patch that just got reverted. idlesibling is
Peters patch. "Elapsed mean" time is the main value of interest. Mike's
patch gains 0.42% which is less than the 2% lost but at least the gain is
outside the noise. idlesibling make very little difference. "System mean"
is also interesting because even though idlesibling shows a "regression", it
also shows that the variation between runs is reduced. That might indicate
that fewer cache misses are being incurred in the select_idle_sibling()
code although that is a bit of a leap of faith.

The machine is in use at the moment but I'll queue up a test this evening to
gather a profile to confirm time is even being spent in select_idle_sibling()
Just because 2% was lost in select_idle_sibling() back in 3.2 does not
mean squat now.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 13:23                   ` Mel Gorman
@ 2012-09-25 14:36                     ` Peter Zijlstra
  0 siblings, 0 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-25 14:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Borislav Petkov, Nikolay Ulyanitsky,
	Mike Galbraith, linux-kernel, Andreas Herrmann, Andrew Morton,
	Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Tue, 2012-09-25 at 14:23 +0100, Mel Gorman wrote:
> It crashes on boot due to the fact that you created a function-scope variable
> called sd_llc in select_idle_sibling() and shadowed the actual sd_llc you
> were interested in. 

D'0h!

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 13:17                             ` Borislav Petkov
@ 2012-09-25 17:00                               ` Borislav Petkov
  2012-09-25 17:21                                 ` Linus Torvalds
  0 siblings, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-25 17:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Mike Galbraith, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 03:17:36PM +0200, Borislav Petkov wrote:
> For example, I did some measurements a couple of days ago on Bulldozer
> of tbench with and without select_idle_sibling:

Here are updated benchmark results with your patch here:
http://marc.info/?l=linux-kernel&m=134850871822587

I think this pretty much confirms Mel's results.

tbench runs single-socket OR-B  (box has 8 cores, 4 CUs) (tbench_srv localhost), tbench default settings as in debian testing

# clients							1	2	4	8	12	16
3.6-rc6+tip/auto-latest						115.91	238.571	469.606	1865.77	1863.08	1851.46
3.6-rc6+tip/auto-latest-kill select_idle_sibling():		354.619	534.714	900.069	1969.35	1955.91	1940.84
3.6-rc6+tip/auto-latest-revert-the-revert			114.001	223.171	408.507	1771.48	1757.08	1736.12
3.6-rc7+tip/auto-latest-select_idle_sibling-lists		107.39  222.439 435.255 1659.42 1697.43 1685.92

3.6-rc6+tip/auto-latest
-----------------------
Throughput 115.91 MB/sec   1 clients  1 procs  max_latency=0.296 ms
Throughput 238.571 MB/sec  2 clients  2 procs  max_latency=1.296 ms
Throughput 469.606 MB/sec  4 clients  4 procs  max_latency=0.340 ms
Throughput 1865.77 MB/sec  8 clients  8 procs  max_latency=3.393 ms
Throughput 1863.08 MB/sec  12 clients  12 procs  max_latency=0.322 ms
Throughput 1851.46 MB/sec  16 clients  16 procs  max_latency=2.059 ms

3.6-rc6+tip/auto-latest-kill select_idle_sibling()
--------------------------------------------------
Throughput 354.619 MB/sec  1 clients  1 procs  max_latency=0.321 ms
Throughput 534.714 MB/sec  2 clients  2 procs  max_latency=2.651 ms
Throughput 900.069 MB/sec  4 clients  4 procs  max_latency=10.823 ms
Throughput 1969.35 MB/sec  8 clients  8 procs  max_latency=1.630 ms
Throughput 1955.91 MB/sec  12 clients  12 procs  max_latency=3.236 ms
Throughput 1940.84 MB/sec  16 clients  16 procs  max_latency=0.314 ms

3.6-rc6+tip/auto-latest-revert-the-revert
-----------------------------------------
Throughput 114.001 MB/sec  1 clients  1 procs  max_latency=0.352 ms
Throughput 223.171 MB/sec  2 clients  2 procs  max_latency=0.348 ms
Throughput 408.507 MB/sec  4 clients  4 procs  max_latency=0.388 ms
Throughput 1771.48 MB/sec  8 clients  8 procs  max_latency=0.280 ms
Throughput 1757.08 MB/sec  12 clients  12 procs  max_latency=3.280 ms
Throughput 1736.12 MB/sec  16 clients  16 procs  max_latency=0.333 ms

3.6-rc7+tip/auto-latest-select_idle_sibling-lists
-------------------------------------------------
Throughput 107.39 MB/sec  1 clients  1 procs  max_latency=0.372 ms
Throughput 222.439 MB/sec  2 clients  2 procs  max_latency=0.345 ms
Throughput 435.255 MB/sec  4 clients  4 procs  max_latency=0.346 ms
Throughput 1659.42 MB/sec  8 clients  8 procs  max_latency=3.497 ms
Throughput 1697.43 MB/sec  12 clients  12 procs  max_latency=3.205 ms
Throughput 1685.92 MB/sec  16 clients  16 procs  max_latency=0.331 ms

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 17:00                               ` Borislav Petkov
@ 2012-09-25 17:21                                 ` Linus Torvalds
  2012-09-25 18:42                                   ` Borislav Petkov
                                                     ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-25 17:21 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Linus Torvalds, Mike Galbraith,
	Mel Gorman, Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov <bp@alien8.de> wrote:
>
> 3.6-rc6+tip/auto-latest-kill select_idle_sibling()

Is this literally just removing it entirely? Because apart from the
latency spike at 4 procs (and the latency numbers look very noisy, so
that's probably just noise), it looks clearly superior to everything
else. On that benchmark, at least.

How does pgbench look? That's the one that apparently really wants to
spread out, possibly due to user-level spinlocks. So I assume it will
show the reverse pattern, with "kill select_idle_sibling" being the
worst case. Sad, because it really would be lovely to just remove that
thing ;)

          Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 17:21                                 ` Linus Torvalds
@ 2012-09-25 18:42                                   ` Borislav Petkov
  2012-09-25 19:08                                     ` Linus Torvalds
  2012-09-26  2:23                                     ` Mike Galbraith
  2012-09-26  2:00                                   ` Mike Galbraith
  2012-09-26 16:32                                   ` Borislav Petkov
  2 siblings, 2 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-25 18:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mike Galbraith, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
> On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > 3.6-rc6+tip/auto-latest-kill select_idle_sibling()
> 
> Is this literally just removing it entirely?

Basically yes:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a14b990..016ba387c7f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2640,6 +2640,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
        struct sched_group *sg;
        int i;
 
+       goto done;
+
        /*
         * If the task is going to be woken-up on this cpu and if it is
         * already idle, then it is the right target.

> Because apart from the latency spike at 4 procs (and the latency
> numbers look very noisy, so that's probably just noise), it looks
> clearly superior to everything else. On that benchmark, at least.

Yep, I need more results for a more reliable say here.

> How does pgbench look? That's the one that apparently really wants to
> spread out, possibly due to user-level spinlocks. So I assume it will
> show the reverse pattern, with "kill select_idle_sibling" being the
> worst case.

Let me run pgbench tomorrow (I had run it only on an older family 0x10
single-node box) on Bulldozer to check that out. And we haven't started
the multi-node measurements at all.

> Sad, because it really would be lovely to just remove that thing ;)

Right, so why did we need it all, in the first place? There has to be
some reason for it.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 18:42                                   ` Borislav Petkov
@ 2012-09-25 19:08                                     ` Linus Torvalds
  2012-09-26  2:23                                     ` Mike Galbraith
  1 sibling, 0 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-25 19:08 UTC (permalink / raw)
  To: Borislav Petkov, Peter Zijlstra, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 11:42 AM, Borislav Petkov <bp@alien8.de> wrote:
>>
>> Is this literally just removing it entirely?
>
> Basically yes:

Ok, so you make it just always select 'target'. Fine. I wondered if
you just removed the calling logic entirely.

>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case.
>
> Let me run pgbench tomorrow (I had run it only on an older family 0x10
> single-node box) on Bulldozer to check that out. And we haven't started
> the multi-node measurements at all.

Ack, this clearly needs much more testing. That said, I really would
*love* to just get rid of the function entirely.

>> Sad, because it really would be lovely to just remove that thing ;)
>
> Right, so why did we need it all, in the first place? There has to be
> some reason for it.

I'm not entirely convinced.

Looking at the history of that thing, it's long and tortuous, and has
a few commits completely fixing the "logic" of it (eg see commit
99bd5e2f245d).

To the point where I don't think it necessarily even matches what the
original cause for it was. So it's *possible* that we have a case of
historical code that may have improved performance originally on at
least some machines, but that has (a) been changed due to it being
broken and (b) CPU's have changed too, so it may well be that it
simply doesn't help any more.

And we've had problems with this function before. See for example:
 - 4dcfe1025b51: sched: Avoid SMT siblings in select_idle_sibling() if possible
 - 518cd6234178: sched: Only queue remote wakeups when crossing cache boundaries

so we've basically had odd special-case "tuning" of this function from
the original. I do not think that there is any solid reason to believe
that it does what it used to do, or that what it used to do makes
sense any more.

It's entirely possible that "prev_cpu" basically ends up being the
better choice for spreading things out.

That said, my *guess* is that when you run pgbench, you'll see the
same regression that we saw due to Mike's patch too. It simply looks
like tbench wants to have minimal cpu selection and avoid moving
things around, while pgbench probably wants to spread out maximally.

             Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-24 19:12                   ` Linus Torvalds
  2012-09-24 19:20                     ` Borislav Petkov
  2012-09-25  1:39                     ` Mike Galbraith
@ 2012-09-25 21:11                     ` Suresh Siddha
  2 siblings, 0 replies; 115+ messages in thread
From: Suresh Siddha @ 2012-09-25 21:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Peter Zijlstra, Mel Gorman, Borislav Petkov,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar

On Mon, 2012-09-24 at 12:12 -0700, Linus Torvalds wrote:
> On Mon, Sep 24, 2012 at 11:26 AM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Aside from the cache pollution I recall having been mentioned, on my
> > E5620, cross core is a tbench win over affine, cross thread is not.
> 
> Oh, I agree with trying to avoid HT threads, the resource contention
> easily gets too bad.
> 
> It's more a question of "if we have real cores with separate L1's but
> shared L2's, go with those first, before we start distributing it out
> to separate L2's".

There is one issue though. If the tasks continue to run in this state
and the periodic balance notices an idle L2, it will force migrate
(using active migration) one of the tasks to the idle L2. As the
periodic balance tries to spread the load as far as possible to take
maximum advantage of the available resources (and the perf advantage of
this really depends on the workload, cache usage/memory bw, the upside
of turbo etc).

But I am not sure if this was the reason why we chose to spread it out
to separate L2's during wakeup.

Anyways, this is one of the places where the Paul Turner's task load
average tracking patches will be useful. Depending on how long a task
typically runs, we can probably even chose a SMT siblings or a separate
L2 to run.

thanks,
suresh


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 17:21                                 ` Linus Torvalds
  2012-09-25 18:42                                   ` Borislav Petkov
@ 2012-09-26  2:00                                   ` Mike Galbraith
  2012-09-26  2:22                                     ` Linus Torvalds
  2012-09-26 16:32                                   ` Borislav Petkov
  2 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-26  2:00 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, 2012-09-25 at 10:21 -0700, Linus Torvalds wrote: 
> On Tue, Sep 25, 2012 at 10:00 AM, Borislav Petkov <bp@alien8.de> wrote:
> >
> > 3.6-rc6+tip/auto-latest-kill select_idle_sibling()
> 
> Is this literally just removing it entirely? Because apart from the
> latency spike at 4 procs (and the latency numbers look very noisy, so
> that's probably just noise), it looks clearly superior to everything
> else. On that benchmark, at least.

Yes.  On AMD, the best thing you can do for fast switchers AFAIKT is
turn it off.  Different story on Intel.

> How does pgbench look? That's the one that apparently really wants to
> spread out, possibly due to user-level spinlocks. So I assume it will
> show the reverse pattern, with "kill select_idle_sibling" being the
> worst case. Sad, because it really would be lovely to just remove that
> thing ;)

It _is_ irritating.  There's nohz, governors, and then come radically
different cross cpu data blasting ability on top. On Intel, it wins at
the same fast movers it demolishes on AMD.  Throttle it, and that goes
away, along with some other issues.

Or just kill it, then integrate what it does for you into a smarter
lighter wakeup balance.. but then that has to climb that same hills.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26  2:00                                   ` Mike Galbraith
@ 2012-09-26  2:22                                     ` Linus Torvalds
  2012-09-26  2:42                                       ` Mike Galbraith
  2012-09-26 17:15                                       ` Borislav Petkov
  0 siblings, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-26  2:22 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 7:00 PM, Mike Galbraith <efault@gmx.de> wrote:
>
> Yes.  On AMD, the best thing you can do for fast switchers AFAIKT is
> turn it off.  Different story on Intel.

I doubt it's all that different on Intel.

Your patch showed improvement for Intel too on this same benchmark
(tbench). Borislav just went even further. I'd suggest testing that
patch on Intel too, and wouldn't be surprised at all if it shows
improvement there too.

It's pgbench that then regressed with your patch, and I suspect it
will regress with Borislav's too.

So I'm sure there are architecture differences (where HT in particular
probably changes optimal scheduling strategy, although I'd expect the
bulldozer approach to not be *that*different - but I don't know if BD
shows up as "HT siblings" or not, so dissimilar topology
interpretation may make it *look* very different).

So I suspect the architectural differences are smaller than you claim,
and it's much more about the loads in question.

You probably looked at the fact that the original report from Nikolay
says that the Intel E6300 hadn't regressed on pgbench, but I suspect
you didn't realize that E6300 is just a dual-core CPU without even HT.
So I doubt it's about "Intel vs AMD", it's more about "six cores" vs
"just two".

And the thing is - with just two cores, the fact that your patch
didn't change the Intel numbers is totally irrelevant. With two cores,
the whole "buddy_cpu" was equivalent to the old code, since there was
ever only one other core to begin with!

So AMD and Intel do have differences, but they aren't all that radical.

          Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 18:42                                   ` Borislav Petkov
  2012-09-25 19:08                                     ` Linus Torvalds
@ 2012-09-26  2:23                                     ` Mike Galbraith
  2012-09-26 17:17                                       ` Borislav Petkov
  1 sibling, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-26  2:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote:

> Right, so why did we need it all, in the first place? There has to be
> some reason for it.

Easy.  Take two communicating tasks.  Is an affine wakeup a good idea?
It depends on how much execution overlap there is.  Wake affine when
there is overlap larger than cache miss cost, and you just tossed
throughput into the bin.

select_idle_sibling() was originally about shared L2, where any overlap
was salvageable.  On modern processors with no shared L2, you have to
get past the cost, but the gain is still there.  Intel wins with loads
that AMD loses very bady on, so I can only guess that Intel must feed
caches more efficiently.  Dunno.  It just doesn't matter though, point
is that there is a win to be had in both cases, the breakeven just isn't
at the same point.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26  2:22                                     ` Linus Torvalds
@ 2012-09-26  2:42                                       ` Mike Galbraith
  2012-09-26 17:15                                       ` Borislav Petkov
  1 sibling, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-26  2:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, 2012-09-25 at 19:22 -0700, Linus Torvalds wrote: 
> On Tue, Sep 25, 2012 at 7:00 PM, Mike Galbraith <efault@gmx.de> wrote:
> >
> > Yes.  On AMD, the best thing you can do for fast switchers AFAIKT is
> > turn it off.  Different story on Intel.
> 
> I doubt it's all that different on Intel.

The behavioral difference is pretty large, question is why.

<quoting self>
> Am I on the right track here? Or do you mean something completely
> different? Please explain it more verbosely.

A picture is worth a thousand words they say...

x3550 M3 E5620, SMT off, revert reverted, nohz off, zero knob twiddles,
governor=performance.

tbench    1     2     4
        398   820  1574  -select_idle_sibling() 
        454   902  1574  +select_idle_sibling()
        397   737  1556  +select_idle_sibling() virgin source

netperf TCP_RR, one unbound pair
114674   -select_idle_sibling()
131422   +select_idle_sibling()
111551   +select_idle_sibling() virgin source

These 1:1 buddy pairs scheduled cross core on E5620 feel no pain once
you kill the bouncing.  The bounce pain with 4 cores is _tons_ less
intense than on the 10 core Westmere, but it's still quite visible.  The
point though is that cross core doesn't hurt Westmere, but demolishes
Opteron for some reason.  (OTOH, bounce _helps_ fugly 1:N load.. grr;)
</quoting self>

> Your patch showed improvement for Intel too on this same benchmark
> (tbench). Borislav just went even further. I'd suggest testing that
> patch on Intel too, and wouldn't be surprised at all if it shows
> improvement there too.

See above.

> It's pgbench that then regressed with your patch, and I suspect it
> will regress with Borislav's too.

Yeah, strongly suspect you're right.

> You probably looked at the fact that the original report from Nikolay
> says that the Intel E6300 hadn't regressed on pgbench, but I suspect
> you didn't realize that E6300 is just a dual-core CPU without even HT.
> So I doubt it's about "Intel vs AMD", it's more about "six cores" vs
> "just two".

No, I knew, and yeah, it's about number of paths.

> And the thing is - with just two cores, the fact that your patch
> didn't change the Intel numbers is totally irrelevant. With two cores,
> the whole "buddy_cpu" was equivalent to the old code, since there was
> ever only one other core to begin with!
> 
> So AMD and Intel do have differences, but they aren't all that radical.

Looks fairly radical to me, but as noted in mail to Boris, it boils down
to "what does it cost, and where does the breakeven lie?".

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-25 17:21                                 ` Linus Torvalds
  2012-09-25 18:42                                   ` Borislav Petkov
  2012-09-26  2:00                                   ` Mike Galbraith
@ 2012-09-26 16:32                                   ` Borislav Petkov
  2012-09-26 18:19                                     ` Linus Torvalds
  2 siblings, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-26 16:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mike Galbraith, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
> How does pgbench look? That's the one that apparently really wants to
> spread out, possibly due to user-level spinlocks. So I assume it will
> show the reverse pattern, with "kill select_idle_sibling" being the
> worst case. Sad, because it really would be lovely to just remove that
> thing ;)

Yep, correct. It hurts.

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor

tps = 4574.570857 (including connections establishing)
tps = 4579.166159 (excluding connections establishing)

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor + kill select_idle_sibling

tps = 2230.354093 (including connections establishing)
tps = 2231.412169 (excluding connections establishing)


-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26  2:22                                     ` Linus Torvalds
  2012-09-26  2:42                                       ` Mike Galbraith
@ 2012-09-26 17:15                                       ` Borislav Petkov
  1 sibling, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-26 17:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mike Galbraith, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Tue, Sep 25, 2012 at 07:22:22PM -0700, Linus Torvalds wrote:
> So I'm sure there are architecture differences (where HT in particular
> probably changes optimal scheduling strategy, although I'd expect
> the bulldozer approach to not be *that*different - but I don't know
> if BD shows up as "HT siblings" or not, so dissimilar topology
> interpretation may make it *look* very different).

Right, those cores sharing an L2 are thread siblings on BD:

$ grep . /sys/devices/system/cpu/cpu0/topology/*
/sys/devices/system/cpu/cpu0/topology/core_id:0
/sys/devices/system/cpu/cpu0/topology/core_siblings:ff
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings:03
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0-1

much like HT siblings on this single-socket Sandybridge: 

$ grep . /sys/devices/system/cpu/cpu0/topology/*
/sys/devices/system/cpu/cpu0/topology/core_id:0
/sys/devices/system/cpu/cpu0/topology/core_siblings:ff
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0-7
/sys/devices/system/cpu/cpu0/topology/physical_package_id:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings:11
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0,4

Although I don't know whether those thread siblings on this SB box are
actual HT siblings, sharing almost all resources, judging by the core
ids.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26  2:23                                     ` Mike Galbraith
@ 2012-09-26 17:17                                       ` Borislav Petkov
  0 siblings, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-26 17:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Wed, Sep 26, 2012 at 04:23:26AM +0200, Mike Galbraith wrote:
> On Tue, 2012-09-25 at 20:42 +0200, Borislav Petkov wrote:
> 
> > Right, so why did we need it all, in the first place? There has to be
> > some reason for it.
> 
> Easy.  Take two communicating tasks.  Is an affine wakeup a good idea?
> It depends on how much execution overlap there is.  Wake affine when
> there is overlap larger than cache miss cost, and you just tossed
> throughput into the bin.
> 
> select_idle_sibling() was originally about shared L2, where any overlap
> was salvageable.  On modern processors with no shared L2,

Oh, but we do have shared L2s in the Bulldozer uarch (a subset of the
modern AMD processors :)).

> you have to get past the cost, but the gain is still there. Intel
> wins with loads that AMD loses very bady on, so I can only guess that
> Intel must feed caches more efficiently. Dunno. It just doesn't matter
> though, point is that there is a win to be had in both cases, the
> breakeven just isn't at the same point.

Well, I guess selecting the proper core in the hierarchy depending on
the workload is one of those hard problems.

Teaching select_idle_sibling to detect the breakeven point and act
accordingly would be not that easy then...

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26 16:32                                   ` Borislav Petkov
@ 2012-09-26 18:19                                     ` Linus Torvalds
  2012-09-26 21:37                                       ` Borislav Petkov
                                                         ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-26 18:19 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mike Galbraith,
	Mel Gorman, Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

[-- Attachment #1: Type: text/plain, Size: 3242 bytes --]

On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
>> How does pgbench look? That's the one that apparently really wants to
>> spread out, possibly due to user-level spinlocks. So I assume it will
>> show the reverse pattern, with "kill select_idle_sibling" being the
>> worst case. Sad, because it really would be lovely to just remove that
>> thing ;)
>
> Yep, correct. It hurts.

I'm *so* not surprised.

That said, I think your "kill select_idle_sibling()" one was
interesting, but the wrong kind of "get rid of that logic".

It always selected target_cpu, but the fact is, that doesn't really
sound very sane. The target cpu is either the previous cpu or the
current cpu, depending on whether they should be balanced or not. But
that still doesn't make any *sense*.

In fact, the whole select_idle_sibling() logic makes no sense
what-so-ever to me. It seems to be total garbage.

For example, it starts with the maximum target scheduling domain, and
works its way in over the scheduling groups within that domain. What
the f*ck is the logic of that kind of crazy thing? It never makes
sense to look at a biggest domain first. If you want to be close to
something, you want to look at the *smallest* domain first. But
because it looks at things in the wrong order, it then needs to have
that inner loop saying "does this group actually cover the cpu I am
interested in?"

Please tell me I am mis-reading this?

But starting from the biggest ("llc" group) is wrong *anyway*, since
it means that it starts looking at the L3 level, and then if it finds
an acceptable cpu inside that level, it's all done. But that's
*crazy*. Once again, it's much better to try to find an idle sibling
*closeby* rather than at the L3 level. No? So once again, we should
start at the inner level and if we can't find something really close,
we work our way out, rather than starting from the outer level and
working our way in.

If I read the code correctly, we can have both "prev" and "cpu" in the
same L2 domain, but because we start looking at the L3 domain, we may
end up picking another "affine" CPU that isn't even sharing L2's
*before* we pick one that actually *is* sharing L2's with the target
CPU. But that code is confusing enough with the scheduler groups inner
loop that maybe I am mis-reading it entirely.

There are other oddities in select_idle_sibling() too, if I read
things correctly.

For example, it uses "cpu_idle(target)", but if we're actively trying
to move to the current CPU (ie wake_affine() returned true), then
target is the current cpu, which is certainly *not* going to be idle
for a sync wakeup. So it should actually check whether it's a sync
wakeup and the only thing pending is that synchronous waker, no?

Maybe I'm missing something really fundamental, but it all really does
look very odd to me.

Attached is a totally untested and probably very buggy patch, so
please consider it a "shouldn't we do something like this instead" RFC
rather than anything serious. So this RFC patch is more a "ok, the
patch tries to fix the above oddnesses, please tell me where I went
wrong" than anything else.

Comments?

                    Linus

[-- Attachment #2: patch.diff --]
[-- Type: application/octet-stream, Size: 3060 bytes --]

 kernel/sched/fair.c | 80 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 43 insertions(+), 37 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96e2b18b6283..25817cff72c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2632,53 +2632,53 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int target, struct sched_domain *affine)
 {
-	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
-	struct sched_domain *sd;
-	struct sched_group *sg;
+	struct sched_domain *sd, *llc_sd;
 	int i;
 
 	/*
 	 * If the task is going to be woken-up on this cpu and if it is
 	 * already idle, then it is the right target.
 	 */
-	if (target == cpu && idle_cpu(cpu))
-		return cpu;
-
-	/*
-	 * If the task is going to be woken-up on the cpu where it previously
-	 * ran and if it is currently idle, then it the right target.
-	 */
-	if (target == prev_cpu && idle_cpu(prev_cpu))
-		return prev_cpu;
+	if (idle_cpu(target))
+		return target;
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
 	 */
-	sd = rcu_dereference(per_cpu(sd_llc, target));
-	for_each_lower_domain(sd) {
-		sg = sd->groups;
-		do {
-			if (!cpumask_intersects(sched_group_cpus(sg),
-						tsk_cpus_allowed(p)))
-				goto next;
+	llc_sd = rcu_dereference(per_cpu(sd_llc, target));
+	for_each_domain(target, sd) {
+		for_each_cpu(i, sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
+				continue;
+			if (!idle_cpu(i))
+				continue;
+			return  i;
+		}
+		/* Don't iterate past the last level cache domain */
+		if (sd == llc_sd)
+			break;
+		/* Don't iterate past the affinity level */
+		if (sd == affine)
+			break;
+	}
+	return -1;
+}
 
-			for_each_cpu(i, sched_group_cpus(sg)) {
-				if (!idle_cpu(i))
-					goto next;
-			}
+/*
+ * For synchronous wake-ups: is the currently running
+ * process the only pending process of this CPU runqueue?
+ */
+static inline int single_running(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
 
-			target = cpumask_first_and(sched_group_cpus(sg),
-					tsk_cpus_allowed(p));
-			goto done;
-next:
-			sg = sg->next;
-		} while (sg != sd->groups);
-	}
-done:
-	return target;
+#ifdef CONFIG_SMP
+	if (!llist_empty(&rq->wake_list))
+		return 0;
+#endif
+	return cpu_rq(cpu)->nr_running <= 1;
 }
 
 /*
@@ -2759,11 +2759,17 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	}
 
 	if (affine_sd) {
-		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync))
+		if (cpu == prev_cpu || wake_affine(affine_sd, p, sync)) {
 			prev_cpu = cpu;
+			if (sync && single_running(cpu)) {
+				new_cpu = cpu;
+				goto unlock;
+			}
+		}
 
-		new_cpu = select_idle_sibling(p, prev_cpu);
-		goto unlock;
+		new_cpu = select_idle_sibling(p, prev_cpu, affine_sd);
+		if (new_cpu >= 0)
+			goto unlock;
 	}
 
 	while (sd) {

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26 18:19                                     ` Linus Torvalds
@ 2012-09-26 21:37                                       ` Borislav Petkov
  2012-09-27  5:09                                         ` Mike Galbraith
  2012-09-27  7:17                                         ` david
  2012-09-27  4:32                                       ` Mike Galbraith
  2012-09-27  8:21                                       ` Peter Zijlstra
  2 siblings, 2 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-26 21:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Mike Galbraith, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Wed, Sep 26, 2012 at 11:19:42AM -0700, Linus Torvalds wrote:
> I'm *so* not surprised.
> 
> That said, I think your "kill select_idle_sibling()" one was
> interesting, but the wrong kind of "get rid of that logic".

Yeah.

> It always selected target_cpu, but the fact is, that doesn't really
> sound very sane. The target cpu is either the previous cpu or the
> current cpu, depending on whether they should be balanced or not. But
> that still doesn't make any *sense*.
> 
> In fact, the whole select_idle_sibling() logic makes no sense
> what-so-ever to me. It seems to be total garbage.
> 
> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. If you want to be close to
> something, you want to look at the *smallest* domain first. But
> because it looks at things in the wrong order, it then needs to have
> that inner loop saying "does this group actually cover the cpu I am
> interested in?"
> 
> Please tell me I am mis-reading this?

First of all, I'm so *not* a scheduler guy so take this with a great
pinch of salt.

The way I understand it is, you either want to share L2 with a process,
because, for example, both working sets fit in the L2 and/or there's
some sharing which saves you moving everything over the L3. This is
where selecting a core on the same L2 is actually a good thing.

Or, they're too big to fit into the L2 and they start kicking each-other
out. Then you want to spread them out to different L2s - i.e., different
HT groups in Intel-speak.

Oh, and then there's the userspace spinlocks thingie where Mike's patch
hurts us.

Btw, Mike, you can jump in anytime :-)

So I'd say, this is the hard scheduling problem where fitting the
workload to the architecture doesn't make everyone happy.

A crazy thought: one could go and sample tasks while running their
timeslices with the perf counters to know exactly what type of workload
we're looking at. I.e., do I have a large number of L2 evictions? Yes,
then spread them out. No, then select the other core on the L2. And so
on.

> But starting from the biggest ("llc" group) is wrong *anyway*, since
> it means that it starts looking at the L3 level, and then if it
> finds an acceptable cpu inside that level, it's all done. But that's
> *crazy*. Once again, it's much better to try to find an idle sibling
> *closeby* rather than at the L3 level. No?

Exactly my thoughts a couple of days ago but see above.

> So once again, we should start at the inner level and if we can't find
> something really close, we work our way out, rather than starting from
> the outer level and working our way in.
>
> If I read the code correctly, we can have both "prev" and "cpu" in
> the same L2 domain, but because we start looking at the L3 domain, we
> may end up picking another "affine" CPU that isn't even sharing L2's
> *before* we pick one that actually *is* sharing L2's with the target
> CPU. But that code is confusing enough with the scheduler groups inner
> loop that maybe I am mis-reading it entirely.
>
> There are other oddities in select_idle_sibling() too, if I read
> things correctly.
>
> For example, it uses "cpu_idle(target)", but if we're actively trying
> to move to the current CPU (ie wake_affine() returned true), then
> target is the current cpu, which is certainly *not* going to be idle
> for a sync wakeup. So it should actually check whether it's a sync
> wakeup and the only thing pending is that synchronous waker, no?
>
> Maybe I'm missing something really fundamental, but it all really does
> look very odd to me.
>
> Attached is a totally untested and probably very buggy patch, so
> please consider it a "shouldn't we do something like this instead" RFC
> rather than anything serious. So this RFC patch is more a "ok, the
> patch tries to fix the above oddnesses, please tell me where I went
> wrong" than anything else.
>
> Comments?

Let me look at it tomorrow, on a fresh head. Too late here now.

Thanks.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26 18:19                                     ` Linus Torvalds
  2012-09-26 21:37                                       ` Borislav Petkov
@ 2012-09-27  4:32                                       ` Mike Galbraith
  2012-09-27  8:21                                       ` Peter Zijlstra
  2 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27  4:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote: 
> On Wed, Sep 26, 2012 at 9:32 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Tue, Sep 25, 2012 at 10:21:28AM -0700, Linus Torvalds wrote:
> >> How does pgbench look? That's the one that apparently really wants to
> >> spread out, possibly due to user-level spinlocks. So I assume it will
> >> show the reverse pattern, with "kill select_idle_sibling" being the
> >> worst case. Sad, because it really would be lovely to just remove that
> >> thing ;)
> >
> > Yep, correct. It hurts.
> 
> I'm *so* not surprised.

Any other result would have induced mushroom cloud, glazed eyes, and jaw
meets floor here.

> That said, I think your "kill select_idle_sibling()" one was
> interesting, but the wrong kind of "get rid of that logic".
> 
> It always selected target_cpu, but the fact is, that doesn't really
> sound very sane. The target cpu is either the previous cpu or the
> current cpu, depending on whether they should be balanced or not. But
> that still doesn't make any *sense*.
> 
> In fact, the whole select_idle_sibling() logic makes no sense
> what-so-ever to me. It seems to be total garbage.

Oh, it's not _that_ bad.  It does have it's troubles, but if it were
complete shite it wouldn't the make numbers that I showed, and wouldn't
make the even better numbers it does with some other loads. 

> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. If you want to be close to
> something, you want to look at the *smallest* domain first. But
> because it looks at things in the wrong order, it then needs to have
> that inner loop saying "does this group actually cover the cpu I am
> interested in?"
> 
> Please tell me I am mis-reading this?

We start at MC to get the tbench win I showed (Intel) vs loss at SMT.
Riddle me this, why does that produce the wins I showed?  I'm still
hoping someone can shed some light on why the heck there's such a
disparity in processor behaviors.

> But starting from the biggest ("llc" group) is wrong *anyway*, since
> it means that it starts looking at the L3 level, and then if it finds
> an acceptable cpu inside that level, it's all done. But that's
> *crazy*. Once again, it's much better to try to find an idle sibling
> *closeby* rather than at the L3 level. No? So once again, we should
> start at the inner level and if we can't find something really close,
> we work our way out, rather than starting from the outer level and
> working our way in.

Domains on my E5620 look like so when SMT is enabled (seldom):

[    0.473692] CPU0 attaching sched-domain:                                                                                                                                                                                                 
[    0.477616]  domain 0: span 0,4 level SIBLING                                                                                                                                                                                            
[    0.481982]   groups: 0 (cpu_power = 589) 4 (cpu_power = 589)                                                                                                                                                                            
[    0.487805]   domain 1: span 0-7 level MC                                                                                                                                                                                                
[    0.491829]    groups: 0,4 (cpu_power = 1178) 1,5 (cpu_power = 1178) 2,6 (cpu_power = 1178) 3,7 (cpu_power = 1178)
...

I usually have SMT off, which gives me more oomph at the bottom end (smt
affects turboboost gizmo methinks), have only one domain, so say I'm
waking from CPU0.  With cross wire thingy, we'll always wake to CPU1 if
idle.  That demonstrably works well despite it being L3.  Box coughs up
wins at fast movers I too would expect L3 to lose at.  If L2 is my only
viable target for fast movers, I'm stuck with SMT siblings, which I have
measured.  They aren't wonderful for this.  They do improve max
throughput markedly though, so aren't a complete waste of silicon ;-)

I wonder what domains look like on Bulldog. (boot w. sched_debug)
> If I read the code correctly, we can have both "prev" and "cpu" in the
> same L2 domain, but because we start looking at the L3 domain, we may
> end up picking another "affine" CPU that isn't even sharing L2's
> *before* we pick one that actually *is* sharing L2's with the target
> CPU. But that code is confusing enough with the scheduler groups inner
> loop that maybe I am mis-reading it entirely.

Yup, and on Intel, it manages to not suck.

> There are other oddities in select_idle_sibling() too, if I read
> things correctly.
> 
> For example, it uses "cpu_idle(target)", but if we're actively trying
> to move to the current CPU (ie wake_affine() returned true), then
> target is the current cpu, which is certainly *not* going to be idle
> for a sync wakeup. So it should actually check whether it's a sync
> wakeup and the only thing pending is that synchronous waker, no?

Your logic is fine, but the missing element is that the sync wakeup hint
doesn't imply as much as you think it does.

> Maybe I'm missing something really fundamental, but it all really does
> look very odd to me.

Yeah, the sync hint.

> Attached is a totally untested and probably very buggy patch, so
> please consider it a "shouldn't we do something like this instead" RFC
> rather than anything serious. So this RFC patch is more a "ok, the
> patch tries to fix the above oddnesses, please tell me where I went
> wrong" than anything else.
> 
> Comments?

You busted it all to pieces with the sync hint.  Take for example mysql
+oltp, I've run that a zillion times.  It does sync wakeups iirc, been
quite a while, but definitely produced wins on my Q6600.  Those wins can
only exist if there is reclaimable overlap.  Sure, the Q6600 is taking
advantage of its shared L2, but that just moves the breakeven.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26 21:37                                       ` Borislav Petkov
@ 2012-09-27  5:09                                         ` Mike Galbraith
  2012-09-27  5:18                                           ` Borislav Petkov
  2012-09-27  5:47                                           ` Ingo Molnar
  2012-09-27  7:17                                         ` david
  1 sibling, 2 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27  5:09 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Wed, 2012-09-26 at 23:37 +0200, Borislav Petkov wrote:

> The way I understand it is, you either want to share L2 with a process,
> because, for example, both working sets fit in the L2 and/or there's
> some sharing which saves you moving everything over the L3. This is
> where selecting a core on the same L2 is actually a good thing.

Yeah, and if the wakee can't get to the L2 hot data instantly, it may be
better to let wakee drag the data to an instantly accessible spot.

> Or, they're too big to fit into the L2 and they start kicking each-other
> out. Then you want to spread them out to different L2s - i.e., different
> HT groups in Intel-speak.
> 
> Oh, and then there's the userspace spinlocks thingie where Mike's patch
> hurts us.
> 
> Btw, Mike, you can jump in anytime :-)

I think the pgbench problem is more about latency for the 1 in 1:N than
spinlocks.

> So I'd say, this is the hard scheduling problem where fitting the
> workload to the architecture doesn't make everyone happy.

Yup.  I find it hard at least.

> A crazy thought: one could go and sample tasks while running their
> timeslices with the perf counters to know exactly what type of workload
> we're looking at. I.e., do I have a large number of L2 evictions? Yes,
> then spread them out. No, then select the other core on the L2. And so
> on.

Hm.  That sampling better be really cheap.  Might help... but how does
that affect pgbench and ilk that must spread regardless of footprints.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  5:09                                         ` Mike Galbraith
@ 2012-09-27  5:18                                           ` Borislav Petkov
  2012-09-27  5:44                                             ` Mike Galbraith
  2012-09-27  5:47                                           ` Ingo Molnar
  1 sibling, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-27  5:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote:
> > The way I understand it is, you either want to share L2 with a process,
> > because, for example, both working sets fit in the L2 and/or there's
> > some sharing which saves you moving everything over the L3. This is
> > where selecting a core on the same L2 is actually a good thing.
> 
> Yeah, and if the wakee can't get to the L2 hot data instantly, it may be
> better to let wakee drag the data to an instantly accessible spot.

Yep, then moving it to another L2 is the same.

[ … ]

> > A crazy thought: one could go and sample tasks while running their
> > timeslices with the perf counters to know exactly what type of workload
> > we're looking at. I.e., do I have a large number of L2 evictions? Yes,
> > then spread them out. No, then select the other core on the L2. And so
> > on.
> 
> Hm.  That sampling better be really cheap.  Might help...

Yeah, that's why I said sampling and not run the perfcounters during
every timeslice.

But if you count the proper events, you should be able to know exactly
what the workload is doing (compute-bound, io-bound, contention, etc...)

> but how does that affect pgbench and ilk that must spread regardless
> of footprints.

Well, how do you measure latency of the 1 process in the 1:N case? Maybe
pipeline stalls of the 1 along with some way to recognize it is the 1 in
the 1:N case.

Hmm.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  5:18                                           ` Borislav Petkov
@ 2012-09-27  5:44                                             ` Mike Galbraith
  0 siblings, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27  5:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Thu, 2012-09-27 at 07:18 +0200, Borislav Petkov wrote: 
> On Thu, Sep 27, 2012 at 07:09:28AM +0200, Mike Galbraith wrote:
>  but how does that affect pgbench and ilk that must spread regardless
> > of footprints.
> 
> Well, how do you measure latency of the 1 process in the 1:N case? Maybe
> pipeline stalls of the 1 along with some way to recognize it is the 1 in
> the 1:N case.

Best is to let userland tell us it's critical.  Smarts are expensive.  A
class of it's own (my wakees do _not_ preempt me, and I don't care that
you think this is unfair to the unwashed masses who will otherwise
_starve_ without me feeding them) makes sense for these guys.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  5:09                                         ` Mike Galbraith
  2012-09-27  5:18                                           ` Borislav Petkov
@ 2012-09-27  5:47                                           ` Ingo Molnar
  2012-09-27  5:59                                             ` Ingo Molnar
  2012-09-27  6:34                                             ` Mike Galbraith
  1 sibling, 2 replies; 115+ messages in thread
From: Ingo Molnar @ 2012-09-27  5:47 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha


* Mike Galbraith <efault@gmx.de> wrote:

> I think the pgbench problem is more about latency for the 1 in 
> 1:N than spinlocks.

So my understanding of the psql workload is that basically we've 
got a central psql proxy process that is distributing work to 
worker psql processes. If a freshly woken worker process ever 
preempts the central proxy process then it is preventing a lot 
of new work from getting distributed.

Correct?

So the central proxy psql process is 'much more important' to 
run than any of the worker processes - an importance that is not 
(currently) visible from the behavioral statistics the scheduler 
keeps on tasks.

So the scheduler has the following problem here: a new wakee 
might be starved enough and the proxy might have run long enough 
to really justify the preemption here and now. The buddy 
statistics help avoid some of these cases - but not all and the 
difference is measurable.

Yet the 'best' way for psql to run is for this proxy process to 
never be preempted. Your SCHED_BATCH experiments confirmed that.

The way remote CPU selection affects it is that if we ever get 
more aggressive in selecting a remote CPU then we, as a side 
effect, also reduce the chance of harmful preemption of the 
central proxy psql process.

So in that sense sibling selection is somewhat of an indirect 
red herring: it really only helps psql indirectly by preventing 
the harmful preemption. It also, somewhat paradoxially argues 
for suboptimal code: for example tearing apart buddies is 
beneficial in the psql workload, because it also allows the more 
important part of the buddy to run more (the proxy).

In that sense the *real* problem isnt even parallelism (although 
we obviously should improve the decisions there - and the logic 
has suffered in the past from the psql dilemma outlined above), 
but whether the scheduler can (and should) identify the central 
proxy and keep it running as much as possible, deprioritizing 
fairness, wakeup buddies, runtime overlap and cache affinity 
considerations.

There's two broad solutions that I can see:

 - Add a kernel solution to somehow identify 'central' processes
   and bias them. Xorg is a similar kind of process, so it would
   help other workloads as well. That way lie dragons, but might
   be worth an attempt or two. We already try to do a couple of
   robust metrics, like overlap statistics to identify buddies. 

 - Let user-space occasionally identify its important (and less
   important) tasks - say psql could mark it worker processes as
   SCHED_BATCH and keep its central process(es) higher prio. A
   single line of obvious code in 100 KLOCs of user-space code.

Just to confirm, if you turn off all preemption via a hack 
(basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
perform and scale much better, with the quality of sibling 
selection and spreading of processes only being a secondary 
effect?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  5:47                                           ` Ingo Molnar
@ 2012-09-27  5:59                                             ` Ingo Molnar
  2012-09-27  6:34                                             ` Mike Galbraith
  1 sibling, 0 replies; 115+ messages in thread
From: Ingo Molnar @ 2012-09-27  5:59 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha


* Ingo Molnar <mingo@kernel.org> wrote:

> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > I think the pgbench problem is more about latency for the 1 
> > in 1:N than spinlocks.
> 
> So my understanding of the psql workload is that basically 
> we've got a central psql proxy process that is distributing 
> work to worker psql processes. If a freshly woken worker 
> process ever preempts the central proxy process then it is 
> preventing a lot of new work from getting distributed.

Also, I'd like to stress that despite the optimization dilemma, 
the psql workload is *important*. More important than tbench - 
because psql does some real SQL work and it also matches the 
design of many real desktop and server workloads.

So if indeed the above is the main problem of psql it would be 
nice to add a 'perf bench sched proxy' testcase that emulates it 
- that would remove psql version dependencies and would ease the 
difficulty of running the benchmarks.

We alread have 'perf bench sched pipe' and 'perf bench sched 
messaging' - but neither shows the psql pattern currently.

I suspect a couple of udelay()s in the messaging benchmark would 
do the trick? The wakeup work there already matches much of how 
psql looks like.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  5:47                                           ` Ingo Molnar
  2012-09-27  5:59                                             ` Ingo Molnar
@ 2012-09-27  6:34                                             ` Mike Galbraith
  2012-09-27  6:41                                               ` Ingo Molnar
  1 sibling, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27  6:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, 2012-09-27 at 07:47 +0200, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > I think the pgbench problem is more about latency for the 1 in 
> > 1:N than spinlocks.
> 
> So my understanding of the psql workload is that basically we've 
> got a central psql proxy process that is distributing work to 
> worker psql processes. If a freshly woken worker process ever 
> preempts the central proxy process then it is preventing a lot 
> of new work from getting distributed.
> 
> Correct?

Yeah, that's my understanding of the thing, and I played with it quite a
bit in the past (only refreshed memories briefly in present).

> So the central proxy psql process is 'much more important' to 
> run than any of the worker processes - an importance that is not 
> (currently) visible from the behavioral statistics the scheduler 
> keeps on tasks.

Yeah.  We had the adaptive waker thing, but it stopped being a winner at
the one load it originally did help quite a lot, and it didn't help
pgbench all that much in it's then form anyway iirc.

> So the scheduler has the following problem here: a new wakee 
> might be starved enough and the proxy might have run long enough 
> to really justify the preemption here and now. The buddy 
> statistics help avoid some of these cases - but not all and the 
> difference is measurable.
> 
> Yet the 'best' way for psql to run is for this proxy process to 
> never be preempted. Your SCHED_BATCH experiments confirmed that.

Yes.

> The way remote CPU selection affects it is that if we ever get 
> more aggressive in selecting a remote CPU then we, as a side 
> effect, also reduce the chance of harmful preemption of the 
> central proxy psql process.

Right.

> So in that sense sibling selection is somewhat of an indirect 
> red herring: it really only helps psql indirectly by preventing 
> the harmful preemption. It also, somewhat paradoxially argues 
> for suboptimal code: for example tearing apart buddies is 
> beneficial in the psql workload, because it also allows the more 
> important part of the buddy to run more (the proxy).

Yes, I believe preemption dominates, but it's not alone, you can see
that in the numbers.

> In that sense the *real* problem isnt even parallelism (although 
> we obviously should improve the decisions there - and the logic 
> has suffered in the past from the psql dilemma outlined above), 
> but whether the scheduler can (and should) identify the central 
> proxy and keep it running as much as possible, deprioritizing 
> fairness, wakeup buddies, runtime overlap and cache affinity 
> considerations.
> 
> There's two broad solutions that I can see:
> 
>  - Add a kernel solution to somehow identify 'central' processes
>    and bias them. Xorg is a similar kind of process, so it would
>    help other workloads as well. That way lie dragons, but might
>    be worth an attempt or two. We already try to do a couple of
>    robust metrics, like overlap statistics to identify buddies.

What we do now works well for X and friends I think, because there
aren't so many buddies  It might work better though, and for the same
reasons.  I've in fact [re]invented a SCHED_SERVER class a few times,
but never one that survived my own scrutiny for long.

Arrr, here there be dragons is true ;-)

> - Let user-space occasionally identify its important (and less
>    important) tasks - say psql could mark it worker processes as
>    SCHED_BATCH and keep its central process(es) higher prio. A
>    single line of obvious code in 100 KLOCs of user-space code.
> 
> Just to confirm, if you turn off all preemption via a hack 
> (basically if you turn SCHED_OTHER into SCHED_BATCH), does psql 
> perform and scale much better, with the quality of sibling 
> selection and spreading of processes only being a secondary 
> effect?

That has always been the case here.  Preemption dominates.  Others
should play with it too, and let their boxen speak.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  6:34                                             ` Mike Galbraith
@ 2012-09-27  6:41                                               ` Ingo Molnar
  2012-09-27  6:54                                                 ` Mike Galbraith
  0 siblings, 1 reply; 115+ messages in thread
From: Ingo Molnar @ 2012-09-27  6:41 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha


* Mike Galbraith <efault@gmx.de> wrote:

> > Just to confirm, if you turn off all preemption via a hack 
> > (basically if you turn SCHED_OTHER into SCHED_BATCH), does 
> > psql perform and scale much better, with the quality of 
> > sibling selection and spreading of processes only being a 
> > secondary effect?
> 
> That has always been the case here.  Preemption dominates.

Yes, so we get the best psql performance if we allow the central 
proxy process to dominate a single CPU (IIRC it can easily go up 
to 100% CPU utilization on that CPU - it is what determines max 
psql throughput), and not let any worker run there much, right?

> Others should play with it too, and let their boxen speak.

Do you have an easy-to-apply hack patch by chance that has the 
effect of turning off all such preemption, which people could 
try?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  6:41                                               ` Ingo Molnar
@ 2012-09-27  6:54                                                 ` Mike Galbraith
  2012-09-27  7:10                                                   ` Ingo Molnar
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27  6:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, 2012-09-27 at 08:41 +0200, Ingo Molnar wrote: 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> > > Just to confirm, if you turn off all preemption via a hack 
> > > (basically if you turn SCHED_OTHER into SCHED_BATCH), does 
> > > psql perform and scale much better, with the quality of 
> > > sibling selection and spreading of processes only being a 
> > > secondary effect?
> > 
> > That has always been the case here.  Preemption dominates.
> 
> Yes, so we get the best psql performance if we allow the central 
> proxy process to dominate a single CPU (IIRC it can easily go up 
> to 100% CPU utilization on that CPU - it is what determines max 
> psql throughput), and not let any worker run there much, right?

Running the thing RT didn't cut it iirc (will try that again).  For RT,
we won't look for an empty spot on wakeup, we'll just squash an ant.

> > Others should play with it too, and let their boxen speak.
> 
> Do you have an easy-to-apply hack patch by chance that has the 
> effect of turning off all such preemption, which people could 
> try?

They don't need any hacks, all they have to do is start postgreqsl
SCHED_BATCH, then run pgbench the same way.

I use schedctl, but in chrt speak, chrt -b 0 /etc/init.d/postgresql
start, and then the same for pgbench itself.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  6:54                                                 ` Mike Galbraith
@ 2012-09-27  7:10                                                   ` Ingo Molnar
  2012-09-27 16:25                                                     ` Borislav Petkov
  2012-09-27 17:44                                                     ` Linus Torvalds
  0 siblings, 2 replies; 115+ messages in thread
From: Ingo Molnar @ 2012-09-27  7:10 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha


* Mike Galbraith <efault@gmx.de> wrote:

> > Do you have an easy-to-apply hack patch by chance that has 
> > the effect of turning off all such preemption, which people 
> > could try?
> 
> They don't need any hacks, all they have to do is start 
> postgreqsl SCHED_BATCH, then run pgbench the same way.
> 
> I use schedctl, but in chrt speak, chrt -b 0 
> /etc/init.d/postgresql start, and then the same for pgbench 
> itself.

Just in case someone prefers patches to user-space approaches (I 
certainly do!), here's one that turns off wakeup driven 
preemption by default.

It can be turned back on via:

  echo WAKEUP_PREEMPTION > /debug/sched_features

and off again via:

  echo NO_WAKEUP_PREEMPTION > /debug/sched_features 

(the patch is completely untested and such.)

The theory would be that this patch fixes psql performance, with 
CPU selection being a measurable but second order of magnitude 
effect. How well does practice match theory in this case?

Thanks,

	Ingo

---------
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..f936552 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2907,7 +2907,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
 	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
 	 * is driven by the tick):
 	 */
-	if (unlikely(p->policy != SCHED_NORMAL))
+	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
 		return;
 
 	find_matching_se(&se, &pse);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..e68e69a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -32,6 +32,11 @@ SCHED_FEAT(LAST_BUDDY, true)
 SCHED_FEAT(CACHE_HOT_BUDDY, true)
 
 /*
+ * Allow wakeup-time preemption of the current task:
+ */
+SCHED_FEAT(WAKEUP_PREEMPTION, false)
+
+/*
  * Use arch dependent cpu power functions
  */
 SCHED_FEAT(ARCH_POWER, true)

^ permalink raw reply related	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26 21:37                                       ` Borislav Petkov
  2012-09-27  5:09                                         ` Mike Galbraith
@ 2012-09-27  7:17                                         ` david
  2012-09-27  7:55                                           ` Mike Galbraith
  2012-09-27 10:20                                           ` Borislav Petkov
  1 sibling, 2 replies; 115+ messages in thread
From: david @ 2012-09-27  7:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Peter Zijlstra, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Wed, 26 Sep 2012, Borislav Petkov wrote:

>> It always selected target_cpu, but the fact is, that doesn't really
>> sound very sane. The target cpu is either the previous cpu or the
>> current cpu, depending on whether they should be balanced or not. But
>> that still doesn't make any *sense*.
>>
>> In fact, the whole select_idle_sibling() logic makes no sense
>> what-so-ever to me. It seems to be total garbage.
>>
>> For example, it starts with the maximum target scheduling domain, and
>> works its way in over the scheduling groups within that domain. What
>> the f*ck is the logic of that kind of crazy thing? It never makes
>> sense to look at a biggest domain first. If you want to be close to
>> something, you want to look at the *smallest* domain first. But
>> because it looks at things in the wrong order, it then needs to have
>> that inner loop saying "does this group actually cover the cpu I am
>> interested in?"
>>
>> Please tell me I am mis-reading this?
>
> First of all, I'm so *not* a scheduler guy so take this with a great
> pinch of salt.
>
> The way I understand it is, you either want to share L2 with a process,
> because, for example, both working sets fit in the L2 and/or there's
> some sharing which saves you moving everything over the L3. This is
> where selecting a core on the same L2 is actually a good thing.
>
> Or, they're too big to fit into the L2 and they start kicking each-other
> out. Then you want to spread them out to different L2s - i.e., different
> HT groups in Intel-speak.

an observation from an outsider here.

if you do overload a L2 cache, then the core will be busy all the time and 
you will end up migrating a task away from that core.

It seems to me that trying to figure out if you are going to overload the 
L2 is an impossible task, so just assume that it will all fit, and the 
worst case is you have one balancing cycle where you can't do as much work 
and then the normal balancing will kick in and move something anyway.

over the long term, the work lost due to not moving optimally right away 
is probably much less than the work lost due to trying to figure out the 
perfect thing to do.

and since the perfect thing to do is going to be both workload and chip 
specific, trying to model that in your decision making is a lost cause.

David Lang

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  7:17                                         ` david
@ 2012-09-27  7:55                                           ` Mike Galbraith
  2012-09-27 10:20                                           ` Borislav Petkov
  1 sibling, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27  7:55 UTC (permalink / raw)
  To: david
  Cc: Borislav Petkov, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 2012-09-27 at 00:17 -0700, david@lang.hm wrote:

> over the long term, the work lost due to not moving optimally right away 
> is probably much less than the work lost due to trying to figure out the 
> perfect thing to do.

Yeah, "Perfect is the enemy of good" definitely applies.  Once you're
ramped, less is more.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-26 18:19                                     ` Linus Torvalds
  2012-09-26 21:37                                       ` Borislav Petkov
  2012-09-27  4:32                                       ` Mike Galbraith
@ 2012-09-27  8:21                                       ` Peter Zijlstra
  2012-09-27 16:48                                         ` david
  2 siblings, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-27  8:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Mike Galbraith, Mel Gorman, Nikolay Ulyanitsky,
	linux-kernel, Andreas Herrmann, Andrew Morton, Thomas Gleixner,
	Ingo Molnar, Suresh Siddha

On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
> 
> For example, it starts with the maximum target scheduling domain, and
> works its way in over the scheduling groups within that domain. What
> the f*ck is the logic of that kind of crazy thing? It never makes
> sense to look at a biggest domain first. 

That's about SMT, it was felt that you don't want SMT siblings first
because typically SMT siblings are somewhat under-powered compared to
actual cores.

Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
only has the LLC domain, if you want more we'll need to fix that. For
now its a fixed:

 SMT
 MC (llc)
 CPU (package/machine-for-!numa)
 NUMA

So in your patch, your for_each_domain() loop will really only do the
SMT/MC levels and prefer an SMT sibling over an idle core.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  7:17                                         ` david
  2012-09-27  7:55                                           ` Mike Galbraith
@ 2012-09-27 10:20                                           ` Borislav Petkov
  2012-09-27 13:38                                             ` Mike Galbraith
  2012-09-27 16:55                                             ` david
  1 sibling, 2 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-27 10:20 UTC (permalink / raw)
  To: david
  Cc: Linus Torvalds, Peter Zijlstra, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, Sep 27, 2012 at 12:17:22AM -0700, david@lang.hm wrote:
> It seems to me that trying to figure out if you are going to
> overload the L2 is an impossible task, so just assume that it will
> all fit, and the worst case is you have one balancing cycle where
> you can't do as much work and then the normal balancing will kick in
> and move something anyway.

Right, and this implies that when the load balancer runs, it will
definitely move the task away from the L2. But what do I do in the cases
where the two tasks don't overload the L2 and it is actually beneficial
to keep them there? How does the load balancer know that?

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 10:20                                           ` Borislav Petkov
@ 2012-09-27 13:38                                             ` Mike Galbraith
  2012-09-27 16:55                                             ` david
  1 sibling, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-27 13:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: david, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 2012-09-27 at 12:20 +0200, Borislav Petkov wrote: 
> On Thu, Sep 27, 2012 at 12:17:22AM -0700, david@lang.hm wrote:
> > It seems to me that trying to figure out if you are going to
> > overload the L2 is an impossible task, so just assume that it will
> > all fit, and the worst case is you have one balancing cycle where
> > you can't do as much work and then the normal balancing will kick in
> > and move something anyway.
> 
> Right, and this implies that when the load balancer runs, it will
> definitely move the task away from the L2. But what do I do in the cases
> where the two tasks don't overload the L2 and it is actually beneficial
> to keep them there? How does the load balancer know that?

It doesn't, but it has task_hot().  A preempted buddy may be pulled, but
the next wakeup will try to bring buddies back together.

-Mike 



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  7:10                                                   ` Ingo Molnar
@ 2012-09-27 16:25                                                     ` Borislav Petkov
  2012-09-27 17:44                                                     ` Linus Torvalds
  1 sibling, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-27 16:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Linus Torvalds, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, Sep 27, 2012 at 09:10:11AM +0200, Ingo Molnar wrote:
> The theory would be that this patch fixes psql performance, with CPU
> selection being a measurable but second order of magnitude effect. How
> well does practice match theory in this case?

Yeah, it looks a bit better than default linux. A whopping 9% perf delta
:-).

v3.6-rc7-1897-g28381f207bd7 (linus from 26/9 + tip/auto-latest) + performance governor
======================================================================================

plain
-----
tps = 4574.570857 (including connections establishing)
tps = 4579.166159 (excluding connections establishing)

kill select_idle_sibling
------------------------
tps = 2230.354093 (including connections establishing)
tps = 2231.412169 (excluding connections establishing)

NO_WAKEUP_PREEMPTION
--------------------
tps = 4991.206742 (including connections establishing)
tps = 4996.743622 (excluding connections establishing)

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  8:21                                       ` Peter Zijlstra
@ 2012-09-27 16:48                                         ` david
  2012-09-27 17:38                                           ` Peter Zijlstra
  0 siblings, 1 reply; 115+ messages in thread
From: david @ 2012-09-27 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 27 Sep 2012, Peter Zijlstra wrote:

> On Wed, 2012-09-26 at 11:19 -0700, Linus Torvalds wrote:
>>
>> For example, it starts with the maximum target scheduling domain, and
>> works its way in over the scheduling groups within that domain. What
>> the f*ck is the logic of that kind of crazy thing? It never makes
>> sense to look at a biggest domain first.
>
> That's about SMT, it was felt that you don't want SMT siblings first
> because typically SMT siblings are somewhat under-powered compared to
> actual cores.
>
> Also, the whole scheduler topology thing doesn't have L2/L3 domains, it
> only has the LLC domain, if you want more we'll need to fix that. For
> now its a fixed:
>
> SMT
> MC (llc)
> CPU (package/machine-for-!numa)
> NUMA
>
> So in your patch, your for_each_domain() loop will really only do the
> SMT/MC levels and prefer an SMT sibling over an idle core.

I think you are bing too smart for your own good. you don't know if it's 
best to move them further apart or not. I'm arguing that you can't know.

so I'm saying do the simple thing.

if a core is overloaded, move to an idle core that is as close as possible 
to the core you start from (as much shared as possible).

if this does not overload the shared resource, you did the right thing.

if this does overload the shared resource, it's still no worse than 
leaving it on the original core (which was shared everything, so you've 
reduced the sharing a little bit)

the next balancing cycle you then work to move something again, and since 
both the original and new core show as overloaded (due to the contention 
on the shared resources), you move something to another core that shares 
just a little less.

Yes, this means that it may take more balancing cycles to move things far 
enough apartto reduce the sharing enough to avoid overload of the shared 
resource, but I don't see any way that you can possibly guess if two 
processes are going to overload the shared resource ahead of time.

It may be that simply moving to a HT core (and no longer contending for 
registers) is enough to let both processes fly, or it may be that the 
overload is in a shared floating point unit or L1 cache and you need to 
move further away, or you may find the contention is in the L2 cache and 
move further away, or it could be in the L3 cache, or it could be in the 
memory interface (NUMA)

Without being able to predict the future, you don't know how far away you 
need to move the tasks to have them operate at th eoptimal level. All that 
you do know is that the shorter the move, the less expensive the move. So 
make each move be as short as possible, and measure again to see if that 
was enough.

For some workloads, it will be. For many workloads the least expensive 
move won't be.

The question is if doing multiple, cheap moves (requiring simple checking 
for each moves) ends up being a win compared to do better guessing over 
when the more expensive moves are worth it.

Give how chips change from year to year, I don't see how the 'better 
guessing' is going to survive more than a couple of chip releases in any 
case.

David Lang

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 10:20                                           ` Borislav Petkov
  2012-09-27 13:38                                             ` Mike Galbraith
@ 2012-09-27 16:55                                             ` david
  1 sibling, 0 replies; 115+ messages in thread
From: david @ 2012-09-27 16:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Linus Torvalds, Peter Zijlstra, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 27 Sep 2012, Borislav Petkov wrote:

> On Thu, Sep 27, 2012 at 12:17:22AM -0700, david@lang.hm wrote:
>> It seems to me that trying to figure out if you are going to
>> overload the L2 is an impossible task, so just assume that it will
>> all fit, and the worst case is you have one balancing cycle where
>> you can't do as much work and then the normal balancing will kick in
>> and move something anyway.
>
> Right, and this implies that when the load balancer runs, it will
> definitely move the task away from the L2. But what do I do in the cases
> where the two tasks don't overload the L2 and it is actually beneficial
> to keep them there? How does the load balancer know that?

no, I'm saying that you should assume that the two tasks won't overload 
the L2, try it, and if they do overload the L2, move one of the tasks 
again the next balancing cycle.

there is a lot of possible sharing going on between 'cores'

shared everything (a single core)
different registers, shared everything else (HT core)
shared floating point, shared cache, different everything else
shared L2/L3/Memory, different everything else
shared L3/Memory, different everything else
shared Memory, different everything else
different everything

and just wait a couple of years and someone will add a new entry to this 
list (if I haven't already missed a few :-)

the more that is shared, the cheaper it is to move the process (the less 
cached state you throw away), so ideally you want to move the process as 
little as possible, just enough to eliminate whatever the contended 
resource it. But since you really don't know the footprint of each process 
in each of these layers, all you can measure is what percentage of the 
total core time the process used, just move it a little and see if that 
was enough.

David Lang

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 16:48                                         ` david
@ 2012-09-27 17:38                                           ` Peter Zijlstra
  2012-09-27 17:45                                             ` david
  0 siblings, 1 reply; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-27 17:38 UTC (permalink / raw)
  To: david
  Cc: Linus Torvalds, Borislav Petkov, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 2012-09-27 at 09:48 -0700, david@lang.hm wrote:
> I think you are bing too smart for your own good. you don't know if it's 
> best to move them further apart or not. 

Well yes and no.. You're right, however in general the load-balancer has
always tried to not use (SMT) siblings whenever possible, in that regard
not using an idle sibling is consistent here.

Also, for short running tasks the wakeup balancing is typically all we
have, the 'big' periodic load-balancer will 'never' see them, making the
multiple moves argument hard.

Measuring resource contention on the various levels is a fun research
subject, I've spoken to various people who are/were doing so, I've
always encouraged them to send their code just so we can see/learn, even
if not integrate, sadly I can't remember ever having seen any of it :/

And yeah, all the load-balancing stuff is very near to scrying or
tealeaf reading. We can't know all current state (too expensive) nor can
we know the future.

That said, I'm all for less/simpler code, pesky benchmarks aside ;-)

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27  7:10                                                   ` Ingo Molnar
  2012-09-27 16:25                                                     ` Borislav Petkov
@ 2012-09-27 17:44                                                     ` Linus Torvalds
  2012-09-27 18:05                                                       ` Borislav Petkov
  1 sibling, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-27 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Borislav Petkov, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, Sep 27, 2012 at 12:10 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> Just in case someone prefers patches to user-space approaches (I
> certainly do!), here's one that turns off wakeup driven
> preemption by default.

Ok, so apparently this fixes performance in a big way, and might allow
us to simplify select_idle_sibling(), which is clearly way too random.

That is, if we could make it automatic, some way. Not the "let the
user tune it" - that's just fundamentally broken.

What is the common pattern for the wakeups for psql?

Can we detect this somehow? Are they sync? It looks wrong to preempt
for sync wakeups, for example, but we seem to do that.

Or could we just improve the heuristics. What happens if the
scheduling granularity is increased, for example? It's set to 1ms
right now, with a logarithmic scaling by number of cpus.

        Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 17:38                                           ` Peter Zijlstra
@ 2012-09-27 17:45                                             ` david
  2012-09-27 18:09                                               ` Peter Zijlstra
                                                                 ` (2 more replies)
  0 siblings, 3 replies; 115+ messages in thread
From: david @ 2012-09-27 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Borislav Petkov, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 27 Sep 2012, Peter Zijlstra wrote:

> On Thu, 2012-09-27 at 09:48 -0700, david@lang.hm wrote:
>> I think you are bing too smart for your own good. you don't know if it's
>> best to move them further apart or not.
>
> Well yes and no.. You're right, however in general the load-balancer has
> always tried to not use (SMT) siblings whenever possible, in that regard
> not using an idle sibling is consistent here.
>
> Also, for short running tasks the wakeup balancing is typically all we
> have, the 'big' periodic load-balancer will 'never' see them, making the
> multiple moves argument hard.

For the initial starup of a new process, finding as idle and remote a core 
to start on (minimum sharing with existing processes) is probably the 
smart thing to do.

But I thought that this conversation (pgbench) was dealing with long 
running processes, and how to deal with the overload where one master 
process is kicking off many child processes and the core that the master 
process starts off on gets overloaded as a result, with the question being 
how to spread the load out from this one core as it gets overloaded.

David Lang

> Measuring resource contention on the various levels is a fun research
> subject, I've spoken to various people who are/were doing so, I've
> always encouraged them to send their code just so we can see/learn, even
> if not integrate, sadly I can't remember ever having seen any of it :/
>
> And yeah, all the load-balancing stuff is very near to scrying or
> tealeaf reading. We can't know all current state (too expensive) nor can
> we know the future.
>
> That said, I'm all for less/simpler code, pesky benchmarks aside ;-)
>

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 17:44                                                     ` Linus Torvalds
@ 2012-09-27 18:05                                                       ` Borislav Petkov
  2012-09-27 18:19                                                         ` Linus Torvalds
  0 siblings, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-27 18:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mike Galbraith, Peter Zijlstra, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
> Or could we just improve the heuristics. What happens if the
> scheduling granularity is increased, for example? It's set to 1ms
> right now, with a logarithmic scaling by number of cpus.

/proc/sys/kernel/sched_wakeup_granularity_ns=10000000 (10ms)
------------------------------------------------------
tps = 4994.730809 (including connections establishing)
tps = 5000.260764 (excluding connections establishing)

A bit better over the default NO_WAKEUP_PREEMPTION setting.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 17:45                                             ` david
@ 2012-09-27 18:09                                               ` Peter Zijlstra
  2012-09-27 18:15                                               ` Linus Torvalds
  2012-09-27 18:24                                               ` Borislav Petkov
  2 siblings, 0 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-27 18:09 UTC (permalink / raw)
  To: david
  Cc: Linus Torvalds, Borislav Petkov, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, 2012-09-27 at 10:45 -0700, david@lang.hm wrote:
> But I thought that this conversation (pgbench) was dealing with long 
> running processes, 

Ah, I think we've got a confusion on long vs short.. yes pgbench is a
long-running process, however the tasks might not be long in runnable
state. Ie it receives a request, computes a bit, blocks on IO, computes
a bit, replies, goes idle to wait for a new request.

If all those runnable sections are short enough, it will 'never' be
around when the periodic load-balancer does its thing, since that only
looks at the tasks in runnable state at that moment in time.

I say 'never' because while it will occasionally show up due to pure
chance, it will unlikely be a very big player in placement.

Once a cpu is overloaded enough to get real queueing they'll show up,
get dispersed and then its back to wakeup stuff.

Then again, it might be completely irrelevant to pgbench, its been a
while since I looked at how it schedules.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 17:45                                             ` david
  2012-09-27 18:09                                               ` Peter Zijlstra
@ 2012-09-27 18:15                                               ` Linus Torvalds
  2012-09-27 18:24                                               ` Borislav Petkov
  2 siblings, 0 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-27 18:15 UTC (permalink / raw)
  To: david
  Cc: Peter Zijlstra, Borislav Petkov, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, Sep 27, 2012 at 10:45 AM,  <david@lang.hm> wrote:
>
> For the initial starup of a new process, finding as idle and remote a core
> to start on (minimum sharing with existing processes) is probably the smart
> thing to do.

Actually, no.

It's *exec* that should go remote. New processes (fork, vfork or
clone) absolutely should *not* go remote at all.

vfork() should stay on the same CPU (synchronous wakeup), fork()
should possibly go SMT (likely exec in the near future will spread it
out), and clone should likely just stay close too.

           Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 18:05                                                       ` Borislav Petkov
@ 2012-09-27 18:19                                                         ` Linus Torvalds
  2012-09-27 18:29                                                           ` Peter Zijlstra
  0 siblings, 1 reply; 115+ messages in thread
From: Linus Torvalds @ 2012-09-27 18:19 UTC (permalink / raw)
  To: Borislav Petkov, Linus Torvalds, Ingo Molnar, Mike Galbraith,
	Peter Zijlstra, Mel Gorman, Nikolay Ulyanitsky, linux-kernel,
	Andreas Herrmann, Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
>> Or could we just improve the heuristics. What happens if the
>> scheduling granularity is increased, for example? It's set to 1ms
>> right now, with a logarithmic scaling by number of cpus.
>
> /proc/sys/kernel/sched_wakeup_granularity_ns=10000000 (10ms)
> ------------------------------------------------------
> tps = 4994.730809 (including connections establishing)
> tps = 5000.260764 (excluding connections establishing)
>
> A bit better over the default NO_WAKEUP_PREEMPTION setting.

Ok, so this gives us something possible to actually play with.

For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

(Btw, "linear" right now looks like 1:1. That's linear, but it's a
very aggressive linearity. Something like "factor = (cpus+1)/2" would
also be linear, but by a less extreme factor.

              Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 17:45                                             ` david
  2012-09-27 18:09                                               ` Peter Zijlstra
  2012-09-27 18:15                                               ` Linus Torvalds
@ 2012-09-27 18:24                                               ` Borislav Petkov
  2 siblings, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-27 18:24 UTC (permalink / raw)
  To: david
  Cc: Peter Zijlstra, Linus Torvalds, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Ingo Molnar, Suresh Siddha

On Thu, Sep 27, 2012 at 10:45:06AM -0700, david@lang.hm wrote:
> On Thu, 27 Sep 2012, Peter Zijlstra wrote:
> 
> >On Thu, 2012-09-27 at 09:48 -0700, david@lang.hm wrote:
> >>I think you are bing too smart for your own good. you don't know if it's
> >>best to move them further apart or not.
> >
> >Well yes and no.. You're right, however in general the load-balancer has
> >always tried to not use (SMT) siblings whenever possible, in that regard
> >not using an idle sibling is consistent here.
> >
> >Also, for short running tasks the wakeup balancing is typically all we
> >have, the 'big' periodic load-balancer will 'never' see them, making the
> >multiple moves argument hard.
> 
> For the initial starup of a new process, finding as idle and remote
> a core to start on (minimum sharing with existing processes) is
> probably the smart thing to do.

Right,

but we don't schedule to the SMT siblings, as Peter says above. So we
can't get to the case where two SMT siblings are not overloaded and the
processes remain on the same L2.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 18:19                                                         ` Linus Torvalds
@ 2012-09-27 18:29                                                           ` Peter Zijlstra
  2012-09-27 19:24                                                             ` Borislav Petkov
  2012-09-27 19:40                                                             ` Linus Torvalds
  0 siblings, 2 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-27 18:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Ingo Molnar, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, 2012-09-27 at 11:19 -0700, Linus Torvalds wrote:
> On Thu, Sep 27, 2012 at 11:05 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Thu, Sep 27, 2012 at 10:44:26AM -0700, Linus Torvalds wrote:
> >> Or could we just improve the heuristics. What happens if the
> >> scheduling granularity is increased, for example? It's set to 1ms
> >> right now, with a logarithmic scaling by number of cpus.
> >
> > /proc/sys/kernel/sched_wakeup_granularity_ns=10000000 (10ms)
> > ------------------------------------------------------
> > tps = 4994.730809 (including connections establishing)
> > tps = 5000.260764 (excluding connections establishing)
> >
> > A bit better over the default NO_WAKEUP_PREEMPTION setting.
> 
> Ok, so this gives us something possible to actually play with.
> 
> For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
> than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?

Don't forget to run the desktop interactivity benchmarks after you're
done wriggling with this knob... wakeup preemption is important for most
those.



^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 18:29                                                           ` Peter Zijlstra
@ 2012-09-27 19:24                                                             ` Borislav Petkov
  2012-09-28  3:50                                                               ` Mike Galbraith
  2012-09-27 19:40                                                             ` Linus Torvalds
  1 sibling, 1 reply; 115+ messages in thread
From: Borislav Petkov @ 2012-09-27 19:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote:
> > >> Or could we just improve the heuristics. What happens if the
> > >> scheduling granularity is increased, for example? It's set to 1ms
> > >> right now, with a logarithmic scaling by number of cpus.
> > >
> > > /proc/sys/kernel/sched_wakeup_granularity_ns=10000000 (10ms)
> > > ------------------------------------------------------
> > > tps = 4994.730809 (including connections establishing)
> > > tps = 5000.260764 (excluding connections establishing)
> > >
> > > A bit better over the default NO_WAKEUP_PREEMPTION setting.
> > 
> > Ok, so this gives us something possible to actually play with.
> > 
> > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
> > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?
> 
> Don't forget to run the desktop interactivity benchmarks after you're
> done wriggling with this knob... wakeup preemption is important for most
> those.

Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made
wakeup_granularity go to 4ms:

sched_autogroup_enabled:1
sched_child_runs_first:0
sched_latency_ns:24000000
sched_migration_cost_ns:500000
sched_min_granularity_ns:3000000
sched_nr_migrate:32
sched_rt_period_us:1000000
sched_rt_runtime_us:950000
sched_shares_window_ns:10000000
sched_time_avg_ms:1000
sched_tunable_scaling:2
sched_wakeup_granularity_ns:4000000

pgbench results look good:

tps = 4997.675331 (including connections establishing)
tps = 5003.256870 (excluding connections establishing)

This is still with Ingo's NO_WAKEUP_PREEMPTION patch.

Thanks.

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 18:29                                                           ` Peter Zijlstra
  2012-09-27 19:24                                                             ` Borislav Petkov
@ 2012-09-27 19:40                                                             ` Linus Torvalds
  2012-09-28  4:13                                                               ` Mike Galbraith
  2012-09-28  8:37                                                               ` Peter Zijlstra
  1 sibling, 2 replies; 115+ messages in thread
From: Linus Torvalds @ 2012-09-27 19:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Borislav Petkov, Ingo Molnar, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
>
> Don't forget to run the desktop interactivity benchmarks after you're
> done wriggling with this knob... wakeup preemption is important for most
> those.

So I don't think we want to *just* wiggle that knob per se. We
definitely don't want to hurt latency on actual interactive asks. But
it's interesting that it helps psql so much, and that there seems to
be some interaction with the select_idle_sibling().

So I do have a few things I react to when looking at that wakeup granularity..

I wonder about this comment, for example:

         * By using 'se' instead of 'curr' we penalize light tasks, so
         * they get preempted easier. That is, if 'se' < 'curr' then
         * the resulting gran will be larger, therefore penalizing the
         * lighter, if otoh 'se' > 'curr' then the resulting gran will
         * be smaller, again penalizing the lighter task.

why would we want to preempt light tasks easier? It sounds backwards
to me. If they are light, we have *less* reason to preempt them, since
they are more likely to just go to sleep on their own, no?

Another question is whether the fact that this same load interacts
with select_idle_sibling() is perhaps a sign that maybe the preemption
logic is all fine, but it interacts badly with the "pick new cpu"
code. In particular, after having changed rq's, is the vruntime really
comparable? IOW, maybe this is an interaction between "place_entity()"
and then the immediately following (?) call to check wakeup
preemption?

The fact that *either* changing select_idle_sibling() *or* changing
the wakeup preemption granularity seems to have such a huge impact
does seem to tie them together somehow for this particular load. No?

             Linus

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 19:24                                                             ` Borislav Petkov
@ 2012-09-28  3:50                                                               ` Mike Galbraith
  2012-09-28 12:30                                                                 ` Borislav Petkov
  0 siblings, 1 reply; 115+ messages in thread
From: Mike Galbraith @ 2012-09-28  3:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Peter Zijlstra, Linus Torvalds, Ingo Molnar, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, 2012-09-27 at 21:24 +0200, Borislav Petkov wrote: 
> On Thu, Sep 27, 2012 at 08:29:44PM +0200, Peter Zijlstra wrote:
> > > >> Or could we just improve the heuristics. What happens if the
> > > >> scheduling granularity is increased, for example? It's set to 1ms
> > > >> right now, with a logarithmic scaling by number of cpus.
> > > >
> > > > /proc/sys/kernel/sched_wakeup_granularity_ns=10000000 (10ms)
> > > > ------------------------------------------------------
> > > > tps = 4994.730809 (including connections establishing)
> > > > tps = 5000.260764 (excluding connections establishing)
> > > >
> > > > A bit better over the default NO_WAKEUP_PREEMPTION setting.
> > > 
> > > Ok, so this gives us something possible to actually play with.
> > > 
> > > For example, maybe SCHED_TUNABLESCALING_LINEAR is more appropriate
> > > than SCHED_TUNABLESCALING_LOG. At least for WAKEUP_PREEMPTION. Hmm?
> > 
> > Don't forget to run the desktop interactivity benchmarks after you're
> > done wriggling with this knob... wakeup preemption is important for most
> > those.
> 
> Setting sched_tunable_scaling to SCHED_TUNABLESCALING_LINEAR made
> wakeup_granularity go to 4ms:
> 
> sched_autogroup_enabled:1
> sched_child_runs_first:0
> sched_latency_ns:24000000
> sched_migration_cost_ns:500000
> sched_min_granularity_ns:3000000
> sched_nr_migrate:32
> sched_rt_period_us:1000000
> sched_rt_runtime_us:950000
> sched_shares_window_ns:10000000
> sched_time_avg_ms:1000
> sched_tunable_scaling:2
> sched_wakeup_granularity_ns:4000000
> 
> pgbench results look good:
> 
> tps = 4997.675331 (including connections establishing)
> tps = 5003.256870 (excluding connections establishing)
> 
> This is still with Ingo's NO_WAKEUP_PREEMPTION patch.

And wakeup preemption is still disabled as well, correct?

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 19:40                                                             ` Linus Torvalds
@ 2012-09-28  4:13                                                               ` Mike Galbraith
  2012-09-28  8:37                                                               ` Peter Zijlstra
  1 sibling, 0 replies; 115+ messages in thread
From: Mike Galbraith @ 2012-09-28  4:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Borislav Petkov, Ingo Molnar, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote: 
> On Thu, Sep 27, 2012 at 11:29 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> >
> > Don't forget to run the desktop interactivity benchmarks after you're
> > done wriggling with this knob... wakeup preemption is important for most
> > those.
> 
> So I don't think we want to *just* wiggle that knob per se. We
> definitely don't want to hurt latency on actual interactive asks. But
> it's interesting that it helps psql so much, and that there seems to
> be some interaction with the select_idle_sibling().
> 
> So I do have a few things I react to when looking at that wakeup granularity..
> 
> I wonder about this comment, for example:
> 
>          * By using 'se' instead of 'curr' we penalize light tasks, so
>          * they get preempted easier. That is, if 'se' < 'curr' then
>          * the resulting gran will be larger, therefore penalizing the
>          * lighter, if otoh 'se' > 'curr' then the resulting gran will
>          * be smaller, again penalizing the lighter task.
> 
> why would we want to preempt light tasks easier? It sounds backwards
> to me. If they are light, we have *less* reason to preempt them, since
> they are more likely to just go to sleep on their own, no?

At, that particular 'light' refers to se->load.weight.

> Another question is whether the fact that this same load interacts
> with select_idle_sibling() is perhaps a sign that maybe the preemption
> logic is all fine, but it interacts badly with the "pick new cpu"
> code. In particular, after having changed rq's, is the vruntime really
> comparable? IOW, maybe this is an interaction between "place_entity()"
> and then the immediately following (?) call to check wakeup
> preemption?

I think vruntime should be fine.  We set take the delta between the
task's vruntime when it went to sleep and it's previous rq min_vruntime
to capture progress made while it slept, and apply the relative offset
in the task's new home so a task can migrate and still have a chance to
preempt on wakeup.

> The fact that *either* changing select_idle_sibling() *or* changing
> the wakeup preemption granularity seems to have such a huge impact
> does seem to tie them together somehow for this particular load. No?

The way I read it, Boris had wakeup preemption disabled.

-Mike


^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-27 19:40                                                             ` Linus Torvalds
  2012-09-28  4:13                                                               ` Mike Galbraith
@ 2012-09-28  8:37                                                               ` Peter Zijlstra
  1 sibling, 0 replies; 115+ messages in thread
From: Peter Zijlstra @ 2012-09-28  8:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Borislav Petkov, Ingo Molnar, Mike Galbraith, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha, Paul Turner

On Thu, 2012-09-27 at 12:40 -0700, Linus Torvalds wrote:
> I wonder about this comment, for example:
> 
>          * By using 'se' instead of 'curr' we penalize light tasks, so
>          * they get preempted easier. That is, if 'se' < 'curr' then
>          * the resulting gran will be larger, therefore penalizing the
>          * lighter, if otoh 'se' > 'curr' then the resulting gran will
>          * be smaller, again penalizing the lighter task.
> 
> why would we want to preempt light tasks easier? It sounds backwards
> to me. If they are light, we have *less* reason to preempt them, since
> they are more likely to just go to sleep on their own, no?

No, weight is nice, you nicing a task doesn't make it want to run less.
So preempting them sooner means they disturb the heavier less, which is
I think what you want with nice.

> Another question is whether the fact that this same load interacts
> with select_idle_sibling() is perhaps a sign that maybe the preemption
> logic is all fine, but it interacts badly with the "pick new cpu"
> code. In particular, after having changed rq's, is the vruntime really
> comparable? IOW, maybe this is an interaction between "place_entity()"
> and then the immediately following (?) call to check wakeup
> preemption? 

No, the vruntime comparison between cpus is dubious, its not complete
nonsense but its not 'correct' either. PJT has patches to improve that
based on his per-entity tracking stuff.

^ permalink raw reply	[flat|nested] 115+ messages in thread

* Re: 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected
  2012-09-28  3:50                                                               ` Mike Galbraith
@ 2012-09-28 12:30                                                                 ` Borislav Petkov
  0 siblings, 0 replies; 115+ messages in thread
From: Borislav Petkov @ 2012-09-28 12:30 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Peter Zijlstra, Linus Torvalds, Ingo Molnar, Mel Gorman,
	Nikolay Ulyanitsky, linux-kernel, Andreas Herrmann,
	Andrew Morton, Thomas Gleixner, Suresh Siddha

On Fri, Sep 28, 2012 at 05:50:20AM +0200, Mike Galbraith wrote:
> And wakeup preemption is still disabled as well, correct?

Yes it is by default anyway:

$ cat /mnt/dbg/sched_features
GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY CACHE_HOT_BUDDY NO_WAKEUP_PREEMPTION ARCH_POWER NO_HRTICK NO_DOUBLE_TICK LB_BIAS OWNER_SPIN NONTASK_POWER TTWU_QUEUE NO_FORCE_SD_OVERLAP RT_RUNTIME_SHARE NO_LB_MIN

NO_WAKEUP_PREEMPTION brings 9% improvement with pgbench, btw:

http://marc.info/?l=linux-kernel&m=134876312310048

-- 
Regards/Gruss,
Boris.

^ permalink raw reply	[flat|nested] 115+ messages in thread

end of thread, other threads:[~2012-09-28 12:29 UTC | newest]

Thread overview: 115+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-14  7:47 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets Nikolay Ulyanitsky
2012-09-14 18:40 ` Borislav Petkov
2012-09-14 18:51   ` Borislav Petkov
2012-09-14 21:27 ` 20% performance drop on PostgreSQL 9.2 from kernel 3.5.3 to 3.6-rc5 on AMD chipsets - bisected Borislav Petkov
2012-09-14 21:40   ` Peter Zijlstra
2012-09-14 21:44     ` Linus Torvalds
2012-09-14 21:56       ` Peter Zijlstra
2012-09-14 21:59         ` Peter Zijlstra
2012-09-15  3:57           ` Mike Galbraith
2012-09-14 22:01         ` Linus Torvalds
2012-09-14 22:10           ` Peter Zijlstra
2012-09-14 22:20             ` Linus Torvalds
2012-09-14 22:14           ` Borislav Petkov
2012-09-14 21:45     ` Borislav Petkov
2012-09-14 21:42   ` Linus Torvalds
2012-09-15  3:33     ` Mike Galbraith
2012-09-15 16:16       ` Andi Kleen
2012-09-15 16:36         ` Mike Galbraith
2012-09-15 17:08           ` richard -rw- weinberger
2012-09-16  4:48             ` Mike Galbraith
2012-09-15 21:32           ` Alan Cox
2012-09-16  4:35             ` Mike Galbraith
2012-09-16 19:57               ` Linus Torvalds
2012-09-17  8:08                 ` Mike Galbraith
2012-09-17 10:07                   ` Ingo Molnar
2012-09-17 10:47                     ` Mike Galbraith
2012-09-17 14:39                     ` Andi Kleen
2012-09-19 12:35               ` Mike Galbraith
2012-09-19 14:54                 ` Ingo Molnar
2012-09-19 15:23                   ` Mike Galbraith
2012-09-24 15:00     ` Mel Gorman
2012-09-24 15:23       ` Nikolay Ulyanitsky
2012-09-24 15:53         ` Borislav Petkov
2012-09-24 15:30       ` Peter Zijlstra
2012-09-24 15:51         ` Mike Galbraith
2012-09-24 15:52         ` Linus Torvalds
2012-09-24 16:07           ` Peter Zijlstra
2012-09-24 16:33             ` Linus Torvalds
2012-09-24 16:54               ` Peter Zijlstra
2012-09-25 12:10                 ` Hillf Danton
2012-09-24 16:12           ` Peter Zijlstra
2012-09-24 16:30             ` Linus Torvalds
2012-09-24 16:52               ` Borislav Petkov
2012-09-24 16:54               ` Peter Zijlstra
2012-09-24 17:44                 ` Peter Zijlstra
2012-09-25 13:23                   ` Mel Gorman
2012-09-25 14:36                     ` Peter Zijlstra
2012-09-24 18:26                 ` Mike Galbraith
2012-09-24 19:12                   ` Linus Torvalds
2012-09-24 19:20                     ` Borislav Petkov
2012-09-25  1:57                       ` Mike Galbraith
2012-09-25  2:11                         ` Linus Torvalds
2012-09-25  2:49                           ` Mike Galbraith
2012-09-25  3:10                             ` Linus Torvalds
2012-09-25  3:20                               ` Mike Galbraith
2012-09-25  3:32                                 ` Linus Torvalds
2012-09-25  3:43                                   ` Mike Galbraith
2012-09-25 11:58                           ` Peter Zijlstra
2012-09-25 13:17                             ` Borislav Petkov
2012-09-25 17:00                               ` Borislav Petkov
2012-09-25 17:21                                 ` Linus Torvalds
2012-09-25 18:42                                   ` Borislav Petkov
2012-09-25 19:08                                     ` Linus Torvalds
2012-09-26  2:23                                     ` Mike Galbraith
2012-09-26 17:17                                       ` Borislav Petkov
2012-09-26  2:00                                   ` Mike Galbraith
2012-09-26  2:22                                     ` Linus Torvalds
2012-09-26  2:42                                       ` Mike Galbraith
2012-09-26 17:15                                       ` Borislav Petkov
2012-09-26 16:32                                   ` Borislav Petkov
2012-09-26 18:19                                     ` Linus Torvalds
2012-09-26 21:37                                       ` Borislav Petkov
2012-09-27  5:09                                         ` Mike Galbraith
2012-09-27  5:18                                           ` Borislav Petkov
2012-09-27  5:44                                             ` Mike Galbraith
2012-09-27  5:47                                           ` Ingo Molnar
2012-09-27  5:59                                             ` Ingo Molnar
2012-09-27  6:34                                             ` Mike Galbraith
2012-09-27  6:41                                               ` Ingo Molnar
2012-09-27  6:54                                                 ` Mike Galbraith
2012-09-27  7:10                                                   ` Ingo Molnar
2012-09-27 16:25                                                     ` Borislav Petkov
2012-09-27 17:44                                                     ` Linus Torvalds
2012-09-27 18:05                                                       ` Borislav Petkov
2012-09-27 18:19                                                         ` Linus Torvalds
2012-09-27 18:29                                                           ` Peter Zijlstra
2012-09-27 19:24                                                             ` Borislav Petkov
2012-09-28  3:50                                                               ` Mike Galbraith
2012-09-28 12:30                                                                 ` Borislav Petkov
2012-09-27 19:40                                                             ` Linus Torvalds
2012-09-28  4:13                                                               ` Mike Galbraith
2012-09-28  8:37                                                               ` Peter Zijlstra
2012-09-27  7:17                                         ` david
2012-09-27  7:55                                           ` Mike Galbraith
2012-09-27 10:20                                           ` Borislav Petkov
2012-09-27 13:38                                             ` Mike Galbraith
2012-09-27 16:55                                             ` david
2012-09-27  4:32                                       ` Mike Galbraith
2012-09-27  8:21                                       ` Peter Zijlstra
2012-09-27 16:48                                         ` david
2012-09-27 17:38                                           ` Peter Zijlstra
2012-09-27 17:45                                             ` david
2012-09-27 18:09                                               ` Peter Zijlstra
2012-09-27 18:15                                               ` Linus Torvalds
2012-09-27 18:24                                               ` Borislav Petkov
2012-09-25  1:39                     ` Mike Galbraith
2012-09-25 21:11                     ` Suresh Siddha
2012-09-25  4:16       ` Mike Galbraith
2012-09-15  4:11   ` Mike Galbraith
     [not found]     ` <CA+55aFz1A7HbMYS9o-GTS5Zm=Xx8MUD7cR05GMVo--2E34jcgQ@mail.gmail.com>
2012-09-15  4:42       ` Mike Galbraith
2012-09-15 10:44     ` Borislav Petkov
2012-09-15 14:47       ` Mike Galbraith
2012-09-15 15:18         ` Borislav Petkov
2012-09-15 16:13           ` Mike Galbraith
2012-09-15 19:44             ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).