All of lore.kernel.org
 help / color / mirror / Atom feed
* find_busiest_group using lots of CPU
@ 2009-09-30  8:18 Jens Axboe
  2009-10-05 12:31 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2009-09-30  8:18 UTC (permalink / raw)
  To: Linux Kernel; +Cc: mingo, a.p.zijlstra

Hi,

I stuffed a few more SSDs into my text box. Running a simple workload
that just does streaming reads from 10 processes (throughput is around
2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
a 64 thread box.

The top two profile entries are:

    10.86%      fio  [kernel]                [k] find_busiest_group
                |          
                |--99.91%-- thread_return
                |          io_schedule
                |          sys_io_getevents
                |          system_call_fastpath
                |          0x7f4b50b61604
                |          |          
                |           --100.00%-- td_io_getevents
                |                     io_u_queued_complete
                |                     thread_main
                |                     run_threads
                |                     main
                |                     __libc_start_main
                 --0.09%-- [...]

     5.78%      fio  [kernel]                [k] cpumask_next_and
                |          
                |--67.21%-- thread_return
                |          io_schedule
                |          sys_io_getevents
                |          system_call_fastpath
                |          0x7f4b50b61604
                |          |          
                |           --100.00%-- td_io_getevents
                |                     io_u_queued_complete
                |                     thread_main
                |                     run_threads
                |                     main
                |                     __libc_start_main
                |          
                 --32.79%-- find_busiest_group
                           thread_return
                           io_schedule
                           sys_io_getevents
                           system_call_fastpath
                           0x7f4b50b61604
                           |          
                            --100.00%-- td_io_getevents
                                      io_u_queued_complete
                                      thread_main
                                      run_threads
                                      main
                                      __libc_start_main

This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
both disabled but that yields the same result (well actually worse, 22%
spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
not set. This seems way excessive!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-09-30  8:18 find_busiest_group using lots of CPU Jens Axboe
@ 2009-10-05 12:31 ` Peter Zijlstra
  2009-10-06  7:51   ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2009-10-05 12:31 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel, mingo

On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote:
> Hi,
> 
> I stuffed a few more SSDs into my text box. Running a simple workload
> that just does streaming reads from 10 processes (throughput is around
> 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
> a 64 thread box.
> 
> The top two profile entries are:
> 
>     10.86%      fio  [kernel]                [k] find_busiest_group
>                 |          
>                 |--99.91%-- thread_return
>                 |          io_schedule
>                 |          sys_io_getevents
>                 |          system_call_fastpath
>                 |          0x7f4b50b61604
>                 |          |          
>                 |           --100.00%-- td_io_getevents
>                 |                     io_u_queued_complete
>                 |                     thread_main
>                 |                     run_threads
>                 |                     main
>                 |                     __libc_start_main
>                  --0.09%-- [...]
> 
>      5.78%      fio  [kernel]                [k] cpumask_next_and
>                 |          
>                 |--67.21%-- thread_return
>                 |          io_schedule
>                 |          sys_io_getevents
>                 |          system_call_fastpath
>                 |          0x7f4b50b61604
>                 |          |          
>                 |           --100.00%-- td_io_getevents
>                 |                     io_u_queued_complete
>                 |                     thread_main
>                 |                     run_threads
>                 |                     main
>                 |                     __libc_start_main
>                 |          
>                  --32.79%-- find_busiest_group
>                            thread_return
>                            io_schedule
>                            sys_io_getevents
>                            system_call_fastpath
>                            0x7f4b50b61604
>                            |          
>                             --100.00%-- td_io_getevents
>                                       io_u_queued_complete
>                                       thread_main
>                                       run_threads
>                                       main
>                                       __libc_start_main
> 
> This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
> both disabled but that yields the same result (well actually worse, 22%
> spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
> not set. This seems way excessive!

io_schedule() straight into find_busiest_group() leads me to think this
could be SD_BALANCE_NEWIDLE, does something like:

for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
do 
	val=`cat $i`; echo $((val & ~0x02)) > $i; 
done

[ assuming SCHED_DEBUG=y ]

Cure things?

If so, then its spending time looking for work, which there might not be
on your machine, since everything is waiting for IO or somesuch.

Not really sure what to do about it though, this is a quad socket
nehalem, right? We could possibly disable SD_BALANCE_NEWIDLE on the NODE
level, but that would again decrease throughput in things like kbuild.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-10-05 12:31 ` Peter Zijlstra
@ 2009-10-06  7:51   ` Jens Axboe
  2009-10-06 11:20     ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2009-10-06  7:51 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Linux Kernel, mingo

On Mon, Oct 05 2009, Peter Zijlstra wrote:
> On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote:
> > Hi,
> > 
> > I stuffed a few more SSDs into my text box. Running a simple workload
> > that just does streaming reads from 10 processes (throughput is around
> > 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
> > a 64 thread box.
> > 
> > The top two profile entries are:
> > 
> >     10.86%      fio  [kernel]                [k] find_busiest_group
> >                 |          
> >                 |--99.91%-- thread_return
> >                 |          io_schedule
> >                 |          sys_io_getevents
> >                 |          system_call_fastpath
> >                 |          0x7f4b50b61604
> >                 |          |          
> >                 |           --100.00%-- td_io_getevents
> >                 |                     io_u_queued_complete
> >                 |                     thread_main
> >                 |                     run_threads
> >                 |                     main
> >                 |                     __libc_start_main
> >                  --0.09%-- [...]
> > 
> >      5.78%      fio  [kernel]                [k] cpumask_next_and
> >                 |          
> >                 |--67.21%-- thread_return
> >                 |          io_schedule
> >                 |          sys_io_getevents
> >                 |          system_call_fastpath
> >                 |          0x7f4b50b61604
> >                 |          |          
> >                 |           --100.00%-- td_io_getevents
> >                 |                     io_u_queued_complete
> >                 |                     thread_main
> >                 |                     run_threads
> >                 |                     main
> >                 |                     __libc_start_main
> >                 |          
> >                  --32.79%-- find_busiest_group
> >                            thread_return
> >                            io_schedule
> >                            sys_io_getevents
> >                            system_call_fastpath
> >                            0x7f4b50b61604
> >                            |          
> >                             --100.00%-- td_io_getevents
> >                                       io_u_queued_complete
> >                                       thread_main
> >                                       run_threads
> >                                       main
> >                                       __libc_start_main
> > 
> > This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
> > both disabled but that yields the same result (well actually worse, 22%
> > spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
> > not set. This seems way excessive!
> 
> io_schedule() straight into find_busiest_group() leads me to think this
> could be SD_BALANCE_NEWIDLE, does something like:
> 
> for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
> do 
> 	val=`cat $i`; echo $((val & ~0x02)) > $i; 
> done
> 
> [ assuming SCHED_DEBUG=y ]
> 
> Cure things?

I can try, as mentioned it doesn't look any better with SCHED_DEBUG=n

> If so, then its spending time looking for work, which there might not be
> on your machine, since everything is waiting for IO or somesuch.

OK, just seems way excessive for something which is only 10 tasks and
not even that context switch intensive.

> Not really sure what to do about it though, this is a quad socket
> nehalem, right? We could possibly disable SD_BALANCE_NEWIDLE on the NODE
> level, but that would again decrease throughput in things like kbuild.

Yes, it's a quad socket nehalem. I'll see if disabling NEWIDLE makes a
difference, I need to run some other tests on that box today anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-10-06  7:51   ` Jens Axboe
@ 2009-10-06 11:20     ` Jens Axboe
  2009-10-06 11:47       ` Ingo Molnar
  2009-10-06 12:04       ` Peter Zijlstra
  0 siblings, 2 replies; 8+ messages in thread
From: Jens Axboe @ 2009-10-06 11:20 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Linux Kernel, mingo

On Tue, Oct 06 2009, Jens Axboe wrote:
> On Mon, Oct 05 2009, Peter Zijlstra wrote:
> > On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote:
> > > Hi,
> > > 
> > > I stuffed a few more SSDs into my text box. Running a simple workload
> > > that just does streaming reads from 10 processes (throughput is around
> > > 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
> > > a 64 thread box.
> > > 
> > > The top two profile entries are:
> > > 
> > >     10.86%      fio  [kernel]                [k] find_busiest_group
> > >                 |          
> > >                 |--99.91%-- thread_return
> > >                 |          io_schedule
> > >                 |          sys_io_getevents
> > >                 |          system_call_fastpath
> > >                 |          0x7f4b50b61604
> > >                 |          |          
> > >                 |           --100.00%-- td_io_getevents
> > >                 |                     io_u_queued_complete
> > >                 |                     thread_main
> > >                 |                     run_threads
> > >                 |                     main
> > >                 |                     __libc_start_main
> > >                  --0.09%-- [...]
> > > 
> > >      5.78%      fio  [kernel]                [k] cpumask_next_and
> > >                 |          
> > >                 |--67.21%-- thread_return
> > >                 |          io_schedule
> > >                 |          sys_io_getevents
> > >                 |          system_call_fastpath
> > >                 |          0x7f4b50b61604
> > >                 |          |          
> > >                 |           --100.00%-- td_io_getevents
> > >                 |                     io_u_queued_complete
> > >                 |                     thread_main
> > >                 |                     run_threads
> > >                 |                     main
> > >                 |                     __libc_start_main
> > >                 |          
> > >                  --32.79%-- find_busiest_group
> > >                            thread_return
> > >                            io_schedule
> > >                            sys_io_getevents
> > >                            system_call_fastpath
> > >                            0x7f4b50b61604
> > >                            |          
> > >                             --100.00%-- td_io_getevents
> > >                                       io_u_queued_complete
> > >                                       thread_main
> > >                                       run_threads
> > >                                       main
> > >                                       __libc_start_main
> > > 
> > > This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
> > > both disabled but that yields the same result (well actually worse, 22%
> > > spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
> > > not set. This seems way excessive!
> > 
> > io_schedule() straight into find_busiest_group() leads me to think this
> > could be SD_BALANCE_NEWIDLE, does something like:
> > 
> > for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
> > do 
> > 	val=`cat $i`; echo $((val & ~0x02)) > $i; 
> > done
> > 
> > [ assuming SCHED_DEBUG=y ]
> > 
> > Cure things?
> 
> I can try, as mentioned it doesn't look any better with SCHED_DEBUG=n

It does, it's gone from the profiles.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-10-06 11:20     ` Jens Axboe
@ 2009-10-06 11:47       ` Ingo Molnar
  2009-10-06 11:56         ` Jens Axboe
  2009-10-06 12:04       ` Peter Zijlstra
  1 sibling, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2009-10-06 11:47 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Peter Zijlstra, Linux Kernel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Tue, Oct 06 2009, Jens Axboe wrote:
> > On Mon, Oct 05 2009, Peter Zijlstra wrote:
> > > On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote:
> > > > Hi,
> > > > 
> > > > I stuffed a few more SSDs into my text box. Running a simple workload
> > > > that just does streaming reads from 10 processes (throughput is around
> > > > 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
> > > > a 64 thread box.
> > > > 
> > > > The top two profile entries are:
> > > > 
> > > >     10.86%      fio  [kernel]                [k] find_busiest_group
> > > >                 |          
> > > >                 |--99.91%-- thread_return
> > > >                 |          io_schedule
> > > >                 |          sys_io_getevents
> > > >                 |          system_call_fastpath
> > > >                 |          0x7f4b50b61604
> > > >                 |          |          
> > > >                 |           --100.00%-- td_io_getevents
> > > >                 |                     io_u_queued_complete
> > > >                 |                     thread_main
> > > >                 |                     run_threads
> > > >                 |                     main
> > > >                 |                     __libc_start_main
> > > >                  --0.09%-- [...]
> > > > 
> > > >      5.78%      fio  [kernel]                [k] cpumask_next_and
> > > >                 |          
> > > >                 |--67.21%-- thread_return
> > > >                 |          io_schedule
> > > >                 |          sys_io_getevents
> > > >                 |          system_call_fastpath
> > > >                 |          0x7f4b50b61604
> > > >                 |          |          
> > > >                 |           --100.00%-- td_io_getevents
> > > >                 |                     io_u_queued_complete
> > > >                 |                     thread_main
> > > >                 |                     run_threads
> > > >                 |                     main
> > > >                 |                     __libc_start_main
> > > >                 |          
> > > >                  --32.79%-- find_busiest_group
> > > >                            thread_return
> > > >                            io_schedule
> > > >                            sys_io_getevents
> > > >                            system_call_fastpath
> > > >                            0x7f4b50b61604
> > > >                            |          
> > > >                             --100.00%-- td_io_getevents
> > > >                                       io_u_queued_complete
> > > >                                       thread_main
> > > >                                       run_threads
> > > >                                       main
> > > >                                       __libc_start_main
> > > > 
> > > > This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
> > > > both disabled but that yields the same result (well actually worse, 22%
> > > > spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
> > > > not set. This seems way excessive!
> > > 
> > > io_schedule() straight into find_busiest_group() leads me to think this
> > > could be SD_BALANCE_NEWIDLE, does something like:
> > > 
> > > for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
> > > do 
> > > 	val=`cat $i`; echo $((val & ~0x02)) > $i; 
> > > done
> > > 
> > > [ assuming SCHED_DEBUG=y ]
> > > 
> > > Cure things?
> > 
> > I can try, as mentioned it doesn't look any better with SCHED_DEBUG=n
> 
> It does, it's gone from the profiles.

Peter mentioned SCHED_DEBUG=y to have 
/proc/sys/kernel/sched_domain/cpu*/domain*/flags available.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-10-06 11:47       ` Ingo Molnar
@ 2009-10-06 11:56         ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2009-10-06 11:56 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, Linux Kernel

On Tue, Oct 06 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Tue, Oct 06 2009, Jens Axboe wrote:
> > > On Mon, Oct 05 2009, Peter Zijlstra wrote:
> > > > On Wed, 2009-09-30 at 10:18 +0200, Jens Axboe wrote:
> > > > > Hi,
> > > > > 
> > > > > I stuffed a few more SSDs into my text box. Running a simple workload
> > > > > that just does streaming reads from 10 processes (throughput is around
> > > > > 2.2GB/sec), find_busiest_group() is using > 10% of the CPU time. This is
> > > > > a 64 thread box.
> > > > > 
> > > > > The top two profile entries are:
> > > > > 
> > > > >     10.86%      fio  [kernel]                [k] find_busiest_group
> > > > >                 |          
> > > > >                 |--99.91%-- thread_return
> > > > >                 |          io_schedule
> > > > >                 |          sys_io_getevents
> > > > >                 |          system_call_fastpath
> > > > >                 |          0x7f4b50b61604
> > > > >                 |          |          
> > > > >                 |           --100.00%-- td_io_getevents
> > > > >                 |                     io_u_queued_complete
> > > > >                 |                     thread_main
> > > > >                 |                     run_threads
> > > > >                 |                     main
> > > > >                 |                     __libc_start_main
> > > > >                  --0.09%-- [...]
> > > > > 
> > > > >      5.78%      fio  [kernel]                [k] cpumask_next_and
> > > > >                 |          
> > > > >                 |--67.21%-- thread_return
> > > > >                 |          io_schedule
> > > > >                 |          sys_io_getevents
> > > > >                 |          system_call_fastpath
> > > > >                 |          0x7f4b50b61604
> > > > >                 |          |          
> > > > >                 |           --100.00%-- td_io_getevents
> > > > >                 |                     io_u_queued_complete
> > > > >                 |                     thread_main
> > > > >                 |                     run_threads
> > > > >                 |                     main
> > > > >                 |                     __libc_start_main
> > > > >                 |          
> > > > >                  --32.79%-- find_busiest_group
> > > > >                            thread_return
> > > > >                            io_schedule
> > > > >                            sys_io_getevents
> > > > >                            system_call_fastpath
> > > > >                            0x7f4b50b61604
> > > > >                            |          
> > > > >                             --100.00%-- td_io_getevents
> > > > >                                       io_u_queued_complete
> > > > >                                       thread_main
> > > > >                                       run_threads
> > > > >                                       main
> > > > >                                       __libc_start_main
> > > > > 
> > > > > This is with SCHED_DEBUG=y and SCHEDSTATS=y enabled, I just tried with
> > > > > both disabled but that yields the same result (well actually worse, 22%
> > > > > spent in there. dunno if that's normal "fluctuation"). GROUP_SCHED is
> > > > > not set. This seems way excessive!
> > > > 
> > > > io_schedule() straight into find_busiest_group() leads me to think this
> > > > could be SD_BALANCE_NEWIDLE, does something like:
> > > > 
> > > > for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
> > > > do 
> > > > 	val=`cat $i`; echo $((val & ~0x02)) > $i; 
> > > > done
> > > > 
> > > > [ assuming SCHED_DEBUG=y ]
> > > > 
> > > > Cure things?
> > > 
> > > I can try, as mentioned it doesn't look any better with SCHED_DEBUG=n
> > 
> > It does, it's gone from the profiles.
> 
> Peter mentioned SCHED_DEBUG=y to have 
> /proc/sys/kernel/sched_domain/cpu*/domain*/flags available.

In case it wasn't clear, what I did was run the 'for...' from Peter
(with a kernel with SCHED_DEBUG=y) and it cured the "problem". Peter
asked whether it cured things, to which I replied "It does, it's gone
from the profiles".

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-10-06 11:20     ` Jens Axboe
  2009-10-06 11:47       ` Ingo Molnar
@ 2009-10-06 12:04       ` Peter Zijlstra
  2009-10-06 12:14         ` Ingo Molnar
  1 sibling, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2009-10-06 12:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Linux Kernel, mingo, Arnaldo Carvalho de Melo

On Tue, 2009-10-06 at 13:20 +0200, Jens Axboe wrote:
> > > io_schedule() straight into find_busiest_group() leads me to think this
> > > could be SD_BALANCE_NEWIDLE, does something like:
> > > 
> > > for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
> > > do 
> > >     val=`cat $i`; echo $((val & ~0x02)) > $i; 
> > > done
> > > 
> > > [ assuming SCHED_DEBUG=y ]
> > > 
> > > Cure things?
> > 
> > I can try,
> 
> It does, it's gone from the profiles.

Hrmm,. ok. I'll go prod at the NEWIDLE bits to see if anything stands
out.

If only perf-annotate could generate annotated C... maybe I'll look at
that.

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: find_busiest_group using lots of CPU
  2009-10-06 12:04       ` Peter Zijlstra
@ 2009-10-06 12:14         ` Ingo Molnar
  0 siblings, 0 replies; 8+ messages in thread
From: Ingo Molnar @ 2009-10-06 12:14 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Jens Axboe, Linux Kernel, Arnaldo Carvalho de Melo


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> On Tue, 2009-10-06 at 13:20 +0200, Jens Axboe wrote:
> > > > io_schedule() straight into find_busiest_group() leads me to think this
> > > > could be SD_BALANCE_NEWIDLE, does something like:
> > > > 
> > > > for i in /proc/sys/kernel/sched_domain/cpu*/domain*/flags; 
> > > > do 
> > > >     val=`cat $i`; echo $((val & ~0x02)) > $i; 
> > > > done
> > > > 
> > > > [ assuming SCHED_DEBUG=y ]
> > > > 
> > > > Cure things?
> > > 
> > > I can try,
> > 
> > It does, it's gone from the profiles.
> 
> Hrmm,. ok. I'll go prod at the NEWIDLE bits to see if anything stands
> out.
> 
> If only perf-annotate could generate annotated C... maybe I'll look at 
> that.

hey it can do that - build a vmlinux with DEBUG_INFO=y and it should 
just work :-)

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-10-06 12:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-30  8:18 find_busiest_group using lots of CPU Jens Axboe
2009-10-05 12:31 ` Peter Zijlstra
2009-10-06  7:51   ` Jens Axboe
2009-10-06 11:20     ` Jens Axboe
2009-10-06 11:47       ` Ingo Molnar
2009-10-06 11:56         ` Jens Axboe
2009-10-06 12:04       ` Peter Zijlstra
2009-10-06 12:14         ` Ingo Molnar

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.