All of lore.kernel.org
 help / color / mirror / Atom feed
* balance storm
@ 2014-05-26  3:04 Libo Chen
  2014-05-26  5:11 ` Mike Galbraith
  2014-05-26  7:56 ` Mike Galbraith
  0 siblings, 2 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-26  3:04 UTC (permalink / raw)
  To: tglx, mingo, LKML; +Cc: Greg KH, Li Zefan

hi,
    my box has 16 cpu (E5-2658,8 core, 2 thread per core), i did a test on
3.4.24stable, startup 50 same process, every process is sample:

 	#include <unistd.h>

 	int main()
 	{
          	for (;;)
          	{
                  	unsigned int i = 0;
                 	 while (i< 100){
                     	 i++;
                  	}
                  	usleep(100);
          	}

         	 return 0;
  	}

the result is process uses 15% cpu time, perf tool shows 70w migrations in 5 second.

  	PID USER      PR  NI  VIRT  RES  SHR S   %CPU %MEM    TIME+  COMMAND
 	4374 root      20   0  6020  332  256 S     15  0.0   0:03.73 a2.out
	4371 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
	4373 root      20   0  6020  332  256 R     15  0.0   0:03.72 a2.out
 	4377 root      20   0  6020  332  256 R     15  0.0   0:03.72 a2.out
 	4389 root      20   0  6020  328  256 S     15  0.0   0:03.71 a2.out
 	4391 root      20   0  6020  332  256 S     15  0.0   0:03.72 a2.out
 	4394 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4398 root      20   0  6020  328  256 S     15  0.0   0:03.71 a2.out
 	4403 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4405 root      20   0  6020  328  256 S     15  0.0   0:03.72 a2.out
 	4407 root      20   0  6020  332  256 S     15  0.0   0:03.73 a2.out
 	4369 root      20   0  6020  332  256 S     15  0.0   0:03.72 a2.out
 	4370 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4372 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4375 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4378 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4379 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4380 root      20   0  6020  332  256 S     15  0.0   0:03.72 a2.out
 	4381 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4383 root      20   0  6020  332  256 S     15  0.0   0:03.69 a2.out
 	4384 root      20   0  6020  332  256 S     15  0.0   0:03.72 a2.out
 	4386 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4387 root      20   0  6020  328  256 S     15  0.0   0:03.70 a2.out
 	4388 root      20   0  6020  332  256 R     15  0.0   0:03.72 a2.out
 	4390 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4392 root      20   0  6020  332  256 S     15  0.0   0:03.72 a2.out
 	4393 root      20   0  6020  332  256 S     15  0.0   0:03.72 a2.out
 	4395 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4396 root      20   0  6020  328  256 S     15  0.0   0:03.71 a2.out
 	4397 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4399 root      20   0  6020  332  256 R     15  0.0   0:03.72 a2.out
 	4400 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4402 root      20   0  6020  332  256 S     15  0.0   0:03.70 a2.out
 	4404 root      20   0  6020  332  256 R     15  0.0   0:03.69 a2.out
 	4406 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out
 	4408 root      20   0  6020  328  256 R     15  0.0   0:03.71 a2.out
 	4409 root      20   0  6020  332  256 R     15  0.0   0:03.71 a2.out
 	4410 root      20   0  6020  328  256 S     15  0.0   0:03.72 a2.out
 	4411 root      20   0  6020  332  256 S     15  0.0   0:03.71 a2.out

===========================================================================

when i reverts commit 908a3283728d92df36e0c7cd63304fd35e93a8a9:

	diff --git a/kernel/sched.c b/kernel/sched.c
	index 1874c74..4cdc91c 100644
	--- a/kernel/sched.c
	+++ b/kernel/sched.c
	@@ -5138,7 +5138,20 @@ EXPORT_SYMBOL(task_nice);
 	 */
 	int idle_cpu(int cpu)
 	{
	-       return cpu_curr(cpu) == cpu_rq(cpu)->idle;
	+       struct rq *rq = cpu_rq(cpu);
	+
	+       if (rq->curr != rq->idle)
	+               return 0;
	+
	+       if (rq->nr_running)
	+               return 0;
	+
	+#ifdef CONFIG_SMP
	+       if (!llist_empty(&rq->wake_list))
	+               return 0;
	+#endif
	+
	+       return 1;
	}

 the result is process uses 3-5% cpu time, perf tool shows only 1k migrations in 5 second.

 	4444 root      20   0  6020  328  256 S      5  0.0   2:18.49 a2.out
 	4469 root      20   0  6020  328  256 S      5  0.0   2:15.93 a2.out
 	4423 root      20   0  6020  328  256 S      5  0.0   2:14.36 a2.out
 	4433 root      20   0  6020  332  256 S      5  0.0   2:15.81 a2.out
 	4466 root      20   0  6020  328  256 S      4  0.0   2:17.62 a2.out
	4428 root      20   0  6020  332  256 S      4  0.0   2:13.92 a2.out
	4457 root      20   0  6020  332  256 R      4  0.0   2:15.30 a2.out
	4429 root      20   0  6020  328  256 R      4  0.0   2:17.13 a2.out
	4431 root      20   0  6020  332  256 S      3  0.0   2:15.91 a2.out
	4438 root      20   0  6020  332  256 S      3  0.0   2:14.04 a2.out
	4439 root      20   0  6020  332  256 S      3  0.0   2:15.94 a2.out
	4462 root      20   0  6020  332  256 R      3  0.0   2:16.40 a2.out
 	4422 root      20   0  6020  328  256 S      3  0.0   2:17.41 a2.out
	4434 root      20   0  6020  332  256 R      3  0.0   2:15.67 a2.out
	4440 root      20   0  6020  332  256 S      3  0.0   2:14.40 a2.out
 	4447 root      20   0  6020  332  256 S      3  0.0   2:16.02 a2.out
 	4448 root      20   0  6020  332  256 S      3  0.0   2:16.40 a2.out
 	4453 root      20   0  6020  332  256 R      3  0.0   2:15.75 a2.out
	4459 root      20   0  6020  328  256 S      3  0.0   2:16.66 a2.out
	4461 root      20   0  6020  332  256 S      3  0.0   2:15.77 a2.out
 	4471 root      20   0  6020  328  256 S      3  0.0   2:20.68 a2.out
 	4424 root      20   0  6020  328  256 S      3  0.0   2:15.90 a2.out
 	4427 root      20   0  6020  332  256 S      3  0.0   2:14.28 a2.out
 	4432 root      20   0  6020  332  256 S      3  0.0   2:14.63 a2.out
 	4435 root      20   0  6020  328  256 S      3  0.0   2:15.32 a2.out
 	4436 root      20   0  6020  328  256 S      3  0.0   2:15.40 a2.out
 	4437 root      20   0  6020  332  256 S      3  0.0   2:15.42 a2.out
 	4441 root      20   0  6020  332  256 S      3  0.0   2:18.59 a2.out
 	4443 root      20   0  6020  332  256 S      3  0.0   2:14.82 a2.out
 	4445 root      20   0  6020  332  256 R      3  0.0   2:13.12 a2.out
 	4449 root      20   0  6020  332  256 R      3  0.0   2:21.37 a2.out
 	4450 root      20   0  6020  332  256 S      3  0.0   2:15.78 a2.out
 	4451 root      20   0  6020  332  256 S      3  0.0   2:16.25 a2.out
 	4455 root      20   0  6020  332  256 S      3  0.0   2:18.58 a2.out
 	4456 root      20   0  6020  332  256 S      3  0.0   2:16.37 a2.out
 	4458 root      20   0  6020  328  256 S      3  0.0   2:18.03 a2.out
 	4460 root      20   0  6020  332  256 S      3  0.0   2:14.04 a2.out
 	4463 root      20   0  6020  328  256 S      3  0.0   2:16.74 a2.out
 	4464 root      20   0  6020  328  256 S      3  0.0   2:18.11 a2.out

I guess task migration takes up a lot of cpu, so i did another test. use taskset tool to bind
a task to a fixed cpu. Results in line with expectations, cpu usage is down to 5%.

other test:
- 3.15upstream has the same problem with 3.4.24.
- suse sp2 has low cpu usage about 5%.

so I think 15% cpu usage and migration event are too high, how to fixed?


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26  3:04 balance storm Libo Chen
@ 2014-05-26  5:11 ` Mike Galbraith
  2014-05-26 12:16   ` Libo Chen
  2014-05-26  7:56 ` Mike Galbraith
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-26  5:11 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan

On Mon, 2014-05-26 at 11:04 +0800, Libo Chen wrote: 
> hi,
>     my box has 16 cpu (E5-2658,8 core, 2 thread per core), i did a test on
> 3.4.24stable, startup 50 same process, every process is sample:
> 
>  	#include <unistd.h>
> 
>  	int main()
>  	{
>           	for (;;)
>           	{
>                   	unsigned int i = 0;
>                  	 while (i< 100){
>                      	 i++;
>                   	}
>                   	usleep(100);
>           	}
> 
>          	 return 0;
>   	}
> 
> the result is process uses 15% cpu time, perf tool shows 70w migrations in 5 second.

See e0a79f52 sched: Fix select_idle_sibling() bouncing cow syndrome

That commit will fix expensive as hell bouncing for most real loads, but
it won't fix your test.  Doing nothing but wake, select_idle_sibling()
will be traversing all cores/siblings mightily, taking L2 misses as it
traverses, bouncing wakees that do _nothing_ when an idle CPU is found.

Your synthetic test is the absolute worst case scenario.  There has to
be work between wakeups for select_idle_sibling() to have any chance
whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.

> I guess task migration takes up a lot of cpu, so i did another test. use taskset tool to bind
> a task to a fixed cpu. Results in line with expectations, cpu usage is down to 5%.
> 
> other test:
> - 3.15upstream has the same problem with 3.4.24.
> - suse sp2 has low cpu usage about 5%.

SLE11-SP2 has a patch which fixes that behavior, but of course at the
expense of other load types.  A trade.  It also throttles nohz, which
can have substantial cost when cross CPU scheduling.

> so I think 15% cpu usage and migration event are too high, how to fixed?

You can't for free, low latency wakeup can be worth one hell of a lot.

You could do a decayed hit/miss or such to shut the thing off when the
price is just too high.  Restricting migrations per unit time per task
also helps cut the cost, but hurts tasks that could have gotten to the
CPU quicker, and started your next bit of work.  Anything you do there
is going to be a rob Peter to pay Paul thing.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26  3:04 balance storm Libo Chen
  2014-05-26  5:11 ` Mike Galbraith
@ 2014-05-26  7:56 ` Mike Galbraith
  2014-05-26 11:49   ` Libo Chen
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-26  7:56 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan

On Mon, 2014-05-26 at 11:04 +0800, Libo Chen wrote: 
> hi,
>     my box has 16 cpu (E5-2658,8 core, 2 thread per core), i did a test on
> 3.4.24stable, startup 50 same process, every process is sample:
> 
>  	#include <unistd.h>
> 
>  	int main()
>  	{
>           	for (;;)
>           	{
>                   	unsigned int i = 0;
>                  	 while (i< 100){
>                      	 i++;
>                   	}
>                   	usleep(100);
>           	}
> 
>          	 return 0;
>   	}
> 
> the result is process uses 15% cpu time, perf tool shows 70w migrations in 5 second.

My 8 socket 64 core DL980 running 256 copies (3.14-rt5) munches ~4%/copy
per top, and does roughly 1 sh*tload migrations, nano-work loop or not.
Turn SD_SHARE_PKG_RESOURCES off at MC (not a noop here), and consumption
drops to ~2%/copy, and migrations ('course) mostly go away.

vogelweide:/abuild/mike/:[0]# perf stat -a -e sched:sched_migrate_task -- sleep 5

 Performance counter stats for 'system wide':

              3108      sched:sched_migrate_task                                    

       5.001367910 seconds time elapsed

(turns SD_SHARE_PKG_RESOURCES back on)

vogelweide:/abuild/mike/:[0]# perf stat -a -e sched:sched_migrate_task -- sleep 5

 Performance counter stats for 'system wide':

           4182334      sched:sched_migrate_task                                    

       5.001365023 seconds time elapsed

vogelweide:/abuild/mike/:[0]# 


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26  7:56 ` Mike Galbraith
@ 2014-05-26 11:49   ` Libo Chen
  2014-05-26 14:03     ` Mike Galbraith
  2014-05-27  9:48     ` Peter Zijlstra
  0 siblings, 2 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-26 11:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On 2014/5/26 15:56, Mike Galbraith wrote:
> On Mon, 2014-05-26 at 11:04 +0800, Libo Chen wrote: 
>> hi,
>>     my box has 16 cpu (E5-2658,8 core, 2 thread per core), i did a test on
>> 3.4.24stable, startup 50 same process, every process is sample:
>>
>>  	#include <unistd.h>
>>
>>  	int main()
>>  	{
>>           	for (;;)
>>           	{
>>                   	unsigned int i = 0;
>>                  	 while (i< 100){
>>                      	 i++;
>>                   	}
>>                   	usleep(100);
>>           	}
>>
>>          	 return 0;
>>   	}
>>
>> the result is process uses 15% cpu time, perf tool shows 70w migrations in 5 second.
> 
> My 8 socket 64 core DL980 running 256 copies (3.14-rt5) munches ~4%/copy
> per top, and does roughly 1 sh*tload migrations, nano-work loop or not.
> Turn SD_SHARE_PKG_RESOURCES off at MC (not a noop here), and consumption
> drops to ~2%/copy, and migrations ('course) mostly go away.

how to turn off SD_SHARE_PKG_RESOURCES in userspace ?

> 
> vogelweide:/abuild/mike/:[0]# perf stat -a -e sched:sched_migrate_task -- sleep 5
> 
>  Performance counter stats for 'system wide':
> 
>               3108      sched:sched_migrate_task                                    
> 
>        5.001367910 seconds time elapsed
> 
> (turns SD_SHARE_PKG_RESOURCES back on)
> 
> vogelweide:/abuild/mike/:[0]# perf stat -a -e sched:sched_migrate_task -- sleep 5
> 
>  Performance counter stats for 'system wide':
> 
>            4182334      sched:sched_migrate_task                                    
> 
>        5.001365023 seconds time elapsed
> 
> vogelweide:/abuild/mike/:[0]# 
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26  5:11 ` Mike Galbraith
@ 2014-05-26 12:16   ` Libo Chen
  2014-05-26 14:19     ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Libo Chen @ 2014-05-26 12:16 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On 2014/5/26 13:11, Mike Galbraith wrote:
> On Mon, 2014-05-26 at 11:04 +0800, Libo Chen wrote: 
>> hi,
>>     my box has 16 cpu (E5-2658,8 core, 2 thread per core), i did a test on
>> 3.4.24stable, startup 50 same process, every process is sample:
>>
>>  	#include <unistd.h>
>>
>>  	int main()
>>  	{
>>           	for (;;)
>>           	{
>>                   	unsigned int i = 0;
>>                  	 while (i< 100){
>>                      	 i++;
>>                   	}
>>                   	usleep(100);
>>           	}
>>
>>          	 return 0;
>>   	}
>>
>> the result is process uses 15% cpu time, perf tool shows 70w migrations in 5 second.
> 
> See e0a79f52 sched: Fix select_idle_sibling() bouncing cow syndrome
> 
> That commit will fix expensive as hell bouncing for most real loads, but
> it won't fix your test.  Doing nothing but wake, select_idle_sibling()
> will be traversing all cores/siblings mightily, taking L2 misses as it
> traverses, bouncing wakees that do _nothing_ when an idle CPU is found.
> 
> Your synthetic test is the absolute worst case scenario.  There has to
> be work between wakeups for select_idle_sibling() to have any chance
> whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.

not synthetic, it is a real problem in our product. under no load, waste
much cpu time.

> 
>> I guess task migration takes up a lot of cpu, so i did another test. use taskset tool to bind
>> a task to a fixed cpu. Results in line with expectations, cpu usage is down to 5%.
>>
>> other test:
>> - 3.15upstream has the same problem with 3.4.24.
>> - suse sp2 has low cpu usage about 5%.
> 
> SLE11-SP2 has a patch which fixes that behavior, but of course at the
> expense of other load types.  A trade.  It also throttles nohz, which
> can have substantial cost when cross CPU scheduling.

which patch ?

> 
>> so I think 15% cpu usage and migration event are too high, how to fixed?
> 
> You can't for free, low latency wakeup can be worth one hell of a lot.
> 
> You could do a decayed hit/miss or such to shut the thing off when the
> price is just too high.  Restricting migrations per unit time per task
> also helps cut the cost, but hurts tasks that could have gotten to the
> CPU quicker, and started your next bit of work.  Anything you do there
> is going to be a rob Peter to pay Paul thing.
> 

I had tried to change sched_migration_cost and sched_nr_migrate in /proc,
but no use.  any other  suggestion?

I still think this is a problem to schedular.  it is better to directly solve
this issue instead of a workaroud


thanks,
Libo

> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26 11:49   ` Libo Chen
@ 2014-05-26 14:03     ` Mike Galbraith
  2014-05-27  7:44       ` Libo Chen
  2014-05-27  9:48     ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-26 14:03 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On Mon, 2014-05-26 at 19:49 +0800, Libo Chen wrote:

> how to turn off SD_SHARE_PKG_RESOURCES in userspace ?

I use a script Ingo gave me years and years ago to
twiddle /proc/sys/kernel/sched_domain/cpuN/domainN/flags domain wise.
Doing that won't do you any good without a handler to build/tear down
sd_llc when you poke at flags though.  You can easily add a sched
feature to play with it.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26 12:16   ` Libo Chen
@ 2014-05-26 14:19     ` Mike Galbraith
  2014-05-27  7:56       ` Libo Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-26 14:19 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On Mon, 2014-05-26 at 20:16 +0800, Libo Chen wrote: 
> On 2014/5/26 13:11, Mike Galbraith wrote:

> > Your synthetic test is the absolute worst case scenario.  There has to
> > be work between wakeups for select_idle_sibling() to have any chance
> > whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.
> 
> not synthetic, it is a real problem in our product. under no load, waste
> much cpu time.

What happens in your product if you apply the commit I pointed out?

> >> so I think 15% cpu usage and migration event are too high, how to fixed?
> > 
> > You can't for free, low latency wakeup can be worth one hell of a lot.
> > 
> > You could do a decayed hit/miss or such to shut the thing off when the
> > price is just too high.  Restricting migrations per unit time per task
> > also helps cut the cost, but hurts tasks that could have gotten to the
> > CPU quicker, and started your next bit of work.  Anything you do there
> > is going to be a rob Peter to pay Paul thing.
> > 
> 
> I had tried to change sched_migration_cost and sched_nr_migrate in /proc,
> but no use.  any other  suggestion?
> 
> I still think this is a problem to schedular.  it is better to directly solve
> this issue instead of a workaroud

I didn't say it wasn't a problem, it is.  I said whatever you do will be
a tradeoff.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26 14:03     ` Mike Galbraith
@ 2014-05-27  7:44       ` Libo Chen
  2014-05-27  8:12         ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Libo Chen @ 2014-05-27  7:44 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On 2014/5/26 22:03, Mike Galbraith wrote:
> On Mon, 2014-05-26 at 19:49 +0800, Libo Chen wrote:
> 
>> how to turn off SD_SHARE_PKG_RESOURCES in userspace ?
> 
> I use a script Ingo gave me years and years ago to
> twiddle /proc/sys/kernel/sched_domain/cpuN/domainN/flags domain wise.
> Doing that won't do you any good without a handler to build/tear down
> sd_llc when you poke at flags though.  You can easily add a sched
> feature to play with it.


I make a simple script:

   for ((i=0;i<=15;i++))
   do
           echo 4143 > /proc/sys/kernel/sched_domain/cpu$i/domain1/flags
   done

In our kernel  SD_SHARE_PKG_RESOURCE is 0x0200, the original flag value is 4655,
domain1's name is MC.

but migrations event doesn't reduce like yours, what problem?  I wouldn't like
recompile kernel :(


> 
> -Mike
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26 14:19     ` Mike Galbraith
@ 2014-05-27  7:56       ` Libo Chen
  2014-05-27  9:55         ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Libo Chen @ 2014-05-27  7:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On 2014/5/26 22:19, Mike Galbraith wrote:
> On Mon, 2014-05-26 at 20:16 +0800, Libo Chen wrote: 
>> On 2014/5/26 13:11, Mike Galbraith wrote:
> 
>>> Your synthetic test is the absolute worst case scenario.  There has to
>>> be work between wakeups for select_idle_sibling() to have any chance
>>> whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.
>>
>> not synthetic, it is a real problem in our product. under no load, waste
>> much cpu time.
> 
> What happens in your product if you apply the commit I pointed out?

under no load, cpu usage is up to 60%, but the same apps cost 10% on
susp sp1.  The apps use a lot of timer.

I am not sure that commit is the root cause, but they do have some different
cpu usage between 3.4.24 and suse sp1, e.g. my synthetic test before.

> 
>>>> so I think 15% cpu usage and migration event are too high, how to fixed?
>>>
>>> You can't for free, low latency wakeup can be worth one hell of a lot.
>>>
>>> You could do a decayed hit/miss or such to shut the thing off when the
>>> price is just too high.  Restricting migrations per unit time per task
>>> also helps cut the cost, but hurts tasks that could have gotten to the
>>> CPU quicker, and started your next bit of work.  Anything you do there
>>> is going to be a rob Peter to pay Paul thing.
>>>
>>
>> I had tried to change sched_migration_cost and sched_nr_migrate in /proc,
>> but no use.  any other  suggestion?
>>
>> I still think this is a problem to schedular.  it is better to directly solve
>> this issue instead of a workaroud
> 
> I didn't say it wasn't a problem, it is.  I said whatever you do will be
> a tradeoff.
> 
> -Mike
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27  7:44       ` Libo Chen
@ 2014-05-27  8:12         ` Mike Galbraith
  0 siblings, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2014-05-27  8:12 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On Tue, 2014-05-27 at 15:44 +0800, Libo Chen wrote: 
> On 2014/5/26 22:03, Mike Galbraith wrote:
> > On Mon, 2014-05-26 at 19:49 +0800, Libo Chen wrote:
> > 
> >> how to turn off SD_SHARE_PKG_RESOURCES in userspace ?
> > 
> > I use a script Ingo gave me years and years ago to
> > twiddle /proc/sys/kernel/sched_domain/cpuN/domainN/flags domain wise.
> > Doing that won't do you any good without a handler to build/tear down
> > sd_llc when you poke at flags though.  You can easily add a sched
> > feature to play with it.
> 
> 
> I make a simple script:
> 
>    for ((i=0;i<=15;i++))
>    do
>            echo 4143 > /proc/sys/kernel/sched_domain/cpu$i/domain1/flags
>    done
> 
> In our kernel  SD_SHARE_PKG_RESOURCE is 0x0200, the original flag value is 4655,
> domain1's name is MC.
> 
> but migrations event doesn't reduce like yours, what problem?  I wouldn't like
> recompile kernel :(

Hm, I thought you were a kernel hacker, but guess not since that would
be a really weird thing for a kernel hacker to say :)  Problem is that
there's no handler in your kernel to convert your flag poking to sd_llc
poking.

I could send you a patchlet, but that ain't gonna work, neither that nor
the commit I pointed out will seep into the kernel via osmosis.  There
should be a kernel hacker somewhere near you, look down the hall by the
water cooler, when you find one, he/she should be able to help.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-26 11:49   ` Libo Chen
  2014-05-26 14:03     ` Mike Galbraith
@ 2014-05-27  9:48     ` Peter Zijlstra
  2014-05-27 10:05       ` Mike Galbraith
  2014-05-27 12:55       ` Libo Chen
  1 sibling, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-27  9:48 UTC (permalink / raw)
  To: Libo Chen; +Cc: Mike Galbraith, tglx, mingo, LKML, Greg KH, Li Zefan

[-- Attachment #1: Type: text/plain, Size: 2241 bytes --]

On Mon, May 26, 2014 at 07:49:10PM +0800, Libo Chen wrote:
> On 2014/5/26 15:56, Mike Galbraith wrote:
> > On Mon, 2014-05-26 at 11:04 +0800, Libo Chen wrote: 
> >> hi,
> >>     my box has 16 cpu (E5-2658,8 core, 2 thread per core), i did a test on
> >> 3.4.24stable, startup 50 same process, every process is sample:
> >>
> >>  	#include <unistd.h>
> >>
> >>  	int main()
> >>  	{
> >>           	for (;;)
> >>           	{
> >>                   	unsigned int i = 0;
> >>                  	 while (i< 100){
> >>                      	 i++;
> >>                   	}
> >>                   	usleep(100);
> >>           	}
> >>
> >>          	 return 0;
> >>   	}
> >>
> >> the result is process uses 15% cpu time, perf tool shows 70w migrations in 5 second.
> > 
> > My 8 socket 64 core DL980 running 256 copies (3.14-rt5) munches ~4%/copy
> > per top, and does roughly 1 sh*tload migrations, nano-work loop or not.
> > Turn SD_SHARE_PKG_RESOURCES off at MC (not a noop here), and consumption
> > drops to ~2%/copy, and migrations ('course) mostly go away.

So: 

1) what kind of weird ass workload is that? Why are you waking up so
often to do no work?

2) turning on/off share_pkg_resource is a horrid hack whichever way
aruond you turn it.

So I suppose this is due to the select_idle_sibling() nonsense again,
where we assumes L3 is a fair compromise between cheap enough and
effective enough.

Of course, Intel keeps growing the cpu count covered by L3 to ridiculous
sizes, 8 cores isn't nowhere near their top silly, which shifts the
balance, and there's always going to be pathological cases (like the
proposed workload) where its just always going to suck eggs.

Also, when running 50 such things on a 16 cpu machine, you get roughly 3
per cpu, since their runtime is stupid low, I would expect it to pretty
much always hit an idle cpu, which in turn should inhibit the migration.

Then again, maybe the timer slack is causing you grief, resulting in all
3 being woken at the same time, instead of having them staggered.

In any case, I'm not sure what the 'regression' report is against, as
there's only a single kernel version mentioned: 3.4, and that's almost a
dinosaur.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27  7:56       ` Libo Chen
@ 2014-05-27  9:55         ` Mike Galbraith
  2014-05-27 12:50           ` Libo Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-27  9:55 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On Tue, 2014-05-27 at 15:56 +0800, Libo Chen wrote: 
> On 2014/5/26 22:19, Mike Galbraith wrote:
> > On Mon, 2014-05-26 at 20:16 +0800, Libo Chen wrote: 
> >> On 2014/5/26 13:11, Mike Galbraith wrote:
> > 
> >>> Your synthetic test is the absolute worst case scenario.  There has to
> >>> be work between wakeups for select_idle_sibling() to have any chance
> >>> whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.
> >>
> >> not synthetic, it is a real problem in our product. under no load, waste
> >> much cpu time.
> > 
> > What happens in your product if you apply the commit I pointed out?
> 
> under no load, cpu usage is up to 60%, but the same apps cost 10% on
> susp sp1.  The apps use a lot of timer.

Something is rotten.  3.14-rt contains that commit, I ran your test with
256 threads on 64 core box, saw ~4%.

Putting master/nopreempt config on box and doing the same test, box is
chewing up truckloads of CPU, but not from migrations. 

perf top -g --sort=symbol

Samples: 7M of event 'cycles', Event count (approx.): 1316249172581                                                                                                                                                                         
-   82.56%  [k] _raw_spin_lock_irqsave                                                                                                                                                                                                     ▒
   - _raw_spin_lock_irqsave                                                                                                                                                                                                                ▒
      - 96.59% __nanosleep_nocancel                                                                                                                                                                                                        ◆
           100.00% __libc_start_main                                                                                                                                                                                                       ▒
        2.88% __poll                                                                                                                                                                                                                       ▒
           0                                                                                                                                                                                                                               ▒
+    1.56%  [k] native_write_msr_safe                                                                                                                                                                                                      ▒
+    1.21%  [k] update_cfs_shares                                                                                                                                                                                                          ▒
+    0.92%  [k] __schedule                                                                                                                                                                                                                 ▒
+    0.88%  [k] _raw_spin_lock                                                                                                                                                                                                             ▒
+    0.73%  [k] update_cfs_rq_blocked_load                                                                                                                                                                                                 ▒
+    0.62%  [k] idle_cpu                                                                                                                                                                                                                   ▒
+    0.47%  [.] usleep                                                                                                                                                                                                                     ▒
+    0.41%  [k] cpuidle_enter_state                                                                                                                                                                                                        ▒
+    0.37%  [k] set_task_cpu

Oh, 256 * usleep(100) is not a great idea.

	-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27  9:48     ` Peter Zijlstra
@ 2014-05-27 10:05       ` Mike Galbraith
  2014-05-27 10:43         ` Peter Zijlstra
  2014-05-27 12:55       ` Libo Chen
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-27 10:05 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Libo Chen, tglx, mingo, LKML, Greg KH, Li Zefan

On Tue, 2014-05-27 at 11:48 +0200, Peter Zijlstra wrote:

> So I suppose this is due to the select_idle_sibling() nonsense again,
> where we assumes L3 is a fair compromise between cheap enough and
> effective enough.

Nodz.

> Of course, Intel keeps growing the cpu count covered by L3 to ridiculous
> sizes, 8 cores isn't nowhere near their top silly, which shifts the
> balance, and there's always going to be pathological cases (like the
> proposed workload) where its just always going to suck eggs.

Test is as pathological as it gets.  15 core + SMT wouldn't be pretty.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 10:05       ` Mike Galbraith
@ 2014-05-27 10:43         ` Peter Zijlstra
  2014-05-27 10:55           ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-27 10:43 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Libo Chen, tglx, mingo, LKML, Greg KH, Li Zefan

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

On Tue, May 27, 2014 at 12:05:33PM +0200, Mike Galbraith wrote:
> On Tue, 2014-05-27 at 11:48 +0200, Peter Zijlstra wrote:
> 
> > So I suppose this is due to the select_idle_sibling() nonsense again,
> > where we assumes L3 is a fair compromise between cheap enough and
> > effective enough.
> 
> Nodz.
> 
> > Of course, Intel keeps growing the cpu count covered by L3 to ridiculous
> > sizes, 8 cores isn't nowhere near their top silly, which shifts the
> > balance, and there's always going to be pathological cases (like the
> > proposed workload) where its just always going to suck eggs.
> 
> Test is as pathological as it gets.  15 core + SMT wouldn't be pretty.

So one thing we could maybe do is measure the cost of
select_idle_sibling(), just like we do for idle_balance() and compare
this against the tasks avg runtime.

We can go all crazy and do reduced searches; like test every n-th cpu in
the mask, or make it statistical and do a full search ever n wakeups.

Not sure what's a good approach. But L3 spanning more and more CPUs is
not something that's going to get cured anytime soon I'm afraid.

Not to mention bloody SMT which makes the whole mess worse.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 10:43         ` Peter Zijlstra
@ 2014-05-27 10:55           ` Mike Galbraith
  2014-05-27 12:56             ` Libo Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-27 10:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Libo Chen, tglx, mingo, LKML, Greg KH, Li Zefan

On Tue, 2014-05-27 at 12:43 +0200, Peter Zijlstra wrote: 
> On Tue, May 27, 2014 at 12:05:33PM +0200, Mike Galbraith wrote:
> > On Tue, 2014-05-27 at 11:48 +0200, Peter Zijlstra wrote:
> > 
> > > So I suppose this is due to the select_idle_sibling() nonsense again,
> > > where we assumes L3 is a fair compromise between cheap enough and
> > > effective enough.
> > 
> > Nodz.
> > 
> > > Of course, Intel keeps growing the cpu count covered by L3 to ridiculous
> > > sizes, 8 cores isn't nowhere near their top silly, which shifts the
> > > balance, and there's always going to be pathological cases (like the
> > > proposed workload) where its just always going to suck eggs.
> > 
> > Test is as pathological as it gets.  15 core + SMT wouldn't be pretty.
> 
> So one thing we could maybe do is measure the cost of
> select_idle_sibling(), just like we do for idle_balance() and compare
> this against the tasks avg runtime.
> 
> We can go all crazy and do reduced searches; like test every n-th cpu in
> the mask, or make it statistical and do a full search ever n wakeups.
> 
> Not sure what's a good approach. But L3 spanning more and more CPUs is
> not something that's going to get cured anytime soon I'm afraid.
> 
> Not to mention bloody SMT which makes the whole mess worse.

I think we should keep it dirt simple and above all dirt cheap.  The per
task migration cap per unit time should meet that bill, limit the damage
potential, while also limiting the good, but that's tough.  I don't see
any way to make it perfect, so I'll settle for good enough.

-Mike



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27  9:55         ` Mike Galbraith
@ 2014-05-27 12:50           ` Libo Chen
  2014-05-27 13:20             ` Mike Galbraith
  2014-05-27 20:53             ` Thomas Gleixner
  0 siblings, 2 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-27 12:50 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On 2014/5/27 17:55, Mike Galbraith wrote:
> On Tue, 2014-05-27 at 15:56 +0800, Libo Chen wrote: 
>> > On 2014/5/26 22:19, Mike Galbraith wrote:
>>> > > On Mon, 2014-05-26 at 20:16 +0800, Libo Chen wrote: 
>>>> > >> On 2014/5/26 13:11, Mike Galbraith wrote:
>>> > > 
>>>>> > >>> Your synthetic test is the absolute worst case scenario.  There has to
>>>>> > >>> be work between wakeups for select_idle_sibling() to have any chance
>>>>> > >>> whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.
>>>> > >>
>>>> > >> not synthetic, it is a real problem in our product. under no load, waste
>>>> > >> much cpu time.
>>> > > 
>>> > > What happens in your product if you apply the commit I pointed out?
>> > 
>> > under no load, cpu usage is up to 60%, but the same apps cost 10% on
>> > susp sp1.  The apps use a lot of timer.
> Something is rotten.  3.14-rt contains that commit, I ran your test with
> 256 threads on 64 core box, saw ~4%.
> 
> Putting master/nopreempt config on box and doing the same test, box is
> chewing up truckloads of CPU, but not from migrations. 
> 
> perf top -g --sort=symbol
in my box:

perf top -g --sort=symbol

Events: 3K cycles
 73.27%  [k] read_hpet
  4.30%  [k] _raw_spin_lock_irqsave
  1.88%  [k] __schedule
  1.00%  [k] idle_cpu
  0.91%  [k] native_write_msr_safe
  0.68%  [k] select_task_rq_fair
  0.51%  [k] module_get_kallsym
  0.49%  [.] sem_post
  0.44%  [.] main
  0.41%  [k] menu_select
  0.39%  [k] _raw_spin_lock
  0.38%  [k] __switch_to
  0.33%  [k] _raw_spin_lock_irq
  0.32%  [k] format_decode
  0.29%  [.] usleep
  0.28%  [.] symbols__insert
  0.27%  [k] tick_nohz_stop_sched_tick
  0.27%  [k] update_stats_wait_end
  0.26%  [k] apic_timer_interrupt
  0.25%  [k] enqueue_entity
  0.25%  [k] sched_clock_local
  0.24%  [k] _raw_spin_unlock_irqrestore
  0.24%  [k] select_idle_sibling
  0.22%  [k] number
  0.22%  [k] kallsyms_expand_symbol
  0.21%  [k] rcu_irq_exit
  0.20%  [k] ktime_get
  0.20%  [k] rb_insert_color
  0.20%  [k] set_next_entity
  0.19%  [k] vsnprintf
  0.19%  [k] try_to_wake_up
  0.18%  [k] __hrtimer_start_range_ns
  0.18%  [k] update_cfs_load
  0.17%  [k] rcu_idle_exit_common
  0.17%  [k] do_nanosleep
  0.17%  [.] __GI___libc_nanosleep
  0.17%  [k] trace_hardirqs_off
  0.16%  [k] irq_exit
  0.16%  [k] timerqueue_add



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27  9:48     ` Peter Zijlstra
  2014-05-27 10:05       ` Mike Galbraith
@ 2014-05-27 12:55       ` Libo Chen
  2014-05-27 13:13         ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Libo Chen @ 2014-05-27 12:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Mike Galbraith, tglx, mingo, LKML, Greg KH, Li Zefan

On 2014/5/27 17:48, Peter Zijlstra wrote:
> So: 
> 
> 1) what kind of weird ass workload is that? Why are you waking up so
> often to do no work?

it's just a testcase, I agree it doesn`t exist in real world.

> 
> 2) turning on/off share_pkg_resource is a horrid hack whichever way
> aruond you turn it.
> 
> So I suppose this is due to the select_idle_sibling() nonsense again,
> where we assumes L3 is a fair compromise between cheap enough and
> effective enough.
> 
> Of course, Intel keeps growing the cpu count covered by L3 to ridiculous
> sizes, 8 cores isn't nowhere near their top silly, which shifts the
> balance, and there's always going to be pathological cases (like the
> proposed workload) where its just always going to suck eggs.
> 
> Also, when running 50 such things on a 16 cpu machine, you get roughly 3
> per cpu, since their runtime is stupid low, I would expect it to pretty
> much always hit an idle cpu, which in turn should inhibit the migration.
> 
> Then again, maybe the timer slack is causing you grief, resulting in all
> 3 being woken at the same time, instead of having them staggered.
> 
> In any case, I'm not sure what the 'regression' report is against, as
> there's only a single kernel version mentioned: 3.4, and that's almost a
upstream has the same problem, I have mentioned before.

> dinosaur.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 10:55           ` Mike Galbraith
@ 2014-05-27 12:56             ` Libo Chen
  0 siblings, 0 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-27 12:56 UTC (permalink / raw)
  To: Mike Galbraith, Peter Zijlstra; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan

On 2014/5/27 18:55, Mike Galbraith wrote:
> On Tue, 2014-05-27 at 12:43 +0200, Peter Zijlstra wrote: 
>> On Tue, May 27, 2014 at 12:05:33PM +0200, Mike Galbraith wrote:
>>> On Tue, 2014-05-27 at 11:48 +0200, Peter Zijlstra wrote:
>>>
>>>> So I suppose this is due to the select_idle_sibling() nonsense again,
>>>> where we assumes L3 is a fair compromise between cheap enough and
>>>> effective enough.
>>>
>>> Nodz.
>>>
>>>> Of course, Intel keeps growing the cpu count covered by L3 to ridiculous
>>>> sizes, 8 cores isn't nowhere near their top silly, which shifts the
>>>> balance, and there's always going to be pathological cases (like the
>>>> proposed workload) where its just always going to suck eggs.
>>>
>>> Test is as pathological as it gets.  15 core + SMT wouldn't be pretty.
>>
>> So one thing we could maybe do is measure the cost of
>> select_idle_sibling(), just like we do for idle_balance() and compare
>> this against the tasks avg runtime.
>>
>> We can go all crazy and do reduced searches; like test every n-th cpu in
>> the mask, or make it statistical and do a full search ever n wakeups.
>>
>> Not sure what's a good approach. But L3 spanning more and more CPUs is
>> not something that's going to get cured anytime soon I'm afraid.
>>
>> Not to mention bloody SMT which makes the whole mess worse.
> 
> I think we should keep it dirt simple and above all dirt cheap.  The per
> task migration cap per unit time should meet that bill, limit the damage
> potential, while also limiting the good, but that's tough.  I don't see

agree

> any way to make it perfect, so I'll settle for good enough.
> 
> -Mike
> 
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 12:55       ` Libo Chen
@ 2014-05-27 13:13         ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:13 UTC (permalink / raw)
  To: Libo Chen; +Cc: Mike Galbraith, tglx, mingo, LKML, Greg KH, Li Zefan

On Tue, May 27, 2014 at 08:55:20PM +0800, Libo Chen wrote:
> On 2014/5/27 17:48, Peter Zijlstra wrote:

> > In any case, I'm not sure what the 'regression' report is against, as
> > there's only a single kernel version mentioned: 3.4, and that's almost a

> upstream has the same problem, I have mentioned before.

Not on anything that landed in my inbox I think, but that's not the
point. For a regression report you need _2_ kernel versions, one with
and one without the 'problem'.

Providing one (or two) that have a problem doesn't qualify.

In any case, I didn't see the original email, but I got the impression
that it was complaining about 'new' behaviour from the bits I did see as
quoted.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 12:50           ` Libo Chen
@ 2014-05-27 13:20             ` Mike Galbraith
  2014-05-28  1:04               ` Libo Chen
  2014-05-27 20:53             ` Thomas Gleixner
  1 sibling, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-27 13:20 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz

On Tue, 2014-05-27 at 20:50 +0800, Libo Chen wrote:

> in my box:
> 
> perf top -g --sort=symbol
> 
> Events: 3K cycles
>  73.27%  [k] read_hpet
>   4.30%  [k] _raw_spin_lock_irqsave
>   1.88%  [k] __schedule
>   1.00%  [k] idle_cpu
>   0.91%  [k] native_write_msr_safe
>   0.68%  [k] select_task_rq_fair
>   0.51%  [k] module_get_kallsym
>   0.49%  [.] sem_post
>   0.44%  [.] main
>   0.41%  [k] menu_select
>   0.39%  [k] _raw_spin_lock
>   0.38%  [k] __switch_to
>   0.33%  [k] _raw_spin_lock_irq
>   0.32%  [k] format_decode
>   0.29%  [.] usleep
>   0.28%  [.] symbols__insert
>   0.27%  [k] tick_nohz_stop_sched_tick
>   0.27%  [k] update_stats_wait_end
>   0.26%  [k] apic_timer_interrupt
>   0.25%  [k] enqueue_entity
>   0.25%  [k] sched_clock_local
>   0.24%  [k] _raw_spin_unlock_irqrestore
>   0.24%  [k] select_idle_sibling

read_hpet?  Are you booting box notsc or something?  Migration cost is
the least of your worries.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 12:50           ` Libo Chen
  2014-05-27 13:20             ` Mike Galbraith
@ 2014-05-27 20:53             ` Thomas Gleixner
  2014-05-28  1:06               ` Libo Chen
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Gleixner @ 2014-05-27 20:53 UTC (permalink / raw)
  To: Libo Chen; +Cc: Mike Galbraith, mingo, LKML, Greg KH, Li Zefan, peterz

On Tue, 27 May 2014, Libo Chen wrote:
> On 2014/5/27 17:55, Mike Galbraith wrote:
> > On Tue, 2014-05-27 at 15:56 +0800, Libo Chen wrote: 
> >> > On 2014/5/26 22:19, Mike Galbraith wrote:
> >>> > > On Mon, 2014-05-26 at 20:16 +0800, Libo Chen wrote: 
> >>>> > >> On 2014/5/26 13:11, Mike Galbraith wrote:
> >>> > > 
> >>>>> > >>> Your synthetic test is the absolute worst case scenario.  There has to
> >>>>> > >>> be work between wakeups for select_idle_sibling() to have any chance
> >>>>> > >>> whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.
> >>>> > >>
> >>>> > >> not synthetic, it is a real problem in our product. under no load, waste
> >>>> > >> much cpu time.
> >>> > > 
> >>> > > What happens in your product if you apply the commit I pointed out?
> >> > 
> >> > under no load, cpu usage is up to 60%, but the same apps cost 10% on
> >> > susp sp1.  The apps use a lot of timer.
> > Something is rotten.  3.14-rt contains that commit, I ran your test with
> > 256 threads on 64 core box, saw ~4%.
> > 
> > Putting master/nopreempt config on box and doing the same test, box is
> > chewing up truckloads of CPU, but not from migrations. 
> > 
> > perf top -g --sort=symbol
> in my box:
> 
> perf top -g --sort=symbol
> 
> Events: 3K cycles
>  73.27%  [k] read_hpet

Why is that machine using read_hpet() ?

Please provide the output of 

# dmesg | grep -i tsc

and

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource

and

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 13:20             ` Mike Galbraith
@ 2014-05-28  1:04               ` Libo Chen
  2014-05-28  1:53                 ` Mike Galbraith
  0 siblings, 1 reply; 33+ messages in thread
From: Libo Chen @ 2014-05-28  1:04 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On 2014/5/27 21:20, Mike Galbraith wrote:
> On Tue, 2014-05-27 at 20:50 +0800, Libo Chen wrote:
> 
>> in my box:
>>
>> perf top -g --sort=symbol
>>
>> Events: 3K cycles
>>  73.27%  [k] read_hpet
>>   4.30%  [k] _raw_spin_lock_irqsave
>>   1.88%  [k] __schedule
>>   1.00%  [k] idle_cpu
>>   0.91%  [k] native_write_msr_safe
>>   0.68%  [k] select_task_rq_fair
>>   0.51%  [k] module_get_kallsym
>>   0.49%  [.] sem_post
>>   0.44%  [.] main
>>   0.41%  [k] menu_select
>>   0.39%  [k] _raw_spin_lock
>>   0.38%  [k] __switch_to
>>   0.33%  [k] _raw_spin_lock_irq
>>   0.32%  [k] format_decode
>>   0.29%  [.] usleep
>>   0.28%  [.] symbols__insert
>>   0.27%  [k] tick_nohz_stop_sched_tick
>>   0.27%  [k] update_stats_wait_end
>>   0.26%  [k] apic_timer_interrupt
>>   0.25%  [k] enqueue_entity
>>   0.25%  [k] sched_clock_local
>>   0.24%  [k] _raw_spin_unlock_irqrestore
>>   0.24%  [k] select_idle_sibling
> 
> read_hpet?  Are you booting box notsc or something?  Migration cost is
> the least of your worries.

oh yes, no tsc only hpet in my box. I don't know hhy is read_hpet is hot.
but when I bind 3-th tasks to percpu,cost will be rapid decline, yet perf
shows read_hpet is still hot.

after bind

Events: 561K cycles
 64.18%  [kernel]              [k] read_hpet
  5.51%  usleep                [.] main
  2.71%  [kernel]              [k] __schedule
  1.82%  [kernel]              [k] _raw_spin_lock_irqsave
  1.56%  libc-2.11.3.so        [.] usleep
  1.07%  [kernel]              [k] apic_timer_interrupt
  0.89%  libc-2.11.3.so        [.] __GI___libc_nanosleep
  0.82%  [kernel]              [k] native_write_msr_safe
  0.82%  [kernel]              [k] ktime_get
  0.71%  [kernel]              [k] trace_hardirqs_off
  0.63%  [kernel]              [k] __switch_to
  0.60%  [kernel]              [k] _raw_spin_unlock_irqrestore
  0.47%  [kernel]              [k] menu_select
  0.46%  [kernel]              [k] _raw_spin_lock
  0.45%  [kernel]              [k] enqueue_entity
  0.45%  [kernel]              [k] sched_clock_local
  0.43%  [kernel]              [k] try_to_wake_up
  0.42%  [kernel]              [k] hrtimer_nanosleep
  0.36%  [kernel]              [k] do_nanosleep
  0.35%  [kernel]              [k] _raw_spin_lock_irq
  0.34%  [kernel]              [k] rb_insert_color
  0.29%  [kernel]              [k] update_curr
  0.29%  [kernel]              [k] native_sched_clock
  0.28%  [kernel]              [k] hrtimer_interrupt
  0.28%  [kernel]              [k] rcu_idle_exit_common
  0.27%  [kernel]              [k] hrtimer_init
  0.27%  [kernel]              [k] __hrtimer_start_range_ns
  0.26%  [kernel]              [k] __rb_erase_color
  0.26%  [kernel]              [k] lock_hrtimer_base
  0.25%  [kernel]              [k] trace_hardirqs_on
  0.23%  [kernel]              [k] rcu_idle_enter_common
  0.23%  [kernel]              [k] cpuidle_idle_call
  0.23%  [kernel]              [k] finish_task_switch
  0.22%  [kernel]              [k] set_next_entity
  0.22%  [kernel]              [k] cpuacct_charge
  0.22%  [kernel]              [k] pick_next_task_fair
  0.21%  [kernel]              [k] sys_nanosleep
  0.20%  [kernel]              [k] rb_next
  0.20%  [kernel]              [k] start_critical_timings
> 
> -Mike
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-27 20:53             ` Thomas Gleixner
@ 2014-05-28  1:06               ` Libo Chen
  0 siblings, 0 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-28  1:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On 2014/5/28 4:53, Thomas Gleixner wrote:
> On Tue, 27 May 2014, Libo Chen wrote:
>> On 2014/5/27 17:55, Mike Galbraith wrote:
>>> On Tue, 2014-05-27 at 15:56 +0800, Libo Chen wrote: 
>>>>> On 2014/5/26 22:19, Mike Galbraith wrote:
>>>>>>> On Mon, 2014-05-26 at 20:16 +0800, Libo Chen wrote: 
>>>>>>>>> On 2014/5/26 13:11, Mike Galbraith wrote:
>>>>>>>
>>>>>>>>>>> Your synthetic test is the absolute worst case scenario.  There has to
>>>>>>>>>>> be work between wakeups for select_idle_sibling() to have any chance
>>>>>>>>>>> whatsoever of turning in a win.  At 0 work, it becomes 100% overhead.
>>>>>>>>>
>>>>>>>>> not synthetic, it is a real problem in our product. under no load, waste
>>>>>>>>> much cpu time.
>>>>>>>
>>>>>>> What happens in your product if you apply the commit I pointed out?
>>>>>
>>>>> under no load, cpu usage is up to 60%, but the same apps cost 10% on
>>>>> susp sp1.  The apps use a lot of timer.
>>> Something is rotten.  3.14-rt contains that commit, I ran your test with
>>> 256 threads on 64 core box, saw ~4%.
>>>
>>> Putting master/nopreempt config on box and doing the same test, box is
>>> chewing up truckloads of CPU, but not from migrations. 
>>>
>>> perf top -g --sort=symbol
>> in my box:
>>
>> perf top -g --sort=symbol
>>
>> Events: 3K cycles
>>  73.27%  [k] read_hpet
> 
> Why is that machine using read_hpet() ?
> 
> Please provide the output of 
> 
> # dmesg | grep -i tsc
> 

Euler:/home # dmesg  | grep -i tsc
[    0.000000] Fast TSC calibration using PIT
[    0.226921] TSC synchronization [CPU#0 -> CPU#1]:
[    0.227142] Measured 1053728 cycles TSC warp between CPUs, turning off TSC clock.
[    0.008000] Marking TSC unstable due to check_tsc_sync_source failed

> and
> 
> # cat /sys/devices/system/clocksource/clocksource0/available_clocksource

hpet acpi_pm

> 
> and
> 
> # cat /sys/devices/system/clocksource/clocksource0/current_clocksource

hpet

> 
> Thanks,
> 
> 	tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> .
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  1:04               ` Libo Chen
@ 2014-05-28  1:53                 ` Mike Galbraith
  2014-05-28  6:54                   ` Libo Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-28  1:53 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:

> oh yes, no tsc only hpet in my box.

Making poor E5-2658 box a crippled wreck.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  1:53                 ` Mike Galbraith
@ 2014-05-28  6:54                   ` Libo Chen
  2014-05-28  8:16                     ` Mike Galbraith
  2014-05-28  9:08                     ` Thomas Gleixner
  0 siblings, 2 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-28  6:54 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On 2014/5/28 9:53, Mike Galbraith wrote:
> On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
> 
>> oh yes, no tsc only hpet in my box.
> 
> Making poor E5-2658 box a crippled wreck.

yes,it is. But cpu usage will be down from 15% to 5% when binding cpu, so maybe read_hpet
is not the root cause.

thanks,
Libo

> 
> -Mike
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  6:54                   ` Libo Chen
@ 2014-05-28  8:16                     ` Mike Galbraith
  2014-05-28  9:08                     ` Thomas Gleixner
  1 sibling, 0 replies; 33+ messages in thread
From: Mike Galbraith @ 2014-05-28  8:16 UTC (permalink / raw)
  To: Libo Chen; +Cc: tglx, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On Wed, 2014-05-28 at 14:54 +0800, Libo Chen wrote: 
> On 2014/5/28 9:53, Mike Galbraith wrote:
> > On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
> > 
> >> oh yes, no tsc only hpet in my box.
> > 
> > Making poor E5-2658 box a crippled wreck.
> 
> yes,it is. But cpu usage will be down from 15% to 5% when binding cpu, so maybe read_hpet
> is not the root cause.

I don't think anyone will be particularly interested in making kernel
changes based upon the behavior of a broken box.  The problem we were
discussing is real enough though, it's just a question of severity.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  6:54                   ` Libo Chen
  2014-05-28  8:16                     ` Mike Galbraith
@ 2014-05-28  9:08                     ` Thomas Gleixner
  2014-05-28 10:30                       ` Peter Zijlstra
                                         ` (2 more replies)
  1 sibling, 3 replies; 33+ messages in thread
From: Thomas Gleixner @ 2014-05-28  9:08 UTC (permalink / raw)
  To: Libo Chen
  Cc: Mike Galbraith, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On Wed, 28 May 2014, Libo Chen wrote:

> On 2014/5/28 9:53, Mike Galbraith wrote:
> > On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
> > 
> >> oh yes, no tsc only hpet in my box.
> > 
> > Making poor E5-2658 box a crippled wreck.
> 
> yes,it is. But cpu usage will be down from 15% to 5% when binding
> cpu, so maybe read_hpet is not the root cause.

Definitely hpet _IS_ the root cause on a machine as large as this,
simply because everything gets serialized on the hpet access.

Binding stuff to cpus just makes the timing behaviour different, so
the hpet serialization is not that prominent, but still bad enough.

Talk to your HW/BIOS vendor. The kernel cannot do anything about
defunct hardware.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  9:08                     ` Thomas Gleixner
@ 2014-05-28 10:30                       ` Peter Zijlstra
  2014-05-28 10:52                         ` Borislav Petkov
  2014-05-28 11:43                       ` Libo Chen
  2014-05-29  7:57                       ` Libo Chen
  2 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2014-05-28 10:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Libo Chen, Mike Galbraith, mingo, LKML, Greg KH, Li Zefan,
	Huang Qiang, bp

[-- Attachment #1: Type: text/plain, Size: 3758 bytes --]

On Wed, May 28, 2014 at 11:08:40AM +0200, Thomas Gleixner wrote:
> On Wed, 28 May 2014, Libo Chen wrote:
> 
> > On 2014/5/28 9:53, Mike Galbraith wrote:
> > > On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
> > > 
> > >> oh yes, no tsc only hpet in my box.
> > > 
> > > Making poor E5-2658 box a crippled wreck.
> > 
> > yes,it is. But cpu usage will be down from 15% to 5% when binding
> > cpu, so maybe read_hpet is not the root cause.
> 
> Definitely hpet _IS_ the root cause on a machine as large as this,
> simply because everything gets serialized on the hpet access.
> 
> Binding stuff to cpus just makes the timing behaviour different, so
> the hpet serialization is not that prominent, but still bad enough.
> 
> Talk to your HW/BIOS vendor. The kernel cannot do anything about
> defunct hardware.

---
Subject: x86: FW_BUG when the TSC goes funny on hardware where it really should be stable

It happens far too often on 'consumer' grade hardware, and sometimes on
'enterprise' too that the TSC gets marked unstable due to FW fuckage,
complain more loudly in this case.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/include/asm/tsc.h  | 1 +
 arch/x86/kernel/cpu/amd.c   | 4 +++-
 arch/x86/kernel/cpu/intel.c | 4 +++-
 arch/x86/kernel/tsc.c       | 7 +++++++
 4 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tsc.h b/arch/x86/include/asm/tsc.h
index 94605c0e9cee..e33853ee0416 100644
--- a/arch/x86/include/asm/tsc.h
+++ b/arch/x86/include/asm/tsc.h
@@ -52,6 +52,7 @@ extern int check_tsc_unstable(void);
 extern int check_tsc_disabled(void);
 extern unsigned long native_calibrate_tsc(void);
 
+extern int tsc_should_be_reliable;
 extern int tsc_clocksource_reliable;
 
 /*
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index ce8b8ff0e0ef..46012d2ca5a1 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -483,8 +483,10 @@ static void early_init_amd(struct cpuinfo_x86 *c)
 	if (c->x86_power & (1 << 8)) {
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
-		if (!check_tsc_unstable())
+		if (!check_tsc_unstable()) {
+			tsc_should_be_reliable = 1;
 			set_sched_clock_stable();
+		}
 	}
 
 #ifdef CONFIG_X86_64
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index a80029035bf2..2273ca1166bc 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -88,8 +88,10 @@ static void early_init_intel(struct cpuinfo_x86 *c)
 	if (c->x86_power & (1 << 8)) {
 		set_cpu_cap(c, X86_FEATURE_CONSTANT_TSC);
 		set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC);
-		if (!check_tsc_unstable())
+		if (!check_tsc_unstable()) {
+			tsc_should_be_reliable = 1;
 			set_sched_clock_stable();
+		}
 	}
 
 	/* Penwell and Cloverview have the TSC which doesn't sleep on S3 */
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 57e5ce126d5a..1f93827561d8 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -40,6 +40,7 @@ static int __read_mostly tsc_disabled = -1;
 
 static struct static_key __use_tsc = STATIC_KEY_INIT;
 
+int tsc_should_be_reliable;
 int tsc_clocksource_reliable;
 
 /*
@@ -994,6 +995,12 @@ void mark_tsc_unstable(char *reason)
 		clear_sched_clock_stable();
 		disable_sched_clock_irqtime();
 		pr_info("Marking TSC unstable due to %s\n", reason);
+
+		if (tsc_should_be_reliable) {
+			pr_err(FW_BUG "TSC unstable even though it should be; "
+				      "HW/BIOS broken, contact your vendor.\n");
+		}
+
 		/* Change only the rating, when not registered */
 		if (clocksource_tsc.mult)
 			clocksource_mark_unstable(&clocksource_tsc);

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28 10:30                       ` Peter Zijlstra
@ 2014-05-28 10:52                         ` Borislav Petkov
  0 siblings, 0 replies; 33+ messages in thread
From: Borislav Petkov @ 2014-05-28 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Libo Chen, Mike Galbraith, mingo, LKML, Greg KH,
	Li Zefan, Huang Qiang

On Wed, May 28, 2014 at 12:30:19PM +0200, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 11:08:40AM +0200, Thomas Gleixner wrote:
> > On Wed, 28 May 2014, Libo Chen wrote:
> > 
> > > On 2014/5/28 9:53, Mike Galbraith wrote:
> > > > On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
> > > > 
> > > >> oh yes, no tsc only hpet in my box.
> > > > 
> > > > Making poor E5-2658 box a crippled wreck.
> > > 
> > > yes,it is. But cpu usage will be down from 15% to 5% when binding
> > > cpu, so maybe read_hpet is not the root cause.
> > 
> > Definitely hpet _IS_ the root cause on a machine as large as this,
> > simply because everything gets serialized on the hpet access.
> > 
> > Binding stuff to cpus just makes the timing behaviour different, so
> > the hpet serialization is not that prominent, but still bad enough.
> > 
> > Talk to your HW/BIOS vendor. The kernel cannot do anything about
> > defunct hardware.
> 
> ---
> Subject: x86: FW_BUG when the TSC goes funny on hardware where it really should be stable
> 
> It happens far too often on 'consumer' grade hardware, and sometimes on
> 'enterprise' too that the TSC gets marked unstable due to FW fuckage,
> complain more loudly in this case.
> 
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  9:08                     ` Thomas Gleixner
  2014-05-28 10:30                       ` Peter Zijlstra
@ 2014-05-28 11:43                       ` Libo Chen
  2014-05-28 11:55                         ` Mike Galbraith
  2014-05-29  7:57                       ` Libo Chen
  2 siblings, 1 reply; 33+ messages in thread
From: Libo Chen @ 2014-05-28 11:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, mingo, LKML, Greg KH, Li Zefan, peterz,
	Huang Qiang, Peter Zijlstra, Borislav Petkov, Greg KH

On 2014/5/28 17:08, Thomas Gleixner wrote:
> On Wed, 28 May 2014, Libo Chen wrote:
> 
>> On 2014/5/28 9:53, Mike Galbraith wrote:
>>> On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
>>>
>>>> oh yes, no tsc only hpet in my box.
>>>
>>> Making poor E5-2658 box a crippled wreck.
>>
>> yes,it is. But cpu usage will be down from 15% to 5% when binding
>> cpu, so maybe read_hpet is not the root cause.
> 
> Definitely hpet _IS_ the root cause on a machine as large as this,
> simply because everything gets serialized on the hpet access.
> 
> Binding stuff to cpus just makes the timing behaviour different, so
> the hpet serialization is not that prominent, but still bad enough.
> 
> Talk to your HW/BIOS vendor. The kernel cannot do anything about
> defunct hardware.

thank you for your reply.but suse sp2 is very good in this scene.
Can it be said there has a bug, then community fix it later,
so it's just a coincidence?

Libo

> 
> Thanks,
> 
> 	tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28 11:43                       ` Libo Chen
@ 2014-05-28 11:55                         ` Mike Galbraith
  2014-05-29  7:58                           ` Libo Chen
  0 siblings, 1 reply; 33+ messages in thread
From: Mike Galbraith @ 2014-05-28 11:55 UTC (permalink / raw)
  To: Libo Chen
  Cc: Thomas Gleixner, mingo, LKML, Greg KH, Li Zefan, peterz,
	Huang Qiang, Borislav Petkov

On Wed, 2014-05-28 at 19:43 +0800, Libo Chen wrote: 
> On 2014/5/28 17:08, Thomas Gleixner wrote:
> > On Wed, 28 May 2014, Libo Chen wrote:
> > 
> >> On 2014/5/28 9:53, Mike Galbraith wrote:
> >>> On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
> >>>
> >>>> oh yes, no tsc only hpet in my box.
> >>>
> >>> Making poor E5-2658 box a crippled wreck.
> >>
> >> yes,it is. But cpu usage will be down from 15% to 5% when binding
> >> cpu, so maybe read_hpet is not the root cause.
> > 
> > Definitely hpet _IS_ the root cause on a machine as large as this,
> > simply because everything gets serialized on the hpet access.
> > 
> > Binding stuff to cpus just makes the timing behaviour different, so
> > the hpet serialization is not that prominent, but still bad enough.
> > 
> > Talk to your HW/BIOS vendor. The kernel cannot do anything about
> > defunct hardware.
> 
> thank you for your reply.but suse sp2 is very good in this scene.
> Can it be said there has a bug, then community fix it later,
> so it's just a coincidence?

I'm quite sure it's because of the patches I mentioned.

-Mike


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28  9:08                     ` Thomas Gleixner
  2014-05-28 10:30                       ` Peter Zijlstra
  2014-05-28 11:43                       ` Libo Chen
@ 2014-05-29  7:57                       ` Libo Chen
  2 siblings, 0 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-29  7:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mike Galbraith, mingo, LKML, Greg KH, Li Zefan, peterz, Huang Qiang

On 2014/5/28 17:08, Thomas Gleixner wrote:
> On Wed, 28 May 2014, Libo Chen wrote:
> 
>> On 2014/5/28 9:53, Mike Galbraith wrote:
>>> On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
>>>
>>>> oh yes, no tsc only hpet in my box.
>>>
>>> Making poor E5-2658 box a crippled wreck.
>>
>> yes,it is. But cpu usage will be down from 15% to 5% when binding
>> cpu, so maybe read_hpet is not the root cause.
> 
> Definitely hpet _IS_ the root cause on a machine as large as this,
> simply because everything gets serialized on the hpet access.
> 
> Binding stuff to cpus just makes the timing behaviour different, so
> the hpet serialization is not that prominent, but still bad enough.
> 
> Talk to your HW/BIOS vendor. The kernel cannot do anything about
> defunct hardware.

I got it!

thanks,
Libo

> 
> Thanks,
> 
> 	tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: balance storm
  2014-05-28 11:55                         ` Mike Galbraith
@ 2014-05-29  7:58                           ` Libo Chen
  0 siblings, 0 replies; 33+ messages in thread
From: Libo Chen @ 2014-05-29  7:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Thomas Gleixner, mingo, LKML, Greg KH, Li Zefan, peterz,
	Huang Qiang, Borislav Petkov

On 2014/5/28 19:55, Mike Galbraith wrote:
> On Wed, 2014-05-28 at 19:43 +0800, Libo Chen wrote: 
>> On 2014/5/28 17:08, Thomas Gleixner wrote:
>>> On Wed, 28 May 2014, Libo Chen wrote:
>>>
>>>> On 2014/5/28 9:53, Mike Galbraith wrote:
>>>>> On Wed, 2014-05-28 at 09:04 +0800, Libo Chen wrote:
>>>>>
>>>>>> oh yes, no tsc only hpet in my box.
>>>>>
>>>>> Making poor E5-2658 box a crippled wreck.
>>>>
>>>> yes,it is. But cpu usage will be down from 15% to 5% when binding
>>>> cpu, so maybe read_hpet is not the root cause.
>>>
>>> Definitely hpet _IS_ the root cause on a machine as large as this,
>>> simply because everything gets serialized on the hpet access.
>>>
>>> Binding stuff to cpus just makes the timing behaviour different, so
>>> the hpet serialization is not that prominent, but still bad enough.
>>>
>>> Talk to your HW/BIOS vendor. The kernel cannot do anything about
>>> defunct hardware.
>>
>> thank you for your reply.but suse sp2 is very good in this scene.
>> Can it be said there has a bug, then community fix it later,
>> so it's just a coincidence?
> 
> I'm quite sure it's because of the patches I mentioned.

I see


thanks,
Libo

> 
> -Mike
> 
> 
> 



^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2014-05-29  7:59 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-26  3:04 balance storm Libo Chen
2014-05-26  5:11 ` Mike Galbraith
2014-05-26 12:16   ` Libo Chen
2014-05-26 14:19     ` Mike Galbraith
2014-05-27  7:56       ` Libo Chen
2014-05-27  9:55         ` Mike Galbraith
2014-05-27 12:50           ` Libo Chen
2014-05-27 13:20             ` Mike Galbraith
2014-05-28  1:04               ` Libo Chen
2014-05-28  1:53                 ` Mike Galbraith
2014-05-28  6:54                   ` Libo Chen
2014-05-28  8:16                     ` Mike Galbraith
2014-05-28  9:08                     ` Thomas Gleixner
2014-05-28 10:30                       ` Peter Zijlstra
2014-05-28 10:52                         ` Borislav Petkov
2014-05-28 11:43                       ` Libo Chen
2014-05-28 11:55                         ` Mike Galbraith
2014-05-29  7:58                           ` Libo Chen
2014-05-29  7:57                       ` Libo Chen
2014-05-27 20:53             ` Thomas Gleixner
2014-05-28  1:06               ` Libo Chen
2014-05-26  7:56 ` Mike Galbraith
2014-05-26 11:49   ` Libo Chen
2014-05-26 14:03     ` Mike Galbraith
2014-05-27  7:44       ` Libo Chen
2014-05-27  8:12         ` Mike Galbraith
2014-05-27  9:48     ` Peter Zijlstra
2014-05-27 10:05       ` Mike Galbraith
2014-05-27 10:43         ` Peter Zijlstra
2014-05-27 10:55           ` Mike Galbraith
2014-05-27 12:56             ` Libo Chen
2014-05-27 12:55       ` Libo Chen
2014-05-27 13:13         ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.