All of lore.kernel.org
 help / color / mirror / Atom feed
* perf on 2.6.38-rc4 wedges my box
@ 2011-02-09 17:38 Jeff Moyer
  2011-02-09 17:47 ` David Ahern
  2011-02-10 21:38 ` Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Jeff Moyer @ 2011-02-09 17:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Peter Zijlstra, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo

Hi,

I'm trying out willy's ata_ram driver[1], and in so doing have managed to
wedge my box while using perf record on an aio-stress run:

[root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
[root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0
write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
[root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
/dev/sds
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0

and there it sits.  On the console, I see:

NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100
NOHZ: local_softirq_pending 100

The number of messages varies, but this is the most I've seen (it
doesn't keep repeating).  At this point, the machine does not respond to
pings.  As I don't have physical access at the moment, I can't try
alt-sysrq, but might be able to do that tomorrow.  It's probably worth
noting that I've witnessed similar behavior with real devices, so it's
not just the ata_ram driver.

Any ideas on how to track this down?

Thanks!

Jeff

[1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
@ 2011-02-09 17:47 ` David Ahern
  2011-02-09 18:22   ` Jeff Moyer
  2011-02-10 21:38 ` Peter Zijlstra
  1 sibling, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-09 17:47 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/09/11 10:38, Jeff Moyer wrote:
> Hi,
> 
> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
> wedge my box while using perf record on an aio-stress run:
> 
> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
> /dev/sds

Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
profiling? Add -v to see if the fallback to S/W is happening now.

David


> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> 
> and there it sits.  On the console, I see:
> 
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> 
> The number of messages varies, but this is the most I've seen (it
> doesn't keep repeating).  At this point, the machine does not respond to
> pings.  As I don't have physical access at the moment, I can't try
> alt-sysrq, but might be able to do that tomorrow.  It's probably worth
> noting that I've witnessed similar behavior with real devices, so it's
> not just the ata_ram driver.
> 
> Any ideas on how to track this down?
> 
> Thanks!
> 
> Jeff
> 
> [1] http://git.kernel.org/?p=linux/kernel/git/willy/misc.git;a=shortlog;h=refs/heads/ata-ram
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 17:47 ` David Ahern
@ 2011-02-09 18:22   ` Jeff Moyer
  2011-02-09 20:12     ` David Ahern
  0 siblings, 1 reply; 11+ messages in thread
From: Jeff Moyer @ 2011-02-09 18:22 UTC (permalink / raw)
  To: David Ahern
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

David Ahern <dsahern@gmail.com> writes:

> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
> profiling? Add -v to see if the fallback to S/W is happening now.

Thanks for the suggestion, David.  I tried:

# perf record -v ls
  Warning:  ... trying to fall back to cpu-clock-ticks

couldn't open /proc/-1/status
couldn't open /proc/-1/maps
[ls output]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]

If I explicitly set '-e cpu-clock', then the output is the same,
except that the warning is gone.  What's up with the /proc/-1/*?

Now, when running perf record -e cpu-clock on the aio-stress run,
unsurprisingly, I get the same result:

# perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
couldn't open /proc/-1/status
couldn't open /proc/-1/maps
adding stage write
starting with write
file size 1024MB, record size 4KB, depth 32, ios per iteration 8
max io_submit 16, buffer alignment set to 4KB
threads 1 files 1 contexts 1 context offset 2MB verification off
adding file /dev/sds thread 0

and there it sits.  In this case, however, I did not see the NOHZ
warnings on the console, and this time the machine is still responding
to ping, but nothing else.

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 18:22   ` Jeff Moyer
@ 2011-02-09 20:12     ` David Ahern
  2011-02-09 22:11       ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-09 20:12 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/09/11 11:22, Jeff Moyer wrote:
> David Ahern <dsahern@gmail.com> writes:
> 
>> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
>> profiling? Add -v to see if the fallback to S/W is happening now.
> 
> Thanks for the suggestion, David.  I tried:
> 
> # perf record -v ls
>   Warning:  ... trying to fall back to cpu-clock-ticks
> 
> couldn't open /proc/-1/status
> couldn't open /proc/-1/maps
> [ls output]
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
> 
> If I explicitly set '-e cpu-clock', then the output is the same,
> except that the warning is gone.  What's up with the /proc/-1/*?

target_{pid,tid} are initialized to -1 in builtin-record.c I believe the
tid version is making its way through the event__synthesize_xxx code
(event__synthesize_thread -> __event__synthesize_thread ->
event__synthesize_comm and event__synthesize_mmap_events).

> 
> Now, when running perf record -e cpu-clock on the aio-stress run,
> unsurprisingly, I get the same result:
> 
> # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> couldn't open /proc/-1/status
> couldn't open /proc/-1/maps
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> 
> and there it sits.  In this case, however, I did not see the NOHZ
> warnings on the console, and this time the machine is still responding
> to ping, but nothing else.

cpu-clock is handled through hrtimers if that helps understand the lockup.

David

> 
> Cheers,
> Jeff

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 20:12     ` David Ahern
@ 2011-02-09 22:11       ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 11+ messages in thread
From: Arnaldo Carvalho de Melo @ 2011-02-09 22:11 UTC (permalink / raw)
  To: David Ahern
  Cc: Jeff Moyer, linux-kernel, Peter Zijlstra, Ingo Molnar, Paul Mackerras

Em Wed, Feb 09, 2011 at 01:12:22PM -0700, David Ahern escreveu:
> 
> 
> On 02/09/11 11:22, Jeff Moyer wrote:
> > David Ahern <dsahern@gmail.com> writes:
> > 
> >> Have you tried '-e cpu-clock' for S/W based profiling vs the default H/W
> >> profiling? Add -v to see if the fallback to S/W is happening now.
> > 
> > Thanks for the suggestion, David.  I tried:
> > 
> > # perf record -v ls
> >   Warning:  ... trying to fall back to cpu-clock-ticks
> > 
> > couldn't open /proc/-1/status
> > couldn't open /proc/-1/maps
> > [ls output]
> > [ perf record: Woken up 1 times to write data ]
> > [ perf record: Captured and wrote 0.008 MB perf.data (~363 samples) ]
> > 
> > If I explicitly set '-e cpu-clock', then the output is the same,
> > except that the warning is gone.  What's up with the /proc/-1/*?
> 
> target_{pid,tid} are initialized to -1 in builtin-record.c I believe the
> tid version is making its way through the event__synthesize_xxx code
> (event__synthesize_thread -> __event__synthesize_thread ->
> event__synthesize_comm and event__synthesize_mmap_events).

Yes, I'm working on a patch, probably just this will fix it:

diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 07f8d6d..dd27b9f 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -680,7 +680,7 @@ static int __cmd_record(int argc, const char **argv)
 					       perf_event__synthesize_guest_os);
 
 	if (!system_wide)
-		perf_event__synthesize_thread(target_tid,
+		perf_event__synthesize_thread(evsel_list->threads->map[0],
 					      process_synthesized_event,
 					      session);
 	else
 
---

So that it gets the child_pid or the tid passed via --tid.

But thinking more I think that the correct is to pass the thread_map
evsel_list->threads, that will cover the case of just one thread
(--tid), or all a group of threads (--pid, when the thread_map will have
more than one tid).

I'll get a patch done later.

> > 
> > Now, when running perf record -e cpu-clock on the aio-stress run,
> > unsurprisingly, I get the same result:
> > 
> > # perf record -e cpu-clock -v -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> > couldn't open /proc/-1/status
> > couldn't open /proc/-1/maps
> > adding stage write
> > starting with write
> > file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> > max io_submit 16, buffer alignment set to 4KB
> > threads 1 files 1 contexts 1 context offset 2MB verification off
> > adding file /dev/sds thread 0
> > 
> > and there it sits.  In this case, however, I did not see the NOHZ
> > warnings on the console, and this time the machine is still responding
> > to ping, but nothing else.
> 
> cpu-clock is handled through hrtimers if that helps understand the lockup.
> 
> David
> 
> > 
> > Cheers,
> > Jeff

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
  2011-02-09 17:47 ` David Ahern
@ 2011-02-10 21:38 ` Peter Zijlstra
  2011-02-11 16:35   ` David Ahern
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-10 21:38 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo

On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote:
> Hi,
> 
> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
> wedge my box while using perf record on an aio-stress run:
> 
> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
> /dev/sds
> adding stage write
> starting with write
> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
> max io_submit 16, buffer alignment set to 4KB
> threads 1 files 1 contexts 1 context offset 2MB verification off
> adding file /dev/sds thread 0
> 
> and there it sits.  On the console, I see:
> 
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> NOHZ: local_softirq_pending 100
> 
> The number of messages varies, but this is the most I've seen (it
> doesn't keep repeating).  At this point, the machine does not respond to
> pings.  As I don't have physical access at the moment, I can't try
> alt-sysrq, but might be able to do that tomorrow.  It's probably worth
> noting that I've witnessed similar behavior with real devices, so it's
> not just the ata_ram driver.
> 
> Any ideas on how to track this down?

So I took linus' tree of about half an hour ago, added
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram
(fixed up some Kconfig/Makefile rejects), google'd aio-stress
(http://fsbench.filesystems.org/bench/aio-stress.c) and set out to
reproduce the above..

Sadly it all seems to work here, its spending ~15% in
_raw_spin_lock_irq, which when I use -g looks to break down like:

-     14.13%  aio-stress  [kernel.kallsyms]  [k] _raw_spin_lock_irq 
   - _raw_spin_lock_irq                                             
      + 44.14% __make_request                                       
      + 20.91% __aio_get_req                                        
      + 10.15% aio_run_iocb                                         
      + 7.37% do_io_submit                                          
      + 6.55% scsi_request_fn                                       
      + 5.48% generic_unplug_device                                 
      + 3.58% aio_put_req                                           
      + 0.92% generic_make_request                                  
      + 0.91% __generic_unplug_device




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-10 21:38 ` Peter Zijlstra
@ 2011-02-11 16:35   ` David Ahern
  2011-02-11 17:53     ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: David Ahern @ 2011-02-11 16:35 UTC (permalink / raw)
  To: Peter Zijlstra, Jeff Moyer
  Cc: linux-kernel, Ingo Molnar, Paul Mackerras, Arnaldo Carvalho de Melo



On 02/10/11 14:38, Peter Zijlstra wrote:
> On Wed, 2011-02-09 at 12:38 -0500, Jeff Moyer wrote:
>> Hi,
>>
>> I'm trying out willy's ata_ram driver[1], and in so doing have managed to
>> wedge my box while using perf record on an aio-stress run:
>>
>> [root@metallica ~]# modprobe ata_ram capacity=2097152 preallocate=1
>> [root@metallica ~]# ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sds
>> adding stage write
>> starting with write
>> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
>> max io_submit 16, buffer alignment set to 4KB
>> threads 1 files 1 contexts 1 context offset 2MB verification off
>> adding file /dev/sds thread 0
>> write on /dev/sds (621.30 MB/s) 1024.00 MB in 1.65s
>> thread 0 write totals (621.27 MB/s) 1024.00 MB in 1.65s
>> [root@metallica ~]# perf record -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16
>> /dev/sds
>> adding stage write
>> starting with write
>> file size 1024MB, record size 4KB, depth 32, ios per iteration 8
>> max io_submit 16, buffer alignment set to 4KB
>> threads 1 files 1 contexts 1 context offset 2MB verification off
>> adding file /dev/sds thread 0
>>
>> and there it sits.  On the console, I see:
>>
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>> NOHZ: local_softirq_pending 100
>>
>> The number of messages varies, but this is the most I've seen (it
>> doesn't keep repeating).  At this point, the machine does not respond to
>> pings.  As I don't have physical access at the moment, I can't try
>> alt-sysrq, but might be able to do that tomorrow.  It's probably worth
>> noting that I've witnessed similar behavior with real devices, so it's
>> not just the ata_ram driver.
>>
>> Any ideas on how to track this down?
> 
> So I took linus' tree of about half an hour ago, added
> git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git ata-ram
> (fixed up some Kconfig/Makefile rejects), google'd aio-stress
> (http://fsbench.filesystems.org/bench/aio-stress.c) and set out to
> reproduce the above..
> 
> Sadly it all seems to work here, its spending ~15% in
> _raw_spin_lock_irq, which when I use -g looks to break down like:
> 
> -     14.13%  aio-stress  [kernel.kallsyms]  [k] _raw_spin_lock_irq 
>    - _raw_spin_lock_irq                                             
>       + 44.14% __make_request                                       
>       + 20.91% __aio_get_req                                        
>       + 10.15% aio_run_iocb                                         
>       + 7.37% do_io_submit                                          
>       + 6.55% scsi_request_fn                                       
>       + 5.48% generic_unplug_device                                 
>       + 3.58% aio_put_req                                           
>       + 0.92% generic_make_request                                  
>       + 0.91% __generic_unplug_device

I'm guessing in your case perf is using hardware cycles for profiling.

I was able to reproduce the lockup in a VM which uses cpu-clock for
profiling - like Jeff's case. The VM is running Fedora 14 with 2.6.38-rc4.

In the host one of qemu-kvm's threads is pegging the cpu. Backtrace of
the vcpus is given below. stop-bt-cont at various intervals is showing
similar traces, and xtime is not advancing. Final function changes
(e.g., from __rb_rotate_left to __rb_rotate_right)

David



[Switching to Thread 1]
(gdb) bt
#0  native_safe_halt () at
/opt/kernel/src/arch/x86/include/asm/irqflags.h:50
#1  0xffffffff81009d76 in arch_safe_halt () at
/opt/kernel/src/arch/x86/include/asm/paravirt.h:110
#2  default_idle () at /opt/kernel/src/arch/x86/kernel/process.c:381
#3  0xffffffff8100132a in cpu_idle () at
/opt/kernel/src/arch/x86/kernel/process_64.c:139
#4  0xffffffff81392c89 in rest_init () at /opt/kernel/src/init/main.c:463
#5  0xffffffff81697c23 in start_kernel () at /opt/kernel/src/init/main.c:713
#6  0xffffffff816972af in x86_64_start_reservations
(real_mode_data=<value optimized out>) at
/opt/kernel/src/arch/x86/kernel/head64.c:127
#7  0xffffffff816973b9 in x86_64_start_kernel (real_mode_data=0x93950
<Address 0x93950 out of bounds>) at
/opt/kernel/src/arch/x86/kernel/head64.c:97
#8  0x0000000000000000 in ?? ()

(gdb) thread 2
[Switching to thread 2 (Thread 2)]#0  __rb_rotate_left
(node=0xffff8800782dd8f0, root=0xffff88007fd0e358) at
/opt/kernel/src/lib/rbtree.c:37

(gdb) bt
#0  __rb_rotate_left (node=0xffff8800782dd8f0, root=0xffff88007fd0e358)
at /opt/kernel/src/lib/rbtree.c:37
#1  0xffffffff81206bcc in rb_insert_color (node=0xffff8800782dd8f0,
root=0xffff88007fd0e358) at /opt/kernel/src/lib/rbtree.c:130
#2  0xffffffff81207cb5 in timerqueue_add (head=0xffff88007fd0e358,
node=0xffff8800782dd8f0) at /opt/kernel/src/lib/timerqueue.c:56
#3  0xffffffff8105f595 in enqueue_hrtimer (timer=0xffff8800782dd8f0,
base=0xffff88007fd0e348) at /opt/kernel/src/kernel/hrtimer.c:848
#4  0xffffffff8105fc75 in __hrtimer_start_range_ns
(timer=0xffff8800782dd8f0, tim=..., delta_ns=0, mode=<value optimized
out>, wakeup=0)
    at /opt/kernel/src/kernel/hrtimer.c:961
#5  0xffffffff810b662e in perf_swevent_start_hrtimer
(event=0xffff8800782dd800) at /opt/kernel/src/kernel/perf_event.c:5092
#6  0xffffffff810b669c in cpu_clock_event_start (event=<value optimized
out>, flags=<value optimized out>) at
/opt/kernel/src/kernel/perf_event.c:5126
#7  0xffffffff810ba515 in perf_ctx_adjust_freq (ctx=0xffff880079930600,
period=999848) at /opt/kernel/src/kernel/perf_event.c:1726
#8  0xffffffff810ba690 in perf_rotate_context () at
/opt/kernel/src/kernel/perf_event.c:1787
#9  perf_event_task_tick () at /opt/kernel/src/kernel/perf_event.c:1821
#10 0xffffffff8103d8d8 in scheduler_tick () at
/opt/kernel/src/kernel/sched.c:3784
#11 0xffffffff8104effe in update_process_times (user_tick=0) at
/opt/kernel/src/kernel/timer.c:1274
#12 0xffffffff81069587 in tick_sched_timer (timer=0xffff88007fd0e3f0) at
/opt/kernel/src/kernel/time/tick-sched.c:760
#13 0xffffffff8105f75d in __run_hrtimer (timer=0xffff88007fd0e3f0,
now=0xffff88007fd03f48) at /opt/kernel/src/kernel/hrtimer.c:1197
#14 0xffffffff8105ff5e in hrtimer_interrupt (dev=<value optimized out>)
at /opt/kernel/src/kernel/hrtimer.c:1283
#15 0xffffffff813acb4e in local_apic_timer_interrupt (regs=<value
optimized out>) at /opt/kernel/src/arch/x86/kernel/apic/apic.c:844
#16 smp_apic_timer_interrupt (regs=<value optimized out>) at
/opt/kernel/src/arch/x86/kernel/apic/apic.c:871
#17 <signal handler called>


> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-11 16:35   ` David Ahern
@ 2011-02-11 17:53     ` Peter Zijlstra
  2011-02-11 18:23       ` Peter Zijlstra
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-11 17:53 UTC (permalink / raw)
  To: David Ahern
  Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
> I'm guessing in your case perf is using hardware cycles for profiling.
> 
> I was able to reproduce the lockup in a VM which uses cpu-clock for
> profiling - like Jeff's case. The VM is running Fedora 14 with
> 2.6.38-rc4.
> 
Ah, indeed, when I use:

  perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb

things did come apart, something like the below cured that problem (but
did show the pending softirq thing and triggered something iffy in the
backtrace code -- will have to stare at those still)


---
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a353a4d..36fb410 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5123,6 +5123,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	u64 period;
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+	if (event->state < PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
 	event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0);
@@ -5174,7 +5178,7 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
-		hrtimer_cancel(&hwc->hrtimer);
+		hrtimer_try_to_cancel(&hwc->hrtimer);
 	}
 }
 



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-11 17:53     ` Peter Zijlstra
@ 2011-02-11 18:23       ` Peter Zijlstra
  2011-02-11 18:47         ` David Ahern
  2011-02-16 13:50         ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Zijlstra @ 2011-02-11 18:23 UTC (permalink / raw)
  To: David Ahern
  Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo

On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
> > I'm guessing in your case perf is using hardware cycles for profiling.
> > 
> > I was able to reproduce the lockup in a VM which uses cpu-clock for
> > profiling - like Jeff's case. The VM is running Fedora 14 with
> > 2.6.38-rc4.
> > 
> Ah, indeed, when I use:
> 
>   perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
> 
> things did come apart, something like the below cured that problem (but
> did show the pending softirq thing and triggered something iffy in the
> backtrace code -- will have to stare at those still)

So while this doesn't explain these weird things, it should have at
least one race less -- hrtimer_init() on a possible still running timer
didn't seem like a very good idea, also since hrtimers are nsec the
whole freq thing seemed unnecessary.

---
 kernel/perf_event.c |   37 +++++++++++++++++++++++++++++++++----
 1 files changed, 33 insertions(+), 4 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 999835b..08428b3 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	u64 period;
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+	if (event->state < PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
 	event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0);
@@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hwc->hrtimer.function = perf_swevent_hrtimer;
-
 	period = local64_read(&hwc->period_left);
 	if (period) {
 		if (period < 0)
@@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
 		local64_set(&hwc->period_left, ktime_to_ns(remaining));
 
-		hrtimer_cancel(&hwc->hrtimer);
+		hrtimer_try_to_cancel(&hwc->hrtimer);
+	}
+}
+
+static void perf_swevent_init_hrtimer(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (!is_sampling_event(event))
+		return;
+
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hwc->hrtimer.function = perf_swevent_hrtimer;
+
+	/*
+	 * Since hrtimers have a fixed rate, we can do a static freq->period
+	 * mapping and avoid the whole period adjust feedback stuff.
+	 */
+	if (event->attr.freq) {
+		long freq = event->attr.sample_freq;
+
+		event->attr.sample_period = NSEC_PER_SEC / freq;
+		hwc->sample_period = event->attr.sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+		event->attr.freq = 0;
 	}
 }
 
@@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 
@@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: perf on 2.6.38-rc4 wedges my box
  2011-02-11 18:23       ` Peter Zijlstra
@ 2011-02-11 18:47         ` David Ahern
  2011-02-16 13:50         ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: David Ahern @ 2011-02-11 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jeff Moyer, linux-kernel, Ingo Molnar, Paul Mackerras,
	Arnaldo Carvalho de Melo



On 02/11/11 11:23, Peter Zijlstra wrote:
> On Fri, 2011-02-11 at 18:53 +0100, Peter Zijlstra wrote:
>> On Fri, 2011-02-11 at 09:35 -0700, David Ahern wrote:
>>> I'm guessing in your case perf is using hardware cycles for profiling.
>>>
>>> I was able to reproduce the lockup in a VM which uses cpu-clock for
>>> profiling - like Jeff's case. The VM is running Fedora 14 with
>>> 2.6.38-rc4.
>>>
>> Ah, indeed, when I use:
>>
>>   perf record -gfe task-clock -- ./aio-stress -O -o 0 -r 4 -d 32 -b 16 /dev/sdb
>>
>> things did come apart, something like the below cured that problem (but
>> did show the pending softirq thing and triggered something iffy in the
>> backtrace code -- will have to stare at those still)
> 
> So while this doesn't explain these weird things, it should have at
> least one race less -- hrtimer_init() on a possible still running timer
> didn't seem like a very good idea, also since hrtimers are nsec the
> whole freq thing seemed unnecessary.

Solved the lockup with my repro on 2.6.38-rc4.

I did try 2.6.37 earlier today (without this patch), and it locked up
pretty quick as well.

David

> 
> ---
>  kernel/perf_event.c |   37 +++++++++++++++++++++++++++++++++----
>  1 files changed, 33 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 999835b..08428b3 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -5051,6 +5051,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
>  	u64 period;
>  
>  	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
> +
> +	if (event->state < PERF_EVENT_STATE_ACTIVE)
> +		return HRTIMER_NORESTART;
> +
>  	event->pmu->read(event);
>  
>  	perf_sample_data_init(&data, 0);
> @@ -5077,9 +5081,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
>  	if (!is_sampling_event(event))
>  		return;
>  
> -	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> -	hwc->hrtimer.function = perf_swevent_hrtimer;
> -
>  	period = local64_read(&hwc->period_left);
>  	if (period) {
>  		if (period < 0)
> @@ -5102,7 +5103,31 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
>  		ktime_t remaining = hrtimer_get_remaining(&hwc->hrtimer);
>  		local64_set(&hwc->period_left, ktime_to_ns(remaining));
>  
> -		hrtimer_cancel(&hwc->hrtimer);
> +		hrtimer_try_to_cancel(&hwc->hrtimer);
> +	}
> +}
> +
> +static void perf_swevent_init_hrtimer(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	if (!is_sampling_event(event))
> +		return;
> +
> +	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hwc->hrtimer.function = perf_swevent_hrtimer;
> +
> +	/*
> +	 * Since hrtimers have a fixed rate, we can do a static freq->period
> +	 * mapping and avoid the whole period adjust feedback stuff.
> +	 */
> +	if (event->attr.freq) {
> +		long freq = event->attr.sample_freq;
> +
> +		event->attr.sample_period = NSEC_PER_SEC / freq;
> +		hwc->sample_period = event->attr.sample_period;
> +		local64_set(&hwc->period_left, hwc->sample_period);
> +		event->attr.freq = 0;
>  	}
>  }
>  
> @@ -5158,6 +5183,8 @@ static int cpu_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
>  		return -ENOENT;
>  
> +	perf_swevent_init_hrtimer(event);
> +
>  	return 0;
>  }
>  
> @@ -5235,6 +5262,8 @@ static int task_clock_event_init(struct perf_event *event)
>  	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
>  		return -ENOENT;
>  
> +	perf_swevent_init_hrtimer(event);
> +
>  	return 0;
>  }
>  
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [tip:perf/core] perf: Optimize hrtimer events
  2011-02-11 18:23       ` Peter Zijlstra
  2011-02-11 18:47         ` David Ahern
@ 2011-02-16 13:50         ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 11+ messages in thread
From: tip-bot for Peter Zijlstra @ 2011-02-16 13:50 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: linux-kernel, hpa, mingo, a.p.zijlstra, tglx, mingo

Commit-ID:  ba3dd36c6775264ee6e7354ba1aabcd6e86d7298
Gitweb:     http://git.kernel.org/tip/ba3dd36c6775264ee6e7354ba1aabcd6e86d7298
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 15 Feb 2011 12:41:46 +0100
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Wed, 16 Feb 2011 13:30:57 +0100

perf: Optimize hrtimer events

There is no need to re-initialize the hrtimer every time we start it,
so don't do that (shaves a few cycles). Also, since we know hrtimers
run at a fixed rate (nanoseconds) we can pre-compute the desired
frequency at which they tick. This avoids us having to go through the
whole adaptive frequency feedback logic (shaves another few cycles).

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1297448589.5226.47.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/perf_event.c |   35 ++++++++++++++++++++++++++++++++---
 1 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index e03be08..a0a6987 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -5602,6 +5602,10 @@ static enum hrtimer_restart perf_swevent_hrtimer(struct hrtimer *hrtimer)
 	u64 period;
 
 	event = container_of(hrtimer, struct perf_event, hw.hrtimer);
+
+	if (event->state != PERF_EVENT_STATE_ACTIVE)
+		return HRTIMER_NORESTART;
+
 	event->pmu->read(event);
 
 	perf_sample_data_init(&data, 0);
@@ -5628,9 +5632,6 @@ static void perf_swevent_start_hrtimer(struct perf_event *event)
 	if (!is_sampling_event(event))
 		return;
 
-	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hwc->hrtimer.function = perf_swevent_hrtimer;
-
 	period = local64_read(&hwc->period_left);
 	if (period) {
 		if (period < 0)
@@ -5657,6 +5658,30 @@ static void perf_swevent_cancel_hrtimer(struct perf_event *event)
 	}
 }
 
+static void perf_swevent_init_hrtimer(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	if (!is_sampling_event(event))
+		return;
+
+	hrtimer_init(&hwc->hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hwc->hrtimer.function = perf_swevent_hrtimer;
+
+	/*
+	 * Since hrtimers have a fixed rate, we can do a static freq->period
+	 * mapping and avoid the whole period adjust feedback stuff.
+	 */
+	if (event->attr.freq) {
+		long freq = event->attr.sample_freq;
+
+		event->attr.sample_period = NSEC_PER_SEC / freq;
+		hwc->sample_period = event->attr.sample_period;
+		local64_set(&hwc->period_left, hwc->sample_period);
+		event->attr.freq = 0;
+	}
+}
+
 /*
  * Software event: cpu wall time clock
  */
@@ -5709,6 +5734,8 @@ static int cpu_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_CPU_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 
@@ -5787,6 +5814,8 @@ static int task_clock_event_init(struct perf_event *event)
 	if (event->attr.config != PERF_COUNT_SW_TASK_CLOCK)
 		return -ENOENT;
 
+	perf_swevent_init_hrtimer(event);
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-02-16 13:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-02-09 17:38 perf on 2.6.38-rc4 wedges my box Jeff Moyer
2011-02-09 17:47 ` David Ahern
2011-02-09 18:22   ` Jeff Moyer
2011-02-09 20:12     ` David Ahern
2011-02-09 22:11       ` Arnaldo Carvalho de Melo
2011-02-10 21:38 ` Peter Zijlstra
2011-02-11 16:35   ` David Ahern
2011-02-11 17:53     ` Peter Zijlstra
2011-02-11 18:23       ` Peter Zijlstra
2011-02-11 18:47         ` David Ahern
2011-02-16 13:50         ` [tip:perf/core] perf: Optimize hrtimer events tip-bot for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.