[PATCH] Wait to remove active timer when rescheduling hrtimer

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] Wait to remove active timer when rescheduling hrtimer
@ 2011-01-21 17:52 Vernon Mauery
  2011-01-21 21:32 ` Thomas Gleixner
  0 siblings, 1 reply; 4+ messages in thread
From: Vernon Mauery @ 2011-01-21 17:52 UTC (permalink / raw)
  To: linux-kernel; +Cc: Thomas Gleixner, Vernon Mauery

In the current hrtimer implementation, when explicitly
removing a timer, the hrtimer_cancel function waits
until the timer is not currently running before actually
calling remove_hrtimer.  But in the __hrtimer_start_range_ns
function (called by hrtimer_start, among others), it
calls remove_hrtimer without checking to see if the
timer is currently running.  This doesn't seem to cause
much trouble in the vanilla kernel, but in the -RT
kernel, this causes a silent hang when exercising the
sch_htb module on a 10GbE interface.

Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
---
 kernel/hrtimer.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 0c8d7c0..c7e8ba0 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -935,6 +935,11 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
 	base = lock_hrtimer_base(timer, &flags);

 	/* Remove an active timer from the queue: */
+	while (hrtimer_callback_running(timer)) {
+		unlock_hrtimer_base(timer, &flags);
+		hrtimer_wait_for_timer(timer);
+		base = lock_hrtimer_base(timer, &flags);
+	}
 	ret = remove_hrtimer(timer, base);

 	/* Switch the timer base, if necessary: */
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] Wait to remove active timer when rescheduling hrtimer
  2011-01-21 17:52 [PATCH] Wait to remove active timer when rescheduling hrtimer Vernon Mauery
@ 2011-01-21 21:32 ` Thomas Gleixner
  2011-01-21 23:06   ` Vernon Mauery
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Gleixner @ 2011-01-21 21:32 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: linux-kernel



On Fri, 21 Jan 2011, Vernon Mauery wrote:

> In the current hrtimer implementation, when explicitly
> removing a timer, the hrtimer_cancel function waits
> until the timer is not currently running before actually
> calling remove_hrtimer.  But in the __hrtimer_start_range_ns
> function (called by hrtimer_start, among others), it
> calls remove_hrtimer without checking to see if the
> timer is currently running.  This doesn't seem to cause
> much trouble in the vanilla kernel, but in the -RT
> kernel, this causes a silent hang when exercising the
> sch_htb module on a 10GbE interface.
> 
> Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
> ---
>  kernel/hrtimer.c |    5 +++++
>  1 files changed, 5 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index 0c8d7c0..c7e8ba0 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -935,6 +935,11 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
>  	base = lock_hrtimer_base(timer, &flags);
>  
>  	/* Remove an active timer from the queue: */
> +	while (hrtimer_callback_running(timer)) {
> +		unlock_hrtimer_base(timer, &flags);
> +		hrtimer_wait_for_timer(timer);
> +		base = lock_hrtimer_base(timer, &flags);
> +	}
>  	ret = remove_hrtimer(timer, base);
>  
>  	/* Switch the timer base, if necessary: */

Hmm, why does it hang ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Wait to remove active timer when rescheduling hrtimer
  2011-01-21 21:32 ` Thomas Gleixner
@ 2011-01-21 23:06   ` Vernon Mauery
  2011-01-22  9:43     ` Thomas Gleixner
  0 siblings, 1 reply; 4+ messages in thread
From: Vernon Mauery @ 2011-01-21 23:06 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel

On Fri, Jan 21, 2011 at 1:32 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>
> On Fri, 21 Jan 2011, Vernon Mauery wrote:
>
>> In the current hrtimer implementation, when explicitly
>> removing a timer, the hrtimer_cancel function waits
>> until the timer is not currently running before actually
>> calling remove_hrtimer.  But in the __hrtimer_start_range_ns
>> function (called by hrtimer_start, among others), it
>> calls remove_hrtimer without checking to see if the
>> timer is currently running.  This doesn't seem to cause
>> much trouble in the vanilla kernel, but in the -RT
>> kernel, this causes a silent hang when exercising the
>> sch_htb module on a 10GbE interface.
>>
>> Signed-off-by: Vernon Mauery <vernux@us.ibm.com>
>> ---
>>  kernel/hrtimer.c |    5 +++++
>>  1 files changed, 5 insertions(+), 0 deletions(-)
>>
>> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
>> index 0c8d7c0..c7e8ba0 100644
>> --- a/kernel/hrtimer.c
>> +++ b/kernel/hrtimer.c
>> @@ -935,6 +935,11 @@ int __hrtimer_start_range_ns(struct hrtimer *timer, ktime_t tim,
>>       base = lock_hrtimer_base(timer, &flags);
>>
>>       /* Remove an active timer from the queue: */
>> +     while (hrtimer_callback_running(timer)) {
>> +             unlock_hrtimer_base(timer, &flags);
>> +             hrtimer_wait_for_timer(timer);
>> +             base = lock_hrtimer_base(timer, &flags);
>> +     }
>>       ret = remove_hrtimer(timer, base);
>>
>>       /* Switch the timer base, if necessary: */
>
> Hmm, why does it hang ?

It hangs when I set up QoS on a 10GbE interface and run traffic over
it.  Here is what I do:
#!/bin/bash

if [ -z "$1" ]; then
        ETH=eth2
else
        ETH="$1"
fi
IP=`ip addr show $ETH | grep 'inet\>' | sed 's/.*inet \([^/]*\)\/.*/\1/'`
IP=`echo $IP | sed 's/\([^.]*\)\.\([^.]*\)\.\([^.]*\)\.\([^.]*\)/\1.\2.\3.95/'`

SPEED=`ethtool $ETH | grep Speed | sed 's/[^0-9]*\([0-9]*\).*/\1/'`
case $SPEED in
   10000) ZEROS=00 ;;
    1000) ZEROS=0 ;;
 default) ZEROS='' ;;
esac

tc qdisc del dev $ETH root >&/dev/null || :
[ "$2" == "stop" ] && exit 0
set -x
tc qdisc add dev $ETH root handle 1: htb default 30 r2q 600$ZEROS
tc class add dev $ETH parent 1: classid 1:1 htb rate 30${ZEROS}mbit
tc class add dev $ETH parent 1:1 classid 1:10 htb rate 5${ZEROS}mbit prio 1
tc class add dev $ETH parent 1:1 classid 1:20 htb rate 5${ZEROS}mbit prio 2
tc class add dev $ETH parent 1:1 classid 1:30 htb rate 8${ZEROS}mbit
set +x

echo "netperf -l 2000 -H 10.1.1.94 -t UDP_STREAM -- -m 65505"
----------------------8< snip 8<-------------------------------

After running the script, I run a netperf command similar to the one
that the script prints, that runs traffic over the interface that we
just set up QoS on.  Within 10-15 seconds, the machine goes silent.
(this is on a 2.6.33.7-rt29 kernel)

My guess would be that without this patch, it is possible to have a
timer currently in the tree, not delete it and then schedule it again.
 Not on this machine or on this kernel, I saw a similar problem
(caused by the same thing: using sch_htb on -rt kernel), but instead
of a silent hang, it gave me an oops that looked like this:

Unable to handle kernel NULL pointer dereference at 0000000000000010
 RIP: [<ffffffff8113b38c>] rb_erase+0x1f3/0x2b1
PGD 14e150067 PUD 142d34067 PMD 0
Oops: 0000 [1] PREEMPT SMP
CPU 2
Modules linked in: sch_htb pktgen nfs nfsd lockd nfs_acl auth_rpcgss exportfs
ipmi_devintf ipmi_si ipmi_msghandler ibm_rtl ipv6 autofs4 i2c_dev i2c_core hidp
rfcomm l2cap bluetooth sunrpc dm_mirror dm_multipath scsi_dh dm_mod video
output sbs sbshc battery ac parport_pc lp parport sg bnx2 button netxen_nic
serio_raw amd64_edac edac_core pcspkr shpchp mptsas mptscsih mptbase
scsi_transport_sas sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd
Pid: 38, comm: sirq-hrtimer/2 Not tainted 2.6.24.7-139.el5rt #1
RIP: 0010:[<ffffffff8113b38c>]  [<ffffffff8113b38c>] rb_erase+0x1f3/0x2b1
RSP: 0018:ffff81014f16fe50  EFLAGS: 00010082
RAX: 0000000000000000 RBX: ffff81014640bac8 RCX: ffff810001085780
RDX: 0000000000000000 RSI: ffff8100010076a8 RDI: 0000000000000000
RBP: ffff81014f16fe60 R08: ffff81033f15dac8 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000000 R12: ffff8100010076a8
R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000080
FS:  00007ff9960016e0(0000) GS:ffff81014fc09cc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000010 CR3: 000000014e188000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process sirq-hrtimer/2 (pid: 38, threadinfo ffff81014f16e000, task
ffff81014f16c300)
Stack:  ffff81013a5e67d0 ffff810001007698 ffff81014f16fe90 ffffffff81054dfc
 ffffffff81227401 ffff81013a5e67d0 ffff810001085640 0000000000000002
 ffff81014f16fec0 ffffffff81055cbb 0000000000000002 ffffffff815005e8
Call Trace:
 [<ffffffff81054dfc>] __remove_hrtimer+0x6e/0x7b
 [<ffffffff81227401>] ? qdisc_watchdog+0x0/0x23
 [<ffffffff81055cbb>] run_hrtimer_softirq+0x7a/0x14e
 [<ffffffff81043d26>] ksoftirqd+0x16a/0x26f
 [<ffffffff81043bbc>] ? ksoftirqd+0x0/0x26f
 [<ffffffff81043bbc>] ? ksoftirqd+0x0/0x26f
 [<ffffffff8105261c>] kthread+0x49/0x79
 [<ffffffff8100d088>] child_rip+0xa/0x12
 [<ffffffff810525d3>] ? kthread+0x0/0x79
 [<ffffffff8100d07e>] ? child_rip+0x0/0x12


My most common experience is to just experience a silent hang, which
allows me to do very little in the way of debugging -- unless I catch
it before it goes completely silent, not even sysrq works.  I have
managed to crash the machine on occasion, but the crash kernel never
seems to get very far before it dies off too... Very odd behavior.  I
tried to capture a sysrq+T, but it only gave me the names of the
processes, not the backtraces, and then the per CPU info.  I didn't
see anything that stuck out in the per-CPU info, but right after the
dump, the machine went completely silent again.

Sample from sysrq+T:

sshd          S ffff880417e65ea8     0  3621   3204 0x00000080

Call Trace:
bash          S ffff880417dd0200     0  3636   3621 0x00000080

Call Trace:
sshd          S 0000000000000000     0  3637   3204 0x00000080

Call Trace:
bash          S ffff880417ddc440     0  3662   3637 0x00000080

Call Trace:
top           S ffff88002838af48     0  3686   3662 0x00000080

Call Trace:
netperf       R  running task        0  3708   3636 0x00000080

Call Trace:
Sched Debug Version: v0.09, 2.6.33.7-rt29.35.el5rt #1
now at 134835.593881 msecs
  .jiffies                                 : 4294787328
  .sysctl_sched_latency                    : 20.000000
  .sysctl_sched_min_granularity            : 4.000000
  .sysctl_sched_wakeup_granularity         : 4.000000
  .sysctl_sched_child_runs_first           : 0.000000
  .sysctl_sched_features                   : 7917179
  .sysctl_sched_tunable_scaling            : 1 (logaritmic)

cpu#0, 2833.945 MHz
  .nr_running                    : 1
  .load                          : 0
  .nr_switches                   : 1030470
  .nr_load_updates               : 120033
  .nr_uninterruptible            : 141
  .next_balance                  : 4294.787429
  .curr->pid                     : 3708
  .clock                         : 122477.541494
  .cpu_load[0]                   : 0
  .cpu_load[1]                   : 4
  .cpu_load[2]                   : 56
  .cpu_load[3]                   : 122
  .cpu_load[4]                   : 168
  .rt.rt_nr_running              : 1
  .rt.rt_nr_uninterruptible      : 26920
  .rto_schedule                  : 534494
  .rto_schedule_tail             : 0
  .rto_wakeup                    : 0
  .rto_pulled                    : 0
  .rto_pushed                    : 0
  .yld_count                     : 0
  .sched_switch                  : 0
  .sched_count                   : 1070624
  .sched_goidle                  : 279923
  .avg_idle                      : 93170
  .ttwu_count                    : 640746
  .ttwu_local                    : 613469
  .bkl_count                     : 299

cfs_rq[0]:
  .exec_clock                    : 4267.411040
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 11634.164614
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000
  .spread0                       : 0.000000
  .nr_running                    : 0
  .load                          : 0
  .nr_spread_over                : 12978

rt_rq[0]:
  .rt_nr_running                 : 1
  .rt_throttled                  : 0
  .rt_time                       : 7.163839
  .rt_runtime                    : 850.000000

runnable tasks:
            task   PID         tree-key  switches  prio
exec-runtime         sum-exec        sum-sleep
----------------------------------------------------------------------------------------------------------
R        netperf  3708         0.000000    206988     9
0.000000      3237.342882      2155.844073



Any ideas on this?  The patch I sent fixes the problem.  The idea came
from a proposed patch a long time ago that just added a hrtimer_cancel
call just before the hrtimer_start call in the sch_api.c watchdog
code.  I figured if the explicit cancel before the start fixed the
problem, there was something that the cancel did that the start alone
didn't.  What I found was that the cancel waits if the timer is
currently running, while start just brute-force cancels the timer.

--Vernon

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] Wait to remove active timer when rescheduling hrtimer
  2011-01-21 23:06   ` Vernon Mauery
@ 2011-01-22  9:43     ` Thomas Gleixner
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2011-01-22  9:43 UTC (permalink / raw)
  To: Vernon Mauery; +Cc: linux-kernel

On Fri, 21 Jan 2011, Vernon Mauery wrote:
> > Hmm, why does it hang ?
> 
> 
> After running the script, I run a netperf command similar to the one
> that the script prints, that runs traffic over the interface that we
> just set up QoS on.  Within 10-15 seconds, the machine goes silent.
> (this is on a 2.6.33.7-rt29 kernel)
> 
> My guess would be that without this patch, it is possible to have a
> timer currently in the tree, not delete it and then schedule it again.
>  Not on this machine or on this kernel, I saw a similar problem
> (caused by the same thing: using sch_htb on -rt kernel), but instead
> of a silent hang, it gave me an oops that looked like this:
>
>  [<ffffffff81054dfc>] __remove_hrtimer+0x6e/0x7b
>  [<ffffffff81227401>] ? qdisc_watchdog+0x0/0x23
>  [<ffffffff81055cbb>] run_hrtimer_softirq+0x7a/0x14e
>  [<ffffffff81043d26>] ksoftirqd+0x16a/0x26f
>  [<ffffffff81043bbc>] ? ksoftirqd+0x0/0x26f
>  [<ffffffff81043bbc>] ? ksoftirqd+0x0/0x26f
>  [<ffffffff8105261c>] kthread+0x49/0x79
>  [<ffffffff8100d088>] child_rip+0xa/0x12
>  [<ffffffff810525d3>] ? kthread+0x0/0x79
>  [<ffffffff8100d07e>] ? child_rip+0x0/0x12

 
> Any ideas on this?  The patch I sent fixes the problem.  The idea came
> from a proposed patch a long time ago that just added a hrtimer_cancel
> call just before the hrtimer_start call in the sch_api.c watchdog
> code.  I figured if the explicit cancel before the start fixed the
> problem, there was something that the cancel did that the start alone
> didn't.  What I found was that the cancel waits if the timer is
> currently running, while start just brute-force cancels the timer.

Hmm. That's weird. The remove in start() and in the softirq code is
protected by the base lock. So whoever comes first, removes the
timer. So if the timer call back runs then the remove in start() will
be a noop. And if the timer is removed right before the softirq wants
to run the callback then the softirq wont see it anymore.

So I fear while your patch makes the hang/oops go away it is papering
over the real bug. Can you try to reproduce with function tracing (add
timer events as well) enabled ? If yes, set ftrace_dump_on_oops. So we
should see the history which led to this problem.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-01-22  9:43 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-21 17:52 [PATCH] Wait to remove active timer when rescheduling hrtimer Vernon Mauery
2011-01-21 21:32 ` Thomas Gleixner
2011-01-21 23:06   ` Vernon Mauery
2011-01-22  9:43     ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.