* rcu_sched self-detect stall when disable vif device
@ 2015-01-27 16:03 Julien Grall
2015-01-27 16:45 ` Wei Liu
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-27 16:03 UTC (permalink / raw)
To: xen-devel, Wei Liu, Ian Campbell
Hi,
While I'm working on support for 64K page in netfront, I got
an rcu_sced self-detect message. It happens when netback is
disabling the vif device due to an error.
I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
the processor is stucked in xenvif_rx_queue_purge?
Here the log:
vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
vif vif-20-0 vif20.0: fatal error; disabling device
INFO: rcu_sched self-detected stall on CPU { 1} (t=2101 jiffies g=37266 c=37265 q=2649)
Task dump for CPU 1:
vif20.0-q0-gues R running task 0 12617 2 0x00000002
Call trace:
[<ffff800000089038>] dump_backtrace+0x0/0x124
[<ffff80000008916c>] show_stack+0x10/0x1c
[<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
[<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
[<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
[<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
[<ffff8000000e0b20>] update_process_times+0x38/0x6c
[<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
[<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
[<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
[<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
[<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
[<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
[<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
[<ffff8000000823b8>] gic_handle_irq+0x30/0x80
Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
[<ffff8000000854e4>] el1_irq+0x64/0xc0
[<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
[<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
[<ffff8000000b1060>] kthread+0xd8/0xf0
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-27 16:03 rcu_sched self-detect stall when disable vif device Julien Grall
@ 2015-01-27 16:45 ` Wei Liu
2015-01-27 16:47 ` Julien Grall
2015-01-27 16:56 ` David Vrabel
0 siblings, 2 replies; 10+ messages in thread
From: Wei Liu @ 2015-01-27 16:45 UTC (permalink / raw)
To: Julien Grall; +Cc: Wei Liu, Ian Campbell, xen-devel
On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
> Hi,
>
> While I'm working on support for 64K page in netfront, I got
> an rcu_sced self-detect message. It happens when netback is
> disabling the vif device due to an error.
>
> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
> the processor is stucked in xenvif_rx_queue_purge?
>
When you try to release a SKB, core network driver need to enter some
RCU cirital region to clean up. dst_release for one, calls call_rcu.
Wei.
> Here the log:
>
> vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
> vif vif-20-0 vif20.0: fatal error; disabling device
> INFO: rcu_sched self-detected stall on CPU { 1} (t=2101 jiffies g=37266 c=37265 q=2649)
> Task dump for CPU 1:
> vif20.0-q0-gues R running task 0 12617 2 0x00000002
> Call trace:
> [<ffff800000089038>] dump_backtrace+0x0/0x124
> [<ffff80000008916c>] show_stack+0x10/0x1c
> [<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
> [<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
> [<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
> [<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
> [<ffff8000000e0b20>] update_process_times+0x38/0x6c
> [<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
> [<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
> [<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
> [<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
> [<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
> [<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
> [<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
> [<ffff8000000823b8>] gic_handle_irq+0x30/0x80
> Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
> 7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
> 7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
> 7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
> 7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
> 7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
> 7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
> 7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
> 7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
> 7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
> [<ffff8000000854e4>] el1_irq+0x64/0xc0
> [<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
> [<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
> [<ffff8000000b1060>] kthread+0xd8/0xf0
>
>
> Regards,
>
> --
> Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-27 16:45 ` Wei Liu
@ 2015-01-27 16:47 ` Julien Grall
2015-01-27 16:53 ` Wei Liu
2015-01-27 16:56 ` David Vrabel
1 sibling, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-27 16:47 UTC (permalink / raw)
To: Wei Liu; +Cc: Ian Campbell, xen-devel
On 27/01/15 16:45, Wei Liu wrote:
> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>> Hi,
>>
>> While I'm working on support for 64K page in netfront, I got
>> an rcu_sced self-detect message. It happens when netback is
>> disabling the vif device due to an error.
>>
>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>> the processor is stucked in xenvif_rx_queue_purge?
>>
>
> When you try to release a SKB, core network driver need to enter some
> RCU cirital region to clean up. dst_release for one, calls call_rcu.
But this message shouldn't happen in normal condition or because of
netfront. Right?
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-27 16:47 ` Julien Grall
@ 2015-01-27 16:53 ` Wei Liu
2015-01-28 16:45 ` Julien Grall
0 siblings, 1 reply; 10+ messages in thread
From: Wei Liu @ 2015-01-27 16:53 UTC (permalink / raw)
To: Julien Grall; +Cc: Wei Liu, Ian Campbell, xen-devel
On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
> On 27/01/15 16:45, Wei Liu wrote:
> > On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
> >> Hi,
> >>
> >> While I'm working on support for 64K page in netfront, I got
> >> an rcu_sced self-detect message. It happens when netback is
> >> disabling the vif device due to an error.
> >>
> >> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
> >> the processor is stucked in xenvif_rx_queue_purge?
> >>
> >
> > When you try to release a SKB, core network driver need to enter some
> > RCU cirital region to clean up. dst_release for one, calls call_rcu.
>
> But this message shouldn't happen in normal condition or because of
> netfront. Right?
>
Never saw report like this before, even in the case that netfront is
buggy.
Wei.
> Regards,
>
> --
> Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-27 16:45 ` Wei Liu
2015-01-27 16:47 ` Julien Grall
@ 2015-01-27 16:56 ` David Vrabel
1 sibling, 0 replies; 10+ messages in thread
From: David Vrabel @ 2015-01-27 16:56 UTC (permalink / raw)
To: xen-devel
On 27/01/15 16:45, Wei Liu wrote:
> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>> Hi,
>>
>> While I'm working on support for 64K page in netfront, I got
>> an rcu_sced self-detect message. It happens when netback is
>> disabling the vif device due to an error.
>>
>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>> the processor is stucked in xenvif_rx_queue_purge?
>>
>
> When you try to release a SKB, core network driver need to enter some
> RCU cirital region to clean up. dst_release for one, calls call_rcu.
This is RCU detecting a soft-lockup. You're either spinning on the
spinlock or the guest rx queue is corrupt and cannot be drained.
David
>> Here the log:
>>
>> vif vif-20-0 vif20.0: txreq.offset: 3410, size: 342, end: 1382
>> vif vif-20-0 vif20.0: fatal error; disabling device
>> INFO: rcu_sched self-detected stall on CPU { 1} (t=2101 jiffies g=37266 c=37265 q=2649)
>> Task dump for CPU 1:
>> vif20.0-q0-gues R running task 0 12617 2 0x00000002
>> Call trace:
>> [<ffff800000089038>] dump_backtrace+0x0/0x124
>> [<ffff80000008916c>] show_stack+0x10/0x1c
>> [<ffff8000000bb1b4>] sched_show_task+0x98/0xf8
>> [<ffff8000000bdca0>] dump_cpu_task+0x3c/0x4c
>> [<ffff8000000d9fd4>] rcu_dump_cpu_stacks+0xa4/0xf8
>> [<ffff8000000dd1a8>] rcu_check_callbacks+0x478/0x748
>> [<ffff8000000e0b20>] update_process_times+0x38/0x6c
>> [<ffff8000000eedd0>] tick_sched_timer+0x64/0x1b4
>> [<ffff8000000e10a8>] __run_hrtimer+0x88/0x234
>> [<ffff8000000e19c0>] hrtimer_interrupt+0x108/0x2b0
>> [<ffff8000005934c4>] arch_timer_handler_virt+0x28/0x38
>> [<ffff8000000d5e88>] handle_percpu_devid_irq+0x88/0x11c
>> [<ffff8000000d1ec0>] generic_handle_irq+0x30/0x4c
>> [<ffff8000000d21dc>] __handle_domain_irq+0x5c/0xac
>> [<ffff8000000823b8>] gic_handle_irq+0x30/0x80
>> Exception stack(0xffff800013a07c20 to 0xffff800013a07d40)
>> 7c20: 058ed000 ffff0000 058ed9d8 ffff0000 13a07d60 ffff8000 0053c418 ffff8000
>> 7c40: 00000000 00000000 0000ecf2 00000000 058ed9ec ffff0000 00000000 00000000
>> 7c60: 00000001 00000000 00000000 00000000 00001800 00000000 feacbe9d 0000060d
>> 7c80: 1ce5d6e0 ffff8000 13a07a90 ffff8000 00000400 00000000 ffffffff ffffffff
>> 7ca0: 0013d000 00000000 00000090 00000000 00000040 00000000 9a272028 0000ffff
>> 7cc0: 00099e64 ffff8000 00411010 00000000 df8fbb70 0000ffff 058ed000 ffff0000
>> 7ce0: 058ed9d8 ffff0000 058ed000 ffff0000 058ed988 ffff0000 00956000 ffff8000
>> 7d00: 19204840 ffff8000 000c75f8 ffff8000 13a04000 ffff8000 008a0598 ffff8000
>> 7d20: 00000000 00000000 13a07d60 ffff8000 0053c3bc ffff8000 13a07d60 ffff8000
>> [<ffff8000000854e4>] el1_irq+0x64/0xc0
>> [<ffff80000053c448>] xenvif_rx_queue_purge+0x1c/0x30
>> [<ffff80000053ea34>] xenvif_kthread_guest_rx+0x210/0x29c
>> [<ffff8000000b1060>] kthread+0xd8/0xf0
>>
>>
>> Regards,
>>
>> --
>> Julien Grall
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
>
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-27 16:53 ` Wei Liu
@ 2015-01-28 16:45 ` Julien Grall
2015-01-28 17:06 ` David Vrabel
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-28 16:45 UTC (permalink / raw)
To: Wei Liu; +Cc: David Vrabel, Ian Campbell, xen-devel
On 27/01/15 16:53, Wei Liu wrote:
> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>> On 27/01/15 16:45, Wei Liu wrote:
>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>> Hi,
>>>>
>>>> While I'm working on support for 64K page in netfront, I got
>>>> an rcu_sced self-detect message. It happens when netback is
>>>> disabling the vif device due to an error.
>>>>
>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>
>>>
>>> When you try to release a SKB, core network driver need to enter some
>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>
>> But this message shouldn't happen in normal condition or because of
>> netfront. Right?
>>
>
> Never saw report like this before, even in the case that netfront is
> buggy.
This is only happening when preemption is not enabled (i.e
CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
into an infinite loop. In my case, the code executed looks like:
1. for (;;) {
2. xenvif_wait_for_rx_work(queue);
3.
4. if (kthread_should_stop())
5. break;
6.
7. if (unlikely(vif->disabled && queue->id == 0) {
8. xenvif_carrier_off(vif);
9. xenvif_rx_queue_purge(queue);
10. continue;
11. }
12. }
The wait on line 2 will return directly because the vif is disabled
(see xenvif_have_rx_work)
We are on queue 0, so the condition on line 7 is true. Therefore we will
loop on line 10. And so on...
On platform where preemption is not enabled, this thread will never
yield/give the hand to another thread (unless the domain is destroyed).
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-28 16:45 ` Julien Grall
@ 2015-01-28 17:06 ` David Vrabel
2015-01-28 17:27 ` Julien Grall
0 siblings, 1 reply; 10+ messages in thread
From: David Vrabel @ 2015-01-28 17:06 UTC (permalink / raw)
To: Julien Grall, Wei Liu; +Cc: Ian Campbell, xen-devel
On 28/01/15 16:45, Julien Grall wrote:
> On 27/01/15 16:53, Wei Liu wrote:
>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>> On 27/01/15 16:45, Wei Liu wrote:
>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>> Hi,
>>>>>
>>>>> While I'm working on support for 64K page in netfront, I got
>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>> disabling the vif device due to an error.
>>>>>
>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>
>>>>
>>>> When you try to release a SKB, core network driver need to enter some
>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>
>>> But this message shouldn't happen in normal condition or because of
>>> netfront. Right?
>>>
>>
>> Never saw report like this before, even in the case that netfront is
>> buggy.
>
> This is only happening when preemption is not enabled (i.e
> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
>
> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
> into an infinite loop. In my case, the code executed looks like:
>
>
> 1. for (;;) {
> 2. xenvif_wait_for_rx_work(queue);
> 3.
> 4. if (kthread_should_stop())
> 5. break;
> 6.
> 7. if (unlikely(vif->disabled && queue->id == 0) {
> 8. xenvif_carrier_off(vif);
> 9. xenvif_rx_queue_purge(queue);
> 10. continue;
> 11. }
> 12. }
>
> The wait on line 2 will return directly because the vif is disabled
> (see xenvif_have_rx_work)
>
> We are on queue 0, so the condition on line 7 is true. Therefore we will
> loop on line 10. And so on...
>
> On platform where preemption is not enabled, this thread will never
> yield/give the hand to another thread (unless the domain is destroyed).
I'm not sure why we have a continue in the vif->disabled case and not
just a break. Can you try that?
David
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-28 17:06 ` David Vrabel
@ 2015-01-28 17:27 ` Julien Grall
2015-01-30 16:04 ` David Vrabel
0 siblings, 1 reply; 10+ messages in thread
From: Julien Grall @ 2015-01-28 17:27 UTC (permalink / raw)
To: David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel
On 28/01/15 17:06, David Vrabel wrote:
> On 28/01/15 16:45, Julien Grall wrote:
>> On 27/01/15 16:53, Wei Liu wrote:
>>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>>> On 27/01/15 16:45, Wei Liu wrote:
>>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>>> Hi,
>>>>>>
>>>>>> While I'm working on support for 64K page in netfront, I got
>>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>>> disabling the vif device due to an error.
>>>>>>
>>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>>
>>>>>
>>>>> When you try to release a SKB, core network driver need to enter some
>>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>>
>>>> But this message shouldn't happen in normal condition or because of
>>>> netfront. Right?
>>>>
>>>
>>> Never saw report like this before, even in the case that netfront is
>>> buggy.
>>
>> This is only happening when preemption is not enabled (i.e
>> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
>>
>> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
>> into an infinite loop. In my case, the code executed looks like:
>>
>>
>> 1. for (;;) {
>> 2. xenvif_wait_for_rx_work(queue);
>> 3.
>> 4. if (kthread_should_stop())
>> 5. break;
>> 6.
>> 7. if (unlikely(vif->disabled && queue->id == 0) {
>> 8. xenvif_carrier_off(vif);
>> 9. xenvif_rx_queue_purge(queue);
>> 10. continue;
>> 11. }
>> 12. }
>>
>> The wait on line 2 will return directly because the vif is disabled
>> (see xenvif_have_rx_work)
>>
>> We are on queue 0, so the condition on line 7 is true. Therefore we will
>> loop on line 10. And so on...
>>
>> On platform where preemption is not enabled, this thread will never
>> yield/give the hand to another thread (unless the domain is destroyed).
>
> I'm not sure why we have a continue in the vif->disabled case and not
> just a break. Can you try that?
So I applied this small patches:
diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 908e65e..9448c6c 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data)
if (unlikely(vif->disabled && queue->id == 0)) {
xenvif_carrier_off(vif);
xenvif_rx_queue_purge(queue);
- continue;
+ break;
}
if (!skb_queue_empty(&queue->rx_queue))
While I don't get anymore message rcu_sched stall, when I destroy the
guest, the backend hits a NULL pointer dereference:
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = ffff800000a50000
[00000000] *pgd=00000083de82a003, *pud=00000083de82b003, *pmd=00000083de82c003, *pte=00600000e1110707
Internal error: Oops: 96000006 [#1] SMP
Modules linked in:
CPU: 4 PID: 34 Comm: xenwatch Not tainted 3.19.0-rc5-xen-seattle+ #13
Hardware name: AMD Seattle (RevA) Development Board (Overdrive) (DT)
task: ffff80001ea39480 ti: ffff80001ea78000 task.ti: ffff80001ea78000
PC is at exit_creds+0x18/0x70
LR is at __put_task_struct+0x3c/0xd4
pc : [<ffff8000000b2d94>] lr : [<ffff800000094990>] pstate: 80000145
sp : ffff80001ea7bc50
x29: ffff80001ea7bc50 x28: 0000000000000000
x27: 0000000000000000 x26: 0000000000000000
x25: 0000000000000000 x24: ffff80001eb3c840
x23: ffff80001eb3c840 x22: 000000000006c560
x21: ffff0000011f7000 x20: 0000000000000000
x19: ffff80001ba06680 x18: 0000ffffd2635bd0
x17: 0000ffff839e4074 x16: 00000000deadbeef
x15: ffffffffffffffff x14: 0ffffffffffffffe
x13: 0000000000000028 x12: 0000000000000010
x11: 0000000000000030 x10: 0101010101010101
x9 : ffff80001ea7b8e0 x8 : ffff7c01cf6e2740
x7 : 0000000000000000 x6 : 0000000000002fc9
x5 : 0000000000000000 x4 : 0000000000000001
x3 : 0000000000000000 x2 : ffff80001ba06690
x1 : 0000000000000000 x0 : 0000000000000000
Process xenwatch (pid: 34, stack limit = 0xffff80001ea78058)
Stack: (0xffff80001ea7bc50 to 0xffff80001ea7c000)
bc40: 1ea7bc70 ffff8000 00094990 ffff8000
bc60: 1ba06680 ffff8000 008b45a8 ffff8000 1ea7bc90 ffff8000 000b15f0 ffff8000
bc80: 1ba06680 ffff8000 005bcab8 ffff8000 1ea7bcc0 ffff8000 00541efc ffff8000
bca0: 011ed000 ffff0000 00000000 00000000 011f7000 ffff0000 00000006 00000000
bcc0: 1ea7bd00 ffff8000 00540984 ffff8000 1ce23680 ffff8000 00000006 00000000
bce0: 00752cf0 ffff8000 00000001 00000000 00752e38 ffff8000 1ea7bd98 ffff8000
bd00: 1ea7bd40 ffff8000 00540bcc ffff8000 1ce23680 ffff8000 1cce0c00 ffff8000
bd20: 00000000 00000000 1cce0c00 ffff8000 009b0288 ffff8000 1ea7be20 ffff8000
bd40: 1ea7bd70 ffff8000 0048011c ffff8000 1ce23700 ffff8000 1cf71000 ffff8000
bd60: 009a6258 ffff8000 00a36d38 00000000 1ea7bdb0 ffff8000 00480ea4 ffff8000
bd80: 1b89d800 ffff8000 009a62b0 ffff8000 009a6258 ffff8000 00a36d38 ffff8000
bda0: 00a36e30 ffff8000 0047f7c0 ffff8000 1ea7bdc0 ffff8000 0047f82c ffff8000
bdc0: 1ea7be30 ffff8000 000b1064 ffff8000 1ea48cc0 ffff8000 009dbfe8 ffff8000
bde0: 008552d8 ffff8000 00000000 00000000 0047f778 ffff8000 00000000 00000000
be00: 1ea7be30 ffff8000 00000000 ffff8000 1ea39480 ffff8000 000c75f8 ffff8000
be20: 1ea7be20 ffff8000 1ea7be20 ffff8000 00000000 00000000 00085930 ffff8000
be40: 000b0f88 ffff8000 1ea48cc0 ffff8000 00000000 00000000 00000000 00000000
be60: 00000000 00000000 1ea48cc0 ffff8000 00000000 00000000 00000000 00000000
be80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bea0: 1ea7bea0 ffff8000 1ea7bea0 ffff8000 00000000 ffff8000 00000000 00000000
bec0: 1ea7bec0 ffff8000 1ea7bec0 ffff8000 00000000 00000000 00000000 00000000
bee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf00: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf20: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf40: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf60: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bf80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfa0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000005 00000000
bfe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Call trace:
[<ffff8000000b2d94>] exit_creds+0x18/0x70
[<ffff80000009498c>] __put_task_struct+0x38/0xd4
[<ffff8000000b15ec>] kthread_stop+0xc0/0x130
[<ffff800000541ef8>] xenvif_disconnect+0x58/0xd0
[<ffff800000540980>] set_backend_state+0x134/0x278
[<ffff800000540bc8>] frontend_changed+0x8c/0xec
[<ffff800000480118>] xenbus_otherend_changed+0x9c/0xa4
[<ffff800000480ea0>] frontend_changed+0xc/0x18
[<ffff80000047f828>] xenwatch_thread+0xb0/0x140
[<ffff8000000b1060>] kthread+0xd8/0xf0
Code: f9000bf3 aa0003f3 f9422401 f9422000 (b9400021)
---[ end trace af11d521ee530da8 ]---
Regards,
--
Julien Grall
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-28 17:27 ` Julien Grall
@ 2015-01-30 16:04 ` David Vrabel
2015-02-02 13:54 ` Julien Grall
0 siblings, 1 reply; 10+ messages in thread
From: David Vrabel @ 2015-01-30 16:04 UTC (permalink / raw)
To: Julien Grall, David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel
On 28/01/15 17:27, Julien Grall wrote:
> On 28/01/15 17:06, David Vrabel wrote:
>> On 28/01/15 16:45, Julien Grall wrote:
>>> On 27/01/15 16:53, Wei Liu wrote:
>>>> On Tue, Jan 27, 2015 at 04:47:45PM +0000, Julien Grall wrote:
>>>>> On 27/01/15 16:45, Wei Liu wrote:
>>>>>> On Tue, Jan 27, 2015 at 04:03:52PM +0000, Julien Grall wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> While I'm working on support for 64K page in netfront, I got
>>>>>>> an rcu_sced self-detect message. It happens when netback is
>>>>>>> disabling the vif device due to an error.
>>>>>>>
>>>>>>> I'm using Linux 3.19-rc5 on seattle (ARM64). Any idea why
>>>>>>> the processor is stucked in xenvif_rx_queue_purge?
>>>>>>>
>>>>>>
>>>>>> When you try to release a SKB, core network driver need to enter some
>>>>>> RCU cirital region to clean up. dst_release for one, calls call_rcu.
>>>>>
>>>>> But this message shouldn't happen in normal condition or because of
>>>>> netfront. Right?
>>>>>
>>>>
>>>> Never saw report like this before, even in the case that netfront is
>>>> buggy.
>>>
>>> This is only happening when preemption is not enabled (i.e
>>> CONFIG_PREEMPT_NONE in the config file) in the backend kernel.
>>>
>>> When the vif is disabled, the loop in xenvif_kthread_guest_rx turned
>>> into an infinite loop. In my case, the code executed looks like:
>>>
>>>
>>> 1. for (;;) {
>>> 2. xenvif_wait_for_rx_work(queue);
>>> 3.
>>> 4. if (kthread_should_stop())
>>> 5. break;
>>> 6.
>>> 7. if (unlikely(vif->disabled && queue->id == 0) {
>>> 8. xenvif_carrier_off(vif);
>>> 9. xenvif_rx_queue_purge(queue);
>>> 10. continue;
>>> 11. }
>>> 12. }
>>>
>>> The wait on line 2 will return directly because the vif is disabled
>>> (see xenvif_have_rx_work)
>>>
>>> We are on queue 0, so the condition on line 7 is true. Therefore we will
>>> loop on line 10. And so on...
>>>
>>> On platform where preemption is not enabled, this thread will never
>>> yield/give the hand to another thread (unless the domain is destroyed).
>>
>> I'm not sure why we have a continue in the vif->disabled case and not
>> just a break. Can you try that?
>
> So I applied this small patches:
>
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 908e65e..9448c6c 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -2110,7 +2110,7 @@ int xenvif_kthread_guest_rx(void *data)
> if (unlikely(vif->disabled && queue->id == 0)) {
> xenvif_carrier_off(vif);
> xenvif_rx_queue_purge(queue);
> - continue;
> + break;
> }
>
> if (!skb_queue_empty(&queue->rx_queue))
How about this?
8<------------------------------------------
xen-netback: stop the guest rx thread after a fatal error
After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback:
disable rogue vif in kthread context), a fatal (protocol) error would
leave the guest Rx thread spinning, wasting CPU time. Commit
ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce
guest Rx stall detection) made this even worse by removing a
cond_resched() from this path.
A fatal error is non-recoverable so just allow the guest Rx thread to
exit. This requires taking additional refs to the task so the thread
exiting early is handled safely.
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
diff --git a/drivers/net/xen-netback/interface.c
b/drivers/net/xen-netback/interface.c
index 9259a73..037f74f 100644
--- a/drivers/net/xen-netback/interface.c
+++ b/drivers/net/xen-netback/interface.c
@@ -578,6 +578,7 @@ int xenvif_connect(struct xenvif_queue *queue,
unsigned long tx_ring_ref,
goto err_rx_unbind;
}
queue->task = task;
+ get_task_struct(task);
task = kthread_create(xenvif_dealloc_kthread,
(void *)queue, "%s-dealloc", queue->name);
@@ -634,6 +635,7 @@ void xenvif_disconnect(struct xenvif *vif)
if (queue->task) {
kthread_stop(queue->task);
+ put_task_struct(queue->task);
queue->task = NULL;
}
diff --git a/drivers/net/xen-netback/netback.c
b/drivers/net/xen-netback/netback.c
index 908e65e..c8ce701 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -2109,8 +2109,7 @@ int xenvif_kthread_guest_rx(void *data)
*/
if (unlikely(vif->disabled && queue->id == 0)) {
xenvif_carrier_off(vif);
- xenvif_rx_queue_purge(queue);
- continue;
+ break;
}
if (!skb_queue_empty(&queue->rx_queue))
^ permalink raw reply related [flat|nested] 10+ messages in thread
* Re: rcu_sched self-detect stall when disable vif device
2015-01-30 16:04 ` David Vrabel
@ 2015-02-02 13:54 ` Julien Grall
0 siblings, 0 replies; 10+ messages in thread
From: Julien Grall @ 2015-02-02 13:54 UTC (permalink / raw)
To: David Vrabel, Wei Liu; +Cc: Ian Campbell, xen-devel
Hi David,
On 30/01/15 16:04, David Vrabel wrote:
> How about this?
This is working for me. Thanks!
> 8<------------------------------------------
> xen-netback: stop the guest rx thread after a fatal error
>
> After commit e9d8b2c2968499c1f96563e6522c56958d5a1d0d (xen-netback:
> disable rogue vif in kthread context), a fatal (protocol) error would
> leave the guest Rx thread spinning, wasting CPU time. Commit
> ecf08d2dbb96d5a4b4bcc53a39e8d29cc8fef02e (xen-netback: reintroduce
> guest Rx stall detection) made this even worse by removing a
> cond_resched() from this path.
>
> A fatal error is non-recoverable so just allow the guest Rx thread to
> exit. This requires taking additional refs to the task so the thread
> exiting early is handled safely.
>
> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Reported-by: Julien Grall <julien.grall@linaro.org>
Tested-by: Julien Grall <julien.grall@linaro.org>
Regards,
--
Julien Grall
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2015-02-02 13:54 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-27 16:03 rcu_sched self-detect stall when disable vif device Julien Grall
2015-01-27 16:45 ` Wei Liu
2015-01-27 16:47 ` Julien Grall
2015-01-27 16:53 ` Wei Liu
2015-01-28 16:45 ` Julien Grall
2015-01-28 17:06 ` David Vrabel
2015-01-28 17:27 ` Julien Grall
2015-01-30 16:04 ` David Vrabel
2015-02-02 13:54 ` Julien Grall
2015-01-27 16:56 ` David Vrabel
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.