All of lore.kernel.org
 help / color / mirror / Atom feed
* new netfront and occasional receive path lockup
@ 2010-08-22 16:43 Christophe Saout
  2010-08-22 18:37 ` Christophe Saout
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Christophe Saout @ 2010-08-22 16:43 UTC (permalink / raw)
  To: xen-devel

Hi,

I've been playing with some of the new pvops code, namely DomU guest
code.  What I've been observing on one of the virtual machines is that
the network (vif) is dying after about ten to sixty minutes of uptime.
The unfortunate thing here is that I can only repoduce it on a
production VM and have been unlucky so far to trigger the bug on a test
machine.  While this has not been tragic - rebooting fixed the issue,
unfortunately I can't spend very much time on debugging after the issue
pops up.

Now, what is happening is that the receive path goes dead.  The DomU can
send packets to Dom0 and those are visible using tcpdump on the Dom0 on
the virtual interface, but not the other way around.

Now, I have done more than one change at a time (I'd like to avoid going
into pinning it down since I can only reproduce it on a production
machine, as I said, so suggestions are welcome), but my suspicion is
that it might have to do with the new "smart polling" feature in
xen/netfront.  Note that I have also updated Dom0 to pull in the latest
dom0/backend and netback changes, just to make sure it's not due to an
issue that has been fixed there, but I'm still seeing the same.

The production machine is a machine that doesn't have much network load,
but deals with a lot of small network requests (DNS and smtp mostly).  A
workload which is hard to reproduce on the test machine.  Heavy network
load (NFS, FTP and so on) for days hasn't triggered the problem.  Also,
segmentation offloading and similar settings don't have any effect.

The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT
enabled.

I've been looking at the code, if there might be a race condition
somewhere, something like where one could run into a situation where the
hrtimer doesn't run and Dom0 believes the DomU should be polling and
doesn't emit an interrupt or something, but I'm afraid I don't know
enough to judge this (I mean, there are spinlocks which look safe to
me).

Do you have any suggestions what to try?  I can trigger the issue on the
production VM again, but debugging should not take more than a few
minutes if it happens.  Access is only possible via the console.
Neither Dom0 nor the guest show anything unusual in the kernel message
and continue to behave normally after the network goes dead (also able
to shut down the guest normally).

Thanks,
	Christophe

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-22 16:43 new netfront and occasional receive path lockup Christophe Saout
@ 2010-08-22 18:37 ` Christophe Saout
  2010-08-24  0:53   ` Jeremy Fitzhardinge
  2010-08-23 14:26 ` Christophe Saout
  2010-08-24  0:46 ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 31+ messages in thread
From: Christophe Saout @ 2010-08-22 18:37 UTC (permalink / raw)
  To: xen-devel

Hi again,

> I've been looking at the code, if there might be a race condition
> somewhere, something like where one could run into a situation where the
> hrtimer doesn't run and Dom0 believes the DomU should be polling and
> doesn't emit an interrupt or something, but I'm afraid I don't know
> enough to judge this (I mean, there are spinlocks which look safe to
> me).

Hmm, looking a bit more.

rx.sring->private.netif.smartpoll_active lies in a piece of memory that
is shared between netback and netfront, is that right?

If that is so, the tx spinlock in netfront only protects against
simultaneous modifications from another thread in netfront, so netback
can read smartpoll_active while netfront is fiddling with it.  Is that
safe?

Note that when the lockup occurs, /proc/interrupts in the guest doesn't
show any interrupts arriving from for eth0 anymore.  Are there any
conditions where netback waits for netfront to retrieve packages even
when new packages arrive? (like e.g. when the ring is full and there is
backlog into the network stack or something?) Any way to debug this from
the Dom0 side?  Like looking into the state of the ring from userspace?
Debug options?

	Christophe

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-22 16:43 new netfront and occasional receive path lockup Christophe Saout
  2010-08-22 18:37 ` Christophe Saout
@ 2010-08-23 14:26 ` Christophe Saout
  2010-08-23 16:04   ` Konrad Rzeszutek Wilk
  2010-08-24  0:46 ` Jeremy Fitzhardinge
  2 siblings, 1 reply; 31+ messages in thread
From: Christophe Saout @ 2010-08-23 14:26 UTC (permalink / raw)
  To: xen-devel

Hi yet again,

[not quoting everything again]

I finally managed to trigger the issue on the test VM, which is now
stuck in that state since last night and can be inspected.  Apparently
the tx ring on the netback side is full, since every packet sent is
immediately dropped (as seen from ifconfig output).  No interrupts
moving on the guest.

Still I'm wondering what would be the best course of action trying to
debug this now.  Should I have compiled some debugger into the
hypervisor? (gdbsx apparently needs that)

Thanks,
	Christophe

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-23 14:26 ` Christophe Saout
@ 2010-08-23 16:04   ` Konrad Rzeszutek Wilk
  2010-08-23 17:09     ` Christophe Saout
  0 siblings, 1 reply; 31+ messages in thread
From: Konrad Rzeszutek Wilk @ 2010-08-23 16:04 UTC (permalink / raw)
  To: Christophe Saout; +Cc: xen-devel

On Mon, Aug 23, 2010 at 04:26:52PM +0200, Christophe Saout wrote:
> Hi yet again,
> 
> [not quoting everything again]
> 
> I finally managed to trigger the issue on the test VM, which is now
> stuck in that state since last night and can be inspected.  Apparently
> the tx ring on the netback side is full, since every packet sent is
> immediately dropped (as seen from ifconfig output).  No interrupts
> moving on the guest.

What is the kernel and hypervisor in Dom0? And what is it in DomU?

> 
> Still I'm wondering what would be the best course of action trying to
> debug this now.  Should I have compiled some debugger into the
> hypervisor? (gdbsx apparently needs that)

Sure. An easier path might be to do 'xm debug-keys q' which should
trigger the debug irq handler. In DomU that should print out all of the
event channel bits which we can analyze that and see if the
proper bits are not set (and hence the IRQ handler isn't picking up
from the ring buffer).

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-23 16:04   ` Konrad Rzeszutek Wilk
@ 2010-08-23 17:09     ` Christophe Saout
  0 siblings, 0 replies; 31+ messages in thread
From: Christophe Saout @ 2010-08-23 17:09 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: xen-devel

Hi Konrad,

> > I finally managed to trigger the issue on the test VM, which is now
> > stuck in that state since last night and can be inspected.  Apparently
> > the tx ring on the netback side is full, since every packet sent is
> > immediately dropped (as seen from ifconfig output).  No interrupts
> > moving on the guest.
> 
> What is the kernel and hypervisor in Dom0? And what is it in DomU?

The hypervisor is from the Xen 4.0.0 release and the Dom0 is from
Jeremy's 2.6.32 stable branch for pvops Dom0 (and lately with the
xen/dom0/backend branches merged in top because I hoped there might be
some fixes that help).  The same kernel has been working fine as guest,
but my newer one where I took an upstream 2.6.35, applied some of the
upstream fixes branches and also pulled the xen/netfront in, is now
causeing this issue.  Everything else is working just fine, so I am
pretty sure it is related to a netfront-specific change and not to
anything else.

> > hypervisor? (gdbsx apparently needs that)
> 
> Sure.

Also, I noticed that "gdb /path/to/vmlinux /proc/kcore" does allow me to
inspect the memory.  I'll try to see if I can pinpoint some of the
interesting memory locations.

> An easier path might be to do 'xm debug-keys q' which should
> trigger the debug irq handler. In DomU that should print out all of the
> event channel bits which we can analyze that and see if the
> proper bits are not set (and hence the IRQ handler isn't picking up
> from the ring buffer).

I'm not exactly sure how to read the output of that.
http://www.saout.de/assets/xm-debug-q.txt

	Christophe

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-22 16:43 new netfront and occasional receive path lockup Christophe Saout
  2010-08-22 18:37 ` Christophe Saout
  2010-08-23 14:26 ` Christophe Saout
@ 2010-08-24  0:46 ` Jeremy Fitzhardinge
  2010-08-25  0:51   ` Xu, Dongxiao
  2 siblings, 1 reply; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-24  0:46 UTC (permalink / raw)
  To: Christophe Saout; +Cc: Xu, Dongxiao, xen-devel

 On 08/22/2010 09:43 AM, Christophe Saout wrote:
> Hi,
>
> I've been playing with some of the new pvops code, namely DomU guest
> code.  What I've been observing on one of the virtual machines is that
> the network (vif) is dying after about ten to sixty minutes of uptime.
> The unfortunate thing here is that I can only repoduce it on a
> production VM and have been unlucky so far to trigger the bug on a test
> machine.  While this has not been tragic - rebooting fixed the issue,
> unfortunately I can't spend very much time on debugging after the issue
> pops up.

Ah, OK.  I've seen this a couple of times as well.  And it just happened
to me then...


> Now, what is happening is that the receive path goes dead.  The DomU can
> send packets to Dom0 and those are visible using tcpdump on the Dom0 on
> the virtual interface, but not the other way around.

I hadn't got to that level of diagnosis, but I can confirm that that's
what seems to be happening here too.

> Now, I have done more than one change at a time (I'd like to avoid going
> into pinning it down since I can only reproduce it on a production
> machine, as I said, so suggestions are welcome), but my suspicion is
> that it might have to do with the new "smart polling" feature in
> xen/netfront.  Note that I have also updated Dom0 to pull in the latest
> dom0/backend and netback changes, just to make sure it's not due to an
> issue that has been fixed there, but I'm still seeing the same.

I agree.  I think I started seeing this once I merged smartpoll into
netfront.

    J

> The production machine is a machine that doesn't have much network load,
> but deals with a lot of small network requests (DNS and smtp mostly).  A
> workload which is hard to reproduce on the test machine.  Heavy network
> load (NFS, FTP and so on) for days hasn't triggered the problem.  Also,
> segmentation offloading and similar settings don't have any effect.
>
> The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT
> enabled.
>
> I've been looking at the code, if there might be a race condition
> somewhere, something like where one could run into a situation where the
> hrtimer doesn't run and Dom0 believes the DomU should be polling and
> doesn't emit an interrupt or something, but I'm afraid I don't know
> enough to judge this (I mean, there are spinlocks which look safe to
> me).
>
> Do you have any suggestions what to try?  I can trigger the issue on the
> production VM again, but debugging should not take more than a few
> minutes if it happens.  Access is only possible via the console.
> Neither Dom0 nor the guest show anything unusual in the kernel message
> and continue to behave normally after the network goes dead (also able
> to shut down the guest normally).
>
> Thanks,
> 	Christophe
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-22 18:37 ` Christophe Saout
@ 2010-08-24  0:53   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-08-24  0:53 UTC (permalink / raw)
  To: Christophe Saout; +Cc: Xu, Dongxiao, xen-devel

 On 08/22/2010 11:37 AM, Christophe Saout wrote:
> Hmm, looking a bit more.
>
> rx.sring->private.netif.smartpoll_active lies in a piece of memory that
> is shared between netback and netfront, is that right?
>
> If that is so, the tx spinlock in netfront only protects against
> simultaneous modifications from another thread in netfront, so netback
> can read smartpoll_active while netfront is fiddling with it.  Is that
> safe?

It depends on exactly how it is used.  But any use cross-cpu shared
memory must carefully consider access ordering, and possibly have
explicit barriers to make sure that the expected ordering is actually
seen by all cpus.

    J

> Note that when the lockup occurs, /proc/interrupts in the guest doesn't
> show any interrupts arriving from for eth0 anymore.  Are there any
> conditions where netback waits for netfront to retrieve packages even
> when new packages arrive? (like e.g. when the ring is full and there is
> backlog into the network stack or something?) Any way to debug this from
> the Dom0 side?  Like looking into the state of the ring from userspace?
> Debug options?
>
> 	Christophe
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: new netfront and occasional receive path lockup
  2010-08-24  0:46 ` Jeremy Fitzhardinge
@ 2010-08-25  0:51   ` Xu, Dongxiao
  2010-09-09 18:50     ` Pasi Kärkkäinen
  0 siblings, 1 reply; 31+ messages in thread
From: Xu, Dongxiao @ 2010-08-25  0:51 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Christophe Saout; +Cc: xen-devel

Hi Christophe,

Thanks for finding and checking the problem.
I will try to reproduce the issue and check what caused the problem.

Thanks,
Dongxiao

Jeremy Fitzhardinge wrote:
>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>> Hi,
>> 
>> I've been playing with some of the new pvops code, namely DomU guest
>> code.  What I've been observing on one of the virtual machines is
>> that 
>> the network (vif) is dying after about ten to sixty minutes of
>> uptime. 
>> The unfortunate thing here is that I can only repoduce it on a
>> production VM and have been unlucky so far to trigger the bug on a
>> test machine.  While this has not been tragic - rebooting fixed the
>> issue, unfortunately I can't spend very much time on debugging after
>> the issue pops up.
> 
> Ah, OK.  I've seen this a couple of times as well.  And it just
> happened to me then... 
> 
> 
>> Now, what is happening is that the receive path goes dead.  The DomU
>> can send packets to Dom0 and those are visible using tcpdump on the
>> Dom0 on the virtual interface, but not the other way around.
> 
> I hadn't got to that level of diagnosis, but I can confirm that
> that's what seems to be happening here too. 
> 
>> Now, I have done more than one change at a time (I'd like to avoid
>> going into pinning it down since I can only reproduce it on a
>> production machine, as I said, so suggestions are welcome), but my
>> suspicion is that it might have to do with the new "smart polling"
>> feature in xen/netfront.  Note that I have also updated Dom0 to pull
>> in the latest dom0/backend and netback changes, just to make sure
>> it's 
>> not due to an issue that has been fixed there, but I'm still seeing
>> the same. 
> 
> I agree.  I think I started seeing this once I merged smartpoll into
> netfront. 
> 
>     J
> 
>> The production machine is a machine that doesn't have much network
>> load, but deals with a lot of small network requests (DNS and smtp
>> mostly).  A workload which is hard to reproduce on the test machine.
>> Heavy network load (NFS, FTP and so on) for days hasn't triggered the
>> problem.  Also, segmentation offloading and similar settings don't
>> have any effect. 
>> 
>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>> PREEMPT 
>> enabled.
>> 
>> I've been looking at the code, if there might be a race condition
>> somewhere, something like where one could run into a situation where
>> the hrtimer doesn't run and Dom0 believes the DomU should be polling
>> and doesn't emit an interrupt or something, but I'm afraid I don't
>> know enough to judge this (I mean, there are spinlocks which look
>> safe 
>> to me).
>> 
>> Do you have any suggestions what to try?  I can trigger the issue on
>> the production VM again, but debugging should not take more than a
>> few 
>> minutes if it happens.  Access is only possible via the console.
>> Neither Dom0 nor the guest show anything unusual in the kernel
>> message 
>> and continue to behave normally after the network goes dead (also
>> able 
>> to shut down the guest normally).
>> 
>> Thanks,
>> 	Christophe
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-08-25  0:51   ` Xu, Dongxiao
@ 2010-09-09 18:50     ` Pasi Kärkkäinen
  2010-09-10  0:55       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 31+ messages in thread
From: Pasi Kärkkäinen @ 2010-09-09 18:50 UTC (permalink / raw)
  To: Xu, Dongxiao; +Cc: Jeremy Fitzhardinge, xen-devel, Christophe Saout

On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
> Hi Christophe,
> 
> Thanks for finding and checking the problem.
> I will try to reproduce the issue and check what caused the problem.
> 

Hello,

Was this issue resolved? Some users have been complaining
"network freezing up" issues recently on ##xen on irc..

-- Pasi

> Thanks,
> Dongxiao
> 
> Jeremy Fitzhardinge wrote:
> >  On 08/22/2010 09:43 AM, Christophe Saout wrote:
> >> Hi,
> >> 
> >> I've been playing with some of the new pvops code, namely DomU guest
> >> code.  What I've been observing on one of the virtual machines is
> >> that 
> >> the network (vif) is dying after about ten to sixty minutes of
> >> uptime. 
> >> The unfortunate thing here is that I can only repoduce it on a
> >> production VM and have been unlucky so far to trigger the bug on a
> >> test machine.  While this has not been tragic - rebooting fixed the
> >> issue, unfortunately I can't spend very much time on debugging after
> >> the issue pops up.
> > 
> > Ah, OK.  I've seen this a couple of times as well.  And it just
> > happened to me then... 
> > 
> > 
> >> Now, what is happening is that the receive path goes dead.  The DomU
> >> can send packets to Dom0 and those are visible using tcpdump on the
> >> Dom0 on the virtual interface, but not the other way around.
> > 
> > I hadn't got to that level of diagnosis, but I can confirm that
> > that's what seems to be happening here too. 
> > 
> >> Now, I have done more than one change at a time (I'd like to avoid
> >> going into pinning it down since I can only reproduce it on a
> >> production machine, as I said, so suggestions are welcome), but my
> >> suspicion is that it might have to do with the new "smart polling"
> >> feature in xen/netfront.  Note that I have also updated Dom0 to pull
> >> in the latest dom0/backend and netback changes, just to make sure
> >> it's 
> >> not due to an issue that has been fixed there, but I'm still seeing
> >> the same. 
> > 
> > I agree.  I think I started seeing this once I merged smartpoll into
> > netfront. 
> > 
> >     J
> > 
> >> The production machine is a machine that doesn't have much network
> >> load, but deals with a lot of small network requests (DNS and smtp
> >> mostly).  A workload which is hard to reproduce on the test machine.
> >> Heavy network load (NFS, FTP and so on) for days hasn't triggered the
> >> problem.  Also, segmentation offloading and similar settings don't
> >> have any effect. 
> >> 
> >> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
> >> PREEMPT 
> >> enabled.
> >> 
> >> I've been looking at the code, if there might be a race condition
> >> somewhere, something like where one could run into a situation where
> >> the hrtimer doesn't run and Dom0 believes the DomU should be polling
> >> and doesn't emit an interrupt or something, but I'm afraid I don't
> >> know enough to judge this (I mean, there are spinlocks which look
> >> safe 
> >> to me).
> >> 
> >> Do you have any suggestions what to try?  I can trigger the issue on
> >> the production VM again, but debugging should not take more than a
> >> few 
> >> minutes if it happens.  Access is only possible via the console.
> >> Neither Dom0 nor the guest show anything unusual in the kernel
> >> message 
> >> and continue to behave normally after the network goes dead (also
> >> able 
> >> to shut down the guest normally).
> >> 
> >> Thanks,
> >> 	Christophe
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xensource.com
> >> http://lists.xensource.com/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-09-09 18:50     ` Pasi Kärkkäinen
@ 2010-09-10  0:55       ` Jeremy Fitzhardinge
  2010-09-10  1:45         ` Xu, Dongxiao
  0 siblings, 1 reply; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-09-10  0:55 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: Xu, Dongxiao, xen-devel, Christophe Saout

 On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>> Hi Christophe,
>>
>> Thanks for finding and checking the problem.
>> I will try to reproduce the issue and check what caused the problem.
>>
> Hello,
>
> Was this issue resolved? Some users have been complaining
> "network freezing up" issues recently on ##xen on irc..

Yeah, I'll add a command-line parameter to disable smartpoll (and leave
it off by default).

    J

> -- Pasi
>
>> Thanks,
>> Dongxiao
>>
>> Jeremy Fitzhardinge wrote:
>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>> Hi,
>>>>
>>>> I've been playing with some of the new pvops code, namely DomU guest
>>>> code.  What I've been observing on one of the virtual machines is
>>>> that 
>>>> the network (vif) is dying after about ten to sixty minutes of
>>>> uptime. 
>>>> The unfortunate thing here is that I can only repoduce it on a
>>>> production VM and have been unlucky so far to trigger the bug on a
>>>> test machine.  While this has not been tragic - rebooting fixed the
>>>> issue, unfortunately I can't spend very much time on debugging after
>>>> the issue pops up.
>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>> happened to me then... 
>>>
>>>
>>>> Now, what is happening is that the receive path goes dead.  The DomU
>>>> can send packets to Dom0 and those are visible using tcpdump on the
>>>> Dom0 on the virtual interface, but not the other way around.
>>> I hadn't got to that level of diagnosis, but I can confirm that
>>> that's what seems to be happening here too. 
>>>
>>>> Now, I have done more than one change at a time (I'd like to avoid
>>>> going into pinning it down since I can only reproduce it on a
>>>> production machine, as I said, so suggestions are welcome), but my
>>>> suspicion is that it might have to do with the new "smart polling"
>>>> feature in xen/netfront.  Note that I have also updated Dom0 to pull
>>>> in the latest dom0/backend and netback changes, just to make sure
>>>> it's 
>>>> not due to an issue that has been fixed there, but I'm still seeing
>>>> the same. 
>>> I agree.  I think I started seeing this once I merged smartpoll into
>>> netfront. 
>>>
>>>     J
>>>
>>>> The production machine is a machine that doesn't have much network
>>>> load, but deals with a lot of small network requests (DNS and smtp
>>>> mostly).  A workload which is hard to reproduce on the test machine.
>>>> Heavy network load (NFS, FTP and so on) for days hasn't triggered the
>>>> problem.  Also, segmentation offloading and similar settings don't
>>>> have any effect. 
>>>>
>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>> PREEMPT 
>>>> enabled.
>>>>
>>>> I've been looking at the code, if there might be a race condition
>>>> somewhere, something like where one could run into a situation where
>>>> the hrtimer doesn't run and Dom0 believes the DomU should be polling
>>>> and doesn't emit an interrupt or something, but I'm afraid I don't
>>>> know enough to judge this (I mean, there are spinlocks which look
>>>> safe 
>>>> to me).
>>>>
>>>> Do you have any suggestions what to try?  I can trigger the issue on
>>>> the production VM again, but debugging should not take more than a
>>>> few 
>>>> minutes if it happens.  Access is only possible via the console.
>>>> Neither Dom0 nor the guest show anything unusual in the kernel
>>>> message 
>>>> and continue to behave normally after the network goes dead (also
>>>> able 
>>>> to shut down the guest normally).
>>>>
>>>> Thanks,
>>>> 	Christophe
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: new netfront and occasional receive path lockup
  2010-09-10  0:55       ` Jeremy Fitzhardinge
@ 2010-09-10  1:45         ` Xu, Dongxiao
  2010-09-10  2:25           ` Jeremy Fitzhardinge
  2010-09-12  1:00           ` Gerald Turner
  0 siblings, 2 replies; 31+ messages in thread
From: Xu, Dongxiao @ 2010-09-10  1:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Pasi Kärkkäinen
  Cc: xen-devel, Christophe Saout

[-- Attachment #1: Type: text/plain, Size: 4434 bytes --]

Hi Jeremy and Pasi,

I was frustrated that I couldn't reproduce this bug in my site. 

However I investigated the code, indeed there is one race condition that
probably cause the bug. See the attached patch.

Could anybody who can see this bug help to try it? Appreciate much!

Thanks,
Dongxiao


Jeremy Fitzhardinge wrote:
>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>> Hi Christophe,
>>> 
>>> Thanks for finding and checking the problem.
>>> I will try to reproduce the issue and check what caused the problem.
>>> 
>> Hello,
>> 
>> Was this issue resolved? Some users have been complaining "network
>> freezing up" issues recently on ##xen on irc..
> 
> Yeah, I'll add a command-line parameter to disable smartpoll (and
> leave it off by default). 
> 
>     J
> 
>> -- Pasi
>> 
>>> Thanks,
>>> Dongxiao
>>> 
>>> Jeremy Fitzhardinge wrote:
>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>> Hi,
>>>>> 
>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>> guest code.  What I've been observing on one of the virtual
>>>>> machines is that the network (vif) is dying after about ten to
>>>>> sixty minutes of uptime. The unfortunate thing here is that I can
>>>>> only repoduce it on a production VM and have been unlucky so far
>>>>> to trigger the bug on a test machine.  While this has not been
>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>> very much time on debugging after the issue pops up.
>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>> happened to me then... 
>>>> 
>>>> 
>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>> DomU can send packets to Dom0 and those are visible using tcpdump
>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>> around. 
>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>> that's what seems to be happening here too.
>>>> 
>>>>> Now, I have done more than one change at a time (I'd like to avoid
>>>>> going into pinning it down since I can only reproduce it on a
>>>>> production machine, as I said, so suggestions are welcome), but my
>>>>> suspicion is that it might have to do with the new "smart polling"
>>>>> feature in xen/netfront.  Note that I have also updated Dom0 to
>>>>> pull in the latest dom0/backend and netback changes, just to make
>>>>> sure it's not due to an issue that has been fixed there, but I'm
>>>>> still seeing the same.
>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>> into netfront. 
>>>> 
>>>>     J
>>>> 
>>>>> The production machine is a machine that doesn't have much network
>>>>> load, but deals with a lot of small network requests (DNS and smtp
>>>>> mostly).  A workload which is hard to reproduce on the test
>>>>> machine. Heavy network load (NFS, FTP and so on) for days hasn't
>>>>> triggered the problem.  Also, segmentation offloading and similar
>>>>> settings don't have any effect. 
>>>>> 
>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>> PREEMPT enabled. 
>>>>> 
>>>>> I've been looking at the code, if there might be a race condition
>>>>> somewhere, something like where one could run into a situation
>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should be
>>>>> polling and doesn't emit an interrupt or something, but I'm afraid
>>>>> I don't know enough to judge this (I mean, there are spinlocks
>>>>> which look safe to me). 
>>>>> 
>>>>> Do you have any suggestions what to try?  I can trigger the issue
>>>>> on the production VM again, but debugging should not take more
>>>>> than a few minutes if it happens.  Access is only possible via
>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>> the kernel message and continue to behave normally after the
>>>>> network goes dead (also able to shut down the guest normally).
>>>>> 
>>>>> Thanks,
>>>>> 	Christophe
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>> 
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel


[-- Attachment #2: 0001-Fix-one-race-condition-for-netfront-smartpoll-logic.patch --]
[-- Type: application/octet-stream, Size: 1694 bytes --]

From 4304521d61573332033e3799e28f6ffb12a0654a Mon Sep 17 00:00:00 2001
From: Dongxiao Xu <dongxiao.xu@intel.com>
Date: Fri, 10 Sep 2010 09:00:54 +0800
Subject: [PATCH] Fix one race condition for netfront smartpoll logic

Assume one case like, netfront poll could not get any data,
and it clears the shared flag to indicate it is not polling.
However at this moment (netfront has cleared the flag but
still in hrtimer callback), netback triggers one interrupt
to netfront and tries to start the timer, the operation will
be failed since the timer is still alive.

Add this logic that, if the new hrtimer starts failed,
netfront should clear the shared flag to indicate that it is
not polling.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
---
 drivers/net/xen-netfront.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index e894dd2..03e19b0 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1397,10 +1397,15 @@ static irqreturn_t xennet_interrupt(int irq, void *dev_id)
 			napi_schedule(&np->napi);
 	}
 
-	if (np->smart_poll.feature_smart_poll)
-		hrtimer_start(&np->smart_poll.timer,
-			ktime_set(0, NANO_SECOND/np->smart_poll.smart_poll_freq),
-			HRTIMER_MODE_REL);
+	if (np->smart_poll.feature_smart_poll) {
+		if ( hrtimer_start(&np->smart_poll.timer,
+			ktime_set(0,NANO_SECOND/np->smart_poll.smart_poll_freq),
+			HRTIMER_MODE_REL) ) {
+			printk(KERN_DEBUG "Failed to start hrtimer,"
+					"use interrupt mode for this packet\n");
+			np->rx.sring->private.netif.smartpoll_active = 0;
+		}
+	}
 
 	spin_unlock_irqrestore(&np->tx_lock, flags);
 
-- 
1.6.3


[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-09-10  1:45         ` Xu, Dongxiao
@ 2010-09-10  2:25           ` Jeremy Fitzhardinge
  2010-09-10  2:37             ` Xu, Dongxiao
  2010-09-12  1:00           ` Gerald Turner
  1 sibling, 1 reply; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-09-10  2:25 UTC (permalink / raw)
  To: Xu, Dongxiao; +Cc: xen-devel, Christophe Saout

 On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:
> Hi Jeremy and Pasi,
>
> I was frustrated that I couldn't reproduce this bug in my site. 

Perhaps you have been trying to reproduce it in the wrong conditions?  I
have generally seen this bug when the networking is under very light
load, such as a couple of fairly idle dom0<->domU ssh connections.  I'm
not sure that I've seen it under heavy load.

> However I investigated the code, indeed there is one race condition that
> probably cause the bug. See the attached patch.
>
> Could anybody who can see this bug help to try it? Appreciate much!

Thanks for looking into this.  Your logic seems reasonable, so I'll
apply it (however I also added a patch to make smartpoll default to
"off"; I guess I can switch that to default on again to make sure it
gets tested, but leave the option as a workaround if there are still
problems).

However, I am concerned about these manipulations of a cross-cpu shared
variable without any barriers or other ordering constraints.  Are you
sure this code is correct under any reordering (either by the compiler
or CPUs); and if the compiler decides to access it more or less often
than the source says it should?

Thanks,
    J

> Thanks,
> Dongxiao
>
>
> Jeremy Fitzhardinge wrote:
>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>> Hi Christophe,
>>>>
>>>> Thanks for finding and checking the problem.
>>>> I will try to reproduce the issue and check what caused the problem.
>>>>
>>> Hello,
>>>
>>> Was this issue resolved? Some users have been complaining "network
>>> freezing up" issues recently on ##xen on irc..
>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>> leave it off by default). 
>>
>>     J
>>
>>> -- Pasi
>>>
>>>> Thanks,
>>>> Dongxiao
>>>>
>>>> Jeremy Fitzhardinge wrote:
>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can
>>>>>> only repoduce it on a production VM and have been unlucky so far
>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>> very much time on debugging after the issue pops up.
>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>> happened to me then... 
>>>>>
>>>>>
>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump
>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>> around. 
>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>> that's what seems to be happening here too.
>>>>>
>>>>>> Now, I have done more than one change at a time (I'd like to avoid
>>>>>> going into pinning it down since I can only reproduce it on a
>>>>>> production machine, as I said, so suggestions are welcome), but my
>>>>>> suspicion is that it might have to do with the new "smart polling"
>>>>>> feature in xen/netfront.  Note that I have also updated Dom0 to
>>>>>> pull in the latest dom0/backend and netback changes, just to make
>>>>>> sure it's not due to an issue that has been fixed there, but I'm
>>>>>> still seeing the same.
>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>> into netfront. 
>>>>>
>>>>>     J
>>>>>
>>>>>> The production machine is a machine that doesn't have much network
>>>>>> load, but deals with a lot of small network requests (DNS and smtp
>>>>>> mostly).  A workload which is hard to reproduce on the test
>>>>>> machine. Heavy network load (NFS, FTP and so on) for days hasn't
>>>>>> triggered the problem.  Also, segmentation offloading and similar
>>>>>> settings don't have any effect. 
>>>>>>
>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>> PREEMPT enabled. 
>>>>>>
>>>>>> I've been looking at the code, if there might be a race condition
>>>>>> somewhere, something like where one could run into a situation
>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should be
>>>>>> polling and doesn't emit an interrupt or something, but I'm afraid
>>>>>> I don't know enough to judge this (I mean, there are spinlocks
>>>>>> which look safe to me). 
>>>>>>
>>>>>> Do you have any suggestions what to try?  I can trigger the issue
>>>>>> on the production VM again, but debugging should not take more
>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>> the kernel message and continue to behave normally after the
>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>
>>>>>> Thanks,
>>>>>> 	Christophe
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: new netfront and occasional receive path lockup
  2010-09-10  2:25           ` Jeremy Fitzhardinge
@ 2010-09-10  2:37             ` Xu, Dongxiao
  2010-09-10  2:42               ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 31+ messages in thread
From: Xu, Dongxiao @ 2010-09-10  2:37 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel, Christophe Saout

Jeremy Fitzhardinge wrote:
>  On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:
>> Hi Jeremy and Pasi,
>> 
>> I was frustrated that I couldn't reproduce this bug in my site.
> 
> Perhaps you have been trying to reproduce it in the wrong conditions?
> I have generally seen this bug when the networking is under very
> light load, such as a couple of fairly idle dom0<->domU ssh
> connections.  I'm not sure that I've seen it under heavy load.   
> 
>> However I investigated the code, indeed there is one race condition
>> that probably cause the bug. See the attached patch.
>> 
>> Could anybody who can see this bug help to try it? Appreciate much!
> 
> Thanks for looking into this.  Your logic seems reasonable, so I'll
> apply it (however I also added a patch to make smartpoll default to
> "off"; I guess I can switch that to default on again to make sure it
> gets tested, but leave the option as a workaround if there are still
> problems).    
> 
> However, I am concerned about these manipulations of a cross-cpu
> shared variable without any barriers or other ordering constraints. 
> Are you sure this code is correct under any reordering (either by the
> compiler or CPUs); and if the compiler decides to access it more or
> less often than the source says it should?    

Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"?
It is a flag in shared ring structure, Therefore operations towards
this flag are the same as other component in shared ring, such as
under spinlock, etc.

I will put dom0 and domU ssh(ed) for some time to see if the bug
still exists.

Thanks,
Dongxiao

> 
> Thanks,
>     J
> 
>> Thanks,
>> Dongxiao
>> 
>> 
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>> 
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem. 
>>>>> 
>>>> Hello,
>>>> 
>>>> Was this issue resolved? Some users have been complaining "network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>>> leave it off by default). 
>>> 
>>>     J
>>> 
>>>> -- Pasi
>>>> 
>>>>> Thanks,
>>>>> Dongxiao
>>>>> 
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here is that I
>>>>>>> can only repoduce it on a production VM and have been unlucky
>>>>>>> so far 
>>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>>> very much time on debugging after the issue pops up.
>>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>>> happened to me then... 
>>>>>> 
>>>>>> 
>>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible using
>>>>>>> tcpdump 
>>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>>> around.
>>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>>> that's what seems to be happening here too.
>>>>>> 
>>>>>>> Now, I have done more than one change at a time (I'd like to
>>>>>>> avoid going into pinning it down since I can only reproduce it
>>>>>>> on 
>>>>>>> a production machine, as I said, so suggestions are welcome),
>>>>>>> but 
>>>>>>> my suspicion is that it might have to do with the new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I have also
>>>>>>> updated Dom0 to 
>>>>>>> pull in the latest dom0/backend and netback changes, just to
>>>>>>> make sure it's not due to an issue that has been fixed there,
>>>>>>> but I'm still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>>> into netfront. 
>>>>>> 
>>>>>>     J
>>>>>> 
>>>>>>> The production machine is a machine that doesn't have much
>>>>>>> network load, but deals with a lot of small network requests
>>>>>>> (DNS and smtp mostly).  A workload which is hard to reproduce
>>>>>>> on the 
>>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days
>>>>>>> hasn't triggered the problem.  Also, segmentation offloading and
>>>>>>> similar settings don't have any effect.
>>>>>>> 
>>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>>> PREEMPT enabled. 
>>>>>>> 
>>>>>>> I've been looking at the code, if there might be a race
>>>>>>> condition somewhere, something like where one could run into a
>>>>>>> situation 
>>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should
>>>>>>> be polling and doesn't emit an interrupt or something, but I'm
>>>>>>> afraid I don't know enough to judge this (I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>> 
>>>>>>> Do you have any suggestions what to try?  I can trigger the
>>>>>>> issue 
>>>>>>> on the production VM again, but debugging should not take more
>>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>>> the kernel message and continue to behave normally after the
>>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 	Christophe
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-09-10  2:37             ` Xu, Dongxiao
@ 2010-09-10  2:42               ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-09-10  2:42 UTC (permalink / raw)
  To: Xu, Dongxiao; +Cc: xen-devel, Christophe Saout

 On 09/10/2010 12:37 PM, Xu, Dongxiao wrote:
>> However, I am concerned about these manipulations of a cross-cpu
>> shared variable without any barriers or other ordering constraints. 
>> Are you sure this code is correct under any reordering (either by the
>> compiler or CPUs); and if the compiler decides to access it more or
>> less often than the source says it should?    
> Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"?
> It is a flag in shared ring structure, Therefore operations towards
> this flag are the same as other component in shared ring, such as
> under spinlock, etc.

Spinlocks are no use for inter-domain synchronization, only within a
domain.  The other ring operations are carefully ordered with
appropriate memory barriers in specific places; that's why I'm a bit
concerned about their absence for the smartpoll_active flag.  Even if
they are not necessary, I'd like to see an analysis as to why.

    J

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-09-10  1:45         ` Xu, Dongxiao
  2010-09-10  2:25           ` Jeremy Fitzhardinge
@ 2010-09-12  1:00           ` Gerald Turner
  2010-09-12  8:55             ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 31+ messages in thread
From: Gerald Turner @ 2010-09-12  1:00 UTC (permalink / raw)
  To: xen-devel

"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:

> Hi Jeremy and Pasi,
>
> I was frustrated that I couldn't reproduce this bug in my site.
>
> However I investigated the code, indeed there is one race condition
> that probably cause the bug. See the attached patch.
>
> Could anybody who can see this bug help to try it? Appreciate much!
>

Hello, I experienced this problem with netfront and the smartpoll code
causing their bridge interfaces to fail.

I've been building a Xen server using Debian Squeeze, Xen 4.0.1-rc6.
For weeks the server had been running solid with just three domU's.  In
the last few days I significantly increased the number of domU's (13
total) and have been having terrible packet drop problems.  Randomly,
maybe after 10 to 60 minutes of uptime, a domU or two will fall victim
to bridge failure.  There's no syslog/dmesg output.  The only report of
the problem can by seen through network stats on dom0 (the domU vifX.X
interfaces have huge TX drops), and 'brctl showmacs' output is missing
the MAC addresses for the domU's that have failed.

I'm not doing anything interesting with networking.  eth0/peth0 on dom0
with static IP, vifX.0 on domU, no DHCP, no firewall rules (other than
fail2ban), static IP assigned within in each domU.

I'm using PV and all Debian -xen-amd64 flavor kernel in dom0 and domU
(no interest in HVM).

I've tried dozens of attempts to solve this:

  * Screwed with ethtool -K XXX tx off on dom0, domU, physical
    interface.

  * Removed 'network-bridge' setup from xend and setup 'br0' the Debian
    Way.

  * Commented out 'iptables_setup' from 'vif-bridge' script which was
    producing lots of iptables noise.

  * Use 'mac=' in domU vif config.

  * Tried latest vanilla 2.6.35.5 kernel (netfront driver is
    pre-smartpoll) - I didn't give this kernel enough time to break, I
    saw TX drops on boot and assumed the problem was still there, but my
    judgement was incorrect - all domU's get a few TX drops while the
    kernel boots (probably ARPs while vifX.X is up but before the domU
    ifup's it's eth0 on boot).

Friday morning a fellow named 'Nrg_' on ##xen immediately diagnosed this
as possibly being related to the smartpoll bug in the netfront driver.

I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
confirmed the netfront driver is patched with an earlier version the
smartpoll code.

I manually merged Debian's kernel with Jeremy's updates to the netfront
driver in his git repository.

  $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606

Deployed this new image on all domU's (except for two of them, as a
control group) and updated grub kernel parameter with
xen_netfront.use_smartpoll=0.

Problem solved.  Only the two domU's I left unpatched get victimized.
The rest of the hosts have been up for over a day and have not lost any
packets.

P.S. this is my first NNTP post thru gmane, I have no idea if it will
reach the list, keep Message-Id/References intact, and CC Christophe,
Jeremy, Dongxiao et al.


> Jeremy Fitzhardinge wrote:
>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>> Hi Christophe,
>>>>
>>>> Thanks for finding and checking the problem.
>>>> I will try to reproduce the issue and check what caused the
>>>> problem.
>>>>
>>> Hello,
>>>
>>> Was this issue resolved? Some users have been complaining "network
>>> freezing up" issues recently on ##xen on irc..
>>
>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>> leave it off by default).
>>
>>     J
>>
>>> -- Pasi
>>>
>>>> Thanks,
>>>> Dongxiao
>>>>
>>>> Jeremy Fitzhardinge wrote:
>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can
>>>>>> only repoduce it on a production VM and have been unlucky so far
>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>> very much time on debugging after the issue pops up.
>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>> happened to me then...
>>>>>
>>>>>
>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump
>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>> around.
>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>> that's what seems to be happening here too.
>>>>>
>>>>>> Now, I have done more than one change at a time (I'd like to
>>>>>> avoid going into pinning it down since I can only reproduce it on
>>>>>> a production machine, as I said, so suggestions are welcome), but
>>>>>> my suspicion is that it might have to do with the new "smart
>>>>>> polling" feature in xen/netfront.  Note that I have also updated
>>>>>> Dom0 to pull in the latest dom0/backend and netback changes, just
>>>>>> to make sure it's not due to an issue that has been fixed there,
>>>>>> but I'm still seeing the same.
>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>> into netfront.
>>>>>
>>>>>     J
>>>>>
>>>>>> The production machine is a machine that doesn't have much
>>>>>> network load, but deals with a lot of small network requests (DNS
>>>>>> and smtp mostly).  A workload which is hard to reproduce on the
>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days
>>>>>> hasn't triggered the problem.  Also, segmentation offloading and
>>>>>> similar settings don't have any effect.
>>>>>>
>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>> PREEMPT enabled.
>>>>>>
>>>>>> I've been looking at the code, if there might be a race condition
>>>>>> somewhere, something like where one could run into a situation
>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should
>>>>>> be polling and doesn't emit an interrupt or something, but I'm
>>>>>> afraid I don't know enough to judge this (I mean, there are
>>>>>> spinlocks which look safe to me).
>>>>>>
>>>>>> Do you have any suggestions what to try?  I can trigger the issue
>>>>>> on the production VM again, but debugging should not take more
>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>> the kernel message and continue to behave normally after the
>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-12  1:00           ` Gerald Turner
@ 2010-09-12  8:55             ` Jeremy Fitzhardinge
  2010-09-12 17:23               ` Pasi Kärkkäinen
  2010-09-12 22:40               ` Gerald Turner
  0 siblings, 2 replies; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-09-12  8:55 UTC (permalink / raw)
  To: Gerald Turner; +Cc: Xu, Dongxiao, xen-devel

 On 09/12/2010 11:00 AM, Gerald Turner wrote:
> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
> confirmed the netfront driver is patched with an earlier version the
> smartpoll code.
>
> I manually merged Debian's kernel with Jeremy's updates to the netfront
> driver in his git repository.
>
>   $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>
> Deployed this new image on all domU's (except for two of them, as a
> control group) and updated grub kernel parameter with
> xen_netfront.use_smartpoll=0.

That's good to hear.  But I also included a fix from Dongxiao which, if
correct, means it should work with use_smartpoll=1 (or nothing, as
that's the default).  Could you verify whether the fix in
cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?

> Problem solved.  Only the two domU's I left unpatched get victimized.
> The rest of the hosts have been up for over a day and have not lost any
> packets.
>
> P.S. this is my first NNTP post thru gmane, I have no idea if it will
> reach the list, keep Message-Id/References intact, and CC Christophe,
> Jeremy, Dongxiao et al.

There were no cc:s.

Thanks,
    J

>
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>>
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem.
>>>>>
>>>> Hello,
>>>>
>>>> Was this issue resolved? Some users have been complaining "network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I'll add a command-line parameter to disable smartpoll (and
>>> leave it off by default).
>>>
>>>     J
>>>
>>>> -- Pasi
>>>>
>>>>> Thanks,
>>>>> Dongxiao
>>>>>
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been playing with some of the new pvops code, namely DomU
>>>>>>> guest code.  What I've been observing on one of the virtual
>>>>>>> machines is that the network (vif) is dying after about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can
>>>>>>> only repoduce it on a production VM and have been unlucky so far
>>>>>>> to trigger the bug on a test machine.  While this has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I can't spend
>>>>>>> very much time on debugging after the issue pops up.
>>>>>> Ah, OK.  I've seen this a couple of times as well.  And it just
>>>>>> happened to me then...
>>>>>>
>>>>>>
>>>>>>> Now, what is happening is that the receive path goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump
>>>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>>>> around.
>>>>>> I hadn't got to that level of diagnosis, but I can confirm that
>>>>>> that's what seems to be happening here too.
>>>>>>
>>>>>>> Now, I have done more than one change at a time (I'd like to
>>>>>>> avoid going into pinning it down since I can only reproduce it on
>>>>>>> a production machine, as I said, so suggestions are welcome), but
>>>>>>> my suspicion is that it might have to do with the new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I have also updated
>>>>>>> Dom0 to pull in the latest dom0/backend and netback changes, just
>>>>>>> to make sure it's not due to an issue that has been fixed there,
>>>>>>> but I'm still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>>>> into netfront.
>>>>>>
>>>>>>     J
>>>>>>
>>>>>>> The production machine is a machine that doesn't have much
>>>>>>> network load, but deals with a lot of small network requests (DNS
>>>>>>> and smtp mostly).  A workload which is hard to reproduce on the
>>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days
>>>>>>> hasn't triggered the problem.  Also, segmentation offloading and
>>>>>>> similar settings don't have any effect.
>>>>>>>
>>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>>>>> PREEMPT enabled.
>>>>>>>
>>>>>>> I've been looking at the code, if there might be a race condition
>>>>>>> somewhere, something like where one could run into a situation
>>>>>>> where the hrtimer doesn't run and Dom0 believes the DomU should
>>>>>>> be polling and doesn't emit an interrupt or something, but I'm
>>>>>>> afraid I don't know enough to judge this (I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>>
>>>>>>> Do you have any suggestions what to try?  I can trigger the issue
>>>>>>> on the production VM again, but debugging should not take more
>>>>>>> than a few minutes if it happens.  Access is only possible via
>>>>>>> the console. Neither Dom0 nor the guest show anything unusual in
>>>>>>> the kernel message and continue to behave normally after the
>>>>>>> network goes dead (also able to shut down the guest normally).
>>>>>>>

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-12  8:55             ` Jeremy Fitzhardinge
@ 2010-09-12 17:23               ` Pasi Kärkkäinen
  2010-09-12 22:40               ` Gerald Turner
  1 sibling, 0 replies; 31+ messages in thread
From: Pasi Kärkkäinen @ 2010-09-12 17:23 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xu, Dongxiao, xen-devel, Gerald Turner

On Sun, Sep 12, 2010 at 06:55:48PM +1000, Jeremy Fitzhardinge wrote:
>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
> > I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
> > confirmed the netfront driver is patched with an earlier version the
> > smartpoll code.
> >
> > I manually merged Debian's kernel with Jeremy's updates to the netfront
> > driver in his git repository.
> >
> >   $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
> >
> > Deployed this new image on all domU's (except for two of them, as a
> > control group) and updated grub kernel parameter with
> > xen_netfront.use_smartpoll=0.
> 
> That's good to hear.  But I also included a fix from Dongxiao which, if
> correct, means it should work with use_smartpoll=1 (or nothing, as
> that's the default).  Could you verify whether the fix in
> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
> 

It'd be good to get the fix(es) to xen/stable-2.6.32.x aswell..

Or can you use "use_smartpoll=0" in current xen/stable-2.6.32.x branch? 

-- Pasi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: new netfront and occasional receive path lockup
  2010-09-12  8:55             ` Jeremy Fitzhardinge
  2010-09-12 17:23               ` Pasi Kärkkäinen
@ 2010-09-12 22:40               ` Gerald Turner
  2010-09-13  0:03                 ` Gerald Turner
  1 sibling, 1 reply; 31+ messages in thread
From: Gerald Turner @ 2010-09-12 22:40 UTC (permalink / raw)
  To: xen-devel

Jeremy Fitzhardinge <jeremy@goop.org> writes:

>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
>> confirmed the netfront driver is patched with an earlier version the
>> smartpoll code.
>>
>> I manually merged Debian's kernel with Jeremy's updates to the
>> netfront driver in his git repository.
>>
>>   $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>>
>> Deployed this new image on all domU's (except for two of them, as a
>> control group) and updated grub kernel parameter with
>> xen_netfront.use_smartpoll=0.
>
> That's good to hear.  But I also included a fix from Dongxiao which,
> if correct, means it should work with use_smartpoll=1 (or nothing, as
> that's the default).  Could you verify whether the fix in
> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
>

I've been running with use_smartpoll=1 for a few hours this afternoon,
looks like Dongxiao's bugfix works.

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-12 22:40               ` Gerald Turner
@ 2010-09-13  0:03                 ` Gerald Turner
  2010-09-13  0:54                   ` Xu, Dongxiao
  0 siblings, 1 reply; 31+ messages in thread
From: Gerald Turner @ 2010-09-13  0:03 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xu, Dongxiao, xen-devel

Gerald Turner <gturner@unzane.com> writes:

> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
>>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
>>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
>>> confirmed the netfront driver is patched with an earlier version the
>>> smartpoll code.
>>>
>>> I manually merged Debian's kernel with Jeremy's updates to the
>>> netfront driver in his git repository.
>>>
>>>   $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>>>
>>> Deployed this new image on all domU's (except for two of them, as a
>>> control group) and updated grub kernel parameter with
>>> xen_netfront.use_smartpoll=0.
>>
>> That's good to hear.  But I also included a fix from Dongxiao which,
>> if correct, means it should work with use_smartpoll=1 (or nothing, as
>> that's the default).  Could you verify whether the fix in
>> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
>>
>
> I've been running with use_smartpoll=1 for a few hours this afternoon,
> looks like Dongxiao's bugfix works.
>

I spoke too soon!  use_smartpoll set to 1 and still exhibits the
problem, a few domU's lost network after about 60 minutes of uptime.
Sorry for the bad news...

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: Re: new netfront and occasional receive path lockup
  2010-09-13  0:03                 ` Gerald Turner
@ 2010-09-13  0:54                   ` Xu, Dongxiao
  2010-09-13  2:12                     ` Gerald Turner
  0 siblings, 1 reply; 31+ messages in thread
From: Xu, Dongxiao @ 2010-09-13  0:54 UTC (permalink / raw)
  To: Gerald Turner, Jeremy Fitzhardinge; +Cc: xen-devel

Gerald Turner wrote:
> Gerald Turner <gturner@unzane.com> writes:
> 
>> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>> 
>>>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
>>>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
>>>> confirmed the netfront driver is patched with an earlier version
>>>> the smartpoll code. 
>>>> 
>>>> I manually merged Debian's kernel with Jeremy's updates to the
>>>> netfront driver in his git repository.
>>>> 
>>>>   $ git diff
>>>> 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c
>>>> 8475f0c00e0606 
>>>> 
>>>> Deployed this new image on all domU's (except for two of them, as a
>>>> control group) and updated grub kernel parameter with
>>>> xen_netfront.use_smartpoll=0.
>>> 
>>> That's good to hear.  But I also included a fix from Dongxiao which,
>>> if correct, means it should work with use_smartpoll=1 (or nothing,
>>> as that's the default).  Could you verify whether the fix in
>>> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
>>> 
>> 
>> I've been running with use_smartpoll=1 for a few hours this
>> afternoon, looks like Dongxiao's bugfix works.
>> 
> 
> I spoke too soon!  use_smartpoll set to 1 and still exhibits the
> problem, a few domU's lost network after about 60 minutes of uptime.
> Sorry for the bad news... 

Hi Gerald,

Sorry for the inconvinience. I will continue to look into it.

Does this bug only happen when you launch multiple domUs?
I tried a single domU and could not catch the bug.

Thanks,
Dongxiao

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-13  0:54                   ` Xu, Dongxiao
@ 2010-09-13  2:12                     ` Gerald Turner
  2010-09-13  2:34                       ` Xu, Dongxiao
  0 siblings, 1 reply; 31+ messages in thread
From: Gerald Turner @ 2010-09-13  2:12 UTC (permalink / raw)
  To: Xu, Dongxiao; +Cc: Jeremy Fitzhardinge, xen-devel

"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:

> Does this bug only happen when you launch multiple domUs?  I tried a
> single domU and could not catch the bug.
>

I've been working on this server for about two weeks, I hadn't noticed
the problem for the first week when I only had 3 domUs.  It started
happening when I added 10 more domUs.  The problem would happen quickly,
within 10 minutes, always affecting at least two domUs at random, and
affect more domUs over time.

Saturday I installed the updated driver with Jeremy's use_smartpoll
parameter, ran for 24 hours with smartpoll disabled, no problems.

Today I've been trying with smartpoll enabled.  It took an hour to
affect two domUs - noticibly longer than the behavior previous days
before installing your patch.  I still have 9 other domUs running with
smartpoll enabled, four hours uptime, I'm surprised they haven't been
affected yet.  Could there be another less-frequent race in
smart_poll_function?

--
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: Re: new netfront and occasional receive path lockup
  2010-09-13  2:12                     ` Gerald Turner
@ 2010-09-13  2:34                       ` Xu, Dongxiao
  2010-09-13  4:38                         ` Gerald Turner
  0 siblings, 1 reply; 31+ messages in thread
From: Xu, Dongxiao @ 2010-09-13  2:34 UTC (permalink / raw)
  To: Gerald Turner; +Cc: Jeremy Fitzhardinge, xen-devel

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

Gerald Turner wrote:
> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> 
>> Does this bug only happen when you launch multiple domUs?  I tried a
>> single domU and could not catch the bug.
>> 
> 
> I've been working on this server for about two weeks, I hadn't
> noticed the problem for the first week when I only had 3 domUs.  It
> started happening when I added 10 more domUs.  The problem would
> happen quickly, within 10 minutes, always affecting at least two
> domUs at random, and affect more domUs over time.    
> 
> Saturday I installed the updated driver with Jeremy's use_smartpoll
> parameter, ran for 24 hours with smartpoll disabled, no problems. 
> 
> Today I've been trying with smartpoll enabled.  It took an hour to
> affect two domUs - noticibly longer than the behavior previous days
> before installing your patch.  I still have 9 other domUs running
> with smartpoll enabled, four hours uptime, I'm surprised they haven't
> been affected yet.  Could there be another less-frequent race in
> smart_poll_function?     

Hi Gerald, 

Thanks for your detail information.

Unfortunately I don't have such platform that could launch more than 10 guests in hand.

Here is another patch (see attached file) that fix another potential race.

Do you have bandwidth to have a try? Thanks in advance!

Best Regards,
-- Dongxiao

[-- Attachment #2: 0001-Netfront-Fix-another-potential-race-condition.patch --]
[-- Type: application/octet-stream, Size: 1270 bytes --]

From af4cfa73d54e59686aad8bf1a5d6ec0223c3dd32 Mon Sep 17 00:00:00 2001
From: Dongxiao Xu <dongxiao.xu@intel.com>
Date: Mon, 13 Sep 2010 10:17:58 +0800
Subject: [PATCH] Netfront: Fix another potential race condition

When trying to start next hrtimer in current callback,
we should judge its return value and do error handling.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
---
 drivers/net/xen-netfront.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
index 03e19b0..47f651e 100644
--- a/drivers/net/xen-netfront.c
+++ b/drivers/net/xen-netfront.c
@@ -1372,10 +1372,15 @@ static enum hrtimer_restart smart_poll_function(struct hrtimer *timer)
 		np->smart_poll.active = 0;
 	}
 
-	if (np->rx.sring->private.netif.smartpoll_active)
-		hrtimer_start(timer,
+	if (np->rx.sring->private.netif.smartpoll_active) {
+		if ( hrtimer_start(timer,
 			ktime_set(0, NANO_SECOND/psmart_poll->smart_poll_freq),
-			HRTIMER_MODE_REL);
+			HRTIMER_MODE_REL) ) {
+			printk(KERN_DEBUG "Failed to start hrtimer,"
+					"use interrupt mode for this packet\n");
+			np->rx.sring->private.netif.smartpoll_active = 0;
+		}
+	}
 
 end:
 	spin_unlock_irqrestore(&np->tx_lock, flags);
-- 
1.6.3


[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-13  2:34                       ` Xu, Dongxiao
@ 2010-09-13  4:38                         ` Gerald Turner
  2010-09-13 16:01                           ` Gerald Turner
  0 siblings, 1 reply; 31+ messages in thread
From: Gerald Turner @ 2010-09-13  4:38 UTC (permalink / raw)
  To: Xu, Dongxiao; +Cc: Jeremy Fitzhardinge, xen-devel

"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:

> Thanks for your detail information.
>
> Unfortunately I don't have such platform that could launch more than
> 10 guests in hand.
>
> Here is another patch (see attached file) that fix another potential
> race.
>
> Do you have bandwidth to have a try? Thanks in advance!
>

I built a kernel with your additional patch.

I have it running on all 13 domU's with use_smartpoll=1.

I'll report tomorrow morning whether there were any lockups.

FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
patch.

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-13  4:38                         ` Gerald Turner
@ 2010-09-13 16:01                           ` Gerald Turner
  2010-09-13 16:08                             ` Pasi Kärkkäinen
  2010-09-14  0:26                             ` Xu, Dongxiao
  0 siblings, 2 replies; 31+ messages in thread
From: Gerald Turner @ 2010-09-13 16:01 UTC (permalink / raw)
  To: Xu, Dongxiao; +Cc: Jeremy Fitzhardinge, xen-devel

Gerald Turner <gturner@unzane.com> writes:

> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
>
>> Thanks for your detail information.
>>
>> Unfortunately I don't have such platform that could launch more than
>> 10 guests in hand.
>>
>> Here is another patch (see attached file) that fix another potential
>> race.
>>
>> Do you have bandwidth to have a try? Thanks in advance!
>>
>
> I built a kernel with your additional patch.
>
> I have it running on all 13 domU's with use_smartpoll=1.
>
> I'll report tomorrow morning whether there were any lockups.
>
> FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
> patch.
>

Sorry bad news again...

Had 5 lockups within 4 hours.

Then I restarted all domUs with use_smartpoll=0 and haven't had any
lockups in 7 hours.

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-13 16:01                           ` Gerald Turner
@ 2010-09-13 16:08                             ` Pasi Kärkkäinen
  2010-09-13 19:36                               ` Jeremy Fitzhardinge
  2010-09-14  0:26                             ` Xu, Dongxiao
  1 sibling, 1 reply; 31+ messages in thread
From: Pasi Kärkkäinen @ 2010-09-13 16:08 UTC (permalink / raw)
  To: Gerald Turner; +Cc: Xu, Dongxiao, xen-devel, Jeremy Fitzhardinge

On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
> Gerald Turner <gturner@unzane.com> writes:
> 
> > "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> >
> >> Thanks for your detail information.
> >>
> >> Unfortunately I don't have such platform that could launch more than
> >> 10 guests in hand.
> >>
> >> Here is another patch (see attached file) that fix another potential
> >> race.
> >>
> >> Do you have bandwidth to have a try? Thanks in advance!
> >>
> >
> > I built a kernel with your additional patch.
> >
> > I have it running on all 13 domU's with use_smartpoll=1.
> >
> > I'll report tomorrow morning whether there were any lockups.
> >
> > FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
> > patch.
> >
> 
> Sorry bad news again...
> 
> Had 5 lockups within 4 hours.
> 
> Then I restarted all domUs with use_smartpoll=0 and haven't had any
> lockups in 7 hours.
>

I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being
until this is sorted out..

-- Pasi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-13 16:08                             ` Pasi Kärkkäinen
@ 2010-09-13 19:36                               ` Jeremy Fitzhardinge
  2010-09-14  8:25                                 ` Ian Campbell
  0 siblings, 1 reply; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-09-13 19:36 UTC (permalink / raw)
  To: Pasi Kärkkäinen; +Cc: Xu, Dongxiao, xen-devel, Gerald Turner

 On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
>> Gerald Turner <gturner@unzane.com> writes:
>>
>>> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
>>>
>>>> Thanks for your detail information.
>>>>
>>>> Unfortunately I don't have such platform that could launch more than
>>>> 10 guests in hand.
>>>>
>>>> Here is another patch (see attached file) that fix another potential
>>>> race.
>>>>
>>>> Do you have bandwidth to have a try? Thanks in advance!
>>>>
>>> I built a kernel with your additional patch.
>>>
>>> I have it running on all 13 domU's with use_smartpoll=1.
>>>
>>> I'll report tomorrow morning whether there were any lockups.
>>>
>>> FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
>>> patch.
>>>
>> Sorry bad news again...
>>
>> Had 5 lockups within 4 hours.
>>
>> Then I restarted all domUs with use_smartpoll=0 and haven't had any
>> lockups in 7 hours.
>>
> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being
> until this is sorted out..

Agreed.

    J

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: Re: new netfront and occasional receive path lockup
  2010-09-13 16:01                           ` Gerald Turner
  2010-09-13 16:08                             ` Pasi Kärkkäinen
@ 2010-09-14  0:26                             ` Xu, Dongxiao
  1 sibling, 0 replies; 31+ messages in thread
From: Xu, Dongxiao @ 2010-09-14  0:26 UTC (permalink / raw)
  To: Gerald Turner; +Cc: Jeremy Fitzhardinge, xen-devel

Gerald Turner wrote:
> Gerald Turner <gturner@unzane.com> writes:
> 
>> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
>> 
>>> Thanks for your detail information.
>>> 
>>> Unfortunately I don't have such platform that could launch more
>>> than 10 guests in hand. 
>>> 
>>> Here is another patch (see attached file) that fix another
>>> potential race. 
>>> 
>>> Do you have bandwidth to have a try? Thanks in advance!
>>> 
>> 
>> I built a kernel with your additional patch.
>> 
>> I have it running on all 13 domU's with use_smartpoll=1.
>> 
>> I'll report tomorrow morning whether there were any lockups.
>> 
>> FYI, total today I had 6 lockups with use_smartpoll=1 and the
>> previous patch. 
>> 
> 
> Sorry bad news again...
> 
> Had 5 lockups within 4 hours.
> 
> Then I restarted all domUs with use_smartpoll=0 and haven't had any
> lockups in 7 hours. 

Thanks Gerald.
I will try to find a local environment to do more investigation.

Best Regards,
-- Dongxiao

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-13 19:36                               ` Jeremy Fitzhardinge
@ 2010-09-14  8:25                                 ` Ian Campbell
  2010-09-14 17:54                                   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 31+ messages in thread
From: Ian Campbell @ 2010-09-14  8:25 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xu, Dongxiao, xen-devel, Gerald Turner

On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> > On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
> >> Then I restarted all domUs with use_smartpoll=0 and haven't had any
> >> lockups in 7 hours.
> >>
> > I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being
> > until this is sorted out..
> 
> Agreed.

Should we also consider adding a netback option to disable it for the
system as a whole as well? Or are the issues strictly in-guest only?

Perhaps netback should support a xenstore key to allow a toolstack to
configure this property per guest?

Ian.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-14  8:25                                 ` Ian Campbell
@ 2010-09-14 17:54                                   ` Jeremy Fitzhardinge
  2010-09-14 18:44                                     ` Pasi Kärkkäinen
  0 siblings, 1 reply; 31+ messages in thread
From: Jeremy Fitzhardinge @ 2010-09-14 17:54 UTC (permalink / raw)
  To: Ian Campbell; +Cc: Xu, Dongxiao, xen-devel, Gerald Turner

 On 09/14/2010 01:25 AM, Ian Campbell wrote:
> On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
>> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
>>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
>>>> Then I restarted all domUs with use_smartpoll=0 and haven't had any
>>>> lockups in 7 hours.
>>>>
>>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being
>>> until this is sorted out..
>> Agreed.
> Should we also consider adding a netback option to disable it for the
> system as a whole as well? Or are the issues strictly in-guest only?
>
> Perhaps netback should support a xenstore key to allow a toolstack to
> configure this property per guest?

It depends on what the problem is.  If there's a basic problem with the
smartpoll front<->back communication protocol then we'll probably have
to revert the whole thing and start over.  If the bug is just something
in the frontend then we can disable it there until resolved.

Fortunately I haven't pushed netfront smartpoll support upstream yet, so
the userbase is still fairly limited.  I hope.

    J

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-14 17:54                                   ` Jeremy Fitzhardinge
@ 2010-09-14 18:44                                     ` Pasi Kärkkäinen
  2010-09-15  9:46                                       ` Ian Campbell
  0 siblings, 1 reply; 31+ messages in thread
From: Pasi Kärkkäinen @ 2010-09-14 18:44 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Xu, Dongxiao, xen-devel, Ian Campbell, Gerald Turner

On Tue, Sep 14, 2010 at 10:54:27AM -0700, Jeremy Fitzhardinge wrote:
>  On 09/14/2010 01:25 AM, Ian Campbell wrote:
> > On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
> >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
> >>>> Then I restarted all domUs with use_smartpoll=0 and haven't had any
> >>>> lockups in 7 hours.
> >>>>
> >>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being
> >>> until this is sorted out..
> >> Agreed.
> > Should we also consider adding a netback option to disable it for the
> > system as a whole as well? Or are the issues strictly in-guest only?
> >
> > Perhaps netback should support a xenstore key to allow a toolstack to
> > configure this property per guest?
> 
> It depends on what the problem is.  If there's a basic problem with the
> smartpoll front<->back communication protocol then we'll probably have
> to revert the whole thing and start over.  If the bug is just something
> in the frontend then we can disable it there until resolved.
> 
> Fortunately I haven't pushed netfront smartpoll support upstream yet, so
> the userbase is still fairly limited.  I hope.
> 

There has been quite a few people on ##xen on irc complaining about it..

I think the smartpoll code has ended up in Debian Squeeze 2.6.32-5-xen kernel..
Hopefully they'll pull the "Revert "xen/netfront: default smartpoll to on"" soon..

-- Pasi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Re: new netfront and occasional receive path lockup
  2010-09-14 18:44                                     ` Pasi Kärkkäinen
@ 2010-09-15  9:46                                       ` Ian Campbell
  0 siblings, 0 replies; 31+ messages in thread
From: Ian Campbell @ 2010-09-15  9:46 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Jeremy Fitzhardinge, xen-devel, Xu, Dongxiao, Gerald Turner

On Tue, 2010-09-14 at 19:44 +0100, Pasi Kärkkäinen wrote:
> On Tue, Sep 14, 2010 at 10:54:27AM -0700, Jeremy Fitzhardinge wrote:
> >  On 09/14/2010 01:25 AM, Ian Campbell wrote:
> > > On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
> > >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> > >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
> > >>>> Then I restarted all domUs with use_smartpoll=0 and haven't had any
> > >>>> lockups in 7 hours.
> > >>>>
> > >>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being
> > >>> until this is sorted out..
> > >> Agreed.
> > > Should we also consider adding a netback option to disable it for the
> > > system as a whole as well? Or are the issues strictly in-guest only?
> > >
> > > Perhaps netback should support a xenstore key to allow a toolstack to
> > > configure this property per guest?
> > 
> > It depends on what the problem is.  If there's a basic problem with the
> > smartpoll front<->back communication protocol then we'll probably have
> > to revert the whole thing and start over.  If the bug is just something
> > in the frontend then we can disable it there until resolved.
> > 
> > Fortunately I haven't pushed netfront smartpoll support upstream yet, so
> > the userbase is still fairly limited.  I hope.
> > 
> 
> There has been quite a few people on ##xen on irc complaining about it..
> 
> I think the smartpoll code has ended up in Debian Squeeze 2.6.32-5-xen kernel..
> Hopefully they'll pull the "Revert "xen/netfront: default smartpoll to on"" soon..

I've suggested it on debian-kernel.

Ian.

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2010-09-15  9:46 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-08-22 16:43 new netfront and occasional receive path lockup Christophe Saout
2010-08-22 18:37 ` Christophe Saout
2010-08-24  0:53   ` Jeremy Fitzhardinge
2010-08-23 14:26 ` Christophe Saout
2010-08-23 16:04   ` Konrad Rzeszutek Wilk
2010-08-23 17:09     ` Christophe Saout
2010-08-24  0:46 ` Jeremy Fitzhardinge
2010-08-25  0:51   ` Xu, Dongxiao
2010-09-09 18:50     ` Pasi Kärkkäinen
2010-09-10  0:55       ` Jeremy Fitzhardinge
2010-09-10  1:45         ` Xu, Dongxiao
2010-09-10  2:25           ` Jeremy Fitzhardinge
2010-09-10  2:37             ` Xu, Dongxiao
2010-09-10  2:42               ` Jeremy Fitzhardinge
2010-09-12  1:00           ` Gerald Turner
2010-09-12  8:55             ` Jeremy Fitzhardinge
2010-09-12 17:23               ` Pasi Kärkkäinen
2010-09-12 22:40               ` Gerald Turner
2010-09-13  0:03                 ` Gerald Turner
2010-09-13  0:54                   ` Xu, Dongxiao
2010-09-13  2:12                     ` Gerald Turner
2010-09-13  2:34                       ` Xu, Dongxiao
2010-09-13  4:38                         ` Gerald Turner
2010-09-13 16:01                           ` Gerald Turner
2010-09-13 16:08                             ` Pasi Kärkkäinen
2010-09-13 19:36                               ` Jeremy Fitzhardinge
2010-09-14  8:25                                 ` Ian Campbell
2010-09-14 17:54                                   ` Jeremy Fitzhardinge
2010-09-14 18:44                                     ` Pasi Kärkkäinen
2010-09-15  9:46                                       ` Ian Campbell
2010-09-14  0:26                             ` Xu, Dongxiao

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.