new netfront and occasional receive path lockup

* new netfront and occasional receive path lockup
@ 2010-08-22 16:43 Christophe Saout
  2010-08-22 18:37 ` Christophe Saout
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Christophe Saout @ 2010-08-22 16:43 UTC (permalink / raw)
  To: xen-devel

Hi,

I've been playing with some of the new pvops code, namely DomU guest
code.  What I've been observing on one of the virtual machines is that
the network (vif) is dying after about ten to sixty minutes of uptime.
The unfortunate thing here is that I can only repoduce it on a
production VM and have been unlucky so far to trigger the bug on a test
machine.  While this has not been tragic - rebooting fixed the issue,
unfortunately I can't spend very much time on debugging after the issue
pops up.

Now, what is happening is that the receive path goes dead.  The DomU can
send packets to Dom0 and those are visible using tcpdump on the Dom0 on
the virtual interface, but not the other way around.

Now, I have done more than one change at a time (I'd like to avoid going
into pinning it down since I can only reproduce it on a production
machine, as I said, so suggestions are welcome), but my suspicion is
that it might have to do with the new "smart polling" feature in
xen/netfront.  Note that I have also updated Dom0 to pull in the latest
dom0/backend and netback changes, just to make sure it's not due to an
issue that has been fixed there, but I'm still seeing the same.

The production machine is a machine that doesn't have much network load,
but deals with a lot of small network requests (DNS and smtp mostly).  A
workload which is hard to reproduce on the test machine.  Heavy network
load (NFS, FTP and so on) for days hasn't triggered the problem.  Also,
segmentation offloading and similar settings don't have any effect.

The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT
enabled.

I've been looking at the code, if there might be a race condition
somewhere, something like where one could run into a situation where the
hrtimer doesn't run and Dom0 believes the DomU should be polling and
doesn't emit an interrupt or something, but I'm afraid I don't know
enough to judge this (I mean, there are spinlocks which look safe to
me).

Do you have any suggestions what to try?  I can trigger the issue on the
production VM again, but debugging should not take more than a few
minutes if it happens.  Access is only possible via the console.
Neither Dom0 nor the guest show anything unusual in the kernel message
and continue to behave normally after the network goes dead (also able
to shut down the guest normally).

Thanks,
	Christophe

^ permalink raw reply	[flat|nested] 31+ messages in thread