From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jeremy Fitzhardinge Subject: Re: new netfront and occasional receive path lockup Date: Mon, 23 Aug 2010 17:46:37 -0700 Message-ID: <4C73166D.3030000@goop.org> References: <1282495384.12843.11.camel@leto.intern.saout.de> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <1282495384.12843.11.camel@leto.intern.saout.de> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Christophe Saout Cc: "Xu, Dongxiao" , xen-devel@lists.xensource.com List-Id: xen-devel@lists.xenproject.org On 08/22/2010 09:43 AM, Christophe Saout wrote: > Hi, > > I've been playing with some of the new pvops code, namely DomU guest > code. What I've been observing on one of the virtual machines is that > the network (vif) is dying after about ten to sixty minutes of uptime. > The unfortunate thing here is that I can only repoduce it on a > production VM and have been unlucky so far to trigger the bug on a test > machine. While this has not been tragic - rebooting fixed the issue, > unfortunately I can't spend very much time on debugging after the issue > pops up. Ah, OK. I've seen this a couple of times as well. And it just happened to me then... > Now, what is happening is that the receive path goes dead. The DomU can > send packets to Dom0 and those are visible using tcpdump on the Dom0 on > the virtual interface, but not the other way around. I hadn't got to that level of diagnosis, but I can confirm that that's what seems to be happening here too. > Now, I have done more than one change at a time (I'd like to avoid going > into pinning it down since I can only reproduce it on a production > machine, as I said, so suggestions are welcome), but my suspicion is > that it might have to do with the new "smart polling" feature in > xen/netfront. Note that I have also updated Dom0 to pull in the latest > dom0/backend and netback changes, just to make sure it's not due to an > issue that has been fixed there, but I'm still seeing the same. I agree. I think I started seeing this once I merged smartpoll into netfront. J > The production machine is a machine that doesn't have much network load, > but deals with a lot of small network requests (DNS and smtp mostly). A > workload which is hard to reproduce on the test machine. Heavy network > load (NFS, FTP and so on) for days hasn't triggered the problem. Also, > segmentation offloading and similar settings don't have any effect. > > The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT > enabled. > > I've been looking at the code, if there might be a race condition > somewhere, something like where one could run into a situation where the > hrtimer doesn't run and Dom0 believes the DomU should be polling and > doesn't emit an interrupt or something, but I'm afraid I don't know > enough to judge this (I mean, there are spinlocks which look safe to > me). > > Do you have any suggestions what to try? I can trigger the issue on the > production VM again, but debugging should not take more than a few > minutes if it happens. Access is only possible via the console. > Neither Dom0 nor the guest show anything unusual in the kernel message > and continue to behave normally after the network goes dead (also able > to shut down the guest normally). > > Thanks, > Christophe > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >