From mboxrd@z Thu Jan  1 00:00:00 1970
From: Scott Garron <xen-devel@sce.pridelands.org>
Subject: Re: Making snapshot of logical volumes handling HVM	domU
	causes OOPS and instability
Date: Tue, 31 Aug 2010 04:16:09 -0400
Message-ID: <4C7CBA49.2030306@sce.pridelands.org>
References: <4C7864BB.1010808@sce.pridelands.org> <4C7BE1C6.5030602@goop.org>
	<D5AB6E638E5A3E4B8F4406B113A5A19A2A4D1D5B@shsmsx501.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <D5AB6E638E5A3E4B8F4406B113A5A19A2A4D1D5B@shsmsx501.ccr.corp.intel.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "Xu, Dongxiao" <dongxiao.xu@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Daniel Stodden <daniel.stodden@citrix.com>
List-Id: xen-devel@lists.xenproject.org

>> Scott Garron wrote:
>>> Another issue that comes up is that if I run the 2.6.32.18 pvops
>>> kernel for my Linux domUs, after a time (usually only about an
>>> hour or so), the network interfaces stop responding.

> Jeremy Fitzhardinge wrote:
>> That's a separate problem in netfront that appears to be a bug in
>> the "smartpoll" code.  I think Dongxiao is looking into it.

On 8/31/2010 2:59 AM, Xu, Dongxiao wrote:
> Yes, I tried to reproduce these days, however I could catch it
> locally. I tried both netperf and ping for a long time, but the bug
> is not triggered. What workload are you using when met the bug?

      I'd say that the whole machine is under moderate to high
utilization because it has 10 virtual machines running - three of which
are Windows 2008 Servers as HVM guests.  However, as far as the "load"
goes, most of the virtual machines are fairly idle and probably not
under much stress, overall.  Just to give you an idea, we have a
10Mbit/s connection to the Internet, and this server's physical network
interface (all 10 of the domUs' traffic, combined) usually accounts for
less than 2Mbit/s of the outbound traffic at any given point in the day.
  Aside from Windows being Windows (the HVM guests are running graphical
desktops), I wouldn't say that any of them cause a high CPU load,
either.  Database load is fairly low to moderate on guests running MySQL
and/or PostgreSQL.  The only guest that seems to use more CPU and
RAM is one serving e-mail, and that's because it runs ClamAV and
SpamAssassin.  That e-mail server was one that kept its network
connectivity the longest, though (after a few hours, it did stop
responding, but that was after some guests with lighter loads stopped
responding).

      An observation that I made, and it may just be coincidental,
but at least noteworthy, is that the virtual machines that are assigned
less RAM seem to lose connectivity more quickly than those with more
RAM.  The most recent time that I was able to trigger the bug, the
virtual machine that lost connectivity was only assigned 384MB RAM,
running 2.6.32.18.  At the time, the rest of my paravirtualized guests
were running 2.6.31.14, and they didn't experience the problem.

      I've previously triggered the bug in multiple domUs that were
running a more recent kernel (I think it was 2.6.32.17 - before I
reverted to a netback-patched 2.6.31.14 kernel), and the first ones to
disappear from the network were ones that were only assigned 256MB.
Eventually, they all disappeared, though.  The only "load" on one of the
first to disappear is an installation of bind9, servicing about 50
domain names - none of which receive an abnormally high hit count.

      The first time I noticed the problem, I had started 7
paravirtualized guests, of varying memory assignments.  The moment I
started the 8th guest, an HVM Windows 2008 Server, the networking on all
of the running of the guests (the paravirt ones) stopped responding at
the same time.  That may also be something to try/look at.

      After a reboot, I avoided starting any of the HVM guests, and the
connectivity lasted a couple of hours on the 7 running paravirt guests,
but started disappearing one guest at a time, over the course of the
next few hours.

      I didn't mention in my previous e-mail that in order to get
networking to work in a stable fashion in the 2.6.31.14 kernel (the one
I reverted to), I had to apply the patch mentioned here:
http://lists.xensource.com/archives/html/xen-devel/2010-05/msg01570.html
Otherwise, networking became unstable immediately at the time of guest
creation.  That patch was already applied to the 2.6.32.18 kernel that
is giving me the eventual network loss problems, though.

      More specifics about my configuration can be found here:
http://www.pridelands.org/~simba/hurricane-server.txt

-- 
Scott Garron