From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Xen unstability on HP Moonshot m400 Date: Sat, 21 Mar 2015 13:34:48 +0100 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============0047155615059037829==" Return-path: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell , Stefano Stabellini , "Hull, Jim" Cc: Marc Zyngier , Robert Ricci , xen-devel@lists.xen.org, Pranavkumar Sawargaonkar List-Id: xen-devel@lists.xenproject.org --===============0047155615059037829== Content-Type: multipart/alternative; boundary=001a1147b0d899748a0511cba867 --001a1147b0d899748a0511cba867 Content-Type: text/plain; charset=UTF-8 Hi, I have been experiencing a problematic crash running Xen on m400 over the last few days. I already spoke to Ian and Stefano about this, but thought I'd summarize what I've seen so far and loop in a wider audience. The basic setup is this: - Two m400 nodes, one running Linux bare-metal, the other running Xen. - The Xen node runs Dom0 and 1 DomU - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two parts on it - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to the internet) and regular bridging to eth1 which is connected to a private VLAN to the bare-metal node - Dom0 and DomU are configured with 14GB of ram, 4 cpus each - DomU runs apache2 serving the GCC manual (see https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh ) The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100 http://10.10.1.120/gcc/index.html" (10.10.1.120 is the DomU IP address of the bridged interface to eth1) What happens now is that the entire Xen node goes down. I see various errors in the kernel log, some examples: http://pastebin.ubuntu.com/10642148/ http://pastebin.ubuntu.com/10642177/ http://pastebin.ubuntu.com/10642181/ http://pastebin.ubuntu.com/10635573/ All Linux kernels are 3.18 plus some tweaks for the m400 cartridge: https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18 config: columbia_armvirt_defconfig (from the same tree: https://github.com/columbia/linux-kvm-arm/blob/columbia-armvirt-3.18/arch/arm64/configs/columbia_armvirt_defconfig ) I have also tried applying a set of swiotlb fixes provided by Stefano to both the Dom0 and DomU kernel, like this: https://github.com/columbia/linux-kvm-arm/commits/columbia-armvirt-3.18-with-xen-fixes With these patches I sometime also saw this error in the kernel log (but not always): http://pastebin.ubuntu.com/10635062/ Other data points of interest: - Bare-metal serving apache doesn't exhibit this behavior - KVM guests with bridged networking on identical hardware/setup with the same kernels also don't exhibit this behavior - Other physical identical nodes exhibit the same behavior - Just running Dom0 serving apache without running DomU doesn't appear to exhibit this behavior - Running apache on Dom0 and benchmarking the system using Dom0's ip address but running DomU idle in the background causes this behavior ( http://pastebin.ubuntu.com/10642311/), but the system seems to stay alive (at least for much longer)! Stefano suggested that this could be related DMA cache coherency, but I'm not sure how to investigate that further. This is a somewhat urgent issue for us at Columbia so I would appreciate any feedback and/or ideas and will be happy to try out any debugging steps to get to the bottom of this. Thanks, -Christoffer --001a1147b0d899748a0511cba867 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi,

I have been experiencing a problematic crash ru= nning Xen on m400 over the last few days.=C2=A0 I already spoke to Ian and = Stefano about this, but thought I'd summarize what I've seen so far= and loop in a wider audience.

The basic setup is this:
=C2=A0- T= wo m400 nodes, one running Linux bare-metal, the other running Xen.
=C2= =A0- The Xen node runs Dom0 and 1 DomU
=C2=A0- The m400 has a Mellanox C= onnectx-3 PCIe 10G ethernet card with two parts on it
=C2=A0- Dom0 uses = NAT forwarding from Dom0's eth0 (which is connected to the internet) an= d regular bridging to eth1 which is connected to a private VLAN to the bare= -metal node
=C2=A0- Dom0 and DomU are configured with 14GB of ram, 4 cpu= s each
=C2=A0- DomU runs apache2 serving the GCC manual (see https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_instal= l.sh)

The bare-metal node runs apache bench, like this: "ab= -n 100000 -c 100 http://10.1= 0.1.120/gcc/index.html"

(10.10.1.120 is the DomU IP address= of the bridged interface to eth1)

What happens now is that the enti= re Xen node goes down.=C2=A0 I see various errors in the kernel log, some e= xamples:
http://pastebi= n.ubuntu.com/10642148/
http://pastebin.ubuntu.com/10642177/
http://pastebin.ubuntu.com/10642181/

I have also tried applying a set of swiotlb fixes provided by Stefan= o to both the Dom0 and DomU kernel, like this:

With these patches I som= etime also saw this error in the kernel log (but not always):

Other data points of interest:
=C2=A0- Bare-metal serving apache doesn't exhibit this behavior<= br>
=C2=A0- KVM guests with bridged networking on identical hardw= are/setup with the same kernels also don't exhibit this behavior
<= div>=C2=A0- Other physical identical nodes exhibit the same behavior
<= div>=C2=A0- Just running Dom0 serving apache without running DomU doesn'= ;t appear to exhibit this behavior
=C2=A0- Running apache on Dom0= and benchmarking the system using Dom0's ip address but running DomU i= dle in the background causes this behavior (http://pastebin.ubuntu.com/10642311/), but the system= seems to stay alive (at least for much longer)!

S= tefano suggested that this could be related DMA cache coherency, but I'= m not sure how to investigate that further.

This i= s a somewhat urgent issue for us at Columbia so I would appreciate any feed= back and/or ideas and will be happy to try out any debugging steps to get t= o the bottom of this.

Thanks,
-Christoff= er

--001a1147b0d899748a0511cba867-- --===============0047155615059037829== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============0047155615059037829==--