From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefano Stabellini Subject: Re: Xen unstability on HP Moonshot m400 Date: Mon, 23 Mar 2015 23:58:07 +0000 Message-ID: References: <1427114196.21742.265.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="1342847746-1561609071-1427155088=:7982" Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Christoffer Dall Cc: Robert Ricci , Ian Campbell , Stefano Stabellini , Marc Zyngier , xen-devel@lists.xen.org, msalter@redhat.com, "Hull, Jim" , Pranavkumar Sawargaonkar List-Id: xen-devel@lists.xenproject.org --1342847746-1561609071-1427155088=:7982 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 23 Mar 2015, Christoffer Dall wrote: > On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell w= rote: > On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote: > > Hi, > > > > I have been experiencing a problematic crash running Xen on m400 = over > > the last few days.=C2=A0 I already spoke to Ian and Stefano about= this, but > > thought I'd summarize what I've seen so far and loop in a wider > > audience. > > > > The basic setup is this: > >=C2=A0 - Two m400 nodes, one running Linux bare-metal, the other r= unning > > Xen. > >=C2=A0 - The Xen node runs Dom0 and 1 DomU > >=C2=A0 - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card= with two > > parts on it > >=C2=A0 - Dom0 uses NAT forwarding from Dom0's eth0 (which is conne= cted to > > the internet) and regular bridging to eth1 which is connected to = a > > private VLAN to the bare-metal node > >=C2=A0 - Dom0 and DomU are configured with 14GB of ram, 4 cpus eac= h > >=C2=A0 - DomU runs apache2 serving the GCC manual (see > > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache= _install.sh) > > > > The bare-metal node runs apache bench, like this: "ab -n 100000 -= c 100 > >http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYi= V4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N > y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7= QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F > gcc%2Findex.html" > > > > (10.10.1.120 is the DomU IP address of the bridged interface to e= th1) > > > > What happens now is that the entire Xen node goes down.=C2=A0 I s= ee various > > errors in the kernel log, some examples: > > http://pastebin.ubuntu.com/10642148/ > > http://pastebin.ubuntu.com/10642177/ > > http://pastebin.ubuntu.com/10642181/ > > http://pastebin.ubuntu.com/10635573/ > > > > > > All Linux kernels are 3.18 plus some tweaks for the m400 cartridg= e: > > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3= =2E18 >=20 > Is it worth adding > https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit= /?id=3D285994a62c80f1d72c6924282bcb59608098d5ec > to your kernel? It isn't Xen specific but it's perhaps possible tha= t Xen opens the window wider. >=20 > How confident are you in > https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f= 568d6bea6840960331bcbb ? > (although I suppose you aren't running in ACPI mode if you are runn= ing > Xen?) >=20 >=20 > I'm not confident at all, but Linux (last I checked was v3.19) doesn't bo= ot without it, so not sure if there's an > alternative?=C2=A0 Mark? This patch is key: it doesn't look like it is setting dev->archdata.dma_coherent appropriately, see the implementation of set_arch_dma_coherent_ops. >=20 > If we think the issue might be to do with coherency of foreign mapp= ings > undergoing i/o from dom0 and we've already ruled out disk (by using= a > loopback mounted rootfs) then it might be worth bodging netback to > always copy too. >=20 > Adding a call to skb_orphan_frags right before the netif_receive_sk= b in > drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but > rather inefficient way of doing that (so I hope it doesn't perturb = the > issue). >=20 >=20 > I'll be happy to try this. If we are right and the problem is due to the commit above not setting dma_coherent to true (the kernel will think that actually the network card is not coherent), then Ian's workaround should hide the problem. >=20 > Stefano (who is more familiar with the Linux swiotlb side of things= than > me) is travelling this week so he'll be on West coast time, not sur= e > when he gets off a plane nor if he's on email anyway (he's at ELC += this > ARM ACPI thing) >=20 >=20 > ok, we'll see what happens. >=20 > -Christoffer >=20 >=20 --1342847746-1561609071-1427155088=:7982 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --1342847746-1561609071-1427155088=:7982--