From mboxrd@z Thu Jan 1 00:00:00 1970 From: Christoffer Dall Subject: Re: Xen unstability on HP Moonshot m400 Date: Mon, 23 Mar 2015 14:00:46 +0100 Message-ID: References: <1427114196.21742.265.camel@citrix.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4339151336945146900==" Return-path: In-Reply-To: <1427114196.21742.265.camel@citrix.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: Robert Ricci , Stefano Stabellini , Marc Zyngier , xen-devel@lists.xen.org, msalter@redhat.com, "Hull, Jim" , Pranavkumar Sawargaonkar List-Id: xen-devel@lists.xenproject.org --===============4339151336945146900== Content-Type: multipart/alternative; boundary=001a113769b22a40ab0511f441e2 --001a113769b22a40ab0511f441e2 Content-Type: text/plain; charset=UTF-8 On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell wrote: > On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote: > > Hi, > > > > I have been experiencing a problematic crash running Xen on m400 over > > the last few days. I already spoke to Ian and Stefano about this, but > > thought I'd summarize what I've seen so far and loop in a wider > > audience. > > > > The basic setup is this: > > - Two m400 nodes, one running Linux bare-metal, the other running > > Xen. > > - The Xen node runs Dom0 and 1 DomU > > - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two > > parts on it > > - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to > > the internet) and regular bridging to eth1 which is connected to a > > private VLAN to the bare-metal node > > - Dom0 and DomU are configured with 14GB of ram, 4 cpus each > > - DomU runs apache2 serving the GCC manual (see > > > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh > ) > > > > The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100 > > > http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-Ny7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2Fgcc%2Findex.html > " > > > > (10.10.1.120 is the DomU IP address of the bridged interface to eth1) > > > > What happens now is that the entire Xen node goes down. I see various > > errors in the kernel log, some examples: > > http://pastebin.ubuntu.com/10642148/ > > http://pastebin.ubuntu.com/10642177/ > > http://pastebin.ubuntu.com/10642181/ > > http://pastebin.ubuntu.com/10635573/ > > > > > > All Linux kernels are 3.18 plus some tweaks for the m400 cartridge: > > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18 > > Is it worth adding > > https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec > to your kernel? It isn't Xen specific but it's perhaps possible that Xen > opens the window wider. > > How confident are you in > > https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb > ? > (although I suppose you aren't running in ACPI mode if you are running > Xen?) > I'm not confident at all, but Linux (last I checked was v3.19) doesn't boot without it, so not sure if there's an alternative? Mark? > > If we think the issue might be to do with coherency of foreign mappings > undergoing i/o from dom0 and we've already ruled out disk (by using a > loopback mounted rootfs) then it might be worth bodging netback to > always copy too. > > Adding a call to skb_orphan_frags right before the netif_receive_skb in > drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but > rather inefficient way of doing that (so I hope it doesn't perturb the > issue). > I'll be happy to try this. > > Stefano (who is more familiar with the Linux swiotlb side of things than > me) is travelling this week so he'll be on West coast time, not sure > when he gets off a plane nor if he's on email anyway (he's at ELC + this > ARM ACPI thing) > > ok, we'll see what happens. -Christoffer --001a113769b22a40ab0511f441e2 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable


On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@citr= ix.com> wrote:
On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> Hi,
>
> I have been experiencing a problematic crash running Xen on m400 over<= br> > the last few days.=C2=A0 I already spoke to Ian and Stefano about this= , but
> thought I'd summarize what I've seen so far and loop in a wide= r
> audience.
>
> The basic setup is this:
>=C2=A0 - Two m400 nodes, one running Linux bare-metal, the other runnin= g
> Xen.
>=C2=A0 - The Xen node runs Dom0 and 1 DomU
>=C2=A0 - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with= two
> parts on it
>=C2=A0 - Dom0 uses NAT forwarding from Dom0's eth0 (which is connec= ted to
> the internet) and regular bridging to eth1 which is connected to a
> private VLAN to the bare-metal node
>=C2=A0 - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
>=C2=A0 - DomU runs apache2 serving the GCC manual (see
> https://github.com/chazy/kvmperf/blob/= master/cmdline_tests/apache_install.sh)
>
> The bare-metal node runs apache bench, like this: "ab -n 100000 -= c 100
> http://secure= -web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_= SZ1x-VxdzyK-ErDsOUiQ9z2x-Ny7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJO= sZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2Fgcc%2Finde= x.html"
>
> (10.10.1.120 is the DomU IP address of the bridged interface to eth1)<= br> >
> What happens now is that the entire Xen node goes down.=C2=A0 I see va= rious
> errors in the kernel log, some examples:
> htt= p://pastebin.ubuntu.com/10642148/
> htt= p://pastebin.ubuntu.com/10642177/
> htt= p://pastebin.ubuntu.com/10642181/
> htt= p://pastebin.ubuntu.com/10635573/
>
>
> All Linux kernels are 3.18 plus some tweaks for the m400 cartridge: > https://github.com/columbia/linux-kvm-arm/tree= /columbia-armvirt-3.18

Is it worth adding
https= ://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=3D285994= a62c80f1d72c6924282bcb59608098d5ec to your kernel? It isn't Xen spe= cific but it's perhaps possible that Xen opens the window wider.

How confident are you in
https://github.com/columbia/li= nux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb ?
(although I suppose you aren't running in ACPI mode if you are running<= br> Xen?)

I'm not confident at all, but= Linux (last I checked was v3.19) doesn't boot without it, so not sure = if there's an alternative?=C2=A0 Mark?
=C2=A0

If we think the issue might be to do with coherency of foreign mappings
undergoing i/o from dom0 and we've already ruled out disk (by using a loopback mounted rootfs) then it might be worth bodging netback to
always copy too.

Adding a call to skb_orphan_frags right before the netif_receive_skb in
drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
rather inefficient way of doing that (so I hope it doesn't perturb the<= br> issue).

I'll be happy to try this.<= /div>
=C2=A0

Stefano (who is more familiar with the Linux swiotlb side of things than me) is travelling this week so he'll be on West coast time, not sure when he gets off a plane nor if he's on email anyway (he's at ELC += this
ARM ACPI thing)


ok, we'll see what happens.

=
-Christoffer
--001a113769b22a40ab0511f441e2-- --===============4339151336945146900== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4339151336945146900==--