All of lore.kernel.org
 help / color / mirror / Atom feed
* Xen unstability on HP Moonshot m400
@ 2015-03-21 12:34 Christoffer Dall
  2015-03-23 12:36 ` Ian Campbell
  0 siblings, 1 reply; 7+ messages in thread
From: Christoffer Dall @ 2015-03-21 12:34 UTC (permalink / raw)
  To: Ian Campbell, Stefano Stabellini, Hull, Jim
  Cc: Marc Zyngier, Robert Ricci, xen-devel, Pranavkumar Sawargaonkar


[-- Attachment #1.1: Type: text/plain, Size: 2805 bytes --]

Hi,

I have been experiencing a problematic crash running Xen on m400 over the
last few days.  I already spoke to Ian and Stefano about this, but thought
I'd summarize what I've seen so far and loop in a wider audience.

The basic setup is this:
 - Two m400 nodes, one running Linux bare-metal, the other running Xen.
 - The Xen node runs Dom0 and 1 DomU
 - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two parts
on it
 - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to the
internet) and regular bridging to eth1 which is connected to a private VLAN
to the bare-metal node
 - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
 - DomU runs apache2 serving the GCC manual (see
https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh
)

The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100
http://10.10.1.120/gcc/index.html"

(10.10.1.120 is the DomU IP address of the bridged interface to eth1)

What happens now is that the entire Xen node goes down.  I see various
errors in the kernel log, some examples:
http://pastebin.ubuntu.com/10642148/
http://pastebin.ubuntu.com/10642177/
http://pastebin.ubuntu.com/10642181/
http://pastebin.ubuntu.com/10635573/

All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
config: columbia_armvirt_defconfig (from the same tree:
https://github.com/columbia/linux-kvm-arm/blob/columbia-armvirt-3.18/arch/arm64/configs/columbia_armvirt_defconfig
)

I have also tried applying a set of swiotlb fixes provided by Stefano to
both the Dom0 and DomU kernel, like this:
https://github.com/columbia/linux-kvm-arm/commits/columbia-armvirt-3.18-with-xen-fixes

With these patches I sometime also saw this error in the kernel log (but
not always):
http://pastebin.ubuntu.com/10635062/

Other data points of interest:
 - Bare-metal serving apache doesn't exhibit this behavior
 - KVM guests with bridged networking on identical hardware/setup with the
same kernels also don't exhibit this behavior
 - Other physical identical nodes exhibit the same behavior
 - Just running Dom0 serving apache without running DomU doesn't appear to
exhibit this behavior
 - Running apache on Dom0 and benchmarking the system using Dom0's ip
address but running DomU idle in the background causes this behavior (
http://pastebin.ubuntu.com/10642311/), but the system seems to stay alive
(at least for much longer)!

Stefano suggested that this could be related DMA cache coherency, but I'm
not sure how to investigate that further.

This is a somewhat urgent issue for us at Columbia so I would appreciate
any feedback and/or ideas and will be happy to try out any debugging steps
to get to the bottom of this.

Thanks,
-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 4019 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen unstability on HP Moonshot m400
  2015-03-21 12:34 Xen unstability on HP Moonshot m400 Christoffer Dall
@ 2015-03-23 12:36 ` Ian Campbell
  2015-03-23 13:00   ` Christoffer Dall
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2015-03-23 12:36 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Robert Ricci, Stefano Stabellini, Marc Zyngier, xen-devel, Hull,
	Jim, Pranavkumar Sawargaonkar

On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> Hi,
> 
> I have been experiencing a problematic crash running Xen on m400 over
> the last few days.  I already spoke to Ian and Stefano about this, but
> thought I'd summarize what I've seen so far and loop in a wider
> audience.
> 
> The basic setup is this:
>  - Two m400 nodes, one running Linux bare-metal, the other running
> Xen.
>  - The Xen node runs Dom0 and 1 DomU
>  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two
> parts on it
>  - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to
> the internet) and regular bridging to eth1 which is connected to a
> private VLAN to the bare-metal node
>  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
>  - DomU runs apache2 serving the GCC manual (see
> https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh)
> 
> The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100
> http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-Ny7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2Fgcc%2Findex.html"
> 
> (10.10.1.120 is the DomU IP address of the bridged interface to eth1)
> 
> What happens now is that the entire Xen node goes down.  I see various
> errors in the kernel log, some examples:
> http://pastebin.ubuntu.com/10642148/
> http://pastebin.ubuntu.com/10642177/
> http://pastebin.ubuntu.com/10642181/
> http://pastebin.ubuntu.com/10635573/
> 
> 
> All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
> https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18

Is it worth adding
https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec to your kernel? It isn't Xen specific but it's perhaps possible that Xen opens the window wider.

How confident are you in
https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb ?
(although I suppose you aren't running in ACPI mode if you are running
Xen?)

If we think the issue might be to do with coherency of foreign mappings
undergoing i/o from dom0 and we've already ruled out disk (by using a
loopback mounted rootfs) then it might be worth bodging netback to
always copy too.

Adding a call to skb_orphan_frags right before the netif_receive_skb in
drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
rather inefficient way of doing that (so I hope it doesn't perturb the
issue).

Stefano (who is more familiar with the Linux swiotlb side of things than
me) is travelling this week so he'll be on West coast time, not sure
when he gets off a plane nor if he's on email anyway (he's at ELC + this
ARM ACPI thing)

Ian.

> config: columbia_armvirt_defconfig (from the same tree:
> https://github.com/columbia/linux-kvm-arm/blob/columbia-armvirt-3.18/arch/arm64/configs/columbia_armvirt_defconfig)
> 
> 
> I have also tried applying a set of swiotlb fixes provided by Stefano
> to both the Dom0 and DomU kernel, like this:
> https://github.com/columbia/linux-kvm-arm/commits/columbia-armvirt-3.18-with-xen-fixes
> 
> 
> 
> With these patches I sometime also saw this error in the kernel log
> (but not always):
> http://pastebin.ubuntu.com/10635062/
> 
> 
> 
> Other data points of interest:
>  - Bare-metal serving apache doesn't exhibit this behavior
> 
>  - KVM guests with bridged networking on identical hardware/setup with
> the same kernels also don't exhibit this behavior
>  - Other physical identical nodes exhibit the same behavior
>  - Just running Dom0 serving apache without running DomU doesn't
> appear to exhibit this behavior
>  - Running apache on Dom0 and benchmarking the system using Dom0's ip
> address but running DomU idle in the background causes this behavior
> (http://pastebin.ubuntu.com/10642311/), but the system seems to stay
> alive (at least for much longer)!
> 
> 
> Stefano suggested that this could be related DMA cache coherency, but
> I'm not sure how to investigate that further.
> 
> 
> This is a somewhat urgent issue for us at Columbia so I would
> appreciate any feedback and/or ideas and will be happy to try out any
> debugging steps to get to the bottom of this.
> 
> 
> Thanks,
> -Christoffer
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen unstability on HP Moonshot m400
  2015-03-23 12:36 ` Ian Campbell
@ 2015-03-23 13:00   ` Christoffer Dall
  2015-03-23 23:58     ` Stefano Stabellini
  0 siblings, 1 reply; 7+ messages in thread
From: Christoffer Dall @ 2015-03-23 13:00 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Robert Ricci, Stefano Stabellini, Marc Zyngier, xen-devel,
	msalter, Hull, Jim, Pranavkumar Sawargaonkar


[-- Attachment #1.1: Type: text/plain, Size: 3241 bytes --]

On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@citrix.com>
wrote:

> On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> > Hi,
> >
> > I have been experiencing a problematic crash running Xen on m400 over
> > the last few days.  I already spoke to Ian and Stefano about this, but
> > thought I'd summarize what I've seen so far and loop in a wider
> > audience.
> >
> > The basic setup is this:
> >  - Two m400 nodes, one running Linux bare-metal, the other running
> > Xen.
> >  - The Xen node runs Dom0 and 1 DomU
> >  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two
> > parts on it
> >  - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to
> > the internet) and regular bridging to eth1 which is connected to a
> > private VLAN to the bare-metal node
> >  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
> >  - DomU runs apache2 serving the GCC manual (see
> >
> https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh
> )
> >
> > The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100
> >
> http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-Ny7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2Fgcc%2Findex.html
> "
> >
> > (10.10.1.120 is the DomU IP address of the bridged interface to eth1)
> >
> > What happens now is that the entire Xen node goes down.  I see various
> > errors in the kernel log, some examples:
> > http://pastebin.ubuntu.com/10642148/
> > http://pastebin.ubuntu.com/10642177/
> > http://pastebin.ubuntu.com/10642181/
> > http://pastebin.ubuntu.com/10635573/
> >
> >
> > All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
> > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
>
> Is it worth adding
>
> https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec
> to your kernel? It isn't Xen specific but it's perhaps possible that Xen
> opens the window wider.
>
> How confident are you in
>
> https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb
> ?
> (although I suppose you aren't running in ACPI mode if you are running
> Xen?)
>

I'm not confident at all, but Linux (last I checked was v3.19) doesn't boot
without it, so not sure if there's an alternative?  Mark?


>
> If we think the issue might be to do with coherency of foreign mappings
> undergoing i/o from dom0 and we've already ruled out disk (by using a
> loopback mounted rootfs) then it might be worth bodging netback to
> always copy too.
>
> Adding a call to skb_orphan_frags right before the netif_receive_skb in
> drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
> rather inefficient way of doing that (so I hope it doesn't perturb the
> issue).
>

I'll be happy to try this.


>
> Stefano (who is more familiar with the Linux swiotlb side of things than
> me) is travelling this week so he'll be on West coast time, not sure
> when he gets off a plane nor if he's on email anyway (he's at ELC + this
> ARM ACPI thing)
>
>
ok, we'll see what happens.

-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 5349 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen unstability on HP Moonshot m400
  2015-03-23 13:00   ` Christoffer Dall
@ 2015-03-23 23:58     ` Stefano Stabellini
  2015-03-24 13:54       ` Mark Salter
  0 siblings, 1 reply; 7+ messages in thread
From: Stefano Stabellini @ 2015-03-23 23:58 UTC (permalink / raw)
  To: Christoffer Dall
  Cc: Robert Ricci, Ian Campbell, Stefano Stabellini, Marc Zyngier,
	xen-devel, msalter, Hull, Jim, Pranavkumar Sawargaonkar

[-- Attachment #1: Type: text/plain, Size: 4100 bytes --]

On Mon, 23 Mar 2015, Christoffer Dall wrote:
> On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
>       On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
>       > Hi,
>       >
>       > I have been experiencing a problematic crash running Xen on m400 over
>       > the last few days.  I already spoke to Ian and Stefano about this, but
>       > thought I'd summarize what I've seen so far and loop in a wider
>       > audience.
>       >
>       > The basic setup is this:
>       >  - Two m400 nodes, one running Linux bare-metal, the other running
>       > Xen.
>       >  - The Xen node runs Dom0 and 1 DomU
>       >  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two
>       > parts on it
>       >  - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to
>       > the internet) and regular bridging to eth1 which is connected to a
>       > private VLAN to the bare-metal node
>       >  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
>       >  - DomU runs apache2 serving the GCC manual (see
>       > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh)
>       >
>       > The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100
>       >http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N
> y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F
>       gcc%2Findex.html"
>       >
>       > (10.10.1.120 is the DomU IP address of the bridged interface to eth1)
>       >
>       > What happens now is that the entire Xen node goes down.  I see various
>       > errors in the kernel log, some examples:
>       > http://pastebin.ubuntu.com/10642148/
>       > http://pastebin.ubuntu.com/10642177/
>       > http://pastebin.ubuntu.com/10642181/
>       > http://pastebin.ubuntu.com/10635573/
>       >
>       >
>       > All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
>       > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
> 
>       Is it worth adding
>       https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec
>       to your kernel? It isn't Xen specific but it's perhaps possible that Xen opens the window wider.
> 
>       How confident are you in
>       https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb ?
>       (although I suppose you aren't running in ACPI mode if you are running
>       Xen?)
> 
> 
> I'm not confident at all, but Linux (last I checked was v3.19) doesn't boot without it, so not sure if there's an
> alternative?  Mark?

This patch is key: it doesn't look like it is setting
dev->archdata.dma_coherent appropriately, see the implementation of
set_arch_dma_coherent_ops.


> 
>       If we think the issue might be to do with coherency of foreign mappings
>       undergoing i/o from dom0 and we've already ruled out disk (by using a
>       loopback mounted rootfs) then it might be worth bodging netback to
>       always copy too.
> 
>       Adding a call to skb_orphan_frags right before the netif_receive_skb in
>       drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
>       rather inefficient way of doing that (so I hope it doesn't perturb the
>       issue).
> 
> 
> I'll be happy to try this.

If we are right and the problem is due to the commit above not setting
dma_coherent to true (the kernel will think that actually the network
card is not coherent), then Ian's workaround should hide the problem.


> 
>       Stefano (who is more familiar with the Linux swiotlb side of things than
>       me) is travelling this week so he'll be on West coast time, not sure
>       when he gets off a plane nor if he's on email anyway (he's at ELC + this
>       ARM ACPI thing)
> 
> 
> ok, we'll see what happens.
> 
> -Christoffer
> 
> 

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen unstability on HP Moonshot m400
  2015-03-23 23:58     ` Stefano Stabellini
@ 2015-03-24 13:54       ` Mark Salter
  2015-03-24 14:00         ` Mark Salter
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Salter @ 2015-03-24 13:54 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Robert Ricci, Ian Campbell, Marc Zyngier, xen-devel,
	Christoffer Dall, Hull, Jim, Pranavkumar Sawargaonkar

On Mon, 2015-03-23 at 23:58 +0000, Stefano Stabellini wrote:
> On Mon, 23 Mar 2015, Christoffer Dall wrote:
> > On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
> >       On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> >       > Hi,
> >       >
> >       > I have been experiencing a problematic crash running Xen on m400 over
> >       > the last few days.  I already spoke to Ian and Stefano about this, but
> >       > thought I'd summarize what I've seen so far and loop in a wider
> >       > audience.
> >       >
> >       > The basic setup is this:
> >       >  - Two m400 nodes, one running Linux bare-metal, the other running
> >       > Xen.
> >       >  - The Xen node runs Dom0 and 1 DomU
> >       >  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two
> >       > parts on it
> >       >  - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to
> >       > the internet) and regular bridging to eth1 which is connected to a
> >       > private VLAN to the bare-metal node
> >       >  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
> >       >  - DomU runs apache2 serving the GCC manual (see
> >       > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh)
> >       >
> >       > The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100
> >       >http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N
> > y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F
> >       gcc%2Findex.html"
> >       >
> >       > (10.10.1.120 is the DomU IP address of the bridged interface to eth1)
> >       >
> >       > What happens now is that the entire Xen node goes down.  I see various
> >       > errors in the kernel log, some examples:
> >       > http://pastebin.ubuntu.com/10642148/
> >       > http://pastebin.ubuntu.com/10642177/
> >       > http://pastebin.ubuntu.com/10642181/
> >       > http://pastebin.ubuntu.com/10635573/
> >       >
> >       >
> >       > All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
> >       > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
> > 
> >       Is it worth adding
> >       https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec
> >       to your kernel? It isn't Xen specific but it's perhaps possible that Xen opens the window wider.

You definitely want that one. Without it, the page table walker could
end up using a stale pointer to a page being used for something other
than page tables.

> > 
> >       How confident are you in
> >       https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb ?
> >       (although I suppose you aren't running in ACPI mode if you are running
> >       Xen?)
> > 
> > 
> > I'm not confident at all, but Linux (last I checked was v3.19) doesn't boot without it, so not sure if there's an
> > alternative?  Mark?
> 
> This patch is key: it doesn't look like it is setting
> dev->archdata.dma_coherent appropriately, see the implementation of
> set_arch_dma_coherent_ops.

You'd want this if booting with ACPI. You might also need it for
enumerated PCI devices even if booting with devicetree.

> 
> 
> > 
> >       If we think the issue might be to do with coherency of foreign mappings
> >       undergoing i/o from dom0 and we've already ruled out disk (by using a
> >       loopback mounted rootfs) then it might be worth bodging netback to
> >       always copy too.
> > 
> >       Adding a call to skb_orphan_frags right before the netif_receive_skb in
> >       drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
> >       rather inefficient way of doing that (so I hope it doesn't perturb the
> >       issue).
> > 
> > 
> > I'll be happy to try this.
> 
> If we are right and the problem is due to the commit above not setting
> dma_coherent to true (the kernel will think that actually the network
> card is not coherent), then Ian's workaround should hide the problem.
> 
> 
> > 
> >       Stefano (who is more familiar with the Linux swiotlb side of things than
> >       me) is travelling this week so he'll be on West coast time, not sure
> >       when he gets off a plane nor if he's on email anyway (he's at ELC + this
> >       ARM ACPI thing)
> > 
> > 
> > ok, we'll see what happens.
> > 
> > -Christoffer
> > 
> > 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen unstability on HP Moonshot m400
  2015-03-24 13:54       ` Mark Salter
@ 2015-03-24 14:00         ` Mark Salter
  2015-03-24 16:51           ` Christoffer Dall
  0 siblings, 1 reply; 7+ messages in thread
From: Mark Salter @ 2015-03-24 14:00 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Robert Ricci, Ian Campbell, Marc Zyngier, xen-devel,
	Christoffer Dall, Hull, Jim, Pranavkumar Sawargaonkar

On Tue, 2015-03-24 at 09:54 -0400, Mark Salter wrote:
> On Mon, 2015-03-23 at 23:58 +0000, Stefano Stabellini wrote:
> > On Mon, 23 Mar 2015, Christoffer Dall wrote:
> > > On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <ian.campbell@citrix.com> wrote:
> > >       On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> > >       > Hi,
> > >       >
> > >       > I have been experiencing a problematic crash running Xen on m400 over
> > >       > the last few days.  I already spoke to Ian and Stefano about this, but
> > >       > thought I'd summarize what I've seen so far and loop in a wider
> > >       > audience.
> > >       >
> > >       > The basic setup is this:
> > >       >  - Two m400 nodes, one running Linux bare-metal, the other running
> > >       > Xen.
> > >       >  - The Xen node runs Dom0 and 1 DomU
> > >       >  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card with two
> > >       > parts on it
> > >       >  - Dom0 uses NAT forwarding from Dom0's eth0 (which is connected to
> > >       > the internet) and regular bridging to eth1 which is connected to a
> > >       > private VLAN to the bare-metal node
> > >       >  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
> > >       >  - DomU runs apache2 serving the GCC manual (see
> > >       > https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh)
> > >       >
> > >       > The bare-metal node runs apache bench, like this: "ab -n 100000 -c 100
> > >       >http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N
> > > y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F
> > >       gcc%2Findex.html"
> > >       >
> > >       > (10.10.1.120 is the DomU IP address of the bridged interface to eth1)
> > >       >
> > >       > What happens now is that the entire Xen node goes down.  I see various
> > >       > errors in the kernel log, some examples:
> > >       > http://pastebin.ubuntu.com/10642148/
> > >       > http://pastebin.ubuntu.com/10642177/
> > >       > http://pastebin.ubuntu.com/10642181/
> > >       > http://pastebin.ubuntu.com/10635573/
> > >       >
> > >       >
> > >       > All Linux kernels are 3.18 plus some tweaks for the m400 cartridge:
> > >       > https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
> > > 
> > >       Is it worth adding
> > >       https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec
> > >       to your kernel? It isn't Xen specific but it's perhaps possible that Xen opens the window wider.
> 
> You definitely want that one. Without it, the page table walker could
> end up using a stale pointer to a page being used for something other
> than page tables.
> 
> > > 
> > >       How confident are you in
> > >       https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb ?
> > >       (although I suppose you aren't running in ACPI mode if you are running
> > >       Xen?)
> > > 
> > > 
> > > I'm not confident at all, but Linux (last I checked was v3.19) doesn't boot without it, so not sure if there's an
> > > alternative?  Mark?
> > 
> > This patch is key: it doesn't look like it is setting
> > dev->archdata.dma_coherent appropriately, see the implementation of
> > set_arch_dma_coherent_ops.
> 
> You'd want this if booting with ACPI. You might also need it for
> enumerated PCI devices even if booting with devicetree.

There's an updated version of this patch for newer kernels in the
devel branch of git.fedorahosted.org/git/kernel-arm64.git

There is also this one in Linus' tree which may be of interest to you:

commit 7132813c384515c9dede1ae20e56f3895feb7f1e
Author: Suzuki K. Poulose <suzuki.poulose@arm.com>
Date:   Thu Mar 19 18:17:09 2015 +0000

    arm64: Honor __GFP_ZERO in dma allocations

> > 
> > 
> > > 
> > >       If we think the issue might be to do with coherency of foreign mappings
> > >       undergoing i/o from dom0 and we've already ruled out disk (by using a
> > >       loopback mounted rootfs) then it might be worth bodging netback to
> > >       always copy too.
> > > 
> > >       Adding a call to skb_orphan_frags right before the netif_receive_skb in
> > >       drivers/net/xen-netback/netback.c:xenvif_tx_submit is a simple but
> > >       rather inefficient way of doing that (so I hope it doesn't perturb the
> > >       issue).
> > > 
> > > 
> > > I'll be happy to try this.
> > 
> > If we are right and the problem is due to the commit above not setting
> > dma_coherent to true (the kernel will think that actually the network
> > card is not coherent), then Ian's workaround should hide the problem.
> > 
> > 
> > > 
> > >       Stefano (who is more familiar with the Linux swiotlb side of things than
> > >       me) is travelling this week so he'll be on West coast time, not sure
> > >       when he gets off a plane nor if he's on email anyway (he's at ELC + this
> > >       ARM ACPI thing)
> > > 
> > > 
> > > ok, we'll see what happens.
> > > 
> > > -Christoffer
> > > 
> > > 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Xen unstability on HP Moonshot m400
  2015-03-24 14:00         ` Mark Salter
@ 2015-03-24 16:51           ` Christoffer Dall
  0 siblings, 0 replies; 7+ messages in thread
From: Christoffer Dall @ 2015-03-24 16:51 UTC (permalink / raw)
  To: Mark Salter
  Cc: Robert Ricci, Ian Campbell, Stefano Stabellini, Marc Zyngier,
	xen-devel, Hull, Jim, Pranavkumar Sawargaonkar


[-- Attachment #1.1: Type: text/plain, Size: 4246 bytes --]

On Tue, Mar 24, 2015 at 3:00 PM, Mark Salter <msalter@redhat.com> wrote:

> On Tue, 2015-03-24 at 09:54 -0400, Mark Salter wrote:
> > On Mon, 2015-03-23 at 23:58 +0000, Stefano Stabellini wrote:
> > > On Mon, 23 Mar 2015, Christoffer Dall wrote:
> > > > On Mon, Mar 23, 2015 at 1:36 PM, Ian Campbell <
> ian.campbell@citrix.com> wrote:
> > > >       On Sat, 2015-03-21 at 13:34 +0100, Christoffer Dall wrote:
> > > >       > Hi,
> > > >       >
> > > >       > I have been experiencing a problematic crash running Xen on
> m400 over
> > > >       > the last few days.  I already spoke to Ian and Stefano about
> this, but
> > > >       > thought I'd summarize what I've seen so far and loop in a
> wider
> > > >       > audience.
> > > >       >
> > > >       > The basic setup is this:
> > > >       >  - Two m400 nodes, one running Linux bare-metal, the other
> running
> > > >       > Xen.
> > > >       >  - The Xen node runs Dom0 and 1 DomU
> > > >       >  - The m400 has a Mellanox Connectx-3 PCIe 10G ethernet card
> with two
> > > >       > parts on it
> > > >       >  - Dom0 uses NAT forwarding from Dom0's eth0 (which is
> connected to
> > > >       > the internet) and regular bridging to eth1 which is
> connected to a
> > > >       > private VLAN to the bare-metal node
> > > >       >  - Dom0 and DomU are configured with 14GB of ram, 4 cpus each
> > > >       >  - DomU runs apache2 serving the GCC manual (see
> > > >       >
> https://github.com/chazy/kvmperf/blob/master/cmdline_tests/apache_install.sh
> )
> > > >       >
> > > >       > The bare-metal node runs apache bench, like this: "ab -n
> 100000 -c 100
> > > >       >
> http://secure-web.cisco.com/1r5tZ8-7RF8gHRANwFdizEZzgeMsjxVO0yKbYiV4zy7LeiUfYBXMkFq7FGW_SZ1x-VxdzyK-ErDsOUiQ9z2x-N
> > > >
> y7XkL_loHP8ene_BuNFscGyWmQ3r6CtXAYaZCY4xRmmPT1uJOsZDLMu7j-LfCOGmQDSdBwgW7QYukI2bCtTrXM/http%3A%2F%2F10.10.1.120%2F
> > > >       gcc%2Findex.html"
> > > >       >
> > > >       > (10.10.1.120 is the DomU IP address of the bridged interface
> to eth1)
> > > >       >
> > > >       > What happens now is that the entire Xen node goes down.  I
> see various
> > > >       > errors in the kernel log, some examples:
> > > >       > http://pastebin.ubuntu.com/10642148/
> > > >       > http://pastebin.ubuntu.com/10642177/
> > > >       > http://pastebin.ubuntu.com/10642181/
> > > >       > http://pastebin.ubuntu.com/10635573/
> > > >       >
> > > >       >
> > > >       > All Linux kernels are 3.18 plus some tweaks for the m400
> cartridge:
> > > >       >
> https://github.com/columbia/linux-kvm-arm/tree/columbia-armvirt-3.18
> > > >
> > > >       Is it worth adding
> > > >
> https://git.kernel.org/cgit/linux/kernel/git/arm64/linux.git/commit/?id=285994a62c80f1d72c6924282bcb59608098d5ec
> > > >       to your kernel? It isn't Xen specific but it's perhaps
> possible that Xen opens the window wider.
> >
> > You definitely want that one. Without it, the page table walker could
> > end up using a stale pointer to a page being used for something other
> > than page tables.
> >
> > > >
> > > >       How confident are you in
> > > >
> https://github.com/columbia/linux-kvm-arm/commit/5e29cb0478f3d90e4f568d6bea6840960331bcbb
> ?
> > > >       (although I suppose you aren't running in ACPI mode if you are
> running
> > > >       Xen?)
> > > >
> > > >
> > > > I'm not confident at all, but Linux (last I checked was v3.19)
> doesn't boot without it, so not sure if there's an
> > > > alternative?  Mark?
> > >
> > > This patch is key: it doesn't look like it is setting
> > > dev->archdata.dma_coherent appropriately, see the implementation of
> > > set_arch_dma_coherent_ops.
> >
> > You'd want this if booting with ACPI. You might also need it for
> > enumerated PCI devices even if booting with devicetree.
>
> There's an updated version of this patch for newer kernels in the
> devel branch of git.fedorahosted.org/git/kernel-arm64.git
>
> There is also this one in Linus' tree which may be of interest to you:
>
> commit 7132813c384515c9dede1ae20e56f3895feb7f1e
> Author: Suzuki K. Poulose <suzuki.poulose@arm.com>
> Date:   Thu Mar 19 18:17:09 2015 +0000
>
>     arm64: Honor __GFP_ZERO in dma allocations


Thanks Mark!

I'll give both a try!

-Christoffer

[-- Attachment #1.2: Type: text/html, Size: 6807 bytes --]

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2015-03-24 16:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-21 12:34 Xen unstability on HP Moonshot m400 Christoffer Dall
2015-03-23 12:36 ` Ian Campbell
2015-03-23 13:00   ` Christoffer Dall
2015-03-23 23:58     ` Stefano Stabellini
2015-03-24 13:54       ` Mark Salter
2015-03-24 14:00         ` Mark Salter
2015-03-24 16:51           ` Christoffer Dall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.