Am 10.07.2016 um 00:29 schrieb Andreas Ziegler <ml@andreas-ziegler.de>:

> In May, Ingo Jürgensmann also started experiencing this problem and
> blogged about it:
> https://blog.windfluechter.net/content/blog/2016/03/23/1721-xen-randomly-crashing-server
> https://blog.windfluechter.net/content/blog/2016/05/12/1723-xen-randomly-crashing-server-part-2

Actually I’m suffering from this problem since April 2013. Here’s my story… ;)

Everything was working smoothly when I was still using a rootserver at Hetzner. The setup there was some sort of non-standard, as I needed to have eth0 as outgoing interface not being part of the Xen bridge. So I used a mixture of bridge and routed in xend-config.sxp. This setup worked for years without problems.

However: as Hetzner started to bill for every single IPv4 address, I moved to my new provider where I could get the same address space (/26) without being forced to pay for every IPv4 address. The server back then was a Cisco C200 M2.

Since I got my own VLAN at the new location, I was then able to dismiss the mixed setup of routing and bridging and used only bridging with eth0 now being part of the Xen bridge. The whole setup consists of two bridges: one for the external IP addresses (xenbr0) and one for internal traffic (xenbr1). This was already that way with Hetzner.

However, shortly after I moved to the new provider, the issues started: random crashes of the host. With the new provider, who was and is still very helpful, we exchanged for example the memory. The provider reported as well that other Cisco C200 server with Ubutu LTS didn’t show this issue.

Over time a pattern showed up that might cause the frequent crashes (sometimes several times in a row, let’s say 2-10 times a day!):

My setup is this:

Debian stable with packaged Xen hypervisor and these VMs:
1) Mail, Database, Nameserver, OpenVPN
2) Webserver, Squid3
3) Login server
4) … some more servers (10 in total), e.g. Tor Relay…

IPv4 /26 network, IPv6 /48 network

From my workplace I need to login to 3) and have a tunnel to the Squid on 2) via the internal addresses on xenbr1. Of course Squid queries the nameserver on 1), so there is some internal traffic going back and forth on the internal bridge and traffic originating from the external bridge (xenbr0). Using Squid I access my Roundcube on my small homebrew server that is connected to 1) via OpenVPN. Of course the webserver on 2) queries the database on 1)

So, the most crashes do happen while I’m using the SSH tunnel from my workplace. If a crash happen, it’s most likely that at least two in a row will happen in a short time frame (within 1-2 hours), sometimes even within 10 mins after the server came back. From time to time my impression was, that the server crashes the second time instantly when I try to access my Roundcube at home.

Furthermore, I switched from using the Cisco C200 server to my own server with Supermicro X9SRi-F mainboard and a XEON E5-2630L V2, but still the same provider, and the same issue: the new server crashes the same way as the Cisco server did. With the new server we did a replacement of the memory as well: from 32G to 128G. So over time we have switched memory twice and hardware once. Since then I don’t assume anymore that this might be hardware related.

In the meantime I switched from using Squid on 2) to tinyproxy running on 2) as well as running tinyproxy on another third party VPS. Still the crashes happen, regardless of using Squid on 2) or not.

In May the server crashed again several times a week and several times a day. Really, really annoying!
So together with my provider we setup a netconsole to catch some more information about the crash than just the few lines from the IPMI console.

Trying linux-image 4.4 from backports didn’t help either. I switched from PV to PVHVM as well some months ago.

> He is pretty sure, that the problem went away after disabling IPv6.

Indeed. Since I disabled IPv6 for all of my VMs (it’s still active on dom0, but not routed to the domUs anymore) no single crash happened again.

> But: we can't say for sure, because on our server it sometimes happened
> often in a short period of time, but then it didn't for months.
> and: disabling IPv6 is no option for me at all.

I won’t state that I have an exact way of reproducing the crashes, but it happens fairly often when doing as described above.

What I can offer is:
- activate IPv6 again
- install a kernel with debugging symbols (*-dbg)
- try to provoke another crash
- send netconsole output if happened

What I cannot do:
- interpret the debug symbols
- access IPMI console from workplace (firewalled)

I’m with Andreas that disabling IPv6 cannot be an option.

--
Ciao...          //        http://blog.windfluechter.net
      Ingo     \X/     XMPP: ij@jabber.windfluechter.net

gpg pubkey:  http://www.juergensmann.de/ij_public_key.asc