All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] CPU Lockup bug with the credit2 scheduler
@ 2020-01-07 14:25 Alastair Browne
  2020-02-17 19:58 ` Sarah Newman
  0 siblings, 1 reply; 8+ messages in thread
From: Alastair Browne @ 2020-01-07 14:25 UTC (permalink / raw)
  To: xen-devel

[-- Attachment #1: Type: text/plain, Size: 8301 bytes --]

SYMPTOMS

A Xen host is found to lock up with messages on console along the
following lines:-

NMI watchdog: BUG: soft lockup - CPU#2 stuck for 23s!

Later on in the system log, reference is often made to a specific
program that happens to be running at the time, however the program
referred to is not constant and will vary according to what happens to
be running at the time.

Once the host has locked up, the only solution is a reboot. It hasn't
been possible to further analyse the state of a locked up machine due
to unavailability of the command line.

This problem has been seen to occur on a Debian platform with the
following configuration, however it could equally occur on other
platforms.

The configuration of the host machine is as follows:-

# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 8 (jessie)"
NAME="Debian GNU/Linux"
VERSION_ID="8"
VERSION="8 (jessie)"
ID=debian
HOME_URL="http://www.debian.org/"
SUPPORT_URL="http://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

# uname -srvpio
Linux 4.9.0-11-amd64 #1 SMP Debian 4.9.189-3+deb9u2a~test (2019-12-18)
unknown unknown GNU/Linux

# xl info
host			: my-host.example.com
release			: 4.9.0-11-amd64
version			: #1 SMP Debian 4.9.189-3+deb9u2a~test (2019-
12-18)
machine			: x86_64
nr_cpus			: 24
max_cpu_id		: 191
nr_nodes		: 2
cores_per_socket	: 12
threads_per_core	: 1
cpu_mhz			: 1797.920
hw_caps			:
bfebfbff:77fef3ff:2c100800:00000021:00000001:000037ab:00000000:00000100
virt_caps		: pv hvm hvm_directio pv_directio hap shadow
iommu_hap_pt_share
total_memory		: 392994
free_memory		: 265294
sharing_freed_memory	: 0
sharing_used_memory	: 0
outstanding_claims	: 0
free_cpus		: 0
xen_major		: 4
xen_minor		: 13
xen_extra		: .0-mem1-ox
xen_version		: 4.13.0-mem1-ox
xen_caps		: xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 
hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler		: credit2
xen_pagesize		: 4096
platform_params		: virt_start=0xffff800000000000
xen_changeset		: Tue Dec 17 14:19:49 2019 +0000 git:a2e84d8e42
xen_commandline		: placeholder dom0_mem=4096M,max:16384M
com1=115200,8n1 console=com1 ucode=scan smt=0 sched=credit2 
crashkernel=512M@32M
cc_compiler		: gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
cc_compile_by		: support
cc_compile_domain	: example.com
cc_compile_date		: Wed Dec 18 11:13:45 GMT 2019
build_id		: 672783467e7a60c4f8a1aa715d549cb59f00c7cf
xend_config_format	: 4Re-Creation


To recreate the symptoms, build a Xen host according to the above
parameters, then create at least ten Linux virtual machines
within it.The Xen host should use LVM to provision the VMs with their
storage. Each VM should have one single disk device, partitioned in
the conventional manner.

The Virtual machines and the Xen host must then be loaded up as
follows:-

VIRTUAL MACHINES

Construct a program to allocate, fill and free
memory. An example of such a program is given below:-

mem-grab.C
/*
  This program will allocate and fill memory. It's purpose is to
  simulate memory use on a machine. Once it has grabbed the memory, it
  sleeps for 10 seconds, then frees it.
  If run with no arguments, the program will find out the maximum
  memory available on the machine and then will attempt to grab 75% of
  it. If run with an integer argument, this program will attempt to
  allocate that amount of memory.
  If an error occurs with the allocation, then an exception will be
  thrown and caught. An error message will then be printed on stderr.
*/
  
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <thread>
#include <chrono>
#include <mem-size.h>

#define MEM_PERCENT 0.75
  using namespace std;
int main(int argc, char** argv)
{
  int *ptr;
  unsigned long long i,n;
  unsigned long long MemAvailable = 0;
  unsigned long long MemAlloc = 0;
  if (argc == 1)
    {
      // Find out the maximum memory available
      MemAvailable = get_system_memory ();
      cout << "Memory available = " << MemAvailable << endl;
      MemAlloc = MemAvailable * MEM_PERCENT;
      cout << "Memory to be allocated: " << MemAlloc << endl;
      // Divide the value by the size of an int because that's what we
      // will be filling the memory with.
      n = MemAlloc / sizeof (int);
    }
  else
    {
      n = strtoul (argv[1], NULL, 0);
      n = n / sizeof (int);
    }
  cout << "Allocating " << n * sizeof (int) << " bytes..." << endl;
  try
    {
      ptr = new int [n];
    }
  catch (exception& e)
    {
      cerr << "Failed to allocate memory: " << e.what() << endl;
      return 1;
    }
  printf("Filling int into memory.....\n");
  for (i = 0; i < n; i++)
    {
      ptr[i] = 1;
    }
  printf("Sleep 10 seconds......\n");
  this_thread::sleep_for (chrono::seconds (10));
  printf("Free memory.\n");
  free(ptr);
  return 0;
}


mem-size.C

#include <mem-size.h>
unsigned long long get_system_memory ()
{
  unsigned long pages = sysconf(_SC_PHYS_PAGES);
  unsigned long page_size = sysconf(_SC_PAGE_SIZE);
  return pages * page_size;
}

mem-size.h

#ifndef _MEMSIZE_H
#define _MEMSIZE_H
#include <unistd.h>
extern unsigned long long get_system_memory ();
#endif


This program should be compiled as 'mem-grab' and will be controlled
by the following shell script...

#!/bin/bash

# This script should be run with one argument... The filename of a
# file containing a list of the virtual machine names, one per
# line. Each machine name needs to correspond with the the machine
# name as in the 'lvcreate' line below

MachineList=$1
while true
do
    for machine in $(cat ${MachineList})
    do
	date
	lvcreate --size 10G --snapshot \
		 --name test_${machine}_snap \
		 /dev/virtservervg/${machine}_root_fs
    done
    date
    echo "Snapshots Created"
    sleep 2
    # Transfer the snapshot over the network using 'dd', 'gzip' and
'ssh'
    # This requires a passwordless ssh login on the machine specified
by
    # xxx.xxx.xxx.xxx
    for machine in $(cat ${MachineList})
    do
	date
	(dd if=/dev/virtservervg/test_${machine}_snap \
	    bs=2048 | gzip -1c | \
	     ssh root@xxx.xxx.xxx.xxx \
		 "cat > /dev/null"; \
	 echo "${machine} dd finished") &
	echo "Kicked off ${machine}"
	sleep 2
    done
    date
    echo "Machine snapshots kicked off"
    wait
    date
    echo "Snapshot transfers finished"
    for machine in $(cat ${MachineList})
    do
	date
	/sbin/lvremove -f /dev/virtservervg/test_${machine}_snap
    done
    date
    sleep 2
done



Set this script running and then wait for the host to lock up. The
time that this will take is not predictable, however at some point,
the host will experience the CPU soft lockup.Analysis

Work has been done as explained above, with several different kernel
versions and also different versions of the xen_scheduler. It has been
found that the lockups only occur when the 'credit2' version of the
xen_scheduler is being used.

The problem has been found not to occur with the 'credit' version of
xen_scheduler.  It is therefore concluded that there must be a bug
within credit2.

FURTHER NOTES
During the testing, we used 4 hosts running various kernel and Xen
versions.

Kernel Packages

Production: 4.9.0-9-amd64 (4.9.168-1+deb9u3a~test)

Stretch Patched: 4.9.0-11-amd64 (4.9.189-3+deb9u2a)

Buster Unpatched: 4.19.0-0.bpo.6-amd64 (4.19.67-2+deb10u2~bpo9+1)

Buster Patched: 4.19.0-0.bpo.5-amd64

Pre-MDS Patched: 4.9.0-8-amd64 (4.9.110-3+deb9u4a~test)

All of Xen builds were compiled on a suitable machine, using a proven
shell script to do the build.

Xen Packages

Production: 4.12 (Up to xsa-297) (4.12.1-pre-mem3-ox)

Latest: 4.13 (including 3 fixes for credit2) (4.13.0-mem1-ox)

RESULTS

Please see attached spreadsheet


CONCLUSION

So in conclusion, the tests indicate that credit2 might be unstable.

For the time being, we are using credit as the chosen scheduler. We
are booting the kernel with a parameter "sched=credit" to ensure that
the correct scheduler is used.

After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 Xen
for production use running credit1 as the default scheduler.


[-- Attachment #2: Results.xlsx --]
[-- Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, Size: 5701 bytes --]

[-- Attachment #3: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-01-07 14:25 [Xen-devel] CPU Lockup bug with the credit2 scheduler Alastair Browne
@ 2020-02-17 19:58 ` Sarah Newman
  2020-02-17 23:46   ` Sander Eikelenboom
                     ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Sarah Newman @ 2020-02-17 19:58 UTC (permalink / raw)
  To: xen-devel; +Cc: Alastair Browne, Glen, Tomas Mozes, PGNet Dev

On 1/7/20 6:25 AM, Alastair Browne wrote:
> 
> CONCLUSION
> 
> So in conclusion, the tests indicate that credit2 might be unstable.
> 
> For the time being, we are using credit as the chosen scheduler. We
> are booting the kernel with a parameter "sched=credit" to ensure that
> the correct scheduler is used.
> 
> After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 Xen
> for production use running credit1 as the default scheduler.

One person CC'ed appears to be having the same experience, where the credit2 scheduler leads to lockups (in this case in the domU, not the dom0) under 
relatively heavy load. It seems possible they may have the same root cause.

I don't think there are, but have there been any patches since the 4.13.0 release which might have fixed problems with credit 2 scheduler? If not, 
what would the next step be to isolating the problem - a debug build of Xen or something else?

If there are no merged or proposed fixes soon, it may be worth considering making the credit scheduler the default again until problems with the 
credit2 scheduler are resolved.

Thanks, Sarah

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-02-17 19:58 ` Sarah Newman
@ 2020-02-17 23:46   ` Sander Eikelenboom
  2020-02-18  0:39     ` Glen
  2020-02-19 12:01   ` Dario Faggioli
  2020-03-06 13:54   ` Dario Faggioli
  2 siblings, 1 reply; 8+ messages in thread
From: Sander Eikelenboom @ 2020-02-17 23:46 UTC (permalink / raw)
  To: Sarah Newman, xen-devel; +Cc: Alastair Browne, Glen, Tomas Mozes, PGNet Dev

On 17/02/2020 20:58, Sarah Newman wrote:
> On 1/7/20 6:25 AM, Alastair Browne wrote:
>>
>> CONCLUSION
>>
>> So in conclusion, the tests indicate that credit2 might be unstable.
>>
>> For the time being, we are using credit as the chosen scheduler. We
>> are booting the kernel with a parameter "sched=credit" to ensure that
>> the correct scheduler is used.
>>
>> After the tests, we decided to stick with 4.9.0.9 kernel and 4.12 Xen
>> for production use running credit1 as the default scheduler.
> 
> One person CC'ed appears to be having the same experience, where the credit2 scheduler leads to lockups (in this case in the domU, not the dom0) under 
> relatively heavy load. It seems possible they may have the same root cause.
> 
> I don't think there are, but have there been any patches since the 4.13.0 release which might have fixed problems with credit 2 scheduler? If not, 
> what would the next step be to isolating the problem - a debug build of Xen or something else?
> 
> If there are no merged or proposed fixes soon, it may be worth considering making the credit scheduler the default again until problems with the 
> credit2 scheduler are resolved.
> 
> Thanks, Sarah
> 
> 

Hi Sarah / Alastair,

I can only provide my n=1 (OK, I'm running a bunch of boxes, some of which pretty over-committed CPU wise), 
but I haven't seen any issues (lately) with credit2.

I did take a look at Alastair Browne's report your replied to (https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html)
and I do see some differences:
    - Alastair's machine has multiple sockets, my machines don't.
    - It seems Alastair's config is using ballooning ? (dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the past, so my configs don't.
    - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested
      (5.4 is available in Debian backport for buster). 
    - Alastair, are you using pv, hvm or pvh guests? The report seems to miss the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for dom0) ?

Any how, could be worthwhile to test without ballooning, and test a recent kernel to rule out an issue with (missing) kernel backports.

--
Sander

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-02-17 23:46   ` Sander Eikelenboom
@ 2020-02-18  0:39     ` Glen
  2020-02-18  6:51       ` Jürgen Groß
  0 siblings, 1 reply; 8+ messages in thread
From: Glen @ 2020-02-18  0:39 UTC (permalink / raw)
  To: Sander Eikelenboom
  Cc: Alastair Browne, xen-devel, PGNet Dev, Tomas Mozes, Sarah Newman

Hello Sander -

If I might chime in, I'm also experiencing what we believe is the same
problem, and hope I'm not breaking any protocol by sharing a few quick
details...

On Mon, Feb 17, 2020 at 3:46 PM Sander Eikelenboom <linux@eikelenboom.it> wrote:
> On 17/02/2020 20:58, Sarah Newman wrote:
> > On 1/7/20 6:25 AM, Alastair Browne wrote:
> >> So in conclusion, the tests indicate that credit2 might be unstable.
> >> For the time being, we are using credit as the chosen scheduler. We
> > I don't think there are, but have there been any patches since the 4.13.0 release which might have fixed problems with credit 2 scheduler? If not,
> > what would the next step be to isolating the problem - a debug build of Xen or something else?
> > If there are no merged or proposed fixes soon, it may be worth considering making the credit scheduler the default again until problems with the
> > credit2 scheduler are resolved.
> I did take a look at Alastair Browne's report your replied to (https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html)
> and I do see some differences:
>     - Alastair's machine has multiple sockets, my machines don't.
>     - It seems Alastair's config is using ballooning ? (dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the past, so my configs don't.

My configuration has ballooning disabled, we do not use it, and we
still have the problem.

>     - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested
>       (5.4 is available in Debian backport for buster).
>     - Alastair, are you using pv, hvm or pvh guests? The report seems to miss the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for dom0) ?

The problem appears to occur for both HVM and PV guests.

A report by Tomas
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
provides his config for his HVM setup.

My initial report
https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
contains my PV guest config.

> Any how, could be worthwhile to test without ballooning, and test a recent kernel to rule out an issue with (missing) kernel backports.

Thanks to guidance from Sarah, we've had lots of discussion on the
users lists about this, especially this past week (pasting in
https://lists.xenproject.org/archives/html/xen-users/2020-02/ just for
your clicking convenience since I'm there as I type this) and it seems
like we've been able to narrow things down a bit:

* Alastair's config is on very large machines.  Tomas can duplicate
this on a much smaller scale, and I can duplicate it on a single DomU
running as the only guest on a Dom0 host.   So overall host
size/capacity doesn't seem to be very important, nor does number of
guests on the host.

* I'm using the Linux 4.12.14 kernel on both host and guest with Xen
4.12.1. - for me, the act of just going to a previous version of Xen
(in my case to Xen 4.10) eliminates the problem.  Tomas is on
4.14.159, and he reports that even moving back just to Xen 4.11
resolves his issue, whereas the issue seems to still exist in Xen
4.13.  So changing Xen versions without changing kernel versions seems
to resolve this.

* We've had another user mention that "When I switched to openSUSE Xen
4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests
of all 'flavors' became *much* better behaved.", so we think maybe
something in very recent Xen 4.13 might have helped (or possibly that
latest kernel, although from our limited point of view the changing of
Xen versions back to pre-4.12 solcing this without any kernel changes
seems compelling.)

* Tomas has already tested, and I am still testing, Xen 4.12 with just
the sched=credit change.  For him that has eliminated the problem as
well, I am still stress-testing my guest under Xen 4.12 sched=credit,
so I cannot report, but I am hopeful.

I believe this is why Sarah asked about patches to 4.13... it is
looking to us just on the user level like this is possibly
kernel-independent, but at least Xen-version-dependent, and likely
credit-scheduler-dependent.

I apologize if I should be doing something different here, but it is
looking like a few more of us are having what we believe to be the
same problem and, based only on what I've seen, I've already changed
over all of my production hosts (I run about 20) to sched=credit as a
precautionary measure.

Any thoughts, insights or guidance would be greatly appreciated!

Respectfully,
Glen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-02-18  0:39     ` Glen
@ 2020-02-18  6:51       ` Jürgen Groß
  2020-02-18 20:57         ` Glen
  0 siblings, 1 reply; 8+ messages in thread
From: Jürgen Groß @ 2020-02-18  6:51 UTC (permalink / raw)
  To: Glen, Sander Eikelenboom
  Cc: Alastair Browne, Sarah Newman, PGNet Dev, Tomas Mozes, xen-devel

On 18.02.20 01:39, Glen wrote:
> Hello Sander -
> 
> If I might chime in, I'm also experiencing what we believe is the same
> problem, and hope I'm not breaking any protocol by sharing a few quick
> details...
> 
> On Mon, Feb 17, 2020 at 3:46 PM Sander Eikelenboom <linux@eikelenboom.it> wrote:
>> On 17/02/2020 20:58, Sarah Newman wrote:
>>> On 1/7/20 6:25 AM, Alastair Browne wrote:
>>>> So in conclusion, the tests indicate that credit2 might be unstable.
>>>> For the time being, we are using credit as the chosen scheduler. We
>>> I don't think there are, but have there been any patches since the 4.13.0 release which might have fixed problems with credit 2 scheduler? If not,
>>> what would the next step be to isolating the problem - a debug build of Xen or something else?
>>> If there are no merged or proposed fixes soon, it may be worth considering making the credit scheduler the default again until problems with the
>>> credit2 scheduler are resolved.
>> I did take a look at Alastair Browne's report your replied to (https://lists.xen.org/archives/html/xen-devel/2020-01/msg00361.html)
>> and I do see some differences:
>>      - Alastair's machine has multiple sockets, my machines don't.
>>      - It seems Alastair's config is using ballooning ? (dom0_mem=4096M,max:16384M), for me that has been a source of trouble in the past, so my configs don't.
> 
> My configuration has ballooning disabled, we do not use it, and we
> still have the problem.
> 
>>      - kernel's tested are quite old (4.19.67 (latest upstream is 4.19.104), 4.9.189 (latest upstream is 4.9.214)) and no really new kernel is tested
>>        (5.4 is available in Debian backport for buster).
>>      - Alastair, are you using pv, hvm or pvh guests? The report seems to miss the Guest configs (I'm primarily using PVH, and few HVM's, no PV except for dom0) ?
> 
> The problem appears to occur for both HVM and PV guests.
> 
> A report by Tomas
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00015.html
> provides his config for his HVM setup.
> 
> My initial report
> https://lists.xenproject.org/archives/html/xen-users/2020-02/msg00018.html
> contains my PV guest config.
> 
>> Any how, could be worthwhile to test without ballooning, and test a recent kernel to rule out an issue with (missing) kernel backports.
> 
> Thanks to guidance from Sarah, we've had lots of discussion on the
> users lists about this, especially this past week (pasting in
> https://lists.xenproject.org/archives/html/xen-users/2020-02/ just for
> your clicking convenience since I'm there as I type this) and it seems
> like we've been able to narrow things down a bit:
> 
> * Alastair's config is on very large machines.  Tomas can duplicate
> this on a much smaller scale, and I can duplicate it on a single DomU
> running as the only guest on a Dom0 host.   So overall host
> size/capacity doesn't seem to be very important, nor does number of
> guests on the host.
> 
> * I'm using the Linux 4.12.14 kernel on both host and guest with Xen
> 4.12.1. - for me, the act of just going to a previous version of Xen
> (in my case to Xen 4.10) eliminates the problem.  Tomas is on
> 4.14.159, and he reports that even moving back just to Xen 4.11
> resolves his issue, whereas the issue seems to still exist in Xen
> 4.13.  So changing Xen versions without changing kernel versions seems
> to resolve this.
> 
> * We've had another user mention that "When I switched to openSUSE Xen
> 4.13.0_04 packages with KernelStable (atm, 5.5.3-25.gd654690), Guests
> of all 'flavors' became *much* better behaved.", so we think maybe
> something in very recent Xen 4.13 might have helped (or possibly that
> latest kernel, although from our limited point of view the changing of
> Xen versions back to pre-4.12 solcing this without any kernel changes
> seems compelling.)
> 
> * Tomas has already tested, and I am still testing, Xen 4.12 with just
> the sched=credit change.  For him that has eliminated the problem as
> well, I am still stress-testing my guest under Xen 4.12 sched=credit,
> so I cannot report, but I am hopeful.
> 
> I believe this is why Sarah asked about patches to 4.13... it is
> looking to us just on the user level like this is possibly
> kernel-independent, but at least Xen-version-dependent, and likely
> credit-scheduler-dependent.
> 
> I apologize if I should be doing something different here, but it is
> looking like a few more of us are having what we believe to be the
> same problem and, based only on what I've seen, I've already changed
> over all of my production hosts (I run about 20) to sched=credit as a
> precautionary measure.
> 
> Any thoughts, insights or guidance would be greatly appreciated!

Can you check whether all vcpus of a hanging guest are consuming time
(via xl vcpu-list) ?

It would be interesting to see where the vcpus are running around. Can
you please copy the domU's /boot/System.map-<kernel-version> to dom0
and then issue:

/usr/lib/xen/bin/xenctx -C -S -s <domu-system-map> <domid>

This should give a backtrace for all vcpus of <domid>. To recognize a
loop you should issue that multiple times.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-02-18  6:51       ` Jürgen Groß
@ 2020-02-18 20:57         ` Glen
  0 siblings, 0 replies; 8+ messages in thread
From: Glen @ 2020-02-18 20:57 UTC (permalink / raw)
  To: Jürgen Groß; +Cc: Alastair Browne, Sander Eikelenboom, xen-devel

Juergen -

On Mon, Feb 17, 2020 at 10:51 PM Jürgen Groß <jgross@suse.com> wrote:
> > Any thoughts, insights or guidance would be greatly appreciated!
> Can you check whether all vcpus of a hanging guest are consuming time
> (via xl vcpu-list) ?
> It would be interesting to see where the vcpus are running around. Can
> you please copy the domU's /boot/System.map-<kernel-version> to dom0
> and then issue:
> /usr/lib/xen/bin/xenctx -C -S -s <domu-system-map> <domid>
> This should give a backtrace for all vcpus of <domid>. To recognize a
> loop you should issue that multiple times.
> Juergen

I've applied the sched=credit boot option to all my production servers
at this point, in preparation for a client cutover this weekend.

Once that's done, I'm happy next week to reboot the old crashing
server to credit2, and test.  I'll save these directions and advise.

Thank you,
Glen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-02-17 19:58 ` Sarah Newman
  2020-02-17 23:46   ` Sander Eikelenboom
@ 2020-02-19 12:01   ` Dario Faggioli
  2020-03-06 13:54   ` Dario Faggioli
  2 siblings, 0 replies; 8+ messages in thread
From: Dario Faggioli @ 2020-02-19 12:01 UTC (permalink / raw)
  To: Sarah Newman, xen-devel; +Cc: Alastair Browne, Tomas Mozes, Glen, PGNet Dev


[-- Attachment #1.1: Type: text/plain, Size: 2281 bytes --]

On Mon, 2020-02-17 at 11:58 -0800, Sarah Newman wrote:
> On 1/7/20 6:25 AM, Alastair Browne wrote:
> > 
> > After the tests, we decided to stick with 4.9.0.9 kernel and 4.12
> > Xen
> > for production use running credit1 as the default scheduler.
> 
> One person CC'ed appears to be having the same experience, where the
> credit2 scheduler leads to lockups (in this case in the domU, not the
> dom0) under 
> relatively heavy load. It seems possible they may have the same root
> cause.
> 
Yeah, well, if booting with `sched=credit` makes the problem disappear,
whatever the real root cause really is, it seems related to Credit2.

> I don't think there are, but have there been any patches since the
> 4.13.0 release which might have fixed problems with credit 2
> scheduler? If not, 
> what would the next step be to isolating the problem - a debug build
> of Xen or something else?
> 
Yes, having a debug build of Xen running and providing, for instance,
the info that Juergen is asking for later in this thread, i.e.:

xl vcpu-list
/usr/lib/xen/bin/xenctx -C -S -s <domu-system-map> <domid>

And I'd add myself:

xl debug-keys r ; xl dmesg

And, in general, hypervisor logs when the problem occurs (I've gone
through the threads, and I don't think I have seen any, but maybe I
missed something?).

xentop

is also another way to have a look, from Dom0, at whether (and if yes,
which ones and how much) the vCPUs are busy.


> If there are no merged or proposed fixes soon, it may be worth
> considering making the credit scheduler the default again until
> problems with the 
> credit2 scheduler are resolved.
> 
Nothing similar to what is being described has happened in our testing
(or we wouldn't have switched to Credit2, of course! :-D).

I will see about trying to reproduce this myself, but this may take a
little bit. In the meantime, if you help us by sending more logs, we're
happy to try diagnosing and fixing things.

Thanks and Regards
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [Xen-devel] CPU Lockup bug with the credit2 scheduler
  2020-02-17 19:58 ` Sarah Newman
  2020-02-17 23:46   ` Sander Eikelenboom
  2020-02-19 12:01   ` Dario Faggioli
@ 2020-03-06 13:54   ` Dario Faggioli
  2 siblings, 0 replies; 8+ messages in thread
From: Dario Faggioli @ 2020-03-06 13:54 UTC (permalink / raw)
  To: Sarah Newman, xen-devel, xen-users
  Cc: Glen, George Dunlap, PGNet Dev, Tomas Mozes, Juergen Gross,
	Alastair Browne


[-- Attachment #1.1: Type: text/plain, Size: 1813 bytes --]

[Adding George, as scheduler maintainer, and Juergen as he commented, 
 later in this thread]

[Adding xen-users back, as the thread originated from there... sorry 
 for cross-posting]

On Mon, 2020-02-17 at 11:58 -0800, Sarah Newman wrote:
> If there are no merged or proposed fixes soon, it may be worth
> considering making the credit scheduler the default again until
> problems with the 
> credit2 scheduler are resolved.
> 
Just as an heads up, I finally --thanks to Jim Fehlig-- gfound a
machine where I could reproduce (something like) this.

I've been able to do some analysis of the situation. Basically, on the
server I'm using, I do not see stalls severe enough to cause 
NMI/watchdogs to fire, but I do see, during boot, some preliminary
signs of that.

I checked what was happening in Xen at that point in time ('r' debug-
key, which dumps the scheduler's data scructures), and I found out that
there is a vCPU kind of stuck in a runqueue. In fact, the vCPU is in
there, i.e., it is ready to run *but* not running, despite being plenty
of idle pCPUs that could possibly run it.

Reason why it's not being picked up, is that its credit are less than
the ones of the idle vCPU.

I have a theory about how it got in such a situation and, if I'm right,
a draft of an idea of how to fix this.

We're using this bug, that Glen kindly created, to track this issue:

https://bugzilla.opensuse.org/show_bug.cgi?id=1165206#c3

But of course I'll keep upstream MLs updated as well.

Stay tuned. :-)
-- 
Dario Faggioli, Ph.D
http://about.me/dario.faggioli
Virtualization Software Engineer
SUSE Labs, SUSE https://www.suse.com/
-------------------------------------------------------------------
<<This happens because _I_ choose it to happen!>> (Raistlin Majere)


[-- Attachment #1.2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

[-- Attachment #2: Type: text/plain, Size: 157 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-03-06 13:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-07 14:25 [Xen-devel] CPU Lockup bug with the credit2 scheduler Alastair Browne
2020-02-17 19:58 ` Sarah Newman
2020-02-17 23:46   ` Sander Eikelenboom
2020-02-18  0:39     ` Glen
2020-02-18  6:51       ` Jürgen Groß
2020-02-18 20:57         ` Glen
2020-02-19 12:01   ` Dario Faggioli
2020-03-06 13:54   ` Dario Faggioli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.