All of lore.kernel.org
 help / color / mirror / Atom feed
* [Xen-devel] Stopping much Linux testing in Xen Project CI
@ 2020-03-12 16:49 Ian Jackson
  2020-03-12 16:51 ` Ian Jackson
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Ian Jackson @ 2020-03-12 16:49 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Stefano Stabellini,
	Oleksandr Andrushchenko, Roger Pau Monné,
	Boris Ostrovsky, Juergen Gross, Wei Liu, Paul Durrant
  Cc: xen-devel

Linux stable branches, and Linux upstream tip, are badly broken and
have been for months.  Apparently no-one is able to (or has time to)
to investigate and fix.

  linux-4.4          218 days         to be suspended
  linux-4.9          134 days         to be suspended
  linux-4.14         134 days         to be suspended
  linux-4.19         134 days         to be suspended
  linux-5.4           55 days
  linux-arm-xen     up to date
  linux-linus        372 days         to be suspended

These are times since the last push - ie, how long it has been broken.
Evidently no-one is paying any attention to this.[1]  I looked at the
reports myself and:

Nested HVM is broken on Intel in all of the 4.x branches.
Additionally:

Linux 4.4 has some intermittent guest start failure for 32-bit PV.

Linux 4.14 does not boot on 32-bit ARM.  There are also some 64-bit
x86 migration failures.

The most recent reports (last week or two) are afflicted by underlying
CI problems - what look like sticky PDU relays, or what may be
problems in the Debian mirror network (I have definitely seen problems
there), so the reports are rather noisy.  Sorry about that.  I am
trying to improve this situation but it is quite difficult [2].  But
overall it is clear that the underlying code is broken.

The repeated almost-certainly-doomed retestes are using too much of
osstest's capacity.  I am going to stop testing all of these 4.x
branches, and of linux-linus, until someone tells me they think the
fix(es) are in the relevant branch(es).

This means that we will *no longer have any visibility of breakage in
much of upstream Linux*.  I think this is "fine" because right now
no-one appears to be doing anything with the information.

I didn't look at master or at 5.4.

Thanks for your attention.

Ian.

[1] osstest does not have a triage team monitoring the reports.  There
is just about half of me to run the whole system.  We are relying on
maintainers of Xen-related code noticing when things break and
proactively trying to investigate and fix them.

[2] Amongst other things, someone has managed to patent, in the USA,
the obvious[3] solution to PDU relays sticking.  PDUs which do zero
crossing switching are available in Europe but are hens' teeth in the
US.  The patentholder doesn't actually make and sell the things of
course so we can't even solve the problem by paying them a bad-law
tax.

[3] Obvious anyone with a reasaonble electrical engineering background
like, for example, me.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] Stopping much Linux testing in Xen Project CI
  2020-03-12 16:49 [Xen-devel] Stopping much Linux testing in Xen Project CI Ian Jackson
@ 2020-03-12 16:51 ` Ian Jackson
  2020-03-12 17:06 ` Jürgen Groß
  2020-03-12 17:55 ` Roger Pau Monné
  2 siblings, 0 replies; 6+ messages in thread
From: Ian Jackson @ 2020-03-12 16:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Stefano Stabellini,
	Oleksandr Andrushchenko, Roger Pau Monné,
	Boris Ostrovsky, Juergen Gross, Wei Liu, Paul Durrant, xen-devel

Additionally,

  linux-next   2162 days

This is obviously useless.  I am stopping it.

Ian.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] Stopping much Linux testing in Xen Project CI
  2020-03-12 16:49 [Xen-devel] Stopping much Linux testing in Xen Project CI Ian Jackson
  2020-03-12 16:51 ` Ian Jackson
@ 2020-03-12 17:06 ` Jürgen Groß
  2020-03-12 17:55 ` Roger Pau Monné
  2 siblings, 0 replies; 6+ messages in thread
From: Jürgen Groß @ 2020-03-12 17:06 UTC (permalink / raw)
  To: Ian Jackson, Konrad Rzeszutek Wilk, Stefano Stabellini,
	Oleksandr Andrushchenko, Roger Pau Monné,
	Boris Ostrovsky, Wei Liu, Paul Durrant
  Cc: xen-devel

On 12.03.20 17:49, Ian Jackson wrote:
> Linux stable branches, and Linux upstream tip, are badly broken and
> have been for months.  Apparently no-one is able to (or has time to)
> to investigate and fix.
> 
>    linux-4.4          218 days         to be suspended
>    linux-4.9          134 days         to be suspended
>    linux-4.14         134 days         to be suspended
>    linux-4.19         134 days         to be suspended
>    linux-5.4           55 days
>    linux-arm-xen     up to date
>    linux-linus        372 days         to be suspended
> 
> These are times since the last push - ie, how long it has been broken.
> Evidently no-one is paying any attention to this.[1]  I looked at the
> reports myself and:
> 
> Nested HVM is broken on Intel in all of the 4.x branches.

I was looking into the test failures multiple times, and always found
that problem. Honestly I don't see how this should be the kernel's
fault, so I rather quick gave up each time.

> Additionally:
> 
> Linux 4.4 has some intermittent guest start failure for 32-bit PV.
> 
> Linux 4.14 does not boot on 32-bit ARM.  There are also some 64-bit
> x86 migration failures.
> 
> The most recent reports (last week or two) are afflicted by underlying
> CI problems - what look like sticky PDU relays, or what may be
> problems in the Debian mirror network (I have definitely seen problems
> there), so the reports are rather noisy.  Sorry about that.  I am
> trying to improve this situation but it is quite difficult [2].  But
> overall it is clear that the underlying code is broken.

I know I have said so before: I still think that our tests relying on
the Debian servers (and their ongoing support for a selected version)
is not the optimal setup.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] Stopping much Linux testing in Xen Project CI
  2020-03-12 16:49 [Xen-devel] Stopping much Linux testing in Xen Project CI Ian Jackson
  2020-03-12 16:51 ` Ian Jackson
  2020-03-12 17:06 ` Jürgen Groß
@ 2020-03-12 17:55 ` Roger Pau Monné
  2020-03-13  9:13   ` Jan Beulich
  2 siblings, 1 reply; 6+ messages in thread
From: Roger Pau Monné @ 2020-03-12 17:55 UTC (permalink / raw)
  To: Ian Jackson
  Cc: Juergen Gross, Wei Liu, Stefano Stabellini,
	Oleksandr Andrushchenko, Konrad Rzeszutek Wilk, Paul Durrant,
	xen-devel, Boris Ostrovsky

On Thu, Mar 12, 2020 at 04:49:51PM +0000, Ian Jackson wrote:
> Linux stable branches, and Linux upstream tip, are badly broken and
> have been for months.  Apparently no-one is able to (or has time to)
> to investigate and fix.
> 
>   linux-4.4          218 days         to be suspended
>   linux-4.9          134 days         to be suspended
>   linux-4.14         134 days         to be suspended
>   linux-4.19         134 days         to be suspended
>   linux-5.4           55 days
>   linux-arm-xen     up to date
>   linux-linus        372 days         to be suspended
> 
> These are times since the last push - ie, how long it has been broken.
> Evidently no-one is paying any attention to this.[1]  I looked at the
> reports myself and:
> 
> Nested HVM is broken on Intel in all of the 4.x branches.

FWIW, it's the Debian installer kernel the one that crashes AFAICT,
all the failures are:

[    0.000000] Linux version 4.9.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02)
[...]
[    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 30580167144 ns
[    0.000000] tsc: Fast TSC calibration failed
[    0.000000] tsc: Unable to calibrate against PIT
[    0.000000] tsc: HPET/PMTIMER calibration failed
[    0.000000] divide error: 0000 [#1] SMP
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
[    0.000000] Hardware name: Xen HVM domU, BIOS 4.14-unstable 03/11/2020
[    0.000000] task: ffffffffab611500 task.stack: ffffffffab600000
[    0.000000] RIP: 0010:[<ffffffffaaa59e1f>]  [<ffffffffaaa59e1f>] pvclock_tsc_khz+0xf/0x30
[    0.000000] RSP: 0000:ffffffffab603f38  EFLAGS: 00010246
[    0.000000] RAX: 000f424000000000 RBX: ffffffffffffffff RCX: 0000000000000000
[    0.000000] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffffab939020
[    0.000000] RBP: ffff93806e8f1540 R08: 000000003a637374 R09: 6f6974617262696c
[    0.000000] R10: 00000032f3af6dcd R11: 4d502f5445504820 R12: ffffffffab7dc920
[    0.000000] R13: ffffffffab7e82e0 R14: 00000000000146f0 R15: 000000000000008e
[    0.000000] FS:  0000000000000000(0000) GS:ffff93806e600000(0000) knlGS:0000000000000000
[    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    0.000000] CR2: ffff938065f3a000 CR3: 0000000025c08000 CR4: 00000000000406b0
[    0.000000] Stack:
[    0.000000]  ffffffffab74b1b6 ffff93806e8f1540 ffffffffab7dc920 ba81e537ba81e512
[    0.000000]  ffffffffffffffff ffff93806e8f1540 ffffffffab73deb6 ffffffffab7e82e0
[    0.000000]  0000000000000000 0000000000000020 0000ffffffffab73 00000000ffffffff
[    0.000000] Call Trace:
[    0.000000]  [<ffffffffab74b1b6>] ? tsc_init+0x39/0x25b
[    0.000000]  [<ffffffffab73deb6>] ? start_kernel+0x39f/0x46b
[    0.000000]  [<ffffffffab73d120>] ? early_idt_handler_array+0x120/0x120
[    0.000000]  [<ffffffffab73d408>] ? x86_64_start_kernel+0x14c/0x170
[    0.000000] Code: a6 bc 00 c0 9d a5 aa 0f 94 c0 c3 90 40 88 3d cd ea cb 00 c3 0f 1f 84 00 00 00 00 00 8b 4f 18 31 d2 48 b8 00 00 00 00 40 42 0f 00 <48> f7 f1 0f b6 57 1c 89 d1 f7 d9 48 89 c6 48 d3 e6 89 d1 48 d3 
[    0.000000] RIP  [<ffffffffaaa59e1f>] pvclock_tsc_khz+0xf/0x30
[    0.000000]  RSP <ffffffffab603f38>
[    0.000000] random: fast init done
[    0.000000] ---[ end trace 21c3bd5ec174e388 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!

On all branches it's blocked by 4.9.0-6-amd64 from Debian failing, not
the kernel under test (which could also fail, but we don't even get
there).

I have started a repro and will look into tomorrow.

Regards, Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] Stopping much Linux testing in Xen Project CI
  2020-03-12 17:55 ` Roger Pau Monné
@ 2020-03-13  9:13   ` Jan Beulich
  2020-03-13 10:21     ` Jürgen Groß
  0 siblings, 1 reply; 6+ messages in thread
From: Jan Beulich @ 2020-03-13  9:13 UTC (permalink / raw)
  To: Roger Pau Monné, Juergen Gross
  Cc: Wei Liu, Stefano Stabellini, Oleksandr Andrushchenko,
	Paul Durrant, Konrad Rzeszutek Wilk, xen-devel, Ian Jackson,
	Boris Ostrovsky

On 12.03.2020 18:55, Roger Pau Monné wrote:
> On Thu, Mar 12, 2020 at 04:49:51PM +0000, Ian Jackson wrote:
>> Linux stable branches, and Linux upstream tip, are badly broken and
>> have been for months.  Apparently no-one is able to (or has time to)
>> to investigate and fix.
>>
>>   linux-4.4          218 days         to be suspended
>>   linux-4.9          134 days         to be suspended
>>   linux-4.14         134 days         to be suspended
>>   linux-4.19         134 days         to be suspended
>>   linux-5.4           55 days
>>   linux-arm-xen     up to date
>>   linux-linus        372 days         to be suspended
>>
>> These are times since the last push - ie, how long it has been broken.
>> Evidently no-one is paying any attention to this.[1]  I looked at the
>> reports myself and:
>>
>> Nested HVM is broken on Intel in all of the 4.x branches.
> 
> FWIW, it's the Debian installer kernel the one that crashes AFAICT,
> all the failures are:
> 
> [    0.000000] Linux version 4.9.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02)
> [...]
> [    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 30580167144 ns
> [    0.000000] tsc: Fast TSC calibration failed
> [    0.000000] tsc: Unable to calibrate against PIT
> [    0.000000] tsc: HPET/PMTIMER calibration failed
> [    0.000000] divide error: 0000 [#1] SMP
> [    0.000000] Modules linked in:
> [    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.14-unstable 03/11/2020
> [    0.000000] task: ffffffffab611500 task.stack: ffffffffab600000
> [    0.000000] RIP: 0010:[<ffffffffaaa59e1f>]  [<ffffffffaaa59e1f>] pvclock_tsc_khz+0xf/0x30

Seeing this and ...

> [    0.000000] RSP: 0000:ffffffffab603f38  EFLAGS: 00010246
> [    0.000000] RAX: 000f424000000000 RBX: ffffffffffffffff RCX: 0000000000000000
> [    0.000000] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffffab939020
> [    0.000000] RBP: ffff93806e8f1540 R08: 000000003a637374 R09: 6f6974617262696c
> [    0.000000] R10: 00000032f3af6dcd R11: 4d502f5445504820 R12: ffffffffab7dc920
> [    0.000000] R13: ffffffffab7e82e0 R14: 00000000000146f0 R15: 000000000000008e
> [    0.000000] FS:  0000000000000000(0000) GS:ffff93806e600000(0000) knlGS:0000000000000000
> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    0.000000] CR2: ffff938065f3a000 CR3: 0000000025c08000 CR4: 00000000000406b0
> [    0.000000] Stack:
> [    0.000000]  ffffffffab74b1b6 ffff93806e8f1540 ffffffffab7dc920 ba81e537ba81e512
> [    0.000000]  ffffffffffffffff ffff93806e8f1540 ffffffffab73deb6 ffffffffab7e82e0
> [    0.000000]  0000000000000000 0000000000000020 0000ffffffffab73 00000000ffffffff
> [    0.000000] Call Trace:
> [    0.000000]  [<ffffffffab74b1b6>] ? tsc_init+0x39/0x25b

... this and looking at xen_tsc_khz(), isn't it supposed to use
per_cpu(xen_vcpu, 0) instead, in case vCPU info got relocated?
(Code looks to be the same in 4.9 and 5.5. I'd also question
the hard-coded zero in there, but that's a different topic.)

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Xen-devel] Stopping much Linux testing in Xen Project CI
  2020-03-13  9:13   ` Jan Beulich
@ 2020-03-13 10:21     ` Jürgen Groß
  0 siblings, 0 replies; 6+ messages in thread
From: Jürgen Groß @ 2020-03-13 10:21 UTC (permalink / raw)
  To: Jan Beulich, Roger Pau Monné
  Cc: Wei Liu, Stefano Stabellini, Oleksandr Andrushchenko,
	Paul Durrant, Konrad Rzeszutek Wilk, xen-devel, Ian Jackson,
	Boris Ostrovsky

On 13.03.20 10:13, Jan Beulich wrote:
> On 12.03.2020 18:55, Roger Pau Monné wrote:
>> On Thu, Mar 12, 2020 at 04:49:51PM +0000, Ian Jackson wrote:
>>> Linux stable branches, and Linux upstream tip, are badly broken and
>>> have been for months.  Apparently no-one is able to (or has time to)
>>> to investigate and fix.
>>>
>>>    linux-4.4          218 days         to be suspended
>>>    linux-4.9          134 days         to be suspended
>>>    linux-4.14         134 days         to be suspended
>>>    linux-4.19         134 days         to be suspended
>>>    linux-5.4           55 days
>>>    linux-arm-xen     up to date
>>>    linux-linus        372 days         to be suspended
>>>
>>> These are times since the last push - ie, how long it has been broken.
>>> Evidently no-one is paying any attention to this.[1]  I looked at the
>>> reports myself and:
>>>
>>> Nested HVM is broken on Intel in all of the 4.x branches.
>>
>> FWIW, it's the Debian installer kernel the one that crashes AFAICT,
>> all the failures are:
>>
>> [    0.000000] Linux version 4.9.0-6-amd64 (debian-kernel@lists.debian.org) (gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) ) #1 SMP Debian 4.9.82-1+deb9u3 (2018-03-02)
>> [...]
>> [    0.000000] clocksource: hpet: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 30580167144 ns
>> [    0.000000] tsc: Fast TSC calibration failed
>> [    0.000000] tsc: Unable to calibrate against PIT
>> [    0.000000] tsc: HPET/PMTIMER calibration failed
>> [    0.000000] divide error: 0000 [#1] SMP
>> [    0.000000] Modules linked in:
>> [    0.000000] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-6-amd64 #1 Debian 4.9.82-1+deb9u3
>> [    0.000000] Hardware name: Xen HVM domU, BIOS 4.14-unstable 03/11/2020
>> [    0.000000] task: ffffffffab611500 task.stack: ffffffffab600000
>> [    0.000000] RIP: 0010:[<ffffffffaaa59e1f>]  [<ffffffffaaa59e1f>] pvclock_tsc_khz+0xf/0x30
> 
> Seeing this and ...
> 
>> [    0.000000] RSP: 0000:ffffffffab603f38  EFLAGS: 00010246
>> [    0.000000] RAX: 000f424000000000 RBX: ffffffffffffffff RCX: 0000000000000000
>> [    0.000000] RDX: 0000000000000000 RSI: 0000000000000246 RDI: ffffffffab939020
>> [    0.000000] RBP: ffff93806e8f1540 R08: 000000003a637374 R09: 6f6974617262696c
>> [    0.000000] R10: 00000032f3af6dcd R11: 4d502f5445504820 R12: ffffffffab7dc920
>> [    0.000000] R13: ffffffffab7e82e0 R14: 00000000000146f0 R15: 000000000000008e
>> [    0.000000] FS:  0000000000000000(0000) GS:ffff93806e600000(0000) knlGS:0000000000000000
>> [    0.000000] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    0.000000] CR2: ffff938065f3a000 CR3: 0000000025c08000 CR4: 00000000000406b0
>> [    0.000000] Stack:
>> [    0.000000]  ffffffffab74b1b6 ffff93806e8f1540 ffffffffab7dc920 ba81e537ba81e512
>> [    0.000000]  ffffffffffffffff ffff93806e8f1540 ffffffffab73deb6 ffffffffab7e82e0
>> [    0.000000]  0000000000000000 0000000000000020 0000ffffffffab73 00000000ffffffff
>> [    0.000000] Call Trace:
>> [    0.000000]  [<ffffffffab74b1b6>] ? tsc_init+0x39/0x25b
> 
> ... this and looking at xen_tsc_khz(), isn't it supposed to use
> per_cpu(xen_vcpu, 0) instead, in case vCPU info got relocated?
> (Code looks to be the same in 4.9 and 5.5. I'd also question
> the hard-coded zero in there, but that's a different topic.)

It should use per_cpu(xen_vcpu, 0), but OTOH it shouldn't matter that
much if it doesn't, as the time information from the shared info page
wouldn't go away.

Seeing a zero divisor here indicates that HYPERVISOR_shared_info might
still point to the dummy shared info structure.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-03-13 10:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-12 16:49 [Xen-devel] Stopping much Linux testing in Xen Project CI Ian Jackson
2020-03-12 16:51 ` Ian Jackson
2020-03-12 17:06 ` Jürgen Groß
2020-03-12 17:55 ` Roger Pau Monné
2020-03-13  9:13   ` Jan Beulich
2020-03-13 10:21     ` Jürgen Groß

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.