All of lore.kernel.org
 help / color / mirror / Atom feed
* Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
@ 2020-06-26 11:35 Steve McIntyre
  2020-06-26 13:41 ` Greg KH
  0 siblings, 1 reply; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 11:35 UTC (permalink / raw)
  To: stable; +Cc: 963493

Hi folks,

I'm the maintainer in Debian for strace. Trying to reproduce
https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've
found a repeatable hard lockup running the strace testsuite. Each time
it seems to have failed in a slightly different place in the testsuite
(suggesting it's not one particular syscall test that's triggering the
failure). I initially found this using Debian's current Buster kernel
(4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1
worked fine.

I've bisected to find the failure point along the linux-4.19.y stable
branch and what I've got to is the following commit:

e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
commit e58f543fc7c0926f31a49619c1a3648e49e8d233
Author: Jann Horn <jannh@google.com>
Date:   Thu Sep 13 18:12:09 2018 +0200

    apparmor: don't try to replace stale label in ptrace access check

    [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ]

    As a comment above begin_current_label_crit_section() explains,
    begin_current_label_crit_section() must run in sleepable context because
    when label_is_stale() is true, aa_replace_current_label() runs, which uses
    prepare_creds(), which can sleep.
    Until now, the ptrace access check (which runs with a task lock held)
    violated this rule.

    Also add a might_sleep() assertion to begin_current_label_crit_section(),
    because asserts are less likely to be ignored than comments.

    Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
    Signed-off-by: Jann Horn <jannh@google.com>
    Signed-off-by: John Johansen <john.johansen@canonical.com>
    Signed-off-by: Sasha Levin <sashal@kernel.org>

:040000 040000 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security

Considering I'm running strace build tests to provoke this bug,
finding the failure in a commit talking about ptrace changes does look
very suspicious...!

Annoyingly, I can't reproduce this on my disparate other machines
here, suggesting it's maybe(?) timing related.

Hope this helps - happy to give more information, test things, etc.

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
"Managing a volunteer open source project is a lot like herding
 kittens, except the kittens randomly appear and disappear because they
 have day jobs." -- Matt Mackall


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 11:35 Repeatable hard lockup running strace testsuite on 4.19.98+ onwards Steve McIntyre
@ 2020-06-26 13:41 ` Greg KH
  2020-06-26 13:45   ` Steve McIntyre
  2020-06-26 14:25   ` Jann Horn
  0 siblings, 2 replies; 11+ messages in thread
From: Greg KH @ 2020-06-26 13:41 UTC (permalink / raw)
  To: Steve McIntyre; +Cc: Jann Horn, John Johansen, Sasha Levin, stable, 963493

On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> Hi folks,
> 
> I'm the maintainer in Debian for strace. Trying to reproduce
> https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've
> found a repeatable hard lockup running the strace testsuite. Each time
> it seems to have failed in a slightly different place in the testsuite
> (suggesting it's not one particular syscall test that's triggering the
> failure). I initially found this using Debian's current Buster kernel
> (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1
> worked fine.
> 
> I've bisected to find the failure point along the linux-4.19.y stable
> branch and what I've got to is the following commit:
> 
> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
> commit e58f543fc7c0926f31a49619c1a3648e49e8d233
> Author: Jann Horn <jannh@google.com>
> Date:   Thu Sep 13 18:12:09 2018 +0200
> 
>     apparmor: don't try to replace stale label in ptrace access check
> 
>     [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ]
> 
>     As a comment above begin_current_label_crit_section() explains,
>     begin_current_label_crit_section() must run in sleepable context because
>     when label_is_stale() is true, aa_replace_current_label() runs, which uses
>     prepare_creds(), which can sleep.
>     Until now, the ptrace access check (which runs with a task lock held)
>     violated this rule.
> 
>     Also add a might_sleep() assertion to begin_current_label_crit_section(),
>     because asserts are less likely to be ignored than comments.
> 
>     Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
>     Signed-off-by: Jann Horn <jannh@google.com>
>     Signed-off-by: John Johansen <john.johansen@canonical.com>
>     Signed-off-by: Sasha Levin <sashal@kernel.org>
> 
> :040000 040000 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security
> 
> Considering I'm running strace build tests to provoke this bug,
> finding the failure in a commit talking about ptrace changes does look
> very suspicious...!
> 
> Annoyingly, I can't reproduce this on my disparate other machines
> here, suggesting it's maybe(?) timing related.
> 
> Hope this helps - happy to give more information, test things, etc.

So if you just revert this one patch, all works well?

I've added the authors of the patch to the cc: list...

Also, does this problem happen on Linus's tree?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 13:41 ` Greg KH
@ 2020-06-26 13:45   ` Steve McIntyre
  2020-06-26 15:20     ` Steve McIntyre
  2020-06-26 14:25   ` Jann Horn
  1 sibling, 1 reply; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 13:45 UTC (permalink / raw)
  To: Greg KH; +Cc: Jann Horn, John Johansen, Sasha Levin, stable, 963493

Hey Greg,

On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote:
>On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
>> 
>> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
>> commit e58f543fc7c0926f31a49619c1a3648e49e8d233
>> Author: Jann Horn <jannh@google.com>
>> Date:   Thu Sep 13 18:12:09 2018 +0200

...

>> Annoyingly, I can't reproduce this on my disparate other machines
>> here, suggesting it's maybe(?) timing related.
>> 
>> Hope this helps - happy to give more information, test things, etc.
>
>So if you just revert this one patch, all works well?

Correct.

>I've added the authors of the patch to the cc: list...
>
>Also, does this problem happen on Linus's tree?

Just building to check that now...

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
< Aardvark> I dislike C++ to start with. C++11 just seems to be
            handing rope-creating factories for users to hang multiple
            instances of themselves.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 13:41 ` Greg KH
  2020-06-26 13:45   ` Steve McIntyre
@ 2020-06-26 14:25   ` Jann Horn
  2020-06-26 16:50     ` Steve McIntyre
  1 sibling, 1 reply; 11+ messages in thread
From: Jann Horn @ 2020-06-26 14:25 UTC (permalink / raw)
  To: Greg KH; +Cc: Steve McIntyre, John Johansen, Sasha Levin, stable, 963493

On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> > I'm the maintainer in Debian for strace. Trying to reproduce
> > https://bugs.debian.org/963462 on my machine (Thinkpad T470), I've
> > found a repeatable hard lockup running the strace testsuite. Each time
> > it seems to have failed in a slightly different place in the testsuite
> > (suggesting it's not one particular syscall test that's triggering the
> > failure). I initially found this using Debian's current Buster kernel
> > (4.19.118+2+deb10u1), then backtracking I found that 4.19.98+1+deb10u1
> > worked fine.
> >
> > I've bisected to find the failure point along the linux-4.19.y stable
> > branch and what I've got to is the following commit:
> >
> > e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
> > commit e58f543fc7c0926f31a49619c1a3648e49e8d233
> > Author: Jann Horn <jannh@google.com>
> > Date:   Thu Sep 13 18:12:09 2018 +0200
> >
> >     apparmor: don't try to replace stale label in ptrace access check
> >
> >     [ Upstream commit 1f8266ff58840d698a1e96d2274189de1bdf7969 ]
> >
> >     As a comment above begin_current_label_crit_section() explains,
> >     begin_current_label_crit_section() must run in sleepable context because
> >     when label_is_stale() is true, aa_replace_current_label() runs, which uses
> >     prepare_creds(), which can sleep.
> >     Until now, the ptrace access check (which runs with a task lock held)
> >     violated this rule.
> >
> >     Also add a might_sleep() assertion to begin_current_label_crit_section(),
> >     because asserts are less likely to be ignored than comments.
> >
> >     Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
> >     Signed-off-by: Jann Horn <jannh@google.com>
> >     Signed-off-by: John Johansen <john.johansen@canonical.com>
> >     Signed-off-by: Sasha Levin <sashal@kernel.org>
> >
> > :040000 040000 ca92f885a38c1747b812116f19de6967084a647e 865a227665e460e159502f21e8a16e6fa590bf50 M security
> >
> > Considering I'm running strace build tests to provoke this bug,
> > finding the failure in a commit talking about ptrace changes does look
> > very suspicious...!
> >
> > Annoyingly, I can't reproduce this on my disparate other machines
> > here, suggesting it's maybe(?) timing related.

Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
prints a warning to dmesg? If so, can you share that warning?

If you don't have any way to see console output, and you don't have a
working serial console setup or such, you may want to try re-running
those tests while the kernel is booted with netconsole enabled to log
to a different machine over UDP (see
https://www.kernel.org/doc/Documentation/networking/netconsole.txt).

You may want to try setting the sysctl kernel.sysrq=1 , then when the
system has locked up, press ALT+PRINT+L (to generate stack traces for
all active CPUs from NMI context), and maybe also ALT+PRINT+T and
ALT+PRINT+W (to collect more information about active tasks).

(If you share stack traces from these things with us, it would be
helpful if you could run them through scripts/decode_stacktrace.pl
from the kernel tree first, to add line number information.)


Trying to isolate the problem:

__end_current_label_crit_section and end_current_label_crit_section
are aliases of each other (via #define), so that line change can't
have done anything.

That leaves two possibilities AFAICS:
 - the might_sleep() call by itself is causing issues for one of the
remaining users of begin_current_label_crit_section() (because it
causes preemption to happen more often where it didn't happen on
PREEMPT_VOLUNTARY before, or because it's trying to print a warning
message with the runqueue lock held, or something like that)
 - the lack of "if (aa_replace_current_label(label) == 0)
aa_put_label(label);" in __begin_current_label_crit_section() is
somehow causing issues

You could try to see whether just adding the might_sleep(), or just
replacing the begin_current_label_crit_section() call with
__begin_current_label_crit_section(), triggers the bug.


If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if
that isn't already set in your kernel config -, that might help track
down the problem, unless it magically makes the problem stop
triggering (which I guess would be conceivable if this indeed is a
race).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 13:45   ` Steve McIntyre
@ 2020-06-26 15:20     ` Steve McIntyre
  0 siblings, 0 replies; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 15:20 UTC (permalink / raw)
  To: Greg KH; +Cc: Jann Horn, John Johansen, Sasha Levin, stable, 963493

On Fri, Jun 26, 2020 at 02:45:18PM +0100, Steve McIntyre wrote:
>Hey Greg,
>
>On Fri, Jun 26, 2020 at 03:41:32PM +0200, Greg KH wrote:
>>On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
>>> 
>>> e58f543fc7c0926f31a49619c1a3648e49e8d233 is the first bad commit
>>> commit e58f543fc7c0926f31a49619c1a3648e49e8d233
>>> Author: Jann Horn <jannh@google.com>
>>> Date:   Thu Sep 13 18:12:09 2018 +0200
>
>...
>
>>> Annoyingly, I can't reproduce this on my disparate other machines
>>> here, suggesting it's maybe(?) timing related.
>>> 
>>> Hope this helps - happy to give more information, test things, etc.
>>
>>So if you just revert this one patch, all works well?
>
>Correct.
>
>>I've added the authors of the patch to the cc: list...
>>
>>Also, does this problem happen on Linus's tree?
>
>Just building to check that now...

Current head here (3e08a95294a4fb3702bb3d35ed08028433c37fe6) works fine.

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
The two hard things in computing:
 * naming things
 * cache invalidation
 * off-by-one errors                  -- Stig Sandbeck Mathisen


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 14:25   ` Jann Horn
@ 2020-06-26 16:50     ` Steve McIntyre
  2020-06-26 17:29       ` John Johansen
  2020-06-26 17:52       ` Steve McIntyre
  0 siblings, 2 replies; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 16:50 UTC (permalink / raw)
  To: Jann Horn; +Cc: Greg KH, John Johansen, Sasha Levin, stable, 963493

Hi Jann,

On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:

...

>> > Considering I'm running strace build tests to provoke this bug,
>> > finding the failure in a commit talking about ptrace changes does look
>> > very suspicious...!
>> >
>> > Annoyingly, I can't reproduce this on my disparate other machines
>> > here, suggesting it's maybe(?) timing related.
>
>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
>prints a warning to dmesg? If so, can you share that warning?

I mean the machine locks hard - X stops updating, the mouse/keyboard
stop responding. No pings, etc. When I reboot, there's nothing in the
logs.

>If you don't have any way to see console output, and you don't have a
>working serial console setup or such, you may want to try re-running
>those tests while the kernel is booted with netconsole enabled to log
>to a different machine over UDP (see
>https://www.kernel.org/doc/Documentation/networking/netconsole.txt).

ACK, will try that now for you.

>You may want to try setting the sysctl kernel.sysrq=1 , then when the
>system has locked up, press ALT+PRINT+L (to generate stack traces for
>all active CPUs from NMI context), and maybe also ALT+PRINT+T and
>ALT+PRINT+W (to collect more information about active tasks).

Nod.

>(If you share stack traces from these things with us, it would be
>helpful if you could run them through scripts/decode_stacktrace.pl
>from the kernel tree first, to add line number information.)

ACK.

>Trying to isolate the problem:
>
>__end_current_label_crit_section and end_current_label_crit_section
>are aliases of each other (via #define), so that line change can't
>have done anything.
>
>That leaves two possibilities AFAICS:
> - the might_sleep() call by itself is causing issues for one of the
>remaining users of begin_current_label_crit_section() (because it
>causes preemption to happen more often where it didn't happen on
>PREEMPT_VOLUNTARY before, or because it's trying to print a warning
>message with the runqueue lock held, or something like that)
> - the lack of "if (aa_replace_current_label(label) == 0)
>aa_put_label(label);" in __begin_current_label_crit_section() is
>somehow causing issues
>
>You could try to see whether just adding the might_sleep(), or just
>replacing the begin_current_label_crit_section() call with
>__begin_current_label_crit_section(), triggers the bug.
>
>
>If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if
>that isn't already set in your kernel config -, that might help track
>down the problem, unless it magically makes the problem stop
>triggering (which I guess would be conceivable if this indeed is a
>race).

OK, will try that second...

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
"I can't ever sleep on planes ... call it irrational if you like, but I'm
 afraid I'll miss my stop" -- Vivek Das Mohapatra


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 16:50     ` Steve McIntyre
@ 2020-06-26 17:29       ` John Johansen
  2020-06-26 19:01         ` Steve McIntyre
  2020-06-26 17:52       ` Steve McIntyre
  1 sibling, 1 reply; 11+ messages in thread
From: John Johansen @ 2020-06-26 17:29 UTC (permalink / raw)
  To: Steve McIntyre, Jann Horn; +Cc: Greg KH, Sasha Levin, stable, 963493

On 6/26/20 9:50 AM, Steve McIntyre wrote:
> Hi Jann,
> 
> On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
>> On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> 
> ...
> 
>>>> Considering I'm running strace build tests to provoke this bug,
>>>> finding the failure in a commit talking about ptrace changes does look
>>>> very suspicious...!
>>>>
>>>> Annoyingly, I can't reproduce this on my disparate other machines
>>>> here, suggesting it's maybe(?) timing related.
>>
>> Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
>> prints a warning to dmesg? If so, can you share that warning?
> 
> I mean the machine locks hard - X stops updating, the mouse/keyboard
> stop responding. No pings, etc. When I reboot, there's nothing in the
> logs.
> 
>> If you don't have any way to see console output, and you don't have a
>> working serial console setup or such, you may want to try re-running
>> those tests while the kernel is booted with netconsole enabled to log
>> to a different machine over UDP (see
>> https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
> 
> ACK, will try that now for you.
> 
>> You may want to try setting the sysctl kernel.sysrq=1 , then when the
>> system has locked up, press ALT+PRINT+L (to generate stack traces for
>> all active CPUs from NMI context), and maybe also ALT+PRINT+T and
>> ALT+PRINT+W (to collect more information about active tasks).
> 
> Nod.
> 
>> (If you share stack traces from these things with us, it would be
>> helpful if you could run them through scripts/decode_stacktrace.pl
>>from the kernel tree first, to add line number information.)
> 
> ACK.
> 
>> Trying to isolate the problem:
>>
>> __end_current_label_crit_section and end_current_label_crit_section
>> are aliases of each other (via #define), so that line change can't
>> have done anything.
>>
>> That leaves two possibilities AFAICS:
>> - the might_sleep() call by itself is causing issues for one of the
>> remaining users of begin_current_label_crit_section() (because it
>> causes preemption to happen more often where it didn't happen on
>> PREEMPT_VOLUNTARY before, or because it's trying to print a warning
>> message with the runqueue lock held, or something like that)
>> - the lack of "if (aa_replace_current_label(label) == 0)
>> aa_put_label(label);" in __begin_current_label_crit_section() is
>> somehow causing issues
>>
>> You could try to see whether just adding the might_sleep(), or just
>> replacing the begin_current_label_crit_section() call with
>> __begin_current_label_crit_section(), triggers the bug.
>>
>>
>> If you could recompile the kernel with CONFIG_DEBUG_ATOMIC_SLEEP - if
>> that isn't already set in your kernel config -, that might help track
>> down the problem, unless it magically makes the problem stop
>> triggering (which I guess would be conceivable if this indeed is a
>> race).
> 
> OK, will try that second...
> 

I have not been able to reproduce but

So looking at linux-4.19.y it looks like
1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check

was picked without
ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check

Both of them are marked as
Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")

so I would expect them to be picked together.


ptraceme is potentially updating the task's cred while the access check is
running.

Try building after picking
ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 16:50     ` Steve McIntyre
  2020-06-26 17:29       ` John Johansen
@ 2020-06-26 17:52       ` Steve McIntyre
  2020-06-26 18:27         ` Jann Horn
  1 sibling, 1 reply; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 17:52 UTC (permalink / raw)
  To: Jann Horn; +Cc: Greg KH, John Johansen, Sasha Levin, stable, 963493

[-- Attachment #1: Type: text/plain, Size: 1999 bytes --]

Hi again,

On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote:
>On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
>>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
>>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
>
>...
>
>>> > Considering I'm running strace build tests to provoke this bug,
>>> > finding the failure in a commit talking about ptrace changes does look
>>> > very suspicious...!
>>> >
>>> > Annoyingly, I can't reproduce this on my disparate other machines
>>> > here, suggesting it's maybe(?) timing related.
>>
>>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
>>prints a warning to dmesg? If so, can you share that warning?
>
>I mean the machine locks hard - X stops updating, the mouse/keyboard
>stop responding. No pings, etc. When I reboot, there's nothing in the
>logs.
>
>>If you don't have any way to see console output, and you don't have a
>>working serial console setup or such, you may want to try re-running
>>those tests while the kernel is booted with netconsole enabled to log
>>to a different machine over UDP (see
>>https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
>
>ACK, will try that now for you.
>
>>You may want to try setting the sysctl kernel.sysrq=1 , then when the
>>system has locked up, press ALT+PRINT+L (to generate stack traces for
>>all active CPUs from NMI context), and maybe also ALT+PRINT+T and
>>ALT+PRINT+W (to collect more information about active tasks).
>
>Nod.
>
>>(If you share stack traces from these things with us, it would be
>>helpful if you could run them through scripts/decode_stacktrace.pl
>>from the kernel tree first, to add line number information.)
>
>ACK.

Output passed through scripts/decode_stacktrace.sh attached.

Just about to try John's suggestion next.

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
Dance like no one's watching. Encrypt like everyone is.
 - @torproject

[-- Attachment #2: tack.log.decoded.gz --]
[-- Type: application/gzip, Size: 32484 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 17:52       ` Steve McIntyre
@ 2020-06-26 18:27         ` Jann Horn
  0 siblings, 0 replies; 11+ messages in thread
From: Jann Horn @ 2020-06-26 18:27 UTC (permalink / raw)
  To: Steve McIntyre; +Cc: Greg KH, John Johansen, Sasha Levin, stable, 963493

On Fri, Jun 26, 2020 at 7:52 PM Steve McIntyre <steve@einval.com> wrote:
> On Fri, Jun 26, 2020 at 05:50:00PM +0100, Steve McIntyre wrote:
> >On Fri, Jun 26, 2020 at 04:25:59PM +0200, Jann Horn wrote:
> >>On Fri, Jun 26, 2020 at 3:41 PM Greg KH <gregkh@linuxfoundation.org> wrote:
> >>> On Fri, Jun 26, 2020 at 12:35:58PM +0100, Steve McIntyre wrote:
> >
> >...
> >
> >>> > Considering I'm running strace build tests to provoke this bug,
> >>> > finding the failure in a commit talking about ptrace changes does look
> >>> > very suspicious...!
> >>> >
> >>> > Annoyingly, I can't reproduce this on my disparate other machines
> >>> > here, suggesting it's maybe(?) timing related.
> >>
> >>Does "hard lockup" mean that the HARDLOCKUP_DETECTOR infrastructure
> >>prints a warning to dmesg? If so, can you share that warning?
> >
> >I mean the machine locks hard - X stops updating, the mouse/keyboard
> >stop responding. No pings, etc. When I reboot, there's nothing in the
> >logs.
> >
> >>If you don't have any way to see console output, and you don't have a
> >>working serial console setup or such, you may want to try re-running
> >>those tests while the kernel is booted with netconsole enabled to log
> >>to a different machine over UDP (see
> >>https://www.kernel.org/doc/Documentation/networking/netconsole.txt).
> >
> >ACK, will try that now for you.
> >
> >>You may want to try setting the sysctl kernel.sysrq=1 , then when the
> >>system has locked up, press ALT+PRINT+L (to generate stack traces for
> >>all active CPUs from NMI context), and maybe also ALT+PRINT+T and
> >>ALT+PRINT+W (to collect more information about active tasks).
> >
> >Nod.
> >
> >>(If you share stack traces from these things with us, it would be
> >>helpful if you could run them through scripts/decode_stacktrace.pl
> >>from the kernel tree first, to add line number information.)
> >
> >ACK.
>
> Output passed through scripts/decode_stacktrace.sh attached.
>
> Just about to try John's suggestion next.

Okay, so this is some sort of deadlock...

Looking at the NMI backtraces, all the CPUs are blocked on spinlocks:
CPU 3 is blocked on current->sighand->siglock, in tty_open_proc_set_tty()
CPU 1 is blocked on... I'm not sure which lock, somewhere in do_wait()
CPU 2 is blocked on something, somewhere in ptrace_stop()
CPU 0 is stuck on a lock in do_exit()

So I think it's probably something like a classic deadlock, or a
sleeping-in-atomic issue, or a lock-balancing issue (or memory
corruption, that can cause all kinds of weird errors)?

If it really is a classic deadlock, CONFIG_PROVE_LOCKING=y should be
able to pinpoint the issue.
If it is a sleeping-in-atomic issue, CONFIG_DEBUG_ATOMIC_SLEEP=y should help.
If it is memory corruption, CONFIG_KASAN=y should discover it... but
that might majorly mess up the timing, so if this really is a race,
that might not work.

Maybe flip all of those on, and if it doesn't reproduce anymore, turn
off CONFIG_KASAN and try again?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 17:29       ` John Johansen
@ 2020-06-26 19:01         ` Steve McIntyre
  2020-06-26 19:57           ` Sasha Levin
  0 siblings, 1 reply; 11+ messages in thread
From: Steve McIntyre @ 2020-06-26 19:01 UTC (permalink / raw)
  To: John Johansen; +Cc: Jann Horn, Greg KH, Sasha Levin, stable, 963493

On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote:
>On 6/26/20 9:50 AM, Steve McIntyre wrote:
>> 
>> OK, will try that second...
>> 
>
>I have not been able to reproduce but
>
>So looking at linux-4.19.y it looks like
>1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check
>
>was picked without
>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check
>
>Both of them are marked as
>Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
>
>so I would expect them to be picked together.
>
>ptraceme is potentially updating the task's cred while the access check is
>running.
>
>Try building after picking
>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check

Bingo! With that one change the test suite runs to completion, no lockup.

\o/

Thanks guys, I think we've found the cause here.

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
"The whole problem with the world is that fools and fanatics are
 always so certain of themselves, and wiser people so full of doubts."
   -- Bertrand Russell


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeatable hard lockup running strace testsuite on 4.19.98+ onwards
  2020-06-26 19:01         ` Steve McIntyre
@ 2020-06-26 19:57           ` Sasha Levin
  0 siblings, 0 replies; 11+ messages in thread
From: Sasha Levin @ 2020-06-26 19:57 UTC (permalink / raw)
  To: Steve McIntyre; +Cc: John Johansen, Jann Horn, Greg KH, stable, 963493

On Fri, Jun 26, 2020 at 08:01:18PM +0100, Steve McIntyre wrote:
>On Fri, Jun 26, 2020 at 10:29:28AM -0700, John Johansen wrote:
>>On 6/26/20 9:50 AM, Steve McIntyre wrote:
>>>
>>> OK, will try that second...
>>>
>>
>>I have not been able to reproduce but
>>
>>So looking at linux-4.19.y it looks like
>>1f8266ff5884 apparmor: don't try to replace stale label in ptrace access check
>>
>>was picked without
>>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check
>>
>>Both of them are marked as
>>Fixes: b2d09ae449ced ("apparmor: move ptrace checks to using labels")
>>
>>so I would expect them to be picked together.
>>
>>ptraceme is potentially updating the task's cred while the access check is
>>running.
>>
>>Try building after picking
>>ca3fde5214e1 apparmor: don't try to replace stale label in ptraceme check
>
>Bingo! With that one change the test suite runs to completion, no lockup.

I've queued ca3fde5214e1 for 4.19 and 4.14, thank you for your work!

-- 
Thanks,
Sasha

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2020-06-26 19:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-26 11:35 Repeatable hard lockup running strace testsuite on 4.19.98+ onwards Steve McIntyre
2020-06-26 13:41 ` Greg KH
2020-06-26 13:45   ` Steve McIntyre
2020-06-26 15:20     ` Steve McIntyre
2020-06-26 14:25   ` Jann Horn
2020-06-26 16:50     ` Steve McIntyre
2020-06-26 17:29       ` John Johansen
2020-06-26 19:01         ` Steve McIntyre
2020-06-26 19:57           ` Sasha Levin
2020-06-26 17:52       ` Steve McIntyre
2020-06-26 18:27         ` Jann Horn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.