All of lore.kernel.org
 help / color / mirror / Atom feed
* possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
@ 2020-10-26 11:26 Filipe Manana
  2020-10-26 11:40 ` Peter Zijlstra
  2020-11-03 14:08 ` Boqun Feng
  0 siblings, 2 replies; 26+ messages in thread
From: Filipe Manana @ 2020-10-26 11:26 UTC (permalink / raw)
  To: LKML; +Cc: peterz, Jan Kara, David Sterba

Hello,

I've recently started to hit a warning followed by tasks hanging after
attempts to freeze a filesystem. A git bisection pointed to the
following commit:

commit 4d004099a668c41522242aa146a38cc4eb59cb1e
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Oct 2 11:04:21 2020 +0200

    lockdep: Fix lockdep recursion

This happens very reliably when running all xfstests with lockdep
enabled, and the tested filesystem is btrfs (haven't tried other
filesystems, but it shouldn't matter). The warning and task hangs always
happen at either test generic/068 or test generic/390, and (oddly)
always have to run all tests for it to trigger, running those tests
individually on an infinite loop doesn't seem to trigger it (at least
for a couple hours).

The warning triggered is at fs/super.c:__sb_start_write() which always
results later in several tasks hanging on a percpu rw_sem:

https://pastebin.com/qnLvf94E

What happens is percpu_rwsem_is_held() is apparently returning a false
positive, so this makes __sb_start_write() do a
percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
is expected to always succeed, because if the calling task is holding a
freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
the semaphore at level 2, since the freeze semaphores are always
acquired by increasing level order.

But the try_lock fails, it triggers the warning at __sb_start_write(),
then its caller sb_start_pagefault() ignores the return value and
callers such as btrfs_page_mkwrite() make the assumption the freeze
semaphore was taken, proceed to do their stuff, and later call
sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
at level 2 despite not having not been able to down_read() the
semaphore. This obviously corrupts the semaphore's read_count state, and
later causes any task trying to down_write() it to hang forever.

After such a hang I ran a drgn script to confirm it:

$ cat dump_freeze_sems.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
    reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
    pass

if mnt is None:
    sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
    sys.exit(1)

def dump_sem(level_enum):
    level = level_enum.value_()
    sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
    print(f'freeze semaphore at level {level}, {str(level_enum)}')
    print(f'    block {sem.block.counter.value_()}')
    for i in for_each_possible_cpu(prog):
        read_count = per_cpu_ptr(sem.read_count, i)
        print(f'    read_count at cpu {i} = {read_count}')
    print()

# dump semaphore read counts for all freeze levels (fs.h)
dump_sem(prog['SB_FREEZE_WRITE'])
dump_sem(prog['SB_FREEZE_PAGEFAULT'])
dump_sem(prog['SB_FREEZE_FS'])


$ drgn dump_freeze_sems.py
freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
    block 1
    read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
    read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
    read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
    read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293

freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
    block 1
    read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
    read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
    read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
    read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0

freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
    block 0
    read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
    read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
    read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
    read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0

At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
sums to -1. The system remains like that for hours at least, with no
progress at all.

Is there a known regression with that lockdep commit?
Anything I can do to help debug it in case it's not obvious?

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:26 possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
@ 2020-10-26 11:40 ` Peter Zijlstra
  2020-10-26 11:55   ` Jan Kara
  2020-10-26 11:56   ` Filipe Manana
  2020-11-03 14:08 ` Boqun Feng
  1 sibling, 2 replies; 26+ messages in thread
From: Peter Zijlstra @ 2020-10-26 11:40 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, Jan Kara, David Sterba

On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> Hello,
> 
> I've recently started to hit a warning followed by tasks hanging after
> attempts to freeze a filesystem. A git bisection pointed to the
> following commit:
> 
> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Oct 2 11:04:21 2020 +0200
> 
>     lockdep: Fix lockdep recursion
> 
> This happens very reliably when running all xfstests with lockdep
> enabled, and the tested filesystem is btrfs (haven't tried other
> filesystems, but it shouldn't matter). The warning and task hangs always
> happen at either test generic/068 or test generic/390, and (oddly)
> always have to run all tests for it to trigger, running those tests
> individually on an infinite loop doesn't seem to trigger it (at least
> for a couple hours).
> 
> The warning triggered is at fs/super.c:__sb_start_write() which always
> results later in several tasks hanging on a percpu rw_sem:
> 
> https://pastebin.com/qnLvf94E
> 
> What happens is percpu_rwsem_is_held() is apparently returning a false
> positive,

That smells like the same issue reported here:

  https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net

Make sure you have commit:

  f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")

(in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:40 ` Peter Zijlstra
@ 2020-10-26 11:55   ` Jan Kara
  2020-10-26 11:59     ` Filipe Manana
  2020-10-26 12:30     ` Peter Zijlstra
  2020-10-26 11:56   ` Filipe Manana
  1 sibling, 2 replies; 26+ messages in thread
From: Jan Kara @ 2020-10-26 11:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Filipe Manana, LKML, Jan Kara, David Sterba

On Mon 26-10-20 12:40:09, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> > Hello,
> > 
> > I've recently started to hit a warning followed by tasks hanging after
> > attempts to freeze a filesystem. A git bisection pointed to the
> > following commit:
> > 
> > commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Fri Oct 2 11:04:21 2020 +0200
> > 
> >     lockdep: Fix lockdep recursion
> > 
> > This happens very reliably when running all xfstests with lockdep
> > enabled, and the tested filesystem is btrfs (haven't tried other
> > filesystems, but it shouldn't matter). The warning and task hangs always
> > happen at either test generic/068 or test generic/390, and (oddly)
> > always have to run all tests for it to trigger, running those tests
> > individually on an infinite loop doesn't seem to trigger it (at least
> > for a couple hours).
> > 
> > The warning triggered is at fs/super.c:__sb_start_write() which always
> > results later in several tasks hanging on a percpu rw_sem:
> > 
> > https://pastebin.com/qnLvf94E
> > 
> > What happens is percpu_rwsem_is_held() is apparently returning a false
> > positive,
> 
> That smells like the same issue reported here:
> 
>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> 
> Make sure you have commit:
> 
>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> 
> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?

Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be
there? Also googling for the title doesn't help...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:40 ` Peter Zijlstra
  2020-10-26 11:55   ` Jan Kara
@ 2020-10-26 11:56   ` Filipe Manana
  2020-10-26 12:55     ` Peter Zijlstra
  1 sibling, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2020-10-26 11:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Jan Kara, David Sterba



On 26/10/20 11:40, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra <peterz@infradead.org>
>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>
>>     lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive,
> 
> That smells like the same issue reported here:
> 
>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> 
> Make sure you have commit:
> 
>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> 
> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?

Yes, CONFIG_DEBUG_PREEMPT is enabled.
I'll try with that commit and let you know, however it's gonna take a
few hours to build a kernel and run all fstests (on that test box it
takes over 3 hours) to confirm that fixes the issue.

Thanks for the quick reply!

> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:55   ` Jan Kara
@ 2020-10-26 11:59     ` Filipe Manana
  2020-10-26 12:30     ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2020-10-26 11:59 UTC (permalink / raw)
  To: Jan Kara, Peter Zijlstra; +Cc: LKML, David Sterba



On 26/10/20 11:55, Jan Kara wrote:
> On Mon 26-10-20 12:40:09, Peter Zijlstra wrote:
>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>>> Hello,
>>>
>>> I've recently started to hit a warning followed by tasks hanging after
>>> attempts to freeze a filesystem. A git bisection pointed to the
>>> following commit:
>>>
>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>> Author: Peter Zijlstra <peterz@infradead.org>
>>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>>
>>>     lockdep: Fix lockdep recursion
>>>
>>> This happens very reliably when running all xfstests with lockdep
>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>> happen at either test generic/068 or test generic/390, and (oddly)
>>> always have to run all tests for it to trigger, running those tests
>>> individually on an infinite loop doesn't seem to trigger it (at least
>>> for a couple hours).
>>>
>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>> results later in several tasks hanging on a percpu rw_sem:
>>>
>>> https://pastebin.com/qnLvf94E
>>>
>>> What happens is percpu_rwsem_is_held() is apparently returning a false
>>> positive,
>>
>> That smells like the same issue reported here:
>>
>>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
>>
>> Make sure you have commit:
>>
>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>
>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> 
> Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be
> there? Also googling for the title doesn't help...

Same here, can't find in Linus' tree neither through google.

> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:55   ` Jan Kara
  2020-10-26 11:59     ` Filipe Manana
@ 2020-10-26 12:30     ` Peter Zijlstra
  1 sibling, 0 replies; 26+ messages in thread
From: Peter Zijlstra @ 2020-10-26 12:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Filipe Manana, LKML, David Sterba

On Mon, Oct 26, 2020 at 12:55:41PM +0100, Jan Kara wrote:
> > Make sure you have commit:
> > 
> >   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> > 
> > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> 
> Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be
> there? Also googling for the title doesn't help...

*groan*... I seem to have forgotten to push it out to tip/locking/urgent on
Friday :/

Find it here:

  git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/urgent



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:56   ` Filipe Manana
@ 2020-10-26 12:55     ` Peter Zijlstra
  2020-10-26 13:06       ` Filipe Manana
                         ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2020-10-26 12:55 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, Jan Kara, David Sterba

On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
> > That smells like the same issue reported here:
> > 
> >   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> > 
> > Make sure you have commit:
> > 
> >   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> > 
> > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> 
> Yes, CONFIG_DEBUG_PREEMPT is enabled.

Bummer :/

> I'll try with that commit and let you know, however it's gonna take a
> few hours to build a kernel and run all fstests (on that test box it
> takes over 3 hours) to confirm that fixes the issue.

*ouch*, 3 hours is painful. How long to make it sick with the current
kernel? quicker I would hope?

> Thanks for the quick reply!

Anyway, I don't think that commit can actually explain the issue :/

The false positive on lockdep_assert_held() happens when the recursion
count is !0, however we _should_ be having IRQs disabled when
lockdep_recursion > 0, so that should never be observable.

My hope was that DEBUG_PREEMPT would trigger on one of the
__this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
then be a clear violation.

And you're seeing this on x86, right?

Let me puzzle moar..

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 12:55     ` Peter Zijlstra
@ 2020-10-26 13:06       ` Filipe Manana
  2020-10-26 15:22       ` Peter Zijlstra
  2020-10-26 20:35       ` David Sterba
  2 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2020-10-26 13:06 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Jan Kara, David Sterba



On 26/10/20 12:55, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
>>> That smells like the same issue reported here:
>>>
>>>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
>>>
>>> Make sure you have commit:
>>>
>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>>
>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
>>
>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
> 
> Bummer :/
> 
>> I'll try with that commit and let you know, however it's gonna take a
>> few hours to build a kernel and run all fstests (on that test box it
>> takes over 3 hours) to confirm that fixes the issue.
> 
> *ouch*, 3 hours is painful. How long to make it sick with the current
> kernel? quicker I would hope?

If generic/068 triggers the bug, than it's about 1 hour. If that passes,
which rarely happens, then have to wait to get into generic/390, which
is over 2 hours.

It sucks that running those tests alone never trigger the issue, but
running all fstests (first btrfs specific ones, followed by the generic
ones) reliably triggers the bug, almost always at generic/068, when that
passes, it's triggered by generic/390. To confirm everything is ok, I
let all tests run (last generic is 612).


> 
>> Thanks for the quick reply!
> 
> Anyway, I don't think that commit can actually explain the issue :/
> 
> The false positive on lockdep_assert_held() happens when the recursion
> count is !0, however we _should_ be having IRQs disabled when
> lockdep_recursion > 0, so that should never be observable.
> 
> My hope was that DEBUG_PREEMPT would trigger on one of the
> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> then be a clear violation.
> 
> And you're seeing this on x86, right?

Right.
It's in a qemu vm on x86, with '-cpu host' passed to qemu and kvm enabled.

Thanks.


> 
> Let me puzzle moar..
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 12:55     ` Peter Zijlstra
  2020-10-26 13:06       ` Filipe Manana
@ 2020-10-26 15:22       ` Peter Zijlstra
  2020-10-27  9:49         ` Anatoly Pugachev
                           ` (2 more replies)
  2020-10-26 20:35       ` David Sterba
  2 siblings, 3 replies; 26+ messages in thread
From: Peter Zijlstra @ 2020-10-26 15:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, Jan Kara, David Sterba, matorola, mingo

On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
> > > That smells like the same issue reported here:
> > > 
> > >   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> > > 
> > > Make sure you have commit:
> > > 
> > >   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> > > 
> > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> > 
> > Yes, CONFIG_DEBUG_PREEMPT is enabled.
> 
> Bummer :/
> 
> > I'll try with that commit and let you know, however it's gonna take a
> > few hours to build a kernel and run all fstests (on that test box it
> > takes over 3 hours) to confirm that fixes the issue.
> 
> *ouch*, 3 hours is painful. How long to make it sick with the current
> kernel? quicker I would hope?
> 
> > Thanks for the quick reply!
> 
> Anyway, I don't think that commit can actually explain the issue :/
> 
> The false positive on lockdep_assert_held() happens when the recursion
> count is !0, however we _should_ be having IRQs disabled when
> lockdep_recursion > 0, so that should never be observable.
> 
> My hope was that DEBUG_PREEMPT would trigger on one of the
> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> then be a clear violation.
> 
> And you're seeing this on x86, right?
> 
> Let me puzzle moar..

So I might have an explanation for the Sparc64 fail, but that can't
explain x86 :/

I initially thought raw_cpu_read() was OK, since if it is !0 we have
IRQs disabled and can't get migrated, so if we get migrated both CPUs
must have 0 and it doesn't matter which 0 we read.

And while that is true; it isn't the whole store, on pretty much all
architectures (except x86) this can result in computing the address for
one CPU, getting migrated, the old CPU continuing execution with another
task (possibly setting recursion) and then the new CPU reading the value
of the old CPU, which is no longer 0.

I already fixed a bunch of that in:

  baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")

but clearly this one got crossed.

Still, that leaves me puzzled over you seeing this on x86 :/

Anatoly, could you try linus+tip/locking/urgent and the below on your
Sparc, please?

---
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..a3041463e42d 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void)
 	if (!debug_locks)
 		return false;
 
-	if (raw_cpu_read(lockdep_recursion))
+	if (this_cpu_read(lockdep_recursion))
 		return false;
 
 	if (current->lockdep_recursion)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 12:55     ` Peter Zijlstra
  2020-10-26 13:06       ` Filipe Manana
  2020-10-26 15:22       ` Peter Zijlstra
@ 2020-10-26 20:35       ` David Sterba
  2 siblings, 0 replies; 26+ messages in thread
From: David Sterba @ 2020-10-26 20:35 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Filipe Manana, LKML, Jan Kara, David Sterba

On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
> > > That smells like the same issue reported here:
> > > 
> > >   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> > > 
> > > Make sure you have commit:
> > > 
> > >   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> > > 
> > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> > 
> > Yes, CONFIG_DEBUG_PREEMPT is enabled.
> 
> Bummer :/

My builds don't have that enabled (CONFIG_PREEMPT_NONE=y) but I still
see the warning (same scenario as for Filipe). That is with today's
master branch + your fix from locking/urgent.

> > I'll try with that commit and let you know, however it's gonna take a
> > few hours to build a kernel and run all fstests (on that test box it
> > takes over 3 hours) to confirm that fixes the issue.
> 
> *ouch*, 3 hours is painful. How long to make it sick with the current
> kernel? quicker I would hope?
> 
> > Thanks for the quick reply!
> 
> Anyway, I don't think that commit can actually explain the issue :/
> 
> The false positive on lockdep_assert_held() happens when the recursion
> count is !0, however we _should_ be having IRQs disabled when
> lockdep_recursion > 0, so that should never be observable.
> 
> My hope was that DEBUG_PREEMPT would trigger on one of the
> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> then be a clear violation.

I can start another round (in my case it's more than 4 hours to
reproduce it) with DEBUG_PREEMPT, unless you have something else to
test.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 15:22       ` Peter Zijlstra
@ 2020-10-27  9:49         ` Anatoly Pugachev
  2020-10-31 11:30         ` [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage tip-bot2 for Peter Zijlstra
  2020-11-02 17:58         ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
  2 siblings, 0 replies; 26+ messages in thread
From: Anatoly Pugachev @ 2020-10-27  9:49 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Filipe Manana, LKML, Jan Kara, David Sterba, Ingo Molnar

On Mon, Oct 26, 2020 at 6:23 PM Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
> > On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
> > > > That smells like the same issue reported here:
> > > >
> > > >   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> > > >
> > > > Make sure you have commit:
> > > >
> > > >   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> > > >
> > > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> > >
> > > Yes, CONFIG_DEBUG_PREEMPT is enabled.
> >
> > Bummer :/
> >
> > > I'll try with that commit and let you know, however it's gonna take a
> > > few hours to build a kernel and run all fstests (on that test box it
> > > takes over 3 hours) to confirm that fixes the issue.
> >
> > *ouch*, 3 hours is painful. How long to make it sick with the current
> > kernel? quicker I would hope?
> >
> > > Thanks for the quick reply!
> >
> > Anyway, I don't think that commit can actually explain the issue :/
> >
> > The false positive on lockdep_assert_held() happens when the recursion
> > count is !0, however we _should_ be having IRQs disabled when
> > lockdep_recursion > 0, so that should never be observable.
> >
> > My hope was that DEBUG_PREEMPT would trigger on one of the
> > __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> > then be a clear violation.
> >
> > And you're seeing this on x86, right?
> >
> > Let me puzzle moar..
>
> So I might have an explanation for the Sparc64 fail, but that can't
> explain x86 :/
>
> I initially thought raw_cpu_read() was OK, since if it is !0 we have
> IRQs disabled and can't get migrated, so if we get migrated both CPUs
> must have 0 and it doesn't matter which 0 we read.
>
> And while that is true; it isn't the whole store, on pretty much all
> architectures (except x86) this can result in computing the address for
> one CPU, getting migrated, the old CPU continuing execution with another
> task (possibly setting recursion) and then the new CPU reading the value
> of the old CPU, which is no longer 0.
>
> I already fixed a bunch of that in:
>
>   baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")
>
> but clearly this one got crossed.
>
> Still, that leaves me puzzled over you seeing this on x86 :/
>
> Anatoly, could you try linus+tip/locking/urgent and the below on your
> Sparc, please?

Peter,
let me test first. Thanks.

PS: sorry for the delay, a weekend and got ill a bit ...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage
  2020-10-26 15:22       ` Peter Zijlstra
  2020-10-27  9:49         ` Anatoly Pugachev
@ 2020-10-31 11:30         ` tip-bot2 for Peter Zijlstra
  2020-11-02 17:58         ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
  2 siblings, 0 replies; 26+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2020-10-31 11:30 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, LKML

The following commit has been merged into the locking/urgent branch of tip:

Commit-ID:     d48e3850030623e1c20785bceaaf78f916d0b1a3
Gitweb:        https://git.kernel.org/tip/d48e3850030623e1c20785bceaaf78f916d0b1a3
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Mon, 26 Oct 2020 16:22:56 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 30 Oct 2020 17:07:18 +01:00

locking/lockdep: Remove more raw_cpu_read() usage

I initially thought raw_cpu_read() was OK, since if it is !0 we have
IRQs disabled and can't get migrated, so if we get migrated both CPUs
must have 0 and it doesn't matter which 0 we read.

And while that is true; it isn't the whole store, on pretty much all
architectures (except x86) this can result in computing the address for
one CPU, getting migrated, the old CPU continuing execution with another
task (possibly setting recursion) and then the new CPU reading the value
of the old CPU, which is no longer 0.

Similer to:

  baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201026152256.GB2651@hirez.programming.kicks-ass.net
---
 kernel/locking/lockdep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index fc206ae..1102849 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void)
 	if (!debug_locks)
 		return false;
 
-	if (raw_cpu_read(lockdep_recursion))
+	if (this_cpu_read(lockdep_recursion))
 		return false;
 
 	if (current->lockdep_recursion)

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 15:22       ` Peter Zijlstra
  2020-10-27  9:49         ` Anatoly Pugachev
  2020-10-31 11:30         ` [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage tip-bot2 for Peter Zijlstra
@ 2020-11-02 17:58         ` Filipe Manana
  2020-11-03 10:15           ` Jan Kara
  2 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2020-11-02 17:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: LKML, Jan Kara, David Sterba, matorola, mingo



On 26/10/20 15:22, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
>> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
>>>> That smells like the same issue reported here:
>>>>
>>>>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
>>>>
>>>> Make sure you have commit:
>>>>
>>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>>>
>>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
>>>
>>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
>>
>> Bummer :/
>>
>>> I'll try with that commit and let you know, however it's gonna take a
>>> few hours to build a kernel and run all fstests (on that test box it
>>> takes over 3 hours) to confirm that fixes the issue.
>>
>> *ouch*, 3 hours is painful. How long to make it sick with the current
>> kernel? quicker I would hope?
>>
>>> Thanks for the quick reply!
>>
>> Anyway, I don't think that commit can actually explain the issue :/
>>
>> The false positive on lockdep_assert_held() happens when the recursion
>> count is !0, however we _should_ be having IRQs disabled when
>> lockdep_recursion > 0, so that should never be observable.
>>
>> My hope was that DEBUG_PREEMPT would trigger on one of the
>> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
>> then be a clear violation.
>>
>> And you're seeing this on x86, right?
>>
>> Let me puzzle moar..
> 
> So I might have an explanation for the Sparc64 fail, but that can't
> explain x86 :/
> 
> I initially thought raw_cpu_read() was OK, since if it is !0 we have
> IRQs disabled and can't get migrated, so if we get migrated both CPUs
> must have 0 and it doesn't matter which 0 we read.
> 
> And while that is true; it isn't the whole store, on pretty much all
> architectures (except x86) this can result in computing the address for
> one CPU, getting migrated, the old CPU continuing execution with another
> task (possibly setting recursion) and then the new CPU reading the value
> of the old CPU, which is no longer 0.
> 
> I already fixed a bunch of that in:
> 
>   baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")
> 
> but clearly this one got crossed.
> 
> Still, that leaves me puzzled over you seeing this on x86 :/

Hi Peter,

I still get the same issue with 5.10-rc2.
Is there any non-merged patch I should try, or anything I can help with?

Thanks.

> 
> Anatoly, could you try linus+tip/locking/urgent and the below on your
> Sparc, please?
> 
> ---
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..a3041463e42d 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void)
>  	if (!debug_locks)
>  		return false;
>  
> -	if (raw_cpu_read(lockdep_recursion))
> +	if (this_cpu_read(lockdep_recursion))
>  		return false;
>  
>  	if (current->lockdep_recursion)
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-02 17:58         ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
@ 2020-11-03 10:15           ` Jan Kara
  2020-11-03 10:22             ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Jan Kara @ 2020-11-03 10:15 UTC (permalink / raw)
  To: Filipe Manana
  Cc: Peter Zijlstra, LKML, Jan Kara, David Sterba, matorola, mingo

On Mon 02-11-20 17:58:54, Filipe Manana wrote:
> 
> 
> On 26/10/20 15:22, Peter Zijlstra wrote:
> > On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
> >> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
> >>>> That smells like the same issue reported here:
> >>>>
> >>>>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
> >>>>
> >>>> Make sure you have commit:
> >>>>
> >>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> >>>>
> >>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> >>>
> >>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
> >>
> >> Bummer :/
> >>
> >>> I'll try with that commit and let you know, however it's gonna take a
> >>> few hours to build a kernel and run all fstests (on that test box it
> >>> takes over 3 hours) to confirm that fixes the issue.
> >>
> >> *ouch*, 3 hours is painful. How long to make it sick with the current
> >> kernel? quicker I would hope?
> >>
> >>> Thanks for the quick reply!
> >>
> >> Anyway, I don't think that commit can actually explain the issue :/
> >>
> >> The false positive on lockdep_assert_held() happens when the recursion
> >> count is !0, however we _should_ be having IRQs disabled when
> >> lockdep_recursion > 0, so that should never be observable.
> >>
> >> My hope was that DEBUG_PREEMPT would trigger on one of the
> >> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> >> then be a clear violation.
> >>
> >> And you're seeing this on x86, right?
> >>
> >> Let me puzzle moar..
> > 
> > So I might have an explanation for the Sparc64 fail, but that can't
> > explain x86 :/
> > 
> > I initially thought raw_cpu_read() was OK, since if it is !0 we have
> > IRQs disabled and can't get migrated, so if we get migrated both CPUs
> > must have 0 and it doesn't matter which 0 we read.
> > 
> > And while that is true; it isn't the whole store, on pretty much all
> > architectures (except x86) this can result in computing the address for
> > one CPU, getting migrated, the old CPU continuing execution with another
> > task (possibly setting recursion) and then the new CPU reading the value
> > of the old CPU, which is no longer 0.
> > 
> > I already fixed a bunch of that in:
> > 
> >   baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")
> > 
> > but clearly this one got crossed.
> > 
> > Still, that leaves me puzzled over you seeing this on x86 :/
> 
> Hi Peter,
> 
> I still get the same issue with 5.10-rc2.
> Is there any non-merged patch I should try, or anything I can help with?

BTW, I've just hit the same deadlock issue with ext4 on generic/390 so I
confirm this isn't btrfs specific issue (as we already knew from the
analysis but still it's good to have that confirmed).

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-03 10:15           ` Jan Kara
@ 2020-11-03 10:22             ` Filipe Manana
  0 siblings, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2020-11-03 10:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, LKML, David Sterba, matorola, mingo, darrick.wong



On 03/11/20 10:15, Jan Kara wrote:
> On Mon 02-11-20 17:58:54, Filipe Manana wrote:
>>
>>
>> On 26/10/20 15:22, Peter Zijlstra wrote:
>>> On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
>>>> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
>>>>>> That smells like the same issue reported here:
>>>>>>
>>>>>>   https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net
>>>>>>
>>>>>> Make sure you have commit:
>>>>>>
>>>>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>>>>>
>>>>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
>>>>>
>>>>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
>>>>
>>>> Bummer :/
>>>>
>>>>> I'll try with that commit and let you know, however it's gonna take a
>>>>> few hours to build a kernel and run all fstests (on that test box it
>>>>> takes over 3 hours) to confirm that fixes the issue.
>>>>
>>>> *ouch*, 3 hours is painful. How long to make it sick with the current
>>>> kernel? quicker I would hope?
>>>>
>>>>> Thanks for the quick reply!
>>>>
>>>> Anyway, I don't think that commit can actually explain the issue :/
>>>>
>>>> The false positive on lockdep_assert_held() happens when the recursion
>>>> count is !0, however we _should_ be having IRQs disabled when
>>>> lockdep_recursion > 0, so that should never be observable.
>>>>
>>>> My hope was that DEBUG_PREEMPT would trigger on one of the
>>>> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
>>>> then be a clear violation.
>>>>
>>>> And you're seeing this on x86, right?
>>>>
>>>> Let me puzzle moar..
>>>
>>> So I might have an explanation for the Sparc64 fail, but that can't
>>> explain x86 :/
>>>
>>> I initially thought raw_cpu_read() was OK, since if it is !0 we have
>>> IRQs disabled and can't get migrated, so if we get migrated both CPUs
>>> must have 0 and it doesn't matter which 0 we read.
>>>
>>> And while that is true; it isn't the whole store, on pretty much all
>>> architectures (except x86) this can result in computing the address for
>>> one CPU, getting migrated, the old CPU continuing execution with another
>>> task (possibly setting recursion) and then the new CPU reading the value
>>> of the old CPU, which is no longer 0.
>>>
>>> I already fixed a bunch of that in:
>>>
>>>   baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"")
>>>
>>> but clearly this one got crossed.
>>>
>>> Still, that leaves me puzzled over you seeing this on x86 :/
>>
>> Hi Peter,
>>
>> I still get the same issue with 5.10-rc2.
>> Is there any non-merged patch I should try, or anything I can help with?
> 
> BTW, I've just hit the same deadlock issue with ext4 on generic/390 so I
> confirm this isn't btrfs specific issue (as we already knew from the
> analysis but still it's good to have that confirmed).

Indeed, yesterday Darrick was mentioning on IRC that he has run into it
too with fstests on XFS (5.10-rc).

> 
> 								Honza
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-10-26 11:26 possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
  2020-10-26 11:40 ` Peter Zijlstra
@ 2020-11-03 14:08 ` Boqun Feng
  2020-11-03 14:24   ` Filipe Manana
  2020-11-03 19:44   ` Filipe Manana
  1 sibling, 2 replies; 26+ messages in thread
From: Boqun Feng @ 2020-11-03 14:08 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov

Hi Filipe,

On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> Hello,
> 
> I've recently started to hit a warning followed by tasks hanging after
> attempts to freeze a filesystem. A git bisection pointed to the
> following commit:
> 
> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Oct 2 11:04:21 2020 +0200
> 
>     lockdep: Fix lockdep recursion
> 
> This happens very reliably when running all xfstests with lockdep
> enabled, and the tested filesystem is btrfs (haven't tried other
> filesystems, but it shouldn't matter). The warning and task hangs always
> happen at either test generic/068 or test generic/390, and (oddly)
> always have to run all tests for it to trigger, running those tests
> individually on an infinite loop doesn't seem to trigger it (at least
> for a couple hours).
> 
> The warning triggered is at fs/super.c:__sb_start_write() which always
> results later in several tasks hanging on a percpu rw_sem:
> 
> https://pastebin.com/qnLvf94E
> 

In your dmesg, I see line:

	[ 9304.920151] INFO: lockdep is turned off.

, that means debug_locks is 0, that usually happens when lockdep find a
problem (i.e. a deadlock) and it turns itself off, because a problem is
found and it's pointless for lockdep to continue to run.

And I haven't found a lockdep splat in your dmesg, do you have a full
dmesg so that I can have a look?

This may be relevant because in commit 4d004099a66, we have

	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
		unsigned long flags;
		int ret = 0;

	-       if (unlikely(current->lockdep_recursion))
	+       if (unlikely(!lockdep_enabled()))
			return 1; /* avoid false negative lockdep_assert_held() */

before this commit lock_is_held_type() and its friends may return false
if debug_locks==0, after this commit lock_is_held_type() and its friends
will always return true if debug_locks == 0. That could cause the
behavior here.

In case I'm correct, the following "fix" may be helpful. 

Regards,
Boqun

----------8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 3e99dfef8408..c0e27fb949ff 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
 	unsigned long flags;
 	int ret = 0;
 
-	if (unlikely(!lockdep_enabled()))
+	if (unlikely(debug_locks && !lockdep_enabled()))
 		return 1; /* avoid false negative lockdep_assert_held() */
 
 	raw_local_irq_save(flags);



> What happens is percpu_rwsem_is_held() is apparently returning a false
> positive, so this makes __sb_start_write() do a
> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> is expected to always succeed, because if the calling task is holding a
> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> the semaphore at level 2, since the freeze semaphores are always
> acquired by increasing level order.
> 
> But the try_lock fails, it triggers the warning at __sb_start_write(),
> then its caller sb_start_pagefault() ignores the return value and
> callers such as btrfs_page_mkwrite() make the assumption the freeze
> semaphore was taken, proceed to do their stuff, and later call
> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> at level 2 despite not having not been able to down_read() the
> semaphore. This obviously corrupts the semaphore's read_count state, and
> later causes any task trying to down_write() it to hang forever.
> 
> After such a hang I ran a drgn script to confirm it:
> 
> $ cat dump_freeze_sems.py
> import sys
> import drgn
> from drgn import NULL, Object, cast, container_of, execscript, \
>     reinterpret, sizeof
> from drgn.helpers.linux import *
> 
> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> 
> mnt = None
> for mnt in for_each_mount(prog, dst = mnt_path):
>     pass
> 
> if mnt is None:
>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>     sys.exit(1)
> 
> def dump_sem(level_enum):
>     level = level_enum.value_()
>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
>     print(f'    block {sem.block.counter.value_()}')
>     for i in for_each_possible_cpu(prog):
>         read_count = per_cpu_ptr(sem.read_count, i)
>         print(f'    read_count at cpu {i} = {read_count}')
>     print()
> 
> # dump semaphore read counts for all freeze levels (fs.h)
> dump_sem(prog['SB_FREEZE_WRITE'])
> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> dump_sem(prog['SB_FREEZE_FS'])
> 
> 
> $ drgn dump_freeze_sems.py
> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>     block 1
>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> 
> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>     block 1
>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> 
> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>     block 0
>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> 
> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> sums to -1. The system remains like that for hours at least, with no
> progress at all.
> 
> Is there a known regression with that lockdep commit?
> Anything I can do to help debug it in case it's not obvious?
> 
> Thanks.

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-03 14:08 ` Boqun Feng
@ 2020-11-03 14:24   ` Filipe Manana
  2020-11-03 19:44   ` Filipe Manana
  1 sibling, 0 replies; 26+ messages in thread
From: Filipe Manana @ 2020-11-03 14:24 UTC (permalink / raw)
  To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov



On 03/11/20 14:08, Boqun Feng wrote:
> Hi Filipe,
> 
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra <peterz@infradead.org>
>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>
>>     lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
> 
> In your dmesg, I see line:
> 
> 	[ 9304.920151] INFO: lockdep is turned off.
> 
> , that means debug_locks is 0, that usually happens when lockdep find a
> problem (i.e. a deadlock) and it turns itself off, because a problem is
> found and it's pointless for lockdep to continue to run.
> 
> And I haven't found a lockdep splat in your dmesg, do you have a full
> dmesg so that I can have a look?

Everytime I run into the issue it produces similar results. It always
starts with the WARN_ON() at __sb_start_write(), followed by several
"hung task" traces, and then some time after lockdep is turned off.

> 
> This may be relevant because in commit 4d004099a66, we have
> 
> 	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> 		unsigned long flags;
> 		int ret = 0;
> 
> 	-       if (unlikely(current->lockdep_recursion))
> 	+       if (unlikely(!lockdep_enabled()))
> 			return 1; /* avoid false negative lockdep_assert_held() */
> 
> before this commit lock_is_held_type() and its friends may return false
> if debug_locks==0, after this commit lock_is_held_type() and its friends
> will always return true if debug_locks == 0. That could cause the
> behavior here.
> 
> In case I'm correct, the following "fix" may be helpful. 

I'll try it and let you now.
Thanks!

> 
> Regards,
> Boqun
> 
> ----------8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..c0e27fb949ff 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>  	unsigned long flags;
>  	int ret = 0;
>  
> -	if (unlikely(!lockdep_enabled()))
> +	if (unlikely(debug_locks && !lockdep_enabled()))
>  		return 1; /* avoid false negative lockdep_assert_held() */
>  
>  	raw_local_irq_save(flags);
> 
> 
> 
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive, so this makes __sb_start_write() do a
>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>> is expected to always succeed, because if the calling task is holding a
>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>> the semaphore at level 2, since the freeze semaphores are always
>> acquired by increasing level order.
>>
>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>> then its caller sb_start_pagefault() ignores the return value and
>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>> semaphore was taken, proceed to do their stuff, and later call
>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>> at level 2 despite not having not been able to down_read() the
>> semaphore. This obviously corrupts the semaphore's read_count state, and
>> later causes any task trying to down_write() it to hang forever.
>>
>> After such a hang I ran a drgn script to confirm it:
>>
>> $ cat dump_freeze_sems.py
>> import sys
>> import drgn
>> from drgn import NULL, Object, cast, container_of, execscript, \
>>     reinterpret, sizeof
>> from drgn.helpers.linux import *
>>
>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>
>> mnt = None
>> for mnt in for_each_mount(prog, dst = mnt_path):
>>     pass
>>
>> if mnt is None:
>>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>>     sys.exit(1)
>>
>> def dump_sem(level_enum):
>>     level = level_enum.value_()
>>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
>>     print(f'    block {sem.block.counter.value_()}')
>>     for i in for_each_possible_cpu(prog):
>>         read_count = per_cpu_ptr(sem.read_count, i)
>>         print(f'    read_count at cpu {i} = {read_count}')
>>     print()
>>
>> # dump semaphore read counts for all freeze levels (fs.h)
>> dump_sem(prog['SB_FREEZE_WRITE'])
>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>> dump_sem(prog['SB_FREEZE_FS'])
>>
>>
>> $ drgn dump_freeze_sems.py
>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>>     block 1
>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>
>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>>     block 1
>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>
>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>>     block 0
>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>
>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>> sums to -1. The system remains like that for hours at least, with no
>> progress at all.
>>
>> Is there a known regression with that lockdep commit?
>> Anything I can do to help debug it in case it's not obvious?
>>
>> Thanks.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-03 14:08 ` Boqun Feng
  2020-11-03 14:24   ` Filipe Manana
@ 2020-11-03 19:44   ` Filipe Manana
  2020-11-04  2:22     ` Boqun Feng
  1 sibling, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2020-11-03 19:44 UTC (permalink / raw)
  To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov



On 03/11/20 14:08, Boqun Feng wrote:
> Hi Filipe,
> 
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra <peterz@infradead.org>
>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>
>>     lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
> 
> In your dmesg, I see line:
> 
> 	[ 9304.920151] INFO: lockdep is turned off.
> 
> , that means debug_locks is 0, that usually happens when lockdep find a
> problem (i.e. a deadlock) and it turns itself off, because a problem is
> found and it's pointless for lockdep to continue to run.
> 
> And I haven't found a lockdep splat in your dmesg, do you have a full
> dmesg so that I can have a look?
> 
> This may be relevant because in commit 4d004099a66, we have
> 
> 	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> 		unsigned long flags;
> 		int ret = 0;
> 
> 	-       if (unlikely(current->lockdep_recursion))
> 	+       if (unlikely(!lockdep_enabled()))
> 			return 1; /* avoid false negative lockdep_assert_held() */
> 
> before this commit lock_is_held_type() and its friends may return false
> if debug_locks==0, after this commit lock_is_held_type() and its friends
> will always return true if debug_locks == 0. That could cause the
> behavior here.
> 
> In case I'm correct, the following "fix" may be helpful. 
> 
> Regards,
> Boqun
> 
> ----------8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..c0e27fb949ff 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>  	unsigned long flags;
>  	int ret = 0;
>  
> -	if (unlikely(!lockdep_enabled()))
> +	if (unlikely(debug_locks && !lockdep_enabled()))
>  		return 1; /* avoid false negative lockdep_assert_held() */
>  
>  	raw_local_irq_save(flags);

Boqun, the patch fixes the problem for me!
You can have Tested-by: Filipe Manana <fdmanana@suse.com>

Thanks!

> 
> 
> 
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive, so this makes __sb_start_write() do a
>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>> is expected to always succeed, because if the calling task is holding a
>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>> the semaphore at level 2, since the freeze semaphores are always
>> acquired by increasing level order.
>>
>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>> then its caller sb_start_pagefault() ignores the return value and
>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>> semaphore was taken, proceed to do their stuff, and later call
>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>> at level 2 despite not having not been able to down_read() the
>> semaphore. This obviously corrupts the semaphore's read_count state, and
>> later causes any task trying to down_write() it to hang forever.
>>
>> After such a hang I ran a drgn script to confirm it:
>>
>> $ cat dump_freeze_sems.py
>> import sys
>> import drgn
>> from drgn import NULL, Object, cast, container_of, execscript, \
>>     reinterpret, sizeof
>> from drgn.helpers.linux import *
>>
>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>
>> mnt = None
>> for mnt in for_each_mount(prog, dst = mnt_path):
>>     pass
>>
>> if mnt is None:
>>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>>     sys.exit(1)
>>
>> def dump_sem(level_enum):
>>     level = level_enum.value_()
>>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
>>     print(f'    block {sem.block.counter.value_()}')
>>     for i in for_each_possible_cpu(prog):
>>         read_count = per_cpu_ptr(sem.read_count, i)
>>         print(f'    read_count at cpu {i} = {read_count}')
>>     print()
>>
>> # dump semaphore read counts for all freeze levels (fs.h)
>> dump_sem(prog['SB_FREEZE_WRITE'])
>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>> dump_sem(prog['SB_FREEZE_FS'])
>>
>>
>> $ drgn dump_freeze_sems.py
>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>>     block 1
>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>
>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>>     block 1
>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>
>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>>     block 0
>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>
>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>> sums to -1. The system remains like that for hours at least, with no
>> progress at all.
>>
>> Is there a known regression with that lockdep commit?
>> Anything I can do to help debug it in case it's not obvious?
>>
>> Thanks.
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-03 19:44   ` Filipe Manana
@ 2020-11-04  2:22     ` Boqun Feng
  2020-11-04  3:44       ` Boqun Feng
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2020-11-04  2:22 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov

On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
> 
> 
> On 03/11/20 14:08, Boqun Feng wrote:
> > Hi Filipe,
> > 
> > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> >> Hello,
> >>
> >> I've recently started to hit a warning followed by tasks hanging after
> >> attempts to freeze a filesystem. A git bisection pointed to the
> >> following commit:
> >>
> >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> >> Author: Peter Zijlstra <peterz@infradead.org>
> >> Date:   Fri Oct 2 11:04:21 2020 +0200
> >>
> >>     lockdep: Fix lockdep recursion
> >>
> >> This happens very reliably when running all xfstests with lockdep
> >> enabled, and the tested filesystem is btrfs (haven't tried other
> >> filesystems, but it shouldn't matter). The warning and task hangs always
> >> happen at either test generic/068 or test generic/390, and (oddly)
> >> always have to run all tests for it to trigger, running those tests
> >> individually on an infinite loop doesn't seem to trigger it (at least
> >> for a couple hours).
> >>
> >> The warning triggered is at fs/super.c:__sb_start_write() which always
> >> results later in several tasks hanging on a percpu rw_sem:
> >>
> >> https://pastebin.com/qnLvf94E
> >>
> > 
> > In your dmesg, I see line:
> > 
> > 	[ 9304.920151] INFO: lockdep is turned off.
> > 
> > , that means debug_locks is 0, that usually happens when lockdep find a
> > problem (i.e. a deadlock) and it turns itself off, because a problem is
> > found and it's pointless for lockdep to continue to run.
> > 
> > And I haven't found a lockdep splat in your dmesg, do you have a full
> > dmesg so that I can have a look?
> > 
> > This may be relevant because in commit 4d004099a66, we have
> > 
> > 	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > 		unsigned long flags;
> > 		int ret = 0;
> > 
> > 	-       if (unlikely(current->lockdep_recursion))
> > 	+       if (unlikely(!lockdep_enabled()))
> > 			return 1; /* avoid false negative lockdep_assert_held() */
> > 
> > before this commit lock_is_held_type() and its friends may return false
> > if debug_locks==0, after this commit lock_is_held_type() and its friends
> > will always return true if debug_locks == 0. That could cause the
> > behavior here.
> > 
> > In case I'm correct, the following "fix" may be helpful. 
> > 
> > Regards,
> > Boqun
> > 
> > ----------8
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index 3e99dfef8408..c0e27fb949ff 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> >  	unsigned long flags;
> >  	int ret = 0;
> >  
> > -	if (unlikely(!lockdep_enabled()))
> > +	if (unlikely(debug_locks && !lockdep_enabled()))
> >  		return 1; /* avoid false negative lockdep_assert_held() */
> >  
> >  	raw_local_irq_save(flags);
> 
> Boqun, the patch fixes the problem for me!
> You can have Tested-by: Filipe Manana <fdmanana@suse.com>
> 

Thanks. Although I think it still means that we have a lock issue when
running xfstests (because we don't know why debug_locks gets cleared),
I guess I will have to reproduce this myself for further analysis, could
you share you .config?

Anyway, I think this fix still makes a bit sense, I will send a proper
patch so that the problem won't block fs folks.

Regards,
Boqun

> Thanks!
> 
> > 
> > 
> > 
> >> What happens is percpu_rwsem_is_held() is apparently returning a false
> >> positive, so this makes __sb_start_write() do a
> >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> >> is expected to always succeed, because if the calling task is holding a
> >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> >> the semaphore at level 2, since the freeze semaphores are always
> >> acquired by increasing level order.
> >>
> >> But the try_lock fails, it triggers the warning at __sb_start_write(),
> >> then its caller sb_start_pagefault() ignores the return value and
> >> callers such as btrfs_page_mkwrite() make the assumption the freeze
> >> semaphore was taken, proceed to do their stuff, and later call
> >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> >> at level 2 despite not having not been able to down_read() the
> >> semaphore. This obviously corrupts the semaphore's read_count state, and
> >> later causes any task trying to down_write() it to hang forever.
> >>
> >> After such a hang I ran a drgn script to confirm it:
> >>
> >> $ cat dump_freeze_sems.py
> >> import sys
> >> import drgn
> >> from drgn import NULL, Object, cast, container_of, execscript, \
> >>     reinterpret, sizeof
> >> from drgn.helpers.linux import *
> >>
> >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> >>
> >> mnt = None
> >> for mnt in for_each_mount(prog, dst = mnt_path):
> >>     pass
> >>
> >> if mnt is None:
> >>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> >>     sys.exit(1)
> >>
> >> def dump_sem(level_enum):
> >>     level = level_enum.value_()
> >>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> >>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
> >>     print(f'    block {sem.block.counter.value_()}')
> >>     for i in for_each_possible_cpu(prog):
> >>         read_count = per_cpu_ptr(sem.read_count, i)
> >>         print(f'    read_count at cpu {i} = {read_count}')
> >>     print()
> >>
> >> # dump semaphore read counts for all freeze levels (fs.h)
> >> dump_sem(prog['SB_FREEZE_WRITE'])
> >> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> >> dump_sem(prog['SB_FREEZE_FS'])
> >>
> >>
> >> $ drgn dump_freeze_sems.py
> >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> >>     block 1
> >>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> >>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> >>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> >>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> >>
> >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> >>     block 1
> >>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> >>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> >>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> >>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> >>
> >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> >>     block 0
> >>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> >>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> >>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> >>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> >>
> >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> >> sums to -1. The system remains like that for hours at least, with no
> >> progress at all.
> >>
> >> Is there a known regression with that lockdep commit?
> >> Anything I can do to help debug it in case it's not obvious?
> >>
> >> Thanks.
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-04  2:22     ` Boqun Feng
@ 2020-11-04  3:44       ` Boqun Feng
  2020-11-04  9:49         ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2020-11-04  3:44 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov

On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
> > 
> > 
> > On 03/11/20 14:08, Boqun Feng wrote:
> > > Hi Filipe,
> > > 
> > > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
> > >> Hello,
> > >>
> > >> I've recently started to hit a warning followed by tasks hanging after
> > >> attempts to freeze a filesystem. A git bisection pointed to the
> > >> following commit:
> > >>
> > >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
> > >> Author: Peter Zijlstra <peterz@infradead.org>
> > >> Date:   Fri Oct 2 11:04:21 2020 +0200
> > >>
> > >>     lockdep: Fix lockdep recursion
> > >>
> > >> This happens very reliably when running all xfstests with lockdep
> > >> enabled, and the tested filesystem is btrfs (haven't tried other
> > >> filesystems, but it shouldn't matter). The warning and task hangs always
> > >> happen at either test generic/068 or test generic/390, and (oddly)
> > >> always have to run all tests for it to trigger, running those tests
> > >> individually on an infinite loop doesn't seem to trigger it (at least
> > >> for a couple hours).
> > >>
> > >> The warning triggered is at fs/super.c:__sb_start_write() which always
> > >> results later in several tasks hanging on a percpu rw_sem:
> > >>
> > >> https://pastebin.com/qnLvf94E
> > >>
> > > 
> > > In your dmesg, I see line:
> > > 
> > > 	[ 9304.920151] INFO: lockdep is turned off.
> > > 
> > > , that means debug_locks is 0, that usually happens when lockdep find a
> > > problem (i.e. a deadlock) and it turns itself off, because a problem is
> > > found and it's pointless for lockdep to continue to run.
> > > 
> > > And I haven't found a lockdep splat in your dmesg, do you have a full
> > > dmesg so that I can have a look?
> > > 
> > > This may be relevant because in commit 4d004099a66, we have
> > > 
> > > 	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > > 		unsigned long flags;
> > > 		int ret = 0;
> > > 
> > > 	-       if (unlikely(current->lockdep_recursion))
> > > 	+       if (unlikely(!lockdep_enabled()))
> > > 			return 1; /* avoid false negative lockdep_assert_held() */
> > > 
> > > before this commit lock_is_held_type() and its friends may return false
> > > if debug_locks==0, after this commit lock_is_held_type() and its friends
> > > will always return true if debug_locks == 0. That could cause the
> > > behavior here.
> > > 
> > > In case I'm correct, the following "fix" may be helpful. 
> > > 
> > > Regards,
> > > Boqun
> > > 
> > > ----------8
> > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > > index 3e99dfef8408..c0e27fb949ff 100644
> > > --- a/kernel/locking/lockdep.c
> > > +++ b/kernel/locking/lockdep.c
> > > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
> > >  	unsigned long flags;
> > >  	int ret = 0;
> > >  
> > > -	if (unlikely(!lockdep_enabled()))
> > > +	if (unlikely(debug_locks && !lockdep_enabled()))
> > >  		return 1; /* avoid false negative lockdep_assert_held() */
> > >  
> > >  	raw_local_irq_save(flags);
> > 
> > Boqun, the patch fixes the problem for me!
> > You can have Tested-by: Filipe Manana <fdmanana@suse.com>
> > 
> 
> Thanks. Although I think it still means that we have a lock issue when
> running xfstests (because we don't know why debug_locks gets cleared),

I might find a place where we could turn lockdep off silently:

in print_circular_bug(), we turn off lockdep via
debug_locks_off_graph_unlock(), and then we try to save the trace for
lockdep splat, however, if we use up the stack_trace buffer (i.e.
nr_stack_trace_entries), save_trace() will return NULL and we return
silently.

Filipe, in order to check whethter that happens, could you share me your
/proc/lockdep_stats after the full set of xfstests is finished?

Alternatively, it's also helpful if you can try the following debug
diff, with teh full set of xfstests:

Thanks! Just trying to understand the real problem.

Regards,
Boqun

-------------->8
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b71ad8d9f1c9..9ae3e089e5c0 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
 		LOCK_TRACE_SIZE_IN_LONGS;
 
 	if (max_entries <= 0) {
-		if (!debug_locks_off_graph_unlock())
+		if (!debug_locks_off_graph_unlock()) {
+			WARN_ON_ONCE(1);
 			return NULL;
+		}
 
 		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
 		dump_stack();

> I guess I will have to reproduce this myself for further analysis, could
> you share you .config?
> 
> Anyway, I think this fix still makes a bit sense, I will send a proper
> patch so that the problem won't block fs folks.
> 
> Regards,
> Boqun
> 
> > Thanks!
> > 
> > > 
> > > 
> > > 
> > >> What happens is percpu_rwsem_is_held() is apparently returning a false
> > >> positive, so this makes __sb_start_write() do a
> > >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> > >> is expected to always succeed, because if the calling task is holding a
> > >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> > >> the semaphore at level 2, since the freeze semaphores are always
> > >> acquired by increasing level order.
> > >>
> > >> But the try_lock fails, it triggers the warning at __sb_start_write(),
> > >> then its caller sb_start_pagefault() ignores the return value and
> > >> callers such as btrfs_page_mkwrite() make the assumption the freeze
> > >> semaphore was taken, proceed to do their stuff, and later call
> > >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> > >> at level 2 despite not having not been able to down_read() the
> > >> semaphore. This obviously corrupts the semaphore's read_count state, and
> > >> later causes any task trying to down_write() it to hang forever.
> > >>
> > >> After such a hang I ran a drgn script to confirm it:
> > >>
> > >> $ cat dump_freeze_sems.py
> > >> import sys
> > >> import drgn
> > >> from drgn import NULL, Object, cast, container_of, execscript, \
> > >>     reinterpret, sizeof
> > >> from drgn.helpers.linux import *
> > >>
> > >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> > >>
> > >> mnt = None
> > >> for mnt in for_each_mount(prog, dst = mnt_path):
> > >>     pass
> > >>
> > >> if mnt is None:
> > >>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> > >>     sys.exit(1)
> > >>
> > >> def dump_sem(level_enum):
> > >>     level = level_enum.value_()
> > >>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> > >>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
> > >>     print(f'    block {sem.block.counter.value_()}')
> > >>     for i in for_each_possible_cpu(prog):
> > >>         read_count = per_cpu_ptr(sem.read_count, i)
> > >>         print(f'    read_count at cpu {i} = {read_count}')
> > >>     print()
> > >>
> > >> # dump semaphore read counts for all freeze levels (fs.h)
> > >> dump_sem(prog['SB_FREEZE_WRITE'])
> > >> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> > >> dump_sem(prog['SB_FREEZE_FS'])
> > >>
> > >>
> > >> $ drgn dump_freeze_sems.py
> > >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> > >>     block 1
> > >>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> > >>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> > >>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> > >>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> > >>
> > >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> > >>     block 1
> > >>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> > >>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> > >>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> > >>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> > >>
> > >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> > >>     block 0
> > >>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> > >>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> > >>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> > >>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> > >>
> > >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> > >> sums to -1. The system remains like that for hours at least, with no
> > >> progress at all.
> > >>
> > >> Is there a known regression with that lockdep commit?
> > >> Anything I can do to help debug it in case it's not obvious?
> > >>
> > >> Thanks.
> > > 

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-04  3:44       ` Boqun Feng
@ 2020-11-04  9:49         ` Filipe Manana
  2020-11-04 19:54           ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2020-11-04  9:49 UTC (permalink / raw)
  To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov



On 04/11/20 03:44, Boqun Feng wrote:
> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
>> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
>>>
>>>
>>> On 03/11/20 14:08, Boqun Feng wrote:
>>>> Hi Filipe,
>>>>
>>>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>>>>> Hello,
>>>>>
>>>>> I've recently started to hit a warning followed by tasks hanging after
>>>>> attempts to freeze a filesystem. A git bisection pointed to the
>>>>> following commit:
>>>>>
>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>>>>
>>>>>     lockdep: Fix lockdep recursion
>>>>>
>>>>> This happens very reliably when running all xfstests with lockdep
>>>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>>>> happen at either test generic/068 or test generic/390, and (oddly)
>>>>> always have to run all tests for it to trigger, running those tests
>>>>> individually on an infinite loop doesn't seem to trigger it (at least
>>>>> for a couple hours).
>>>>>
>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>>>> results later in several tasks hanging on a percpu rw_sem:
>>>>>
>>>>> https://pastebin.com/qnLvf94E
>>>>>
>>>>
>>>> In your dmesg, I see line:
>>>>
>>>> 	[ 9304.920151] INFO: lockdep is turned off.
>>>>
>>>> , that means debug_locks is 0, that usually happens when lockdep find a
>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is
>>>> found and it's pointless for lockdep to continue to run.
>>>>
>>>> And I haven't found a lockdep splat in your dmesg, do you have a full
>>>> dmesg so that I can have a look?
>>>>
>>>> This may be relevant because in commit 4d004099a66, we have
>>>>
>>>> 	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>> 		unsigned long flags;
>>>> 		int ret = 0;
>>>>
>>>> 	-       if (unlikely(current->lockdep_recursion))
>>>> 	+       if (unlikely(!lockdep_enabled()))
>>>> 			return 1; /* avoid false negative lockdep_assert_held() */
>>>>
>>>> before this commit lock_is_held_type() and its friends may return false
>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends
>>>> will always return true if debug_locks == 0. That could cause the
>>>> behavior here.
>>>>
>>>> In case I'm correct, the following "fix" may be helpful. 
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>>> ----------8
>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>> index 3e99dfef8408..c0e27fb949ff 100644
>>>> --- a/kernel/locking/lockdep.c
>>>> +++ b/kernel/locking/lockdep.c
>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>>  	unsigned long flags;
>>>>  	int ret = 0;
>>>>  
>>>> -	if (unlikely(!lockdep_enabled()))
>>>> +	if (unlikely(debug_locks && !lockdep_enabled()))
>>>>  		return 1; /* avoid false negative lockdep_assert_held() */
>>>>  
>>>>  	raw_local_irq_save(flags);
>>>
>>> Boqun, the patch fixes the problem for me!
>>> You can have Tested-by: Filipe Manana <fdmanana@suse.com>
>>>
>>
>> Thanks. Although I think it still means that we have a lock issue when
>> running xfstests (because we don't know why debug_locks gets cleared),
> 
> I might find a place where we could turn lockdep off silently:
> 
> in print_circular_bug(), we turn off lockdep via
> debug_locks_off_graph_unlock(), and then we try to save the trace for
> lockdep splat, however, if we use up the stack_trace buffer (i.e.
> nr_stack_trace_entries), save_trace() will return NULL and we return
> silently.
> 
> Filipe, in order to check whethter that happens, could you share me your
> /proc/lockdep_stats after the full set of xfstests is finished?

Here it is:

$ cat /proc/lockdep_stats
 lock-classes:                         1831 [max: 8192]
 direct dependencies:                 17774 [max: 32768]
 indirect dependencies:               75662
 all direct dependencies:            325284
 dependency chains:                   34223 [max: 65536]
 dependency chain hlocks used:       158129 [max: 327680]
 dependency chain hlocks lost:            0
 in-hardirq chains:                      57
 in-softirq chains:                     658
 in-process chains:                   33508
 stack-trace entries:                160748 [max: 524288]
 number of stack traces:               9237
 number of stack hash chains:          7076
 combined max dependencies:      1280780998
 hardirq-safe locks:                     43
 hardirq-unsafe locks:                 1337
 softirq-safe locks:                    179
 softirq-unsafe locks:                 1236
 irq-safe locks:                        187
 irq-unsafe locks:                     1337
 hardirq-read-safe locks:                 2
 hardirq-read-unsafe locks:             209
 softirq-read-safe locks:                 9
 softirq-read-unsafe locks:             204
 irq-read-safe locks:                     9
 irq-read-unsafe locks:                 209
 uncategorized locks:                   274
 unused locks:                            0
 max locking depth:                      15
 max bfs queue depth:                   337
 debug_locks:                             0

 zapped classes:                       2278
 zapped lock chains:                  17915
 large chain blocks:                      1

(That's the result after running all fstests with the previous one line
patch you sent.)

My kernel .config:  https://pastebin.com/4xEMvLJ9

I'll try the debugging patch and let you know the results. It will take
some 4 hours or so to get back with the result.

Thanks!

> 
> Alternatively, it's also helpful if you can try the following debug
> diff, with teh full set of xfstests:
> 
> Thanks! Just trying to understand the real problem.
> 
> Regards,
> Boqun
> 
> -------------->8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index b71ad8d9f1c9..9ae3e089e5c0 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>  		LOCK_TRACE_SIZE_IN_LONGS;
>  
>  	if (max_entries <= 0) {
> -		if (!debug_locks_off_graph_unlock())
> +		if (!debug_locks_off_graph_unlock()) {
> +			WARN_ON_ONCE(1);
>  			return NULL;
> +		}
>  
>  		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>  		dump_stack();
> 
>> I guess I will have to reproduce this myself for further analysis, could
>> you share you .config?
>>
>> Anyway, I think this fix still makes a bit sense, I will send a proper
>> patch so that the problem won't block fs folks.
>>
>> Regards,
>> Boqun
>>
>>> Thanks!
>>>
>>>>
>>>>
>>>>
>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false
>>>>> positive, so this makes __sb_start_write() do a
>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>>>>> is expected to always succeed, because if the calling task is holding a
>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>>>>> the semaphore at level 2, since the freeze semaphores are always
>>>>> acquired by increasing level order.
>>>>>
>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>>>>> then its caller sb_start_pagefault() ignores the return value and
>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>>>>> semaphore was taken, proceed to do their stuff, and later call
>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>>>>> at level 2 despite not having not been able to down_read() the
>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and
>>>>> later causes any task trying to down_write() it to hang forever.
>>>>>
>>>>> After such a hang I ran a drgn script to confirm it:
>>>>>
>>>>> $ cat dump_freeze_sems.py
>>>>> import sys
>>>>> import drgn
>>>>> from drgn import NULL, Object, cast, container_of, execscript, \
>>>>>     reinterpret, sizeof
>>>>> from drgn.helpers.linux import *
>>>>>
>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>>>>
>>>>> mnt = None
>>>>> for mnt in for_each_mount(prog, dst = mnt_path):
>>>>>     pass
>>>>>
>>>>> if mnt is None:
>>>>>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>>>>>     sys.exit(1)
>>>>>
>>>>> def dump_sem(level_enum):
>>>>>     level = level_enum.value_()
>>>>>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>>>>>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
>>>>>     print(f'    block {sem.block.counter.value_()}')
>>>>>     for i in for_each_possible_cpu(prog):
>>>>>         read_count = per_cpu_ptr(sem.read_count, i)
>>>>>         print(f'    read_count at cpu {i} = {read_count}')
>>>>>     print()
>>>>>
>>>>> # dump semaphore read counts for all freeze levels (fs.h)
>>>>> dump_sem(prog['SB_FREEZE_WRITE'])
>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>>>>> dump_sem(prog['SB_FREEZE_FS'])
>>>>>
>>>>>
>>>>> $ drgn dump_freeze_sems.py
>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>>>>>     block 1
>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>>>>
>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>>>>>     block 1
>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>>>>
>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>>>>>     block 0
>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>>>>
>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>>>>> sums to -1. The system remains like that for hours at least, with no
>>>>> progress at all.
>>>>>
>>>>> Is there a known regression with that lockdep commit?
>>>>> Anything I can do to help debug it in case it's not obvious?
>>>>>
>>>>> Thanks.
>>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-04  9:49         ` Filipe Manana
@ 2020-11-04 19:54           ` Filipe Manana
  2020-11-05  1:10             ` Boqun Feng
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2020-11-04 19:54 UTC (permalink / raw)
  To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov



On 04/11/20 09:49, Filipe Manana wrote:
> 
> 
> On 04/11/20 03:44, Boqun Feng wrote:
>> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
>>> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
>>>>
>>>>
>>>> On 03/11/20 14:08, Boqun Feng wrote:
>>>>> Hi Filipe,
>>>>>
>>>>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've recently started to hit a warning followed by tasks hanging after
>>>>>> attempts to freeze a filesystem. A git bisection pointed to the
>>>>>> following commit:
>>>>>>
>>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>>>>> Author: Peter Zijlstra <peterz@infradead.org>
>>>>>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>>>>>
>>>>>>     lockdep: Fix lockdep recursion
>>>>>>
>>>>>> This happens very reliably when running all xfstests with lockdep
>>>>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>>>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>>>>> happen at either test generic/068 or test generic/390, and (oddly)
>>>>>> always have to run all tests for it to trigger, running those tests
>>>>>> individually on an infinite loop doesn't seem to trigger it (at least
>>>>>> for a couple hours).
>>>>>>
>>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>>>>> results later in several tasks hanging on a percpu rw_sem:
>>>>>>
>>>>>> https://pastebin.com/qnLvf94E
>>>>>>
>>>>>
>>>>> In your dmesg, I see line:
>>>>>
>>>>> 	[ 9304.920151] INFO: lockdep is turned off.
>>>>>
>>>>> , that means debug_locks is 0, that usually happens when lockdep find a
>>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is
>>>>> found and it's pointless for lockdep to continue to run.
>>>>>
>>>>> And I haven't found a lockdep splat in your dmesg, do you have a full
>>>>> dmesg so that I can have a look?
>>>>>
>>>>> This may be relevant because in commit 4d004099a66, we have
>>>>>
>>>>> 	@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>>> 		unsigned long flags;
>>>>> 		int ret = 0;
>>>>>
>>>>> 	-       if (unlikely(current->lockdep_recursion))
>>>>> 	+       if (unlikely(!lockdep_enabled()))
>>>>> 			return 1; /* avoid false negative lockdep_assert_held() */
>>>>>
>>>>> before this commit lock_is_held_type() and its friends may return false
>>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends
>>>>> will always return true if debug_locks == 0. That could cause the
>>>>> behavior here.
>>>>>
>>>>> In case I'm correct, the following "fix" may be helpful. 
>>>>>
>>>>> Regards,
>>>>> Boqun
>>>>>
>>>>> ----------8
>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>>> index 3e99dfef8408..c0e27fb949ff 100644
>>>>> --- a/kernel/locking/lockdep.c
>>>>> +++ b/kernel/locking/lockdep.c
>>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read)
>>>>>  	unsigned long flags;
>>>>>  	int ret = 0;
>>>>>  
>>>>> -	if (unlikely(!lockdep_enabled()))
>>>>> +	if (unlikely(debug_locks && !lockdep_enabled()))
>>>>>  		return 1; /* avoid false negative lockdep_assert_held() */
>>>>>  
>>>>>  	raw_local_irq_save(flags);
>>>>
>>>> Boqun, the patch fixes the problem for me!
>>>> You can have Tested-by: Filipe Manana <fdmanana@suse.com>
>>>>
>>>
>>> Thanks. Although I think it still means that we have a lock issue when
>>> running xfstests (because we don't know why debug_locks gets cleared),
>>
>> I might find a place where we could turn lockdep off silently:
>>
>> in print_circular_bug(), we turn off lockdep via
>> debug_locks_off_graph_unlock(), and then we try to save the trace for
>> lockdep splat, however, if we use up the stack_trace buffer (i.e.
>> nr_stack_trace_entries), save_trace() will return NULL and we return
>> silently.
>>
>> Filipe, in order to check whethter that happens, could you share me your
>> /proc/lockdep_stats after the full set of xfstests is finished?
> 
> Here it is:
> 
> $ cat /proc/lockdep_stats
>  lock-classes:                         1831 [max: 8192]
>  direct dependencies:                 17774 [max: 32768]
>  indirect dependencies:               75662
>  all direct dependencies:            325284
>  dependency chains:                   34223 [max: 65536]
>  dependency chain hlocks used:       158129 [max: 327680]
>  dependency chain hlocks lost:            0
>  in-hardirq chains:                      57
>  in-softirq chains:                     658
>  in-process chains:                   33508
>  stack-trace entries:                160748 [max: 524288]
>  number of stack traces:               9237
>  number of stack hash chains:          7076
>  combined max dependencies:      1280780998
>  hardirq-safe locks:                     43
>  hardirq-unsafe locks:                 1337
>  softirq-safe locks:                    179
>  softirq-unsafe locks:                 1236
>  irq-safe locks:                        187
>  irq-unsafe locks:                     1337
>  hardirq-read-safe locks:                 2
>  hardirq-read-unsafe locks:             209
>  softirq-read-safe locks:                 9
>  softirq-read-unsafe locks:             204
>  irq-read-safe locks:                     9
>  irq-read-unsafe locks:                 209
>  uncategorized locks:                   274
>  unused locks:                            0
>  max locking depth:                      15
>  max bfs queue depth:                   337
>  debug_locks:                             0
> 
>  zapped classes:                       2278
>  zapped lock chains:                  17915
>  large chain blocks:                      1
> 
> (That's the result after running all fstests with the previous one line
> patch you sent.)
> 
> My kernel .config:  https://pastebin.com/4xEMvLJ9
> 
> I'll try the debugging patch and let you know the results. It will take
> some 4 hours or so to get back with the result.

Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index b71ad8d9f1c9..b31d4ad482c7 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
                LOCK_TRACE_SIZE_IN_LONGS;

        if (max_entries <= 0) {
-               if (!debug_locks_off_graph_unlock())
+               if (!debug_locks_off_graph_unlock()) {
+                       WARN_ON_ONCE(1);
                        return NULL;
+               }

                print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
                dump_stack();
@@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
lockdep_map *lock, int read)
        unsigned long flags;
        int ret = 0;

-       if (unlikely(!lockdep_enabled()))
+       if (unlikely(debug_locks && !lockdep_enabled()))
                return 1; /* avoid false negative lockdep_assert_held() */

        raw_local_irq_save(flags);

With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
Unexpected, right?

Should I try something else?

Thanks.


> 
> Thanks!
> 
>>
>> Alternatively, it's also helpful if you can try the following debug
>> diff, with teh full set of xfstests:
>>
>> Thanks! Just trying to understand the real problem.
>>
>> Regards,
>> Boqun
>>
>> -------------->8
>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>> index b71ad8d9f1c9..9ae3e089e5c0 100644
>> --- a/kernel/locking/lockdep.c
>> +++ b/kernel/locking/lockdep.c
>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>>  		LOCK_TRACE_SIZE_IN_LONGS;
>>  
>>  	if (max_entries <= 0) {
>> -		if (!debug_locks_off_graph_unlock())
>> +		if (!debug_locks_off_graph_unlock()) {
>> +			WARN_ON_ONCE(1);
>>  			return NULL;
>> +		}
>>  
>>  		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>>  		dump_stack();
>>
>>> I guess I will have to reproduce this myself for further analysis, could
>>> you share you .config?
>>>
>>> Anyway, I think this fix still makes a bit sense, I will send a proper
>>> patch so that the problem won't block fs folks.
>>>
>>> Regards,
>>> Boqun
>>>
>>>> Thanks!
>>>>
>>>>>
>>>>>
>>>>>
>>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false
>>>>>> positive, so this makes __sb_start_write() do a
>>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>>>>>> is expected to always succeed, because if the calling task is holding a
>>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>>>>>> the semaphore at level 2, since the freeze semaphores are always
>>>>>> acquired by increasing level order.
>>>>>>
>>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>>>>>> then its caller sb_start_pagefault() ignores the return value and
>>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>>>>>> semaphore was taken, proceed to do their stuff, and later call
>>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>>>>>> at level 2 despite not having not been able to down_read() the
>>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and
>>>>>> later causes any task trying to down_write() it to hang forever.
>>>>>>
>>>>>> After such a hang I ran a drgn script to confirm it:
>>>>>>
>>>>>> $ cat dump_freeze_sems.py
>>>>>> import sys
>>>>>> import drgn
>>>>>> from drgn import NULL, Object, cast, container_of, execscript, \
>>>>>>     reinterpret, sizeof
>>>>>> from drgn.helpers.linux import *
>>>>>>
>>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>>>>>
>>>>>> mnt = None
>>>>>> for mnt in for_each_mount(prog, dst = mnt_path):
>>>>>>     pass
>>>>>>
>>>>>> if mnt is None:
>>>>>>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
>>>>>>     sys.exit(1)
>>>>>>
>>>>>> def dump_sem(level_enum):
>>>>>>     level = level_enum.value_()
>>>>>>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
>>>>>>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
>>>>>>     print(f'    block {sem.block.counter.value_()}')
>>>>>>     for i in for_each_possible_cpu(prog):
>>>>>>         read_count = per_cpu_ptr(sem.read_count, i)
>>>>>>         print(f'    read_count at cpu {i} = {read_count}')
>>>>>>     print()
>>>>>>
>>>>>> # dump semaphore read counts for all freeze levels (fs.h)
>>>>>> dump_sem(prog['SB_FREEZE_WRITE'])
>>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
>>>>>> dump_sem(prog['SB_FREEZE_FS'])
>>>>>>
>>>>>>
>>>>>> $ drgn dump_freeze_sems.py
>>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
>>>>>>     block 1
>>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
>>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
>>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
>>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
>>>>>>
>>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
>>>>>>     block 1
>>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
>>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
>>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
>>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
>>>>>>
>>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
>>>>>>     block 0
>>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
>>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
>>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
>>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
>>>>>>
>>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
>>>>>> sums to -1. The system remains like that for hours at least, with no
>>>>>> progress at all.
>>>>>>
>>>>>> Is there a known regression with that lockdep commit?
>>>>>> Anything I can do to help debug it in case it's not obvious?
>>>>>>
>>>>>> Thanks.
>>>>>
>>

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-04 19:54           ` Filipe Manana
@ 2020-11-05  1:10             ` Boqun Feng
  2020-11-09  8:44               ` Boqun Feng
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2020-11-05  1:10 UTC (permalink / raw)
  To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov

On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
[...]
> 
> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index b71ad8d9f1c9..b31d4ad482c7 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>                 LOCK_TRACE_SIZE_IN_LONGS;
> 
>         if (max_entries <= 0) {
> -               if (!debug_locks_off_graph_unlock())
> +               if (!debug_locks_off_graph_unlock()) {
> +                       WARN_ON_ONCE(1);
>                         return NULL;
> +               }
> 
>                 print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>                 dump_stack();
> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> lockdep_map *lock, int read)
>         unsigned long flags;
>         int ret = 0;
> 
> -       if (unlikely(!lockdep_enabled()))
> +       if (unlikely(debug_locks && !lockdep_enabled()))
>                 return 1; /* avoid false negative lockdep_assert_held() */
> 
>         raw_local_irq_save(flags);
> 
> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> Unexpected, right?
> 

Kinda, that means we still don't know why lockdep was turned off.

> Should I try something else?
> 

Thanks for trying this. Let me set up the reproducer on my side, and see
if I could get more information.

Regards,
Boqun

> Thanks.
> 
> 
> > 
> > Thanks!
> > 
> >>
> >> Alternatively, it's also helpful if you can try the following debug
> >> diff, with teh full set of xfstests:
> >>
> >> Thanks! Just trying to understand the real problem.
> >>
> >> Regards,
> >> Boqun
> >>
> >> -------------->8
> >> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> >> index b71ad8d9f1c9..9ae3e089e5c0 100644
> >> --- a/kernel/locking/lockdep.c
> >> +++ b/kernel/locking/lockdep.c
> >> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >>  		LOCK_TRACE_SIZE_IN_LONGS;
> >>  
> >>  	if (max_entries <= 0) {
> >> -		if (!debug_locks_off_graph_unlock())
> >> +		if (!debug_locks_off_graph_unlock()) {
> >> +			WARN_ON_ONCE(1);
> >>  			return NULL;
> >> +		}
> >>  
> >>  		print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> >>  		dump_stack();
> >>
> >>> I guess I will have to reproduce this myself for further analysis, could
> >>> you share you .config?
> >>>
> >>> Anyway, I think this fix still makes a bit sense, I will send a proper
> >>> patch so that the problem won't block fs folks.
> >>>
> >>> Regards,
> >>> Boqun
> >>>
> >>>> Thanks!
> >>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false
> >>>>>> positive, so this makes __sb_start_write() do a
> >>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
> >>>>>> is expected to always succeed, because if the calling task is holding a
> >>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
> >>>>>> the semaphore at level 2, since the freeze semaphores are always
> >>>>>> acquired by increasing level order.
> >>>>>>
> >>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(),
> >>>>>> then its caller sb_start_pagefault() ignores the return value and
> >>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze
> >>>>>> semaphore was taken, proceed to do their stuff, and later call
> >>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
> >>>>>> at level 2 despite not having not been able to down_read() the
> >>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and
> >>>>>> later causes any task trying to down_write() it to hang forever.
> >>>>>>
> >>>>>> After such a hang I ran a drgn script to confirm it:
> >>>>>>
> >>>>>> $ cat dump_freeze_sems.py
> >>>>>> import sys
> >>>>>> import drgn
> >>>>>> from drgn import NULL, Object, cast, container_of, execscript, \
> >>>>>>     reinterpret, sizeof
> >>>>>> from drgn.helpers.linux import *
> >>>>>>
> >>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
> >>>>>>
> >>>>>> mnt = None
> >>>>>> for mnt in for_each_mount(prog, dst = mnt_path):
> >>>>>>     pass
> >>>>>>
> >>>>>> if mnt is None:
> >>>>>>     sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
> >>>>>>     sys.exit(1)
> >>>>>>
> >>>>>> def dump_sem(level_enum):
> >>>>>>     level = level_enum.value_()
> >>>>>>     sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
> >>>>>>     print(f'freeze semaphore at level {level}, {str(level_enum)}')
> >>>>>>     print(f'    block {sem.block.counter.value_()}')
> >>>>>>     for i in for_each_possible_cpu(prog):
> >>>>>>         read_count = per_cpu_ptr(sem.read_count, i)
> >>>>>>         print(f'    read_count at cpu {i} = {read_count}')
> >>>>>>     print()
> >>>>>>
> >>>>>> # dump semaphore read counts for all freeze levels (fs.h)
> >>>>>> dump_sem(prog['SB_FREEZE_WRITE'])
> >>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT'])
> >>>>>> dump_sem(prog['SB_FREEZE_FS'])
> >>>>>>
> >>>>>>
> >>>>>> $ drgn dump_freeze_sems.py
> >>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE
> >>>>>>     block 1
> >>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3
> >>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293
> >>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3
> >>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293
> >>>>>>
> >>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT
> >>>>>>     block 1
> >>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0
> >>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295
> >>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0
> >>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0
> >>>>>>
> >>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS
> >>>>>>     block 0
> >>>>>>     read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0
> >>>>>>     read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0
> >>>>>>     read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0
> >>>>>>     read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0
> >>>>>>
> >>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
> >>>>>> sums to -1. The system remains like that for hours at least, with no
> >>>>>> progress at all.
> >>>>>>
> >>>>>> Is there a known regression with that lockdep commit?
> >>>>>> Anything I can do to help debug it in case it's not obvious?
> >>>>>>
> >>>>>> Thanks.
> >>>>>
> >>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-05  1:10             ` Boqun Feng
@ 2020-11-09  8:44               ` Boqun Feng
  2020-11-09  9:57                 ` Filipe Manana
  0 siblings, 1 reply; 26+ messages in thread
From: Boqun Feng @ 2020-11-09  8:44 UTC (permalink / raw)
  To: Filipe Manana
  Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov, Qu Wenruo

Hi Filipe,

On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
> [...]
> > 
> > Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> > 
> > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> > index b71ad8d9f1c9..b31d4ad482c7 100644
> > --- a/kernel/locking/lockdep.c
> > +++ b/kernel/locking/lockdep.c
> > @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >                 LOCK_TRACE_SIZE_IN_LONGS;
> > 
> >         if (max_entries <= 0) {
> > -               if (!debug_locks_off_graph_unlock())
> > +               if (!debug_locks_off_graph_unlock()) {
> > +                       WARN_ON_ONCE(1);
> >                         return NULL;
> > +               }
> > 
> >                 print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> >                 dump_stack();
> > @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> > lockdep_map *lock, int read)
> >         unsigned long flags;
> >         int ret = 0;
> > 
> > -       if (unlikely(!lockdep_enabled()))
> > +       if (unlikely(debug_locks && !lockdep_enabled()))
> >                 return 1; /* avoid false negative lockdep_assert_held() */
> > 
> >         raw_local_irq_save(flags);
> > 
> > With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> > Unexpected, right?
> > 
> 
> Kinda, that means we still don't know why lockdep was turned off.
> 
> > Should I try something else?
> > 
> 
> Thanks for trying this. Let me set up the reproducer on my side, and see
> if I could get more information.
> 

I could hit this with btrfs/187, and when we hit it, lockdep will report
the deadlock and turn off, and I think this is the root cause for your
hitting the original problem, I will add some analysis after the lockdep
splat.

[12295.973309] ============================================
[12295.974770] WARNING: possible recursive locking detected
[12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
[12295.974770] --------------------------------------------
[12295.974770] zsh/701247 is trying to acquire lock:
[12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] 
               but task is already holding lock:
[12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] 
               other info that might help us debug this:
[12295.974770]  Possible unsafe locking scenario:

[12295.974770]        CPU0
[12295.974770]        ----
[12295.974770]   lock(&eb->lock);
[12295.974770]   lock(&eb->lock);
[12295.974770] 
                *** DEADLOCK ***

[12295.974770]  May be due to missing lock nesting notation

[12295.974770] 2 locks held by zsh/701247:
[12295.974770]  #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920
[12295.974770]  #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770] 
               stack backtrace:
[12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20
[12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
[12295.974770] Call Trace:
[12295.974770]  dump_stack+0x8b/0xb0
[12295.974770]  __lock_acquire.cold+0x175/0x2e9
[12295.974770]  lock_acquire+0x15b/0x490
[12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]  ? read_block_for_search+0xf4/0x350 [btrfs]
[12295.974770]  _raw_read_lock+0x40/0xa0
[12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]  btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
[12295.974770]  btrfs_search_slot+0x6ac/0xca0 [btrfs]
[12295.974770]  btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
[12295.974770]  btrfs_getxattr+0x67/0x130 [btrfs]
[12295.974770]  __vfs_getxattr+0x53/0x70
[12295.974770]  get_vfs_caps_from_disk+0x68/0x1a0
[12295.974770]  ? sched_clock_cpu+0x114/0x180
[12295.974770]  cap_bprm_creds_from_file+0x181/0x6c0
[12295.974770]  security_bprm_creds_from_file+0x2a/0x40
[12295.974770]  begin_new_exec+0xf4/0xc40
[12295.974770]  ? load_elf_phdrs+0x6b/0xb0
[12295.974770]  load_elf_binary+0x66b/0x1620
[12295.974770]  ? read_hv_sched_clock_tsc+0x5/0x20
[12295.974770]  ? sched_clock+0x5/0x10
[12295.974770]  ? sched_clock_local+0x12/0x80
[12295.974770]  ? sched_clock_cpu+0x114/0x180
[12295.974770]  bprm_execve+0x3ce/0x920
[12295.974770]  do_execveat_common+0x1b0/0x1f0
[12295.974770]  __x64_sys_execve+0x39/0x50
[12295.974770]  do_syscall_64+0x33/0x80
[12295.974770]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[12295.974770] RIP: 0033:0x7f6aaefc13cb
[12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48
[12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
[12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb
[12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280
[12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001
[12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280

The deadlock senario reported by this splat is as follow:

	CPU 0					CPU 2
	=====					=====
	btrfs_search_slot():
	  btrfs_tree_read_lock_atomic():
	    read_lock(&eb->lock);		<a random writer>
						write_lock(&eb->lock);
						// waiting for the lock
	  ...
	  btrfs_tree_read_lock_atomic():
	    read_lock(&eb->lock); // blocked because the fairness.

In short, you can not use nested read_lock() on the same lock. However,
I'm not sure whether we have the real problem here, because I don't
think btrfs_search_slot() can pick the same extent buffer twice in
btrfs_search_slot(). Also copy the author of the code for more
information.

Regards,
Boqun

> Regards,
> Boqun
> 
> > Thanks.
> > 
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-09  8:44               ` Boqun Feng
@ 2020-11-09  9:57                 ` Filipe Manana
  2020-11-10  1:41                   ` Boqun Feng
  0 siblings, 1 reply; 26+ messages in thread
From: Filipe Manana @ 2020-11-09  9:57 UTC (permalink / raw)
  To: Boqun Feng
  Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov, Qu Wenruo,
	darrick.wong



On 09/11/20 08:44, Boqun Feng wrote:
> Hi Filipe,
> 
> On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
>> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
>> [...]
>>>
>>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
>>>
>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>> index b71ad8d9f1c9..b31d4ad482c7 100644
>>> --- a/kernel/locking/lockdep.c
>>> +++ b/kernel/locking/lockdep.c
>>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>>>                 LOCK_TRACE_SIZE_IN_LONGS;
>>>
>>>         if (max_entries <= 0) {
>>> -               if (!debug_locks_off_graph_unlock())
>>> +               if (!debug_locks_off_graph_unlock()) {
>>> +                       WARN_ON_ONCE(1);
>>>                         return NULL;
>>> +               }
>>>
>>>                 print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>>>                 dump_stack();
>>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
>>> lockdep_map *lock, int read)
>>>         unsigned long flags;
>>>         int ret = 0;
>>>
>>> -       if (unlikely(!lockdep_enabled()))
>>> +       if (unlikely(debug_locks && !lockdep_enabled()))
>>>                 return 1; /* avoid false negative lockdep_assert_held() */
>>>
>>>         raw_local_irq_save(flags);
>>>
>>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
>>> Unexpected, right?
>>>
>>
>> Kinda, that means we still don't know why lockdep was turned off.
>>
>>> Should I try something else?
>>>
>>
>> Thanks for trying this. Let me set up the reproducer on my side, and see
>> if I could get more information.
>>
> 
> I could hit this with btrfs/187, and when we hit it, lockdep will report
> the deadlock and turn off, and I think this is the root cause for your
> hitting the original problem, I will add some analysis after the lockdep
> splat.
> 
> [12295.973309] ============================================
> [12295.974770] WARNING: possible recursive locking detected
> [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
> [12295.974770] --------------------------------------------
> [12295.974770] zsh/701247 is trying to acquire lock:
> [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] 
>                but task is already holding lock:
> [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] 
>                other info that might help us debug this:
> [12295.974770]  Possible unsafe locking scenario:
> 
> [12295.974770]        CPU0
> [12295.974770]        ----
> [12295.974770]   lock(&eb->lock);
> [12295.974770]   lock(&eb->lock);
> [12295.974770] 
>                 *** DEADLOCK ***
> 
> [12295.974770]  May be due to missing lock nesting notation
> 
> [12295.974770] 2 locks held by zsh/701247:
> [12295.974770]  #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920
> [12295.974770]  #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] 
>                stack backtrace:
> [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20
> [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> [12295.974770] Call Trace:
> [12295.974770]  dump_stack+0x8b/0xb0
> [12295.974770]  __lock_acquire.cold+0x175/0x2e9
> [12295.974770]  lock_acquire+0x15b/0x490
> [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]  ? read_block_for_search+0xf4/0x350 [btrfs]
> [12295.974770]  _raw_read_lock+0x40/0xa0
> [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]  btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]  btrfs_search_slot+0x6ac/0xca0 [btrfs]
> [12295.974770]  btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
> [12295.974770]  btrfs_getxattr+0x67/0x130 [btrfs]
> [12295.974770]  __vfs_getxattr+0x53/0x70
> [12295.974770]  get_vfs_caps_from_disk+0x68/0x1a0
> [12295.974770]  ? sched_clock_cpu+0x114/0x180
> [12295.974770]  cap_bprm_creds_from_file+0x181/0x6c0
> [12295.974770]  security_bprm_creds_from_file+0x2a/0x40
> [12295.974770]  begin_new_exec+0xf4/0xc40
> [12295.974770]  ? load_elf_phdrs+0x6b/0xb0
> [12295.974770]  load_elf_binary+0x66b/0x1620
> [12295.974770]  ? read_hv_sched_clock_tsc+0x5/0x20
> [12295.974770]  ? sched_clock+0x5/0x10
> [12295.974770]  ? sched_clock_local+0x12/0x80
> [12295.974770]  ? sched_clock_cpu+0x114/0x180
> [12295.974770]  bprm_execve+0x3ce/0x920
> [12295.974770]  do_execveat_common+0x1b0/0x1f0
> [12295.974770]  __x64_sys_execve+0x39/0x50
> [12295.974770]  do_syscall_64+0x33/0x80
> [12295.974770]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [12295.974770] RIP: 0033:0x7f6aaefc13cb
> [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48
> [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
> [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb
> [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280
> [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001
> [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280
> 
> The deadlock senario reported by this splat is as follow:
> 
> 	CPU 0					CPU 2
> 	=====					=====
> 	btrfs_search_slot():
> 	  btrfs_tree_read_lock_atomic():
> 	    read_lock(&eb->lock);		<a random writer>
> 						write_lock(&eb->lock);
> 						// waiting for the lock
> 	  ...
> 	  btrfs_tree_read_lock_atomic():
> 	    read_lock(&eb->lock); // blocked because the fairness.
> 
> In short, you can not use nested read_lock() on the same lock. However,
> I'm not sure whether we have the real problem here, because I don't
> think btrfs_search_slot() can pick the same extent buffer twice in
> btrfs_search_slot(). Also copy the author of the code for more
> information.

Ah yes. We have several lockdep splats for which the fixes are not in
5.10-rcs yet. Some may be already in the integration branch [1] and
others not yet in any tree, but can be found in the btrfs mailing list.

For that particular one, it's fixed by the following patch:

https://lore.kernel.org/linux-btrfs/1cee2922a32c305056a9559ccf7aede49777beae.1604591048.git.josef@toxicpanda.com/

(It belongs to a series)

You'll likely see more different lockdep splats, this is due to a
transition from custom btree locks to rw semaphores that is in progress,
and some preparatory work for that is already in 5.10-rcs.

Btw, I could hit the fs freeze deadlock even when lockdep didn't produce
any previous splat.

Jan was hitting it too with ext4, and Darrick with xfs - though I can't
tell if they hit any lockdep splat before hitting the freeze deadlock.

Thanks for reporting it.

[1] https://github.com/kdave/btrfs-devel/commits/misc-next

> 
> Regards,
> Boqun
> 
>> Regards,
>> Boqun
>>
>>> Thanks.
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")
  2020-11-09  9:57                 ` Filipe Manana
@ 2020-11-10  1:41                   ` Boqun Feng
  0 siblings, 0 replies; 26+ messages in thread
From: Boqun Feng @ 2020-11-10  1:41 UTC (permalink / raw)
  To: Filipe Manana
  Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov, Qu Wenruo,
	darrick.wong

On Mon, Nov 09, 2020 at 09:57:05AM +0000, Filipe Manana wrote:
> 
> 
> On 09/11/20 08:44, Boqun Feng wrote:
> > Hi Filipe,
> > 
> > On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
> >> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote:
> >> [...]
> >>>
> >>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
> >>>
> >>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> >>> index b71ad8d9f1c9..b31d4ad482c7 100644
> >>> --- a/kernel/locking/lockdep.c
> >>> +++ b/kernel/locking/lockdep.c
> >>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
> >>>                 LOCK_TRACE_SIZE_IN_LONGS;
> >>>
> >>>         if (max_entries <= 0) {
> >>> -               if (!debug_locks_off_graph_unlock())
> >>> +               if (!debug_locks_off_graph_unlock()) {
> >>> +                       WARN_ON_ONCE(1);
> >>>                         return NULL;
> >>> +               }
> >>>
> >>>                 print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
> >>>                 dump_stack();
> >>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
> >>> lockdep_map *lock, int read)
> >>>         unsigned long flags;
> >>>         int ret = 0;
> >>>
> >>> -       if (unlikely(!lockdep_enabled()))
> >>> +       if (unlikely(debug_locks && !lockdep_enabled()))
> >>>                 return 1; /* avoid false negative lockdep_assert_held() */
> >>>
> >>>         raw_local_irq_save(flags);
> >>>
> >>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
> >>> Unexpected, right?
> >>>
> >>
> >> Kinda, that means we still don't know why lockdep was turned off.
> >>
> >>> Should I try something else?
> >>>
> >>
> >> Thanks for trying this. Let me set up the reproducer on my side, and see
> >> if I could get more information.
> >>
> > 
> > I could hit this with btrfs/187, and when we hit it, lockdep will report
> > the deadlock and turn off, and I think this is the root cause for your
> > hitting the original problem, I will add some analysis after the lockdep
> > splat.
> > 
> > [12295.973309] ============================================
> > [12295.974770] WARNING: possible recursive locking detected
> > [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
> > [12295.974770] --------------------------------------------
> > [12295.974770] zsh/701247 is trying to acquire lock:
> > [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] 
> >                but task is already holding lock:
> > [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] 
> >                other info that might help us debug this:
> > [12295.974770]  Possible unsafe locking scenario:
> > 
> > [12295.974770]        CPU0
> > [12295.974770]        ----
> > [12295.974770]   lock(&eb->lock);
> > [12295.974770]   lock(&eb->lock);
> > [12295.974770] 
> >                 *** DEADLOCK ***
> > 
> > [12295.974770]  May be due to missing lock nesting notation
> > 
> > [12295.974770] 2 locks held by zsh/701247:
> > [12295.974770]  #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920
> > [12295.974770]  #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770] 
> >                stack backtrace:
> > [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20
> > [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> > [12295.974770] Call Trace:
> > [12295.974770]  dump_stack+0x8b/0xb0
> > [12295.974770]  __lock_acquire.cold+0x175/0x2e9
> > [12295.974770]  lock_acquire+0x15b/0x490
> > [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]  ? read_block_for_search+0xf4/0x350 [btrfs]
> > [12295.974770]  _raw_read_lock+0x40/0xa0
> > [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]  btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> > [12295.974770]  btrfs_search_slot+0x6ac/0xca0 [btrfs]
> > [12295.974770]  btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
> > [12295.974770]  btrfs_getxattr+0x67/0x130 [btrfs]
> > [12295.974770]  __vfs_getxattr+0x53/0x70
> > [12295.974770]  get_vfs_caps_from_disk+0x68/0x1a0
> > [12295.974770]  ? sched_clock_cpu+0x114/0x180
> > [12295.974770]  cap_bprm_creds_from_file+0x181/0x6c0
> > [12295.974770]  security_bprm_creds_from_file+0x2a/0x40
> > [12295.974770]  begin_new_exec+0xf4/0xc40
> > [12295.974770]  ? load_elf_phdrs+0x6b/0xb0
> > [12295.974770]  load_elf_binary+0x66b/0x1620
> > [12295.974770]  ? read_hv_sched_clock_tsc+0x5/0x20
> > [12295.974770]  ? sched_clock+0x5/0x10
> > [12295.974770]  ? sched_clock_local+0x12/0x80
> > [12295.974770]  ? sched_clock_cpu+0x114/0x180
> > [12295.974770]  bprm_execve+0x3ce/0x920
> > [12295.974770]  do_execveat_common+0x1b0/0x1f0
> > [12295.974770]  __x64_sys_execve+0x39/0x50
> > [12295.974770]  do_syscall_64+0x33/0x80
> > [12295.974770]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [12295.974770] RIP: 0033:0x7f6aaefc13cb
> > [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48
> > [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b
> > [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb
> > [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280
> > [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001
> > [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280
> > 
> > The deadlock senario reported by this splat is as follow:
> > 
> > 	CPU 0					CPU 2
> > 	=====					=====
> > 	btrfs_search_slot():
> > 	  btrfs_tree_read_lock_atomic():
> > 	    read_lock(&eb->lock);		<a random writer>
> > 						write_lock(&eb->lock);
> > 						// waiting for the lock
> > 	  ...
> > 	  btrfs_tree_read_lock_atomic():
> > 	    read_lock(&eb->lock); // blocked because the fairness.
> > 
> > In short, you can not use nested read_lock() on the same lock. However,
> > I'm not sure whether we have the real problem here, because I don't
> > think btrfs_search_slot() can pick the same extent buffer twice in
> > btrfs_search_slot(). Also copy the author of the code for more
> > information.
> 
> Ah yes. We have several lockdep splats for which the fixes are not in
> 5.10-rcs yet. Some may be already in the integration branch [1] and
> others not yet in any tree, but can be found in the btrfs mailing list.
> 
> For that particular one, it's fixed by the following patch:
> 
> https://lore.kernel.org/linux-btrfs/1cee2922a32c305056a9559ccf7aede49777beae.1604591048.git.josef@toxicpanda.com/
> 
> (It belongs to a series)
> 
> You'll likely see more different lockdep splats, this is due to a
> transition from custom btree locks to rw semaphores that is in progress,
> and some preparatory work for that is already in 5.10-rcs.
> 

Thanks for the information!

> Btw, I could hit the fs freeze deadlock even when lockdep didn't produce
> any previous splat.
> 

And this is the weird part, which means lockdep get turned off silently.
That probably means there is a bug (either in fs code or lockdep) that
we don't know, and my previous "fix" although works as per your testing,
but it may hide the real problem...

So I just send out another fix, which is similar to my previous one, but
leave a warning so that once we hit the problem we can be warned and not
miss the silent lockdep turn-off.

Regards,
Boqun

> Jan was hitting it too with ext4, and Darrick with xfs - though I can't
> tell if they hit any lockdep splat before hitting the freeze deadlock.
> 
> Thanks for reporting it.
> 
> [1] https://github.com/kdave/btrfs-devel/commits/misc-next
> 
> > 
> > Regards,
> > Boqun
> > 
> >> Regards,
> >> Boqun
> >>
> >>> Thanks.
> >>>
> >>>
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-11-10  1:41 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-26 11:26 possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
2020-10-26 11:40 ` Peter Zijlstra
2020-10-26 11:55   ` Jan Kara
2020-10-26 11:59     ` Filipe Manana
2020-10-26 12:30     ` Peter Zijlstra
2020-10-26 11:56   ` Filipe Manana
2020-10-26 12:55     ` Peter Zijlstra
2020-10-26 13:06       ` Filipe Manana
2020-10-26 15:22       ` Peter Zijlstra
2020-10-27  9:49         ` Anatoly Pugachev
2020-10-31 11:30         ` [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage tip-bot2 for Peter Zijlstra
2020-11-02 17:58         ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana
2020-11-03 10:15           ` Jan Kara
2020-11-03 10:22             ` Filipe Manana
2020-10-26 20:35       ` David Sterba
2020-11-03 14:08 ` Boqun Feng
2020-11-03 14:24   ` Filipe Manana
2020-11-03 19:44   ` Filipe Manana
2020-11-04  2:22     ` Boqun Feng
2020-11-04  3:44       ` Boqun Feng
2020-11-04  9:49         ` Filipe Manana
2020-11-04 19:54           ` Filipe Manana
2020-11-05  1:10             ` Boqun Feng
2020-11-09  8:44               ` Boqun Feng
2020-11-09  9:57                 ` Filipe Manana
2020-11-10  1:41                   ` Boqun Feng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.