* possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") @ 2020-10-26 11:26 Filipe Manana 2020-10-26 11:40 ` Peter Zijlstra 2020-11-03 14:08 ` Boqun Feng 0 siblings, 2 replies; 26+ messages in thread From: Filipe Manana @ 2020-10-26 11:26 UTC (permalink / raw) To: LKML; +Cc: peterz, Jan Kara, David Sterba Hello, I've recently started to hit a warning followed by tasks hanging after attempts to freeze a filesystem. A git bisection pointed to the following commit: commit 4d004099a668c41522242aa146a38cc4eb59cb1e Author: Peter Zijlstra <peterz@infradead.org> Date: Fri Oct 2 11:04:21 2020 +0200 lockdep: Fix lockdep recursion This happens very reliably when running all xfstests with lockdep enabled, and the tested filesystem is btrfs (haven't tried other filesystems, but it shouldn't matter). The warning and task hangs always happen at either test generic/068 or test generic/390, and (oddly) always have to run all tests for it to trigger, running those tests individually on an infinite loop doesn't seem to trigger it (at least for a couple hours). The warning triggered is at fs/super.c:__sb_start_write() which always results later in several tasks hanging on a percpu rw_sem: https://pastebin.com/qnLvf94E What happens is percpu_rwsem_is_held() is apparently returning a false positive, so this makes __sb_start_write() do a percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which is expected to always succeed, because if the calling task is holding a freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock the semaphore at level 2, since the freeze semaphores are always acquired by increasing level order. But the try_lock fails, it triggers the warning at __sb_start_write(), then its caller sb_start_pagefault() ignores the return value and callers such as btrfs_page_mkwrite() make the assumption the freeze semaphore was taken, proceed to do their stuff, and later call sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem at level 2 despite not having not been able to down_read() the semaphore. This obviously corrupts the semaphore's read_count state, and later causes any task trying to down_write() it to hang forever. After such a hang I ran a drgn script to confirm it: $ cat dump_freeze_sems.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) def dump_sem(level_enum): level = level_enum.value_() sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] print(f'freeze semaphore at level {level}, {str(level_enum)}') print(f' block {sem.block.counter.value_()}') for i in for_each_possible_cpu(prog): read_count = per_cpu_ptr(sem.read_count, i) print(f' read_count at cpu {i} = {read_count}') print() # dump semaphore read counts for all freeze levels (fs.h) dump_sem(prog['SB_FREEZE_WRITE']) dump_sem(prog['SB_FREEZE_PAGEFAULT']) dump_sem(prog['SB_FREEZE_FS']) $ drgn dump_freeze_sems.py freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE block 1 read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT block 1 read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS block 0 read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it sums to -1. The system remains like that for hours at least, with no progress at all. Is there a known regression with that lockdep commit? Anything I can do to help debug it in case it's not obvious? Thanks. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:26 possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana @ 2020-10-26 11:40 ` Peter Zijlstra 2020-10-26 11:55 ` Jan Kara 2020-10-26 11:56 ` Filipe Manana 2020-11-03 14:08 ` Boqun Feng 1 sibling, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2020-10-26 11:40 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, Jan Kara, David Sterba On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: > Hello, > > I've recently started to hit a warning followed by tasks hanging after > attempts to freeze a filesystem. A git bisection pointed to the > following commit: > > commit 4d004099a668c41522242aa146a38cc4eb59cb1e > Author: Peter Zijlstra <peterz@infradead.org> > Date: Fri Oct 2 11:04:21 2020 +0200 > > lockdep: Fix lockdep recursion > > This happens very reliably when running all xfstests with lockdep > enabled, and the tested filesystem is btrfs (haven't tried other > filesystems, but it shouldn't matter). The warning and task hangs always > happen at either test generic/068 or test generic/390, and (oddly) > always have to run all tests for it to trigger, running those tests > individually on an infinite loop doesn't seem to trigger it (at least > for a couple hours). > > The warning triggered is at fs/super.c:__sb_start_write() which always > results later in several tasks hanging on a percpu rw_sem: > > https://pastebin.com/qnLvf94E > > What happens is percpu_rwsem_is_held() is apparently returning a false > positive, That smells like the same issue reported here: https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net Make sure you have commit: f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:40 ` Peter Zijlstra @ 2020-10-26 11:55 ` Jan Kara 2020-10-26 11:59 ` Filipe Manana 2020-10-26 12:30 ` Peter Zijlstra 2020-10-26 11:56 ` Filipe Manana 1 sibling, 2 replies; 26+ messages in thread From: Jan Kara @ 2020-10-26 11:55 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Filipe Manana, LKML, Jan Kara, David Sterba On Mon 26-10-20 12:40:09, Peter Zijlstra wrote: > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: > > Hello, > > > > I've recently started to hit a warning followed by tasks hanging after > > attempts to freeze a filesystem. A git bisection pointed to the > > following commit: > > > > commit 4d004099a668c41522242aa146a38cc4eb59cb1e > > Author: Peter Zijlstra <peterz@infradead.org> > > Date: Fri Oct 2 11:04:21 2020 +0200 > > > > lockdep: Fix lockdep recursion > > > > This happens very reliably when running all xfstests with lockdep > > enabled, and the tested filesystem is btrfs (haven't tried other > > filesystems, but it shouldn't matter). The warning and task hangs always > > happen at either test generic/068 or test generic/390, and (oddly) > > always have to run all tests for it to trigger, running those tests > > individually on an infinite loop doesn't seem to trigger it (at least > > for a couple hours). > > > > The warning triggered is at fs/super.c:__sb_start_write() which always > > results later in several tasks hanging on a percpu rw_sem: > > > > https://pastebin.com/qnLvf94E > > > > What happens is percpu_rwsem_is_held() is apparently returning a false > > positive, > > That smells like the same issue reported here: > > https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > > Make sure you have commit: > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be there? Also googling for the title doesn't help... Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:55 ` Jan Kara @ 2020-10-26 11:59 ` Filipe Manana 2020-10-26 12:30 ` Peter Zijlstra 1 sibling, 0 replies; 26+ messages in thread From: Filipe Manana @ 2020-10-26 11:59 UTC (permalink / raw) To: Jan Kara, Peter Zijlstra; +Cc: LKML, David Sterba On 26/10/20 11:55, Jan Kara wrote: > On Mon 26-10-20 12:40:09, Peter Zijlstra wrote: >> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: >>> Hello, >>> >>> I've recently started to hit a warning followed by tasks hanging after >>> attempts to freeze a filesystem. A git bisection pointed to the >>> following commit: >>> >>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e >>> Author: Peter Zijlstra <peterz@infradead.org> >>> Date: Fri Oct 2 11:04:21 2020 +0200 >>> >>> lockdep: Fix lockdep recursion >>> >>> This happens very reliably when running all xfstests with lockdep >>> enabled, and the tested filesystem is btrfs (haven't tried other >>> filesystems, but it shouldn't matter). The warning and task hangs always >>> happen at either test generic/068 or test generic/390, and (oddly) >>> always have to run all tests for it to trigger, running those tests >>> individually on an infinite loop doesn't seem to trigger it (at least >>> for a couple hours). >>> >>> The warning triggered is at fs/super.c:__sb_start_write() which always >>> results later in several tasks hanging on a percpu rw_sem: >>> >>> https://pastebin.com/qnLvf94E >>> >>> What happens is percpu_rwsem_is_held() is apparently returning a false >>> positive, >> >> That smells like the same issue reported here: >> >> https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net >> >> Make sure you have commit: >> >> f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") >> >> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > > Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be > there? Also googling for the title doesn't help... Same here, can't find in Linus' tree neither through google. > > Honza > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:55 ` Jan Kara 2020-10-26 11:59 ` Filipe Manana @ 2020-10-26 12:30 ` Peter Zijlstra 1 sibling, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2020-10-26 12:30 UTC (permalink / raw) To: Jan Kara; +Cc: Filipe Manana, LKML, David Sterba On Mon, Oct 26, 2020 at 12:55:41PM +0100, Jan Kara wrote: > > Make sure you have commit: > > > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > > Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be > there? Also googling for the title doesn't help... *groan*... I seem to have forgotten to push it out to tip/locking/urgent on Friday :/ Find it here: git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git locking/urgent ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:40 ` Peter Zijlstra 2020-10-26 11:55 ` Jan Kara @ 2020-10-26 11:56 ` Filipe Manana 2020-10-26 12:55 ` Peter Zijlstra 1 sibling, 1 reply; 26+ messages in thread From: Filipe Manana @ 2020-10-26 11:56 UTC (permalink / raw) To: Peter Zijlstra; +Cc: LKML, Jan Kara, David Sterba On 26/10/20 11:40, Peter Zijlstra wrote: > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: >> Hello, >> >> I've recently started to hit a warning followed by tasks hanging after >> attempts to freeze a filesystem. A git bisection pointed to the >> following commit: >> >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e >> Author: Peter Zijlstra <peterz@infradead.org> >> Date: Fri Oct 2 11:04:21 2020 +0200 >> >> lockdep: Fix lockdep recursion >> >> This happens very reliably when running all xfstests with lockdep >> enabled, and the tested filesystem is btrfs (haven't tried other >> filesystems, but it shouldn't matter). The warning and task hangs always >> happen at either test generic/068 or test generic/390, and (oddly) >> always have to run all tests for it to trigger, running those tests >> individually on an infinite loop doesn't seem to trigger it (at least >> for a couple hours). >> >> The warning triggered is at fs/super.c:__sb_start_write() which always >> results later in several tasks hanging on a percpu rw_sem: >> >> https://pastebin.com/qnLvf94E >> >> What happens is percpu_rwsem_is_held() is apparently returning a false >> positive, > > That smells like the same issue reported here: > > https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > > Make sure you have commit: > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? Yes, CONFIG_DEBUG_PREEMPT is enabled. I'll try with that commit and let you know, however it's gonna take a few hours to build a kernel and run all fstests (on that test box it takes over 3 hours) to confirm that fixes the issue. Thanks for the quick reply! > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:56 ` Filipe Manana @ 2020-10-26 12:55 ` Peter Zijlstra 2020-10-26 13:06 ` Filipe Manana ` (2 more replies) 0 siblings, 3 replies; 26+ messages in thread From: Peter Zijlstra @ 2020-10-26 12:55 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, Jan Kara, David Sterba On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: > > That smells like the same issue reported here: > > > > https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > > > > Make sure you have commit: > > > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > > Yes, CONFIG_DEBUG_PREEMPT is enabled. Bummer :/ > I'll try with that commit and let you know, however it's gonna take a > few hours to build a kernel and run all fstests (on that test box it > takes over 3 hours) to confirm that fixes the issue. *ouch*, 3 hours is painful. How long to make it sick with the current kernel? quicker I would hope? > Thanks for the quick reply! Anyway, I don't think that commit can actually explain the issue :/ The false positive on lockdep_assert_held() happens when the recursion count is !0, however we _should_ be having IRQs disabled when lockdep_recursion > 0, so that should never be observable. My hope was that DEBUG_PREEMPT would trigger on one of the __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would then be a clear violation. And you're seeing this on x86, right? Let me puzzle moar.. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 12:55 ` Peter Zijlstra @ 2020-10-26 13:06 ` Filipe Manana 2020-10-26 15:22 ` Peter Zijlstra 2020-10-26 20:35 ` David Sterba 2 siblings, 0 replies; 26+ messages in thread From: Filipe Manana @ 2020-10-26 13:06 UTC (permalink / raw) To: Peter Zijlstra; +Cc: LKML, Jan Kara, David Sterba On 26/10/20 12:55, Peter Zijlstra wrote: > On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: >>> That smells like the same issue reported here: >>> >>> https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net >>> >>> Make sure you have commit: >>> >>> f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") >>> >>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? >> >> Yes, CONFIG_DEBUG_PREEMPT is enabled. > > Bummer :/ > >> I'll try with that commit and let you know, however it's gonna take a >> few hours to build a kernel and run all fstests (on that test box it >> takes over 3 hours) to confirm that fixes the issue. > > *ouch*, 3 hours is painful. How long to make it sick with the current > kernel? quicker I would hope? If generic/068 triggers the bug, than it's about 1 hour. If that passes, which rarely happens, then have to wait to get into generic/390, which is over 2 hours. It sucks that running those tests alone never trigger the issue, but running all fstests (first btrfs specific ones, followed by the generic ones) reliably triggers the bug, almost always at generic/068, when that passes, it's triggered by generic/390. To confirm everything is ok, I let all tests run (last generic is 612). > >> Thanks for the quick reply! > > Anyway, I don't think that commit can actually explain the issue :/ > > The false positive on lockdep_assert_held() happens when the recursion > count is !0, however we _should_ be having IRQs disabled when > lockdep_recursion > 0, so that should never be observable. > > My hope was that DEBUG_PREEMPT would trigger on one of the > __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would > then be a clear violation. > > And you're seeing this on x86, right? Right. It's in a qemu vm on x86, with '-cpu host' passed to qemu and kvm enabled. Thanks. > > Let me puzzle moar.. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 12:55 ` Peter Zijlstra 2020-10-26 13:06 ` Filipe Manana @ 2020-10-26 15:22 ` Peter Zijlstra 2020-10-27 9:49 ` Anatoly Pugachev ` (2 more replies) 2020-10-26 20:35 ` David Sterba 2 siblings, 3 replies; 26+ messages in thread From: Peter Zijlstra @ 2020-10-26 15:22 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, Jan Kara, David Sterba, matorola, mingo On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote: > On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: > > > That smells like the same issue reported here: > > > > > > https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > > > > > > Make sure you have commit: > > > > > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > > > > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > > > > Yes, CONFIG_DEBUG_PREEMPT is enabled. > > Bummer :/ > > > I'll try with that commit and let you know, however it's gonna take a > > few hours to build a kernel and run all fstests (on that test box it > > takes over 3 hours) to confirm that fixes the issue. > > *ouch*, 3 hours is painful. How long to make it sick with the current > kernel? quicker I would hope? > > > Thanks for the quick reply! > > Anyway, I don't think that commit can actually explain the issue :/ > > The false positive on lockdep_assert_held() happens when the recursion > count is !0, however we _should_ be having IRQs disabled when > lockdep_recursion > 0, so that should never be observable. > > My hope was that DEBUG_PREEMPT would trigger on one of the > __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would > then be a clear violation. > > And you're seeing this on x86, right? > > Let me puzzle moar.. So I might have an explanation for the Sparc64 fail, but that can't explain x86 :/ I initially thought raw_cpu_read() was OK, since if it is !0 we have IRQs disabled and can't get migrated, so if we get migrated both CPUs must have 0 and it doesn't matter which 0 we read. And while that is true; it isn't the whole store, on pretty much all architectures (except x86) this can result in computing the address for one CPU, getting migrated, the old CPU continuing execution with another task (possibly setting recursion) and then the new CPU reading the value of the old CPU, which is no longer 0. I already fixed a bunch of that in: baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") but clearly this one got crossed. Still, that leaves me puzzled over you seeing this on x86 :/ Anatoly, could you try linus+tip/locking/urgent and the below on your Sparc, please? --- diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 3e99dfef8408..a3041463e42d 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void) if (!debug_locks) return false; - if (raw_cpu_read(lockdep_recursion)) + if (this_cpu_read(lockdep_recursion)) return false; if (current->lockdep_recursion) ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 15:22 ` Peter Zijlstra @ 2020-10-27 9:49 ` Anatoly Pugachev 2020-10-31 11:30 ` [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage tip-bot2 for Peter Zijlstra 2020-11-02 17:58 ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana 2 siblings, 0 replies; 26+ messages in thread From: Anatoly Pugachev @ 2020-10-27 9:49 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Filipe Manana, LKML, Jan Kara, David Sterba, Ingo Molnar On Mon, Oct 26, 2020 at 6:23 PM Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote: > > On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: > > > > That smells like the same issue reported here: > > > > > > > > https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > > > > > > > > Make sure you have commit: > > > > > > > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > > > > > > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > > > > > > Yes, CONFIG_DEBUG_PREEMPT is enabled. > > > > Bummer :/ > > > > > I'll try with that commit and let you know, however it's gonna take a > > > few hours to build a kernel and run all fstests (on that test box it > > > takes over 3 hours) to confirm that fixes the issue. > > > > *ouch*, 3 hours is painful. How long to make it sick with the current > > kernel? quicker I would hope? > > > > > Thanks for the quick reply! > > > > Anyway, I don't think that commit can actually explain the issue :/ > > > > The false positive on lockdep_assert_held() happens when the recursion > > count is !0, however we _should_ be having IRQs disabled when > > lockdep_recursion > 0, so that should never be observable. > > > > My hope was that DEBUG_PREEMPT would trigger on one of the > > __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would > > then be a clear violation. > > > > And you're seeing this on x86, right? > > > > Let me puzzle moar.. > > So I might have an explanation for the Sparc64 fail, but that can't > explain x86 :/ > > I initially thought raw_cpu_read() was OK, since if it is !0 we have > IRQs disabled and can't get migrated, so if we get migrated both CPUs > must have 0 and it doesn't matter which 0 we read. > > And while that is true; it isn't the whole store, on pretty much all > architectures (except x86) this can result in computing the address for > one CPU, getting migrated, the old CPU continuing execution with another > task (possibly setting recursion) and then the new CPU reading the value > of the old CPU, which is no longer 0. > > I already fixed a bunch of that in: > > baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") > > but clearly this one got crossed. > > Still, that leaves me puzzled over you seeing this on x86 :/ > > Anatoly, could you try linus+tip/locking/urgent and the below on your > Sparc, please? Peter, let me test first. Thanks. PS: sorry for the delay, a weekend and got ill a bit ... ^ permalink raw reply [flat|nested] 26+ messages in thread
* [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage 2020-10-26 15:22 ` Peter Zijlstra 2020-10-27 9:49 ` Anatoly Pugachev @ 2020-10-31 11:30 ` tip-bot2 for Peter Zijlstra 2020-11-02 17:58 ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana 2 siblings, 0 replies; 26+ messages in thread From: tip-bot2 for Peter Zijlstra @ 2020-10-31 11:30 UTC (permalink / raw) To: linux-tip-commits; +Cc: Peter Zijlstra (Intel), x86, LKML The following commit has been merged into the locking/urgent branch of tip: Commit-ID: d48e3850030623e1c20785bceaaf78f916d0b1a3 Gitweb: https://git.kernel.org/tip/d48e3850030623e1c20785bceaaf78f916d0b1a3 Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Mon, 26 Oct 2020 16:22:56 +01:00 Committer: Peter Zijlstra <peterz@infradead.org> CommitterDate: Fri, 30 Oct 2020 17:07:18 +01:00 locking/lockdep: Remove more raw_cpu_read() usage I initially thought raw_cpu_read() was OK, since if it is !0 we have IRQs disabled and can't get migrated, so if we get migrated both CPUs must have 0 and it doesn't matter which 0 we read. And while that is true; it isn't the whole store, on pretty much all architectures (except x86) this can result in computing the address for one CPU, getting migrated, the old CPU continuing execution with another task (possibly setting recursion) and then the new CPU reading the value of the old CPU, which is no longer 0. Similer to: baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20201026152256.GB2651@hirez.programming.kicks-ass.net --- kernel/locking/lockdep.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index fc206ae..1102849 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void) if (!debug_locks) return false; - if (raw_cpu_read(lockdep_recursion)) + if (this_cpu_read(lockdep_recursion)) return false; if (current->lockdep_recursion) ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 15:22 ` Peter Zijlstra 2020-10-27 9:49 ` Anatoly Pugachev 2020-10-31 11:30 ` [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage tip-bot2 for Peter Zijlstra @ 2020-11-02 17:58 ` Filipe Manana 2020-11-03 10:15 ` Jan Kara 2 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2020-11-02 17:58 UTC (permalink / raw) To: Peter Zijlstra; +Cc: LKML, Jan Kara, David Sterba, matorola, mingo On 26/10/20 15:22, Peter Zijlstra wrote: > On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote: >> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: >>>> That smells like the same issue reported here: >>>> >>>> https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net >>>> >>>> Make sure you have commit: >>>> >>>> f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") >>>> >>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? >>> >>> Yes, CONFIG_DEBUG_PREEMPT is enabled. >> >> Bummer :/ >> >>> I'll try with that commit and let you know, however it's gonna take a >>> few hours to build a kernel and run all fstests (on that test box it >>> takes over 3 hours) to confirm that fixes the issue. >> >> *ouch*, 3 hours is painful. How long to make it sick with the current >> kernel? quicker I would hope? >> >>> Thanks for the quick reply! >> >> Anyway, I don't think that commit can actually explain the issue :/ >> >> The false positive on lockdep_assert_held() happens when the recursion >> count is !0, however we _should_ be having IRQs disabled when >> lockdep_recursion > 0, so that should never be observable. >> >> My hope was that DEBUG_PREEMPT would trigger on one of the >> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would >> then be a clear violation. >> >> And you're seeing this on x86, right? >> >> Let me puzzle moar.. > > So I might have an explanation for the Sparc64 fail, but that can't > explain x86 :/ > > I initially thought raw_cpu_read() was OK, since if it is !0 we have > IRQs disabled and can't get migrated, so if we get migrated both CPUs > must have 0 and it doesn't matter which 0 we read. > > And while that is true; it isn't the whole store, on pretty much all > architectures (except x86) this can result in computing the address for > one CPU, getting migrated, the old CPU continuing execution with another > task (possibly setting recursion) and then the new CPU reading the value > of the old CPU, which is no longer 0. > > I already fixed a bunch of that in: > > baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") > > but clearly this one got crossed. > > Still, that leaves me puzzled over you seeing this on x86 :/ Hi Peter, I still get the same issue with 5.10-rc2. Is there any non-merged patch I should try, or anything I can help with? Thanks. > > Anatoly, could you try linus+tip/locking/urgent and the below on your > Sparc, please? > > --- > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..a3041463e42d 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void) > if (!debug_locks) > return false; > > - if (raw_cpu_read(lockdep_recursion)) > + if (this_cpu_read(lockdep_recursion)) > return false; > > if (current->lockdep_recursion) > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-02 17:58 ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana @ 2020-11-03 10:15 ` Jan Kara 2020-11-03 10:22 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2020-11-03 10:15 UTC (permalink / raw) To: Filipe Manana Cc: Peter Zijlstra, LKML, Jan Kara, David Sterba, matorola, mingo On Mon 02-11-20 17:58:54, Filipe Manana wrote: > > > On 26/10/20 15:22, Peter Zijlstra wrote: > > On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote: > >> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: > >>>> That smells like the same issue reported here: > >>>> > >>>> https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > >>>> > >>>> Make sure you have commit: > >>>> > >>>> f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > >>>> > >>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > >>> > >>> Yes, CONFIG_DEBUG_PREEMPT is enabled. > >> > >> Bummer :/ > >> > >>> I'll try with that commit and let you know, however it's gonna take a > >>> few hours to build a kernel and run all fstests (on that test box it > >>> takes over 3 hours) to confirm that fixes the issue. > >> > >> *ouch*, 3 hours is painful. How long to make it sick with the current > >> kernel? quicker I would hope? > >> > >>> Thanks for the quick reply! > >> > >> Anyway, I don't think that commit can actually explain the issue :/ > >> > >> The false positive on lockdep_assert_held() happens when the recursion > >> count is !0, however we _should_ be having IRQs disabled when > >> lockdep_recursion > 0, so that should never be observable. > >> > >> My hope was that DEBUG_PREEMPT would trigger on one of the > >> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would > >> then be a clear violation. > >> > >> And you're seeing this on x86, right? > >> > >> Let me puzzle moar.. > > > > So I might have an explanation for the Sparc64 fail, but that can't > > explain x86 :/ > > > > I initially thought raw_cpu_read() was OK, since if it is !0 we have > > IRQs disabled and can't get migrated, so if we get migrated both CPUs > > must have 0 and it doesn't matter which 0 we read. > > > > And while that is true; it isn't the whole store, on pretty much all > > architectures (except x86) this can result in computing the address for > > one CPU, getting migrated, the old CPU continuing execution with another > > task (possibly setting recursion) and then the new CPU reading the value > > of the old CPU, which is no longer 0. > > > > I already fixed a bunch of that in: > > > > baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") > > > > but clearly this one got crossed. > > > > Still, that leaves me puzzled over you seeing this on x86 :/ > > Hi Peter, > > I still get the same issue with 5.10-rc2. > Is there any non-merged patch I should try, or anything I can help with? BTW, I've just hit the same deadlock issue with ext4 on generic/390 so I confirm this isn't btrfs specific issue (as we already knew from the analysis but still it's good to have that confirmed). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-03 10:15 ` Jan Kara @ 2020-11-03 10:22 ` Filipe Manana 0 siblings, 0 replies; 26+ messages in thread From: Filipe Manana @ 2020-11-03 10:22 UTC (permalink / raw) To: Jan Kara Cc: Peter Zijlstra, LKML, David Sterba, matorola, mingo, darrick.wong On 03/11/20 10:15, Jan Kara wrote: > On Mon 02-11-20 17:58:54, Filipe Manana wrote: >> >> >> On 26/10/20 15:22, Peter Zijlstra wrote: >>> On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote: >>>> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: >>>>>> That smells like the same issue reported here: >>>>>> >>>>>> https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net >>>>>> >>>>>> Make sure you have commit: >>>>>> >>>>>> f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") >>>>>> >>>>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? >>>>> >>>>> Yes, CONFIG_DEBUG_PREEMPT is enabled. >>>> >>>> Bummer :/ >>>> >>>>> I'll try with that commit and let you know, however it's gonna take a >>>>> few hours to build a kernel and run all fstests (on that test box it >>>>> takes over 3 hours) to confirm that fixes the issue. >>>> >>>> *ouch*, 3 hours is painful. How long to make it sick with the current >>>> kernel? quicker I would hope? >>>> >>>>> Thanks for the quick reply! >>>> >>>> Anyway, I don't think that commit can actually explain the issue :/ >>>> >>>> The false positive on lockdep_assert_held() happens when the recursion >>>> count is !0, however we _should_ be having IRQs disabled when >>>> lockdep_recursion > 0, so that should never be observable. >>>> >>>> My hope was that DEBUG_PREEMPT would trigger on one of the >>>> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would >>>> then be a clear violation. >>>> >>>> And you're seeing this on x86, right? >>>> >>>> Let me puzzle moar.. >>> >>> So I might have an explanation for the Sparc64 fail, but that can't >>> explain x86 :/ >>> >>> I initially thought raw_cpu_read() was OK, since if it is !0 we have >>> IRQs disabled and can't get migrated, so if we get migrated both CPUs >>> must have 0 and it doesn't matter which 0 we read. >>> >>> And while that is true; it isn't the whole store, on pretty much all >>> architectures (except x86) this can result in computing the address for >>> one CPU, getting migrated, the old CPU continuing execution with another >>> task (possibly setting recursion) and then the new CPU reading the value >>> of the old CPU, which is no longer 0. >>> >>> I already fixed a bunch of that in: >>> >>> baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu variables"") >>> >>> but clearly this one got crossed. >>> >>> Still, that leaves me puzzled over you seeing this on x86 :/ >> >> Hi Peter, >> >> I still get the same issue with 5.10-rc2. >> Is there any non-merged patch I should try, or anything I can help with? > > BTW, I've just hit the same deadlock issue with ext4 on generic/390 so I > confirm this isn't btrfs specific issue (as we already knew from the > analysis but still it's good to have that confirmed). Indeed, yesterday Darrick was mentioning on IRC that he has run into it too with fstests on XFS (5.10-rc). > > Honza > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 12:55 ` Peter Zijlstra 2020-10-26 13:06 ` Filipe Manana 2020-10-26 15:22 ` Peter Zijlstra @ 2020-10-26 20:35 ` David Sterba 2 siblings, 0 replies; 26+ messages in thread From: David Sterba @ 2020-10-26 20:35 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Filipe Manana, LKML, Jan Kara, David Sterba On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote: > On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote: > > > That smells like the same issue reported here: > > > > > > https://lkml.kernel.org/r/20201022111700.GZ2651@hirez.programming.kicks-ass.net > > > > > > Make sure you have commit: > > > > > > f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable") > > > > > > (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled? > > > > Yes, CONFIG_DEBUG_PREEMPT is enabled. > > Bummer :/ My builds don't have that enabled (CONFIG_PREEMPT_NONE=y) but I still see the warning (same scenario as for Filipe). That is with today's master branch + your fix from locking/urgent. > > I'll try with that commit and let you know, however it's gonna take a > > few hours to build a kernel and run all fstests (on that test box it > > takes over 3 hours) to confirm that fixes the issue. > > *ouch*, 3 hours is painful. How long to make it sick with the current > kernel? quicker I would hope? > > > Thanks for the quick reply! > > Anyway, I don't think that commit can actually explain the issue :/ > > The false positive on lockdep_assert_held() happens when the recursion > count is !0, however we _should_ be having IRQs disabled when > lockdep_recursion > 0, so that should never be observable. > > My hope was that DEBUG_PREEMPT would trigger on one of the > __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would > then be a clear violation. I can start another round (in my case it's more than 4 hours to reproduce it) with DEBUG_PREEMPT, unless you have something else to test. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-10-26 11:26 possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana 2020-10-26 11:40 ` Peter Zijlstra @ 2020-11-03 14:08 ` Boqun Feng 2020-11-03 14:24 ` Filipe Manana 2020-11-03 19:44 ` Filipe Manana 1 sibling, 2 replies; 26+ messages in thread From: Boqun Feng @ 2020-11-03 14:08 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov Hi Filipe, On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: > Hello, > > I've recently started to hit a warning followed by tasks hanging after > attempts to freeze a filesystem. A git bisection pointed to the > following commit: > > commit 4d004099a668c41522242aa146a38cc4eb59cb1e > Author: Peter Zijlstra <peterz@infradead.org> > Date: Fri Oct 2 11:04:21 2020 +0200 > > lockdep: Fix lockdep recursion > > This happens very reliably when running all xfstests with lockdep > enabled, and the tested filesystem is btrfs (haven't tried other > filesystems, but it shouldn't matter). The warning and task hangs always > happen at either test generic/068 or test generic/390, and (oddly) > always have to run all tests for it to trigger, running those tests > individually on an infinite loop doesn't seem to trigger it (at least > for a couple hours). > > The warning triggered is at fs/super.c:__sb_start_write() which always > results later in several tasks hanging on a percpu rw_sem: > > https://pastebin.com/qnLvf94E > In your dmesg, I see line: [ 9304.920151] INFO: lockdep is turned off. , that means debug_locks is 0, that usually happens when lockdep find a problem (i.e. a deadlock) and it turns itself off, because a problem is found and it's pointless for lockdep to continue to run. And I haven't found a lockdep splat in your dmesg, do you have a full dmesg so that I can have a look? This may be relevant because in commit 4d004099a66, we have @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) unsigned long flags; int ret = 0; - if (unlikely(current->lockdep_recursion)) + if (unlikely(!lockdep_enabled())) return 1; /* avoid false negative lockdep_assert_held() */ before this commit lock_is_held_type() and its friends may return false if debug_locks==0, after this commit lock_is_held_type() and its friends will always return true if debug_locks == 0. That could cause the behavior here. In case I'm correct, the following "fix" may be helpful. Regards, Boqun ----------8 diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 3e99dfef8408..c0e27fb949ff 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) unsigned long flags; int ret = 0; - if (unlikely(!lockdep_enabled())) + if (unlikely(debug_locks && !lockdep_enabled())) return 1; /* avoid false negative lockdep_assert_held() */ raw_local_irq_save(flags); > What happens is percpu_rwsem_is_held() is apparently returning a false > positive, so this makes __sb_start_write() do a > percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which > is expected to always succeed, because if the calling task is holding a > freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock > the semaphore at level 2, since the freeze semaphores are always > acquired by increasing level order. > > But the try_lock fails, it triggers the warning at __sb_start_write(), > then its caller sb_start_pagefault() ignores the return value and > callers such as btrfs_page_mkwrite() make the assumption the freeze > semaphore was taken, proceed to do their stuff, and later call > sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem > at level 2 despite not having not been able to down_read() the > semaphore. This obviously corrupts the semaphore's read_count state, and > later causes any task trying to down_write() it to hang forever. > > After such a hang I ran a drgn script to confirm it: > > $ cat dump_freeze_sems.py > import sys > import drgn > from drgn import NULL, Object, cast, container_of, execscript, \ > reinterpret, sizeof > from drgn.helpers.linux import * > > mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' > > mnt = None > for mnt in for_each_mount(prog, dst = mnt_path): > pass > > if mnt is None: > sys.stderr.write(f'Error: mount point {mnt_path} not found\n') > sys.exit(1) > > def dump_sem(level_enum): > level = level_enum.value_() > sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] > print(f'freeze semaphore at level {level}, {str(level_enum)}') > print(f' block {sem.block.counter.value_()}') > for i in for_each_possible_cpu(prog): > read_count = per_cpu_ptr(sem.read_count, i) > print(f' read_count at cpu {i} = {read_count}') > print() > > # dump semaphore read counts for all freeze levels (fs.h) > dump_sem(prog['SB_FREEZE_WRITE']) > dump_sem(prog['SB_FREEZE_PAGEFAULT']) > dump_sem(prog['SB_FREEZE_FS']) > > > $ drgn dump_freeze_sems.py > freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE > block 1 > read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 > read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 > read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 > read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 > > freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT > block 1 > read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 > read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 > read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 > read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 > > freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS > block 0 > read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 > read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 > read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 > read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 > > At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it > sums to -1. The system remains like that for hours at least, with no > progress at all. > > Is there a known regression with that lockdep commit? > Anything I can do to help debug it in case it's not obvious? > > Thanks. ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-03 14:08 ` Boqun Feng @ 2020-11-03 14:24 ` Filipe Manana 2020-11-03 19:44 ` Filipe Manana 1 sibling, 0 replies; 26+ messages in thread From: Filipe Manana @ 2020-11-03 14:24 UTC (permalink / raw) To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On 03/11/20 14:08, Boqun Feng wrote: > Hi Filipe, > > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: >> Hello, >> >> I've recently started to hit a warning followed by tasks hanging after >> attempts to freeze a filesystem. A git bisection pointed to the >> following commit: >> >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e >> Author: Peter Zijlstra <peterz@infradead.org> >> Date: Fri Oct 2 11:04:21 2020 +0200 >> >> lockdep: Fix lockdep recursion >> >> This happens very reliably when running all xfstests with lockdep >> enabled, and the tested filesystem is btrfs (haven't tried other >> filesystems, but it shouldn't matter). The warning and task hangs always >> happen at either test generic/068 or test generic/390, and (oddly) >> always have to run all tests for it to trigger, running those tests >> individually on an infinite loop doesn't seem to trigger it (at least >> for a couple hours). >> >> The warning triggered is at fs/super.c:__sb_start_write() which always >> results later in several tasks hanging on a percpu rw_sem: >> >> https://pastebin.com/qnLvf94E >> > > In your dmesg, I see line: > > [ 9304.920151] INFO: lockdep is turned off. > > , that means debug_locks is 0, that usually happens when lockdep find a > problem (i.e. a deadlock) and it turns itself off, because a problem is > found and it's pointless for lockdep to continue to run. > > And I haven't found a lockdep splat in your dmesg, do you have a full > dmesg so that I can have a look? Everytime I run into the issue it produces similar results. It always starts with the WARN_ON() at __sb_start_write(), followed by several "hung task" traces, and then some time after lockdep is turned off. > > This may be relevant because in commit 4d004099a66, we have > > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > unsigned long flags; > int ret = 0; > > - if (unlikely(current->lockdep_recursion)) > + if (unlikely(!lockdep_enabled())) > return 1; /* avoid false negative lockdep_assert_held() */ > > before this commit lock_is_held_type() and its friends may return false > if debug_locks==0, after this commit lock_is_held_type() and its friends > will always return true if debug_locks == 0. That could cause the > behavior here. > > In case I'm correct, the following "fix" may be helpful. I'll try it and let you now. Thanks! > > Regards, > Boqun > > ----------8 > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..c0e27fb949ff 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > unsigned long flags; > int ret = 0; > > - if (unlikely(!lockdep_enabled())) > + if (unlikely(debug_locks && !lockdep_enabled())) > return 1; /* avoid false negative lockdep_assert_held() */ > > raw_local_irq_save(flags); > > > >> What happens is percpu_rwsem_is_held() is apparently returning a false >> positive, so this makes __sb_start_write() do a >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which >> is expected to always succeed, because if the calling task is holding a >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock >> the semaphore at level 2, since the freeze semaphores are always >> acquired by increasing level order. >> >> But the try_lock fails, it triggers the warning at __sb_start_write(), >> then its caller sb_start_pagefault() ignores the return value and >> callers such as btrfs_page_mkwrite() make the assumption the freeze >> semaphore was taken, proceed to do their stuff, and later call >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem >> at level 2 despite not having not been able to down_read() the >> semaphore. This obviously corrupts the semaphore's read_count state, and >> later causes any task trying to down_write() it to hang forever. >> >> After such a hang I ran a drgn script to confirm it: >> >> $ cat dump_freeze_sems.py >> import sys >> import drgn >> from drgn import NULL, Object, cast, container_of, execscript, \ >> reinterpret, sizeof >> from drgn.helpers.linux import * >> >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' >> >> mnt = None >> for mnt in for_each_mount(prog, dst = mnt_path): >> pass >> >> if mnt is None: >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') >> sys.exit(1) >> >> def dump_sem(level_enum): >> level = level_enum.value_() >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] >> print(f'freeze semaphore at level {level}, {str(level_enum)}') >> print(f' block {sem.block.counter.value_()}') >> for i in for_each_possible_cpu(prog): >> read_count = per_cpu_ptr(sem.read_count, i) >> print(f' read_count at cpu {i} = {read_count}') >> print() >> >> # dump semaphore read counts for all freeze levels (fs.h) >> dump_sem(prog['SB_FREEZE_WRITE']) >> dump_sem(prog['SB_FREEZE_PAGEFAULT']) >> dump_sem(prog['SB_FREEZE_FS']) >> >> >> $ drgn dump_freeze_sems.py >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE >> block 1 >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 >> >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT >> block 1 >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 >> >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS >> block 0 >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 >> >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it >> sums to -1. The system remains like that for hours at least, with no >> progress at all. >> >> Is there a known regression with that lockdep commit? >> Anything I can do to help debug it in case it's not obvious? >> >> Thanks. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-03 14:08 ` Boqun Feng 2020-11-03 14:24 ` Filipe Manana @ 2020-11-03 19:44 ` Filipe Manana 2020-11-04 2:22 ` Boqun Feng 1 sibling, 1 reply; 26+ messages in thread From: Filipe Manana @ 2020-11-03 19:44 UTC (permalink / raw) To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On 03/11/20 14:08, Boqun Feng wrote: > Hi Filipe, > > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: >> Hello, >> >> I've recently started to hit a warning followed by tasks hanging after >> attempts to freeze a filesystem. A git bisection pointed to the >> following commit: >> >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e >> Author: Peter Zijlstra <peterz@infradead.org> >> Date: Fri Oct 2 11:04:21 2020 +0200 >> >> lockdep: Fix lockdep recursion >> >> This happens very reliably when running all xfstests with lockdep >> enabled, and the tested filesystem is btrfs (haven't tried other >> filesystems, but it shouldn't matter). The warning and task hangs always >> happen at either test generic/068 or test generic/390, and (oddly) >> always have to run all tests for it to trigger, running those tests >> individually on an infinite loop doesn't seem to trigger it (at least >> for a couple hours). >> >> The warning triggered is at fs/super.c:__sb_start_write() which always >> results later in several tasks hanging on a percpu rw_sem: >> >> https://pastebin.com/qnLvf94E >> > > In your dmesg, I see line: > > [ 9304.920151] INFO: lockdep is turned off. > > , that means debug_locks is 0, that usually happens when lockdep find a > problem (i.e. a deadlock) and it turns itself off, because a problem is > found and it's pointless for lockdep to continue to run. > > And I haven't found a lockdep splat in your dmesg, do you have a full > dmesg so that I can have a look? > > This may be relevant because in commit 4d004099a66, we have > > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > unsigned long flags; > int ret = 0; > > - if (unlikely(current->lockdep_recursion)) > + if (unlikely(!lockdep_enabled())) > return 1; /* avoid false negative lockdep_assert_held() */ > > before this commit lock_is_held_type() and its friends may return false > if debug_locks==0, after this commit lock_is_held_type() and its friends > will always return true if debug_locks == 0. That could cause the > behavior here. > > In case I'm correct, the following "fix" may be helpful. > > Regards, > Boqun > > ----------8 > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index 3e99dfef8408..c0e27fb949ff 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > unsigned long flags; > int ret = 0; > > - if (unlikely(!lockdep_enabled())) > + if (unlikely(debug_locks && !lockdep_enabled())) > return 1; /* avoid false negative lockdep_assert_held() */ > > raw_local_irq_save(flags); Boqun, the patch fixes the problem for me! You can have Tested-by: Filipe Manana <fdmanana@suse.com> Thanks! > > > >> What happens is percpu_rwsem_is_held() is apparently returning a false >> positive, so this makes __sb_start_write() do a >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which >> is expected to always succeed, because if the calling task is holding a >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock >> the semaphore at level 2, since the freeze semaphores are always >> acquired by increasing level order. >> >> But the try_lock fails, it triggers the warning at __sb_start_write(), >> then its caller sb_start_pagefault() ignores the return value and >> callers such as btrfs_page_mkwrite() make the assumption the freeze >> semaphore was taken, proceed to do their stuff, and later call >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem >> at level 2 despite not having not been able to down_read() the >> semaphore. This obviously corrupts the semaphore's read_count state, and >> later causes any task trying to down_write() it to hang forever. >> >> After such a hang I ran a drgn script to confirm it: >> >> $ cat dump_freeze_sems.py >> import sys >> import drgn >> from drgn import NULL, Object, cast, container_of, execscript, \ >> reinterpret, sizeof >> from drgn.helpers.linux import * >> >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' >> >> mnt = None >> for mnt in for_each_mount(prog, dst = mnt_path): >> pass >> >> if mnt is None: >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') >> sys.exit(1) >> >> def dump_sem(level_enum): >> level = level_enum.value_() >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] >> print(f'freeze semaphore at level {level}, {str(level_enum)}') >> print(f' block {sem.block.counter.value_()}') >> for i in for_each_possible_cpu(prog): >> read_count = per_cpu_ptr(sem.read_count, i) >> print(f' read_count at cpu {i} = {read_count}') >> print() >> >> # dump semaphore read counts for all freeze levels (fs.h) >> dump_sem(prog['SB_FREEZE_WRITE']) >> dump_sem(prog['SB_FREEZE_PAGEFAULT']) >> dump_sem(prog['SB_FREEZE_FS']) >> >> >> $ drgn dump_freeze_sems.py >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE >> block 1 >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 >> >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT >> block 1 >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 >> >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS >> block 0 >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 >> >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it >> sums to -1. The system remains like that for hours at least, with no >> progress at all. >> >> Is there a known regression with that lockdep commit? >> Anything I can do to help debug it in case it's not obvious? >> >> Thanks. > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-03 19:44 ` Filipe Manana @ 2020-11-04 2:22 ` Boqun Feng 2020-11-04 3:44 ` Boqun Feng 0 siblings, 1 reply; 26+ messages in thread From: Boqun Feng @ 2020-11-04 2:22 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote: > > > On 03/11/20 14:08, Boqun Feng wrote: > > Hi Filipe, > > > > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: > >> Hello, > >> > >> I've recently started to hit a warning followed by tasks hanging after > >> attempts to freeze a filesystem. A git bisection pointed to the > >> following commit: > >> > >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e > >> Author: Peter Zijlstra <peterz@infradead.org> > >> Date: Fri Oct 2 11:04:21 2020 +0200 > >> > >> lockdep: Fix lockdep recursion > >> > >> This happens very reliably when running all xfstests with lockdep > >> enabled, and the tested filesystem is btrfs (haven't tried other > >> filesystems, but it shouldn't matter). The warning and task hangs always > >> happen at either test generic/068 or test generic/390, and (oddly) > >> always have to run all tests for it to trigger, running those tests > >> individually on an infinite loop doesn't seem to trigger it (at least > >> for a couple hours). > >> > >> The warning triggered is at fs/super.c:__sb_start_write() which always > >> results later in several tasks hanging on a percpu rw_sem: > >> > >> https://pastebin.com/qnLvf94E > >> > > > > In your dmesg, I see line: > > > > [ 9304.920151] INFO: lockdep is turned off. > > > > , that means debug_locks is 0, that usually happens when lockdep find a > > problem (i.e. a deadlock) and it turns itself off, because a problem is > > found and it's pointless for lockdep to continue to run. > > > > And I haven't found a lockdep splat in your dmesg, do you have a full > > dmesg so that I can have a look? > > > > This may be relevant because in commit 4d004099a66, we have > > > > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > > unsigned long flags; > > int ret = 0; > > > > - if (unlikely(current->lockdep_recursion)) > > + if (unlikely(!lockdep_enabled())) > > return 1; /* avoid false negative lockdep_assert_held() */ > > > > before this commit lock_is_held_type() and its friends may return false > > if debug_locks==0, after this commit lock_is_held_type() and its friends > > will always return true if debug_locks == 0. That could cause the > > behavior here. > > > > In case I'm correct, the following "fix" may be helpful. > > > > Regards, > > Boqun > > > > ----------8 > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > index 3e99dfef8408..c0e27fb949ff 100644 > > --- a/kernel/locking/lockdep.c > > +++ b/kernel/locking/lockdep.c > > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > > unsigned long flags; > > int ret = 0; > > > > - if (unlikely(!lockdep_enabled())) > > + if (unlikely(debug_locks && !lockdep_enabled())) > > return 1; /* avoid false negative lockdep_assert_held() */ > > > > raw_local_irq_save(flags); > > Boqun, the patch fixes the problem for me! > You can have Tested-by: Filipe Manana <fdmanana@suse.com> > Thanks. Although I think it still means that we have a lock issue when running xfstests (because we don't know why debug_locks gets cleared), I guess I will have to reproduce this myself for further analysis, could you share you .config? Anyway, I think this fix still makes a bit sense, I will send a proper patch so that the problem won't block fs folks. Regards, Boqun > Thanks! > > > > > > > > >> What happens is percpu_rwsem_is_held() is apparently returning a false > >> positive, so this makes __sb_start_write() do a > >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which > >> is expected to always succeed, because if the calling task is holding a > >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock > >> the semaphore at level 2, since the freeze semaphores are always > >> acquired by increasing level order. > >> > >> But the try_lock fails, it triggers the warning at __sb_start_write(), > >> then its caller sb_start_pagefault() ignores the return value and > >> callers such as btrfs_page_mkwrite() make the assumption the freeze > >> semaphore was taken, proceed to do their stuff, and later call > >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem > >> at level 2 despite not having not been able to down_read() the > >> semaphore. This obviously corrupts the semaphore's read_count state, and > >> later causes any task trying to down_write() it to hang forever. > >> > >> After such a hang I ran a drgn script to confirm it: > >> > >> $ cat dump_freeze_sems.py > >> import sys > >> import drgn > >> from drgn import NULL, Object, cast, container_of, execscript, \ > >> reinterpret, sizeof > >> from drgn.helpers.linux import * > >> > >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' > >> > >> mnt = None > >> for mnt in for_each_mount(prog, dst = mnt_path): > >> pass > >> > >> if mnt is None: > >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') > >> sys.exit(1) > >> > >> def dump_sem(level_enum): > >> level = level_enum.value_() > >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] > >> print(f'freeze semaphore at level {level}, {str(level_enum)}') > >> print(f' block {sem.block.counter.value_()}') > >> for i in for_each_possible_cpu(prog): > >> read_count = per_cpu_ptr(sem.read_count, i) > >> print(f' read_count at cpu {i} = {read_count}') > >> print() > >> > >> # dump semaphore read counts for all freeze levels (fs.h) > >> dump_sem(prog['SB_FREEZE_WRITE']) > >> dump_sem(prog['SB_FREEZE_PAGEFAULT']) > >> dump_sem(prog['SB_FREEZE_FS']) > >> > >> > >> $ drgn dump_freeze_sems.py > >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE > >> block 1 > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 > >> > >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT > >> block 1 > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 > >> > >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS > >> block 0 > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 > >> > >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it > >> sums to -1. The system remains like that for hours at least, with no > >> progress at all. > >> > >> Is there a known regression with that lockdep commit? > >> Anything I can do to help debug it in case it's not obvious? > >> > >> Thanks. > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-04 2:22 ` Boqun Feng @ 2020-11-04 3:44 ` Boqun Feng 2020-11-04 9:49 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Boqun Feng @ 2020-11-04 3:44 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote: > On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote: > > > > > > On 03/11/20 14:08, Boqun Feng wrote: > > > Hi Filipe, > > > > > > On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: > > >> Hello, > > >> > > >> I've recently started to hit a warning followed by tasks hanging after > > >> attempts to freeze a filesystem. A git bisection pointed to the > > >> following commit: > > >> > > >> commit 4d004099a668c41522242aa146a38cc4eb59cb1e > > >> Author: Peter Zijlstra <peterz@infradead.org> > > >> Date: Fri Oct 2 11:04:21 2020 +0200 > > >> > > >> lockdep: Fix lockdep recursion > > >> > > >> This happens very reliably when running all xfstests with lockdep > > >> enabled, and the tested filesystem is btrfs (haven't tried other > > >> filesystems, but it shouldn't matter). The warning and task hangs always > > >> happen at either test generic/068 or test generic/390, and (oddly) > > >> always have to run all tests for it to trigger, running those tests > > >> individually on an infinite loop doesn't seem to trigger it (at least > > >> for a couple hours). > > >> > > >> The warning triggered is at fs/super.c:__sb_start_write() which always > > >> results later in several tasks hanging on a percpu rw_sem: > > >> > > >> https://pastebin.com/qnLvf94E > > >> > > > > > > In your dmesg, I see line: > > > > > > [ 9304.920151] INFO: lockdep is turned off. > > > > > > , that means debug_locks is 0, that usually happens when lockdep find a > > > problem (i.e. a deadlock) and it turns itself off, because a problem is > > > found and it's pointless for lockdep to continue to run. > > > > > > And I haven't found a lockdep splat in your dmesg, do you have a full > > > dmesg so that I can have a look? > > > > > > This may be relevant because in commit 4d004099a66, we have > > > > > > @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > > > unsigned long flags; > > > int ret = 0; > > > > > > - if (unlikely(current->lockdep_recursion)) > > > + if (unlikely(!lockdep_enabled())) > > > return 1; /* avoid false negative lockdep_assert_held() */ > > > > > > before this commit lock_is_held_type() and its friends may return false > > > if debug_locks==0, after this commit lock_is_held_type() and its friends > > > will always return true if debug_locks == 0. That could cause the > > > behavior here. > > > > > > In case I'm correct, the following "fix" may be helpful. > > > > > > Regards, > > > Boqun > > > > > > ----------8 > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > > index 3e99dfef8408..c0e27fb949ff 100644 > > > --- a/kernel/locking/lockdep.c > > > +++ b/kernel/locking/lockdep.c > > > @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) > > > unsigned long flags; > > > int ret = 0; > > > > > > - if (unlikely(!lockdep_enabled())) > > > + if (unlikely(debug_locks && !lockdep_enabled())) > > > return 1; /* avoid false negative lockdep_assert_held() */ > > > > > > raw_local_irq_save(flags); > > > > Boqun, the patch fixes the problem for me! > > You can have Tested-by: Filipe Manana <fdmanana@suse.com> > > > > Thanks. Although I think it still means that we have a lock issue when > running xfstests (because we don't know why debug_locks gets cleared), I might find a place where we could turn lockdep off silently: in print_circular_bug(), we turn off lockdep via debug_locks_off_graph_unlock(), and then we try to save the trace for lockdep splat, however, if we use up the stack_trace buffer (i.e. nr_stack_trace_entries), save_trace() will return NULL and we return silently. Filipe, in order to check whethter that happens, could you share me your /proc/lockdep_stats after the full set of xfstests is finished? Alternatively, it's also helpful if you can try the following debug diff, with teh full set of xfstests: Thanks! Just trying to understand the real problem. Regards, Boqun -------------->8 diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index b71ad8d9f1c9..9ae3e089e5c0 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) LOCK_TRACE_SIZE_IN_LONGS; if (max_entries <= 0) { - if (!debug_locks_off_graph_unlock()) + if (!debug_locks_off_graph_unlock()) { + WARN_ON_ONCE(1); return NULL; + } print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); dump_stack(); > I guess I will have to reproduce this myself for further analysis, could > you share you .config? > > Anyway, I think this fix still makes a bit sense, I will send a proper > patch so that the problem won't block fs folks. > > Regards, > Boqun > > > Thanks! > > > > > > > > > > > > > >> What happens is percpu_rwsem_is_held() is apparently returning a false > > >> positive, so this makes __sb_start_write() do a > > >> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which > > >> is expected to always succeed, because if the calling task is holding a > > >> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock > > >> the semaphore at level 2, since the freeze semaphores are always > > >> acquired by increasing level order. > > >> > > >> But the try_lock fails, it triggers the warning at __sb_start_write(), > > >> then its caller sb_start_pagefault() ignores the return value and > > >> callers such as btrfs_page_mkwrite() make the assumption the freeze > > >> semaphore was taken, proceed to do their stuff, and later call > > >> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem > > >> at level 2 despite not having not been able to down_read() the > > >> semaphore. This obviously corrupts the semaphore's read_count state, and > > >> later causes any task trying to down_write() it to hang forever. > > >> > > >> After such a hang I ran a drgn script to confirm it: > > >> > > >> $ cat dump_freeze_sems.py > > >> import sys > > >> import drgn > > >> from drgn import NULL, Object, cast, container_of, execscript, \ > > >> reinterpret, sizeof > > >> from drgn.helpers.linux import * > > >> > > >> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' > > >> > > >> mnt = None > > >> for mnt in for_each_mount(prog, dst = mnt_path): > > >> pass > > >> > > >> if mnt is None: > > >> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') > > >> sys.exit(1) > > >> > > >> def dump_sem(level_enum): > > >> level = level_enum.value_() > > >> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] > > >> print(f'freeze semaphore at level {level}, {str(level_enum)}') > > >> print(f' block {sem.block.counter.value_()}') > > >> for i in for_each_possible_cpu(prog): > > >> read_count = per_cpu_ptr(sem.read_count, i) > > >> print(f' read_count at cpu {i} = {read_count}') > > >> print() > > >> > > >> # dump semaphore read counts for all freeze levels (fs.h) > > >> dump_sem(prog['SB_FREEZE_WRITE']) > > >> dump_sem(prog['SB_FREEZE_PAGEFAULT']) > > >> dump_sem(prog['SB_FREEZE_FS']) > > >> > > >> > > >> $ drgn dump_freeze_sems.py > > >> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE > > >> block 1 > > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 > > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 > > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 > > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 > > >> > > >> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT > > >> block 1 > > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 > > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 > > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 > > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 > > >> > > >> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS > > >> block 0 > > >> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 > > >> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 > > >> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 > > >> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 > > >> > > >> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it > > >> sums to -1. The system remains like that for hours at least, with no > > >> progress at all. > > >> > > >> Is there a known regression with that lockdep commit? > > >> Anything I can do to help debug it in case it's not obvious? > > >> > > >> Thanks. > > > ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-04 3:44 ` Boqun Feng @ 2020-11-04 9:49 ` Filipe Manana 2020-11-04 19:54 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2020-11-04 9:49 UTC (permalink / raw) To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On 04/11/20 03:44, Boqun Feng wrote: > On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote: >> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote: >>> >>> >>> On 03/11/20 14:08, Boqun Feng wrote: >>>> Hi Filipe, >>>> >>>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: >>>>> Hello, >>>>> >>>>> I've recently started to hit a warning followed by tasks hanging after >>>>> attempts to freeze a filesystem. A git bisection pointed to the >>>>> following commit: >>>>> >>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e >>>>> Author: Peter Zijlstra <peterz@infradead.org> >>>>> Date: Fri Oct 2 11:04:21 2020 +0200 >>>>> >>>>> lockdep: Fix lockdep recursion >>>>> >>>>> This happens very reliably when running all xfstests with lockdep >>>>> enabled, and the tested filesystem is btrfs (haven't tried other >>>>> filesystems, but it shouldn't matter). The warning and task hangs always >>>>> happen at either test generic/068 or test generic/390, and (oddly) >>>>> always have to run all tests for it to trigger, running those tests >>>>> individually on an infinite loop doesn't seem to trigger it (at least >>>>> for a couple hours). >>>>> >>>>> The warning triggered is at fs/super.c:__sb_start_write() which always >>>>> results later in several tasks hanging on a percpu rw_sem: >>>>> >>>>> https://pastebin.com/qnLvf94E >>>>> >>>> >>>> In your dmesg, I see line: >>>> >>>> [ 9304.920151] INFO: lockdep is turned off. >>>> >>>> , that means debug_locks is 0, that usually happens when lockdep find a >>>> problem (i.e. a deadlock) and it turns itself off, because a problem is >>>> found and it's pointless for lockdep to continue to run. >>>> >>>> And I haven't found a lockdep splat in your dmesg, do you have a full >>>> dmesg so that I can have a look? >>>> >>>> This may be relevant because in commit 4d004099a66, we have >>>> >>>> @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) >>>> unsigned long flags; >>>> int ret = 0; >>>> >>>> - if (unlikely(current->lockdep_recursion)) >>>> + if (unlikely(!lockdep_enabled())) >>>> return 1; /* avoid false negative lockdep_assert_held() */ >>>> >>>> before this commit lock_is_held_type() and its friends may return false >>>> if debug_locks==0, after this commit lock_is_held_type() and its friends >>>> will always return true if debug_locks == 0. That could cause the >>>> behavior here. >>>> >>>> In case I'm correct, the following "fix" may be helpful. >>>> >>>> Regards, >>>> Boqun >>>> >>>> ----------8 >>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c >>>> index 3e99dfef8408..c0e27fb949ff 100644 >>>> --- a/kernel/locking/lockdep.c >>>> +++ b/kernel/locking/lockdep.c >>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) >>>> unsigned long flags; >>>> int ret = 0; >>>> >>>> - if (unlikely(!lockdep_enabled())) >>>> + if (unlikely(debug_locks && !lockdep_enabled())) >>>> return 1; /* avoid false negative lockdep_assert_held() */ >>>> >>>> raw_local_irq_save(flags); >>> >>> Boqun, the patch fixes the problem for me! >>> You can have Tested-by: Filipe Manana <fdmanana@suse.com> >>> >> >> Thanks. Although I think it still means that we have a lock issue when >> running xfstests (because we don't know why debug_locks gets cleared), > > I might find a place where we could turn lockdep off silently: > > in print_circular_bug(), we turn off lockdep via > debug_locks_off_graph_unlock(), and then we try to save the trace for > lockdep splat, however, if we use up the stack_trace buffer (i.e. > nr_stack_trace_entries), save_trace() will return NULL and we return > silently. > > Filipe, in order to check whethter that happens, could you share me your > /proc/lockdep_stats after the full set of xfstests is finished? Here it is: $ cat /proc/lockdep_stats lock-classes: 1831 [max: 8192] direct dependencies: 17774 [max: 32768] indirect dependencies: 75662 all direct dependencies: 325284 dependency chains: 34223 [max: 65536] dependency chain hlocks used: 158129 [max: 327680] dependency chain hlocks lost: 0 in-hardirq chains: 57 in-softirq chains: 658 in-process chains: 33508 stack-trace entries: 160748 [max: 524288] number of stack traces: 9237 number of stack hash chains: 7076 combined max dependencies: 1280780998 hardirq-safe locks: 43 hardirq-unsafe locks: 1337 softirq-safe locks: 179 softirq-unsafe locks: 1236 irq-safe locks: 187 irq-unsafe locks: 1337 hardirq-read-safe locks: 2 hardirq-read-unsafe locks: 209 softirq-read-safe locks: 9 softirq-read-unsafe locks: 204 irq-read-safe locks: 9 irq-read-unsafe locks: 209 uncategorized locks: 274 unused locks: 0 max locking depth: 15 max bfs queue depth: 337 debug_locks: 0 zapped classes: 2278 zapped lock chains: 17915 large chain blocks: 1 (That's the result after running all fstests with the previous one line patch you sent.) My kernel .config: https://pastebin.com/4xEMvLJ9 I'll try the debugging patch and let you know the results. It will take some 4 hours or so to get back with the result. Thanks! > > Alternatively, it's also helpful if you can try the following debug > diff, with teh full set of xfstests: > > Thanks! Just trying to understand the real problem. > > Regards, > Boqun > > -------------->8 > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index b71ad8d9f1c9..9ae3e089e5c0 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) > LOCK_TRACE_SIZE_IN_LONGS; > > if (max_entries <= 0) { > - if (!debug_locks_off_graph_unlock()) > + if (!debug_locks_off_graph_unlock()) { > + WARN_ON_ONCE(1); > return NULL; > + } > > print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); > dump_stack(); > >> I guess I will have to reproduce this myself for further analysis, could >> you share you .config? >> >> Anyway, I think this fix still makes a bit sense, I will send a proper >> patch so that the problem won't block fs folks. >> >> Regards, >> Boqun >> >>> Thanks! >>> >>>> >>>> >>>> >>>>> What happens is percpu_rwsem_is_held() is apparently returning a false >>>>> positive, so this makes __sb_start_write() do a >>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which >>>>> is expected to always succeed, because if the calling task is holding a >>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock >>>>> the semaphore at level 2, since the freeze semaphores are always >>>>> acquired by increasing level order. >>>>> >>>>> But the try_lock fails, it triggers the warning at __sb_start_write(), >>>>> then its caller sb_start_pagefault() ignores the return value and >>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze >>>>> semaphore was taken, proceed to do their stuff, and later call >>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem >>>>> at level 2 despite not having not been able to down_read() the >>>>> semaphore. This obviously corrupts the semaphore's read_count state, and >>>>> later causes any task trying to down_write() it to hang forever. >>>>> >>>>> After such a hang I ran a drgn script to confirm it: >>>>> >>>>> $ cat dump_freeze_sems.py >>>>> import sys >>>>> import drgn >>>>> from drgn import NULL, Object, cast, container_of, execscript, \ >>>>> reinterpret, sizeof >>>>> from drgn.helpers.linux import * >>>>> >>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' >>>>> >>>>> mnt = None >>>>> for mnt in for_each_mount(prog, dst = mnt_path): >>>>> pass >>>>> >>>>> if mnt is None: >>>>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') >>>>> sys.exit(1) >>>>> >>>>> def dump_sem(level_enum): >>>>> level = level_enum.value_() >>>>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] >>>>> print(f'freeze semaphore at level {level}, {str(level_enum)}') >>>>> print(f' block {sem.block.counter.value_()}') >>>>> for i in for_each_possible_cpu(prog): >>>>> read_count = per_cpu_ptr(sem.read_count, i) >>>>> print(f' read_count at cpu {i} = {read_count}') >>>>> print() >>>>> >>>>> # dump semaphore read counts for all freeze levels (fs.h) >>>>> dump_sem(prog['SB_FREEZE_WRITE']) >>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT']) >>>>> dump_sem(prog['SB_FREEZE_FS']) >>>>> >>>>> >>>>> $ drgn dump_freeze_sems.py >>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE >>>>> block 1 >>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 >>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 >>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 >>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 >>>>> >>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT >>>>> block 1 >>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 >>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 >>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 >>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 >>>>> >>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS >>>>> block 0 >>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 >>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 >>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 >>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 >>>>> >>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it >>>>> sums to -1. The system remains like that for hours at least, with no >>>>> progress at all. >>>>> >>>>> Is there a known regression with that lockdep commit? >>>>> Anything I can do to help debug it in case it's not obvious? >>>>> >>>>> Thanks. >>>> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-04 9:49 ` Filipe Manana @ 2020-11-04 19:54 ` Filipe Manana 2020-11-05 1:10 ` Boqun Feng 0 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2020-11-04 19:54 UTC (permalink / raw) To: Boqun Feng; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On 04/11/20 09:49, Filipe Manana wrote: > > > On 04/11/20 03:44, Boqun Feng wrote: >> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote: >>> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote: >>>> >>>> >>>> On 03/11/20 14:08, Boqun Feng wrote: >>>>> Hi Filipe, >>>>> >>>>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote: >>>>>> Hello, >>>>>> >>>>>> I've recently started to hit a warning followed by tasks hanging after >>>>>> attempts to freeze a filesystem. A git bisection pointed to the >>>>>> following commit: >>>>>> >>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e >>>>>> Author: Peter Zijlstra <peterz@infradead.org> >>>>>> Date: Fri Oct 2 11:04:21 2020 +0200 >>>>>> >>>>>> lockdep: Fix lockdep recursion >>>>>> >>>>>> This happens very reliably when running all xfstests with lockdep >>>>>> enabled, and the tested filesystem is btrfs (haven't tried other >>>>>> filesystems, but it shouldn't matter). The warning and task hangs always >>>>>> happen at either test generic/068 or test generic/390, and (oddly) >>>>>> always have to run all tests for it to trigger, running those tests >>>>>> individually on an infinite loop doesn't seem to trigger it (at least >>>>>> for a couple hours). >>>>>> >>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always >>>>>> results later in several tasks hanging on a percpu rw_sem: >>>>>> >>>>>> https://pastebin.com/qnLvf94E >>>>>> >>>>> >>>>> In your dmesg, I see line: >>>>> >>>>> [ 9304.920151] INFO: lockdep is turned off. >>>>> >>>>> , that means debug_locks is 0, that usually happens when lockdep find a >>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is >>>>> found and it's pointless for lockdep to continue to run. >>>>> >>>>> And I haven't found a lockdep splat in your dmesg, do you have a full >>>>> dmesg so that I can have a look? >>>>> >>>>> This may be relevant because in commit 4d004099a66, we have >>>>> >>>>> @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) >>>>> unsigned long flags; >>>>> int ret = 0; >>>>> >>>>> - if (unlikely(current->lockdep_recursion)) >>>>> + if (unlikely(!lockdep_enabled())) >>>>> return 1; /* avoid false negative lockdep_assert_held() */ >>>>> >>>>> before this commit lock_is_held_type() and its friends may return false >>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends >>>>> will always return true if debug_locks == 0. That could cause the >>>>> behavior here. >>>>> >>>>> In case I'm correct, the following "fix" may be helpful. >>>>> >>>>> Regards, >>>>> Boqun >>>>> >>>>> ----------8 >>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c >>>>> index 3e99dfef8408..c0e27fb949ff 100644 >>>>> --- a/kernel/locking/lockdep.c >>>>> +++ b/kernel/locking/lockdep.c >>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) >>>>> unsigned long flags; >>>>> int ret = 0; >>>>> >>>>> - if (unlikely(!lockdep_enabled())) >>>>> + if (unlikely(debug_locks && !lockdep_enabled())) >>>>> return 1; /* avoid false negative lockdep_assert_held() */ >>>>> >>>>> raw_local_irq_save(flags); >>>> >>>> Boqun, the patch fixes the problem for me! >>>> You can have Tested-by: Filipe Manana <fdmanana@suse.com> >>>> >>> >>> Thanks. Although I think it still means that we have a lock issue when >>> running xfstests (because we don't know why debug_locks gets cleared), >> >> I might find a place where we could turn lockdep off silently: >> >> in print_circular_bug(), we turn off lockdep via >> debug_locks_off_graph_unlock(), and then we try to save the trace for >> lockdep splat, however, if we use up the stack_trace buffer (i.e. >> nr_stack_trace_entries), save_trace() will return NULL and we return >> silently. >> >> Filipe, in order to check whethter that happens, could you share me your >> /proc/lockdep_stats after the full set of xfstests is finished? > > Here it is: > > $ cat /proc/lockdep_stats > lock-classes: 1831 [max: 8192] > direct dependencies: 17774 [max: 32768] > indirect dependencies: 75662 > all direct dependencies: 325284 > dependency chains: 34223 [max: 65536] > dependency chain hlocks used: 158129 [max: 327680] > dependency chain hlocks lost: 0 > in-hardirq chains: 57 > in-softirq chains: 658 > in-process chains: 33508 > stack-trace entries: 160748 [max: 524288] > number of stack traces: 9237 > number of stack hash chains: 7076 > combined max dependencies: 1280780998 > hardirq-safe locks: 43 > hardirq-unsafe locks: 1337 > softirq-safe locks: 179 > softirq-unsafe locks: 1236 > irq-safe locks: 187 > irq-unsafe locks: 1337 > hardirq-read-safe locks: 2 > hardirq-read-unsafe locks: 209 > softirq-read-safe locks: 9 > softirq-read-unsafe locks: 204 > irq-read-safe locks: 9 > irq-read-unsafe locks: 209 > uncategorized locks: 274 > unused locks: 0 > max locking depth: 15 > max bfs queue depth: 337 > debug_locks: 0 > > zapped classes: 2278 > zapped lock chains: 17915 > large chain blocks: 1 > > (That's the result after running all fstests with the previous one line > patch you sent.) > > My kernel .config: https://pastebin.com/4xEMvLJ9 > > I'll try the debugging patch and let you know the results. It will take > some 4 hours or so to get back with the result. Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one): diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index b71ad8d9f1c9..b31d4ad482c7 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) LOCK_TRACE_SIZE_IN_LONGS; if (max_entries <= 0) { - if (!debug_locks_off_graph_unlock()) + if (!debug_locks_off_graph_unlock()) { + WARN_ON_ONCE(1); return NULL; + } print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); dump_stack(); @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct lockdep_map *lock, int read) unsigned long flags; int ret = 0; - if (unlikely(!lockdep_enabled())) + if (unlikely(debug_locks && !lockdep_enabled())) return 1; /* avoid false negative lockdep_assert_held() */ raw_local_irq_save(flags); With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered. Unexpected, right? Should I try something else? Thanks. > > Thanks! > >> >> Alternatively, it's also helpful if you can try the following debug >> diff, with teh full set of xfstests: >> >> Thanks! Just trying to understand the real problem. >> >> Regards, >> Boqun >> >> -------------->8 >> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c >> index b71ad8d9f1c9..9ae3e089e5c0 100644 >> --- a/kernel/locking/lockdep.c >> +++ b/kernel/locking/lockdep.c >> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) >> LOCK_TRACE_SIZE_IN_LONGS; >> >> if (max_entries <= 0) { >> - if (!debug_locks_off_graph_unlock()) >> + if (!debug_locks_off_graph_unlock()) { >> + WARN_ON_ONCE(1); >> return NULL; >> + } >> >> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); >> dump_stack(); >> >>> I guess I will have to reproduce this myself for further analysis, could >>> you share you .config? >>> >>> Anyway, I think this fix still makes a bit sense, I will send a proper >>> patch so that the problem won't block fs folks. >>> >>> Regards, >>> Boqun >>> >>>> Thanks! >>>> >>>>> >>>>> >>>>> >>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false >>>>>> positive, so this makes __sb_start_write() do a >>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which >>>>>> is expected to always succeed, because if the calling task is holding a >>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock >>>>>> the semaphore at level 2, since the freeze semaphores are always >>>>>> acquired by increasing level order. >>>>>> >>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(), >>>>>> then its caller sb_start_pagefault() ignores the return value and >>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze >>>>>> semaphore was taken, proceed to do their stuff, and later call >>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem >>>>>> at level 2 despite not having not been able to down_read() the >>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and >>>>>> later causes any task trying to down_write() it to hang forever. >>>>>> >>>>>> After such a hang I ran a drgn script to confirm it: >>>>>> >>>>>> $ cat dump_freeze_sems.py >>>>>> import sys >>>>>> import drgn >>>>>> from drgn import NULL, Object, cast, container_of, execscript, \ >>>>>> reinterpret, sizeof >>>>>> from drgn.helpers.linux import * >>>>>> >>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' >>>>>> >>>>>> mnt = None >>>>>> for mnt in for_each_mount(prog, dst = mnt_path): >>>>>> pass >>>>>> >>>>>> if mnt is None: >>>>>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') >>>>>> sys.exit(1) >>>>>> >>>>>> def dump_sem(level_enum): >>>>>> level = level_enum.value_() >>>>>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] >>>>>> print(f'freeze semaphore at level {level}, {str(level_enum)}') >>>>>> print(f' block {sem.block.counter.value_()}') >>>>>> for i in for_each_possible_cpu(prog): >>>>>> read_count = per_cpu_ptr(sem.read_count, i) >>>>>> print(f' read_count at cpu {i} = {read_count}') >>>>>> print() >>>>>> >>>>>> # dump semaphore read counts for all freeze levels (fs.h) >>>>>> dump_sem(prog['SB_FREEZE_WRITE']) >>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT']) >>>>>> dump_sem(prog['SB_FREEZE_FS']) >>>>>> >>>>>> >>>>>> $ drgn dump_freeze_sems.py >>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE >>>>>> block 1 >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 >>>>>> >>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT >>>>>> block 1 >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 >>>>>> >>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS >>>>>> block 0 >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 >>>>>> >>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it >>>>>> sums to -1. The system remains like that for hours at least, with no >>>>>> progress at all. >>>>>> >>>>>> Is there a known regression with that lockdep commit? >>>>>> Anything I can do to help debug it in case it's not obvious? >>>>>> >>>>>> Thanks. >>>>> >> ^ permalink raw reply related [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-04 19:54 ` Filipe Manana @ 2020-11-05 1:10 ` Boqun Feng 2020-11-09 8:44 ` Boqun Feng 0 siblings, 1 reply; 26+ messages in thread From: Boqun Feng @ 2020-11-05 1:10 UTC (permalink / raw) To: Filipe Manana; +Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote: [...] > > Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one): > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > index b71ad8d9f1c9..b31d4ad482c7 100644 > --- a/kernel/locking/lockdep.c > +++ b/kernel/locking/lockdep.c > @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) > LOCK_TRACE_SIZE_IN_LONGS; > > if (max_entries <= 0) { > - if (!debug_locks_off_graph_unlock()) > + if (!debug_locks_off_graph_unlock()) { > + WARN_ON_ONCE(1); > return NULL; > + } > > print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); > dump_stack(); > @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct > lockdep_map *lock, int read) > unsigned long flags; > int ret = 0; > > - if (unlikely(!lockdep_enabled())) > + if (unlikely(debug_locks && !lockdep_enabled())) > return 1; /* avoid false negative lockdep_assert_held() */ > > raw_local_irq_save(flags); > > With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered. > Unexpected, right? > Kinda, that means we still don't know why lockdep was turned off. > Should I try something else? > Thanks for trying this. Let me set up the reproducer on my side, and see if I could get more information. Regards, Boqun > Thanks. > > > > > > Thanks! > > > >> > >> Alternatively, it's also helpful if you can try the following debug > >> diff, with teh full set of xfstests: > >> > >> Thanks! Just trying to understand the real problem. > >> > >> Regards, > >> Boqun > >> > >> -------------->8 > >> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > >> index b71ad8d9f1c9..9ae3e089e5c0 100644 > >> --- a/kernel/locking/lockdep.c > >> +++ b/kernel/locking/lockdep.c > >> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) > >> LOCK_TRACE_SIZE_IN_LONGS; > >> > >> if (max_entries <= 0) { > >> - if (!debug_locks_off_graph_unlock()) > >> + if (!debug_locks_off_graph_unlock()) { > >> + WARN_ON_ONCE(1); > >> return NULL; > >> + } > >> > >> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); > >> dump_stack(); > >> > >>> I guess I will have to reproduce this myself for further analysis, could > >>> you share you .config? > >>> > >>> Anyway, I think this fix still makes a bit sense, I will send a proper > >>> patch so that the problem won't block fs folks. > >>> > >>> Regards, > >>> Boqun > >>> > >>>> Thanks! > >>>> > >>>>> > >>>>> > >>>>> > >>>>>> What happens is percpu_rwsem_is_held() is apparently returning a false > >>>>>> positive, so this makes __sb_start_write() do a > >>>>>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which > >>>>>> is expected to always succeed, because if the calling task is holding a > >>>>>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock > >>>>>> the semaphore at level 2, since the freeze semaphores are always > >>>>>> acquired by increasing level order. > >>>>>> > >>>>>> But the try_lock fails, it triggers the warning at __sb_start_write(), > >>>>>> then its caller sb_start_pagefault() ignores the return value and > >>>>>> callers such as btrfs_page_mkwrite() make the assumption the freeze > >>>>>> semaphore was taken, proceed to do their stuff, and later call > >>>>>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem > >>>>>> at level 2 despite not having not been able to down_read() the > >>>>>> semaphore. This obviously corrupts the semaphore's read_count state, and > >>>>>> later causes any task trying to down_write() it to hang forever. > >>>>>> > >>>>>> After such a hang I ran a drgn script to confirm it: > >>>>>> > >>>>>> $ cat dump_freeze_sems.py > >>>>>> import sys > >>>>>> import drgn > >>>>>> from drgn import NULL, Object, cast, container_of, execscript, \ > >>>>>> reinterpret, sizeof > >>>>>> from drgn.helpers.linux import * > >>>>>> > >>>>>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1' > >>>>>> > >>>>>> mnt = None > >>>>>> for mnt in for_each_mount(prog, dst = mnt_path): > >>>>>> pass > >>>>>> > >>>>>> if mnt is None: > >>>>>> sys.stderr.write(f'Error: mount point {mnt_path} not found\n') > >>>>>> sys.exit(1) > >>>>>> > >>>>>> def dump_sem(level_enum): > >>>>>> level = level_enum.value_() > >>>>>> sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1] > >>>>>> print(f'freeze semaphore at level {level}, {str(level_enum)}') > >>>>>> print(f' block {sem.block.counter.value_()}') > >>>>>> for i in for_each_possible_cpu(prog): > >>>>>> read_count = per_cpu_ptr(sem.read_count, i) > >>>>>> print(f' read_count at cpu {i} = {read_count}') > >>>>>> print() > >>>>>> > >>>>>> # dump semaphore read counts for all freeze levels (fs.h) > >>>>>> dump_sem(prog['SB_FREEZE_WRITE']) > >>>>>> dump_sem(prog['SB_FREEZE_PAGEFAULT']) > >>>>>> dump_sem(prog['SB_FREEZE_FS']) > >>>>>> > >>>>>> > >>>>>> $ drgn dump_freeze_sems.py > >>>>>> freeze semaphore at level 1, (enum <anonymous>)SB_FREEZE_WRITE > >>>>>> block 1 > >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c74 = 3 > >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c74 = 4294967293 > >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c74 = 3 > >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c74 = 4294967293 > >>>>>> > >>>>>> freeze semaphore at level 2, (enum <anonymous>)SB_FREEZE_PAGEFAULT > >>>>>> block 1 > >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c78 = 0 > >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c78 = 4294967295 > >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c78 = 0 > >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c78 = 0 > >>>>>> > >>>>>> freeze semaphore at level 3, (enum <anonymous>)SB_FREEZE_FS > >>>>>> block 0 > >>>>>> read_count at cpu 0 = *(unsigned int *)0xffffc2ec3ee00c7c = 0 > >>>>>> read_count at cpu 1 = *(unsigned int *)0xffffc2ec3f200c7c = 0 > >>>>>> read_count at cpu 2 = *(unsigned int *)0xffffc2ec3f600c7c = 0 > >>>>>> read_count at cpu 3 = *(unsigned int *)0xffffc2ec3fa00c7c = 0 > >>>>>> > >>>>>> At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it > >>>>>> sums to -1. The system remains like that for hours at least, with no > >>>>>> progress at all. > >>>>>> > >>>>>> Is there a known regression with that lockdep commit? > >>>>>> Anything I can do to help debug it in case it's not obvious? > >>>>>> > >>>>>> Thanks. > >>>>> > >> ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-05 1:10 ` Boqun Feng @ 2020-11-09 8:44 ` Boqun Feng 2020-11-09 9:57 ` Filipe Manana 0 siblings, 1 reply; 26+ messages in thread From: Boqun Feng @ 2020-11-09 8:44 UTC (permalink / raw) To: Filipe Manana Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov, Qu Wenruo Hi Filipe, On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote: > On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote: > [...] > > > > Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one): > > > > diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > > index b71ad8d9f1c9..b31d4ad482c7 100644 > > --- a/kernel/locking/lockdep.c > > +++ b/kernel/locking/lockdep.c > > @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) > > LOCK_TRACE_SIZE_IN_LONGS; > > > > if (max_entries <= 0) { > > - if (!debug_locks_off_graph_unlock()) > > + if (!debug_locks_off_graph_unlock()) { > > + WARN_ON_ONCE(1); > > return NULL; > > + } > > > > print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); > > dump_stack(); > > @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct > > lockdep_map *lock, int read) > > unsigned long flags; > > int ret = 0; > > > > - if (unlikely(!lockdep_enabled())) > > + if (unlikely(debug_locks && !lockdep_enabled())) > > return 1; /* avoid false negative lockdep_assert_held() */ > > > > raw_local_irq_save(flags); > > > > With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered. > > Unexpected, right? > > > > Kinda, that means we still don't know why lockdep was turned off. > > > Should I try something else? > > > > Thanks for trying this. Let me set up the reproducer on my side, and see > if I could get more information. > I could hit this with btrfs/187, and when we hit it, lockdep will report the deadlock and turn off, and I think this is the root cause for your hitting the original problem, I will add some analysis after the lockdep splat. [12295.973309] ============================================ [12295.974770] WARNING: possible recursive locking detected [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted [12295.974770] -------------------------------------------- [12295.974770] zsh/701247 is trying to acquire lock: [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] [12295.974770] but task is already holding lock: [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] [12295.974770] other info that might help us debug this: [12295.974770] Possible unsafe locking scenario: [12295.974770] CPU0 [12295.974770] ---- [12295.974770] lock(&eb->lock); [12295.974770] lock(&eb->lock); [12295.974770] *** DEADLOCK *** [12295.974770] May be due to missing lock nesting notation [12295.974770] 2 locks held by zsh/701247: [12295.974770] #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920 [12295.974770] #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] [12295.974770] stack backtrace: [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20 [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 [12295.974770] Call Trace: [12295.974770] dump_stack+0x8b/0xb0 [12295.974770] __lock_acquire.cold+0x175/0x2e9 [12295.974770] lock_acquire+0x15b/0x490 [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] [12295.974770] ? read_block_for_search+0xf4/0x350 [btrfs] [12295.974770] _raw_read_lock+0x40/0xa0 [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] [12295.974770] btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] [12295.974770] btrfs_search_slot+0x6ac/0xca0 [btrfs] [12295.974770] btrfs_lookup_xattr+0x7d/0xd0 [btrfs] [12295.974770] btrfs_getxattr+0x67/0x130 [btrfs] [12295.974770] __vfs_getxattr+0x53/0x70 [12295.974770] get_vfs_caps_from_disk+0x68/0x1a0 [12295.974770] ? sched_clock_cpu+0x114/0x180 [12295.974770] cap_bprm_creds_from_file+0x181/0x6c0 [12295.974770] security_bprm_creds_from_file+0x2a/0x40 [12295.974770] begin_new_exec+0xf4/0xc40 [12295.974770] ? load_elf_phdrs+0x6b/0xb0 [12295.974770] load_elf_binary+0x66b/0x1620 [12295.974770] ? read_hv_sched_clock_tsc+0x5/0x20 [12295.974770] ? sched_clock+0x5/0x10 [12295.974770] ? sched_clock_local+0x12/0x80 [12295.974770] ? sched_clock_cpu+0x114/0x180 [12295.974770] bprm_execve+0x3ce/0x920 [12295.974770] do_execveat_common+0x1b0/0x1f0 [12295.974770] __x64_sys_execve+0x39/0x50 [12295.974770] do_syscall_64+0x33/0x80 [12295.974770] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [12295.974770] RIP: 0033:0x7f6aaefc13cb [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48 [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280 [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001 [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280 The deadlock senario reported by this splat is as follow: CPU 0 CPU 2 ===== ===== btrfs_search_slot(): btrfs_tree_read_lock_atomic(): read_lock(&eb->lock); <a random writer> write_lock(&eb->lock); // waiting for the lock ... btrfs_tree_read_lock_atomic(): read_lock(&eb->lock); // blocked because the fairness. In short, you can not use nested read_lock() on the same lock. However, I'm not sure whether we have the real problem here, because I don't think btrfs_search_slot() can pick the same extent buffer twice in btrfs_search_slot(). Also copy the author of the code for more information. Regards, Boqun > Regards, > Boqun > > > Thanks. > > > > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-09 8:44 ` Boqun Feng @ 2020-11-09 9:57 ` Filipe Manana 2020-11-10 1:41 ` Boqun Feng 0 siblings, 1 reply; 26+ messages in thread From: Filipe Manana @ 2020-11-09 9:57 UTC (permalink / raw) To: Boqun Feng Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov, Qu Wenruo, darrick.wong On 09/11/20 08:44, Boqun Feng wrote: > Hi Filipe, > > On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote: >> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote: >> [...] >>> >>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one): >>> >>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c >>> index b71ad8d9f1c9..b31d4ad482c7 100644 >>> --- a/kernel/locking/lockdep.c >>> +++ b/kernel/locking/lockdep.c >>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) >>> LOCK_TRACE_SIZE_IN_LONGS; >>> >>> if (max_entries <= 0) { >>> - if (!debug_locks_off_graph_unlock()) >>> + if (!debug_locks_off_graph_unlock()) { >>> + WARN_ON_ONCE(1); >>> return NULL; >>> + } >>> >>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); >>> dump_stack(); >>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct >>> lockdep_map *lock, int read) >>> unsigned long flags; >>> int ret = 0; >>> >>> - if (unlikely(!lockdep_enabled())) >>> + if (unlikely(debug_locks && !lockdep_enabled())) >>> return 1; /* avoid false negative lockdep_assert_held() */ >>> >>> raw_local_irq_save(flags); >>> >>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered. >>> Unexpected, right? >>> >> >> Kinda, that means we still don't know why lockdep was turned off. >> >>> Should I try something else? >>> >> >> Thanks for trying this. Let me set up the reproducer on my side, and see >> if I could get more information. >> > > I could hit this with btrfs/187, and when we hit it, lockdep will report > the deadlock and turn off, and I think this is the root cause for your > hitting the original problem, I will add some analysis after the lockdep > splat. > > [12295.973309] ============================================ > [12295.974770] WARNING: possible recursive locking detected > [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted > [12295.974770] -------------------------------------------- > [12295.974770] zsh/701247 is trying to acquire lock: > [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > [12295.974770] > but task is already holding lock: > [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > [12295.974770] > other info that might help us debug this: > [12295.974770] Possible unsafe locking scenario: > > [12295.974770] CPU0 > [12295.974770] ---- > [12295.974770] lock(&eb->lock); > [12295.974770] lock(&eb->lock); > [12295.974770] > *** DEADLOCK *** > > [12295.974770] May be due to missing lock nesting notation > > [12295.974770] 2 locks held by zsh/701247: > [12295.974770] #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920 > [12295.974770] #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > [12295.974770] > stack backtrace: > [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20 > [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 > [12295.974770] Call Trace: > [12295.974770] dump_stack+0x8b/0xb0 > [12295.974770] __lock_acquire.cold+0x175/0x2e9 > [12295.974770] lock_acquire+0x15b/0x490 > [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > [12295.974770] ? read_block_for_search+0xf4/0x350 [btrfs] > [12295.974770] _raw_read_lock+0x40/0xa0 > [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > [12295.974770] btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > [12295.974770] btrfs_search_slot+0x6ac/0xca0 [btrfs] > [12295.974770] btrfs_lookup_xattr+0x7d/0xd0 [btrfs] > [12295.974770] btrfs_getxattr+0x67/0x130 [btrfs] > [12295.974770] __vfs_getxattr+0x53/0x70 > [12295.974770] get_vfs_caps_from_disk+0x68/0x1a0 > [12295.974770] ? sched_clock_cpu+0x114/0x180 > [12295.974770] cap_bprm_creds_from_file+0x181/0x6c0 > [12295.974770] security_bprm_creds_from_file+0x2a/0x40 > [12295.974770] begin_new_exec+0xf4/0xc40 > [12295.974770] ? load_elf_phdrs+0x6b/0xb0 > [12295.974770] load_elf_binary+0x66b/0x1620 > [12295.974770] ? read_hv_sched_clock_tsc+0x5/0x20 > [12295.974770] ? sched_clock+0x5/0x10 > [12295.974770] ? sched_clock_local+0x12/0x80 > [12295.974770] ? sched_clock_cpu+0x114/0x180 > [12295.974770] bprm_execve+0x3ce/0x920 > [12295.974770] do_execveat_common+0x1b0/0x1f0 > [12295.974770] __x64_sys_execve+0x39/0x50 > [12295.974770] do_syscall_64+0x33/0x80 > [12295.974770] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > [12295.974770] RIP: 0033:0x7f6aaefc13cb > [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48 > [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b > [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb > [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280 > [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001 > [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280 > > The deadlock senario reported by this splat is as follow: > > CPU 0 CPU 2 > ===== ===== > btrfs_search_slot(): > btrfs_tree_read_lock_atomic(): > read_lock(&eb->lock); <a random writer> > write_lock(&eb->lock); > // waiting for the lock > ... > btrfs_tree_read_lock_atomic(): > read_lock(&eb->lock); // blocked because the fairness. > > In short, you can not use nested read_lock() on the same lock. However, > I'm not sure whether we have the real problem here, because I don't > think btrfs_search_slot() can pick the same extent buffer twice in > btrfs_search_slot(). Also copy the author of the code for more > information. Ah yes. We have several lockdep splats for which the fixes are not in 5.10-rcs yet. Some may be already in the integration branch [1] and others not yet in any tree, but can be found in the btrfs mailing list. For that particular one, it's fixed by the following patch: https://lore.kernel.org/linux-btrfs/1cee2922a32c305056a9559ccf7aede49777beae.1604591048.git.josef@toxicpanda.com/ (It belongs to a series) You'll likely see more different lockdep splats, this is due to a transition from custom btree locks to rw semaphores that is in progress, and some preparatory work for that is already in 5.10-rcs. Btw, I could hit the fs freeze deadlock even when lockdep didn't produce any previous splat. Jan was hitting it too with ext4, and Darrick with xfs - though I can't tell if they hit any lockdep splat before hitting the freeze deadlock. Thanks for reporting it. [1] https://github.com/kdave/btrfs-devel/commits/misc-next > > Regards, > Boqun > >> Regards, >> Boqun >> >>> Thanks. >>> >>> > ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") 2020-11-09 9:57 ` Filipe Manana @ 2020-11-10 1:41 ` Boqun Feng 0 siblings, 0 replies; 26+ messages in thread From: Boqun Feng @ 2020-11-10 1:41 UTC (permalink / raw) To: Filipe Manana Cc: LKML, peterz, Jan Kara, David Sterba, Nikolay Borisov, Qu Wenruo, darrick.wong On Mon, Nov 09, 2020 at 09:57:05AM +0000, Filipe Manana wrote: > > > On 09/11/20 08:44, Boqun Feng wrote: > > Hi Filipe, > > > > On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote: > >> On Wed, Nov 04, 2020 at 07:54:40PM +0000, Filipe Manana wrote: > >> [...] > >>> > >>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one): > >>> > >>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c > >>> index b71ad8d9f1c9..b31d4ad482c7 100644 > >>> --- a/kernel/locking/lockdep.c > >>> +++ b/kernel/locking/lockdep.c > >>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void) > >>> LOCK_TRACE_SIZE_IN_LONGS; > >>> > >>> if (max_entries <= 0) { > >>> - if (!debug_locks_off_graph_unlock()) > >>> + if (!debug_locks_off_graph_unlock()) { > >>> + WARN_ON_ONCE(1); > >>> return NULL; > >>> + } > >>> > >>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!"); > >>> dump_stack(); > >>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct > >>> lockdep_map *lock, int read) > >>> unsigned long flags; > >>> int ret = 0; > >>> > >>> - if (unlikely(!lockdep_enabled())) > >>> + if (unlikely(debug_locks && !lockdep_enabled())) > >>> return 1; /* avoid false negative lockdep_assert_held() */ > >>> > >>> raw_local_irq_save(flags); > >>> > >>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered. > >>> Unexpected, right? > >>> > >> > >> Kinda, that means we still don't know why lockdep was turned off. > >> > >>> Should I try something else? > >>> > >> > >> Thanks for trying this. Let me set up the reproducer on my side, and see > >> if I could get more information. > >> > > > > I could hit this with btrfs/187, and when we hit it, lockdep will report > > the deadlock and turn off, and I think this is the root cause for your > > hitting the original problem, I will add some analysis after the lockdep > > splat. > > > > [12295.973309] ============================================ > > [12295.974770] WARNING: possible recursive locking detected > > [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted > > [12295.974770] -------------------------------------------- > > [12295.974770] zsh/701247 is trying to acquire lock: > > [12295.974770] ffff92cef43480b8 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > > [12295.974770] > > but task is already holding lock: > > [12295.974770] ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > > [12295.974770] > > other info that might help us debug this: > > [12295.974770] Possible unsafe locking scenario: > > > > [12295.974770] CPU0 > > [12295.974770] ---- > > [12295.974770] lock(&eb->lock); > > [12295.974770] lock(&eb->lock); > > [12295.974770] > > *** DEADLOCK *** > > > > [12295.974770] May be due to missing lock nesting notation > > > > [12295.974770] 2 locks held by zsh/701247: > > [12295.974770] #0: ffff92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+0xaa/0x920 > > [12295.974770] #1: ffff92cef434a038 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > > [12295.974770] > > stack backtrace: > > [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 5.10.0-rc2-btrfs-next-71 #20 > > [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019 > > [12295.974770] Call Trace: > > [12295.974770] dump_stack+0x8b/0xb0 > > [12295.974770] __lock_acquire.cold+0x175/0x2e9 > > [12295.974770] lock_acquire+0x15b/0x490 > > [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > > [12295.974770] ? read_block_for_search+0xf4/0x350 [btrfs] > > [12295.974770] _raw_read_lock+0x40/0xa0 > > [12295.974770] ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > > [12295.974770] btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs] > > [12295.974770] btrfs_search_slot+0x6ac/0xca0 [btrfs] > > [12295.974770] btrfs_lookup_xattr+0x7d/0xd0 [btrfs] > > [12295.974770] btrfs_getxattr+0x67/0x130 [btrfs] > > [12295.974770] __vfs_getxattr+0x53/0x70 > > [12295.974770] get_vfs_caps_from_disk+0x68/0x1a0 > > [12295.974770] ? sched_clock_cpu+0x114/0x180 > > [12295.974770] cap_bprm_creds_from_file+0x181/0x6c0 > > [12295.974770] security_bprm_creds_from_file+0x2a/0x40 > > [12295.974770] begin_new_exec+0xf4/0xc40 > > [12295.974770] ? load_elf_phdrs+0x6b/0xb0 > > [12295.974770] load_elf_binary+0x66b/0x1620 > > [12295.974770] ? read_hv_sched_clock_tsc+0x5/0x20 > > [12295.974770] ? sched_clock+0x5/0x10 > > [12295.974770] ? sched_clock_local+0x12/0x80 > > [12295.974770] ? sched_clock_cpu+0x114/0x180 > > [12295.974770] bprm_execve+0x3ce/0x920 > > [12295.974770] do_execveat_common+0x1b0/0x1f0 > > [12295.974770] __x64_sys_execve+0x39/0x50 > > [12295.974770] do_syscall_64+0x33/0x80 > > [12295.974770] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [12295.974770] RIP: 0033:0x7f6aaefc13cb > > [12295.974770] Code: 48 3d 00 f0 ff ff 76 e7 f7 d8 64 41 89 00 eb df 0f 1f 80 00 00 00 00 f7 d8 64 41 89 00 eb dc f3 0f 1e fa b8 3b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 75 4a 0f 00 f7 d8 64 89 01 48 > > [12295.974770] RSP: 002b:00007ffd33b54d58 EFLAGS: 00000207 ORIG_RAX: 000000000000003b > > [12295.974770] RAX: ffffffffffffffda RBX: 00007f6aaf28bc88 RCX: 00007f6aaefc13cb > > [12295.974770] RDX: 00007ffd33b5fd98 RSI: 00007f6aaf28bc88 RDI: 00007ffd33b55280 > > [12295.974770] RBP: 00007ffd33b54d80 R08: 00007ffd33b54ce0 R09: 0000000000000001 > > [12295.974770] R10: 0000000000000008 R11: 0000000000000207 R12: 00007ffd33b55280 > > > > The deadlock senario reported by this splat is as follow: > > > > CPU 0 CPU 2 > > ===== ===== > > btrfs_search_slot(): > > btrfs_tree_read_lock_atomic(): > > read_lock(&eb->lock); <a random writer> > > write_lock(&eb->lock); > > // waiting for the lock > > ... > > btrfs_tree_read_lock_atomic(): > > read_lock(&eb->lock); // blocked because the fairness. > > > > In short, you can not use nested read_lock() on the same lock. However, > > I'm not sure whether we have the real problem here, because I don't > > think btrfs_search_slot() can pick the same extent buffer twice in > > btrfs_search_slot(). Also copy the author of the code for more > > information. > > Ah yes. We have several lockdep splats for which the fixes are not in > 5.10-rcs yet. Some may be already in the integration branch [1] and > others not yet in any tree, but can be found in the btrfs mailing list. > > For that particular one, it's fixed by the following patch: > > https://lore.kernel.org/linux-btrfs/1cee2922a32c305056a9559ccf7aede49777beae.1604591048.git.josef@toxicpanda.com/ > > (It belongs to a series) > > You'll likely see more different lockdep splats, this is due to a > transition from custom btree locks to rw semaphores that is in progress, > and some preparatory work for that is already in 5.10-rcs. > Thanks for the information! > Btw, I could hit the fs freeze deadlock even when lockdep didn't produce > any previous splat. > And this is the weird part, which means lockdep get turned off silently. That probably means there is a bug (either in fs code or lockdep) that we don't know, and my previous "fix" although works as per your testing, but it may hide the real problem... So I just send out another fix, which is similar to my previous one, but leave a warning so that once we hit the problem we can be warned and not miss the silent lockdep turn-off. Regards, Boqun > Jan was hitting it too with ext4, and Darrick with xfs - though I can't > tell if they hit any lockdep splat before hitting the freeze deadlock. > > Thanks for reporting it. > > [1] https://github.com/kdave/btrfs-devel/commits/misc-next > > > > > Regards, > > Boqun > > > >> Regards, > >> Boqun > >> > >>> Thanks. > >>> > >>> > > ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2020-11-10 1:41 UTC | newest] Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-10-26 11:26 possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana 2020-10-26 11:40 ` Peter Zijlstra 2020-10-26 11:55 ` Jan Kara 2020-10-26 11:59 ` Filipe Manana 2020-10-26 12:30 ` Peter Zijlstra 2020-10-26 11:56 ` Filipe Manana 2020-10-26 12:55 ` Peter Zijlstra 2020-10-26 13:06 ` Filipe Manana 2020-10-26 15:22 ` Peter Zijlstra 2020-10-27 9:49 ` Anatoly Pugachev 2020-10-31 11:30 ` [tip: locking/urgent] locking/lockdep: Remove more raw_cpu_read() usage tip-bot2 for Peter Zijlstra 2020-11-02 17:58 ` possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion") Filipe Manana 2020-11-03 10:15 ` Jan Kara 2020-11-03 10:22 ` Filipe Manana 2020-10-26 20:35 ` David Sterba 2020-11-03 14:08 ` Boqun Feng 2020-11-03 14:24 ` Filipe Manana 2020-11-03 19:44 ` Filipe Manana 2020-11-04 2:22 ` Boqun Feng 2020-11-04 3:44 ` Boqun Feng 2020-11-04 9:49 ` Filipe Manana 2020-11-04 19:54 ` Filipe Manana 2020-11-05 1:10 ` Boqun Feng 2020-11-09 8:44 ` Boqun Feng 2020-11-09 9:57 ` Filipe Manana 2020-11-10 1:41 ` Boqun Feng
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).