netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: BUG: MAX_LOCKDEP_KEYS too low!
       [not found] <c099ad52-0c2c-b886-bae2-c64bd8626452@ozlabs.ru>
@ 2021-01-22  9:16 ` Dmitry Vyukov
       [not found]   ` <6af41136-4344-73da-f821-e831674be473@i-love.sakura.ne.jp>
  0 siblings, 1 reply; 4+ messages in thread
From: Dmitry Vyukov @ 2021-01-22  9:16 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: LKML, Peter Zijlstra, Ingo Molnar, Will Deacon, syzkaller,
	Tetsuo Handa, netdev

On Fri, Jan 22, 2021 at 4:43 AM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> Hi!
>
> Syzkaller found this bug and it has a repro (below). I googled a similar
> bug in 2019 which was fixed so this seems new.
>
> The repro takes about a half a minute to produce the message,  "grep
> lock-classes /proc/lockdep_stats" reports 8177 of 8192, before running
> the repro it is 702. It is a POWER8 box.
>
> The offender is htab->lockdep_key. If I run repro at the slow rate, no
> problems appears, traces show lockdep_unregister_key() is called and the
> leak is quite slow.
>
> Is this something known? Any hints how to debug this further? I'd give
> it a try since I have an easy reproducer. Thanks,

+netdev as it discusses net namespaces as well

Hi Alexey,

The reproducer only does 2 bpf syscalls, so something is slowly leaking in bpf.
My first suspect would be one of these. Since workqueue is async, it
may cause such slow drain that happens only when tasks are spawned
fast. I don't know if there is a procfs/debugfs introspection file to
monitor workqueue lengths to verify this hypothesis.

$ grep INIT_WORK kernel/bpf/*.c
kernel/bpf/arraymap.c: INIT_WORK(&aux->work, prog_array_map_clear_deferred);
kernel/bpf/cgroup.c: INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
kernel/bpf/core.c: INIT_WORK(&aux->work, bpf_prog_free_deferred);
kernel/bpf/cpumap.c: INIT_WORK(&old_rcpu->kthread_stop_wq,
cpu_map_kthread_stop);
kernel/bpf/syscall.c: INIT_WORK(&map->work, bpf_map_free_deferred);
kernel/bpf/syscall.c: INIT_WORK(&link->work, bpf_link_put_deferred);

However, if it's indeed one of the workqueues, I am not sure how it
should be fixed.
We are having a similar (even worser) problem with async destruction
of network namespaces. These are way slower and take lots of mutexes.
I suspect that lots of hangs on net mutexes on syzbot dashboard are
related to that.

Unbounded async queueing is never a good idea. The classical solution
to this is to make the queue bounded and put back pressure on
producers. In this case it would limit the speed at which new
processes are created and make resource consumption (including # of
lockdep entries) bounded.
The restriction probably needs to be per-callback type, at least for
global workqueues.
However, I suspect that lots of callers of schedule_work can't block
(the reason for moving the work to background in the first place?). So
potentially the back pressure may be need to be applied at a different
point, which makes things a bit more complicated.


> root@le-dbg:~# egrep "BD.*htab->lockdep_key" /proc/lockdep | wc -l
> 7449
> root@le-dbg:~# egrep "BD.*htab->lockdep_key" /proc/lockdep | tail -n 3
> (____ptrval____) FD:    1 BD:    1 ....: &htab->lockdep_key#9531
> (____ptrval____) FD:    1 BD:    1 ....: &htab->lockdep_key#9532
> (____ptrval____) FD:    1 BD:    1 ....: &htab->lockdep_key#9533
>
>
> // autogenerated by syzkaller (https://github.com/google/syzkaller)
>
> #define __unix__ 1
> #define __gnu_linux__ 1
> #define __linux__ 1
>
> #define _GNU_SOURCE
>
> #include <dirent.h>
> #include <endian.h>
> #include <errno.h>
> #include <fcntl.h>
> #include <signal.h>
> #include <stdarg.h>
> #include <stdbool.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/prctl.h>
> #include <sys/stat.h>
> #include <sys/syscall.h>
> #include <sys/types.h>
> #include <sys/wait.h>
> #include <time.h>
> #include <unistd.h>
>
> static unsigned long long procid;
>
> static void sleep_ms(uint64_t ms)
> {
>         usleep(ms * 1000);
> }
>
> static uint64_t current_time_ms(void)
> {
>         struct timespec ts;
>         if (clock_gettime(CLOCK_MONOTONIC, &ts))
>         exit(1);
>         return (uint64_t)ts.tv_sec * 1000 + (uint64_t)ts.tv_nsec / 1000000;
> }
>
> static bool write_file(const char* file, const char* what, ...)
> {
>         char buf[1024];
>         va_list args;
>         va_start(args, what);
>         vsnprintf(buf, sizeof(buf), what, args);
>         va_end(args);
>         buf[sizeof(buf) - 1] = 0;
>         int len = strlen(buf);
>         int fd = open(file, O_WRONLY | O_CLOEXEC);
>         if (fd == -1)
>                 return false;
>         if (write(fd, buf, len) != len) {
>                 int err = errno;
>                 close(fd);
>                 errno = err;
>                 return false;
>         }
>         close(fd);
>         return true;
> }
>
> static void kill_and_wait(int pid, int* status)
> {
>         kill(-pid, SIGKILL);
>         kill(pid, SIGKILL);
>         for (int i = 0; i < 100; i++) {
>                 if (waitpid(-1, status, WNOHANG | __WALL) == pid)
>                         return;
>                 usleep(1000);
>         }
>         DIR* dir = opendir("/sys/fs/fuse/connections");
>         if (dir) {
>                 for (;;) {
>                         struct dirent* ent = readdir(dir);
>                         if (!ent)
>                                 break;
>                         if (strcmp(ent->d_name, ".") == 0 || strcmp(ent->d_name, "..") == 0)
>                                 continue;
>                         char abort[300];
>                         snprintf(abort, sizeof(abort), "/sys/fs/fuse/connections/%s/abort",
> ent->d_name);
>                         int fd = open(abort, O_WRONLY);
>                         if (fd == -1) {
>                                 continue;
>                         }
>                         if (write(fd, abort, 1) < 0) {
>                         }
>                         close(fd);
>                 }
>                 closedir(dir);
>         } else {
>         }
>         while (waitpid(-1, status, __WALL) != pid) {
>         }
> }
>
> static void setup_test()
> {
>         prctl(PR_SET_PDEATHSIG, SIGKILL, 0, 0, 0);
>         setpgrp();
>         write_file("/proc/self/oom_score_adj", "1000");
> }
>
> static void execute_one(void);
>
> #define WAIT_FLAGS __WALL
>
> static void loop(void)
> {
>         int iter = 0;
>         for (;; iter++) {
>                 int pid = fork();
>                 if (pid < 0)
>         exit(1);
>                 if (pid == 0) {
>                         setup_test();
>                         execute_one();
>                         exit(0);
>                 }
>                 int status = 0;
>                 uint64_t start = current_time_ms();
>                 for (;;) {
>                         if (waitpid(-1, &status, WNOHANG | WAIT_FLAGS) == pid)
>                                 break;
>                         sleep_ms(1);
>                 if (current_time_ms() - start < 5000) {
>                         continue;
>                 }
>                         kill_and_wait(pid, &status);
>                         break;
>                 }
>         }
> }
>
> #ifndef __NR_bpf
> #define __NR_bpf 361
> #endif
> #ifndef __NR_mmap
> #define __NR_mmap 90
> #endif
>
> uint64_t r[1] = {0xffffffffffffffff};
>
> void execute_one(void)
> {
>                 intptr_t res = 0;
> *(uint32_t*)0x20000280 = 9;
> *(uint32_t*)0x20000284 = 1;
> *(uint32_t*)0x20000288 = 6;
> *(uint32_t*)0x2000028c = 5;
> *(uint32_t*)0x20000290 = 0;
> *(uint32_t*)0x20000294 = -1;
> *(uint32_t*)0x20000298 = 0;
> *(uint8_t*)0x2000029c = 0;
> *(uint8_t*)0x2000029d = 0;
> *(uint8_t*)0x2000029e = 0;
> *(uint8_t*)0x2000029f = 0;
> *(uint8_t*)0x200002a0 = 0;
> *(uint8_t*)0x200002a1 = 0;
> *(uint8_t*)0x200002a2 = 0;
> *(uint8_t*)0x200002a3 = 0;
> *(uint8_t*)0x200002a4 = 0;
> *(uint8_t*)0x200002a5 = 0;
> *(uint8_t*)0x200002a6 = 0;
> *(uint8_t*)0x200002a7 = 0;
> *(uint8_t*)0x200002a8 = 0;
> *(uint8_t*)0x200002a9 = 0;
> *(uint8_t*)0x200002aa = 0;
> *(uint8_t*)0x200002ab = 0;
> *(uint32_t*)0x200002ac = 0;
> *(uint32_t*)0x200002b0 = -1;
> *(uint32_t*)0x200002b4 = 0;
> *(uint32_t*)0x200002b8 = 0;
> *(uint32_t*)0x200002bc = 0;
>         res = syscall(__NR_bpf, 0ul, 0x20000280ul, 0x40ul);
>         if (res != -1)
>                 r[0] = res;
> *(uint64_t*)0x20000100 = 0;
> *(uint64_t*)0x20000108 = 0;
> *(uint64_t*)0x20000110 = 0x200002c0;
> *(uint64_t*)0x20000118 = 0x20000000;
> *(uint32_t*)0x20000120 = 0x1000;
> *(uint32_t*)0x20000124 = r[0];
> *(uint64_t*)0x20000128 = 0;
> *(uint64_t*)0x20000130 = 0;
>         syscall(__NR_bpf, 0x1aul, 0x20000100ul, 0x38ul);
>
> }
> int main(void)
> {
>                 syscall(__NR_mmap, 0x1fff0000ul, 0x10000ul, 0ul, 0x32ul, -1, 0ul);
>         syscall(__NR_mmap, 0x20000000ul, 0x1000000ul, 7ul, 0x32ul, -1, 0ul);
>         syscall(__NR_mmap, 0x21000000ul, 0x10000ul, 0ul, 0x32ul, -1, 0ul);
>         for (procid = 0; procid < 16; procid++) {
>                 if (fork() == 0) {
>                         loop();
>                 }
>         }
>         sleep(1000000);
>         return 0;
> }
>
>
>
>
> --
> Alexey

^ permalink raw reply	[flat|nested] 4+ messages in thread

* BPF: unbounded bpf_map_free_deferred problem
       [not found]             ` <e4767b84-05a4-07c0-811b-b3a08cad2f43@ozlabs.ru>
@ 2021-01-23  0:39               ` Tetsuo Handa
  2021-01-23  3:27                 ` Cong Wang
  0 siblings, 1 reply; 4+ messages in thread
From: Tetsuo Handa @ 2021-01-23  0:39 UTC (permalink / raw)
  To: netdev, bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Linux Kernel Mailing List, Tejun Heo, Alexey Kardashevskiy,
	Dmitry Vyukov

Hello, BPF developers.

Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of
unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND
workqueue did not solve this problem. Is it possible that a refcount leak somewhere
preventing bpf_map_free_deferred from completing? Please see
https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+J0ekrag@mail.gmail.com .

On 2021/01/23 7:53, Alexey Kardashevskiy wrote:
> 
> 
> On 23/01/2021 02:30, Tetsuo Handa wrote:
>> On 2021/01/22 22:28, Tetsuo Handa wrote:
>>> On 2021/01/22 21:10, Dmitry Vyukov wrote:
>>>> On Fri, Jan 22, 2021 at 1:03 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 22/01/2021 21:30, Tetsuo Handa wrote:
>>>>>> On 2021/01/22 18:16, Dmitry Vyukov wrote:
>>>>>>> The reproducer only does 2 bpf syscalls, so something is slowly leaking in bpf.
>>>>>>> My first suspect would be one of these. Since workqueue is async, it
>>>>>>> may cause such slow drain that happens only when tasks are spawned
>>>>>>> fast. I don't know if there is a procfs/debugfs introspection file to
>>>>>>> monitor workqueue lengths to verify this hypothesis.
>>>>>>
>>>>>> If you can reproduce locally, you can call show_workqueue_state()
>>>>>> (part of SysRq-t) when hitting the limit.
>>>>>>
>>>>>> --- a/kernel/locking/lockdep.c
>>>>>> +++ b/kernel/locking/lockdep.c
>>>>>> @@ -1277,6 +1277,7 @@ register_lock_class(struct lockdep_map *lock, unsigned int subclass, int force)
>>>>>>
>>>>>>                   print_lockdep_off("BUG: MAX_LOCKDEP_KEYS too low!");
>>>>>>                   dump_stack();
>>>>>> +               show_workqueue_state();
>>>>>>                   return NULL;
>>>>>>           }
>>>>>>           nr_lock_classes++;
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Here is the result:
>>>>> https://pastebin.com/rPn0Cytu
>>>>
>>>> Do you mind posting this publicly?
>>>> Yes, it seems that bpf_map_free_deferred is the problem (11138
>>>> outstanding callbacks).
>>>>
>>>
>>> Wow. Horribly stuck queue. I guess BPF wants to try WQ created by
>>>
>>>    alloc_workqueue("bpf_free_wq", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
>>>
>>> rather than system_wq . You can add Tejun Heo <tj@kernel.org> for WQ.
>>>
>>> Anyway, please post your result to ML.
> 
> 
> 
> https://pastebin.com/JfrmzguK is with the patch below applied. Seems less output. Interestingly when I almost hit "send", OOM kicked in and tried killing a bunch of "maxlockdep" processes (my test case):
> 
> [  891.037315] [  31007]     0 31007      281        5    49152        0          1000 maxlockdep
> [  891.037540] [  31009]     0 31009      281        5    49152        0          1000 maxlockdep
> [  891.037760] [  31012]     0 31012      281        5    49152        0          1000 maxlockdep
> [  891.037980] [  31013]     0 31013      281        5    47104        0             0 maxlockdep
> [  891.038210] [  31014]     0 31014      281        5    49152        0          1000 maxlockdep
> [  891.038429] [  31018]     0 31018      281        5    47104        0             0 maxlockdep
> [  891.038652] [  31019]     0 31019      281        5    49152        0          1000 maxlockdep
> [  891.038874] [  31020]     0 31020      281        5    49152        0          1000 maxlockdep
> [  891.039095] [  31021]     0 31021      281        5    49152        0          1000 maxlockdep
> [  891.039317] [  31022]     0 31022      281        5    47104        0             0 maxlockdep
> 
> 
> 
> 
> And (re)adding LKML and Tejun as suggested. Thanks,
> 
> 
>>>
>>
>> Does this patch (which is only compile tested) reduce number of pending works
>> when hitting "BUG: MAX_LOCKDEP_KEYS too low!" ?
>>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index 07cb5d15e743..c6c6902090f0 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -41,6 +41,7 @@ struct bpf_local_storage_map;
>>   struct kobject;
>>   struct mem_cgroup;
>>   +extern struct workqueue_struct *bpf_free_wq;
>>   extern struct idr btf_idr;
>>   extern spinlock_t btf_idr_lock;
>>   extern struct kobject *btf_kobj;
>> diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
>> index 1f8453343bf2..8b1cf6aab089 100644
>> --- a/kernel/bpf/arraymap.c
>> +++ b/kernel/bpf/arraymap.c
>> @@ -994,7 +994,7 @@ static void prog_array_map_clear(struct bpf_map *map)
>>       struct bpf_array_aux *aux = container_of(map, struct bpf_array,
>>                            map)->aux;
>>       bpf_map_inc(map);
>> -    schedule_work(&aux->work);
>> +    queue_work(bpf_free_wq, &aux->work);
>>   }
>>     static struct bpf_map *prog_array_map_alloc(union bpf_attr *attr)
>> diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
>> index 96555a8a2c54..f272844163df 100644
>> --- a/kernel/bpf/cgroup.c
>> +++ b/kernel/bpf/cgroup.c
>> @@ -160,7 +160,7 @@ static void cgroup_bpf_release_fn(struct percpu_ref *ref)
>>       struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);
>>         INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
>> -    queue_work(system_wq, &cgrp->bpf.release_work);
>> +    queue_work(bpf_free_wq, &cgrp->bpf.release_work);
>>   }
>>     /* Get underlying bpf_prog of bpf_prog_list entry, regardless if it's through
>> diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
>> index 261f8692d0d2..9d76c0d77687 100644
>> --- a/kernel/bpf/core.c
>> +++ b/kernel/bpf/core.c
>> @@ -34,6 +34,15 @@
>>   #include <linux/log2.h>
>>   #include <asm/unaligned.h>
>>   +struct workqueue_struct *bpf_free_wq;
>> +
>> +static int __init bpf_free_wq_init(void)
>> +{
>> +    bpf_free_wq = alloc_workqueue("bpf_free_wq", WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND, 0);
>> +    return 0;
>> +}
>> +subsys_initcall(bpf_free_wq_init);
>> +
>>   /* Registers */
>>   #define BPF_R0    regs[BPF_REG_0]
>>   #define BPF_R1    regs[BPF_REG_1]
>> @@ -2152,7 +2161,7 @@ void bpf_prog_free(struct bpf_prog *fp)
>>       if (aux->dst_prog)
>>           bpf_prog_put(aux->dst_prog);
>>       INIT_WORK(&aux->work, bpf_prog_free_deferred);
>> -    schedule_work(&aux->work);
>> +    queue_work(bpf_free_wq, &aux->work);
>>   }
>>   EXPORT_SYMBOL_GPL(bpf_prog_free);
>>   diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
>> index 747313698178..6507cc8263fc 100644
>> --- a/kernel/bpf/cpumap.c
>> +++ b/kernel/bpf/cpumap.c
>> @@ -515,7 +515,7 @@ static void __cpu_map_entry_replace(struct bpf_cpu_map *cmap,
>>       if (old_rcpu) {
>>           call_rcu(&old_rcpu->rcu, __cpu_map_entry_free);
>>           INIT_WORK(&old_rcpu->kthread_stop_wq, cpu_map_kthread_stop);
>> -        schedule_work(&old_rcpu->kthread_stop_wq);
>> +        queue_work(bpf_free_wq, &old_rcpu->kthread_stop_wq);
>>       }
>>   }
>>   diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
>> index e5999d86c76e..084b903b4ee6 100644
>> --- a/kernel/bpf/syscall.c
>> +++ b/kernel/bpf/syscall.c
>> @@ -477,7 +477,7 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
>>           bpf_map_free_id(map, do_idr_lock);
>>           btf_put(map->btf);
>>           INIT_WORK(&map->work, bpf_map_free_deferred);
>> -        schedule_work(&map->work);
>> +        queue_work(bpf_free_wq, &map->work);
>>       }
>>   }
>>   @@ -2343,7 +2343,7 @@ void bpf_link_put(struct bpf_link *link)
>>         if (in_atomic()) {
>>           INIT_WORK(&link->work, bpf_link_put_deferred);
>> -        schedule_work(&link->work);
>> +        queue_work(bpf_free_wq, &link->work);
>>       } else {
>>           bpf_link_free(link);
>>       }
>>
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BPF: unbounded bpf_map_free_deferred problem
  2021-01-23  0:39               ` BPF: unbounded bpf_map_free_deferred problem Tetsuo Handa
@ 2021-01-23  3:27                 ` Cong Wang
       [not found]                   ` <cf17e6c4-76c7-52b9-39d5-c14946070fc4@i-love.sakura.ne.jp>
  0 siblings, 1 reply; 4+ messages in thread
From: Cong Wang @ 2021-01-23  3:27 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: netdev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Linux Kernel Mailing List, Tejun Heo,
	Alexey Kardashevskiy, Dmitry Vyukov

On Fri, Jan 22, 2021 at 4:42 PM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> Hello, BPF developers.
>
> Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of
> unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND
> workqueue did not solve this problem. Is it possible that a refcount leak somewhere
> preventing bpf_map_free_deferred from completing? Please see
> https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+J0ekrag@mail.gmail.com .
>

Which map does the reproducer create? And where exactly do
those work block on?

Different map->ops->map_free() waits for different reasons,
for example, htab_map_free() waits for flying htab_elem_free_rcu().
I can't immediately see how they could wait for each other, if this
is what you meant above.

Thanks.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: BPF: unbounded bpf_map_free_deferred problem
       [not found]                     ` <c1aecd4e-8db7-87a5-94bf-c630f1cf0866@ozlabs.ru>
@ 2021-01-25  0:52                       ` Tetsuo Handa
  0 siblings, 0 replies; 4+ messages in thread
From: Tetsuo Handa @ 2021-01-25  0:52 UTC (permalink / raw)
  To: Cong Wang, netdev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko
  Cc: Tejun Heo, Dmitry Vyukov, Alexey Kardashevskiy

On 2021/01/25 8:48, Alexey Kardashevskiy wrote:
> 
> 
> On 23/01/2021 22:17, Tetsuo Handa wrote:
>> On 2021/01/23 12:27, Cong Wang wrote:
>>> On Fri, Jan 22, 2021 at 4:42 PM Tetsuo Handa
>>> <penguin-kernel@i-love.sakura.ne.jp> wrote:
>>>>
>>>> Hello, BPF developers.
>>>>
>>>> Alexey Kardashevskiy is reporting that system_wq gets stuck due to flooding of
>>>> unbounded bpf_map_free_deferred work. Use of WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND
>>>> workqueue did not solve this problem. Is it possible that a refcount leak somewhere
>>>> preventing bpf_map_free_deferred from completing? Please see
>>>> https://lkml.kernel.org/r/CACT4Y+Z+kwPM=WUzJ-e359PWeLLqmF0w4Yxp1spzZ=+J0ekrag@mail.gmail.com .
>>>>
>>>
>>> Which map does the reproducer create? And where exactly do
>>> those work block on?
>>>
>>> Different map->ops->map_free() waits for different reasons,
>>> for example, htab_map_free() waits for flying htab_elem_free_rcu().
>>> I can't immediately see how they could wait for each other, if this
>>> is what you meant above.

I guess that rcu_barrier() in htab_map_free() is taking longer than it should, for
Alexey said

  The offender is htab->lockdep_key. If I run repro at the slow rate, no
  problems appears, traces show lockdep_unregister_key() is called and the
  leak is quite slow.

at https://lkml.kernel.org/r/c099ad52-0c2c-b886-bae2-c64bd8626452@ozlabs.ru and
pending works seems to be reduced by use of WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UNBOUND workqueue
at https://lkml.kernel.org/r/e4767b84-05a4-07c0-811b-b3a08cad2f43@ozlabs.ru .

>>>
>>> Thanks.
>>>
>>
>> Alexey, please try below diff on top of "show_workqueue_state() right after the bug message"
>> and Hillf's patch. And please capture several times so that we can check if sched_show_task()
>> always points to the same function.
> 
> 
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index 9880b6c0e272..616464dd8e92 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -50,6 +50,7 @@
>>   #include <linux/uaccess.h>
>>   #include <linux/sched/isolation.h>
>>   #include <linux/nmi.h>
>> +#include <linux/sched/debug.h>
>>     #include "workqueue_internal.h"
>>   @@ -4725,6 +4726,10 @@ static void show_pwq(struct pool_workqueue *pwq)
>>               comma = true;
>>           }
>>           pr_cont("\n");
>> +        hash_for_each(pool->busy_hash, bkt, worker, hentry) {
>> +            if (worker->current_pwq == pwq)
>> +                sched_show_task(worker->task);
>> +        }
>>       }
>>         list_for_each_entry(work, &pool->worklist, entry) {
>>
> 
> Below are 3 samples, they report:
> 
> Workqueue: events_unbound bpf_map_free_deferred
> 
> Hope that helps, i'm still getting my head around this new can of worms :) Thanks,
> 
> 
> 
> [  148.770893] hrtimer: interrupt took 1070385 ns
> [  207.882594] BUG: MAX_LOCKDEP_KEYS too low!
> [  207.886689] turning off the locking correctness validator.
> [  207.886766] CPU: 1 PID: 9448 Comm: maxlockdep Not tainted 5.11.0-rc4-le_syzkaller_a+fstn1 #64
> [  207.886882] Call Trace:
> [  207.886914] [c00000003b397690] [c000000000d112f0] dump_stack+0x13c/0x1bc (unreliable)
> [  207.887052] [c00000003b3976e0] [c0000000002edb0c] register_lock_class+0xcbc/0xe20
> [  207.887156] [c00000003b3977f0] [c0000000002e97b8] __lock_acquire+0xa8/0x21e0
> [  207.887247] [c00000003b397940] [c0000000002ec674] lock_acquire+0x2c4/0x5c0
> [  207.887328] [c00000003b397a30] [c0000000018212dc] _raw_spin_lock_irqsave+0x7c/0xb0
> [  207.887436] [c00000003b397a70] [c0000000004dc9d0] htab_lru_map_update_elem+0x3e0/0x6c0
> [  207.887537] [c00000003b397af0] [c0000000004a2b14] bpf_map_update_value.isra.24+0x734/0x820
> [  207.887631] [c00000003b397b50] [c0000000004a7d58] generic_map_update_batch+0x1b8/0x3b0
> [  207.887723] [c00000003b397bf0] [c0000000004a05e0] bpf_map_do_batch+0x1b0/0x390
> [  207.887816] [c00000003b397c50] [c0000000004aca90] __do_sys_bpf+0x13d0/0x35b0
> [  207.887905] [c00000003b397db0] [c00000000004b648] system_call_exception+0x178/0x2b0
> [  207.887998] [c00000003b397e10] [c00000000000e060] system_call_common+0xf0/0x27c
> [  207.888090] Showing busy workqueues and worker pools:
> [  207.888149] workqueue events_unbound: flags=0x2
> [  207.888206]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3
> [  207.888290]     in-flight: 7:bpf_map_free_deferred
> [  207.888357] ___K___ (1) show_pwq 4729
> [  207.888403] task:kworker/u16:0   state:D stack:10928 pid:    7 ppid:     2 flags:0x00000800
> [  207.888494] Workqueue: events_unbound bpf_map_free_deferred
> [  207.888559] Call Trace:
> [  207.888590] [c00000001343b4f0] [c000000013412b18] 0xc000000013412b18 (unreliable)
> [  207.888691] [c00000001343b6e0] [c0000000000271d0] __switch_to+0x3e0/0x700
> [  207.888763] [c00000001343b750] [c000000001817b48] __schedule+0x3c8/0xc80
> [  207.888828] [c00000001343b820] [c000000001818494] schedule+0x94/0x170
> [  207.888892] [c00000001343b850] [c00000000181f9ec] schedule_timeout+0x43c/0x650
> [  207.888966] [c00000001343b960] [c00000000181a194] wait_for_completion+0xb4/0x190
> [  207.889041] [c00000001343b9c0] [c000000000325904] __wait_rcu_gp+0x1c4/0x240
> [  207.889105] [c00000001343ba20] [c000000000339164] synchronize_rcu+0xa4/0x180
> [  207.889179] [c00000001343bac0] [c0000000002e54b0] lockdep_unregister_key+0x1e0/0x470
> [  207.889253] [c00000001343bb60] [c0000000004d942c] htab_map_free+0x20c/0x250
> [  207.889317] [c00000001343bbc0] [c0000000004a1934] bpf_map_free_deferred+0xa4/0x3e0
> [  207.889391] [c00000001343bc30] [c00000000026d508] process_one_work+0x468/0xb00
> [  207.889465] [c00000001343bd10] [c00000000026dc34] worker_thread+0x94/0x760
> [  207.889535] [c00000001343bda0] [c00000000027e380] kthread+0x1f0/0x200
> [  207.889598] [c00000001343be10] [c00000000000e2f0] ret_from_kernel_thread+0x5c/0x6c
> [  207.889674] workqueue events_power_efficient: flags=0x80
> [  207.889721]   pwq 12: cpus=6 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> [  207.889789]     pending: gc_worker
> [  207.889894] pool 16: cpus=0-7 flags=0x4 nice=0 hung=76s workers=3 idle: 81 253
> 
> 
> 
> 
> [   29.071632] maxlockdep (2535) used greatest stack depth: 7872 bytes left
> [   30.855628] hrtimer: interrupt took 1041204 ns
> [   96.748271] BUG: MAX_LOCKDEP_KEYS too low!
> [   96.780332] turning off the locking correctness validator.
> [   96.781239] CPU: 0 PID: 9443 Comm: maxlockdep Not tainted 5.11.0-rc4-le_syzkaller_a+fstn1 #64
> [   96.782008] Call Trace:
> [   96.782571] [c00000003658b690] [c000000000d112f0] dump_stack+0x13c/0x1bc (unreliable)
> [   96.783472] [c00000003658b6e0] [c0000000002edb0c] register_lock_class+0xcbc/0xe20
> [   96.784037] [c00000003658b7f0] [c0000000002e97b8] __lock_acquire+0xa8/0x21e0
> [   96.784607] [c00000003658b940] [c0000000002ec674] lock_acquire+0x2c4/0x5c0
> [   96.785067] [c00000003658ba30] [c0000000018212dc] _raw_spin_lock_irqsave+0x7c/0xb0
> [   96.785307] [c00000003658ba70] [c0000000004dc9d0] htab_lru_map_update_elem+0x3e0/0x6c0
> [   96.785541] [c00000003658baf0] [c0000000004a2b14] bpf_map_update_value.isra.24+0x734/0x820
> [   96.785749] [c00000003658bb50] [c0000000004a7d58] generic_map_update_batch+0x1b8/0x3b0
> [   96.785963] [c00000003658bbf0] [c0000000004a05e0] bpf_map_do_batch+0x1b0/0x390
> [   96.786161] [c00000003658bc50] [c0000000004aca90] __do_sys_bpf+0x13d0/0x35b0
> [   96.786392] [c00000003658bdb0] [c00000000004b648] system_call_exception+0x178/0x2b0
> [   96.786614] [c00000003658be10] [c00000000000e060] system_call_common+0xf0/0x27c
> [   96.786819] Showing busy workqueues and worker pools:
> [   96.786942] workqueue events_unbound: flags=0x2
> [   96.787062]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3
> [   96.787272]     in-flight: 81:bpf_map_free_deferred
> [   96.787441] ___K___ (0) show_pwq 4729
> [   96.787538] task:kworker/u16:1   state:D stack:10928 pid:   81 ppid:     2 flags:0x00000800
> [   96.787767] Workqueue: events_unbound bpf_map_free_deferred
> [   96.787909] Call Trace:
> [   96.787975] [c000000013b57600] [c000000013b57660] 0xc000000013b57660 (unreliable)
> [   96.788222] [c000000013b577f0] [c0000000000271d0] __switch_to+0x3e0/0x700
> [   96.788417] [c000000013b57860] [c000000001817b48] __schedule+0x3c8/0xc80
> [   96.788595] [c000000013b57930] [c000000001818494] schedule+0x94/0x170
> [   96.788772] [c000000013b57960] [c00000000181f9ec] schedule_timeout+0x43c/0x650
> [   96.789002] [c000000013b57a70] [c00000000181a194] wait_for_completion+0xb4/0x190
> [   96.789222] [c000000013b57ad0] [c000000000333e8c] rcu_barrier+0x2fc/0xc70
> [   96.789402] [c000000013b57b60] [c0000000004d9258] htab_map_free+0x38/0x250
> [   96.789580] [c000000013b57bc0] [c0000000004a1934] bpf_map_free_deferred+0xa4/0x3e0
> [   96.789804] [c000000013b57c30] [c00000000026d508] process_one_work+0x468/0xb00
> [   96.790002] [c000000013b57d10] [c00000000026dc34] worker_thread+0x94/0x760
> [   96.790177] [c000000013b57da0] [c00000000027e380] kthread+0x1f0/0x200
> [   96.790346] [c000000013b57e10] [c00000000000e2f0] ret_from_kernel_thread+0x5c/0x6c
> [   96.790559] workqueue events_power_efficient: flags=0x80
> [   96.790717]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> [   96.790956]     pending: gc_worker
> [   96.791215] pool 16: cpus=0-7 flags=0x4 nice=0 hung=76s workers=3 idle: 7 251
> 
> 
> 
> 
> 
> [   21.324859] hrtimer: interrupt took 383633 ns
> [   83.653616] BUG: MAX_LOCKDEP_KEYS too low!
> [   83.653806] turning off the locking correctness validator.
> [   83.653963] CPU: 4 PID: 9460 Comm: maxlockdep Not tainted 5.11.0-rc4-le_syzkaller_a+fstn1 #64
> [   83.654183] Call Trace:
> [   83.654250] [c00000004d53b690] [c000000000d112f0] dump_stack+0x13c/0x1bc (unreliable)
> [   83.654565] [c00000004d53b6e0] [c0000000002edb0c] register_lock_class+0xcbc/0xe20
> [   83.654808] [c00000004d53b7f0] [c0000000002e97b8] __lock_acquire+0xa8/0x21e0
> [   83.655020] [c00000004d53b940] [c0000000002ec674] lock_acquire+0x2c4/0x5c0
> [   83.655203] [c00000004d53ba30] [c0000000018212dc] _raw_spin_lock_irqsave+0x7c/0xb0
> [   83.655444] [c00000004d53ba70] [c0000000004dc9d0] htab_lru_map_update_elem+0x3e0/0x6c0
> [   83.655673] [c00000004d53baf0] [c0000000004a2b14] bpf_map_update_value.isra.24+0x734/0x820
> [   83.655886] [c00000004d53bb50] [c0000000004a7d58] generic_map_update_batch+0x1b8/0x3b0
> [   83.656088] [c00000004d53bbf0] [c0000000004a05e0] bpf_map_do_batch+0x1b0/0x390
> [   83.656297] [c00000004d53bc50] [c0000000004aca90] __do_sys_bpf+0x13d0/0x35b0
> [   83.656496] [c00000004d53bdb0] [c00000000004b648] system_call_exception+0x178/0x2b0
> [   83.656700] [c00000004d53be10] [c00000000000e060] system_call_common+0xf0/0x27c
> [   83.656903] Showing busy workqueues and worker pools:
> [   83.657044] workqueue events_unbound: flags=0x2
> [   83.657165]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=1/512 refcnt=3
> [   83.657358]     in-flight: 81:bpf_map_free_deferred
> [   83.657519] ___K___ (4) show_pwq 4729
> [   83.657620] task:kworker/u16:1   state:D stack:10928 pid:   81 ppid:     2 flags:0x00000800
> [   83.657820] Workqueue: events_unbound bpf_map_free_deferred
> [   83.657967] Call Trace:
> [   83.658033] [c000000013b574f0] [c000000013aefa98] 0xc000000013aefa98 (unreliable)
> [   83.658255] [c000000013b576e0] [c0000000000271d0] __switch_to+0x3e0/0x700
> [   83.658437] [c000000013b57750] [c000000001817b48] __schedule+0x3c8/0xc80
> [   83.658618] [c000000013b57820] [c000000001818494] schedule+0x94/0x170
> [   83.658815] [c000000013b57850] [c00000000181f9ec] schedule_timeout+0x43c/0x650
> [   83.659019] [c000000013b57960] [c00000000181a194] wait_for_completion+0xb4/0x190
> [   83.659229] [c000000013b579c0] [c000000000325904] __wait_rcu_gp+0x1c4/0x240
> [   83.659406] [c000000013b57a20] [c000000000339164] synchronize_rcu+0xa4/0x180
> [   83.659617] [c000000013b57ac0] [c0000000002e54b0] lockdep_unregister_key+0x1e0/0x470
> [   83.659825] [c000000013b57b60] [c0000000004d942c] htab_map_free+0x20c/0x250
> [   83.660002] [c000000013b57bc0] [c0000000004a1934] bpf_map_free_deferred+0xa4/0x3e0
> [   83.660212] [c000000013b57c30] [c00000000026d508] process_one_work+0x468/0xb00
> [   83.660415] [c000000013b57d10] [c00000000026dc34] worker_thread+0x94/0x760
> [   83.660598] [c000000013b57da0] [c00000000027e380] kthread+0x1f0/0x200
> [   83.660771] [c000000013b57e10] [c00000000000e2f0] ret_from_kernel_thread+0x5c/0x6c
> [   83.661114] pool 16: cpus=0-7 flags=0x4 nice=0 hung=67s workers=3 idle: 7 252
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-01-25  0:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <c099ad52-0c2c-b886-bae2-c64bd8626452@ozlabs.ru>
2021-01-22  9:16 ` BUG: MAX_LOCKDEP_KEYS too low! Dmitry Vyukov
     [not found]   ` <6af41136-4344-73da-f821-e831674be473@i-love.sakura.ne.jp>
     [not found]     ` <70d427e8-7281-0aae-c524-813d73eca2d7@ozlabs.ru>
     [not found]       ` <CACT4Y+bqidtwh1HUFFoyyKyVy0jnwrzhVBgqmU+T9sN1yPMO=g@mail.gmail.com>
     [not found]         ` <eb71cc37-afbd-5446-6305-8c7abcc6e91f@i-love.sakura.ne.jp>
     [not found]           ` <6eaafbd8-1c10-75df-75ae-9afa0861f69b@i-love.sakura.ne.jp>
     [not found]             ` <e4767b84-05a4-07c0-811b-b3a08cad2f43@ozlabs.ru>
2021-01-23  0:39               ` BPF: unbounded bpf_map_free_deferred problem Tetsuo Handa
2021-01-23  3:27                 ` Cong Wang
     [not found]                   ` <cf17e6c4-76c7-52b9-39d5-c14946070fc4@i-love.sakura.ne.jp>
     [not found]                     ` <c1aecd4e-8db7-87a5-94bf-c630f1cf0866@ozlabs.ru>
2021-01-25  0:52                       ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).