linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock
@ 2021-09-02 20:01 Shuah Khan
  2021-09-02 21:58 ` Marco Elver
  0 siblings, 1 reply; 5+ messages in thread
From: Shuah Khan @ 2021-09-02 20:01 UTC (permalink / raw)
  To: tj, jiangshanlai, akpm, elver, andreyknvl, dvyukov, walter-zh.wu
  Cc: Shuah Khan, linux-kernel

When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
it tries to allocate memory attempting to acquire spinlock in page
allocation code while holding workqueue pool raw_spinlock.

There are several instances of this problem when block layer tries
to __queue_work(). Call trace from one of these instances is below:

    kblockd_mod_delayed_work_on()
      mod_delayed_work_on()
        __queue_delayed_work()
          __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
            insert_work()
	      kasan_record_aux_stack()
                kasan_save_stack()
		  stack_depot_save()
                    alloc_pages()
                      __alloc_pages()
                        get_page_from_freelist()
                          rm_queue()
                            rm_queue_pcplist()
                              local_lock_irqsave(&pagesets.lock, flags);
                              [ BUG: Invalid wait context triggered ]

Fix it by calling kasan_record_aux_stack() conditionally only when
CONFIG_PROVE_RAW_LOCK_NESTING is not enabled. After exploring other
options such as calling kasan_record_aux_stack() after releasing the
pool lock, opting for a least disruptive path of stubbing this record
function to avoid nesting raw spinlock and spinlock.

=============================
 [ BUG: Invalid wait context ]
 5.14.0-rc7+ #8 Not tainted
 -----------------------------
 snap/532 is trying to lock:
 ffff888374f32ba0 (lock#2){..-.}-{3:3}, at: get_page_from_freelist (mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
 other info that might help us debug this:
 context-{5:5}
 3 locks held by snap/532:
 #0: ffff888139fa4408 (&type->i_mutex_dir_key#10){.+.+}-{4:4}, at: walk_component (fs/namei.c:1663 fs/namei.c:1959)
 #1: ffffffffab870c40 (rcu_read_lock){....}-{1:3}, at: __queue_work (./arch/x86/include/asm/preempt.h:80 ./include/linux/rcupdate.h:68 ./include/linux/rcupdate.h:685 kernel/workqueue.c:1421)
 #2: ffff888374f36cd8 (&pool->lock){-.-.}-{2:2}, at: __queue_work (kernel/workqueue.c:1466)
 stack backtrace:
 CPU: 14 PID: 532 Comm: snap Not tainted 5.14.0-rc7+ #8
 Hardware name: LENOVO 90Q30008US/3728, BIOS O4ZKT1CA 09/16/2020
 Call Trace:
 dump_stack_lvl (lib/dump_stack.c:106 (discriminator 4))
 dump_stack (lib/dump_stack.c:113)
 __lock_acquire.cold (kernel/locking/lockdep.c:4666 kernel/locking/lockdep.c:4727 kernel/locking/lockdep.c:4965)
 ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4873)
 ? lock_is_held_type (kernel/locking/lockdep.c:5368 kernel/locking/lockdep.c:5668)
 lock_acquire (kernel/locking/lockdep.c:438 kernel/locking/lockdep.c:5627 kernel/locking/lockdep.c:5590)
 ? get_page_from_freelist (mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
 ? lock_release (kernel/locking/lockdep.c:5593)
 ? __kasan_check_read (mm/kasan/shadow.c:32)
 ? __lock_acquire (kernel/locking/lockdep.c:5019)
 ? __zone_watermark_ok (./include/linux/list.h:282 ./include/linux/mmzone.h:111 mm/page_alloc.c:3908)
 get_page_from_freelist (./include/linux/local_lock_internal.h:43 mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
 ? get_page_from_freelist (mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
 ? lock_is_held_type (kernel/locking/lockdep.c:5368 kernel/locking/lockdep.c:5668)
 ? is_bpf_text_address (./arch/x86/include/asm/preempt.h:85 ./include/linux/rcupdate.h:73 ./include/linux/rcupdate.h:719 kernel/bpf/core.c:708)
 ? lock_downgrade (kernel/locking/lockdep.c:5633)
 ? __zone_watermark_ok (mm/page_alloc.c:4054)
 __alloc_pages (mm/page_alloc.c:5391)
 ? __alloc_pages_slowpath.constprop.0 (mm/page_alloc.c:5354)
 ? create_prof_cpu_mask (kernel/stacktrace.c:82)
? _find_first_bit (lib/find_bit.c:83)
 alloc_pages (mm/mempolicy.c:2249)
 stack_depot_save (lib/stackdepot.c:304)
 ? lock_is_held_type (kernel/locking/lockdep.c:5368 kernel/locking/lockdep.c:5668)
 kasan_save_stack (mm/kasan/common.c:41)
 ? kasan_save_stack (mm/kasan/common.c:39)
 ? kasan_record_aux_stack (mm/kasan/generic.c:348)
 ? insert_work (./include/linux/instrumented.h:71 ./include/asm-generic/bitops/instrumented-non-atomic.h:134 kernel/workqueue.c:616 kernel/workqueue.c:623 kernel/workqueue.c:1335)
 ? __queue_work (kernel/workqueue.c:1501)
 ? __queue_delayed_work (kernel/workqueue.c:1657)
 ? mod_delayed_work_on (kernel/workqueue.c:1720)
 ? kblockd_mod_delayed_work_on (block/blk-core.c:1633)
 ? __blk_mq_delay_run_hw_queue (block/blk-mq.c:1567)
 ? blk_mq_run_hw_queue (block/blk-mq.c:1610)
 ? blk_mq_sched_insert_request (block/blk-mq-sched.c:480)
 ? blk_mq_submit_bio (block/blk-mq.c:2276)

Fixes: e89a85d63fb2 ("workqueue: kasan: record workqueue stack")
Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
---
Changes since v1:
-- Instead of changing when record happens, disable record
   when CONFIG_PROVE_RAW_LOCK_NESTING=y
 
 kernel/workqueue.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f148eacda55a..435970ef81ae 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1328,8 +1328,16 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 {
 	struct worker_pool *pool = pwq->pool;
 
-	/* record the work call stack in order to print it in KASAN reports */
+	/*
+	 * record the work call stack in order to print it in KASAN reports
+	 * Doing this when CONFIG_PROVE_RAW_LOCK_NESTING is enabled results
+	 * in nesting raw spinlock with page allocation spinlock.
+	 *
+	 * Avoid recording when CONFIG_PROVE_RAW_LOCK_NESTING is enabled.
+	 */
+#if !defined(CONFIG_PROVE_RAW_LOCK_NESTING)
 	kasan_record_aux_stack(work);
+#endif
 
 	/* we own @work, set data and link */
 	set_work_pwq(work, pwq, extra_flags);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock
  2021-09-02 20:01 [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock Shuah Khan
@ 2021-09-02 21:58 ` Marco Elver
  2021-09-02 23:46   ` Shuah Khan
  0 siblings, 1 reply; 5+ messages in thread
From: Marco Elver @ 2021-09-02 21:58 UTC (permalink / raw)
  To: Shuah Khan
  Cc: tj, jiangshanlai, akpm, andreyknvl, dvyukov, walter-zh.wu,
	linux-kernel, Sebastian Andrzej Siewior

On Thu, 2 Sept 2021 at 22:01, Shuah Khan <skhan@linuxfoundation.org> wrote:
>
> When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
> kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
> it tries to allocate memory attempting to acquire spinlock in page
> allocation code while holding workqueue pool raw_spinlock.
>
> There are several instances of this problem when block layer tries
> to __queue_work(). Call trace from one of these instances is below:
>
>     kblockd_mod_delayed_work_on()
>       mod_delayed_work_on()
>         __queue_delayed_work()
>           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
>             insert_work()
>               kasan_record_aux_stack()
>                 kasan_save_stack()
>                   stack_depot_save()
>                     alloc_pages()
>                       __alloc_pages()
>                         get_page_from_freelist()
>                           rm_queue()
>                             rm_queue_pcplist()
>                               local_lock_irqsave(&pagesets.lock, flags);
>                               [ BUG: Invalid wait context triggered ]
>
> Fix it by calling kasan_record_aux_stack() conditionally only when
> CONFIG_PROVE_RAW_LOCK_NESTING is not enabled. After exploring other
> options such as calling kasan_record_aux_stack() after releasing the
> pool lock, opting for a least disruptive path of stubbing this record
> function to avoid nesting raw spinlock and spinlock.
>
> =============================
>  [ BUG: Invalid wait context ]
>  5.14.0-rc7+ #8 Not tainted
>  -----------------------------
>  snap/532 is trying to lock:
>  ffff888374f32ba0 (lock#2){..-.}-{3:3}, at: get_page_from_freelist (mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
>  other info that might help us debug this:
>  context-{5:5}
>  3 locks held by snap/532:
>  #0: ffff888139fa4408 (&type->i_mutex_dir_key#10){.+.+}-{4:4}, at: walk_component (fs/namei.c:1663 fs/namei.c:1959)
>  #1: ffffffffab870c40 (rcu_read_lock){....}-{1:3}, at: __queue_work (./arch/x86/include/asm/preempt.h:80 ./include/linux/rcupdate.h:68 ./include/linux/rcupdate.h:685 kernel/workqueue.c:1421)
>  #2: ffff888374f36cd8 (&pool->lock){-.-.}-{2:2}, at: __queue_work (kernel/workqueue.c:1466)
>  stack backtrace:
>  CPU: 14 PID: 532 Comm: snap Not tainted 5.14.0-rc7+ #8
>  Hardware name: LENOVO 90Q30008US/3728, BIOS O4ZKT1CA 09/16/2020
>  Call Trace:
>  dump_stack_lvl (lib/dump_stack.c:106 (discriminator 4))
>  dump_stack (lib/dump_stack.c:113)
>  __lock_acquire.cold (kernel/locking/lockdep.c:4666 kernel/locking/lockdep.c:4727 kernel/locking/lockdep.c:4965)
>  ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4873)
>  ? lock_is_held_type (kernel/locking/lockdep.c:5368 kernel/locking/lockdep.c:5668)
>  lock_acquire (kernel/locking/lockdep.c:438 kernel/locking/lockdep.c:5627 kernel/locking/lockdep.c:5590)
>  ? get_page_from_freelist (mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
>  ? lock_release (kernel/locking/lockdep.c:5593)
>  ? __kasan_check_read (mm/kasan/shadow.c:32)
>  ? __lock_acquire (kernel/locking/lockdep.c:5019)
>  ? __zone_watermark_ok (./include/linux/list.h:282 ./include/linux/mmzone.h:111 mm/page_alloc.c:3908)
>  get_page_from_freelist (./include/linux/local_lock_internal.h:43 mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
>  ? get_page_from_freelist (mm/page_alloc.c:3665 mm/page_alloc.c:3703 mm/page_alloc.c:4165)
>  ? lock_is_held_type (kernel/locking/lockdep.c:5368 kernel/locking/lockdep.c:5668)
>  ? is_bpf_text_address (./arch/x86/include/asm/preempt.h:85 ./include/linux/rcupdate.h:73 ./include/linux/rcupdate.h:719 kernel/bpf/core.c:708)
>  ? lock_downgrade (kernel/locking/lockdep.c:5633)
>  ? __zone_watermark_ok (mm/page_alloc.c:4054)
>  __alloc_pages (mm/page_alloc.c:5391)
>  ? __alloc_pages_slowpath.constprop.0 (mm/page_alloc.c:5354)
>  ? create_prof_cpu_mask (kernel/stacktrace.c:82)
> ? _find_first_bit (lib/find_bit.c:83)
>  alloc_pages (mm/mempolicy.c:2249)
>  stack_depot_save (lib/stackdepot.c:304)
>  ? lock_is_held_type (kernel/locking/lockdep.c:5368 kernel/locking/lockdep.c:5668)
>  kasan_save_stack (mm/kasan/common.c:41)
>  ? kasan_save_stack (mm/kasan/common.c:39)
>  ? kasan_record_aux_stack (mm/kasan/generic.c:348)
>  ? insert_work (./include/linux/instrumented.h:71 ./include/asm-generic/bitops/instrumented-non-atomic.h:134 kernel/workqueue.c:616 kernel/workqueue.c:623 kernel/workqueue.c:1335)
>  ? __queue_work (kernel/workqueue.c:1501)
>  ? __queue_delayed_work (kernel/workqueue.c:1657)
>  ? mod_delayed_work_on (kernel/workqueue.c:1720)
>  ? kblockd_mod_delayed_work_on (block/blk-core.c:1633)
>  ? __blk_mq_delay_run_hw_queue (block/blk-mq.c:1567)
>  ? blk_mq_run_hw_queue (block/blk-mq.c:1610)
>  ? blk_mq_sched_insert_request (block/blk-mq-sched.c:480)
>  ? blk_mq_submit_bio (block/blk-mq.c:2276)
>
> Fixes: e89a85d63fb2 ("workqueue: kasan: record workqueue stack")
> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
> ---
> Changes since v1:
> -- Instead of changing when record happens, disable record
>    when CONFIG_PROVE_RAW_LOCK_NESTING=y
>
>  kernel/workqueue.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index f148eacda55a..435970ef81ae 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1328,8 +1328,16 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
>  {
>         struct worker_pool *pool = pwq->pool;
>
> -       /* record the work call stack in order to print it in KASAN reports */
> +       /*
> +        * record the work call stack in order to print it in KASAN reports
> +        * Doing this when CONFIG_PROVE_RAW_LOCK_NESTING is enabled results
> +        * in nesting raw spinlock with page allocation spinlock.
> +        *
> +        * Avoid recording when CONFIG_PROVE_RAW_LOCK_NESTING is enabled.
> +        */
> +#if !defined(CONFIG_PROVE_RAW_LOCK_NESTING)

Just "if (!IS_ENABLED(CONFIG_PROVE_RAW_LOCK_NESTING))" should work
here, however...

... PROVE_RAW_LOCK_NESTING exists for PREEMPT_RT's benefit. I don't
think silencing the debugging tool is the solution, because the bug
still exists in a PREEMPT_RT kernel.

+Cc Sebastian for advice. I may have missed something obvious. :-)

I have a suspicion that kasan_record_aux_stack() (via
stack_depot_save()) is generally unsound on PREEMPT_RT kernels,
because allocating memory cannot be done within raw-locked critical
sections because memory allocation is preemptible on RT. Even using
GWP_NOWAIT/ATOMIC doesn't help (which kasan_record_aux_stack() uses).

It follows that if we do not know what type of locks may be held when
calling kasan_record_aux_stack() we have a bug in RT.

I see 3 options:

1. Try to move kasan_record_aux_stack() where no raw lock is held.
(Seems complicated per v1 attempt?)

But ideally we make kasan_record_aux_stack() more robust on RT:

2. Make kasan_record_aux_stack() a no-op on RT (and if
PROVE_RAW_LOCK_NESTING). Perhaps overkill?

3. Try to not allocate memory in stackdepot. Not sure this is feasible
without telling stackdepot to preallocate the max slabs on boot if RT.

Anything else? Because I don't think any of the options are satisfying.

Thanks,
-- Marco

>         kasan_record_aux_stack(work);
> +#endif
>
>         /* we own @work, set data and link */
>         set_work_pwq(work, pwq, extra_flags);
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock
  2021-09-02 21:58 ` Marco Elver
@ 2021-09-02 23:46   ` Shuah Khan
  2021-09-06  7:12     ` Marco Elver
  0 siblings, 1 reply; 5+ messages in thread
From: Shuah Khan @ 2021-09-02 23:46 UTC (permalink / raw)
  To: Marco Elver
  Cc: tj, jiangshanlai, akpm, andreyknvl, dvyukov, walter-zh.wu,
	linux-kernel, Sebastian Andrzej Siewior, Shuah Khan

On 9/2/21 3:58 PM, Marco Elver wrote:
> On Thu, 2 Sept 2021 at 22:01, Shuah Khan <skhan@linuxfoundation.org> wrote:
>>
>> When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
>> kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
>> it tries to allocate memory attempting to acquire spinlock in page
>> allocation code while holding workqueue pool raw_spinlock.
>>

[snip]

>> Fix it by calling kasan_record_aux_stack() conditionally only when
>> CONFIG_PROVE_RAW_LOCK_NESTING is not enabled. After exploring other
>> options such as calling kasan_record_aux_stack() after releasing the
>> pool lock, opting for a least disruptive path of stubbing this record
>> function to avoid nesting raw spinlock and spinlock.
>>

[snip]

>>
>> Fixes: e89a85d63fb2 ("workqueue: kasan: record workqueue stack")
>> Signed-off-by: Shuah Khan <skhan@linuxfoundation.org>
>> ---
>> Changes since v1:
>> -- Instead of changing when record happens, disable record
>>     when CONFIG_PROVE_RAW_LOCK_NESTING=y
>>
>>   kernel/workqueue.c | 10 +++++++++-
>>   1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index f148eacda55a..435970ef81ae 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -1328,8 +1328,16 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
>>   {
>>          struct worker_pool *pool = pwq->pool;
>>
>> -       /* record the work call stack in order to print it in KASAN reports */
>> +       /*
>> +        * record the work call stack in order to print it in KASAN reports
>> +        * Doing this when CONFIG_PROVE_RAW_LOCK_NESTING is enabled results
>> +        * in nesting raw spinlock with page allocation spinlock.
>> +        *
>> +        * Avoid recording when CONFIG_PROVE_RAW_LOCK_NESTING is enabled.
>> +        */
>> +#if !defined(CONFIG_PROVE_RAW_LOCK_NESTING)
> 
> Just "if (!IS_ENABLED(CONFIG_PROVE_RAW_LOCK_NESTING))" should work
> here, however...
> 

Yes. That would work.

> ... PROVE_RAW_LOCK_NESTING exists for PREEMPT_RT's benefit. I don't
> think silencing the debugging tool is the solution, because the bug
> still exists in a PREEMPT_RT kernel.
> 

This silencing is limited in scope to just the insert_work() and when
PROVE_RAW_LOCK_NESTING is enabled. Please see below under your proposed
option 2

> +Cc Sebastian for advice. I may have missed something obvious. :-)
> 

Thanks for adding Sebastian

> I have a suspicion that kasan_record_aux_stack() (via
> stack_depot_save()) is generally unsound on PREEMPT_RT kernels,
> because allocating memory cannot be done within raw-locked critical
> sections because memory allocation is preemptible on RT. Even using
> GWP_NOWAIT/ATOMIC doesn't help (which kasan_record_aux_stack() uses).
> 
> It follows that if we do not know what type of locks may be held when
> calling kasan_record_aux_stack() we have a bug in RT.
> 
> I see 3 options:
> 
> 1. Try to move kasan_record_aux_stack() where no raw lock is held.
> (Seems complicated per v1 attempt?)
> 

Yes. kasan_record_aux_stack() is better called from insert_work()
prior to insertion. This makes it difficult to do - we don't want
to release the pool lock.

> But ideally we make kasan_record_aux_stack() more robust on RT:
> 
> 2. Make kasan_record_aux_stack() a no-op on RT (and if
> PROVE_RAW_LOCK_NESTING). Perhaps overkill?
> 

I considered it and didn't go down that route because it is a big
hammer. I choose to just disable the debug code in insert_work()
path instead. Not ideal, but limits the disable to a narrower
scope. Limiting the scope in kasan_record_aux_stack() extends to
all other paths where kasan_record_aux_stack() is used.

> 3. Try to not allocate memory in stackdepot. Not sure this is feasible
> without telling stackdepot to preallocate the max slabs on boot if RT.
> 

We could. I have to ask though how much of the real world cases do we
need to impact for the debug code to work?

> Anything else? Because I don't think any of the options are satisfying.
> 

One option to consider is checking dry-run invalid nesting check and
bail out if it is true in kasan_record_aux_stack()

thanks,
-- Shuah

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock
  2021-09-02 23:46   ` Shuah Khan
@ 2021-09-06  7:12     ` Marco Elver
  2021-09-06  9:33       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 5+ messages in thread
From: Marco Elver @ 2021-09-06  7:12 UTC (permalink / raw)
  To: Shuah Khan
  Cc: tj, jiangshanlai, akpm, andreyknvl, dvyukov, walter-zh.wu,
	linux-kernel, Sebastian Andrzej Siewior

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

On Thu, Sep 02, 2021 at 05:46PM -0600, Shuah Khan wrote:
[...]
> > 3. Try to not allocate memory in stackdepot. Not sure this is feasible
> > without telling stackdepot to preallocate the max slabs on boot if RT.
> > 
> 
> We could. I have to ask though how much of the real world cases do we
> need to impact for the debug code to work?
> 
> > Anything else? Because I don't think any of the options are satisfying.
> 
> One option to consider is checking dry-run invalid nesting check and
> bail out if it is true in kasan_record_aux_stack()

Sadly, if lockdep is off, this won't work. And we need a way to
generically fix this, as otherwise we still have a bug (which may also
cause issues on RT kernels).

I propose we properly fix this and prevent stackdepot from replenishing
its "stack slab" pool if memory allocations cannot be done in the
current context. Specifically, I noticed technically it's a bug to use
either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts,
including raw_spin_locks (see gfp.h and ab00db216c9c7).

This is what kasan_record_aux_stack() via stackdepot does, and it's a
bug here regardless if lockdep is on or off.

I've prepared a series (see attached draft patches) that allows telling
stackdepot to not replenish its pool if alloc_pages() cannot be called
at all (where GFP_ATOMIC/NOWAIT doesn't even work).

The only downside is that saving a stack trace may fail if: stackdepot
runs out of space AND the same stack trace has not been recorded before.
I expect this to be unlikely, and a simple experiment (boot the kernel)
didn't result in any failure to record stack trace from insert_work().

I think this is a reasonable trade-off. And considering that we're
unsure if queuing work can or cannot be done from within an outer
raw_spin_lock'ed critical section, I don't see a better way.

If you agree, I'll send this series out for further review.

Thanks,
-- Marco

[-- Attachment #2: 0001-lib-stackdepot-introduce-__stack_depot_save.patch --]
[-- Type: text/x-diff, Size: 3981 bytes --]

From d765613f04b6aee4ebde71e21c2a6210fe93927d Mon Sep 17 00:00:00 2001
From: Marco Elver <elver@google.com>
Date: Sun, 5 Sep 2021 16:51:26 +0200
Subject: [PATCH 1/4] lib/stackdepot: introduce __stack_depot_save()

Add __stack_depot_save(), which provides more fine-grained control over
stackdepot's memory allocation behaviour, in case stackdepot runs out of
"stack slabs".

Normally stackdepot uses alloc_pages() in case it runs out of space;
passing can_alloc==false to __stack_depot_save() prohibits this, at the
cost of more likely failure to record a stack trace.

Signed-off-by: Marco Elver <elver@google.com>
---
 include/linux/stackdepot.h |  4 ++++
 lib/stackdepot.c           | 38 ++++++++++++++++++++++++++++++++------
 2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/include/linux/stackdepot.h b/include/linux/stackdepot.h
index 6bb4bc1a5f54..3735b97f62be 100644
--- a/include/linux/stackdepot.h
+++ b/include/linux/stackdepot.h
@@ -13,6 +13,10 @@
 
 typedef u32 depot_stack_handle_t;
 
+depot_stack_handle_t __stack_depot_save(unsigned long *entries,
+					unsigned int nr_entries,
+					gfp_t gfp_flags, bool can_alloc);
+
 depot_stack_handle_t stack_depot_save(unsigned long *entries,
 				      unsigned int nr_entries, gfp_t gfp_flags);
 
diff --git a/lib/stackdepot.c b/lib/stackdepot.c
index c80a9f734253..ec26845b4e29 100644
--- a/lib/stackdepot.c
+++ b/lib/stackdepot.c
@@ -248,17 +248,26 @@ unsigned int stack_depot_fetch(depot_stack_handle_t handle,
 EXPORT_SYMBOL_GPL(stack_depot_fetch);
 
 /**
- * stack_depot_save - Save a stack trace from an array
+ * __stack_depot_save - Save a stack trace from an array
+ *
+ * Save a stack trace from an array. If |can_alloc| is false, avoids allocating
+ * new stack slabs if no space is left in the current stack slab.
+ *
+ * Setting |can_alloc| to false is required if alloc_pages() cannot be used from
+ * the current context; currently this is the case from contexts where not even
+ * %GFP_ATOMIC or %GFP_NOWAIT can be used (NMI, raw_spin_lock).
  *
  * @entries:		Pointer to storage array
  * @nr_entries:		Size of the storage array
  * @alloc_flags:	Allocation gfp flags
+ * @can_alloc:		Allocate stack slabs (increased chance of failure if false)
  *
- * Return: The handle of the stack struct stored in depot
+ * Return: The handle of the stack struct stored in depot (0 on failure)
  */
-depot_stack_handle_t stack_depot_save(unsigned long *entries,
-				      unsigned int nr_entries,
-				      gfp_t alloc_flags)
+depot_stack_handle_t __stack_depot_save(unsigned long *entries,
+					unsigned int nr_entries,
+					gfp_t alloc_flags,
+					bool can_alloc)
 {
 	struct stack_record *found = NULL, **bucket;
 	depot_stack_handle_t retval = 0;
@@ -291,7 +300,7 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
 	 * The smp_load_acquire() here pairs with smp_store_release() to
 	 * |next_slab_inited| in depot_alloc_stack() and init_stack_slab().
 	 */
-	if (unlikely(!smp_load_acquire(&next_slab_inited))) {
+	if (unlikely(can_alloc && !smp_load_acquire(&next_slab_inited))) {
 		/*
 		 * Zero out zone modifiers, as we don't have specific zone
 		 * requirements. Keep the flags related to allocation in atomic
@@ -339,6 +348,23 @@ depot_stack_handle_t stack_depot_save(unsigned long *entries,
 fast_exit:
 	return retval;
 }
+EXPORT_SYMBOL_GPL(__stack_depot_save);
+
+/**
+ * stack_depot_save - Save a stack trace from an array
+ *
+ * @entries:		Pointer to storage array
+ * @nr_entries:		Size of the storage array
+ * @alloc_flags:	Allocation gfp flags
+ *
+ * Return: The handle of the stack struct stored in depot (0 on failure)
+ */
+depot_stack_handle_t stack_depot_save(unsigned long *entries,
+				      unsigned int nr_entries,
+				      gfp_t alloc_flags)
+{
+	return __stack_depot_save(entries, nr_entries, alloc_flags, true);
+}
 EXPORT_SYMBOL_GPL(stack_depot_save);
 
 static inline int in_irqentry_text(unsigned long ptr)
-- 
2.33.0.153.gba50c8fa24-goog


[-- Attachment #3: 0002-kasan-common-provide-can_alloc-in-kasan_save_stack.patch --]
[-- Type: text/x-diff, Size: 2536 bytes --]

From b297b06ca93a124f450c06ea41b9b0d761ca3402 Mon Sep 17 00:00:00 2001
From: Marco Elver <elver@google.com>
Date: Sun, 5 Sep 2021 16:56:43 +0200
Subject: [PATCH 2/4] kasan: common: provide can_alloc in kasan_save_stack()

Add another argument, can_alloc, to kasan_save_stack() which is passed
as-is to __stack_depot_save().

No functional change intended.

Signed-off-by: Marco Elver <elver@google.com>
---
 mm/kasan/common.c  | 6 +++---
 mm/kasan/generic.c | 2 +-
 mm/kasan/kasan.h   | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 2baf121fb8c5..3e0999892c36 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -30,20 +30,20 @@
 #include "kasan.h"
 #include "../slab.h"
 
-depot_stack_handle_t kasan_save_stack(gfp_t flags)
+depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
 {
 	unsigned long entries[KASAN_STACK_DEPTH];
 	unsigned int nr_entries;
 
 	nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
 	nr_entries = filter_irq_stacks(entries, nr_entries);
-	return stack_depot_save(entries, nr_entries, flags);
+	return __stack_depot_save(entries, nr_entries, flags, can_alloc);
 }
 
 void kasan_set_track(struct kasan_track *track, gfp_t flags)
 {
 	track->pid = current->pid;
-	track->stack = kasan_save_stack(flags);
+	track->stack = kasan_save_stack(flags, true);
 }
 
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index c3f5ba7a294a..2a8e59e6326d 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -345,7 +345,7 @@ void kasan_record_aux_stack(void *addr)
 		return;
 
 	alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
-	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT);
+	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, true);
 }
 
 void kasan_set_free_info(struct kmem_cache *cache,
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index fa02c88b6948..e442d94a8f6e 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -260,7 +260,7 @@ void kasan_report_invalid_free(void *object, unsigned long ip);
 
 struct page *kasan_addr_to_page(const void *addr);
 
-depot_stack_handle_t kasan_save_stack(gfp_t flags);
+depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc);
 void kasan_set_track(struct kasan_track *track, gfp_t flags);
 void kasan_set_free_info(struct kmem_cache *cache, void *object, u8 tag);
 struct kasan_track *kasan_get_free_track(struct kmem_cache *cache,
-- 
2.33.0.153.gba50c8fa24-goog


[-- Attachment #4: 0003-kasan-generic-introduce-kasan_record_aux_stack_noall.patch --]
[-- Type: text/x-diff, Size: 2358 bytes --]

From 098d9384609748bc12e6ea75038d0fc7659459a5 Mon Sep 17 00:00:00 2001
From: Marco Elver <elver@google.com>
Date: Sun, 5 Sep 2021 17:12:33 +0200
Subject: [PATCH 3/4] kasan: generic: introduce
 kasan_record_aux_stack_noalloc()

Introduce a variant of kasan_record_aux_stack() that does not do any
memory allocation through stackdepot. This will permit using it in
contexts that cannot allocate any memory.

Signed-off-by: Marco Elver <elver@google.com>
---
 include/linux/kasan.h |  2 ++
 mm/kasan/generic.c    | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index dd874a1ee862..736d7b458996 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -370,12 +370,14 @@ static inline void kasan_unpoison_task_stack(struct task_struct *task) {}
 void kasan_cache_shrink(struct kmem_cache *cache);
 void kasan_cache_shutdown(struct kmem_cache *cache);
 void kasan_record_aux_stack(void *ptr);
+void kasan_record_aux_stack_noalloc(void *ptr);
 
 #else /* CONFIG_KASAN_GENERIC */
 
 static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
 static inline void kasan_cache_shutdown(struct kmem_cache *cache) {}
 static inline void kasan_record_aux_stack(void *ptr) {}
+static inline void kasan_record_aux_stack_noalloc(void *ptr) {}
 
 #endif /* CONFIG_KASAN_GENERIC */
 
diff --git a/mm/kasan/generic.c b/mm/kasan/generic.c
index 2a8e59e6326d..84a038b07c6f 100644
--- a/mm/kasan/generic.c
+++ b/mm/kasan/generic.c
@@ -328,7 +328,7 @@ DEFINE_ASAN_SET_SHADOW(f3);
 DEFINE_ASAN_SET_SHADOW(f5);
 DEFINE_ASAN_SET_SHADOW(f8);
 
-void kasan_record_aux_stack(void *addr)
+static void __kasan_record_aux_stack(void *addr, bool can_alloc)
 {
 	struct page *page = kasan_addr_to_page(addr);
 	struct kmem_cache *cache;
@@ -345,7 +345,17 @@ void kasan_record_aux_stack(void *addr)
 		return;
 
 	alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
-	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, true);
+	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, can_alloc);
+}
+
+void kasan_record_aux_stack(void *addr)
+{
+	return __kasan_record_aux_stack(addr, true);
+}
+
+void kasan_record_aux_stack_noalloc(void *addr)
+{
+	return __kasan_record_aux_stack(addr, false);
 }
 
 void kasan_set_free_info(struct kmem_cache *cache,
-- 
2.33.0.153.gba50c8fa24-goog


[-- Attachment #5: 0004-workqueue-kasan-avoid-alloc_pages-when-recording-sta.patch --]
[-- Type: text/x-diff, Size: 2939 bytes --]

From 0da4fd0de4ed32ed0d7ddd8e9518ddffba31471a Mon Sep 17 00:00:00 2001
From: Marco Elver <elver@google.com>
Date: Sun, 5 Sep 2021 17:17:19 +0200
Subject: [PATCH 4/4] workqueue, kasan: avoid alloc_pages() when recording
 stack

Shuah Khan reported:

 | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
 | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
 | it tries to allocate memory attempting to acquire spinlock in page
 | allocation code while holding workqueue pool raw_spinlock.
 |
 | There are several instances of this problem when block layer tries
 | to __queue_work(). Call trace from one of these instances is below:
 |
 |     kblockd_mod_delayed_work_on()
 |       mod_delayed_work_on()
 |         __queue_delayed_work()
 |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
 |             insert_work()
 |               kasan_record_aux_stack()
 |                 kasan_save_stack()
 |                   stack_depot_save()
 |                     alloc_pages()
 |                       __alloc_pages()
 |                         get_page_from_freelist()
 |                           rm_queue()
 |                             rm_queue_pcplist()
 |                               local_lock_irqsave(&pagesets.lock, flags);
 |                               [ BUG: Invalid wait context triggered ]

The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).
In general, however, it is not even possible to use either GFP_ATOMIC
nor GFP_NOWAIT in certain non-preemptive contexts, including
raw_spin_locks (see gfp.h and ab00db216c9c7).

Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using kasan_record_aux_stack_noalloc().

While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit). For frequent calls from the same
location, it therefore becomes extremely unlikely that
kasan_record_aux_stack_noalloc() fails.

Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Marco Elver <elver@google.com>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 50142fc08902..0681774e6908 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1329,7 +1329,7 @@ static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
 	struct worker_pool *pool = pwq->pool;
 
 	/* record the work call stack in order to print it in KASAN reports */
-	kasan_record_aux_stack(work);
+	kasan_record_aux_stack_noalloc(work);
 
 	/* we own @work, set data and link */
 	set_work_pwq(work, pwq, extra_flags);
-- 
2.33.0.153.gba50c8fa24-goog


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock
  2021-09-06  7:12     ` Marco Elver
@ 2021-09-06  9:33       ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 5+ messages in thread
From: Sebastian Andrzej Siewior @ 2021-09-06  9:33 UTC (permalink / raw)
  To: Marco Elver
  Cc: Shuah Khan, tj, jiangshanlai, akpm, andreyknvl, dvyukov,
	walter-zh.wu, linux-kernel, Thomas Gleixner

On 2021-09-06 09:12:58 [+0200], Marco Elver wrote:
> On Thu, Sep 02, 2021 at 05:46PM -0600, Shuah Khan wrote:
> [...]
> > > 3. Try to not allocate memory in stackdepot. Not sure this is feasible
> > > without telling stackdepot to preallocate the max slabs on boot if RT.
> > > 
> > 
> > We could. I have to ask though how much of the real world cases do we
> > need to impact for the debug code to work?
> > 
> > > Anything else? Because I don't think any of the options are satisfying.
> > 
> > One option to consider is checking dry-run invalid nesting check and
> > bail out if it is true in kasan_record_aux_stack()
> 
> Sadly, if lockdep is off, this won't work. And we need a way to
> generically fix this, as otherwise we still have a bug (which may also
> cause issues on RT kernels).
> 
> I propose we properly fix this and prevent stackdepot from replenishing
> its "stack slab" pool if memory allocations cannot be done in the
> current context. Specifically, I noticed technically it's a bug to use
> either GFP_ATOMIC nor GFP_NOWAIT in certain non-preemptive contexts,
> including raw_spin_locks (see gfp.h and ab00db216c9c7).

I would say so, too. It is unlikely that we manage to remove
raw_spin_lock_t from the workqueue code or that memory allocation with
disabled preemption/ interrupts would be allowed on RT.
Also as Marco pointed out, avoiding this code once a debug switch has
been noticed is not good.

> This is what kasan_record_aux_stack() via stackdepot does, and it's a
> bug here regardless if lockdep is on or off.
> 
> I've prepared a series (see attached draft patches) that allows telling
> stackdepot to not replenish its pool if alloc_pages() cannot be called
> at all (where GFP_ATOMIC/NOWAIT doesn't even work).

This sounds good. debug_object has also a memory pool which is
replenished from time to time.

> The only downside is that saving a stack trace may fail if: stackdepot
> runs out of space AND the same stack trace has not been recorded before.
> I expect this to be unlikely, and a simple experiment (boot the kernel)
> didn't result in any failure to record stack trace from insert_work().
> 
> I think this is a reasonable trade-off. And considering that we're
> unsure if queuing work can or cannot be done from within an outer
> raw_spin_lock'ed critical section, I don't see a better way.
> 
> If you agree, I'll send this series out for further review.
> 
> Thanks,
> -- Marco

Sebastian

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-09-06  9:33 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-02 20:01 [PATCH v2] workqueue: Don't record workqueue stack holding raw_spin_lock Shuah Khan
2021-09-02 21:58 ` Marco Elver
2021-09-02 23:46   ` Shuah Khan
2021-09-06  7:12     ` Marco Elver
2021-09-06  9:33       ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).