stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [arm64] db410c: BUG: Invalid wait context
@ 2020-12-02  4:45 Naresh Kamboju
  2020-12-03  1:49 ` Boqun Feng
  0 siblings, 1 reply; 3+ messages in thread
From: Naresh Kamboju @ 2020-12-02  4:45 UTC (permalink / raw)
  To: linux-stable, open list, rcu, lkft-triage; +Cc: Greg Kroah-Hartman, Sasha Levin

While running kselftests on arm64 db410c platform "BUG: Invalid wait context"
noticed at different runs this specific platform running stable-rc 5.9.12-rc1.

While running these two test cases we have noticed this BUG and not easily
reproducible.

# selftests: bpf: test_xdp_redirect.sh
# selftests: net: ip6_gre_headroom.sh

[  245.694901] kauditd_printk_skb: 100 callbacks suppressed
[  245.694913] audit: type=1334 audit(251.699:25757): prog-id=12883 op=LOAD
[  245.735658] audit: type=1334 audit(251.743:25758): prog-id=12884 op=LOAD
[  245.801299] audit: type=1334 audit(251.807:25759): prog-id=12885 op=LOAD
[  245.832034] audit: type=1334 audit(251.839:25760): prog-id=12886 op=LOAD
[  245.888601]
[  245.888631] =============================
[  245.889156] [ BUG: Invalid wait context ]
[  245.893071] 5.9.12-rc1 #1 Tainted: G        W
[  245.897056] -----------------------------
[  245.902091] pool/1279 is trying to lock:
[  245.906083] ffff000032fc1218
(&child->perf_event_mutex){+.+.}-{3:3}, at:
perf_event_exit_task+0x34/0x3a8
[  245.910085] other info that might help us debug this:
[  245.919539] context-{4:4}
[  245.924484] 1 lock held by pool/1279:
[  245.927087]  #0: ffff8000127819b8 (rcu_read_lock){....}-{1:2}, at:
dput+0x54/0x460
[  245.930739] stack backtrace:
[  245.938203] CPU: 1 PID: 1279 Comm: pool Tainted: G        W
5.9.12-rc1 #1
[  245.941243] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[  245.948621] Call trace:
[  245.955390]  dump_backtrace+0x0/0x1f8
[  245.957560]  show_stack+0x2c/0x38
[  245.961382]  dump_stack+0xec/0x158
[  245.964679]  __lock_acquire+0x59c/0x15c8
[  245.967978]  lock_acquire+0x124/0x4d0
[  245.972058]  __mutex_lock+0xa4/0x970
[  245.975615]  mutex_lock_nested+0x54/0x70
[  245.979261]  perf_event_exit_task+0x34/0x3a8
[  245.983168]  do_exit+0x394/0xad8
[  245.987420]  do_group_exit+0x4c/0xa8
[  245.990633]  get_signal+0x16c/0xb40
[  245.994193]  do_notify_resume+0x2ec/0x678
[  245.997404]  work_pending+0x8/0x200

and BUG in an other run,

[ 1012.068407] audit: type=1700 audit(1018.803:25886): dev=eth0 prom=0
old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
[ 1012.250561] IPv6: ADDRCONF(NETDEV_CHANGE): swp1: link becomes ready
[ 1012.251298] IPv6: ADDRCONF(NETDEV_CHANGE): h1: link becomes ready
[ 1012.252559]
[ 1012.261892] =============================
[ 1012.263453] [ BUG: Invalid wait context ]
[ 1012.267363] 5.9.12-rc1 #1 Tainted: G        W
[ 1012.271354] -----------------------------
[ 1012.276389] systemd/454 is trying to lock:
[ 1012.280381] ffff00003985a918 (&mm->mmap_lock){++++}-{3:3}, at:
__might_fault+0x60/0xa8
[ 1012.284378] other info that might help us debug this:
[ 1012.292275] context-{4:4}
[ 1012.297396] 1 lock held by systemd/454:
[ 1012.299997]  #0: ffff8000127d1f38 (rcu_read_lock){....}-{1:2}, at:
path_init+0x40/0x718
[ 1012.303649] stack backtrace:
[ 1012.311638] CPU: 2 PID: 454 Comm: systemd Tainted: G        W
  5.9.12-rc1 #1
[ 1012.314760] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
[ 1012.322139] Call trace:
[ 1012.329084]  dump_backtrace+0x0/0x1f8
[ 1012.331254]  show_stack+0x2c/0x38
[ 1012.335075]  dump_stack+0xec/0x158
[ 1012.338371]  __lock_acquire+0x59c/0x15c8
[ 1012.341672]  lock_acquire+0x124/0x4d0
[ 1012.345751]  __might_fault+0x84/0xa8
[ 1012.349311]  cp_new_stat+0x114/0x1b8
[ 1012.352956]  __do_sys_newfstat+0x44/0x70
[ 1012.356513]  __arm64_sys_newfstat+0x24/0x30
[ 1012.358652] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
[ 1012.360424]  el0_svc_common.constprop.3+0x7c/0x198
[ 1012.370575]  do_el0_svc+0x34/0xa0
[ 1012.375437]  el0_sync_handler+0x16c/0x210
[ 1012.378824]  el0_sync+0x140/0x180


Full test log links,
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.9.y/build/v5.9.11-153-gfb49ad2107f4/testrun/3516257/suite/linux-log-parser/test/check-kernel-bug-1997042/log
https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.9.y/build/v5.9.11-153-gfb49ad2107f4/testrun/3515038/suite/linux-log-parser/test/check-kernel-bug-1996797/log

metadata:
  git branch: linux-5.9.y
  git commit: fb49ad2107f4b740257c84ec2991fc6afb447f53
  git describe: v5.9.11-153-gfb49ad2107f4
  git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
  kernel-config:
http://snapshots.linaro.org/openembedded/lkft/lkft/sumo/dragonboard-410c/lkft/linux-stable-rc-5.9/44/config

-- 
Linaro LKFT
https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [arm64] db410c: BUG: Invalid wait context
  2020-12-02  4:45 [arm64] db410c: BUG: Invalid wait context Naresh Kamboju
@ 2020-12-03  1:49 ` Boqun Feng
  2020-12-03 11:39   ` Mark Rutland
  0 siblings, 1 reply; 3+ messages in thread
From: Boqun Feng @ 2020-12-03  1:49 UTC (permalink / raw)
  To: Naresh Kamboju
  Cc: linux-stable, open list, rcu, lkft-triage, Greg Kroah-Hartman,
	Sasha Levin, Peter Zijlstra, Will Deacon, Mark Rutland,
	Paul E. McKenney

Hi Naresh,

On Wed, Dec 02, 2020 at 10:15:44AM +0530, Naresh Kamboju wrote:
> While running kselftests on arm64 db410c platform "BUG: Invalid wait context"
> noticed at different runs this specific platform running stable-rc 5.9.12-rc1.
> 
> While running these two test cases we have noticed this BUG and not easily
> reproducible.
> 
> # selftests: bpf: test_xdp_redirect.sh
> # selftests: net: ip6_gre_headroom.sh
> 
> [  245.694901] kauditd_printk_skb: 100 callbacks suppressed
> [  245.694913] audit: type=1334 audit(251.699:25757): prog-id=12883 op=LOAD
> [  245.735658] audit: type=1334 audit(251.743:25758): prog-id=12884 op=LOAD
> [  245.801299] audit: type=1334 audit(251.807:25759): prog-id=12885 op=LOAD
> [  245.832034] audit: type=1334 audit(251.839:25760): prog-id=12886 op=LOAD
> [  245.888601]
> [  245.888631] =============================
> [  245.889156] [ BUG: Invalid wait context ]
> [  245.893071] 5.9.12-rc1 #1 Tainted: G        W
> [  245.897056] -----------------------------
> [  245.902091] pool/1279 is trying to lock:
> [  245.906083] ffff000032fc1218
> (&child->perf_event_mutex){+.+.}-{3:3}, at:
> perf_event_exit_task+0x34/0x3a8
> [  245.910085] other info that might help us debug this:
> [  245.919539] context-{4:4}
> [  245.924484] 1 lock held by pool/1279:
> [  245.927087]  #0: ffff8000127819b8 (rcu_read_lock){....}-{1:2}, at:
> dput+0x54/0x460
> [  245.930739] stack backtrace:
> [  245.938203] CPU: 1 PID: 1279 Comm: pool Tainted: G        W
> 5.9.12-rc1 #1
> [  245.941243] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [  245.948621] Call trace:
> [  245.955390]  dump_backtrace+0x0/0x1f8
> [  245.957560]  show_stack+0x2c/0x38
> [  245.961382]  dump_stack+0xec/0x158
> [  245.964679]  __lock_acquire+0x59c/0x15c8
> [  245.967978]  lock_acquire+0x124/0x4d0
> [  245.972058]  __mutex_lock+0xa4/0x970
> [  245.975615]  mutex_lock_nested+0x54/0x70
> [  245.979261]  perf_event_exit_task+0x34/0x3a8
> [  245.983168]  do_exit+0x394/0xad8
> [  245.987420]  do_group_exit+0x4c/0xa8
> [  245.990633]  get_signal+0x16c/0xb40
> [  245.994193]  do_notify_resume+0x2ec/0x678
> [  245.997404]  work_pending+0x8/0x200
> 

For the PoV of lockdep, this means some one tries to acquire a mutex
inside an RCU read-side critical section, which is bad, because one can
not sleep (voluntarily) inside RCU.

However I don't think it's the true case here, because 1) normally
people are very careful not putting mutex or other sleepable locks
inside RCU and 2) in the above splats, lockdep find the rcu read lock is
held at dput() while the acquiring of the mutex is at ret_to_user(),
clearly there is no call site (in the same context) from the RCU
read-side critial section of dput() to ret_to_user().

One chance of hitting this is that there is a bug in context/irq tracing
that makes the contexts of dput() and ret_to_user() as one contexts so
that lockdep gets confused and reports a false postive.

FWIW, I think this might be related to some know issues for ARM64 with
lockdep and irq tracing:

	https://lore.kernel.org/lkml/20201119225352.GA5251@willie-the-truck/

And Mark already has series to fix them:

	https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/irq-fixes

But I must defer to Mark for the latest fix ;-)

Regards,
Boqun

> and BUG in an other run,
> 
> [ 1012.068407] audit: type=1700 audit(1018.803:25886): dev=eth0 prom=0
> old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
> [ 1012.250561] IPv6: ADDRCONF(NETDEV_CHANGE): swp1: link becomes ready
> [ 1012.251298] IPv6: ADDRCONF(NETDEV_CHANGE): h1: link becomes ready
> [ 1012.252559]
> [ 1012.261892] =============================
> [ 1012.263453] [ BUG: Invalid wait context ]
> [ 1012.267363] 5.9.12-rc1 #1 Tainted: G        W
> [ 1012.271354] -----------------------------
> [ 1012.276389] systemd/454 is trying to lock:
> [ 1012.280381] ffff00003985a918 (&mm->mmap_lock){++++}-{3:3}, at:
> __might_fault+0x60/0xa8
> [ 1012.284378] other info that might help us debug this:
> [ 1012.292275] context-{4:4}
> [ 1012.297396] 1 lock held by systemd/454:
> [ 1012.299997]  #0: ffff8000127d1f38 (rcu_read_lock){....}-{1:2}, at:
> path_init+0x40/0x718
> [ 1012.303649] stack backtrace:
> [ 1012.311638] CPU: 2 PID: 454 Comm: systemd Tainted: G        W
>   5.9.12-rc1 #1
> [ 1012.314760] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> [ 1012.322139] Call trace:
> [ 1012.329084]  dump_backtrace+0x0/0x1f8
> [ 1012.331254]  show_stack+0x2c/0x38
> [ 1012.335075]  dump_stack+0xec/0x158
> [ 1012.338371]  __lock_acquire+0x59c/0x15c8
> [ 1012.341672]  lock_acquire+0x124/0x4d0
> [ 1012.345751]  __might_fault+0x84/0xa8
> [ 1012.349311]  cp_new_stat+0x114/0x1b8
> [ 1012.352956]  __do_sys_newfstat+0x44/0x70
> [ 1012.356513]  __arm64_sys_newfstat+0x24/0x30
> [ 1012.358652] IPv6: ADDRCONF(NETDEV_CHANGE): swp3: link becomes ready
> [ 1012.360424]  el0_svc_common.constprop.3+0x7c/0x198
> [ 1012.370575]  do_el0_svc+0x34/0xa0
> [ 1012.375437]  el0_sync_handler+0x16c/0x210
> [ 1012.378824]  el0_sync+0x140/0x180
> 
> 
> Full test log links,
> https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.9.y/build/v5.9.11-153-gfb49ad2107f4/testrun/3516257/suite/linux-log-parser/test/check-kernel-bug-1997042/log
> https://qa-reports.linaro.org/lkft/linux-stable-rc-linux-5.9.y/build/v5.9.11-153-gfb49ad2107f4/testrun/3515038/suite/linux-log-parser/test/check-kernel-bug-1996797/log
> 
> metadata:
>   git branch: linux-5.9.y
>   git commit: fb49ad2107f4b740257c84ec2991fc6afb447f53
>   git describe: v5.9.11-153-gfb49ad2107f4
>   git repo: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
>   kernel-config:
> http://snapshots.linaro.org/openembedded/lkft/lkft/sumo/dragonboard-410c/lkft/linux-stable-rc-5.9/44/config
> 
> -- 
> Linaro LKFT
> https://lkft.linaro.org

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [arm64] db410c: BUG: Invalid wait context
  2020-12-03  1:49 ` Boqun Feng
@ 2020-12-03 11:39   ` Mark Rutland
  0 siblings, 0 replies; 3+ messages in thread
From: Mark Rutland @ 2020-12-03 11:39 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Naresh Kamboju, linux-stable, open list, rcu, lkft-triage,
	Greg Kroah-Hartman, Sasha Levin, Peter Zijlstra, Will Deacon,
	Paul E. McKenney

Hi Naresh, Boqun,

On Thu, Dec 03, 2020 at 09:49:22AM +0800, Boqun Feng wrote:
> On Wed, Dec 02, 2020 at 10:15:44AM +0530, Naresh Kamboju wrote:
> > While running kselftests on arm64 db410c platform "BUG: Invalid wait context"
> > noticed at different runs this specific platform running stable-rc 5.9.12-rc1.
> > 
> > While running these two test cases we have noticed this BUG and not easily
> > reproducible.
> > 
> > # selftests: bpf: test_xdp_redirect.sh
> > # selftests: net: ip6_gre_headroom.sh
> > 
> > [  245.694901] kauditd_printk_skb: 100 callbacks suppressed
> > [  245.694913] audit: type=1334 audit(251.699:25757): prog-id=12883 op=LOAD
> > [  245.735658] audit: type=1334 audit(251.743:25758): prog-id=12884 op=LOAD
> > [  245.801299] audit: type=1334 audit(251.807:25759): prog-id=12885 op=LOAD
> > [  245.832034] audit: type=1334 audit(251.839:25760): prog-id=12886 op=LOAD
> > [  245.888601]
> > [  245.888631] =============================
> > [  245.889156] [ BUG: Invalid wait context ]
> > [  245.893071] 5.9.12-rc1 #1 Tainted: G        W
> > [  245.897056] -----------------------------
> > [  245.902091] pool/1279 is trying to lock:
> > [  245.906083] ffff000032fc1218
> > (&child->perf_event_mutex){+.+.}-{3:3}, at:
> > perf_event_exit_task+0x34/0x3a8
> > [  245.910085] other info that might help us debug this:
> > [  245.919539] context-{4:4}
> > [  245.924484] 1 lock held by pool/1279:
> > [  245.927087]  #0: ffff8000127819b8 (rcu_read_lock){....}-{1:2}, at:
> > dput+0x54/0x460
> > [  245.930739] stack backtrace:
> > [  245.938203] CPU: 1 PID: 1279 Comm: pool Tainted: G        W
> > 5.9.12-rc1 #1
> > [  245.941243] Hardware name: Qualcomm Technologies, Inc. APQ 8016 SBC (DT)
> > [  245.948621] Call trace:
> > [  245.955390]  dump_backtrace+0x0/0x1f8
> > [  245.957560]  show_stack+0x2c/0x38
> > [  245.961382]  dump_stack+0xec/0x158
> > [  245.964679]  __lock_acquire+0x59c/0x15c8
> > [  245.967978]  lock_acquire+0x124/0x4d0
> > [  245.972058]  __mutex_lock+0xa4/0x970
> > [  245.975615]  mutex_lock_nested+0x54/0x70
> > [  245.979261]  perf_event_exit_task+0x34/0x3a8
> > [  245.983168]  do_exit+0x394/0xad8
> > [  245.987420]  do_group_exit+0x4c/0xa8
> > [  245.990633]  get_signal+0x16c/0xb40
> > [  245.994193]  do_notify_resume+0x2ec/0x678
> > [  245.997404]  work_pending+0x8/0x200
> > 
> 
> For the PoV of lockdep, this means some one tries to acquire a mutex
> inside an RCU read-side critical section, which is bad, because one can
> not sleep (voluntarily) inside RCU.
> 
> However I don't think it's the true case here, because 1) normally
> people are very careful not putting mutex or other sleepable locks
> inside RCU and 2) in the above splats, lockdep find the rcu read lock is
> held at dput() while the acquiring of the mutex is at ret_to_user(),
> clearly there is no call site (in the same context) from the RCU
> read-side critial section of dput() to ret_to_user().
> 
> One chance of hitting this is that there is a bug in context/irq tracing
> that makes the contexts of dput() and ret_to_user() as one contexts so
> that lockdep gets confused and reports a false postive.

That sounds likely to me (but I haven't looked too deeply at the above
report).

> FWIW, I think this might be related to some know issues for ARM64 with
> lockdep and irq tracing:
> 
> 	https://lore.kernel.org/lkml/20201119225352.GA5251@willie-the-truck/
> 
> And Mark already has series to fix them:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/log/?h=arm64/irq-fixes
> 
> But I must defer to Mark for the latest fix ;-)

That went into mainline a few hours ago, and will be part of v5.10-rc7.

So if it's possible to test with mainline, that would be helpful!

Thanks,
Mark.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-12-03 11:40 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-02  4:45 [arm64] db410c: BUG: Invalid wait context Naresh Kamboju
2020-12-03  1:49 ` Boqun Feng
2020-12-03 11:39   ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).