linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Steffen Maier <maier@linux.ibm.com>
To: linux-scsi <linux-scsi@vger.kernel.org>,
	Bart Van Assche <bvanassche@acm.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	"James E . J . Bottomley" <jejb@linux.ibm.com>,
	Sachin Sant <sachinp@linux.ibm.com>,
	Hannes Reinecke <hare@suse.de>, Martin Wilck <mwilck@suse.com>,
	Benjamin Block <bblock@linux.ibm.com>,
	linux-s390 <linux-s390@vger.kernel.org>
Subject: kernel BUG scsi_dh_alua sleeping from invalid context && kernel WARNING do not call blocking ops when !TASK_RUNNING
Date: Mon, 16 Jan 2023 15:59:03 +0100	[thread overview]
Message-ID: <b49e37d5-edfb-4c56-3eeb-62c7d5855c00@linux.ibm.com> (raw)

Hi all,

since a few days/weeks, we sometimes see below alua and sleep related kernel 
BUG and WARNING (with panic_on_warn) in our CI.

It reminds me of
[PATCH 0/2] Rework how the ALUA driver calls scsi_device_put()
https://lore.kernel.org/linux-scsi/166986602290.2101055.17397734326843853911.b4-ty@oracle.com/

which I thought was the fix and went into 6.2-rc(1?) on 2022-12-14 with
[GIT PULL] first round of SCSI updates for the 6.1+ merge window
https://lore.kernel.org/linux-scsi/b2e824bbd1e40da64d2d01657f2f7a67b98919fb.camel@HansenPartnership.com/T/#u

Due to limited history, I cannot tell exactly when problems started and whether 
it really correlates to above.

Test workload are all kinds of coverage tests for zfcp recovery including scsi 
device removal and/or rescan.

[ 4569.045992] BUG: sleeping function called from invalid context at 
drivers/scsi/device_handler/scsi_dh_alua.c:992
[ 4569.046003] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 0, name: 
swapper/8
[ 4569.046013] preempt_count: 101, expected: 0
[ 4569.046023] RCU nest depth: 0, expected: 0
[ 4569.046033] no locks held by swapper/8/0.
[ 4569.046042] Preemption disabled at:
[ 4569.046046] [<000000017e27ce4e>] __slab_alloc.constprop.0+0x36/0xb8
[ 4569.046072] CPU: 8 PID: 0 Comm: swapper/8 Tainted: G        W 
6.2.0-20230114.rc3.git0.46e26dd43df0.300.fc37.s390x+debug #1
[ 4569.046084] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[ 4569.046094] Call Trace:
[ 4569.046102]  [<000000017ed21bcc>] dump_stack_lvl+0xac/0x100
[ 4569.046118]  [<000000017df9192c>] __might_resched+0x284/0x2c8
[ 4569.046131]  [<000003ff7fb9c874>] alua_rtpg_queue+0x3c/0x98 [scsi_dh_alua]
[ 4569.046146]  [<000003ff7fb9cfb2>] alua_check+0x122/0x250 [scsi_dh_alua]
[ 4569.046167]  [<000003ff7fb9d562>] alua_check_sense+0x172/0x228 [scsi_dh_alua]
[ 4569.046179]  [<000000017e96b3e2>] scsi_check_sense+0x8a/0x2e0
[ 4569.046191]  [<000000017e96e4b6>] scsi_decide_disposition+0x286/0x298
[ 4569.046201]  [<000000017e972bca>] scsi_complete+0x6a/0x108
[ 4569.046212]  [<000000017e746906>] blk_complete_reqs+0x6e/0x88
[ 4569.046227]  [<000000017ed3830e>] __do_softirq+0x13e/0x6b8
[ 4569.046238]  [<000000017df57902>] __irq_exit_rcu+0x14a/0x170
[ 4569.046264]  [<000000017df58472>] irq_exit_rcu+0x22/0x50
[ 4569.046275]  [<000000017ed2242a>] do_ext_irq+0x10a/0x1d0
[ 4569.046286]  [<000000017ed36156>] ext_int_handler+0xd6/0x110
[ 4569.046296]  [<000000017ed362e6>] psw_idle_exit+0x0/0xa
[ 4569.046307] ([<000000017defc5da>] arch_cpu_idle+0x52/0xe0)
[ 4569.046318]  [<000000017ed34744>] default_idle_call+0x84/0xd0
[ 4569.046329]  [<000000017dfbe4cc>] do_idle+0xfc/0x1b8
[ 4569.046340]  [<000000017dfbe80e>] cpu_startup_entry+0x36/0x40
[ 4569.046350]  [<000000017df11964>] smp_start_secondary+0x14c/0x160
[ 4569.046371]  [<000000017ed3658e>] restart_int_handler+0x6e/0x90
[ 4569.046381] no locks held by swapper/8/0.

Above occurs a few times until it finally ends with:

[ 4760.865496] device-mapper: multipath: 251:6: Reinstating path 8:176.
[ 4760.867398] sd 4:0:0:1083719810: Power-on or device reset occurred
[ 4760.867445] sd 4:0:0:1083719810: [sde] tag#1224 Done: ADD_TO_MLQUEUE Result: 
hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[ 4760.867469] sd 4:0:0:1083719810: [sde] tag#1224 CDB: Test Unit Ready 00 00 
00 00 00 00
[ 4760.867493] sd 4:0:0:1083719810: [sde] tag#1224 Sense Key : Unit Attention 
[current]
[ 4760.867515] sd 4:0:0:1083719810: [sde] tag#1224 Add. Sense: Power on, reset, 
or bus device reset occurred
[ 4760.878066] sd 4:0:0:1083719813: Power-on or device reset occurred
[ 4760.878096] ------------[ cut here ]------------
[ 4760.878107] do not call blocking ops when !TASK_RUNNING; state=2 set at 
[<000000017ed2c0fa>] __wait_for_common+0xa2/0x240
[ 4760.878132] WARNING: CPU: 3 PID: 165738 at kernel/sched/core.c:9908 
__might_sleep+0x7c/0x98
[ 4760.878147] Modules linked in: af_iucv kvm algif_hash af_alg nft_fib_inet 
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 
nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 
nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc vfio_ccw mdev vfio_iommu_type1 
vfio sch_fq_codel ip6_tables ip_tables x_tables configfs dm_service_time 
ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes sha512_s390 
sha256_s390 sha1_s390 sha_common zfcp scsi_transport_fc dm_mirror 
dm_region_hash dm_log scsi_dh_rdac scsi_dh_emc scsi_dh_alua pkey zcrypt 
rng_core dm_multipath autofs4
[ 4760.878456] CPU: 3 PID: 165738 Comm: kworker/3:0 Tainted: G        W 
  6.2.0-20230114.rc3.git0.46e26dd43df0.300.fc37.s390x+debug #1
[ 4760.878478] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[ 4760.878489] Workqueue: kaluad alua_rtpg_work [scsi_dh_alua]
[ 4760.878509] Krnl PSW : 0704d00180000000 000000017df919f0 
(__might_sleep+0x80/0x98)
[ 4760.878542]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 
RI:0 EA:3
[ 4760.878560] Krnl GPRS: c0000000ffffbfff 0000000080000101 000000000000006d 
000000017f198e94
[ 4760.878573]            00000380002739f8 00000380002739f0 0000000000000000 
000000017f7bca48
[ 4760.878586]            0000000000000001 0000000000000000 00000000000003e0 
000003ff7fb9f1bc
[ 4760.878599]            00000000827eb100 0000000000000101 000000017df919ec 
0000038000273b88
[ 4760.878620] Krnl Code: 000000017df919e0: c020008c1da3	larl	%r2,000000017f115526
                           000000017df919e6: c0e5006bb91d	brasl 
%r14,000000017ed08c20
                          #000000017df919ec: af000000		mc	0,0
                          >000000017df919f0: a7490000		lghi	%r4,0
                           000000017df919f4: b904003a		lgr	%r3,%r10
                           000000017df919f8: b904002b		lgr	%r2,%r11
                           000000017df919fc: ebaff0a00004	lmg	%r10,%r15,160(%r15)
                           000000017df91a02: c0f4fffffe53	brcl	15,000000017df916a8
[ 4760.878692] Call Trace:
[ 4760.878703]  [<000000017df919f0>] __might_sleep+0x80/0x98
[ 4760.878716] ([<000000017df919ec>] __might_sleep+0x7c/0x98)
[ 4760.878728]  [<000003ff7fb9c874>] alua_rtpg_queue+0x3c/0x98 [scsi_dh_alua]
[ 4760.878743]  [<000003ff7fb9cfb2>] alua_check+0x122/0x250 [scsi_dh_alua]
[ 4760.878761]  [<000003ff7fb9d562>] alua_check_sense+0x172/0x228 [scsi_dh_alua]
[ 4760.878775]  [<000000017e96b3e2>] scsi_check_sense+0x8a/0x2e0
[ 4760.878788]  [<000000017e96e4b6>] scsi_decide_disposition+0x286/0x298
[ 4760.878802]  [<000000017e972bca>] scsi_complete+0x6a/0x108
[ 4760.878815]  [<000000017e746906>] blk_complete_reqs+0x6e/0x88
[ 4760.878837]  [<000000017ed3830e>] __do_softirq+0x13e/0x6b8
[ 4760.878852]  [<000000017df57902>] __irq_exit_rcu+0x14a/0x170
[ 4760.878866]  [<000000017df58472>] irq_exit_rcu+0x22/0x50
[ 4760.878880]  [<000000017ed223da>] do_ext_irq+0xba/0x1d0
[ 4760.878896]  [<000000017ed36156>] ext_int_handler+0xd6/0x110
[ 4760.878909]  [<000000017ed34fbe>] _raw_spin_unlock_irqrestore+0x86/0xc0
[ 4760.878928] ([<000000017ed34fae>] _raw_spin_unlock_irqrestore+0x76/0xc0)
[ 4760.878941]  [<000000017e033e66>] __mod_timer+0x2d6/0x408
[ 4760.878955]  [<000000017ed33864>] schedule_timeout+0xc4/0x168
[ 4760.878969]  [<000000017ed2ac62>] io_schedule_timeout+0x5a/0x80
[ 4760.878983]  [<000000017ed2c12e>] __wait_for_common+0xd6/0x240
[ 4760.878997]  [<000000017e7479a6>] blk_execute_rq+0x126/0x1f8
[ 4760.879011]  [<000000017e970722>] __scsi_execute+0x112/0x260
[ 4760.879024]  [<000003ff7fb9d750>] alua_rtpg+0x138/0xb10 [scsi_dh_alua]
[ 4760.879038]  [<000003ff7fb9e3e4>] alua_rtpg_work+0x2bc/0x4e0 [scsi_dh_alua]
[ 4760.879053]  [<000000017df78300>] process_one_work+0x310/0x730
[ 4760.879069]  [<000000017df78782>] worker_thread+0x62/0x420
[ 4760.879109]  [<000000017df83bc4>] kthread+0x13c/0x150
[ 4760.879124]  [<000000017defb930>] __ret_from_fork+0x40/0x58
[ 4760.879138]  [<000000017ed35eda>] ret_from_fork+0xa/0x40
[ 4760.879152] 2 locks held by kworker/3:0/165738:
[ 4760.879165]  #0: 000000008c7b5948 ((wq_completion)kaluad){+.+.}-{0:0}, at: 
process_one_work+0x232/0x730
[ 4760.879210]  #1: 0000038001177dc8 
((work_completion)(&(&pg->rtpg_work)->work)){+.+.}-{0:0}, at: 
process_one_work+0x232/0x730
[ 4760.879249] Last Breaking-Event-Address:
[ 4760.879266]  [<000000017e8c6dd0>] __s390_indirect_jump_r14+0x0/0x10
[ 4760.879283] Kernel panic - not syncing: kernel: panic_on_warn set ...


-- 
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z and LinuxONE

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: David Faller
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

             reply	other threads:[~2023-01-16 15:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-16 14:59 Steffen Maier [this message]
2023-01-16 16:57 ` kernel BUG scsi_dh_alua sleeping from invalid context && kernel WARNING do not call blocking ops when !TASK_RUNNING Martin Wilck
2023-01-16 17:48   ` Bart Van Assche
2023-01-16 17:58     ` Martin Wilck
2023-01-17  9:28     ` Martin Wilck
2023-01-17 18:50       ` Bart Van Assche
2023-01-17 21:48         ` Martin Wilck
2023-01-17 21:52           ` Bart Van Assche
2023-01-17 22:03             ` Martin Wilck
2023-01-18  0:29               ` Bart Van Assche
2023-01-18  8:45                 ` Martin Wilck
2023-01-18 16:17                 ` Steffen Maier
2023-01-24 11:16                   ` Steffen Maier
2023-01-24 11:36                     ` Martin Wilck
2023-01-16 17:55 ` Bart Van Assche
2023-01-16 18:12   ` Steffen Maier
2023-01-16 18:31     ` Bart Van Assche
2023-01-17  7:46   ` Martin Wilck

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b49e37d5-edfb-4c56-3eeb-62c7d5855c00@linux.ibm.com \
    --to=maier@linux.ibm.com \
    --cc=bblock@linux.ibm.com \
    --cc=bvanassche@acm.org \
    --cc=hare@suse.de \
    --cc=jejb@linux.ibm.com \
    --cc=linux-s390@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=mwilck@suse.com \
    --cc=sachinp@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).