linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org,
	Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>,
	Brian King <brking@linux.vnet.ibm.com>,
	Keith Busch <keith.busch@intel.com>,
	linux-nvme@lists.infradead.org, linux-block@vger.kernel.org,
	Christoph Hellwig <hch@lst.de>, Jens Axboe <axboe@fb.com>,
	Giuliano Procida <gprocida@google.com>
Subject: [PATCH 4.4 64/86] blk-mq: Allow timeouts to run while queue is freezing
Date: Mon, 18 May 2020 19:36:35 +0200	[thread overview]
Message-ID: <20200518173503.327209969@linuxfoundation.org> (raw)
In-Reply-To: <20200518173450.254571947@linuxfoundation.org>

From: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>

commit 71f79fb3179e69b0c1448a2101a866d871c66e7f upstream.

In case a submitted request gets stuck for some reason, the block layer
can prevent the request starvation by starting the scheduled timeout work.
If this stuck request occurs at the same time another thread has started
a queue freeze, the blk_mq_timeout_work will not be able to acquire the
queue reference and will return silently, thus not issuing the timeout.
But since the request is already holding a q_usage_counter reference and
is unable to complete, it will never release its reference, preventing
the queue from completing the freeze started by first thread.  This puts
the request_queue in a hung state, forever waiting for the freeze
completion.

This was observed while running IO to a NVMe device at the same time we
toggled the CPU hotplug code. Eventually, once a request got stuck
requiring a timeout during a queue freeze, we saw the CPU Hotplug
notification code get stuck inside blk_mq_freeze_queue_wait, as shown in
the trace below.

[c000000deaf13690] [c000000deaf13738] 0xc000000deaf13738 (unreliable)
[c000000deaf13860] [c000000000015ce8] __switch_to+0x1f8/0x350
[c000000deaf138b0] [c000000000ade0e4] __schedule+0x314/0x990
[c000000deaf13940] [c000000000ade7a8] schedule+0x48/0xc0
[c000000deaf13970] [c0000000005492a4] blk_mq_freeze_queue_wait+0x74/0x110
[c000000deaf139e0] [c00000000054b6a8] blk_mq_queue_reinit_notify+0x1a8/0x2e0
[c000000deaf13a40] [c0000000000e7878] notifier_call_chain+0x98/0x100
[c000000deaf13a90] [c0000000000b8e08] cpu_notify_nofail+0x48/0xa0
[c000000deaf13ac0] [c0000000000b92f0] _cpu_down+0x2a0/0x400
[c000000deaf13b90] [c0000000000b94a8] cpu_down+0x58/0xa0
[c000000deaf13bc0] [c0000000006d5dcc] cpu_subsys_offline+0x2c/0x50
[c000000deaf13bf0] [c0000000006cd244] device_offline+0x104/0x140
[c000000deaf13c30] [c0000000006cd40c] online_store+0x6c/0xc0
[c000000deaf13c80] [c0000000006c8c78] dev_attr_store+0x68/0xa0
[c000000deaf13cc0] [c0000000003974d0] sysfs_kf_write+0x80/0xb0
[c000000deaf13d00] [c0000000003963e8] kernfs_fop_write+0x188/0x200
[c000000deaf13d50] [c0000000002e0f6c] __vfs_write+0x6c/0xe0
[c000000deaf13d90] [c0000000002e1ca0] vfs_write+0xc0/0x230
[c000000deaf13de0] [c0000000002e2cdc] SyS_write+0x6c/0x110
[c000000deaf13e30] [c000000000009204] system_call+0x38/0xb4

The fix is to allow the timeout work to execute in the window between
dropping the initial refcount reference and the release of the last
reference, which actually marks the freeze completion.  This can be
achieved with percpu_refcount_tryget, which does not require the counter
to be alive.  This way the timeout work can do it's job and terminate a
stuck request even during a freeze, returning its reference and avoiding
the deadlock.

Allowing the timeout to run is just a part of the fix, since for some
devices, we might get stuck again inside the device driver's timeout
handler, should it attempt to allocate a new request in that path -
which is a quite common action for Abort commands, which need to be sent
after a timeout.  In NVMe, for instance, we call blk_mq_alloc_request
from inside the timeout handler, which will fail during a freeze, since
it also tries to acquire a queue reference.

I considered a similar change to blk_mq_alloc_request as a generic
solution for further device driver hangs, but we can't do that, since it
would allow new requests to disturb the freeze process.  I thought about
creating a new function in the block layer to support unfreezable
requests for these occasions, but after working on it for a while, I
feel like this should be handled in a per-driver basis.  I'm now
experimenting with changes to the NVMe timeout path, but I'm open to
suggestions of ways to make this generic.

Signed-off-by: Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com>
Cc: Brian King <brking@linux.vnet.ibm.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: linux-nvme@lists.infradead.org
Cc: linux-block@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Giuliano Procida <gprocida@google.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 block/blk-mq.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -628,7 +628,20 @@ static void blk_mq_rq_timer(unsigned lon
 	};
 	int i;
 
-	if (blk_queue_enter(q, GFP_NOWAIT))
+	/* A deadlock might occur if a request is stuck requiring a
+	 * timeout at the same time a queue freeze is waiting
+	 * completion, since the timeout code would not be able to
+	 * acquire the queue reference here.
+	 *
+	 * That's why we don't use blk_queue_enter here; instead, we use
+	 * percpu_ref_tryget directly, because we need to be able to
+	 * obtain a reference even in the short window between the queue
+	 * starting to freeze, by dropping the first reference in
+	 * blk_mq_freeze_queue_start, and the moment the last request is
+	 * consumed, marked by the instant q_usage_counter reaches
+	 * zero.
+	 */
+	if (!percpu_ref_tryget(&q->q_usage_counter))
 		return;
 
 	blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &data);



  parent reply	other threads:[~2020-05-18 18:30 UTC|newest]

Thread overview: 95+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-05-18 17:35 [PATCH 4.4 00/86] 4.4.224-rc1 review Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 01/86] USB: serial: qcserial: Add DW5816e support Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 02/86] Revert "net: phy: Avoid polling PHY with PHY_IGNORE_INTERRUPTS" Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 03/86] dp83640: reverse arguments to list_add_tail Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 04/86] net/mlx4_core: Fix use of ENOSPC around mlx4_counter_alloc() Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 05/86] sch_sfq: validate silly quantum values Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 06/86] sch_choke: avoid potential panic in choke_reset() Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 07/86] Revert "ACPI / video: Add force_native quirk for HP Pavilion dv6" Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 08/86] enic: do not overwrite error code Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 09/86] ipv6: fix cleanup ordering for ip6_mr failure Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 10/86] binfmt_elf: move brk out of mmap when doing direct loader exec Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 11/86] x86/apm: Dont access __preempt_count with zeroed fs Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 12/86] Revert "IB/ipoib: Update broadcast object if PKey value was changed in index 0" Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 13/86] USB: uas: add quirk for LaCie 2Big Quadra Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 14/86] USB: serial: garmin_gps: add sanity checking for data length Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 15/86] batman-adv: fix batadv_nc_random_weight_tq Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 16/86] scripts/decodecode: fix trapping instruction formatting Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 17/86] phy: micrel: Disable auto negotiation on startup Greg Kroah-Hartman
2020-05-19  5:45   ` Henri Rosten
2020-05-19 10:53     ` Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 18/86] phy: micrel: Ensure interrupts are reenabled on resume Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 19/86] binfmt_elf: Do not move brk for INTERP-less ET_EXEC Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 20/86] ext4: add cond_resched() to ext4_protect_reserved_inode Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 21/86] net: ipv6: add net argument to ip6_dst_lookup_flow Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 22/86] net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 23/86] blktrace: Fix potential deadlock between delete & sysfs ops Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 24/86] blktrace: fix unlocked access to init/start-stop/teardown Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 25/86] blktrace: fix trace mutex deadlock Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 26/86] blktrace: Protect q->blk_trace with RCU Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 27/86] blktrace: fix dereference after null check Greg Kroah-Hartman
2020-05-18 17:35 ` [PATCH 4.4 28/86] ptp: do not explicitly set drvdata in ptp_clock_register() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 29/86] ptp: use is_visible method to hide unused attributes Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 30/86] ptp: create "pins" together with the rest of attributes Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 31/86] chardev: add helper function to register char devs with a struct device Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 32/86] ptp: Fix pass zero to ERR_PTR() in ptp_clock_register Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 33/86] ptp: fix the race between the release of ptp_clock and cdev Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 34/86] ptp: free ptp device pin descriptors properly Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 35/86] net: handle no dst on skb in icmp6_send Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 36/86] net/sonic: Fix a resource leak in an error handling path in jazz_sonic_probe() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 37/86] net: moxa: Fix a potential double free_irq() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 38/86] drop_monitor: work around gcc-10 stringop-overflow warning Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 39/86] scsi: sg: add sg_remove_request in sg_write Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 40/86] spi: spi-dw: Add lock protect dw_spi rx/tx to prevent concurrent calls Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 41/86] cifs: Check for timeout on Negotiate stage Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 42/86] cifs: Fix a race condition with cifs_echo_request Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 43/86] dmaengine: pch_dma.c: Avoid data race between probe and irq handler Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 44/86] dmaengine: mmp_tdma: Reset channel error on release Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 45/86] drm/qxl: lost qxl_bo_kunmap_atomic_page in qxl_image_init_helper() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 46/86] ipc/util.c: sysvipc_find_ipc() incorrectly updates position index Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 47/86] net: openvswitch: fix csum updates for MPLS actions Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 48/86] gre: do not keep the GRE header around in collect medata mode Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 49/86] mm/memory_hotplug.c: fix overflow in test_pages_in_a_zone() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 50/86] scsi: qla2xxx: Avoid double completion of abort command Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 51/86] i40e: avoid NVM acquire deadlock during NVM update Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 52/86] net/mlx5: Fix driver load error flow when firmware is stuck Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 53/86] netfilter: conntrack: avoid gcc-10 zero-length-bounds warning Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 54/86] IB/mlx4: Test return value of calls to ib_get_cached_pkey Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 55/86] pnp: Use list_for_each_entry() instead of open coding Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 56/86] gcc-10 warnings: fix low-hanging fruit Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 57/86] kbuild: compute false-positive -Wmaybe-uninitialized cases in Kconfig Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 58/86] Stop the ad-hoc games with -Wno-maybe-initialized Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 59/86] gcc-10: disable zero-length-bounds warning for now Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 60/86] gcc-10: disable array-bounds " Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 61/86] gcc-10: disable stringop-overflow " Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 62/86] gcc-10: disable restrict " Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 63/86] block: defer timeouts to a workqueue Greg Kroah-Hartman
2020-05-19  6:00   ` Henri Rosten
2020-05-19  7:31     ` Greg Kroah-Hartman
2020-05-19 10:53       ` Greg Kroah-Hartman
2020-05-18 17:36 ` Greg Kroah-Hartman [this message]
2020-05-18 17:36 ` [PATCH 4.4 65/86] blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 66/86] blk-mq: Allow blocking queue tag iter callbacks Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 67/86] x86/paravirt: Remove the unused irq_enable_sysexit pv op Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 68/86] gcc-10: avoid shadowing standard library free() in crypto Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 69/86] net: fix a potential recursive NETDEV_FEAT_CHANGE Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 70/86] net: ipv4: really enforce backoff for redirects Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 71/86] netlabel: cope with NULL catmap Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 72/86] ALSA: hda/realtek - Limit int mic boost for Thinkpad T530 Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 73/86] ALSA: rawmidi: Fix racy buffer resize under concurrent accesses Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 74/86] ALSA: rawmidi: Initialize allocated buffers Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 75/86] USB: gadget: fix illegal array access in binding with UDC Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 76/86] ARM: dts: imx27-phytec-phycard-s-rdk: Fix the I2C1 pinctrl entries Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 77/86] x86: Fix early boot crash on gcc-10, third try Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 78/86] exec: Move would_dump into flush_old_exec Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 79/86] usb: gadget: net2272: Fix a memory leak in an error handling path in net2272_plat_probe() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 80/86] usb: gadget: audio: Fix a missing error return value in audio_bind() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 81/86] usb: gadget: legacy: fix error return code in gncm_bind() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 82/86] usb: gadget: legacy: fix error return code in cdc_bind() Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 83/86] Revert "ALSA: hda/realtek: Fix pop noise on ALC225" Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 84/86] ARM: dts: r8a7740: Add missing extal2 to CPG node Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 85/86] KVM: x86: Fix off-by-one error in kvm_vcpu_ioctl_x86_setup_mce Greg Kroah-Hartman
2020-05-18 17:36 ` [PATCH 4.4 86/86] Makefile: disallow data races on gcc-10 as well Greg Kroah-Hartman
2020-05-19  8:29 ` [PATCH 4.4 00/86] 4.4.224-rc1 review Naresh Kamboju
2020-05-19  8:49 ` Jon Hunter
2020-05-21  7:47 ` Chris Paterson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200518173503.327209969@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=axboe@fb.com \
    --cc=brking@linux.vnet.ibm.com \
    --cc=gprocida@google.com \
    --cc=hch@lst.de \
    --cc=keith.busch@intel.com \
    --cc=krisman@linux.vnet.ibm.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).