All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jiri Slaby <jslaby@suse.cz>
To: stable@vger.kernel.org
Cc: linux-kernel@vger.kernel.org,
	Roman Pen <roman.penyaev@profitbricks.com>,
	Gioh Kim <gi-oh.kim@profitbricks.com>,
	Michael Wang <yun.wang@profitbricks.com>,
	Tejun Heo <tj@kernel.org>, Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Jiri Slaby <jslaby@suse.cz>
Subject: [PATCH 3.12 14/76] workqueue: fix ghost PENDING flag while doing MQ IO
Date: Thu, 19 May 2016 11:07:36 +0200	[thread overview]
Message-ID: <4aa63af38d182aa029283e109b0e3b390b3fb7c3.1463648873.git.jslaby@suse.cz> (raw)
In-Reply-To: <c4ec97c341630e22e546c1a628a3664f17f7ffc8.1463648873.git.jslaby@suse.cz>
In-Reply-To: <cover.1463648873.git.jslaby@suse.cz>

From: Roman Pen <roman.penyaev@profitbricks.com>

3.12-stable review patch.  If anyone has any objections, please let me know.

===============

commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
[  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
[  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[  601.350965] Call Trace:
[  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
[  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

  queue_mode    = MQ
  submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
       0        1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
        ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

   save_stack_trace_tsk + 34
   blk_mq_insert_requests + 231
   blk_mq_flush_plug_list + 281
   blk_flush_plug_list + 199
   wait_on_page_bit + 192
   __filemap_fdatawait_range + 228
   filemap_fdatawait_range + 20
   filemap_write_and_wait_range + 63
   blkdev_fsync + 27
   vfs_fsync_range + 73
   blkdev_write_iter + 202
   __vfs_write + 170
   vfs_write + 169
   kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

  CPU#0                                  CPU#1
  ----------------------------------     -------------------------------
  reqeust A inserted
  STORE hctx->ctx_map[0] bit marked
  kblockd_schedule...() returns 1
  <schedule to kblockd workqueue>
                                         request B inserted
                                         STORE hctx->ctx_map[1] bit marked
                                         kblockd_schedule...() returns 0
  *** WORK PENDING bit is cleared ***
  flush_busy_ctxs() is executed, but
  bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Michael Wang <yun.wang@profitbricks.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index bb5f920268d7..2bc1257e420f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -622,6 +622,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
 	 */
 	smp_wmb();
 	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+	/*
+	 * The following mb guarantees that previous clear of a PENDING bit
+	 * will not be reordered with any speculative LOADS or STORES from
+	 * work->current_func, which is executed afterwards.  This possible
+	 * reordering can lead to a missed execution on attempt to qeueue
+	 * the same @work.  E.g. consider this case:
+	 *
+	 *   CPU#0                         CPU#1
+	 *   ----------------------------  --------------------------------
+	 *
+	 * 1  STORE event_indicated
+	 * 2  queue_work_on() {
+	 * 3    test_and_set_bit(PENDING)
+	 * 4 }                             set_..._and_clear_pending() {
+	 * 5                                 set_work_data() # clear bit
+	 * 6                                 smp_mb()
+	 * 7                               work->current_func() {
+	 * 8				      LOAD event_indicated
+	 *				   }
+	 *
+	 * Without an explicit full barrier speculative LOAD on line 8 can
+	 * be executed before CPU#0 does STORE on line 1.  If that happens,
+	 * CPU#0 observes the PENDING bit is still set and new execution of
+	 * a @work is not queued in a hope, that CPU#1 will eventually
+	 * finish the queued @work.  Meanwhile CPU#1 does not see
+	 * event_indicated is set, because speculative LOAD was executed
+	 * before actual STORE.
+	 */
+	smp_mb();
 }
 
 static void clear_work_data(struct work_struct *work)
-- 
2.8.2

  parent reply	other threads:[~2016-05-19  9:07 UTC|newest]

Thread overview: 86+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-05-19  9:08 [PATCH 3.12 00/76] 3.12.60-stable review Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 01/76] crypto: gcm - Fix rfc4543 decryption crash Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 02/76] ARM: OMAP2+: hwmod: Fix updating of sysconfig register Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 03/76] usb: xhci: fix wild pointers in xhci_mem_cleanup Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 04/76] usb: hcd: out of bounds access in for_each_companion Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 05/76] lib: lz4: fixed zram with lz4 on big endian machines Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 06/76] drm/qxl: fix cursor position with non-zero hotspot Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 07/76] nl80211: check netlink protocol in socket release notification Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 08/76] Input: gtco - fix crash on detecting device without endpoints Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 09/76] pinctrl: single: Fix pcs_parse_bits_in_pinctrl_entry to use __ffs than ffs Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 10/76] EDAC: i7core, sb_edac: Don't return NOTIFY_BAD from mce_decoder callback Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 11/76] ASoC: s3c24xx: use const snd_soc_component_driver pointer Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 12/76] ASoC: rt5640: Correct the digital interface data select Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 13/76] efi: Fix out-of-bounds read in variable_matches() Jiri Slaby
2016-05-19  9:07 ` Jiri Slaby [this message]
2016-05-19  9:07 ` [PATCH 3.12 15/76] paride: make 'verbose' parameter an 'int' again Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 16/76] fbdev: da8xx-fb: fix videomodes of lcd panels Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 17/76] misc/bmp085: Enable building as a module Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 18/76] rtc: vr41xx: Wire up alarm_irq_enable Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 19/76] drivers/misc/ad525x_dpot: AD5274 fix RDAC read back errors Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 20/76] serial: sh-sci: Remove cpufreq notifier to fix crash/deadlock Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 21/76] include/linux/poison.h: fix LIST_POISON{1,2} offset Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 22/76] Drivers: hv: vmbus: prevent cpu offlining on newer hypervisors Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 23/76] perf stat: Document --detailed option Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 24/76] ARM: OMAP3: Add cpuidle parameters table for omap3430 Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 25/76] bus: imx-weim: Take the 'status' property value into account Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 26/76] sunrpc/cache: drop reference when sunrpc_cache_pipe_upcall() detects a race Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 27/76] Revert "xfs: add capability check to free eofblocks ioctl" Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 28/76] mmc: sdhci: Allow for irq being shared Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 29/76] scsi: Avoid crashing if device uses DIX but adapter does not support it Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 30/76] cpuset: Fix potential deadlock w/ set_mems_allowed Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 31/76] compiler-gcc: disable -ftracer for __noclone functions Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 32/76] x86: LLVMLinux: Fix "incomplete type const struct x86cpu_device_id" Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 33/76] ipvs: correct initial offset of Call-ID header search in SIP persistence engine Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 34/76] nbd: ratelimit error msgs after socket close Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 35/76] clk: versatile: sp810: support reentrance Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 36/76] lpfc: fix misleading indentation Jiri Slaby
2016-05-19  9:07 ` [PATCH 3.12 37/76] ARM: SoCFPGA: Fix secondary CPU startup in thumb2 kernel Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 38/76] proc: prevent accessing /proc/<PID>/environ until it's ready Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 39/76] batman-adv: Check skb size before using encapsulated ETH+VLAN header Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 40/76] batman-adv: Fix broadcast/ogm queue limit on a removed interface Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 41/76] batman-adv: Reduce refcnt of removed router when updating route Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 42/76] MAINTAINERS: Remove asterisk from EFI directory names Jiri Slaby
2016-05-19  9:08   ` Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 43/76] x86/sysfb_efi: Fix valid BAR address range check Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 44/76] ACPICA: Dispatcher: Update thread ID for recursive method calls Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 45/76] USB: serial: cp210x: add ID for Link ECU Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 46/76] USB: serial: cp210x: add Straizona Focusers device ids Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 47/76] iio: ak8975: Fix NULL pointer exception on early interrupt Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 48/76] Input: ads7846 - correct the value got from SPI Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 49/76] powerpc: scan_features() updates incorrect bits for REAL_LE Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 50/76] Input: i8042 - lower log level for "no controller" message Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 51/76] mm/balloon_compaction: redesign ballooned pages management Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 52/76] mm/balloon_compaction: fix deflation when compaction is disabled Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 53/76] crypto: hash - Fix page length clamping in hash walk Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 54/76] get_rock_ridge_filename(): handle malformed NM entries Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 55/76] Input: max8997-haptic - fix NULL pointer dereference Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 56/76] asmlinkage, pnp: Make variables used from assembler code visible Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 57/76] drm/radeon: fix PLL sharing on DCE6.1 (v2) Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 58/76] drm/i915: Bail out of pipe config compute loop on LPT Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 59/76] ARM: OMAP3: Fix booting with thumb2 kernel Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 60/76] net/route: enforce hoplimit max value Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 61/76] decnet: Do not build routes to devices without decnet private data Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 62/76] route: do not cache fib route info on local routes with oif Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 63/76] packet: fix heap info leak in PACKET_DIAG_MCLIST sock_diag interface Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 64/76] atl2: Disable unimplemented scatter/gather feature Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 65/76] ipv4/fib: don't warn when primary address is missing if in_dev is dead Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 66/76] net/mlx4_en: fix spurious timestamping callbacks Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 67/76] netem: Segment GSO packets on enqueue Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 68/76] net: fix infoleak in llc Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 69/76] net: fix infoleak in rtnetlink Jiri Slaby
2016-05-20 12:04   ` Vegard Nossum
     [not found]     ` <CABEk9YxT4eRBrEhkrCNHwM9yuFKRW4bBcrAfjgW0iyS0q3v65A@mail.gmail.com>
2016-05-20 14:25       ` Vegard Nossum
2016-05-20 16:45     ` David Miller
2016-05-21  0:43       ` Hannes Frederic Sowa
2016-05-19  9:08 ` [PATCH 3.12 70/76] VSOCK: do not disconnect socket when peer has shutdown SEND only Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 71/76] net: bridge: fix old ioctl unlocked net device walk Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 72/76] net: fix a kernel infoleak in x25 module Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 73/76] ASN.1: Fix non-match detection failure on data overrun Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 74/76] KEYS: Fix ASN.1 indefinite length object parsing Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 75/76] sched: Remove lockdep check in sched_move_task() Jiri Slaby
2016-05-19  9:08 ` [PATCH 3.12 76/76] X.509: remove possible code fragility: enumeration values not handled Jiri Slaby
2016-05-19 13:52 ` [PATCH 3.12 00/76] 3.12.60-stable review Guenter Roeck
2016-05-23  9:49   ` Jiri Slaby
2016-05-24 12:58 ` Shuah Khan
2016-05-24 13:55   ` Jiri Slaby

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4aa63af38d182aa029283e109b0e3b390b3fb7c3.1463648873.git.jslaby@suse.cz \
    --to=jslaby@suse.cz \
    --cc=axboe@kernel.dk \
    --cc=gi-oh.kim@profitbricks.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=roman.penyaev@profitbricks.com \
    --cc=stable@vger.kernel.org \
    --cc=tj@kernel.org \
    --cc=yun.wang@profitbricks.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.