All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again
@ 2018-05-15 22:51 Bart Van Assche
  2018-05-15 22:51 ` [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-15 22:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche

Hello Jens,

This is the tenth incarnation of the blk-mq timeout handling rework. All
previously posted comments should have been addressed. Please consider this
patch series for inclusion in the upstream kernel.

Bart.

Changes compared to v9:
- Addressed multiple comments related to patch 1/2: added
  CONFIG_ARCH_HAVE_CMPXCHG64 for riscv, modified
  features/locking/cmpxchg64/arch-support.txt as requested and made the
  order of the symbols in the arch/*/Kconfig alphabetical where possible.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.

Bart Van Assche (2):
  arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  blk-mq: Rework blk-mq timeout handling again

 .../features/locking/cmpxchg64/arch-support.txt    |  33 ++++
 arch/Kconfig                                       |   3 +
 arch/alpha/Kconfig                                 |   1 +
 arch/arm/Kconfig                                   |   1 +
 arch/arm64/Kconfig                                 |   1 +
 arch/ia64/Kconfig                                  |   1 +
 arch/m68k/Kconfig                                  |   1 +
 arch/mips/Kconfig                                  |   1 +
 arch/parisc/Kconfig                                |   1 +
 arch/powerpc/Kconfig                               |   1 +
 arch/riscv/Kconfig                                 |   1 +
 arch/s390/Kconfig                                  |   1 +
 arch/sparc/Kconfig                                 |   1 +
 arch/x86/Kconfig                                   |   1 +
 arch/xtensa/Kconfig                                |   1 +
 block/blk-core.c                                   |   6 -
 block/blk-mq-debugfs.c                             |   1 -
 block/blk-mq.c                                     | 183 ++++++---------------
 block/blk-mq.h                                     | 117 ++++++++++---
 block/blk-timeout.c                                |  95 ++++++-----
 block/blk.h                                        |  14 +-
 include/linux/blkdev.h                             |  47 +++---
 22 files changed, 279 insertions(+), 233 deletions(-)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  2018-05-15 22:51 [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-05-15 22:51 ` Bart Van Assche
  2018-05-16  4:15   ` Michael Ellerman
  2018-05-18 20:33   ` Palmer Dabbelt
  2018-05-15 22:51 ` [PATCH v10 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-15 22:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche,
	Catalin Marinas, Will Deacon, Tony Luck, Fenghua Yu,
	Geert Uytterhoeven, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, David S . Miller,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Chris Zankel,
	Max Filippov, Arnd Bergmann, Jonathan Corbet

The next patch in this series introduces a call to cmpxchg64()
in the block layer core for those architectures on which this
functionality is available. Make it possible to test whether
cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jonathan Corbet <corbet@lwn.net>
---
 .../features/locking/cmpxchg64/arch-support.txt    | 33 ++++++++++++++++++++++
 arch/Kconfig                                       |  3 ++
 arch/alpha/Kconfig                                 |  1 +
 arch/arm/Kconfig                                   |  1 +
 arch/arm64/Kconfig                                 |  1 +
 arch/ia64/Kconfig                                  |  1 +
 arch/m68k/Kconfig                                  |  1 +
 arch/mips/Kconfig                                  |  1 +
 arch/parisc/Kconfig                                |  1 +
 arch/powerpc/Kconfig                               |  1 +
 arch/riscv/Kconfig                                 |  1 +
 arch/s390/Kconfig                                  |  1 +
 arch/sparc/Kconfig                                 |  1 +
 arch/x86/Kconfig                                   |  1 +
 arch/xtensa/Kconfig                                |  1 +
 15 files changed, 49 insertions(+)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt b/Documentation/features/locking/cmpxchg64/arch-support.txt
new file mode 100644
index 000000000000..84bfef7242b2
--- /dev/null
+++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
@@ -0,0 +1,33 @@
+#
+# Feature name:          cmpxchg64
+#         Kconfig:       ARCH_HAVE_CMPXCHG64
+#         description:   arch supports the cmpxchg64() API
+#
+    -----------------------
+    |         arch |status|
+    -----------------------
+    |       alpha: |  ok  |
+    |         arc: |  ..  |
+    |         arm: |  ok  |
+    |       arm64: |  ok  |
+    |         c6x: |  ..  |
+    |       h8300: |  ..  |
+    |     hexagon: |  ..  |
+    |        ia64: |  ok  |
+    |        m68k: |  ok  |
+    |  microblaze: |  ..  |
+    |        mips: |  ok  |
+    |       nds32: |  ..  |
+    |       nios2: |  ..  |
+    |    openrisc: |  ..  |
+    |      parisc: |  ok  |
+    |     powerpc: |  ok  |
+    |       riscv: |  ok  |
+    |        s390: |  ok  |
+    |          sh: |  ..  |
+    |       sparc: |  ok  |
+    |          um: |  ..  |
+    |   unicore32: |  ..  |
+    |         x86: |  ok  |
+    |      xtensa: |  ok  |
+    -----------------------
diff --git a/arch/Kconfig b/arch/Kconfig
index 8e0d665c8d53..bd54eb125b15 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -358,6 +358,9 @@ config HAVE_ALIGNED_STRUCT_PAGE
 	  on a struct page for better performance. However selecting this
 	  might increase the size of a struct page by a word.
 
+config ARCH_HAVE_CMPXCHG64
+	bool
+
 config HAVE_CMPXCHG_LOCAL
 	bool
 
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index b2022885ced8..583bd32c90a7 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -15,6 +15,7 @@ config ALPHA
 	select AUTO_IRQ_AFFINITY if SMP
 	select GENERIC_IRQ_SHOW
 	select ARCH_WANT_IPC_PARSE_VERSION
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select AUDIT_ARCH
 	select GENERIC_CLOCKEVENTS
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..02c75697176e 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -13,6 +13,7 @@ config ARM
 	select ARCH_HAS_STRICT_KERNEL_RWX if MMU && !XIP_KERNEL
 	select ARCH_HAS_STRICT_MODULE_RWX if MMU
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
 	select ARCH_HAVE_CUSTOM_GPIO_H
 	select ARCH_HAS_GCOV_PROFILE_ALL
 	select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..dd18d7ca38fa 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -22,6 +22,7 @@ config ARM64
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_INLINE_READ_LOCK if !PREEMPT
 	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index bbe12a038d21..31c49e1482e2 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -41,6 +41,7 @@ config IA64
 	select GENERIC_PENDING_IRQ if SMP
 	select GENERIC_IRQ_SHOW
 	select GENERIC_IRQ_LEGACY
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_IOMAP
 	select GENERIC_SMP_IDLE_THREAD
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 785612b576f7..7b87cda3bbed 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -11,6 +11,7 @@ config M68K
 	select GENERIC_ATOMIC64
 	select HAVE_UID16
 	select VIRT_TO_BUS
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS
 	select GENERIC_CPU_DEVICES
 	select GENERIC_IOMAP
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 225c95da23ce..088bca0fd9f2 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -7,6 +7,7 @@ config MIPS
 	select ARCH_DISCARD_MEMBLOCK
 	select ARCH_HAS_ELF_RANDOMIZE
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
+	select ARCH_HAVE_CMPXCHG64 if 64BIT
 	select ARCH_SUPPORTS_UPROBES
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF if 64BIT
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index fc5a574c3482..166c30865255 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -30,6 +30,7 @@ config PARISC
 	select GENERIC_ATOMIC64 if !64BIT
 	select GENERIC_IRQ_PROBE
 	select GENERIC_PCI_IOMAP
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_CPU_DEVICES
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..901365d12dcd 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
 	select ARCH_HAS_UACCESS_FLUSHCACHE	if PPC64
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if PPC_BOOK3S_64
+	select ARCH_HAVE_CMPXCHG64		if PPC64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index cd4fd85fde84..4f886a055ff6 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -8,6 +8,7 @@ config RISCV
 	select OF
 	select OF_EARLY_FLATTREE
 	select OF_IRQ
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_WANT_FRAME_POINTERS
 	select CLONE_BACKWARDS
 	select COMMON_CLK
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 199ac3e4da1d..f05f7a76ba73 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -73,6 +73,7 @@ config S390
 	select ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_INLINE_READ_LOCK
 	select ARCH_INLINE_READ_LOCK_BH
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..e3429b78c491 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -75,6 +75,7 @@ config SPARC64
 	select HAVE_PERF_EVENTS
 	select PERF_USE_VMALLOC
 	select IRQ_PREFLOW_FASTEOI
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select HAVE_C_RECORDMCOUNT
 	select NO_BOOTMEM
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c07f492b871a..1721fcda9822 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -67,6 +67,7 @@ config X86
 	select ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if X86_64
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index c921e8bccdc8..0e5c77958fa3 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -4,6 +4,7 @@ config ZONE_DMA
 
 config XTENSA
 	def_bool y
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_NO_COHERENT_DMA_MMAP if !MMU
 	select ARCH_WANT_FRAME_POINTERS
 	select ARCH_WANT_IPC_PARSE_VERSION
-- 
2.16.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v10 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-15 22:51 [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  2018-05-15 22:51 ` [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
@ 2018-05-15 22:51 ` Bart Van Assche
  2018-05-15 22:51 ` [PATCH v9 0/2] " Bart Van Assche
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-15 22:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche,
	Tejun Heo, Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sebastian Ott <sebott@linux.ibm.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
---
 block/blk-core.c       |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 183 ++++++++++++++-----------------------------------
 block/blk-mq.h         | 117 +++++++++++++++++++++++--------
 block/blk-timeout.c    |  95 ++++++++++++++-----------
 block/blk.h            |  14 ++--
 include/linux/blkdev.h |  47 ++++++-------
 7 files changed, 230 insertions(+), 233 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 43370faee935..cee03cad99f2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..40c9aa085613 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	rq->__deadline = 0;
+	WARN_ON_ONCE(rq->das.state != MQ_RQ_IDLE);
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -494,7 +494,8 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -547,8 +548,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -593,36 +593,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -634,27 +604,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -681,27 +636,7 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, rq);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -714,22 +649,19 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, rq);
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
-		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -838,8 +770,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -848,13 +778,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -868,48 +792,50 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
+	union blk_deadline_and_state das = READ_ONCE(rq->das);
+	unsigned long now = jiffies;
+	int32_t diff_jiffies = das.deadline - now;
+	int32_t diff_next = das.deadline - data->next;
+	enum mq_rq_state rq_state = das.state;
 
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
-
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
+	/*
+	 * Make sure that rq->aborted_gstate != rq->das if rq->deadline has not
+	 * yet been reached even if a request gets recycled before
+	 * blk_mq_terminate_expired() is called and the value of rq->deadline
+	 * is not modified due to the request recycling.
+	 */
+	rq->aborted_gstate = das;
+	rq->aborted_gstate.generation ^= (1UL << 29);
+	if (diff_jiffies <= 0 && rq_state == MQ_RQ_IN_FLIGHT) {
+		/* request timed out */
+		rq->aborted_gstate = das;
 		data->nr_expired++;
 		hctx->nr_expired++;
-	} else if (!data->next_set || time_after(data->next, deadline)) {
-		data->next = deadline;
+	} else if (!data->next_set) {
+		/* data->next is not yet set; set it to deadline. */
+		data->next = now + diff_jiffies;
 		data->next_set = 1;
+	} else if (diff_next < 0) {
+		/* data->next is later than deadline; reduce data->next. */
+		data->next += diff_next;
 	}
+
 }
 
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	union blk_deadline_and_state old_val = rq->aborted_gstate;
+	union blk_deadline_and_state new_val = old_val;
+
+	new_val.state = MQ_RQ_COMPLETE;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->das has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -948,10 +874,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		bool has_rcu = false;
 
 		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished.
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->nr_expired)
@@ -967,7 +893,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		if (has_rcu)
 			synchronize_rcu();
 
-		/* terminate the ones we won */
+		/* Terminate the requests marked by blk_mq_check_expired(). */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
 	}
 
@@ -2057,21 +1983,16 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 {
 	int ret;
 
+#ifndef CONFIG_ARCH_HAVE_CMPXCHG64
+	spin_lock_init(&rq->das_lock);
+#endif
+
 	if (set->ops->init_request) {
 		ret = set->ops->init_request(set, rq, hctx_idx, node);
 		if (ret)
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
-
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..42320011d264 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,6 +2,7 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include <asm/cmpxchg.h>
 #include "blk-stat.h"
 #include "blk-mq-tag.h"
 
@@ -30,18 +31,11 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* See also blk_deadline_and_state.state. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -103,37 +97,106 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_deadline_and_state old_val,
+				       union blk_deadline_and_state new_val)
+{
+#ifdef CONFIG_ARCH_HAVE_CMPXCHG64
+	return cmpxchg64(&rq->das.das, old_val.das, new_val.das) == old_val.das;
+#else
+	unsigned long flags;
+	bool res = false;
+
+	spin_lock_irqsave(&rq->das_lock, flags);
+	if (rq->das.das == old_val.das) {
+		rq->das = new_val;
+		res = true;
+	}
+	spin_unlock_irqrestore(&rq->das_lock, flags);
+
+	return res;
+#endif
+}
+
+/*
+ * If the state of request @rq equals @old_state, update deadline and request
+ * state atomically to @time and @new_state. blk-mq only. cmpxchg64() is only
+ * used if there could be a concurrent update attempt from another context.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+					  unsigned long new_time,
+					  enum mq_rq_state old_state,
+					  enum mq_rq_state new_state)
+{
+	union blk_deadline_and_state old_val, new_val;
+
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		old_val = READ_ONCE(rq->das);
+		if (old_val.state != old_state)
+			return false;
+		new_val = old_val;
+		new_val.deadline = new_time;
+		new_val.state = new_state;
+		if (new_state == MQ_RQ_IN_FLIGHT)
+			new_val.generation++;
+		WRITE_ONCE(rq->das.das, new_val.das);
+		return true;
+	}
+
+	do {
+		old_val = READ_ONCE(rq->das);
+		if (old_val.state != old_state)
+			return false;
+		new_val = old_val;
+		new_val.deadline = new_time;
+		new_val.state = new_state;
+		if (new_state == MQ_RQ_IN_FLIGHT)
+			new_val.generation++;
+	} while (!blk_mq_set_rq_state(rq, old_val, new_val));
+
+	return true;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return rq->das.state;
 }
 
 /**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
- * @rq: target request.
- * @state: new state to set.
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
  *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
  */
-static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+static inline bool blk_mq_change_rq_state(struct request *rq,
+					  enum mq_rq_state old_state,
+					  enum mq_rq_state new_state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
+	union blk_deadline_and_state das = READ_ONCE(rq->das);
+	union blk_deadline_and_state old_val = das;
+	union blk_deadline_and_state new_val = das;
+
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight to another state cmpxchg64()
+	 * must be used. For other state transitions it is safe to use
+	 * WRITE_ONCE().
+	 */
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		WRITE_ONCE(rq->das.das, new_val.das);
+		return true;
 	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	return blk_mq_set_rq_state(rq, old_val, new_val);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..549dfda7190c 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -162,8 +162,9 @@ void blk_abort_request(struct request *req)
 		 * immediately and that scan sees the new timeout value.
 		 * No need for fancy synchronizations.
 		 */
-		blk_rq_set_deadline(req, jiffies);
-		kblockd_schedule_work(&req->q->timeout_work);
+		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_IN_FLIGHT))
+			kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -184,52 +185,17 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req, unsigned long deadline)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;
@@ -244,5 +210,56 @@ void blk_add_timer(struct request *req)
 		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
 			mod_timer(&q->timeout, expiry);
 	}
+}
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	lockdep_assert_held(q->queue_lock);
+
+	if (!q->rq_timed_out_fn)
+		return;
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
 
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+	return __blk_add_timer(req, deadline);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ * @old:	current request state.
+ * @new:	new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+	deadline = jiffies + req->timeout;
+	if (!blk_mq_rq_set_deadline(req, deadline, old, new))
+		WARN_ON_ONCE(true);
+	return __blk_add_timer(req, deadline);
 }
diff --git a/block/blk.h b/block/blk.h
index eaf1a8e87d11..984367308b11 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new);
 void blk_delete_timer(struct request *);
 
 
@@ -191,21 +193,21 @@ void blk_account_io_done(struct request *req, u64 now);
 /*
  * EH timer and IO completion will both attempt to 'grab' the request, make
  * sure that only one of them succeeds. Steal the bottom bit of the
- * __deadline field for this.
+ * das field for this.
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(0, &rq->__deadline);
+	return test_and_set_bit(0, &rq->das.legacy_deadline);
 }
 
 static inline void blk_clear_rq_complete(struct request *rq)
 {
-	clear_bit(0, &rq->__deadline);
+	clear_bit(0, &rq->das.legacy_deadline);
 }
 
 static inline bool blk_rq_is_complete(struct request *rq)
 {
-	return test_bit(0, &rq->__deadline);
+	return test_bit(0, &rq->das.legacy_deadline);
 }
 
 /*
@@ -314,12 +316,12 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->das.legacy_deadline = time & ~0x1UL;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->das.legacy_deadline & ~0x1UL;
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..771dbdc8758c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,8 +27,6 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -125,15 +123,23 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
+union blk_deadline_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+		uint32_t deadline;
+	};
+	unsigned long legacy_deadline;
+	uint64_t das;
+};
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -236,29 +242,24 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
 	/*
 	 * ->aborted_gstate is used by the timeout to claim a specific
 	 * recycle instance of this request.  See blk_mq_timeout_work().
 	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
+	union blk_deadline_and_state aborted_gstate;
+
+#ifndef CONFIG_ARCH_HAVE_CMPXCHG64
+	spinlock_t das_lock;
+#endif
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
-	unsigned long __deadline;
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues.
+	 */
+	union blk_deadline_and_state das;
 
 	struct list_head timeout_list;
 
-- 
2.16.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 0/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-15 22:51 [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  2018-05-15 22:51 ` [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
  2018-05-15 22:51 ` [PATCH v10 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-05-15 22:51 ` Bart Van Assche
  2018-05-15 22:51 ` [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
  2018-05-15 22:51 ` [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  4 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-15 22:51 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche

Hello Jens,

This is the ninth incarnation of the blk-mq timeout handling rework. All
previously posted comments have been addressed. Please consider this patch
series for inclusion in the upstream kernel.

Bart.

Changes compared to v8:
- Split into two patches.
- Moved the spin_lock_init() call from blk_mq_rq_ctx_init() into
  blk_mq_init_request().
- Fixed the deadline set by blk_add_timer().
- Surrounded the das_lock member with #ifndef CONFIG_ARCH_HAVE_CMPXCHG64 /
  #endif.

Changes compared to v7:
- Fixed the generation number mechanism. Note: with this patch applied the
  behavior of the block layer does not depend on the generation number.
- Added more 32-bit architectures to the list of architectures on which
  cmpxchg64() should not be used.

Changes compared to v6:
- Used a union instead of bit manipulations to store multiple values into
  a single 64-bit field.
- Reduced the size of the timeout field from 64 to 32 bits.
- Made sure that the block layer still builds with this patch applied
  for the sh and mips architectures.
- Fixed two sparse warnings that were introduced by this patch in the
  WRITE_ONCE() calls.

Changes compared to v5:
- Restored the synchronize_rcu() call between marking a request for timeout
  handling and the actual timeout handling to avoid that timeout handling
  starts while .queue_rq() is still in progress if the timeout is very short.
- Only use cmpxchg() if another context could attempt to change the request
  state concurrently. Use WRITE_ONCE() otherwise.

Changes compared to v4:
- Addressed multiple review comments from Christoph. The most important are
  that atomic_long_cmpxchg() has been changed into cmpxchg() and also that
  there is now a nice and clean split between the legacy and blk-mq versions
  of blk_add_timer().
- Changed the patch name and modified the patch description because there is
  disagreement about whether or not the v4.16 blk-mq core can complete a
  single request twice. Kept the "Cc: stable" tag because of
  https://bugzilla.kernel.org/show_bug.cgi?id=199077.

Changes compared to v3 (see also https://www.mail-archive.com/linux-block@vger.kernel.org/msg20073.html):
- Removed the spinlock again that was introduced to protect the request state.
  v4 uses atomic_long_cmpxchg() instead.
- Split __deadline into two variables - one for the legacy block layer and one
  for blk-mq.

Changes compared to v2 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18338.html):
- Rebased and retested on top of kernel v4.16.

Changes compared to v1 (https://www.mail-archive.com/linux-block@vger.kernel.org/msg18089.html):
- Removed the gstate and aborted_gstate members of struct request and used
  the __deadline member to encode both the generation and state information.
Bart Van Assche (2):
  arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  blk-mq: Rework blk-mq timeout handling again

 .../features/locking/cmpxchg64/arch-support.txt    |  31 ++++
 arch/Kconfig                                       |   3 +
 arch/alpha/Kconfig                                 |   1 +
 arch/arm/Kconfig                                   |   1 +
 arch/arm64/Kconfig                                 |   1 +
 arch/ia64/Kconfig                                  |   1 +
 arch/m68k/Kconfig                                  |   1 +
 arch/mips/Kconfig                                  |   1 +
 arch/parisc/Kconfig                                |   1 +
 arch/powerpc/Kconfig                               |   1 +
 arch/s390/Kconfig                                  |   1 +
 arch/sparc/Kconfig                                 |   1 +
 arch/x86/Kconfig                                   |   1 +
 arch/xtensa/Kconfig                                |   1 +
 block/blk-core.c                                   |   6 -
 block/blk-mq-debugfs.c                             |   1 -
 block/blk-mq.c                                     | 183 ++++++---------------
 block/blk-mq.h                                     | 117 ++++++++++---
 block/blk-timeout.c                                |  95 ++++++-----
 block/blk.h                                        |  14 +-
 include/linux/blkdev.h                             |  47 +++---
 21 files changed, 276 insertions(+), 233 deletions(-)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

-- 
2.16.3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  2018-05-15 22:51 [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
                   ` (2 preceding siblings ...)
  2018-05-15 22:51 ` [PATCH v9 0/2] " Bart Van Assche
@ 2018-05-15 22:51 ` Bart Van Assche
  2018-05-16  6:07   ` Christoph Hellwig
  2018-05-15 22:51 ` [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
  4 siblings, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2018-05-15 22:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche,
	Catalin Marinas, Will Deacon, Tony Luck, Fenghua Yu,
	Geert Uytterhoeven, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, David S . Miller,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Chris Zankel,
	Max Filippov, Arnd Bergmann

The next patch in this series introduces a call to cmpxchg64()
in the block layer core for those architectures on which this
functionality is available. Make it possible to test whether
cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>

f
---
 .../features/locking/cmpxchg64/arch-support.txt    | 31 ++++++++++++++++++++++
 arch/Kconfig                                       |  3 +++
 arch/alpha/Kconfig                                 |  1 +
 arch/arm/Kconfig                                   |  1 +
 arch/arm64/Kconfig                                 |  1 +
 arch/ia64/Kconfig                                  |  1 +
 arch/m68k/Kconfig                                  |  1 +
 arch/mips/Kconfig                                  |  1 +
 arch/parisc/Kconfig                                |  1 +
 arch/powerpc/Kconfig                               |  1 +
 arch/s390/Kconfig                                  |  1 +
 arch/sparc/Kconfig                                 |  1 +
 arch/x86/Kconfig                                   |  1 +
 arch/xtensa/Kconfig                                |  1 +
 14 files changed, 46 insertions(+)
 create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

diff --git a/Documentation/features/locking/cmpxchg64/arch-support.txt b/Documentation/features/locking/cmpxchg64/arch-support.txt
new file mode 100644
index 000000000000..65b3290ce5d5
--- /dev/null
+++ b/Documentation/features/locking/cmpxchg64/arch-support.txt
@@ -0,0 +1,31 @@
+#
+# Feature name:          cmpxchg64
+#         Kconfig:       ARCH_HAVE_CMPXCHG64
+#         description:   arch supports the cmpxchg64() API
+#
+    -----------------------
+    |         arch |status|
+    -----------------------
+    |       alpha: |  ok  |
+    |         arc: | TODO |
+    |         arm: |!thumb|
+    |       arm64: |  ok  |
+    |         c6x: | TODO |
+    |       h8300: | TODO |
+    |     hexagon: | TODO |
+    |        ia64: |  ok  |
+    |        m68k: |  ok  |
+    |  microblaze: | TODO |
+    |        mips: |64-bit|
+    |       nios2: | TODO |
+    |    openrisc: | TODO |
+    |      parisc: |  ok  |
+    |     powerpc: |64-bit|
+    |        s390: |  ok  |
+    |          sh: | TODO |
+    |       sparc: |  ok  |
+    |          um: | TODO |
+    |   unicore32: | TODO |
+    |         x86: |  ok  |
+    |      xtensa: |  ok  |
+    -----------------------
diff --git a/arch/Kconfig b/arch/Kconfig
index 8e0d665c8d53..bd54eb125b15 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -358,6 +358,9 @@ config HAVE_ALIGNED_STRUCT_PAGE
 	  on a struct page for better performance. However selecting this
 	  might increase the size of a struct page by a word.
 
+config ARCH_HAVE_CMPXCHG64
+	bool
+
 config HAVE_CMPXCHG_LOCAL
 	bool
 
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index b2022885ced8..94fc28e30ed2 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -16,6 +16,7 @@ config ALPHA
 	select GENERIC_IRQ_SHOW
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select AUDIT_ARCH
 	select GENERIC_CLOCKEVENTS
 	select GENERIC_CPU_VULNERABILITIES
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index a7f8e7f4b88f..7be06e46d329 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -21,6 +21,7 @@ config ARM
 	select ARCH_SUPPORTS_ATOMIC_RMW
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF
+	select ARCH_HAVE_CMPXCHG64 if !THUMB2_KERNEL
 	select ARCH_WANT_IPC_PARSE_VERSION
 	select BUILDTIME_EXTABLE_SORT if MMU
 	select CLONE_BACKWARDS
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index eb2cf4938f6d..998e454a51b5 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -23,6 +23,7 @@ config ARM64
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_TICK_BROADCAST if GENERIC_CLOCKEVENTS_BROADCAST
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_INLINE_READ_LOCK if !PREEMPT
 	select ARCH_INLINE_READ_LOCK_BH if !PREEMPT
 	select ARCH_INLINE_READ_LOCK_IRQ if !PREEMPT
diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index bbe12a038d21..b99a7fd9efec 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -42,6 +42,7 @@ config IA64
 	select GENERIC_IRQ_SHOW
 	select GENERIC_IRQ_LEGACY
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select GENERIC_IOMAP
 	select GENERIC_SMP_IDLE_THREAD
 	select ARCH_TASK_STRUCT_ON_STACK
diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 785612b576f7..9249de31414b 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -12,6 +12,7 @@ config M68K
 	select HAVE_UID16
 	select VIRT_TO_BUS
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG if RMW_INSNS
+	select ARCH_HAVE_CMPXCHG64
 	select GENERIC_CPU_DEVICES
 	select GENERIC_IOMAP
 	select GENERIC_STRNCPY_FROM_USER if MMU
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 225c95da23ce..7f453e7cb9e4 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -10,6 +10,7 @@ config MIPS
 	select ARCH_SUPPORTS_UPROBES
 	select ARCH_USE_BUILTIN_BSWAP
 	select ARCH_USE_CMPXCHG_LOCKREF if 64BIT
+	select ARCH_HAVE_CMPXCHG64 if 64BIT
 	select ARCH_USE_QUEUED_RWLOCKS
 	select ARCH_USE_QUEUED_SPINLOCKS
 	select ARCH_WANT_IPC_PARSE_VERSION
diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index fc5a574c3482..deae8720861e 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -31,6 +31,7 @@ config PARISC
 	select GENERIC_IRQ_PROBE
 	select GENERIC_PCI_IOMAP
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select GENERIC_SMP_IDLE_THREAD
 	select GENERIC_CPU_DEVICES
 	select GENERIC_STRNCPY_FROM_USER
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c32a181a7cbb..315fc1e46baa 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -150,6 +150,7 @@ config PPC
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if PPC_BOOK3S_64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
 	select ARCH_OPTIONAL_KERNEL_RWX		if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 199ac3e4da1d..b06b34ae8b3d 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -74,6 +74,7 @@ config S390
 	select ARCH_HAS_STRICT_MODULE_RWX
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_INLINE_READ_LOCK
 	select ARCH_INLINE_READ_LOCK_BH
 	select ARCH_INLINE_READ_LOCK_IRQ
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 8767e45f1b2b..cd41bf0d5afd 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -76,6 +76,7 @@ config SPARC64
 	select PERF_USE_VMALLOC
 	select IRQ_PREFLOW_FASTEOI
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select HAVE_C_RECORDMCOUNT
 	select NO_BOOTMEM
 	select HAVE_ARCH_AUDITSYSCALL
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c07f492b871a..a5bcbb179ff2 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -68,6 +68,7 @@ config X86
 	select ARCH_HAS_UBSAN_SANITIZE_ALL
 	select ARCH_HAS_ZONE_DEVICE		if X86_64
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
+	select ARCH_HAVE_CMPXCHG64
 	select ARCH_MIGHT_HAVE_ACPI_PDC		if ACPI
 	select ARCH_MIGHT_HAVE_PC_PARPORT
 	select ARCH_MIGHT_HAVE_PC_SERIO
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index c921e8bccdc8..8234278a821d 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -23,6 +23,7 @@ config XTENSA
 	select HAVE_DMA_CONTIGUOUS
 	select HAVE_EXIT_THREAD
 	select HAVE_FUNCTION_TRACER
+	select ARCH_HAVE_CMPXCHG64
 	select HAVE_FUTEX_CMPXCHG if !MMU
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS
 	select HAVE_IRQ_TIME_ACCOUNTING
-- 
2.16.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-15 22:51 [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
                   ` (3 preceding siblings ...)
  2018-05-15 22:51 ` [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
@ 2018-05-15 22:51 ` Bart Van Assche
  2018-05-16 12:51   ` Christoph Hellwig
  4 siblings, 1 reply; 18+ messages in thread
From: Bart Van Assche @ 2018-05-15 22:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche,
	Tejun Heo, Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

Recently the blk-mq timeout handling code was reworked. See also Tejun
Heo, "[PATCHSET v4] blk-mq: reimplement timeout handling", 08 Jan 2018
(https://www.mail-archive.com/linux-block@vger.kernel.org/msg16985.html).
This patch reworks the blk-mq timeout handling code again. The timeout
handling code is simplified by introducing a state machine per request.
This change avoids that the blk-mq timeout handling code ignores
completions that occur after blk_mq_check_expired() has been called and
before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block
driver timeout handler always returns BLK_EH_RESET_TIMER then the result
will be that the request never terminates.

Fix this race as follows:
- Replace the __deadline member of struct request by a new member
  called das that contains the generation number, state and deadline.
  Only 32 bits are used for the deadline field such that all three
  fields occupy only 64 bits. This change reduces the maximum supported
  request timeout value from (2**63/HZ) to (2**31/HZ).
- Remove all request member variables that became superfluous due to
  this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to
  this patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
  the RCU lock and unlock statements in blk_mq_complete_request() and
  also the synchronize_rcu() call in the timeout handler.

Notes:
- A spinlock is used to protect atomic changes of rq->das on those
  architectures that do not provide a cmpxchg64() implementation.
- Atomic instructions are only used to update the request state if
  a concurrent request state change could be in progress.
- blk_add_timer() has been split into two functions - one for the
  legacy block layer and one for blk-mq.

Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Sebastian Ott <sebott@linux.ibm.com>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Israel Rukshin <israelr@mellanox.com>,
Cc: Max Gurtovoy <maxg@mellanox.com>
---
 block/blk-core.c       |   6 --
 block/blk-mq-debugfs.c |   1 -
 block/blk-mq.c         | 183 ++++++++++++++-----------------------------------
 block/blk-mq.h         | 117 +++++++++++++++++++++++--------
 block/blk-timeout.c    |  95 ++++++++++++++-----------
 block/blk.h            |  14 ++--
 include/linux/blkdev.h |  47 ++++++-------
 7 files changed, 230 insertions(+), 233 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 341501c5e239..a7301fcda734 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -198,12 +198,6 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->internal_tag = -1;
 	rq->start_time_ns = ktime_get_ns();
 	rq->part = NULL;
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * See comment of blk_mq_init_request
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
 }
 EXPORT_SYMBOL(blk_rq_init);
 
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 3080e18cb859..ffa622366922 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -344,7 +344,6 @@ static const char *const rqf_name[] = {
 	RQF_NAME(STATS),
 	RQF_NAME(SPECIAL_PAYLOAD),
 	RQF_NAME(ZONE_WRITE_LOCKED),
-	RQF_NAME(MQ_TIMEOUT_EXPIRED),
 	RQF_NAME(MQ_POLL_SLEPT),
 };
 #undef RQF_NAME
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 64630caaf27e..40c9aa085613 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -318,7 +318,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	rq->__deadline = 0;
+	WARN_ON_ONCE(rq->das.state != MQ_RQ_IDLE);
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -494,7 +494,8 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
+		WARN_ON_ONCE(true);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -547,8 +548,7 @@ static void __blk_mq_complete_request(struct request *rq)
 	bool shared = false;
 	int cpu;
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);
-	blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_COMPLETE);
 
 	if (rq->internal_tag != -1)
 		blk_mq_sched_completed_request(rq);
@@ -593,36 +593,6 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int *srcu_idx)
 		*srcu_idx = srcu_read_lock(hctx->srcu);
 }
 
-static void blk_mq_rq_update_aborted_gstate(struct request *rq, u64 gstate)
-{
-	unsigned long flags;
-
-	/*
-	 * blk_mq_rq_aborted_gstate() is used from the completion path and
-	 * can thus be called from irq context.  u64_stats_fetch in the
-	 * middle of update on the same CPU leads to lockup.  Disable irq
-	 * while updating.
-	 */
-	local_irq_save(flags);
-	u64_stats_update_begin(&rq->aborted_gstate_sync);
-	rq->aborted_gstate = gstate;
-	u64_stats_update_end(&rq->aborted_gstate_sync);
-	local_irq_restore(flags);
-}
-
-static u64 blk_mq_rq_aborted_gstate(struct request *rq)
-{
-	unsigned int start;
-	u64 aborted_gstate;
-
-	do {
-		start = u64_stats_fetch_begin(&rq->aborted_gstate_sync);
-		aborted_gstate = rq->aborted_gstate;
-	} while (u64_stats_fetch_retry(&rq->aborted_gstate_sync, start));
-
-	return aborted_gstate;
-}
-
 /**
  * blk_mq_complete_request - end I/O on a request
  * @rq:		the request being processed
@@ -634,27 +604,12 @@ static u64 blk_mq_rq_aborted_gstate(struct request *rq)
 void blk_mq_complete_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu);
-	int srcu_idx;
 
 	if (unlikely(blk_should_fake_timeout(q)))
 		return;
 
-	/*
-	 * If @rq->aborted_gstate equals the current instance, timeout is
-	 * claiming @rq and we lost.  This is synchronized through
-	 * hctx_lock().  See blk_mq_timeout_work() for details.
-	 *
-	 * Completion path never blocks and we can directly use RCU here
-	 * instead of hctx_lock() which can be either RCU or SRCU.
-	 * However, that would complicate paths which want to synchronize
-	 * against us.  Let stay in sync with the issue path so that
-	 * hctx_lock() covers both issue and completion paths.
-	 */
-	hctx_lock(hctx, &srcu_idx);
-	if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_COMPLETE))
 		__blk_mq_complete_request(rq);
-	hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
 
@@ -681,27 +636,7 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, rq);
 	}
 
-	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
-
-	/*
-	 * Mark @rq in-flight which also advances the generation number,
-	 * and register for timeout.  Protect with a seqcount to allow the
-	 * timeout path to read both @rq->gstate and @rq->deadline
-	 * coherently.
-	 *
-	 * This is the only place where a request is marked in-flight.  If
-	 * the timeout path reads an in-flight @rq->gstate, the
-	 * @rq->deadline it reads together under @rq->gstate_seq is
-	 * guaranteed to be the matching one.
-	 */
-	preempt_disable();
-	write_seqcount_begin(&rq->gstate_seq);
-
-	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
-	blk_add_timer(rq);
-
-	write_seqcount_end(&rq->gstate_seq);
-	preempt_enable();
+	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -714,22 +649,19 @@ void blk_mq_start_request(struct request *rq)
 }
 EXPORT_SYMBOL(blk_mq_start_request);
 
-/*
- * When we reach here because queue is busy, it's safe to change the state
- * to IDLE without checking @rq->aborted_gstate because we should still be
- * holding the RCU read lock and thus protected against timeout.
- */
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
+	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, rq);
 
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
-		blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+	if (old_state != MQ_RQ_IDLE) {
+		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+			WARN_ON_ONCE(true);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -838,8 +770,6 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 	const struct blk_mq_ops *ops = req->q->mq_ops;
 	enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
 
-	req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
-
 	if (ops->timeout)
 		ret = ops->timeout(req, reserved);
 
@@ -848,13 +778,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		/*
-		 * As nothing prevents from completion happening while
-		 * ->aborted_gstate is set, this may lead to ignored
-		 * completions and further spurious timeouts.
-		 */
-		blk_mq_rq_update_aborted_gstate(req, 0);
-		blk_add_timer(req);
+		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -868,48 +792,50 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	unsigned long gstate, deadline;
-	int start;
-
-	might_sleep();
-
-	if (rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED)
-		return;
+	union blk_deadline_and_state das = READ_ONCE(rq->das);
+	unsigned long now = jiffies;
+	int32_t diff_jiffies = das.deadline - now;
+	int32_t diff_next = das.deadline - data->next;
+	enum mq_rq_state rq_state = das.state;
 
-	/* read coherent snapshots of @rq->state_gen and @rq->deadline */
-	while (true) {
-		start = read_seqcount_begin(&rq->gstate_seq);
-		gstate = READ_ONCE(rq->gstate);
-		deadline = blk_rq_deadline(rq);
-		if (!read_seqcount_retry(&rq->gstate_seq, start))
-			break;
-		cond_resched();
-	}
-
-	/* if in-flight && overdue, mark for abortion */
-	if ((gstate & MQ_RQ_STATE_MASK) == MQ_RQ_IN_FLIGHT &&
-	    time_after_eq(jiffies, deadline)) {
-		blk_mq_rq_update_aborted_gstate(rq, gstate);
+	/*
+	 * Make sure that rq->aborted_gstate != rq->das if rq->deadline has not
+	 * yet been reached even if a request gets recycled before
+	 * blk_mq_terminate_expired() is called and the value of rq->deadline
+	 * is not modified due to the request recycling.
+	 */
+	rq->aborted_gstate = das;
+	rq->aborted_gstate.generation ^= (1UL << 29);
+	if (diff_jiffies <= 0 && rq_state == MQ_RQ_IN_FLIGHT) {
+		/* request timed out */
+		rq->aborted_gstate = das;
 		data->nr_expired++;
 		hctx->nr_expired++;
-	} else if (!data->next_set || time_after(data->next, deadline)) {
-		data->next = deadline;
+	} else if (!data->next_set) {
+		/* data->next is not yet set; set it to deadline. */
+		data->next = now + diff_jiffies;
 		data->next_set = 1;
+	} else if (diff_next < 0) {
+		/* data->next is later than deadline; reduce data->next. */
+		data->next += diff_next;
 	}
+
 }
 
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
+	union blk_deadline_and_state old_val = rq->aborted_gstate;
+	union blk_deadline_and_state new_val = old_val;
+
+	new_val.state = MQ_RQ_COMPLETE;
+
 	/*
-	 * We marked @rq->aborted_gstate and waited for RCU.  If there were
-	 * completions that we lost to, they would have finished and
-	 * updated @rq->gstate by now; otherwise, the completion path is
-	 * now guaranteed to see @rq->aborted_gstate and yield.  If
-	 * @rq->aborted_gstate still matches @rq->gstate, @rq is ours.
+	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
+	 * calls. If rq->das has not changed that means that it
+	 * is now safe to change the request state and to handle the timeout.
 	 */
-	if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
-	    READ_ONCE(rq->gstate) == rq->aborted_gstate)
+	if (blk_mq_set_rq_state(rq, old_val, new_val))
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -948,10 +874,10 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		bool has_rcu = false;
 
 		/*
-		 * Wait till everyone sees ->aborted_gstate.  The
-		 * sequential waits for SRCUs aren't ideal.  If this ever
-		 * becomes a problem, we can add per-hw_ctx rcu_head and
-		 * wait in parallel.
+		 * For very short timeouts it can happen that
+		 * blk_mq_check_expired() modifies the state of a request
+		 * while .queue_rq() is still in progress. Hence wait until
+		 * these .queue_rq() calls have finished.
 		 */
 		queue_for_each_hw_ctx(q, hctx, i) {
 			if (!hctx->nr_expired)
@@ -967,7 +893,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
 		if (has_rcu)
 			synchronize_rcu();
 
-		/* terminate the ones we won */
+		/* Terminate the requests marked by blk_mq_check_expired(). */
 		blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
 	}
 
@@ -2057,21 +1983,16 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 {
 	int ret;
 
+#ifndef CONFIG_ARCH_HAVE_CMPXCHG64
+	spin_lock_init(&rq->das_lock);
+#endif
+
 	if (set->ops->init_request) {
 		ret = set->ops->init_request(set, rq, hctx_idx, node);
 		if (ret)
 			return ret;
 	}
 
-	seqcount_init(&rq->gstate_seq);
-	u64_stats_init(&rq->aborted_gstate_sync);
-	/*
-	 * start gstate with gen 1 instead of 0, otherwise it will be equal
-	 * to aborted_gstate, and be identified timed out by
-	 * blk_mq_terminate_expired.
-	 */
-	WRITE_ONCE(rq->gstate, MQ_RQ_GEN_INC);
-
 	return 0;
 }
 
diff --git a/block/blk-mq.h b/block/blk-mq.h
index e1bb420dc5d6..42320011d264 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,6 +2,7 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
+#include <asm/cmpxchg.h>
 #include "blk-stat.h"
 #include "blk-mq-tag.h"
 
@@ -30,18 +31,11 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/*
- * Bits for request->gstate.  The lower two bits carry MQ_RQ_* state value
- * and the upper bits the generation number.
- */
+/* See also blk_deadline_and_state.state. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
 	MQ_RQ_COMPLETE		= 2,
-
-	MQ_RQ_STATE_BITS	= 2,
-	MQ_RQ_STATE_MASK	= (1 << MQ_RQ_STATE_BITS) - 1,
-	MQ_RQ_GEN_INC		= 1 << MQ_RQ_STATE_BITS,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -103,37 +97,106 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
+static inline bool blk_mq_set_rq_state(struct request *rq,
+				       union blk_deadline_and_state old_val,
+				       union blk_deadline_and_state new_val)
+{
+#ifdef CONFIG_ARCH_HAVE_CMPXCHG64
+	return cmpxchg64(&rq->das.das, old_val.das, new_val.das) == old_val.das;
+#else
+	unsigned long flags;
+	bool res = false;
+
+	spin_lock_irqsave(&rq->das_lock, flags);
+	if (rq->das.das == old_val.das) {
+		rq->das = new_val;
+		res = true;
+	}
+	spin_unlock_irqrestore(&rq->das_lock, flags);
+
+	return res;
+#endif
+}
+
+/*
+ * If the state of request @rq equals @old_state, update deadline and request
+ * state atomically to @time and @new_state. blk-mq only. cmpxchg64() is only
+ * used if there could be a concurrent update attempt from another context.
+ */
+static inline bool blk_mq_rq_set_deadline(struct request *rq,
+					  unsigned long new_time,
+					  enum mq_rq_state old_state,
+					  enum mq_rq_state new_state)
+{
+	union blk_deadline_and_state old_val, new_val;
+
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		old_val = READ_ONCE(rq->das);
+		if (old_val.state != old_state)
+			return false;
+		new_val = old_val;
+		new_val.deadline = new_time;
+		new_val.state = new_state;
+		if (new_state == MQ_RQ_IN_FLIGHT)
+			new_val.generation++;
+		WRITE_ONCE(rq->das.das, new_val.das);
+		return true;
+	}
+
+	do {
+		old_val = READ_ONCE(rq->das);
+		if (old_val.state != old_state)
+			return false;
+		new_val = old_val;
+		new_val.deadline = new_time;
+		new_val.state = new_state;
+		if (new_state == MQ_RQ_IN_FLIGHT)
+			new_val.generation++;
+	} while (!blk_mq_set_rq_state(rq, old_val, new_val));
+
+	return true;
+}
+
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
-static inline int blk_mq_rq_state(struct request *rq)
+static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return READ_ONCE(rq->gstate) & MQ_RQ_STATE_MASK;
+	return rq->das.state;
 }
 
 /**
- * blk_mq_rq_update_state() - set the current MQ_RQ_* state of a request
- * @rq: target request.
- * @state: new state to set.
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
  *
- * Set @rq's state to @state.  The caller is responsible for ensuring that
- * there are no other updaters.  A request can transition into IN_FLIGHT
- * only from IDLE and doing so increments the generation number.
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
  */
-static inline void blk_mq_rq_update_state(struct request *rq,
-					  enum mq_rq_state state)
+static inline bool blk_mq_change_rq_state(struct request *rq,
+					  enum mq_rq_state old_state,
+					  enum mq_rq_state new_state)
 {
-	u64 old_val = READ_ONCE(rq->gstate);
-	u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-	if (state == MQ_RQ_IN_FLIGHT) {
-		WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-		new_val += MQ_RQ_GEN_INC;
+	union blk_deadline_and_state das = READ_ONCE(rq->das);
+	union blk_deadline_and_state old_val = das;
+	union blk_deadline_and_state new_val = das;
+
+	old_val.state = old_state;
+	new_val.state = new_state;
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		new_val.generation++;
+	/*
+	 * For transitions from state in-flight to another state cmpxchg64()
+	 * must be used. For other state transitions it is safe to use
+	 * WRITE_ONCE().
+	 */
+	if (old_state != MQ_RQ_IN_FLIGHT) {
+		WRITE_ONCE(rq->das.das, new_val.das);
+		return true;
 	}
-
-	/* avoid exposing interim values */
-	WRITE_ONCE(rq->gstate, new_val);
+	return blk_mq_set_rq_state(rq, old_val, new_val);
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 652d4d4d3e97..549dfda7190c 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -162,8 +162,9 @@ void blk_abort_request(struct request *req)
 		 * immediately and that scan sees the new timeout value.
 		 * No need for fancy synchronizations.
 		 */
-		blk_rq_set_deadline(req, jiffies);
-		kblockd_schedule_work(&req->q->timeout_work);
+		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
+					   MQ_RQ_IN_FLIGHT))
+			kblockd_schedule_work(&req->q->timeout_work);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -184,52 +185,17 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-/**
- * blk_add_timer - Start timeout timer for a single request
- * @req:	request that is about to start running.
- *
- * Notes:
- *    Each request has its own timer, and as it is added to the queue, we
- *    set up the timer. When the request completes, we cancel the timer.
- */
-void blk_add_timer(struct request *req)
+static void __blk_add_timer(struct request *req, unsigned long deadline)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
 
-	if (!q->mq_ops)
-		lockdep_assert_held(q->queue_lock);
-
-	/* blk-mq has its own handler, so we don't need ->rq_timed_out_fn */
-	if (!q->mq_ops && !q->rq_timed_out_fn)
-		return;
-
-	BUG_ON(!list_empty(&req->timeout_list));
-
-	/*
-	 * Some LLDs, like scsi, peek at the timeout to prevent a
-	 * command from being retried forever.
-	 */
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-
-	blk_rq_set_deadline(req, jiffies + req->timeout);
-	req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
-
-	/*
-	 * Only the non-mq case needs to add the request to a protected list.
-	 * For the mq case we simply scan the tag map.
-	 */
-	if (!q->mq_ops)
-		list_add_tail(&req->timeout_list, &req->q->timeout_list);
-
 	/*
 	 * If the timer isn't already pending or this timeout is earlier
 	 * than an existing one, modify the timer. Round up to next nearest
 	 * second.
 	 */
-	expiry = blk_rq_timeout(round_jiffies_up(blk_rq_deadline(req)));
-
+	expiry = blk_rq_timeout(round_jiffies_up(deadline));
 	if (!timer_pending(&q->timeout) ||
 	    time_before(expiry, q->timeout.expires)) {
 		unsigned long diff = q->timeout.expires - expiry;
@@ -244,5 +210,56 @@ void blk_add_timer(struct request *req)
 		if (!timer_pending(&q->timeout) || (diff >= HZ / 2))
 			mod_timer(&q->timeout, expiry);
 	}
+}
+
+/**
+ * blk_add_timer - Start timeout timer for a single request
+ * @req:	request that is about to start running.
+ *
+ * Notes:
+ *    Each request has its own timer, and as it is added to the queue, we
+ *    set up the timer. When the request completes, we cancel the timer.
+ */
+void blk_add_timer(struct request *req)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	lockdep_assert_held(q->queue_lock);
+
+	if (!q->rq_timed_out_fn)
+		return;
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
 
+	deadline = jiffies + req->timeout;
+	blk_rq_set_deadline(req, deadline);
+	list_add_tail(&req->timeout_list, &req->q->timeout_list);
+
+	return __blk_add_timer(req, deadline);
+}
+
+/**
+ * blk_mq_add_timer - set the deadline for a single request
+ * @req:	request for which to set the deadline.
+ * @old:	current request state.
+ * @new:	new request state.
+ *
+ * Sets the deadline of a request if and only if it has state @old and
+ * at the same time changes the request state from @old into @new. The caller
+ * must guarantee that the request state won't be modified while this function
+ * is in progress.
+ */
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new)
+{
+	struct request_queue *q = req->q;
+	unsigned long deadline;
+
+	if (!req->timeout)
+		req->timeout = q->rq_timeout;
+	deadline = jiffies + req->timeout;
+	if (!blk_mq_rq_set_deadline(req, deadline, old, new))
+		WARN_ON_ONCE(true);
+	return __blk_add_timer(req, deadline);
 }
diff --git a/block/blk.h b/block/blk.h
index eaf1a8e87d11..984367308b11 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,6 +170,8 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
+void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
+		      enum mq_rq_state new);
 void blk_delete_timer(struct request *);
 
 
@@ -191,21 +193,21 @@ void blk_account_io_done(struct request *req, u64 now);
 /*
  * EH timer and IO completion will both attempt to 'grab' the request, make
  * sure that only one of them succeeds. Steal the bottom bit of the
- * __deadline field for this.
+ * das field for this.
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(0, &rq->__deadline);
+	return test_and_set_bit(0, &rq->das.legacy_deadline);
 }
 
 static inline void blk_clear_rq_complete(struct request *rq)
 {
-	clear_bit(0, &rq->__deadline);
+	clear_bit(0, &rq->das.legacy_deadline);
 }
 
 static inline bool blk_rq_is_complete(struct request *rq)
 {
-	return test_bit(0, &rq->__deadline);
+	return test_bit(0, &rq->das.legacy_deadline);
 }
 
 /*
@@ -314,12 +316,12 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->__deadline = time & ~0x1UL;
+	rq->das.legacy_deadline = time & ~0x1UL;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->__deadline & ~0x1UL;
+	return rq->das.legacy_deadline & ~0x1UL;
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f3999719f828..771dbdc8758c 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -27,8 +27,6 @@
 #include <linux/percpu-refcount.h>
 #include <linux/scatterlist.h>
 #include <linux/blkzoned.h>
-#include <linux/seqlock.h>
-#include <linux/u64_stats_sync.h>
 
 struct module;
 struct scsi_ioctl_command;
@@ -125,15 +123,23 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_SPECIAL_PAYLOAD	((__force req_flags_t)(1 << 18))
 /* The per-zone write lock is held for this request */
 #define RQF_ZONE_WRITE_LOCKED	((__force req_flags_t)(1 << 19))
-/* timeout is expired */
-#define RQF_MQ_TIMEOUT_EXPIRED	((__force req_flags_t)(1 << 20))
 /* already slept for hybrid poll */
-#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 21))
+#define RQF_MQ_POLL_SLEPT	((__force req_flags_t)(1 << 20))
 
 /* flags that prevent us from merging requests: */
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
+union blk_deadline_and_state {
+	struct {
+		uint32_t generation:30;
+		uint32_t state:2;
+		uint32_t deadline;
+	};
+	unsigned long legacy_deadline;
+	uint64_t das;
+};
+
 /*
  * Try to put the fields that are referenced together in the same cacheline.
  *
@@ -236,29 +242,24 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * On blk-mq, the lower bits of ->gstate (generation number and
-	 * state) carry the MQ_RQ_* state value and the upper bits the
-	 * generation number which is monotonically incremented and used to
-	 * distinguish the reuse instances.
-	 *
-	 * ->gstate_seq allows updates to ->gstate and other fields
-	 * (currently ->deadline) during request start to be read
-	 * atomically from the timeout path, so that it can operate on a
-	 * coherent set of information.
-	 */
-	seqcount_t gstate_seq;
-	u64 gstate;
-
 	/*
 	 * ->aborted_gstate is used by the timeout to claim a specific
 	 * recycle instance of this request.  See blk_mq_timeout_work().
 	 */
-	struct u64_stats_sync aborted_gstate_sync;
-	u64 aborted_gstate;
+	union blk_deadline_and_state aborted_gstate;
+
+#ifndef CONFIG_ARCH_HAVE_CMPXCHG64
+	spinlock_t das_lock;
+#endif
 
-	/* access through blk_rq_set_deadline, blk_rq_deadline */
-	unsigned long __deadline;
+	/*
+	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
+	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
+	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
+	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
+	 * blk-mq queues.
+	 */
+	union blk_deadline_and_state das;
 
 	struct list_head timeout_list;
 
-- 
2.16.3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  2018-05-15 22:51 ` [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
@ 2018-05-16  4:15   ` Michael Ellerman
  2018-05-18 20:33   ` Palmer Dabbelt
  1 sibling, 0 replies; 18+ messages in thread
From: Michael Ellerman @ 2018-05-16  4:15 UTC (permalink / raw)
  To: Bart Van Assche, Jens Axboe
  Cc: linux-block, linux-kernel, Christoph Hellwig, Bart Van Assche,
	Catalin Marinas, Will Deacon, Tony Luck, Fenghua Yu,
	Geert Uytterhoeven, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Martin Schwidefsky,
	Heiko Carstens, David S . Miller, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, Chris Zankel, Max Filippov, Arnd Bergmann,
	Jonathan Corbet

Bart Van Assche <bart.vanassche@wdc.com> writes:

> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index c32a181a7cbb..901365d12dcd 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -149,6 +149,7 @@ config PPC
>  	select ARCH_HAS_UACCESS_FLUSHCACHE	if PPC64
>  	select ARCH_HAS_UBSAN_SANITIZE_ALL
>  	select ARCH_HAS_ZONE_DEVICE		if PPC_BOOK3S_64
> +	select ARCH_HAVE_CMPXCHG64		if PPC64
>  	select ARCH_HAVE_NMI_SAFE_CMPXCHG
>  	select ARCH_MIGHT_HAVE_PC_PARPORT
>  	select ARCH_MIGHT_HAVE_PC_SERIO

LGTM. Thanks.

Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)

cheers

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  2018-05-15 22:51 ` [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
@ 2018-05-16  6:07   ` Christoph Hellwig
  0 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2018-05-16  6:07 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-kernel, Christoph Hellwig,
	Catalin Marinas, Will Deacon, Tony Luck, Fenghua Yu,
	Geert Uytterhoeven, James E.J. Bottomley, Helge Deller,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, David S . Miller,
	Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Chris Zankel,
	Max Filippov, Arnd Bergmann

> +config ARCH_HAVE_CMPXCHG64
> +	bool

64-bit architectures must support this as long is 64-bits wide.

So this should have a

	default y if 64BIT

which also means you only need to explicitly select in on 32-bit
architectures that support 64-bit cmpxchg.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-15 22:51 ` [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
@ 2018-05-16 12:51   ` Christoph Hellwig
  2018-05-16 16:17       ` Bart Van Assche
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2018-05-16 12:51 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Jens Axboe, linux-block, linux-kernel, Christoph Hellwig,
	Tejun Heo, Jianchao Wang, Ming Lei, Sebastian Ott, Sagi Grimberg,
	Israel Rukshin, Max Gurtovoy

I've been looking at this carefully, and I don't think we need cmpxchg64
at all, and we don't need anywhere near as many cmpxchg operations either.

The only reason to include the deadline in the atomic operation is the
blk_abort_request case, as the blk_mq_add_timer never modifies the
deadline of a request that someone could be racing with.  So if we
introduce a new aborted state for use by blk_abort_request we can modify
the deadline separately (and in fact have a common field with the legacy
path).

Also anytime we call blk_mq_change_rq_state and don't care about the
previous state it is a pretty clear indicator that we aren't racing
with anyone else  For blk_mq_free_request that obviously is the case
as the request must have completed before, and for the requeue case
we assume that.  If it isn't the case we have a deeper problem going
way back.

Below is a lightly tested patch on top of yours to implement these
suggestions, and also making sure we only access the generation/state
using WRITE_ONCE, READ_ONLY and cmpxchg to avoid any races due to
reordering.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 40c9aa085613..5960e14025e6 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -274,6 +274,40 @@ bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
 }
 EXPORT_SYMBOL(blk_mq_can_queue);
 
+/**
+ * blk_mq_change_rq_state - atomically test and set request state
+ * @rq: Request pointer.
+ * @old_state: Old request state.
+ * @new_state: New request state.
+ *
+ * Returns %true if and only if the old state was @old and if the state has
+ * been changed into @new.
+ */
+static inline bool blk_mq_change_rq_state(struct request *rq,
+		enum mq_rq_state old_state, enum mq_rq_state new_state)
+{
+	union request_sag sag = READ_ONCE(rq->sag);
+	union request_sag old = sag, new = sag;
+
+	BUILD_BUG_ON(sizeof(union request_sag) != sizeof(u32));
+
+	old.state = old_state;
+	new.state = new_state;
+	return cmpxchg(&rq->sag.val, old.val, new.val) == old.val;
+}
+
+static inline void blk_mq_set_rq_state(struct request *rq,
+		enum mq_rq_state new_state)
+{
+	union request_sag sag = READ_ONCE(rq->sag);
+
+	if (new_state == MQ_RQ_IN_FLIGHT)
+		sag.generation++;
+	sag.state = new_state;
+
+	WRITE_ONCE(rq->sag, sag);
+}
+
 static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 		unsigned int tag, unsigned int op)
 {
@@ -318,7 +352,7 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
 	rq->special = NULL;
 	/* tag was already set */
 	rq->extra_len = 0;
-	WARN_ON_ONCE(rq->das.state != MQ_RQ_IDLE);
+	WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IDLE);
 
 	INIT_LIST_HEAD(&rq->timeout_list);
 	rq->timeout = 0;
@@ -494,8 +528,7 @@ void blk_mq_free_request(struct request *rq)
 	if (blk_rq_rl(rq))
 		blk_put_rl(blk_rq_rl(rq));
 
-	if (!blk_mq_change_rq_state(rq, blk_mq_rq_state(rq), MQ_RQ_IDLE))
-		WARN_ON_ONCE(true);
+	blk_mq_set_rq_state(rq, MQ_RQ_IDLE);
 	if (rq->tag != -1)
 		blk_mq_put_tag(hctx, hctx->tags, ctx, rq->tag);
 	if (sched_tag != -1)
@@ -619,6 +652,14 @@ int blk_mq_request_started(struct request *rq)
 }
 EXPORT_SYMBOL_GPL(blk_mq_request_started);
 
+static inline void __blk_mq_start_request(struct request *rq)
+{
+	rq->__deadline = jiffies +
+		(rq->timeout ? rq->timeout : rq->q->rq_timeout);
+	__blk_add_timer(rq, rq->__deadline);
+	blk_mq_set_rq_state(rq, MQ_RQ_IN_FLIGHT);
+}
+
 void blk_mq_start_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
@@ -636,7 +677,7 @@ void blk_mq_start_request(struct request *rq)
 		wbt_issue(q->rq_wb, rq);
 	}
 
-	blk_mq_add_timer(rq, MQ_RQ_IDLE, MQ_RQ_IN_FLIGHT);
+	__blk_mq_start_request(rq);
 
 	if (q->dma_drain_size && blk_rq_bytes(rq)) {
 		/*
@@ -652,16 +693,18 @@ EXPORT_SYMBOL(blk_mq_start_request);
 static void __blk_mq_requeue_request(struct request *rq)
 {
 	struct request_queue *q = rq->q;
-	enum mq_rq_state old_state = blk_mq_rq_state(rq);
 
 	blk_mq_put_driver_tag(rq);
 
 	trace_block_rq_requeue(q, rq);
 	wbt_requeue(q->rq_wb, rq);
 
-	if (old_state != MQ_RQ_IDLE) {
-		if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
-			WARN_ON_ONCE(true);
+	/*
+	 * Note: the driver must ensure this doesn't race with completions or
+	 * timeouts.
+	 */
+	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
+		blk_mq_set_rq_state(rq, MQ_RQ_IDLE);
 		if (q->dma_drain_size && blk_rq_bytes(rq))
 			rq->nr_phys_segments--;
 	}
@@ -778,7 +821,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved)
 		__blk_mq_complete_request(req);
 		break;
 	case BLK_EH_RESET_TIMER:
-		blk_mq_add_timer(req, MQ_RQ_COMPLETE, MQ_RQ_IN_FLIGHT);
+		__blk_mq_start_request(req);
 		break;
 	case BLK_EH_NOT_HANDLED:
 		break;
@@ -792,23 +835,26 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
 	struct blk_mq_timeout_data *data = priv;
-	union blk_deadline_and_state das = READ_ONCE(rq->das);
 	unsigned long now = jiffies;
-	int32_t diff_jiffies = das.deadline - now;
-	int32_t diff_next = das.deadline - data->next;
-	enum mq_rq_state rq_state = das.state;
+	unsigned long deadline = READ_ONCE(rq->__deadline);
+	union request_sag sag = READ_ONCE(rq->sag);
+	union request_sag aborted_sag = sag;
+	long diff_jiffies = deadline - now;
+	long diff_next = deadline - data->next;
 
 	/*
-	 * Make sure that rq->aborted_gstate != rq->das if rq->deadline has not
+	 * Make sure that rq->aborted_sag != rq->sag if rq->deadline has not
 	 * yet been reached even if a request gets recycled before
-	 * blk_mq_terminate_expired() is called and the value of rq->deadline
+	 * blk_mq_terminate_expired() is called and the value of rq->__deadline
 	 * is not modified due to the request recycling.
 	 */
-	rq->aborted_gstate = das;
-	rq->aborted_gstate.generation ^= (1UL << 29);
-	if (diff_jiffies <= 0 && rq_state == MQ_RQ_IN_FLIGHT) {
-		/* request timed out */
-		rq->aborted_gstate = das;
+	aborted_sag.generation ^= (1u << 29);
+	WRITE_ONCE(rq->aborted_sag, aborted_sag);
+
+	if (sag.state == MQ_RQ_ABORTED ||
+	    (sag.state == MQ_RQ_IN_FLIGHT && diff_jiffies <= 0)) {
+		/* request aborted or timed out */
+		WRITE_ONCE(rq->aborted_sag, sag);
 		data->nr_expired++;
 		hctx->nr_expired++;
 	} else if (!data->next_set) {
@@ -819,23 +865,21 @@ static void blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
 		/* data->next is later than deadline; reduce data->next. */
 		data->next += diff_next;
 	}
-
 }
 
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
 		struct request *rq, void *priv, bool reserved)
 {
-	union blk_deadline_and_state old_val = rq->aborted_gstate;
-	union blk_deadline_and_state new_val = old_val;
-
-	new_val.state = MQ_RQ_COMPLETE;
+	union request_sag old = READ_ONCE(rq->aborted_sag);
+	union request_sag new = old;
 
 	/*
-	 * We marked @rq->aborted_gstate and waited for ongoing .queue_rq()
-	 * calls. If rq->das has not changed that means that it
-	 * is now safe to change the request state and to handle the timeout.
+	 * We marked @rq->aborted_sag and waited for ongoing .queue_rq() calls.
+	 * If rq->sag has not changed that means that it is now safe to change
+	 * the request state and to handle the timeout.
 	 */
-	if (blk_mq_set_rq_state(rq, old_val, new_val))
+	new.state = MQ_RQ_COMPLETE;
+	if (cmpxchg(&rq->sag.val, old.val, new.val) == old.val)
 		blk_mq_rq_timed_out(rq, reserved);
 }
 
@@ -916,6 +960,12 @@ static void blk_mq_timeout_work(struct work_struct *work)
 	blk_queue_exit(q);
 }
 
+void blk_mq_abort_request(struct request *rq)
+{
+	if (blk_mq_change_rq_state(rq, MQ_RQ_IN_FLIGHT, MQ_RQ_ABORTED))
+		kblockd_schedule_work(&rq->q->timeout_work);
+}
+
 struct flush_busy_ctx_data {
 	struct blk_mq_hw_ctx *hctx;
 	struct list_head *list;
@@ -1983,10 +2033,6 @@ static int blk_mq_init_request(struct blk_mq_tag_set *set, struct request *rq,
 {
 	int ret;
 
-#ifndef CONFIG_ARCH_HAVE_CMPXCHG64
-	spin_lock_init(&rq->das_lock);
-#endif
-
 	if (set->ops->init_request) {
 		ret = set->ops->init_request(set, rq, hctx_idx, node);
 		if (ret)
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 42320011d264..42721d6331df 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -2,7 +2,6 @@
 #ifndef INT_BLK_MQ_H
 #define INT_BLK_MQ_H
 
-#include <asm/cmpxchg.h>
 #include "blk-stat.h"
 #include "blk-mq-tag.h"
 
@@ -31,11 +30,11 @@ struct blk_mq_ctx {
 	struct kobject		kobj;
 } ____cacheline_aligned_in_smp;
 
-/* See also blk_deadline_and_state.state. */
 enum mq_rq_state {
 	MQ_RQ_IDLE		= 0,
 	MQ_RQ_IN_FLIGHT		= 1,
-	MQ_RQ_COMPLETE		= 2,
+	MQ_RQ_ABORTED		= 2,
+	MQ_RQ_COMPLETE		= 3,
 };
 
 void blk_mq_freeze_queue(struct request_queue *q);
@@ -48,6 +47,7 @@ bool blk_mq_get_driver_tag(struct request *rq, struct blk_mq_hw_ctx **hctx,
 				bool wait);
 struct request *blk_mq_dequeue_from_ctx(struct blk_mq_hw_ctx *hctx,
 					struct blk_mq_ctx *start);
+void blk_mq_abort_request(struct request *rq);
 
 /*
  * Internal helpers for allocating/freeing the request map
@@ -97,106 +97,13 @@ extern void blk_mq_hctx_kobj_init(struct blk_mq_hw_ctx *hctx);
 
 void blk_mq_release(struct request_queue *q);
 
-static inline bool blk_mq_set_rq_state(struct request *rq,
-				       union blk_deadline_and_state old_val,
-				       union blk_deadline_and_state new_val)
-{
-#ifdef CONFIG_ARCH_HAVE_CMPXCHG64
-	return cmpxchg64(&rq->das.das, old_val.das, new_val.das) == old_val.das;
-#else
-	unsigned long flags;
-	bool res = false;
-
-	spin_lock_irqsave(&rq->das_lock, flags);
-	if (rq->das.das == old_val.das) {
-		rq->das = new_val;
-		res = true;
-	}
-	spin_unlock_irqrestore(&rq->das_lock, flags);
-
-	return res;
-#endif
-}
-
-/*
- * If the state of request @rq equals @old_state, update deadline and request
- * state atomically to @time and @new_state. blk-mq only. cmpxchg64() is only
- * used if there could be a concurrent update attempt from another context.
- */
-static inline bool blk_mq_rq_set_deadline(struct request *rq,
-					  unsigned long new_time,
-					  enum mq_rq_state old_state,
-					  enum mq_rq_state new_state)
-{
-	union blk_deadline_and_state old_val, new_val;
-
-	if (old_state != MQ_RQ_IN_FLIGHT) {
-		old_val = READ_ONCE(rq->das);
-		if (old_val.state != old_state)
-			return false;
-		new_val = old_val;
-		new_val.deadline = new_time;
-		new_val.state = new_state;
-		if (new_state == MQ_RQ_IN_FLIGHT)
-			new_val.generation++;
-		WRITE_ONCE(rq->das.das, new_val.das);
-		return true;
-	}
-
-	do {
-		old_val = READ_ONCE(rq->das);
-		if (old_val.state != old_state)
-			return false;
-		new_val = old_val;
-		new_val.deadline = new_time;
-		new_val.state = new_state;
-		if (new_state == MQ_RQ_IN_FLIGHT)
-			new_val.generation++;
-	} while (!blk_mq_set_rq_state(rq, old_val, new_val));
-
-	return true;
-}
-
 /**
  * blk_mq_rq_state() - read the current MQ_RQ_* state of a request
  * @rq: target request.
  */
 static inline enum mq_rq_state blk_mq_rq_state(struct request *rq)
 {
-	return rq->das.state;
-}
-
-/**
- * blk_mq_change_rq_state - atomically test and set request state
- * @rq: Request pointer.
- * @old_state: Old request state.
- * @new_state: New request state.
- *
- * Returns %true if and only if the old state was @old and if the state has
- * been changed into @new.
- */
-static inline bool blk_mq_change_rq_state(struct request *rq,
-					  enum mq_rq_state old_state,
-					  enum mq_rq_state new_state)
-{
-	union blk_deadline_and_state das = READ_ONCE(rq->das);
-	union blk_deadline_and_state old_val = das;
-	union blk_deadline_and_state new_val = das;
-
-	old_val.state = old_state;
-	new_val.state = new_state;
-	if (new_state == MQ_RQ_IN_FLIGHT)
-		new_val.generation++;
-	/*
-	 * For transitions from state in-flight to another state cmpxchg64()
-	 * must be used. For other state transitions it is safe to use
-	 * WRITE_ONCE().
-	 */
-	if (old_state != MQ_RQ_IN_FLIGHT) {
-		WRITE_ONCE(rq->das.das, new_val.das);
-		return true;
-	}
-	return blk_mq_set_rq_state(rq, old_val, new_val);
+	return READ_ONCE(rq->sag).state;
 }
 
 static inline struct blk_mq_ctx *__blk_mq_get_ctx(struct request_queue *q,
diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index 549dfda7190c..19fca827a272 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -157,14 +157,7 @@ void blk_timeout_work(struct work_struct *work)
 void blk_abort_request(struct request *req)
 {
 	if (req->q->mq_ops) {
-		/*
-		 * All we need to ensure is that timeout scan takes place
-		 * immediately and that scan sees the new timeout value.
-		 * No need for fancy synchronizations.
-		 */
-		if (blk_mq_rq_set_deadline(req, jiffies, MQ_RQ_IN_FLIGHT,
-					   MQ_RQ_IN_FLIGHT))
-			kblockd_schedule_work(&req->q->timeout_work);
+		blk_mq_abort_request(req);
 	} else {
 		if (blk_mark_rq_complete(req))
 			return;
@@ -185,7 +178,7 @@ unsigned long blk_rq_timeout(unsigned long timeout)
 	return timeout;
 }
 
-static void __blk_add_timer(struct request *req, unsigned long deadline)
+void __blk_add_timer(struct request *req, unsigned long deadline)
 {
 	struct request_queue *q = req->q;
 	unsigned long expiry;
@@ -238,28 +231,3 @@ void blk_add_timer(struct request *req)
 
 	return __blk_add_timer(req, deadline);
 }
-
-/**
- * blk_mq_add_timer - set the deadline for a single request
- * @req:	request for which to set the deadline.
- * @old:	current request state.
- * @new:	new request state.
- *
- * Sets the deadline of a request if and only if it has state @old and
- * at the same time changes the request state from @old into @new. The caller
- * must guarantee that the request state won't be modified while this function
- * is in progress.
- */
-void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
-		      enum mq_rq_state new)
-{
-	struct request_queue *q = req->q;
-	unsigned long deadline;
-
-	if (!req->timeout)
-		req->timeout = q->rq_timeout;
-	deadline = jiffies + req->timeout;
-	if (!blk_mq_rq_set_deadline(req, deadline, old, new))
-		WARN_ON_ONCE(true);
-	return __blk_add_timer(req, deadline);
-}
diff --git a/block/blk.h b/block/blk.h
index 984367308b11..81a36c96aba2 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -170,8 +170,7 @@ static inline bool bio_integrity_endio(struct bio *bio)
 void blk_timeout_work(struct work_struct *work);
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
-void blk_mq_add_timer(struct request *req, enum mq_rq_state old,
-		      enum mq_rq_state new);
+void __blk_add_timer(struct request *req, unsigned long deadline);
 void blk_delete_timer(struct request *);
 
 
@@ -193,21 +192,21 @@ void blk_account_io_done(struct request *req, u64 now);
 /*
  * EH timer and IO completion will both attempt to 'grab' the request, make
  * sure that only one of them succeeds. Steal the bottom bit of the
- * das field for this.
+ * __deadline field for this.
  */
 static inline int blk_mark_rq_complete(struct request *rq)
 {
-	return test_and_set_bit(0, &rq->das.legacy_deadline);
+	return test_and_set_bit(0, &rq->__deadline);
 }
 
 static inline void blk_clear_rq_complete(struct request *rq)
 {
-	clear_bit(0, &rq->das.legacy_deadline);
+	clear_bit(0, &rq->__deadline);
 }
 
 static inline bool blk_rq_is_complete(struct request *rq)
 {
-	return test_bit(0, &rq->das.legacy_deadline);
+	return test_bit(0, &rq->__deadline);
 }
 
 /*
@@ -316,12 +315,12 @@ static inline void req_set_nomerge(struct request_queue *q, struct request *req)
  */
 static inline void blk_rq_set_deadline(struct request *rq, unsigned long time)
 {
-	rq->das.legacy_deadline = time & ~0x1UL;
+	rq->__deadline = time & ~0x1UL;
 }
 
 static inline unsigned long blk_rq_deadline(struct request *rq)
 {
-	return rq->das.legacy_deadline & ~0x1UL;
+	return rq->__deadline & ~0x1UL;
 }
 
 /*
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 771dbdc8758c..5288e9c3fdbf 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -130,14 +130,12 @@ typedef __u32 __bitwise req_flags_t;
 #define RQF_NOMERGE_FLAGS \
 	(RQF_STARTED | RQF_SOFTBARRIER | RQF_FLUSH_SEQ | RQF_SPECIAL_PAYLOAD)
 
-union blk_deadline_and_state {
+union request_sag {
 	struct {
-		uint32_t generation:30;
-		uint32_t state:2;
-		uint32_t deadline;
+		u32	generation : 30;
+		u32	state : 2;
 	};
-	unsigned long legacy_deadline;
-	uint64_t das;
+	u32		val;
 };
 
 /*
@@ -242,24 +240,19 @@ struct request {
 
 	unsigned int extra_len;	/* length of alignment and padding */
 
-	/*
-	 * ->aborted_gstate is used by the timeout to claim a specific
-	 * recycle instance of this request.  See blk_mq_timeout_work().
-	 */
-	union blk_deadline_and_state aborted_gstate;
-
-#ifndef CONFIG_ARCH_HAVE_CMPXCHG64
-	spinlock_t das_lock;
-#endif
+	 /*
+	  * ->aborted_sagis used by the timeout to claim a specific recycle
+	  * instance of this request.  See blk_mq_timeout_work().
+	  */
+	union request_sag sag;
+	union request_sag aborted_sag;
 
 	/*
 	 * Access through blk_rq_deadline() and blk_rq_set_deadline(),
 	 * blk_mark_rq_complete(), blk_clear_rq_complete() and
-	 * blk_rq_is_complete() for legacy queues or blk_mq_rq_state(),
-	 * blk_mq_change_rq_state() and blk_mq_rq_set_deadline() for
-	 * blk-mq queues.
+	 * blk_rq_is_complete() for legacy queues.
 	 */
-	union blk_deadline_and_state das;
+	unsigned long __deadline;
 
 	struct list_head timeout_list;
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-16 12:51   ` Christoph Hellwig
@ 2018-05-16 16:17       ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-16 16:17 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-block, israelr, sagi, sebott, ming.lei,
	axboe, jianchao.w.wang, maxg, tj

T24gV2VkLCAyMDE4LTA1LTE2IGF0IDE0OjUxICswMjAwLCBDaHJpc3RvcGggSGVsbHdpZyB3cm90
ZToNCj4gSSd2ZSBiZWVuIGxvb2tpbmcgYXQgdGhpcyBjYXJlZnVsbHksIGFuZCBJIGRvbid0IHRo
aW5rIHdlIG5lZWQgY21weGNoZzY0DQo+IGF0IGFsbCwgYW5kIHdlIGRvbid0IG5lZWQgYW55d2hl
cmUgbmVhciBhcyBtYW55IGNtcHhjaGcgb3BlcmF0aW9ucyBlaXRoZXIuDQo+IA0KPiBUaGUgb25s
eSByZWFzb24gdG8gaW5jbHVkZSB0aGUgZGVhZGxpbmUgaW4gdGhlIGF0b21pYyBvcGVyYXRpb24g
aXMgdGhlDQo+IGJsa19hYm9ydF9yZXF1ZXN0IGNhc2UsIGFzIHRoZSBibGtfbXFfYWRkX3RpbWVy
IG5ldmVyIG1vZGlmaWVzIHRoZQ0KPiBkZWFkbGluZSBvZiBhIHJlcXVlc3QgdGhhdCBzb21lb25l
IGNvdWxkIGJlIHJhY2luZyB3aXRoLiAgU28gaWYgd2UNCj4gaW50cm9kdWNlIGEgbmV3IGFib3J0
ZWQgc3RhdGUgZm9yIHVzZSBieSBibGtfYWJvcnRfcmVxdWVzdCB3ZSBjYW4gbW9kaWZ5DQo+IHRo
ZSBkZWFkbGluZSBzZXBhcmF0ZWx5IChhbmQgaW4gZmFjdCBoYXZlIGEgY29tbW9uIGZpZWxkIHdp
dGggdGhlIGxlZ2FjeQ0KPiBwYXRoKS4NCg0KVGhlcmUgaXMgYW5vdGhlciByZWFzb24gdGhlIGRl
YWRsaW5lIGlzIGluY2x1ZGVkIGluIHRoZSBhdG9taWMgb3BlcmF0aW9uLA0KbmFtZWx5IHRvIGhh
bmRsZSByYWNlcyBiZXR3ZWVuIHRoZSBCTEtfRUhfUkVTRVRfVElNRVIgY2FzZSBpbiBibGtfbXFf
cnFfdGltZWRfb3V0KCkNCmFuZCBibGtfbXFfY29tcGxldGVfcmVxdWVzdCgpLiBJIGRvbid0IHRo
aW5rIHRoYXQgcmFjZSBpcyBhZGRyZXNzZWQgcHJvcGVybHkgYnkNCnlvdXIgcGF0Y2guIEkgd2ls
bCBzZWUgd2hhdCBJIGNhbiBkbyB0byBhZGRyZXNzIHRoYXQgcmFjZSB3aXRob3V0IHVzaW5nIDY0
LWJpdA0KYXRvbWljIG9wZXJhdGlvbnMuDQoNCkJhcnQuDQoNCg0KDQo=

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
@ 2018-05-16 16:17       ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-16 16:17 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-block, israelr, sagi, sebott, ming.lei,
	axboe, jianchao.w.wang, maxg, tj

On Wed, 2018-05-16 at 14:51 +0200, Christoph Hellwig wrote:
> I've been looking at this carefully, and I don't think we need cmpxchg64
> at all, and we don't need anywhere near as many cmpxchg operations either.
> 
> The only reason to include the deadline in the atomic operation is the
> blk_abort_request case, as the blk_mq_add_timer never modifies the
> deadline of a request that someone could be racing with.  So if we
> introduce a new aborted state for use by blk_abort_request we can modify
> the deadline separately (and in fact have a common field with the legacy
> path).

There is another reason the deadline is included in the atomic operation,
namely to handle races between the BLK_EH_RESET_TIMER case in blk_mq_rq_timed_out()
and blk_mq_complete_request(). I don't think that race is addressed properly by
your patch. I will see what I can do to address that race without using 64-bit
atomic operations.

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-16 16:17       ` Bart Van Assche
  (?)
@ 2018-05-16 16:24       ` hch
  2018-05-16 16:47           ` Bart Van Assche
  -1 siblings, 1 reply; 18+ messages in thread
From: hch @ 2018-05-16 16:24 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-kernel, linux-block, israelr, sagi, sebott, ming.lei,
	axboe, jianchao.w.wang, maxg, tj

On Wed, May 16, 2018 at 04:17:42PM +0000, Bart Van Assche wrote:
> There is another reason the deadline is included in the atomic operation,
> namely to handle races between the BLK_EH_RESET_TIMER case in blk_mq_rq_timed_out()
> and blk_mq_complete_request(). I don't think that race is addressed properly by
> your patch. I will see what I can do to address that race without using 64-bit
> atomic operations.

I might be missing something here, so please help me understand
what is missing.

If we restart the timer in blk_mq_rq_timed_out we also bump the
generation at the same time as we reset the deadline in your old
patch.  With this patch we only bump the generation, but that should
be enough to address the rest, or not?

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-16 16:24       ` hch
@ 2018-05-16 16:47           ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-16 16:47 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-block, israelr, sagi, sebott, axboe,
	ming.lei, jianchao.w.wang, maxg, tj

T24gV2VkLCAyMDE4LTA1LTE2IGF0IDE4OjI0ICswMjAwLCBoY2hAbHN0LmRlIHdyb3RlOg0KPiBP
biBXZWQsIE1heSAxNiwgMjAxOCBhdCAwNDoxNzo0MlBNICswMDAwLCBCYXJ0IFZhbiBBc3NjaGUg
d3JvdGU6DQo+ID4gVGhlcmUgaXMgYW5vdGhlciByZWFzb24gdGhlIGRlYWRsaW5lIGlzIGluY2x1
ZGVkIGluIHRoZSBhdG9taWMgb3BlcmF0aW9uLA0KPiA+IG5hbWVseSB0byBoYW5kbGUgcmFjZXMg
YmV0d2VlbiB0aGUgQkxLX0VIX1JFU0VUX1RJTUVSIGNhc2UgaW4gYmxrX21xX3JxX3RpbWVkX291
dCgpDQo+ID4gYW5kIGJsa19tcV9jb21wbGV0ZV9yZXF1ZXN0KCkuIEkgZG9uJ3QgdGhpbmsgdGhh
dCByYWNlIGlzIGFkZHJlc3NlZCBwcm9wZXJseSBieQ0KPiA+IHlvdXIgcGF0Y2guIEkgd2lsbCBz
ZWUgd2hhdCBJIGNhbiBkbyB0byBhZGRyZXNzIHRoYXQgcmFjZSB3aXRob3V0IHVzaW5nIDY0LWJp
dA0KPiA+IGF0b21pYyBvcGVyYXRpb25zLg0KPiANCj4gSSBtaWdodCBiZSBtaXNzaW5nIHNvbWV0
aGluZyBoZXJlLCBzbyBwbGVhc2UgaGVscCBtZSB1bmRlcnN0YW5kDQo+IHdoYXQgaXMgbWlzc2lu
Zy4NCj4gDQo+IElmIHdlIHJlc3RhcnQgdGhlIHRpbWVyIGluIGJsa19tcV9ycV90aW1lZF9vdXQg
d2UgYWxzbyBidW1wIHRoZQ0KPiBnZW5lcmF0aW9uIGF0IHRoZSBzYW1lIHRpbWUgYXMgd2UgcmVz
ZXQgdGhlIGRlYWRsaW5lIGluIHlvdXIgb2xkDQo+IHBhdGNoLiAgV2l0aCB0aGlzIHBhdGNoIHdl
IG9ubHkgYnVtcCB0aGUgZ2VuZXJhdGlvbiwgYnV0IHRoYXQgc2hvdWxkDQo+IGJlIGVub3VnaCB0
byBhZGRyZXNzIHRoZSByZXN0LCBvciBub3Q/DQoNCkhlbGxvIENocmlzdG9waCwNCg0KSSB0aGlu
ayB5b3VyIHBhdGNoIGNoYW5nZXMgdGhlIG9yZGVyIG9mIGNoYW5naW5nIHRoZSByZXF1ZXN0IHN0
YXRlIGFuZA0KY2FsbGluZyBtb2RfdGltZXIoKS4gSW4gbXkgcGF0Y2ggdGhlIHJlcXVlc3Qgc3Rh
dGUgYW5kIHRoZSBkZWFkbGluZSBhcmUNCnVwZGF0ZWQgZmlyc3QgYW5kIG1vZF90aW1lcigpIGlz
IGNhbGxlZCBhZnRlcndhcmRzLiBJIHRoaW5rIHlvdXIgcGF0Y2gNCmNoYW5nZXMgdGhlIG9yZGVy
IG9mIHRoZXNlIG9wZXJhdGlvbnMgaW50byB0aGUgZm9sbG93aW5nOg0KKDEpIF9fYmxrX21xX3N0
YXJ0X3JlcXVlc3QoKSBzZXRzIHRoZSByZXF1ZXN0IGRlYWRsaW5lLg0KKDIpIF9fYmxrX21xX3N0
YXJ0X3JlcXVlc3QoKSBjYWxscyBfX2Jsa19hZGRfdGltZXIoKSB3aGljaCBpbiB0dXJuIGNhbGxz
DQogICAgbW9kX3RpbWVyKCkuDQooMykgX19ibGtfbXFfc3RhcnRfcmVxdWVzdCgpIGNoYW5nZXMg
dGhlIHJlcXVlc3Qgc3RhdGUgaW50byBNUV9SUV9JTl9GTElHSFQuDQoNCkluIHRoZSB1bmxpa2Vs
eSBldmVudCBvZiBhIHNpZ25pZmljYW50IGRlbGF5IGJldHdlZW4gKDIpIGFuZCAoMykgaXQgY2Fu
DQpoYXBwZW4gdGhhdCB0aGUgdGltZXIgZmlyZXMgYW5kIGV4YW1pbmVzIGFuZCBpZ25vcmVzIHRo
ZSByZXF1ZXN0IGJlY2F1c2UNCml0cyBzdGF0ZSBkaWZmZXJzIGZyb20gTVFfUlFfSU5fRkxJR0hU
LiBJZiB0aGUgcmVxdWVzdCBmb3Igd2hpY2ggdGhpcw0KaGFwcGVuZWQgdGltZXMgb3V0IGl0cyB0
aW1lb3V0IHdpbGwgb25seSBiZSBoYW5kbGVkIHRoZSBuZXh0IHRpbWUNCmJsa19tcV90aW1lb3V0
X3dvcmsoKSBpcyBjYWxsZWQuIElzIHRoaXMgdGhlIGJlaGF2aW9yIHlvdSBpbnRlbmRlZD8NCg0K
VGhhbmtzLA0KDQpCYXJ0Lg0K

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
@ 2018-05-16 16:47           ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-16 16:47 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-block, israelr, sagi, sebott, axboe,
	ming.lei, jianchao.w.wang, maxg, tj

On Wed, 2018-05-16 at 18:24 +0200, hch@lst.de wrote:
> On Wed, May 16, 2018 at 04:17:42PM +0000, Bart Van Assche wrote:
> > There is another reason the deadline is included in the atomic operation,
> > namely to handle races between the BLK_EH_RESET_TIMER case in blk_mq_rq_timed_out()
> > and blk_mq_complete_request(). I don't think that race is addressed properly by
> > your patch. I will see what I can do to address that race without using 64-bit
> > atomic operations.
> 
> I might be missing something here, so please help me understand
> what is missing.
> 
> If we restart the timer in blk_mq_rq_timed_out we also bump the
> generation at the same time as we reset the deadline in your old
> patch.  With this patch we only bump the generation, but that should
> be enough to address the rest, or not?

Hello Christoph,

I think your patch changes the order of changing the request state and
calling mod_timer(). In my patch the request state and the deadline are
updated first and mod_timer() is called afterwards. I think your patch
changes the order of these operations into the following:
(1) __blk_mq_start_request() sets the request deadline.
(2) __blk_mq_start_request() calls __blk_add_timer() which in turn calls
    mod_timer().
(3) __blk_mq_start_request() changes the request state into MQ_RQ_IN_FLIGHT.

In the unlikely event of a significant delay between (2) and (3) it can
happen that the timer fires and examines and ignores the request because
its state differs from MQ_RQ_IN_FLIGHT. If the request for which this
happened times out its timeout will only be handled the next time
blk_mq_timeout_work() is called. Is this the behavior you intended?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-16 16:47           ` Bart Van Assche
  (?)
@ 2018-05-16 17:31           ` hch
  2018-05-16 18:06               ` Bart Van Assche
  -1 siblings, 1 reply; 18+ messages in thread
From: hch @ 2018-05-16 17:31 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: hch, linux-kernel, linux-block, israelr, sagi, sebott, axboe,
	ming.lei, jianchao.w.wang, maxg, tj

On Wed, May 16, 2018 at 04:47:54PM +0000, Bart Van Assche wrote:
> I think your patch changes the order of changing the request state and
> calling mod_timer(). In my patch the request state and the deadline are
> updated first and mod_timer() is called afterwards. I think your patch
> changes the order of these operations into the following:
> (1) __blk_mq_start_request() sets the request deadline.
> (2) __blk_mq_start_request() calls __blk_add_timer() which in turn calls
>     mod_timer().
> (3) __blk_mq_start_request() changes the request state into MQ_RQ_IN_FLIGHT.
> 
> In the unlikely event of a significant delay between (2) and (3) it can
> happen that the timer fires and examines and ignores the request because
> its state differs from MQ_RQ_IN_FLIGHT. If the request for which this
> happened times out its timeout will only be handled the next time
> blk_mq_timeout_work() is called. Is this the behavior you intended?

We can move the timer manipulation after the change easily I think.
It would make sense to add comments explaining the ordering.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
  2018-05-16 17:31           ` hch
@ 2018-05-16 18:06               ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-16 18:06 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-block, israelr, sagi, sebott, ming.lei,
	axboe, jianchao.w.wang, maxg, tj

T24gV2VkLCAyMDE4LTA1LTE2IGF0IDE5OjMxICswMjAwLCBoY2hAbHN0LmRlIHdyb3RlOg0KPiBP
biBXZWQsIE1heSAxNiwgMjAxOCBhdCAwNDo0Nzo1NFBNICswMDAwLCBCYXJ0IFZhbiBBc3NjaGUg
d3JvdGU6DQo+ID4gSSB0aGluayB5b3VyIHBhdGNoIGNoYW5nZXMgdGhlIG9yZGVyIG9mIGNoYW5n
aW5nIHRoZSByZXF1ZXN0IHN0YXRlIGFuZA0KPiA+IGNhbGxpbmcgbW9kX3RpbWVyKCkuIEluIG15
IHBhdGNoIHRoZSByZXF1ZXN0IHN0YXRlIGFuZCB0aGUgZGVhZGxpbmUgYXJlDQo+ID4gdXBkYXRl
ZCBmaXJzdCBhbmQgbW9kX3RpbWVyKCkgaXMgY2FsbGVkIGFmdGVyd2FyZHMuIEkgdGhpbmsgeW91
ciBwYXRjaA0KPiA+IGNoYW5nZXMgdGhlIG9yZGVyIG9mIHRoZXNlIG9wZXJhdGlvbnMgaW50byB0
aGUgZm9sbG93aW5nOg0KPiA+ICgxKSBfX2Jsa19tcV9zdGFydF9yZXF1ZXN0KCkgc2V0cyB0aGUg
cmVxdWVzdCBkZWFkbGluZS4NCj4gPiAoMikgX19ibGtfbXFfc3RhcnRfcmVxdWVzdCgpIGNhbGxz
IF9fYmxrX2FkZF90aW1lcigpIHdoaWNoIGluIHR1cm4gY2FsbHMNCj4gPiAgICAgbW9kX3RpbWVy
KCkuDQo+ID4gKDMpIF9fYmxrX21xX3N0YXJ0X3JlcXVlc3QoKSBjaGFuZ2VzIHRoZSByZXF1ZXN0
IHN0YXRlIGludG8gTVFfUlFfSU5fRkxJR0hULg0KPiA+IA0KPiA+IEluIHRoZSB1bmxpa2VseSBl
dmVudCBvZiBhIHNpZ25pZmljYW50IGRlbGF5IGJldHdlZW4gKDIpIGFuZCAoMykgaXQgY2FuDQo+
ID4gaGFwcGVuIHRoYXQgdGhlIHRpbWVyIGZpcmVzIGFuZCBleGFtaW5lcyBhbmQgaWdub3JlcyB0
aGUgcmVxdWVzdCBiZWNhdXNlDQo+ID4gaXRzIHN0YXRlIGRpZmZlcnMgZnJvbSBNUV9SUV9JTl9G
TElHSFQuIElmIHRoZSByZXF1ZXN0IGZvciB3aGljaCB0aGlzDQo+ID4gaGFwcGVuZWQgdGltZXMg
b3V0IGl0cyB0aW1lb3V0IHdpbGwgb25seSBiZSBoYW5kbGVkIHRoZSBuZXh0IHRpbWUNCj4gPiBi
bGtfbXFfdGltZW91dF93b3JrKCkgaXMgY2FsbGVkLiBJcyB0aGlzIHRoZSBiZWhhdmlvciB5b3Ug
aW50ZW5kZWQ/DQo+IA0KPiBXZSBjYW4gbW92ZSB0aGUgdGltZXIgbWFuaXB1bGF0aW9uIGFmdGVy
IHRoZSBjaGFuZ2UgZWFzaWx5IEkgdGhpbmsuDQo+IEl0IHdvdWxkIG1ha2Ugc2Vuc2UgdG8gYWRk
IGNvbW1lbnRzIGV4cGxhaW5pbmcgdGhlIG9yZGVyaW5nLg0KDQpIZWxsbyBDaHJpc3RvcGgsDQoN
CkknbSBhZnJhaWQgdGhhdCBjb3VsZCBsZWFkIHRvIG1vZF90aW1lcigpIGJlaW5nIGNhbGxlZCBp
biBhbm90aGVyIG9yZGVyIHRoYW4NCmludGVuZGVkLiBJZiBlLmcuIHRoZSBjb2RlIHRoYXQgaGFu
ZGxlcyBCTEtfRUhfUkVTRVRfVElNRVIgY2hhbmdlcyB0aGUgcmVxdWVzdA0Kc3RhdGUgZmlyc3Qg
dG8gaW4tZmxpZ2h0IGFuZCBuZXh0IGNhbGxzIG1vZF90aW1lcigpIHRoZW4gaXQgY2FuIGhhcHBl
biB0aGF0DQphbm90aGVyIGNvbnRleHQgY29tcGxldGVzIGFuZCByZXN0YXJ0cyB0aGUgcmVxdWVz
dCwgcmVzdWx0aW5nIGluIGEgY29uY3VycmVudA0KbW9kX3RpbWVyKCkgY2FsbC4gSSdtIG5vdCBz
dXJlIHJlb3JkZXJpbmcgb2YgdGhlIG1vZF90aW1lcigpIGNhbGxzIHdvdWxkIHJlc3VsdA0KaW4g
Y29ycmVjdCBiZWhhdmlvci4NCg0KQmFydC4NCg0KDQo=

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again
@ 2018-05-16 18:06               ` Bart Van Assche
  0 siblings, 0 replies; 18+ messages in thread
From: Bart Van Assche @ 2018-05-16 18:06 UTC (permalink / raw)
  To: hch
  Cc: linux-kernel, linux-block, israelr, sagi, sebott, ming.lei,
	axboe, jianchao.w.wang, maxg, tj

On Wed, 2018-05-16 at 19:31 +0200, hch@lst.de wrote:
> On Wed, May 16, 2018 at 04:47:54PM +0000, Bart Van Assche wrote:
> > I think your patch changes the order of changing the request state and
> > calling mod_timer(). In my patch the request state and the deadline are
> > updated first and mod_timer() is called afterwards. I think your patch
> > changes the order of these operations into the following:
> > (1) __blk_mq_start_request() sets the request deadline.
> > (2) __blk_mq_start_request() calls __blk_add_timer() which in turn calls
> >     mod_timer().
> > (3) __blk_mq_start_request() changes the request state into MQ_RQ_IN_FLIGHT.
> > 
> > In the unlikely event of a significant delay between (2) and (3) it can
> > happen that the timer fires and examines and ignores the request because
> > its state differs from MQ_RQ_IN_FLIGHT. If the request for which this
> > happened times out its timeout will only be handled the next time
> > blk_mq_timeout_work() is called. Is this the behavior you intended?
> 
> We can move the timer manipulation after the change easily I think.
> It would make sense to add comments explaining the ordering.

Hello Christoph,

I'm afraid that could lead to mod_timer() being called in another order than
intended. If e.g. the code that handles BLK_EH_RESET_TIMER changes the request
state first to in-flight and next calls mod_timer() then it can happen that
another context completes and restarts the request, resulting in a concurrent
mod_timer() call. I'm not sure reordering of the mod_timer() calls would result
in correct behavior.

Bart.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64
  2018-05-15 22:51 ` [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
  2018-05-16  4:15   ` Michael Ellerman
@ 2018-05-18 20:33   ` Palmer Dabbelt
  1 sibling, 0 replies; 18+ messages in thread
From: Palmer Dabbelt @ 2018-05-18 20:33 UTC (permalink / raw)
  To: bart.vanassche
  Cc: axboe, linux-block, linux-kernel, Christoph Hellwig,
	bart.vanassche, catalin.marinas, Will Deacon, tony.luck,
	fenghua.yu, geert, jejb, deller, benh, paulus, mpe, schwidefsky,
	heiko.carstens, davem, tglx, mingo, hpa, chris, jcmvbkbc,
	Arnd Bergmann, corbet

On Tue, 15 May 2018 15:51:20 PDT (-0700), bart.vanassche@wdc.com wrote:
> The next patch in this series introduces a call to cmpxchg64()
> in the block layer core for those architectures on which this
> functionality is available. Make it possible to test whether
> cmpxchg64() is available by introducing CONFIG_ARCH_HAVE_CMPXCHG64.
>
> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will.deacon@arm.com>
> Cc: Tony Luck <tony.luck@intel.com>
> Cc: Fenghua Yu <fenghua.yu@intel.com>
> Cc: Geert Uytterhoeven <geert@linux-m68k.org>
> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
> Cc: Helge Deller <deller@gmx.de>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
> Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
> Cc: David S. Miller <davem@davemloft.net>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Chris Zankel <chris@zankel.net>
> Cc: Max Filippov <jcmvbkbc@gmail.com>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jonathan Corbet <corbet@lwn.net>
> ---
>  .../features/locking/cmpxchg64/arch-support.txt    | 33 ++++++++++++++++++++++
>  arch/Kconfig                                       |  3 ++
>  arch/alpha/Kconfig                                 |  1 +
>  arch/arm/Kconfig                                   |  1 +
>  arch/arm64/Kconfig                                 |  1 +
>  arch/ia64/Kconfig                                  |  1 +
>  arch/m68k/Kconfig                                  |  1 +
>  arch/mips/Kconfig                                  |  1 +
>  arch/parisc/Kconfig                                |  1 +
>  arch/powerpc/Kconfig                               |  1 +
>  arch/riscv/Kconfig                                 |  1 +
>  arch/s390/Kconfig                                  |  1 +
>  arch/sparc/Kconfig                                 |  1 +
>  arch/x86/Kconfig                                   |  1 +
>  arch/xtensa/Kconfig                                |  1 +
>  15 files changed, 49 insertions(+)
>  create mode 100644 Documentation/features/locking/cmpxchg64/arch-support.txt

[...]

> --- a/arch/riscv/Kconfig
> +++ b/arch/riscv/Kconfig
> @@ -8,6 +8,7 @@ config RISCV
>  	select OF
>  	select OF_EARLY_FLATTREE
>  	select OF_IRQ
> +	select ARCH_HAVE_CMPXCHG64
>  	select ARCH_WANT_FRAME_POINTERS
>  	select CLONE_BACKWARDS
>  	select COMMON_CLK

If I understand correctly, we should only have ARCH_HAVE_CMPXCHG64 on 64-bit 
RISC-V systems so this should look something like

   select ARCH_HAVE_CMPXCHG64 if 64BIT

Of course, the RV32I port is broken right now so it's not a big deal, but we're 
working through making it less broken...

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-05-18 20:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-15 22:51 [PATCH v10 0/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-05-15 22:51 ` [PATCH v10 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
2018-05-16  4:15   ` Michael Ellerman
2018-05-18 20:33   ` Palmer Dabbelt
2018-05-15 22:51 ` [PATCH v10 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-05-15 22:51 ` [PATCH v9 0/2] " Bart Van Assche
2018-05-15 22:51 ` [PATCH v9 1/2] arch/*: Add CONFIG_ARCH_HAVE_CMPXCHG64 Bart Van Assche
2018-05-16  6:07   ` Christoph Hellwig
2018-05-15 22:51 ` [PATCH v9 2/2] blk-mq: Rework blk-mq timeout handling again Bart Van Assche
2018-05-16 12:51   ` Christoph Hellwig
2018-05-16 16:17     ` Bart Van Assche
2018-05-16 16:17       ` Bart Van Assche
2018-05-16 16:24       ` hch
2018-05-16 16:47         ` Bart Van Assche
2018-05-16 16:47           ` Bart Van Assche
2018-05-16 17:31           ` hch
2018-05-16 18:06             ` Bart Van Assche
2018-05-16 18:06               ` Bart Van Assche

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.