All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] tests/nvme: Add admin-passthru+reset race test
@ 2022-11-14 20:34 Jonathan Derrick
  2022-11-15  0:25 ` Chaitanya Kulkarni
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Jonathan Derrick @ 2022-11-14 20:34 UTC (permalink / raw)
  To: linux-nvme, linux-block, Shin\'ichiro Kawasaki
  Cc: Keith Busch, Omar Sandoval, Jonathan Derrick

Adds a test which runs many formats and reset_controllers in parallel.
The intent is to expose timing holes in the controller state machine
which will lead to hung task timing and the controller becoming
unavailable.

Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354

Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
---
 tests/nvme/046     | 85 ++++++++++++++++++++++++++++++++++++++++++++++
 tests/nvme/046.out |  2 ++
 2 files changed, 87 insertions(+)
 create mode 100755 tests/nvme/046
 create mode 100644 tests/nvme/046.out

diff --git a/tests/nvme/046 b/tests/nvme/046
new file mode 100755
index 0000000..4b47783
--- /dev/null
+++ b/tests/nvme/046
@@ -0,0 +1,85 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-3.0+
+# Copyright (C) 2022 Jonathan Derrick <jonathan.derrick@linux.dev>
+#
+# Test nvme reset controller during admin passthru
+#
+# Regression for issue reported by
+# https://bugzilla.kernel.org/show_bug.cgi?id=216354
+
+. tests/nvme/rc
+
+#restrict test to nvme-pci only
+nvme_trtype=pci
+
+DESCRIPTION="test nvme reset controller during admin passthru"
+QUICK=1
+CAN_BE_ZONED=1
+
+requires() {
+	_nvme_requires
+}
+
+device_requires() {
+	_require_test_dev_is_nvme
+}
+
+test_device() {
+	echo "Running ${TEST_NAME}"
+
+	local sysfs
+	local attr
+	local m
+
+	sysfs="$TEST_DEV_SYSFS/device"
+	timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
+
+	sleep 5
+
+	if [[ ! -d "$sysfs" ]]; then
+		echo "$sysfs doesn't exist"
+	fi
+
+	# do reset controller/format loops
+	# don't check status now because a timing race is desired
+	i=0
+	start=0
+	timing_out=false
+	while [[ $i -le 1000 ]]; do
+		start=$SECONDS
+		if [[ -f "$sysfs/reset_controller" ]]; then
+			echo 1 > "$sysfs/reset_controller" 2>/dev/null &
+			i=$((i+1))
+		fi
+		nvme format -l 0 -f $TEST_DEV 2>/dev/null &
+
+		#Assume the controller is hung and unrecoverable
+		if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
+			echo "nvme controller timing out"
+			timing_out=true
+			break
+		fi
+	done
+
+	{ kill $!; wait; } &> /dev/null
+
+	# at this point it may have waited hung_task_timeout / 2 already, so
+	# only wait 25% longer for a total of about 75% of allowed timeout
+	m=0
+	while [[ $m -le $((timeout / 2)) ]]; do
+		if [[ $timing_out == true ]]; then
+			break
+		fi
+		if grep -q live "$sysfs/state"; then
+			break
+		fi
+		sleep 1
+		m=$((m+1))
+	done
+	if ! grep -q live "$sysfs/state"; then
+		echo "nvme still not live after $(($SECONDS - $start)) seconds!"
+	fi
+	udevadm settle
+
+	echo "Test complete"
+}
diff --git a/tests/nvme/046.out b/tests/nvme/046.out
new file mode 100644
index 0000000..2b5fa6a
--- /dev/null
+++ b/tests/nvme/046.out
@@ -0,0 +1,2 @@
+Running nvme/046
+Test complete
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
  2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
@ 2022-11-15  0:25 ` Chaitanya Kulkarni
  2022-11-15  8:49   ` Jonathan Derrick
  2022-11-15  0:33 ` kernel test robot
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Chaitanya Kulkarni @ 2022-11-15  0:25 UTC (permalink / raw)
  To: linux-nvme

On 11/14/22 12:34, Jonathan Derrick wrote:
> Adds a test which runs many formats and reset_controllers in parallel.
> The intent is to expose timing holes in the controller state machine
> which will lead to hung task timing and the controller becoming
> unavailable.
> 
> Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354
> 
> Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
> ---

Thanks a lot for submitting this, it is really useful...

Overall it looks good.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
  2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
  2022-11-15  0:25 ` Chaitanya Kulkarni
@ 2022-11-15  0:33 ` kernel test robot
  2022-11-15 23:07 ` Keith Busch
  2022-11-16  7:43 ` Christoph Hellwig
  3 siblings, 0 replies; 7+ messages in thread
From: kernel test robot @ 2022-11-15  0:33 UTC (permalink / raw)
  To: Jonathan Derrick, linux-nvme, linux-block, Shin'ichiro Kawasaki
  Cc: oe-kbuild-all, Keith Busch, Omar Sandoval, Jonathan Derrick

Hi Jonathan,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.1-rc5 next-20221114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jonathan-Derrick/tests-nvme-Add-admin-passthru-reset-race-test/20221115-043752
patch link:    https://lore.kernel.org/r/20221114203412.383-1-jonathan.derrick%40linux.dev
patch subject: [PATCH] tests/nvme: Add admin-passthru+reset race test
reproduce:
        scripts/spdxcheck.py

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <lkp@intel.com>

spdxcheck warnings: (new ones prefixed by >>)
   drivers/cpufreq/amd-pstate-ut.c: 1:28 Invalid License ID: GPL-1.0-or-later
>> tests/nvme/046: 2:27 Invalid License ID: GPL-3.0+

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
  2022-11-15  0:25 ` Chaitanya Kulkarni
@ 2022-11-15  8:49   ` Jonathan Derrick
  2022-11-15 11:01     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Derrick @ 2022-11-15  8:49 UTC (permalink / raw)
  To: linux-nvme; +Cc: Chaitanya Kulkarni, Shin'ichiro Kawasaki



On 11/14/2022 5:25 PM, Chaitanya Kulkarni wrote:
> On 11/14/22 12:34, Jonathan Derrick wrote:
>> Adds a test which runs many formats and reset_controllers in parallel.
>> The intent is to expose timing holes in the controller state machine
>> which will lead to hung task timing and the controller becoming
>> unavailable.
>>
>> Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354
>>
>> Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
>> ---
> 
> Thanks a lot for submitting this, it is really useful...
> 
> Overall it looks good.
> 
> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> 
> -ck
> 

Thanks
I have v2 coming with a bit more robustness, so let's not merge this one just yet


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
  2022-11-15  8:49   ` Jonathan Derrick
@ 2022-11-15 11:01     ` Shinichiro Kawasaki
  0 siblings, 0 replies; 7+ messages in thread
From: Shinichiro Kawasaki @ 2022-11-15 11:01 UTC (permalink / raw)
  To: Jonathan Derrick; +Cc: linux-nvme, Chaitanya Kulkarni

On Nov 15, 2022 / 01:49, Jonathan Derrick wrote:
> 
> 
> On 11/14/2022 5:25 PM, Chaitanya Kulkarni wrote:
> > On 11/14/22 12:34, Jonathan Derrick wrote:
> >> Adds a test which runs many formats and reset_controllers in parallel.
> >> The intent is to expose timing holes in the controller state machine
> >> which will lead to hung task timing and the controller becoming
> >> unavailable.
> >>
> >> Reported by https://bugzilla.kernel.org/show_bug.cgi?id=216354
> >>
> >> Signed-off-by: Jonathan Derrick <jonathan.derrick@linux.dev>
> >> ---
> > 
> > Thanks a lot for submitting this, it is really useful...
> > 
> > Overall it looks good.
> > 
> > Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> > 
> > -ck
> > 
> 
> Thanks
> I have v2 coming with a bit more robustness, so let's not merge this one just yet

This test case looks valuable. Thanks. When you post v2, would you mind
renumbering the test case from nvme/046 to nvme/047? Another new test case of
the nvme test group was posted [1]. Also, your check with "make check" command
will be appreciated. Shellcheck has some complaints:

$ make check
shellcheck -x -e SC2119 -f gcc check new common/* \
        tests/*/rc tests/*/[0-9]*[0-9]
tests/nvme/046:31:8: warning: attr appears unused. Verify use (or export if used externally). [SC2034]
tests/nvme/046:54:23: note: Double quote to prevent globbing and word splitting. [SC2086]
tests/nvme/046:57:12: note: $/${} is unnecessary on arithmetic variables. [SC2004]
tests/nvme/046:57:23: note: $/${} is unnecessary on arithmetic variables. [SC2004]
tests/nvme/046:80:38: note: $/${} is unnecessary on arithmetic variables. [SC2004]
tests/nvme/046:80:49: note: $/${} is unnecessary on arithmetic variables. [SC2004]
make: *** [Makefile:21: check] Error 1


[1] https://lore.kernel.org/linux-nvme/20221114125252.1086763-1-sagi@grimberg.me/

-- 
Shin'ichiro Kawasaki

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
  2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
  2022-11-15  0:25 ` Chaitanya Kulkarni
  2022-11-15  0:33 ` kernel test robot
@ 2022-11-15 23:07 ` Keith Busch
  2022-11-16  7:43 ` Christoph Hellwig
  3 siblings, 0 replies; 7+ messages in thread
From: Keith Busch @ 2022-11-15 23:07 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: linux-nvme, linux-block, Shin\'ichiro Kawasaki, Omar Sandoval

On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> +	echo "Running ${TEST_NAME}"
> +
> +	local sysfs
> +	local attr
> +	local m
> +
> +	sysfs="$TEST_DEV_SYSFS/device"

That's not the correct directory when the device is using native
nvme-multipath.

> +	timeout=$(($(cat /proc/sys/kernel/hung_task_timeout_secs) / 2))
> +
> +	sleep 5
> +
> +	if [[ ! -d "$sysfs" ]]; then
> +		echo "$sysfs doesn't exist"
> +	fi
> +
> +	# do reset controller/format loops
> +	# don't check status now because a timing race is desired
> +	i=0
> +	start=0
> +	timing_out=false
> +	while [[ $i -le 1000 ]]; do
> +		start=$SECONDS
> +		if [[ -f "$sysfs/reset_controller" ]]; then
> +			echo 1 > "$sysfs/reset_controller" 2>/dev/null &
> +			i=$((i+1))
> +		fi
> +		nvme format -l 0 -f $TEST_DEV 2>/dev/null &
> +
> +		#Assume the controller is hung and unrecoverable
> +		if [[ $(($SECONDS - $start)) -gt $timeout ]]; then
> +			echo "nvme controller timing out"
> +			timing_out=true
> +			break
> +		fi
> +	done

If the controller is already undergoing a reset, then writing to the
reset_controller file becomes a no-op. Unless your "reset_controller"
completes near instantaneously, I find that this loop tears through 1000
iterations, forks 1000 formats, and only 1 reset_controller actually
gets through.

If I remove the upper limit, then I can also see the stalled task, but
it is only temporary and gets itself out of it after the admin timeout
(1 minute). Is that also your observation, or is it stuck forever?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] tests/nvme: Add admin-passthru+reset race test
  2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
                   ` (2 preceding siblings ...)
  2022-11-15 23:07 ` Keith Busch
@ 2022-11-16  7:43 ` Christoph Hellwig
  3 siblings, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2022-11-16  7:43 UTC (permalink / raw)
  To: Jonathan Derrick
  Cc: linux-nvme, linux-block, Shin\'ichiro Kawasaki, Keith Busch,
	Omar Sandoval

On Mon, Nov 14, 2022 at 01:34:12PM -0700, Jonathan Derrick wrote:
> Adds a test which runs many formats and reset_controllers in parallel.
> The intent is to expose timing holes in the controller state machine
> which will lead to hung task timing and the controller becoming
> unavailable.

This passes just fine for me both on qemu and a thunderbolt attached
m.2 SSD.  So it seems to be hardware / timing dependent.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-11-16  7:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-14 20:34 [PATCH] tests/nvme: Add admin-passthru+reset race test Jonathan Derrick
2022-11-15  0:25 ` Chaitanya Kulkarni
2022-11-15  8:49   ` Jonathan Derrick
2022-11-15 11:01     ` Shinichiro Kawasaki
2022-11-15  0:33 ` kernel test robot
2022-11-15 23:07 ` Keith Busch
2022-11-16  7:43 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.