All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] scsi_debug: improve failure rates
@ 2021-07-27 20:10 Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 1/4] common/config: disable udevadm settle if CONFIG_NET is disabled Luis Chamberlain
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-07-27 20:10 UTC (permalink / raw)
  To: fstests
  Cc: hare, dgilbert, jeyu, lucas.demarchi, linux-kernel, Luis Chamberlain

When using scsi_debug to create virtual devices we can often run into
failures which are really just false positives, and the failure was the
inability to remove the module. Addressing this is not easy. While we
can learn from what blktests folks do and use udevadm settle, that's
not sufficient by any means. This addresses the other pieces needed.

More work is also needed on the scsi_debug driver front.

Makes me wonder if a patient kmod removal option is then desirable
upstream on kmod (just the one which waits to refcnt 0), given this sort
of test case. Lucas?

Luis Chamberlain (4):
  common/config: disable udevadm settle if CONFIG_NET is disabled
  common/scsi_debug: use udevadm settle instead of sleeping
  common/module: add a patient module rmmod
  common/scsi_debug: use the patient module remover

 common/config     |  9 ++++++++-
 common/module     | 48 +++++++++++++++++++++++++++++++++++++++++++++++
 common/scsi_debug | 10 ++++++----
 3 files changed, 62 insertions(+), 5 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/4] common/config: disable udevadm settle if CONFIG_NET is disabled
  2021-07-27 20:10 [PATCH 0/4] scsi_debug: improve failure rates Luis Chamberlain
@ 2021-07-27 20:10 ` Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 2/4] common/scsi_debug: use udevadm settle instead of sleeping Luis Chamberlain
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-07-27 20:10 UTC (permalink / raw)
  To: fstests
  Cc: hare, dgilbert, jeyu, lucas.demarchi, linux-kernel, Luis Chamberlain

If CONFIG_NET is disabled kobject_uevent_net_broadcast() will be a no-op
and so no uevent are sent and so 'udevadm settle' won't really do
anything for you.

We check for /proc/net to see if CONFIG_NET was enabled.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 common/config | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/common/config b/common/config
index adc16b59..005fd50a 100644
--- a/common/config
+++ b/common/config
@@ -240,7 +240,14 @@ else
 	UDEV_SETTLE_PROG="$UDEV_SETTLE_PROG settle"
 fi
 # neither command is available, use sleep 1
-if [ "$UDEV_SETTLE_PROG" == "" ]; then
+#
+# Udev events are sent via netlink to userspace through
+# kobject_uevent_net_broadcast(), and udev in userspace is in charge of
+# handling the events. The command `udevadm settle` just checks if
+# /run/udev/queue is 0, however, a kernel without CONFIG_NET will have
+# kobject_uevent_net_broadcast() be a no-op, and so /run/udev/queue may not
+# exist or always be 0. We check for /proc/net to see CONFIG_NET was enabled.
+if [[ "$UDEV_SETTLE_PROG" == "" || ! -d /proc/net ]]; then
 	UDEV_SETTLE_PROG="sleep 1"
 fi
 export UDEV_SETTLE_PROG
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/4] common/scsi_debug: use udevadm settle instead of sleeping
  2021-07-27 20:10 [PATCH 0/4] scsi_debug: improve failure rates Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 1/4] common/config: disable udevadm settle if CONFIG_NET is disabled Luis Chamberlain
@ 2021-07-27 20:10 ` Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 3/4] common/module: add a patient module rmmod Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 4/4] common/scsi_debug: use the patient module remover Luis Chamberlain
  3 siblings, 0 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-07-27 20:10 UTC (permalink / raw)
  To: fstests
  Cc: hare, dgilbert, jeyu, lucas.demarchi, linux-kernel, Luis Chamberlain

The variable UDEV_SETTLE_PROG is already defined and used for
lvm to either use `udevadm settle` in case it is available, and
if not, use 'sleep 1' otherwise (ancient distros or CONFIG_NET
is missing).

Use it on scsi_debug to replace the sleep calls sprinkled in
place after module removal. The correct thing to do is to
just use udevadm settle when available, and only fall back to
calling sleep when udevadm is not available or CONFIG_NET
is disabled.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 common/scsi_debug | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/common/scsi_debug b/common/scsi_debug
index d9aa0bd2..e7988469 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -36,7 +36,7 @@ _get_scsi_debug_dev()
 	echo "scsi_debug options $opts" >> $seqres.full
 	modprobe scsi_debug $opts
 	[ $? -eq 0 ] || _fail "scsi_debug modprobe failed"
-	sleep 1
+	$UDEV_SETTLE_PROG
 	device=`grep -wl scsi_debug /sys/block/sd*/device/model | awk -F / '{print $4}'`
 	echo "/dev/$device"
 }
@@ -50,7 +50,7 @@ _put_scsi_debug_dev()
 	# modprobe is only quiet when the module is not found, not when the
 	# module is in use.
 	while [ $n -ge 0 ] && ! modprobe -nr scsi_debug >/dev/null 2>&1; do
-		sleep 1
+		$UDEV_SETTLE_PROG
 		n=$((n-1))
 	done
 	rmmod scsi_debug || _fail "Could not remove scsi_debug module"
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/4] common/module: add a patient module rmmod
  2021-07-27 20:10 [PATCH 0/4] scsi_debug: improve failure rates Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 1/4] common/config: disable udevadm settle if CONFIG_NET is disabled Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 2/4] common/scsi_debug: use udevadm settle instead of sleeping Luis Chamberlain
@ 2021-07-27 20:10 ` Luis Chamberlain
  2021-07-30  0:22   ` Luis Chamberlain
  2021-08-10 21:31   ` Luis Chamberlain
  2021-07-27 20:10 ` [PATCH 4/4] common/scsi_debug: use the patient module remover Luis Chamberlain
  3 siblings, 2 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-07-27 20:10 UTC (permalink / raw)
  To: fstests
  Cc: hare, dgilbert, jeyu, lucas.demarchi, linux-kernel, Luis Chamberlain

When we call rmmod it will fail if the refcnt is greater than 0.
This is expected, however, if using test modules such as scsi_debug,
userspace tests may expect that once userspace is done issuing out
whatever it needs to that immediately after it can remove the module.

This is not true, things can linger for a while when a test driver's
reference count is greater than 0, and so userspace has to wait before
it can reliably try rmmod.

Furthermore... experience on older kernels shows it may be a while
before we can actually remove the module even *after*
/sys/module/$module/refcnt is 0. This is only currently observed
on older kernels.

Provide a check for the refcnt before we try to remove the module.
And add a grace period to allow things quiesce before we really
try removing the module so that we don't get false positives.

An example of this is when using the scsi_debug module. Using
udevadm settle does ensure the devices are properly ready to
use after modprobe, however once you start using the module,
say testing xfs with generic/108, you will see refcnt linger
between 1 and 2, and once the test completes this eventually
goes down to 0. The refcnt can be > 0 after userspace thinks
it is done with a test, and so will fail to remove it. Likewise,
on older kernels even with a refcnt of 0 the module may sometimes
fail to be removed.

Yes, the sleep 10 is rather large, however if you want to prevent
false positives it is needed. Doug's patch recent test patch [0]
helps with these older kernel kernels not having to require such
large delays, however we have no way of telling currently if such
scsi_debug quiesce patch is applied.

[0] https://lore.kernel.org/linux-scsi/20210508230745.27923-1-dgilbert@interlog.com/

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 common/module | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/common/module b/common/module
index 39e4e793..064bc45e 100644
--- a/common/module
+++ b/common/module
@@ -81,3 +81,51 @@ _get_fs_module_param()
 {
 	cat /sys/module/${FSTYP}/parameters/${1} 2>/dev/null
 }
+
+# checks the refcount and returns 0 if we can safely remove the module. rmmod
+# does this check for us, but we can use this to also iterate checking for this
+# refcount before we even try to remove the module. This is useful when using
+# debug test modules which take a while to quiesce.
+_patient_rmmod_check_refcnt()
+{
+	local module=$1
+	local refcnt=0
+
+	if [[ -f /sys/module/$module/refcnt ]]; then
+		refcnt=$(cat /sys/module/$module/refcnt 2>/dev/null)
+		if [[ $? -ne 0 || $refcnt -eq 0 ]]; then
+			return 0
+		fi
+		return 1
+	fi
+	return 0
+}
+
+# patiently tries to wait to remove a module by ensuring first
+# the refcnt is 0. We wait for 10 seconds at most.
+_patient_rmmod()
+{
+	local module=$1
+	local max_tries=10
+
+	while [[ $max_tries != 0 ]]; do
+		_patient_rmmod_check_refcnt $module
+		if [[ $? -eq 0 ]]; then
+			break
+		fi
+		sleep 1
+		let max_tries=$max_tries-1
+	done
+
+	# give some grace time for when the refcnt is 0 as otherwise the
+	# removal can fail on older kernels. Refer to:
+	# https://bugzilla.kernel.org/show_bug.cgi?id=212337
+	sleep 10
+
+	if [[ -d /sys/module/$module ]]; then
+		modprobe -r $module
+		return $?
+	fi
+
+	return 0
+}
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 4/4] common/scsi_debug: use the patient module remover
  2021-07-27 20:10 [PATCH 0/4] scsi_debug: improve failure rates Luis Chamberlain
                   ` (2 preceding siblings ...)
  2021-07-27 20:10 ` [PATCH 3/4] common/module: add a patient module rmmod Luis Chamberlain
@ 2021-07-27 20:10 ` Luis Chamberlain
  3 siblings, 0 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-07-27 20:10 UTC (permalink / raw)
  To: fstests
  Cc: hare, dgilbert, jeyu, lucas.demarchi, linux-kernel, Luis Chamberlain

If you try to run tests such as generic/108 in a loop
you'll eventually see a failure, but the failure can
be a false positive and the test was just unable to remove
the scsi_debug module.

We need to give some time for the refcnt to become 0. For
instance for the test generic/108 the refcnt lingers between
2 and 1. It should be 0 when we're done but a bit of time
seems to be required. The chance of us trying to run rmmod
when the refcnt is 2 or 1 is low, about 1/30 times if you
run the test in a loop on linux-next today.

Likewise, even when its 0 we just need a tiny breather before
we can remove the module (sleep 10 suffices) but this is
only required on older kernels. Otherwise removing the module
will just fail.

Some of these races are documented on the korg#212337, and
Doug Gilbert has posted at least one patch attempt to try
to help with this [1]. The patch does not resolve all the
issues though, it helps though.

[0] https://bugzilla.kernel.org/show_bug.cgi?id=212337
[1] https://lkml.kernel.org/r/20210508230745.27923-1-dgilbert@interlog.com
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 common/scsi_debug | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/common/scsi_debug b/common/scsi_debug
index e7988469..3c9cd820 100644
--- a/common/scsi_debug
+++ b/common/scsi_debug
@@ -4,11 +4,13 @@
 #
 # Functions useful for tests on unique block devices
 
+. common/module
+
 _require_scsi_debug()
 {
 	# make sure we have the module and it's not already used
 	modinfo scsi_debug 2>&1 > /dev/null || _notrun "scsi_debug module not found"
-	lsmod | grep -wq scsi_debug && (rmmod scsi_debug || _notrun "scsi_debug module in use")
+	lsmod | grep -wq scsi_debug && (_patient_rmmod scsi_debug || _notrun "scsi_debug module in use")
 	# make sure it has the features we need
 	# logical/physical sectors plus unmap support all went in together
 	modinfo scsi_debug | grep -wq sector_size || _notrun "scsi_debug too old"
@@ -53,5 +55,5 @@ _put_scsi_debug_dev()
 		$UDEV_SETTLE_PROG
 		n=$((n-1))
 	done
-	rmmod scsi_debug || _fail "Could not remove scsi_debug module"
+	_patient_rmmod scsi_debug || _fail "Could not remove scsi_debug module"
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/4] common/module: add a patient module rmmod
  2021-07-27 20:10 ` [PATCH 3/4] common/module: add a patient module rmmod Luis Chamberlain
@ 2021-07-30  0:22   ` Luis Chamberlain
  2021-08-10 21:31   ` Luis Chamberlain
  1 sibling, 0 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-07-30  0:22 UTC (permalink / raw)
  To: fstests; +Cc: hare, dgilbert, jeyu, lucas.demarchi, linux-kernel

On Tue, Jul 27, 2021 at 01:10:44PM -0700, Luis Chamberlain wrote:
> Furthermore... experience on older kernels shows it may be a while
> before we can actually remove the module even *after*
> /sys/module/$module/refcnt is 0. This is only currently observed
> on older kernels.

Sadly this is incorrect, after running a test without any sleep
once refcnt is 0 also fails on linux-next, just that the failure rate
is much lower, at about ~ 1/1642.

So the sleep is still relevant even if testing with linux-next.

  Luis

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/4] common/module: add a patient module rmmod
  2021-07-27 20:10 ` [PATCH 3/4] common/module: add a patient module rmmod Luis Chamberlain
  2021-07-30  0:22   ` Luis Chamberlain
@ 2021-08-10 21:31   ` Luis Chamberlain
  1 sibling, 0 replies; 7+ messages in thread
From: Luis Chamberlain @ 2021-08-10 21:31 UTC (permalink / raw)
  To: fstests
  Cc: Hannes Reinecke, Douglas Gilbert, Jessica Yu, Lucas De Marchi,
	linux-kernel

I see the first two patches in this series were applied. Great!

I also see this patch and the next one is not applied yet. Please
don't apply as I'll resend a new set now with some more enhancements
after finding out this issue is very generic and requires a kmod
solution. I'll post these patches soon.

 Luis

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-08-10 21:31 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-27 20:10 [PATCH 0/4] scsi_debug: improve failure rates Luis Chamberlain
2021-07-27 20:10 ` [PATCH 1/4] common/config: disable udevadm settle if CONFIG_NET is disabled Luis Chamberlain
2021-07-27 20:10 ` [PATCH 2/4] common/scsi_debug: use udevadm settle instead of sleeping Luis Chamberlain
2021-07-27 20:10 ` [PATCH 3/4] common/module: add a patient module rmmod Luis Chamberlain
2021-07-30  0:22   ` Luis Chamberlain
2021-08-10 21:31   ` Luis Chamberlain
2021-07-27 20:10 ` [PATCH 4/4] common/scsi_debug: use the patient module remover Luis Chamberlain

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.