All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Marginal Path Fixes and cleanups
@ 2019-01-30 22:13 Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 1/4] libmultipath: handle existing paths in marginal_path enqueue Benjamin Marzinski
                   ` (4 more replies)
  0 siblings, 5 replies; 6+ messages in thread
From: Benjamin Marzinski @ 2019-01-30 22:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Guan Junxiong, Martin Wilck, Muneendra

I've been playing around with the marginal path code, and I've run into
a couple of things that could be cleaned up. The one important issue is
that the marginal paths code tries to enqueue the path for IO error
checking when it is down.  To enqueue the path, it needs to open the
device, which can fail because the path is down.  Instead, it should
wait for the path to come back up before it queues it, and starts the IO
error checking. This is fixed in patch 0003. The other patches are
mostly cleanups. The issues they fix aren't actually breaking things. 

Benjamin Marzinski (4):
  libmultipath: handle existing paths in marginal_path enqueue
  multipathd: cleanup marginal paths checking timers
  libmultipath: fix marginal paths queueing errors
  libmultipath: fix marginal_paths nr_active check

 libmultipath/io_err_stat.c | 71 +++++++++++++++-----------------------
 libmultipath/io_err_stat.h |  2 +-
 multipathd/main.c          |  3 +-
 3 files changed, 31 insertions(+), 45 deletions(-)

-- 
2.17.2

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/4] libmultipath: handle existing paths in marginal_path enqueue
  2019-01-30 22:13 [PATCH 0/4] Marginal Path Fixes and cleanups Benjamin Marzinski
@ 2019-01-30 22:13 ` Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 2/4] multipathd: cleanup marginal paths checking timers Benjamin Marzinski
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Benjamin Marzinski @ 2019-01-30 22:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Guan Junxiong, Martin Wilck, Muneendra

If the path that enqueue_io_err_stat_by_path() is trying to add
is already on the list, just return success. There's no reason
to fail in this case.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/io_err_stat.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c
index 02b1453..1cb3ffe 100644
--- a/libmultipath/io_err_stat.c
+++ b/libmultipath/io_err_stat.c
@@ -254,7 +254,6 @@ static void free_io_err_pathvec(struct io_err_stat_pathvec *p)
  * return value
  * 0: enqueue OK
  * 1: fails because of internal error
- * 2: fails because of existing already
  */
 static int enqueue_io_err_stat_by_path(struct path *path)
 {
@@ -264,7 +263,7 @@ static int enqueue_io_err_stat_by_path(struct path *path)
 	p = find_err_path_by_dev(paths->pathvec, path->dev);
 	if (p) {
 		pthread_mutex_unlock(&paths->mutex);
-		return 2;
+		return 0;
 	}
 	pthread_mutex_unlock(&paths->mutex);
 
@@ -418,9 +417,8 @@ int hit_io_err_recheck_time(struct path *pp)
 			io_err_stat_log(3, "%s: enqueue fails, to recover",
 					pp->dev);
 			goto recover;
-		} else if (!r) {
+		} else
 			pp->io_err_pathfail_cnt = PATH_IO_ERR_IN_CHECKING;
-		}
 	}
 
 	return 1;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/4] multipathd: cleanup marginal paths checking timers
  2019-01-30 22:13 [PATCH 0/4] Marginal Path Fixes and cleanups Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 1/4] libmultipath: handle existing paths in marginal_path enqueue Benjamin Marzinski
@ 2019-01-30 22:13 ` Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 3/4] libmultipath: fix marginal paths queueing errors Benjamin Marzinski
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 6+ messages in thread
From: Benjamin Marzinski @ 2019-01-30 22:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Guan Junxiong, Martin Wilck, Muneendra

When a path gets recovered in hit_io_err_recheck_time(), it will
continue running in check_path(), so there is no reason to schedule
another path check as soon as possible (since one is currently
happening).

Also, there isn't much point in restarting the io err stat checking when
the path is down, so hit_io_err_recheck_time() should only be run when
the path is up. Downed marginal paths can be treated just like any other
downed path.

Finally, there is no reason to set reset pp->io_err_dis_reinstate_time
when we decide to enqueue a path. Either th enqueue will fail and the
path will get recovered, or it will succeed, and we won't check the
reinstate time again until poll_io_err_stat() marks the path as needing
a requeue.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/io_err_stat.c | 8 --------
 multipathd/main.c          | 3 ++-
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c
index 1cb3ffe..416e13a 100644
--- a/libmultipath/io_err_stat.c
+++ b/libmultipath/io_err_stat.c
@@ -400,13 +400,6 @@ int hit_io_err_recheck_time(struct path *pp)
 		io_err_stat_log(4, "%s: reschedule checking after %d seconds",
 				pp->dev,
 				pp->mpp->marginal_path_err_recheck_gap_time);
-		/*
-		 * to reschedule io error checking again
-		 * if the path is good enough, we claim it is good
-		 * and can be reinsated as soon as possible in the
-		 * check_path routine.
-		 */
-		pp->io_err_dis_reinstate_time = curr_time.tv_sec;
 		r = enqueue_io_err_stat_by_path(pp);
 		/*
 		 * Enqueue fails because of internal error.
@@ -426,7 +419,6 @@ int hit_io_err_recheck_time(struct path *pp)
 recover:
 	pp->io_err_pathfail_cnt = 0;
 	pp->io_err_disable_reinstate = 0;
-	pp->tick = 1;
 	return 0;
 }
 
diff --git a/multipathd/main.c b/multipathd/main.c
index 491832b..cac9050 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -2075,7 +2075,8 @@ check_path (struct vectors * vecs, struct path * pp, int ticks)
 		return 1;
 	}
 
-	if (pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
+	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
+	    pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
 		pp->state = PATH_SHAKY;
 		/*
 		 * to reschedule as soon as possible,so that this path can
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 3/4] libmultipath: fix marginal paths queueing errors
  2019-01-30 22:13 [PATCH 0/4] Marginal Path Fixes and cleanups Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 1/4] libmultipath: handle existing paths in marginal_path enqueue Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 2/4] multipathd: cleanup marginal paths checking timers Benjamin Marzinski
@ 2019-01-30 22:13 ` Benjamin Marzinski
  2019-01-30 22:13 ` [PATCH 4/4] libmultipath: fix marginal_paths nr_active check Benjamin Marzinski
  2019-02-01 14:47 ` [PATCH 0/4] Marginal Path Fixes and cleanups Martin Wilck
  4 siblings, 0 replies; 6+ messages in thread
From: Benjamin Marzinski @ 2019-01-30 22:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Guan Junxiong, Martin Wilck, Muneendra

The current marginal paths code tries to enqueue paths for io error
checking when multipathd receives a uevent on path failure. This can run
into a couple of problems. First, this uevent could happen before or
after multipathd actually fails the path, so simply checking nr_active
doesn't tell us if this is the last active path. Also, The code to fail
the path in enqueue_io_err_stat_by_path() doesn't ever update the path
state. This can cause the path to get failed twice, temporarily leading
to incorrect nr_active counts. Further, The point when multipathd should
care if this is the last active path is when the path has come up again,
not when it goes down. Lastly, if the path is down, it is often
impossible to open the path device, causing setup_directio_ctx() to
fail, which causes multipathd to skip io error checking and mark the
path as not marginal.

Instead, multipathd should just make sure that if the path is marginal,
it gets failed in the uevent, so as not to race with the checkerloop
thread. Then, when the path comes back up, check_path() can enqueue it,
just like it does for paths that need to get rechecked. To make it
obvious that the state PATH_IO_ERR_IN_POLLING_RECHECK and the function
hit_io_err_recheck_time() now apply to paths waiting to be enqueued for
the first time as well, I've also changed their names to
PATH_IO_ERR_WAITING_TO_CHECK and need_io_err_check(), respectively.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/io_err_stat.c | 55 +++++++++++++++++---------------------
 libmultipath/io_err_stat.h |  2 +-
 multipathd/main.c          |  2 +-
 3 files changed, 27 insertions(+), 32 deletions(-)

diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c
index 416e13a..72aacf3 100644
--- a/libmultipath/io_err_stat.c
+++ b/libmultipath/io_err_stat.c
@@ -41,7 +41,7 @@
 #define CONCUR_NR_EVENT			32
 
 #define PATH_IO_ERR_IN_CHECKING		-1
-#define PATH_IO_ERR_IN_POLLING_RECHECK	-2
+#define PATH_IO_ERR_WAITING_TO_CHECK	-2
 
 #define io_err_stat_log(prio, fmt, args...) \
 	condlog(prio, "io error statistic: " fmt, ##args)
@@ -283,24 +283,6 @@ static int enqueue_io_err_stat_by_path(struct path *path)
 	vector_set_slot(paths->pathvec, p);
 	pthread_mutex_unlock(&paths->mutex);
 
-	if (!path->io_err_disable_reinstate) {
-		/*
-		 *fail the path in the kernel for the time of the to make
-		 *the test more reliable
-		 */
-		io_err_stat_log(3, "%s: fail dm path %s before checking",
-				path->mpp->alias, path->dev);
-		path->io_err_disable_reinstate = 1;
-		dm_fail_path(path->mpp->alias, path->dev_t);
-		update_queue_mode_del_path(path->mpp);
-
-		/*
-		 * schedule path check as soon as possible to
-		 * update path state to delayed state
-		 */
-		path->tick = 1;
-
-	}
 	io_err_stat_log(2, "%s: enqueue path %s to check",
 			path->mpp->alias, path->dev);
 	return 0;
@@ -317,7 +299,6 @@ free_ioerr_path:
 int io_err_stat_handle_pathfail(struct path *path)
 {
 	struct timespec curr_time;
-	int res;
 
 	if (uatomic_read(&io_err_thread_running) == 0)
 		return 1;
@@ -332,8 +313,6 @@ int io_err_stat_handle_pathfail(struct path *path)
 
 	if (!path->mpp)
 		return 1;
-	if (path->mpp->nr_active <= 1)
-		return 1;
 	if (path->mpp->marginal_path_double_failed_time <= 0 ||
 		path->mpp->marginal_path_err_sample_time <= 0 ||
 		path->mpp->marginal_path_err_recheck_gap_time <= 0 ||
@@ -371,17 +350,33 @@ int io_err_stat_handle_pathfail(struct path *path)
 	}
 	path->io_err_pathfail_cnt++;
 	if (path->io_err_pathfail_cnt >= FLAKY_PATHFAIL_THRESHOLD) {
-		res = enqueue_io_err_stat_by_path(path);
-		if (!res)
-			path->io_err_pathfail_cnt = PATH_IO_ERR_IN_CHECKING;
-		else
-			path->io_err_pathfail_cnt = 0;
+		path->io_err_disable_reinstate = 1;
+		path->io_err_pathfail_cnt = PATH_IO_ERR_WAITING_TO_CHECK;
+		/* enqueue path as soon as it comes up */
+		path->io_err_dis_reinstate_time = 0;
+		if (path->state != PATH_DOWN) {
+			struct config *conf;
+			int oldstate = path->state;
+			int checkint;
+
+			conf = get_multipath_config();
+			checkint = conf->checkint;
+			put_multipath_config(conf);
+			io_err_stat_log(2, "%s: mark as failed", path->dev);
+			path->mpp->stat_path_failures++;
+			path->state = PATH_DOWN;
+			path->dmstate = PSTATE_FAILED;
+			if (oldstate == PATH_UP || oldstate == PATH_GHOST)
+				update_queue_mode_del_path(path->mpp);
+			if (path->tick > checkint)
+				path->tick = checkint;
+		}
 	}
 
 	return 0;
 }
 
-int hit_io_err_recheck_time(struct path *pp)
+int need_io_err_check(struct path *pp)
 {
 	struct timespec curr_time;
 	int r;
@@ -392,7 +387,7 @@ int hit_io_err_recheck_time(struct path *pp)
 		io_err_stat_log(2, "%s: recover path early", pp->dev);
 		goto recover;
 	}
-	if (pp->io_err_pathfail_cnt != PATH_IO_ERR_IN_POLLING_RECHECK)
+	if (pp->io_err_pathfail_cnt != PATH_IO_ERR_WAITING_TO_CHECK)
 		return 1;
 	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0 ||
 	    (curr_time.tv_sec - pp->io_err_dis_reinstate_time) >
@@ -489,7 +484,7 @@ static int poll_io_err_stat(struct vectors *vecs, struct io_err_stat_path *pp)
 	} else if (path->mpp && path->mpp->nr_active > 1) {
 		io_err_stat_log(3, "%s: keep failing the dm path %s",
 				path->mpp->alias, path->dev);
-		path->io_err_pathfail_cnt = PATH_IO_ERR_IN_POLLING_RECHECK;
+		path->io_err_pathfail_cnt = PATH_IO_ERR_WAITING_TO_CHECK;
 		path->io_err_disable_reinstate = 1;
 		path->io_err_dis_reinstate_time = currtime.tv_sec;
 		io_err_stat_log(3, "%s: disable reinstating of %s",
diff --git a/libmultipath/io_err_stat.h b/libmultipath/io_err_stat.h
index bbf31b4..53d6d7d 100644
--- a/libmultipath/io_err_stat.h
+++ b/libmultipath/io_err_stat.h
@@ -10,6 +10,6 @@ extern pthread_attr_t io_err_stat_attr;
 int start_io_err_stat_thread(void *data);
 void stop_io_err_stat_thread(void);
 int io_err_stat_handle_pathfail(struct path *path);
-int hit_io_err_recheck_time(struct path *pp);
+int need_io_err_check(struct path *pp);
 
 #endif /* _IO_ERR_STAT_H */
diff --git a/multipathd/main.c b/multipathd/main.c
index cac9050..0e3ac2c 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -2076,7 +2076,7 @@ check_path (struct vectors * vecs, struct path * pp, int ticks)
 	}
 
 	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
-	    pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
+	    pp->io_err_disable_reinstate && need_io_err_check(pp)) {
 		pp->state = PATH_SHAKY;
 		/*
 		 * to reschedule as soon as possible,so that this path can
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 4/4] libmultipath: fix marginal_paths nr_active check
  2019-01-30 22:13 [PATCH 0/4] Marginal Path Fixes and cleanups Benjamin Marzinski
                   ` (2 preceding siblings ...)
  2019-01-30 22:13 ` [PATCH 3/4] libmultipath: fix marginal paths queueing errors Benjamin Marzinski
@ 2019-01-30 22:13 ` Benjamin Marzinski
  2019-02-01 14:47 ` [PATCH 0/4] Marginal Path Fixes and cleanups Martin Wilck
  4 siblings, 0 replies; 6+ messages in thread
From: Benjamin Marzinski @ 2019-01-30 22:13 UTC (permalink / raw)
  To: device-mapper development; +Cc: Guan Junxiong, Martin Wilck, Muneendra

Marginal paths are SHAKY, so they don't count towards the number of
active paths. poll_io_err_stat() shouldn't automatically reinstate a
marginal path if there already is an active path.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 libmultipath/io_err_stat.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c
index 72aacf3..554b777 100644
--- a/libmultipath/io_err_stat.c
+++ b/libmultipath/io_err_stat.c
@@ -481,7 +481,7 @@ static int poll_io_err_stat(struct vectors *vecs, struct io_err_stat_path *pp)
 		 */
 		path->tick = 1;
 
-	} else if (path->mpp && path->mpp->nr_active > 1) {
+	} else if (path->mpp && path->mpp->nr_active > 0) {
 		io_err_stat_log(3, "%s: keep failing the dm path %s",
 				path->mpp->alias, path->dev);
 		path->io_err_pathfail_cnt = PATH_IO_ERR_WAITING_TO_CHECK;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 0/4] Marginal Path Fixes and cleanups
  2019-01-30 22:13 [PATCH 0/4] Marginal Path Fixes and cleanups Benjamin Marzinski
                   ` (3 preceding siblings ...)
  2019-01-30 22:13 ` [PATCH 4/4] libmultipath: fix marginal_paths nr_active check Benjamin Marzinski
@ 2019-02-01 14:47 ` Martin Wilck
  4 siblings, 0 replies; 6+ messages in thread
From: Martin Wilck @ 2019-02-01 14:47 UTC (permalink / raw)
  To: Benjamin Marzinski, device-mapper development; +Cc: Muneendra, Guan Junxiong

On Wed, 2019-01-30 at 16:13 -0600, Benjamin Marzinski wrote:
> I've been playing around with the marginal path code, and I've run
> into
> a couple of things that could be cleaned up. The one important issue
> is
> that the marginal paths code tries to enqueue the path for IO error
> checking when it is down.  To enqueue the path, it needs to open the
> device, which can fail because the path is down.  Instead, it should
> wait for the path to come back up before it queues it, and starts the
> IO
> error checking. This is fixed in patch 0003. The other patches are
> mostly cleanups. The issues they fix aren't actually breaking
> things. 
> 
> Benjamin Marzinski (4):
>   libmultipath: handle existing paths in marginal_path enqueue
>   multipathd: cleanup marginal paths checking timers
>   libmultipath: fix marginal paths queueing errors
>   libmultipath: fix marginal_paths nr_active check
> 
>  libmultipath/io_err_stat.c | 71 +++++++++++++++---------------------
> --
>  libmultipath/io_err_stat.h |  2 +-
>  multipathd/main.c          |  3 +-
>  3 files changed, 31 insertions(+), 45 deletions(-)
> 

For the set:

Reviewed-by: Martin Wilck <mwilck@suse.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-02-01 14:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-30 22:13 [PATCH 0/4] Marginal Path Fixes and cleanups Benjamin Marzinski
2019-01-30 22:13 ` [PATCH 1/4] libmultipath: handle existing paths in marginal_path enqueue Benjamin Marzinski
2019-01-30 22:13 ` [PATCH 2/4] multipathd: cleanup marginal paths checking timers Benjamin Marzinski
2019-01-30 22:13 ` [PATCH 3/4] libmultipath: fix marginal paths queueing errors Benjamin Marzinski
2019-01-30 22:13 ` [PATCH 4/4] libmultipath: fix marginal_paths nr_active check Benjamin Marzinski
2019-02-01 14:47 ` [PATCH 0/4] Marginal Path Fixes and cleanups Martin Wilck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.