[PATCH V4 0/2] multipath-tools: intermittent IO error accounting to improve reliability

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH V4 0/2] multipath-tools: intermittent IO error accounting to improve reliability
@ 2017-09-17  3:40 Guan Junxiong
  2017-09-17  3:40 ` [PATCH V4 1/2] " Guan Junxiong
  2017-09-17  3:40 ` [PATCH V4 2/2] multipath-tools: discard san_path_err_XXX feature Guan Junxiong
  0 siblings, 2 replies; 17+ messages in thread
From: Guan Junxiong @ 2017-09-17  3:40 UTC (permalink / raw)
  To: dm-devel, christophe.varoqui, mwilck
  Cc: guanjunxiong, chengjike.cheng, mmandala, niuhaoxin, shenhong09

Hi ALL,

This patchset add a new method of path state checking based on accounting
IO error. This is useful in many scenarios such as intermittent IO error
an a path due to network congestion, or a shaky link.

PATCH 1/2 implements the algorithm that sends a couple of continuous IOs
at a fix rate of 10 Hz.
PATCH 2/2 discard the original algorithm because of this:
the detect sample interval of that path checkers is so big/coarse that
it doesn't see what happens in the middle of the sample interval. We have
the PATCH 1/2 as a better method.

Changes from V3:
* discard the 
* fail the path in the kernel before enqueueing the path for checking
  rather than after knowing the checking result to make it more
  reliable. (Martin)
* use posix_memalign instead of manual alignment for direct IO buffer. (Martin) 
* use PATH_MAX to avoid certain compiler warning when opening file
  rather than FILE_NAME_SIZE. (Martin)
* discard unnecessary sanity check when getting block size (Martin)
* do not return 0 in send_each_aync_io if io_starttime of a path is
  not set(Martin)
* Wait 10ms instead of 60 second if every path is down. (Martin)
* rename handle_async_io_timeout to poll_async_io_timeout and use polling
  method because io_getevents does not return 0 if there are timeout IO
  and normal IO.
* rename hit_io_err_recover_time ro hit_io_err_recheck_time 
* modify the multipath.conf.5 and commit comments to keep sync with the
  above changes

Changes from V2:
* fix uncondistional rescedule forverver
* use script/checkpatch.pl in Linux to cleanup informal coding style
* fix "continous" and "internel" typos

Changes from V1:
* send continous IO instead of a single IO in a sample interval (Martin)
* when recover time expires, we reschedule the checking process (Hannes)
* Use the error rate threshold as a permillage instead of IO number(Martin)
* Use a common io_context for libaio for all paths (Martin)
* Other small fixes (Martin)

Junxiong Guan (2):
  multipath-tools: intermittent IO error accounting to improve
    reliability
  multipath-tools: discard san_path_err_XXX feature

 libmultipath/Makefile      |   5 +-
 libmultipath/config.c      |   3 -
 libmultipath/config.h      |  18 +-
 libmultipath/configure.c   |   6 +-
 libmultipath/dict.c        |  74 ++---
 libmultipath/io_err_stat.c | 743 +++++++++++++++++++++++++++++++++++++++++++++
 libmultipath/io_err_stat.h |  15 +
 libmultipath/propsel.c     |  54 ++--
 libmultipath/propsel.h     |   6 +-
 libmultipath/structs.h     |  14 +-
 libmultipath/uevent.c      |  32 ++
 libmultipath/uevent.h      |   2 +
 multipath/multipath.conf.5 |  62 ++--
 multipathd/main.c          | 130 ++++----
 14 files changed, 971 insertions(+), 193 deletions(-)
 create mode 100644 libmultipath/io_err_stat.c
 create mode 100644 libmultipath/io_err_stat.h

-- 
2.11.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-17  3:40 [PATCH V4 0/2] multipath-tools: intermittent IO error accounting to improve reliability Guan Junxiong
@ 2017-09-17  3:40 ` Guan Junxiong
  2017-09-18 12:53   ` Muneendra Kumar M
  2017-09-17  3:40 ` [PATCH V4 2/2] multipath-tools: discard san_path_err_XXX feature Guan Junxiong
  1 sibling, 1 reply; 17+ messages in thread
From: Guan Junxiong @ 2017-09-17  3:40 UTC (permalink / raw)
  To: dm-devel, christophe.varoqui, mwilck
  Cc: guanjunxiong, chengjike.cheng, mmandala, niuhaoxin, shenhong09

This patch adds a new method of path state checking based on accounting
IO error. This is useful in many scenarios such as intermittent IO error
an a path due to network congestion, or a shaky link.

Three parameters are added for the admin: "path_io_err_sample_time",
"path_io_err_rate_threshold" and "path_io_err_recovery_time".
If path_io_err_sample_time are set no less than 120 and
path_io_err_recovery_time are set to a value greater than 0, when path
failing events occur twice in 60 second due to an IO error, multipathd
will fail the path and enqueue this path into a queue of which each
member is sent a couple of continuous direct reading asynchronous io at
a fixed sample rate of 10HZ. The IO accounting process for a path will
last for path_io_err_sample_time. If the IO error rate on a particular
path is greater than the path_io_err_rate_threshold, then the path will
not reinstate for recover_time seconds unless there is only one active
path.

If recover_time expires, we will reschedule this IO error checking
process. If the path is good enough, we will claim it good.

This helps us place the path in delayed state if we hit a lot of
intermittent IO errors on a particular path due to network/target
issues and isolate such degraded path and allow the admin to rectify
the errors on a path.

Signed-off-by: Junxiong Guan <guanjunxiong@huawei.com>
---
 libmultipath/Makefile      |   5 +-
 libmultipath/config.h      |   9 +
 libmultipath/configure.c   |   3 +
 libmultipath/dict.c        |  41 +++
 libmultipath/io_err_stat.c | 743 +++++++++++++++++++++++++++++++++++++++++++++
 libmultipath/io_err_stat.h |  15 +
 libmultipath/propsel.c     |  53 ++++
 libmultipath/propsel.h     |   3 +
 libmultipath/structs.h     |   7 +
 libmultipath/uevent.c      |  32 ++
 libmultipath/uevent.h      |   2 +
 multipath/multipath.conf.5 |  65 ++++
 multipathd/main.c          |  56 ++++
 13 files changed, 1032 insertions(+), 2 deletions(-)
 create mode 100644 libmultipath/io_err_stat.c
 create mode 100644 libmultipath/io_err_stat.h

diff --git a/libmultipath/Makefile b/libmultipath/Makefile
index b3244fc7..dce73afe 100644
--- a/libmultipath/Makefile
+++ b/libmultipath/Makefile
@@ -9,7 +9,7 @@ LIBS = $(DEVLIB).$(SONAME)
 
 CFLAGS += $(LIB_CFLAGS) -I$(mpathcmddir)
 
-LIBDEPS += -lpthread -ldl -ldevmapper -ludev -L$(mpathcmddir) -lmpathcmd -lurcu
+LIBDEPS += -lpthread -ldl -ldevmapper -ludev -L$(mpathcmddir) -lmpathcmd -lurcu -laio
 
 ifdef SYSTEMD
 	CFLAGS += -DUSE_SYSTEMD=$(SYSTEMD)
@@ -42,7 +42,8 @@ OBJS = memory.o parser.o vector.o devmapper.o callout.o \
 	pgpolicies.o debug.o defaults.o uevent.o time-util.o \
 	switchgroup.o uxsock.o print.o alias.o log_pthread.o \
 	log.o configure.o structs_vec.o sysfs.o prio.o checkers.o \
-	lock.o waiter.o file.o wwids.o prioritizers/alua_rtpg.o
+	lock.o waiter.o file.o wwids.o prioritizers/alua_rtpg.o \
+	io_err_stat.o
 
 all: $(LIBS)
 
diff --git a/libmultipath/config.h b/libmultipath/config.h
index ffc69b5f..215d29e9 100644
--- a/libmultipath/config.h
+++ b/libmultipath/config.h
@@ -75,6 +75,9 @@ struct hwentry {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int skip_kpartx;
 	int max_sectors_kb;
 	char * bl_product;
@@ -106,6 +109,9 @@ struct mpentry {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int skip_kpartx;
 	int max_sectors_kb;
 	uid_t uid;
@@ -155,6 +161,9 @@ struct config {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int uxsock_timeout;
 	int strict_timing;
 	int retrigger_tries;
diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 74b6f52a..81dc97d9 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -298,6 +298,9 @@ int setup_map(struct multipath *mpp, char *params, int params_size)
 	select_san_path_err_threshold(conf, mpp);
 	select_san_path_err_forget_rate(conf, mpp);
 	select_san_path_err_recovery_time(conf, mpp);
+	select_path_io_err_sample_time(conf, mpp);
+	select_path_io_err_rate_threshold(conf, mpp);
+	select_path_io_err_recovery_time(conf, mpp);
 	select_skip_kpartx(conf, mpp);
 	select_max_sectors_kb(conf, mpp);
 
diff --git a/libmultipath/dict.c b/libmultipath/dict.c
index 9dc10904..18b1fdb1 100644
--- a/libmultipath/dict.c
+++ b/libmultipath/dict.c
@@ -1108,6 +1108,35 @@ declare_hw_handler(san_path_err_recovery_time, set_off_int_undef)
 declare_hw_snprint(san_path_err_recovery_time, print_off_int_undef)
 declare_mp_handler(san_path_err_recovery_time, set_off_int_undef)
 declare_mp_snprint(san_path_err_recovery_time, print_off_int_undef)
+declare_def_handler(path_io_err_sample_time, set_off_int_undef)
+declare_def_snprint_defint(path_io_err_sample_time, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(path_io_err_sample_time, set_off_int_undef)
+declare_ovr_snprint(path_io_err_sample_time, print_off_int_undef)
+declare_hw_handler(path_io_err_sample_time, set_off_int_undef)
+declare_hw_snprint(path_io_err_sample_time, print_off_int_undef)
+declare_mp_handler(path_io_err_sample_time, set_off_int_undef)
+declare_mp_snprint(path_io_err_sample_time, print_off_int_undef)
+declare_def_handler(path_io_err_rate_threshold, set_off_int_undef)
+declare_def_snprint_defint(path_io_err_rate_threshold, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(path_io_err_rate_threshold, set_off_int_undef)
+declare_ovr_snprint(path_io_err_rate_threshold, print_off_int_undef)
+declare_hw_handler(path_io_err_rate_threshold, set_off_int_undef)
+declare_hw_snprint(path_io_err_rate_threshold, print_off_int_undef)
+declare_mp_handler(path_io_err_rate_threshold, set_off_int_undef)
+declare_mp_snprint(path_io_err_rate_threshold, print_off_int_undef)
+declare_def_handler(path_io_err_recovery_time, set_off_int_undef)
+declare_def_snprint_defint(path_io_err_recovery_time, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(path_io_err_recovery_time, set_off_int_undef)
+declare_ovr_snprint(path_io_err_recovery_time, print_off_int_undef)
+declare_hw_handler(path_io_err_recovery_time, set_off_int_undef)
+declare_hw_snprint(path_io_err_recovery_time, print_off_int_undef)
+declare_mp_handler(path_io_err_recovery_time, set_off_int_undef)
+declare_mp_snprint(path_io_err_recovery_time, print_off_int_undef)
+
+
 static int
 def_uxsock_timeout_handler(struct config *conf, vector strvec)
 {
@@ -1443,6 +1472,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &def_san_path_err_threshold_handler, &snprint_def_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &def_san_path_err_forget_rate_handler, &snprint_def_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &def_san_path_err_recovery_time_handler, &snprint_def_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &def_path_io_err_sample_time_handler, &snprint_def_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &def_path_io_err_rate_threshold_handler, &snprint_def_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", &def_path_io_err_recovery_time_handler, &snprint_def_path_io_err_recovery_time);
 
 	install_keyword("find_multipaths", &def_find_multipaths_handler, &snprint_def_find_multipaths);
 	install_keyword("uxsock_timeout", &def_uxsock_timeout_handler, &snprint_def_uxsock_timeout);
@@ -1530,6 +1562,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &hw_san_path_err_threshold_handler, &snprint_hw_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &hw_san_path_err_forget_rate_handler, &snprint_hw_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &hw_san_path_err_recovery_time_handler, &snprint_hw_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &hw_path_io_err_sample_time_handler, &snprint_hw_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &hw_path_io_err_rate_threshold_handler, &snprint_hw_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", &hw_path_io_err_recovery_time_handler, &snprint_hw_path_io_err_recovery_time);
 	install_keyword("skip_kpartx", &hw_skip_kpartx_handler, &snprint_hw_skip_kpartx);
 	install_keyword("max_sectors_kb", &hw_max_sectors_kb_handler, &snprint_hw_max_sectors_kb);
 	install_sublevel_end();
@@ -1563,6 +1598,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &ovr_san_path_err_threshold_handler, &snprint_ovr_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &ovr_san_path_err_forget_rate_handler, &snprint_ovr_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &ovr_san_path_err_recovery_time_handler, &snprint_ovr_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &ovr_path_io_err_sample_time_handler, &snprint_ovr_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &ovr_path_io_err_rate_threshold_handler, &snprint_ovr_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", &ovr_path_io_err_recovery_time_handler, &snprint_ovr_path_io_err_recovery_time);
 
 	install_keyword("skip_kpartx", &ovr_skip_kpartx_handler, &snprint_ovr_skip_kpartx);
 	install_keyword("max_sectors_kb", &ovr_max_sectors_kb_handler, &snprint_ovr_max_sectors_kb);
@@ -1595,6 +1633,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &mp_san_path_err_threshold_handler, &snprint_mp_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &mp_san_path_err_forget_rate_handler, &snprint_mp_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &mp_san_path_err_recovery_time_handler, &snprint_mp_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &mp_path_io_err_sample_time_handler, &snprint_mp_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &mp_path_io_err_rate_threshold_handler, &snprint_mp_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", &mp_path_io_err_recovery_time_handler, &snprint_mp_path_io_err_recovery_time);
 	install_keyword("skip_kpartx", &mp_skip_kpartx_handler, &snprint_mp_skip_kpartx);
 	install_keyword("max_sectors_kb", &mp_max_sectors_kb_handler, &snprint_mp_max_sectors_kb);
 	install_sublevel_end();
diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c
new file mode 100644
index 00000000..088e3354
--- /dev/null
+++ b/libmultipath/io_err_stat.c
@@ -0,0 +1,743 @@
+/*
+ * (C) Copyright HUAWEI Technology Corp. 2017, All Rights Reserved.
+ *
+ * io_err_stat.c
+ * version 1.0
+ *
+ * IO error stream statistic process for path failure event from kernel
+ *
+ * Author(s): Guan Junxiong 2017 <guanjunxiong@huawei.com>
+ *
+ * This file is released under the GPL version 2, or any later version.
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <libaio.h>
+#include <errno.h>
+#include <sys/mman.h>
+
+#include "vector.h"
+#include "memory.h"
+#include "checkers.h"
+#include "config.h"
+#include "structs.h"
+#include "structs_vec.h"
+#include "devmapper.h"
+#include "debug.h"
+#include "lock.h"
+#include "time-util.h"
+#include "io_err_stat.h"
+
+#define IOTIMEOUT_SEC			60
+#define TIMEOUT_NO_IO_NSEC		10000000 /*10ms = 10000000ns*/
+#define FLAKY_PATHFAIL_THRESHOLD	2
+#define FLAKY_PATHFAIL_TIME_FRAME	60
+#define CONCUR_NR_EVENT			32
+
+#define PATH_IO_ERR_IN_CHECKING		-1
+#define PATH_IO_ERR_IN_POLLING_RECHECK	-2
+
+#define io_err_stat_log(prio, fmt, args...) \
+	condlog(prio, "io error statistic: " fmt, ##args)
+
+
+struct io_err_stat_pathvec {
+	pthread_mutex_t mutex;
+	vector		pathvec;
+};
+
+struct dio_ctx {
+	struct timespec	io_starttime;
+	int		blksize;
+	void		*buf;
+	struct iocb	io;
+};
+
+struct io_err_stat_path {
+	char		devname[FILE_NAME_SIZE];
+	int		fd;
+	struct dio_ctx	*dio_ctx_array;
+	int		io_err_nr;
+	int		io_nr;
+	struct timespec	start_time;
+
+	int		total_time;
+	int		err_rate_threshold;
+};
+
+pthread_t		io_err_stat_thr;
+pthread_attr_t		io_err_stat_attr;
+
+static struct io_err_stat_pathvec *paths;
+struct vectors *vecs;
+io_context_t	ioctx;
+
+static void cancel_inflight_io(struct io_err_stat_path *pp);
+
+static void rcu_unregister(void *param)
+{
+	rcu_unregister_thread();
+}
+
+struct io_err_stat_path *find_err_path_by_dev(vector pathvec, char *dev)
+{
+	int i;
+	struct io_err_stat_path *pp;
+
+	if (!pathvec)
+		return NULL;
+	vector_foreach_slot(pathvec, pp, i)
+		if (!strcmp(pp->devname, dev))
+			return pp;
+
+	io_err_stat_log(4, "%s: not found in check queue", dev);
+
+	return NULL;
+}
+
+static int init_each_dio_ctx(struct dio_ctx *ct, int blksize,
+		unsigned long pgsize)
+{
+	ct->blksize = blksize;
+	if(posix_memalign(&ct->buf, pgsize, blksize))
+		return 1;
+	memset(ct->buf, 0, blksize);
+	ct->io_starttime.tv_sec = 0;
+	ct->io_starttime.tv_nsec = 0;
+
+	return 0;
+}
+
+static void deinit_each_dio_ctx(struct dio_ctx *ct)
+{
+	if (ct->buf)
+		free(ct->buf);
+}
+
+static int setup_directio_ctx(struct io_err_stat_path *p)
+{
+	unsigned long pgsize = getpagesize();
+	char fpath[PATH_MAX];
+	int blksize = 0;
+	int i;
+
+	if (snprintf(fpath, PATH_MAX, "/dev/%s", p->devname) >= PATH_MAX)
+		return 1;
+	if (p->fd < 0)
+		p->fd = open(fpath, O_RDONLY | O_DIRECT);
+	if (p->fd < 0)
+		return 1;
+
+	p->dio_ctx_array = MALLOC(sizeof(struct dio_ctx) * CONCUR_NR_EVENT);
+	if (!p->dio_ctx_array)
+		goto fail_close;
+
+	if (ioctl(p->fd, BLKBSZGET, &blksize) < 0) {
+		io_err_stat_log(4, "%s:cannot get blocksize, set default 512",
+				p->devname);
+		blksize = 512;
+	}
+	if (!blksize)
+		goto free_pdctx;
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++) {
+		if (init_each_dio_ctx(p->dio_ctx_array + i, blksize, pgsize))
+			goto deinit;
+	}
+	return 0;
+
+deinit:
+	for (i = 0; i < CONCUR_NR_EVENT; i++)
+		deinit_each_dio_ctx(p->dio_ctx_array + i);
+free_pdctx:
+	FREE(p->dio_ctx_array);
+fail_close:
+	close(p->fd);
+
+	return 1;
+}
+
+static void destroy_directio_ctx(struct io_err_stat_path *p)
+{
+	int i;
+
+	if (!p || !p->dio_ctx_array)
+		return;
+	cancel_inflight_io(p);
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++)
+		deinit_each_dio_ctx(p->dio_ctx_array + i);
+	FREE(p->dio_ctx_array);
+
+	if (p->fd > 0)
+		close(p->fd);
+}
+
+static struct io_err_stat_path *alloc_io_err_stat_path(void)
+{
+	struct io_err_stat_path *p;
+
+	p = (struct io_err_stat_path *)MALLOC(sizeof(*p));
+	if (!p)
+		return NULL;
+
+	memset(p->devname, 0, sizeof(p->devname));
+	p->io_err_nr = 0;
+	p->io_nr = 0;
+	p->total_time = 0;
+	p->start_time.tv_sec = 0;
+	p->start_time.tv_nsec = 0;
+	p->err_rate_threshold = 0;
+	p->fd = -1;
+
+	return p;
+}
+
+static void free_io_err_stat_path(struct io_err_stat_path *p)
+{
+	FREE(p);
+}
+
+static struct io_err_stat_pathvec *alloc_pathvec(void)
+{
+	struct io_err_stat_pathvec *p;
+	int r;
+
+	p = (struct io_err_stat_pathvec *)MALLOC(sizeof(*p));
+	if (!p)
+		return NULL;
+	p->pathvec = vector_alloc();
+	if (!p->pathvec)
+		goto out_free_struct_pathvec;
+	r = pthread_mutex_init(&p->mutex, NULL);
+	if (r)
+		goto out_free_member_pathvec;
+
+	return p;
+
+out_free_member_pathvec:
+	vector_free(p->pathvec);
+out_free_struct_pathvec:
+	FREE(p);
+	return NULL;
+}
+
+static void free_io_err_pathvec(struct io_err_stat_pathvec *p)
+{
+	struct io_err_stat_path *path;
+	int i;
+
+	if (!p)
+		return;
+	pthread_mutex_destroy(&p->mutex);
+	if (!p->pathvec) {
+		vector_foreach_slot(p->pathvec, path, i) {
+			destroy_directio_ctx(path);
+			free_io_err_stat_path(path);
+		}
+		vector_free(p->pathvec);
+	}
+	FREE(p);
+}
+
+/*
+ * return value
+ * 0: enqueue OK
+ * 1: fails because of internal error
+ * 2: fails because of existing already
+ */
+static int enqueue_io_err_stat_by_path(struct path *path)
+{
+	struct io_err_stat_path *p;
+
+	pthread_mutex_lock(&paths->mutex);
+	p = find_err_path_by_dev(paths->pathvec, path->dev);
+	if (p) {
+		pthread_mutex_unlock(&paths->mutex);
+		return 2;
+	}
+	pthread_mutex_unlock(&paths->mutex);
+
+	p = alloc_io_err_stat_path();
+	if (!p)
+		return 1;
+
+	memcpy(p->devname, path->dev, sizeof(p->devname));
+	p->total_time = path->mpp->path_io_err_sample_time;
+	p->err_rate_threshold = path->mpp->path_io_err_rate_threshold;
+
+	if (setup_directio_ctx(p))
+		goto free_ioerr_path;
+	pthread_mutex_lock(&paths->mutex);
+	if (!vector_alloc_slot(paths->pathvec))
+		goto unlock_destroy;
+	vector_set_slot(paths->pathvec, p);
+	pthread_mutex_unlock(&paths->mutex);
+
+	if (!path->io_err_disable_reinstate) {
+		/*
+		 *fail the path in the kernel for the time of the to make
+		 *the test more reliable
+		 */
+		io_err_stat_log(3, "%s: fail dm path %s before checking",
+				path->mpp->alias, path->dev);
+		path->io_err_disable_reinstate = 1;
+		dm_fail_path(path->mpp->alias, path->dev_t);
+		update_queue_mode_del_path(path->mpp);
+
+		/*
+		 * schedule path check as soon as possible to
+		 * update path state to delayed state
+		 */
+		path->tick = 1;
+
+	}
+	io_err_stat_log(2, "%s: enqueue path %s to check",
+			path->mpp->alias, path->dev);
+	return 0;
+
+unlock_destroy:
+	pthread_mutex_unlock(&paths->mutex);
+	destroy_directio_ctx(p);
+free_ioerr_path:
+	free_io_err_stat_path(p);
+
+	return 1;
+}
+
+int io_err_stat_handle_pathfail(struct path *path)
+{
+	struct timespec curr_time;
+	int res;
+
+	if (path->io_err_disable_reinstate) {
+		io_err_stat_log(3, "%s: reinstate is already disabled",
+				path->dev);
+		return 1;
+	}
+	if (path->io_err_pathfail_cnt < 0)
+		return 1;
+
+	if (!path->mpp)
+		return 1;
+	if (path->mpp->nr_active <= 1)
+		return 1;
+	if (path->mpp->path_io_err_sample_time <= 0 ||
+		path->mpp->path_io_err_recovery_time <= 0 ||
+		path->mpp->path_io_err_rate_threshold < 0) {
+		io_err_stat_log(4, "%s: parameter not set", path->mpp->alias);
+		return 1;
+	}
+	if (path->mpp->path_io_err_sample_time < (2 * IOTIMEOUT_SEC)) {
+		io_err_stat_log(2, "%s: path_io_err_sample_time should not less than %d",
+				path->mpp->alias, 2 * IOTIMEOUT_SEC);
+		return 1;
+	}
+	/*
+	 * The test should only be started for paths that have failed
+	 * repeatedly in a certain time frame, so that we have reason
+	 * to assume they're flaky. Without bother the admin to configure
+	 * the repeated count threshold and time frame, we assume a path
+	 * which fails at least twice within 60 seconds is flaky.
+	 */
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return 1;
+	if (path->io_err_pathfail_cnt == 0) {
+		path->io_err_pathfail_cnt++;
+		path->io_err_pathfail_starttime = curr_time.tv_sec;
+		io_err_stat_log(5, "%s: start path flakiness pre-checking",
+				path->dev);
+		return 0;
+	}
+	if ((curr_time.tv_sec - path->io_err_pathfail_starttime) >
+			FLAKY_PATHFAIL_TIME_FRAME) {
+		path->io_err_pathfail_cnt = 0;
+		path->io_err_pathfail_starttime = curr_time.tv_sec;
+		io_err_stat_log(5, "%s: restart path flakiness pre-checking",
+				path->dev);
+	}
+	path->io_err_pathfail_cnt++;
+	if (path->io_err_pathfail_cnt >= FLAKY_PATHFAIL_THRESHOLD) {
+		res = enqueue_io_err_stat_by_path(path);
+		if (!res)
+			path->io_err_pathfail_cnt = PATH_IO_ERR_IN_CHECKING;
+		else
+			path->io_err_pathfail_cnt = 0;
+	}
+
+	return 0;
+}
+
+int hit_io_err_recheck_time(struct path *pp)
+{
+	struct timespec curr_time;
+	int r;
+
+	if (pp->io_err_disable_reinstate == 0)
+		return 1;
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return 1;
+	if (pp->io_err_pathfail_cnt != PATH_IO_ERR_IN_POLLING_RECHECK)
+		return 1;
+	if (pp->mpp->nr_active <= 0) {
+		io_err_stat_log(2, "%s: recover path early", pp->dev);
+		goto recover;
+	}
+	if ((curr_time.tv_sec - pp->io_err_dis_reinstate_time) >
+			pp->mpp->path_io_err_recovery_time) {
+		io_err_stat_log(4, "%s: reschedule checking after %d seconds",
+				pp->dev, pp->mpp->path_io_err_sample_time);
+		/*
+		 * to reschedule io error checking again
+		 * if the path is good enough, we claim it is good
+		 * and can be reinsated as soon as possible in the
+		 * check_path routine.
+		 */
+		pp->io_err_dis_reinstate_time = curr_time.tv_sec;
+		r = enqueue_io_err_stat_by_path(pp);
+		/*
+		 * Enqueue fails because of internal error.
+		 * In this case , we recover this path
+		 * Or else,  return 1 to set path state to PATH_DELAYED
+		 */
+		if (r == 1) {
+			io_err_stat_log(3, "%s: enqueue fails, to recover",
+					pp->dev);
+			goto recover;
+		}
+		else if (!r) {
+			pp->io_err_pathfail_cnt = PATH_IO_ERR_IN_CHECKING;
+		}
+	}
+
+	return 1;
+
+recover:
+	pp->io_err_pathfail_cnt = 0;
+	pp->io_err_disable_reinstate = 0;
+	pp->tick = 1;
+	return 0;
+}
+
+static int delete_io_err_stat_by_addr(struct io_err_stat_path *p)
+{
+	int i;
+
+	i = find_slot(paths->pathvec, p);
+	if (i != -1)
+		vector_del_slot(paths->pathvec, i);
+
+	destroy_directio_ctx(p);
+	free_io_err_stat_path(p);
+
+	return 0;
+}
+
+static void account_async_io_state(struct io_err_stat_path *pp, int rc)
+{
+	switch (rc) {
+	case PATH_DOWN:
+	case PATH_TIMEOUT:
+		pp->io_err_nr++;
+		break;
+	case PATH_UNCHECKED:
+	case PATH_UP:
+	case PATH_PENDING:
+		break;
+	default:
+		break;
+	}
+}
+
+static int poll_io_err_stat(struct vectors *vecs, struct io_err_stat_path *pp)
+{
+	struct timespec currtime, difftime;
+	struct path *path;
+	double err_rate;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &currtime) != 0)
+		return 1;
+	timespecsub(&currtime, &pp->start_time, &difftime);
+	if (difftime.tv_sec < pp->total_time)
+		return 0;
+
+	io_err_stat_log(4, "check end for %s", pp->devname);
+
+	err_rate = pp->io_nr == 0 ? 0 : (pp->io_err_nr * 1000.0f) / pp->io_nr;
+	io_err_stat_log(5, "%s: IO error rate (%.1f/1000)",
+			pp->devname, err_rate);
+	pthread_cleanup_push(cleanup_lock, &vecs->lock);
+	lock(&vecs->lock);
+	pthread_testcancel();
+	path = find_path_by_dev(vecs->pathvec, pp->devname);
+	if (!path) {
+		io_err_stat_log(4, "path %s not found'", pp->devname);
+	} else if (err_rate <= pp->err_rate_threshold) {
+		path->io_err_pathfail_cnt = 0;
+		path->io_err_disable_reinstate = 0;
+		io_err_stat_log(4, "%s: (%d/%d) good to enable reinstating",
+				pp->devname, pp->io_err_nr, pp->io_nr);
+		/*
+		 * schedule path check as soon as possible to
+		 * update path state. Do NOT reinstate dm path here
+		 */
+		path->tick = 1;
+
+	} else if (path->mpp && path->mpp->nr_active > 1) {
+		io_err_stat_log(3, "%s: keep failing dm path %s",
+				path->mpp->alias, path->dev);
+		path->io_err_pathfail_cnt = PATH_IO_ERR_IN_POLLING_RECHECK;
+		path->io_err_disable_reinstate = 1;
+		path->io_err_dis_reinstate_time = currtime.tv_sec;
+		io_err_stat_log(3, "%s: to disable %s to reinstate",
+				path->mpp->alias, path->dev);
+	} else {
+		path->io_err_pathfail_cnt = 0;
+		path->io_err_disable_reinstate = 0;
+		io_err_stat_log(4, "%s: there is orphan path, enable reinstating",
+				pp->devname);
+	}
+	lock_cleanup_pop(vecs->lock);
+
+	delete_io_err_stat_by_addr(pp);
+
+	return 0;
+}
+
+static int send_each_async_io(struct dio_ctx *ct, int fd, char *dev)
+{
+	int rc = -1;
+
+	if (ct->io_starttime.tv_nsec == 0 &&
+			ct->io_starttime.tv_sec == 0) {
+		struct iocb *ios[1] = { &ct->io };
+
+		if (clock_gettime(CLOCK_MONOTONIC, &ct->io_starttime) != 0) {
+			ct->io_starttime.tv_sec = 0;
+			ct->io_starttime.tv_nsec = 0;
+			return rc;
+		}
+		io_prep_pread(&ct->io, fd, ct->buf, ct->blksize, 0);
+		if (io_submit(ioctx, 1, ios) != 1) {
+			io_err_stat_log(5, "%s: io_submit error %i",
+					dev, errno);
+			return rc;
+		}
+		rc = 0;
+	}
+
+	return rc;
+}
+
+static void send_batch_async_ios(struct io_err_stat_path *pp)
+{
+	int i;
+	struct dio_ctx *ct;
+	struct timespec currtime, difftime;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &currtime) != 0)
+		return;
+	/*
+	 * Give a free time for all IO to complete or timeout
+	 */
+	if (pp->start_time.tv_sec != 0) {
+		timespecsub(&currtime, &pp->start_time, &difftime);
+		if (difftime.tv_sec + IOTIMEOUT_SEC >= pp->total_time)
+			return;
+	}
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++) {
+		ct = pp->dio_ctx_array + i;
+		if (!send_each_async_io(ct, pp->fd, pp->devname))
+			pp->io_nr++;
+	}
+	if (pp->start_time.tv_sec == 0 && pp->start_time.tv_nsec == 0 &&
+		clock_gettime(CLOCK_MONOTONIC, &pp->start_time)) {
+		pp->start_time.tv_sec = 0;
+		pp->start_time.tv_nsec = 0;
+	}
+}
+
+static int try_to_cancel_timeout_io(struct dio_ctx *ct, struct timespec *t,
+		char *dev)
+{
+	struct timespec	difftime;
+	struct io_event	event;
+	int		rc = PATH_UNCHECKED;
+	int		r;
+
+	if (ct->io_starttime.tv_sec == 0)
+		return rc;
+	timespecsub(t, &ct->io_starttime, &difftime);
+	if (difftime.tv_sec > IOTIMEOUT_SEC) {
+		struct iocb *ios[1] = { &ct->io };
+
+		io_err_stat_log(5, "%s: abort check on timeout", dev);
+		r = io_cancel(ioctx, ios[0], &event);
+		if (r)
+			io_err_stat_log(5, "%s: io_cancel error %i",
+					dev, errno);
+		ct->io_starttime.tv_sec = 0;
+		ct->io_starttime.tv_nsec = 0;
+		rc = PATH_TIMEOUT;
+	} else {
+		rc = PATH_PENDING;
+	}
+
+	return rc;
+}
+
+static void poll_async_io_timeout(void)
+{
+	struct io_err_stat_path *pp;
+	struct timespec curr_time;
+	int		rc = PATH_UNCHECKED;
+	int		i, j;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return;
+	vector_foreach_slot(paths->pathvec, pp, i) {
+		for (j = 0; j < CONCUR_NR_EVENT; j++) {
+			rc = try_to_cancel_timeout_io(pp->dio_ctx_array + j,
+					&curr_time, pp->devname);
+			account_async_io_state(pp, rc);
+		}
+	}
+}
+
+static void cancel_inflight_io(struct io_err_stat_path *pp)
+{
+	struct io_event event;
+	int i, r;
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++) {
+		struct dio_ctx *ct = pp->dio_ctx_array + i;
+		struct iocb *ios[1] = { &ct->io };
+
+		if (ct->io_starttime.tv_sec == 0
+				&& ct->io_starttime.tv_nsec == 0)
+			continue;
+		io_err_stat_log(5, "%s: abort infligh io",
+				pp->devname);
+		r = io_cancel(ioctx, ios[0], &event);
+		if (r)
+			io_err_stat_log(5, "%s: io_cancel error %d, %i",
+					pp->devname, r, errno);
+		ct->io_starttime.tv_sec = 0;
+		ct->io_starttime.tv_nsec = 0;
+	}
+}
+
+static inline int handle_done_dio_ctx(struct dio_ctx *ct, struct io_event *ev)
+{
+	ct->io_starttime.tv_sec = 0;
+	ct->io_starttime.tv_nsec = 0;
+	return (ev->res == ct->blksize) ? PATH_UP : PATH_DOWN;
+}
+
+static void handle_async_io_done_event(struct io_event *io_evt)
+{
+	struct io_err_stat_path *pp;
+	struct dio_ctx *ct;
+	int rc = PATH_UNCHECKED;
+	int i, j;
+
+	vector_foreach_slot(paths->pathvec, pp, i) {
+		for (j = 0; j < CONCUR_NR_EVENT; j++) {
+			ct = pp->dio_ctx_array + j;
+			if (&ct->io == io_evt->obj) {
+				rc = handle_done_dio_ctx(ct, io_evt);
+				account_async_io_state(pp, rc);
+				return;
+			}
+		}
+	}
+}
+
+static void process_async_ios_event(int timeout_nsecs, char *dev)
+{
+	struct io_event events[CONCUR_NR_EVENT];
+	int		i, n;
+	struct timespec	timeout = { .tv_nsec = timeout_nsecs };
+
+	errno = 0;
+	n = io_getevents(ioctx, 1L, CONCUR_NR_EVENT, events, &timeout);
+	if (n < 0) {
+		io_err_stat_log(3, "%s: async io events returned %d (errno=%s)",
+				dev, n, strerror(errno));
+	} else {
+		for (i = 0; i < n; i++)
+			handle_async_io_done_event(&events[i]);
+	}
+}
+
+static void service_paths(void)
+{
+	struct io_err_stat_path *pp;
+	int i;
+
+	pthread_mutex_lock(&paths->mutex);
+	vector_foreach_slot(paths->pathvec, pp, i) {
+		send_batch_async_ios(pp);
+		process_async_ios_event(TIMEOUT_NO_IO_NSEC, pp->devname);
+		poll_async_io_timeout();
+		poll_io_err_stat(vecs, pp);
+	}
+	pthread_mutex_unlock(&paths->mutex);
+}
+
+static void *io_err_stat_loop(void *data)
+{
+	vecs = (struct vectors *)data;
+	pthread_cleanup_push(rcu_unregister, NULL);
+	rcu_register_thread();
+
+	mlockall(MCL_CURRENT | MCL_FUTURE);
+	while (1) {
+		service_paths();
+		usleep(100000);
+	}
+
+	pthread_cleanup_pop(1);
+	return NULL;
+}
+
+int start_io_err_stat_thread(void *data)
+{
+	if (io_setup(CONCUR_NR_EVENT, &ioctx) != 0) {
+		io_err_stat_log(4, "io_setup failed");
+		return 1;
+	}
+	paths = alloc_pathvec();
+	if (!paths)
+		goto destroy_ctx;
+
+	if (pthread_create(&io_err_stat_thr, &io_err_stat_attr,
+				io_err_stat_loop, data)) {
+		io_err_stat_log(0, "cannot create io_error statistic thread");
+		goto out_free;
+	}
+	io_err_stat_log(3, "thread started");
+	return 0;
+
+out_free:
+	free_io_err_pathvec(paths);
+destroy_ctx:
+	io_destroy(ioctx);
+	io_err_stat_log(0, "failed to start io_error statistic thread");
+	return 1;
+}
+
+void stop_io_err_stat_thread(void)
+{
+	pthread_cancel(io_err_stat_thr);
+	pthread_kill(io_err_stat_thr, SIGUSR2);
+	free_io_err_pathvec(paths);
+	io_destroy(ioctx);
+}
diff --git a/libmultipath/io_err_stat.h b/libmultipath/io_err_stat.h
new file mode 100644
index 00000000..bbf31b4f
--- /dev/null
+++ b/libmultipath/io_err_stat.h
@@ -0,0 +1,15 @@
+#ifndef _IO_ERR_STAT_H
+#define _IO_ERR_STAT_H
+
+#include "vector.h"
+#include "lock.h"
+
+
+extern pthread_attr_t io_err_stat_attr;
+
+int start_io_err_stat_thread(void *data);
+void stop_io_err_stat_thread(void);
+int io_err_stat_handle_pathfail(struct path *path);
+int hit_io_err_recheck_time(struct path *pp);
+
+#endif /* _IO_ERR_STAT_H */
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index 175fbe11..9d2c3c09 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -731,6 +731,7 @@ out:
 	return 0;
 
 }
+
 int select_san_path_err_threshold(struct config *conf, struct multipath *mp)
 {
 	char *origin, buff[12];
@@ -761,6 +762,7 @@ out:
 	return 0;
 
 }
+
 int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp)
 {
 	char *origin, buff[12];
@@ -776,6 +778,57 @@ out:
 	return 0;
 
 }
+
+int select_path_io_err_sample_time(struct config *conf, struct multipath *mp)
+{
+	char *origin, buff[12];
+
+	mp_set_mpe(path_io_err_sample_time);
+	mp_set_ovr(path_io_err_sample_time);
+	mp_set_hwe(path_io_err_sample_time);
+	mp_set_conf(path_io_err_sample_time);
+	mp_set_default(path_io_err_sample_time, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, &mp->path_io_err_sample_time);
+	condlog(3, "%s: path_io_err_sample_time = %s %s", mp->alias, buff,
+			origin);
+	return 0;
+}
+
+int select_path_io_err_rate_threshold(struct config *conf, struct multipath *mp)
+{
+	char *origin, buff[12];
+
+	mp_set_mpe(path_io_err_rate_threshold);
+	mp_set_ovr(path_io_err_rate_threshold);
+	mp_set_hwe(path_io_err_rate_threshold);
+	mp_set_conf(path_io_err_rate_threshold);
+	mp_set_default(path_io_err_rate_threshold, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, &mp->path_io_err_rate_threshold);
+	condlog(3, "%s: path_io_err_rate_threshold = %s %s", mp->alias, buff,
+			origin);
+	return 0;
+
+}
+
+int select_path_io_err_recovery_time(struct config *conf, struct multipath *mp)
+{
+	char *origin, buff[12];
+
+	mp_set_mpe(path_io_err_recovery_time);
+	mp_set_ovr(path_io_err_recovery_time);
+	mp_set_hwe(path_io_err_recovery_time);
+	mp_set_conf(path_io_err_recovery_time);
+	mp_set_default(path_io_err_recovery_time, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, &mp->path_io_err_recovery_time);
+	condlog(3, "%s: path_io_err_recovery_time = %s %s", mp->alias, buff,
+			origin);
+	return 0;
+
+}
+
 int select_skip_kpartx (struct config *conf, struct multipath * mp)
 {
 	char *origin;
diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h
index f8e96d85..1b2b5714 100644
--- a/libmultipath/propsel.h
+++ b/libmultipath/propsel.h
@@ -28,6 +28,9 @@ int select_max_sectors_kb (struct config *conf, struct multipath * mp);
 int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp);
 int select_san_path_err_threshold(struct config *conf, struct multipath *mp);
 int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp);
+int select_path_io_err_sample_time(struct config *conf, struct multipath *mp);
+int select_path_io_err_rate_threshold(struct config *conf, struct multipath *mp);
+int select_path_io_err_recovery_time(struct config *conf, struct multipath *mp);
 void reconcile_features_with_options(const char *id, char **features,
 				     int* no_path_retry,
 				     int *retain_hwhandler);
diff --git a/libmultipath/structs.h b/libmultipath/structs.h
index 8ea984d9..1ab8cb9b 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -235,6 +235,10 @@ struct path {
 	time_t dis_reinstate_time;
 	int disable_reinstate;
 	int san_path_err_forget_rate;
+	time_t io_err_dis_reinstate_time;
+	int io_err_disable_reinstate;
+	int io_err_pathfail_cnt;
+	int io_err_pathfail_starttime;
 	/* configlet pointers */
 	struct hwentry * hwe;
 };
@@ -269,6 +273,9 @@ struct multipath {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int skip_kpartx;
 	int max_sectors_kb;
 	int force_readonly;
diff --git a/libmultipath/uevent.c b/libmultipath/uevent.c
index eb44da56..e74e3dad 100644
--- a/libmultipath/uevent.c
+++ b/libmultipath/uevent.c
@@ -913,3 +913,35 @@ char *uevent_get_dm_name(struct uevent *uev)
 	}
 	return p;
 }
+
+char *uevent_get_dm_path(struct uevent *uev)
+{
+	char *p = NULL;
+	int i;
+
+	for (i = 0; uev->envp[i] != NULL; i++) {
+		if (!strncmp(uev->envp[i], "DM_PATH", 7) &&
+		    strlen(uev->envp[i]) > 8) {
+			p = MALLOC(strlen(uev->envp[i] + 8) + 1);
+			strcpy(p, uev->envp[i] + 8);
+			break;
+		}
+	}
+	return p;
+}
+
+char *uevent_get_dm_action(struct uevent *uev)
+{
+	char *p = NULL;
+	int i;
+
+	for (i = 0; uev->envp[i] != NULL; i++) {
+		if (!strncmp(uev->envp[i], "DM_ACTION", 9) &&
+		    strlen(uev->envp[i]) > 10) {
+			p = MALLOC(strlen(uev->envp[i] + 10) + 1);
+			strcpy(p, uev->envp[i] + 10);
+			break;
+		}
+	}
+	return p;
+}
diff --git a/libmultipath/uevent.h b/libmultipath/uevent.h
index 61a42071..6f5af0af 100644
--- a/libmultipath/uevent.h
+++ b/libmultipath/uevent.h
@@ -37,5 +37,7 @@ int uevent_get_major(struct uevent *uev);
 int uevent_get_minor(struct uevent *uev);
 int uevent_get_disk_ro(struct uevent *uev);
 char *uevent_get_dm_name(struct uevent *uev);
+char *uevent_get_dm_path(struct uevent *uev);
+char *uevent_get_dm_action(struct uevent *uev);
 
 #endif /* _UEVENT_H */
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index d9ac279f..f49ede66 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -849,6 +849,53 @@ The default is: \fBno\fR
 .
 .
 .TP
+.B path_io_err_sample_time
+One of the three parameters of supporting path check based on accounting IO
+error such as intermittent error. If it is set to a value no less than 120,
+when a path fail event occurs twice in 60 second due to an IO error, multipathd
+will fail the path and enqueue this path into a queue of which members are sent
+a couple of continuous direct reading aio at a fixed sample rate of 10HZ. The
+IO accounting process for a path will last for \fIpath_io_err_sample_time\fR.
+If the rate of IO error on a particular path is greater than the
+\fIpath_io_err_rate_threshold\fR, then the path will not reinstate for
+\fIpath_io_err_rate_threshold\fR seconds unless there is only one active path.
+After \fIpath_io_err_recovery_time\fR expires, the path will be requeueed for
+checking. If checking result is good enough, the path will be reinstated.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B path_io_err_rate_threshold
+The error rate threshold as a permillage (1/1000). One of the three parameters
+of supporting path check based on accounting IO error such as intermittent
+error. Refer to \fIpath_io_err_sample_time\fR. If the rate of IO errors on a
+particular path is greater than this parameter, then the path will not
+reinstate for \fIpath_io_err_rate_threshold\fR seconds unless there is only one
+active path.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B path_io_err_recovery_time
+One of the three parameters of supporting path check based on accounting IO
+error such as intermittent error. Refer to \fIpath_io_err_sample_time\fR. If
+this parameter is set to a positive value, the path which has many error will
+not reinsate till \fIpath_io_err_recovery_time\fR seconds. After
+\fIpath_io_err_recovery_time\fR expires, the path will be requeueed for
+checking. If checking result is good enough, the path will be reinstated.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
 .B delay_watch_checks
 If set to a value greater than 0, multipathd will watch paths that have
 recently become valid for this many checks. If they fail again while they are
@@ -1119,6 +1166,12 @@ are taken from the \fIdefaults\fR or \fIdevices\fR section:
 .TP
 .B san_path_err_recovery_time
 .TP
+.B path_io_err_sample_time
+.TP
+.B path_io_err_rate_threshold
+.TP
+.B path_io_err_recovery_time
+.TP
 .B delay_watch_checks
 .TP
 .B delay_wait_checks
@@ -1246,6 +1299,12 @@ section:
 .TP
 .B san_path_err_recovery_time
 .TP
+.B path_io_err_sample_time
+.TP
+.B path_io_err_rate_threshold
+.TP
+.B path_io_err_recovery_time
+.TP
 .B delay_watch_checks
 .TP
 .B delay_wait_checks
@@ -1318,6 +1377,12 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections:
 .TP
 .B san_path_err_recovery_time
 .TP
+.B path_io_err_sample_time
+.TP
+.B path_io_err_rate_threshold
+.TP
+.B path_io_err_recovery_time
+.TP
 .B delay_watch_checks
 .TP
 .B delay_wait_checks
diff --git a/multipathd/main.c b/multipathd/main.c
index 4be2c579..38158006 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -84,6 +84,7 @@ int uxsock_timeout;
 #include "cli_handlers.h"
 #include "lock.h"
 #include "waiter.h"
+#include "io_err_stat.h"
 #include "wwids.h"
 #include "../third-party/valgrind/drd.h"
 
@@ -1050,6 +1051,41 @@ out:
 }
 
 static int
+uev_pathfail_check(struct uevent *uev, struct vectors *vecs)
+{
+	char *action = NULL, *devt = NULL;
+	struct path *pp;
+	int r;
+
+	action = uevent_get_dm_action(uev);
+	if (!action)
+		return 1;
+	if (strncmp(action, "PATH_FAILED", 11))
+		goto out;
+	devt = uevent_get_dm_path(uev);
+	if (!devt) {
+		condlog(3, "%s: No DM_PATH in uevent", uev->kernel);
+		goto out;
+	}
+
+	pthread_cleanup_push(cleanup_lock, &vecs->lock);
+	lock(&vecs->lock);
+	pthread_testcancel();
+	pp = find_path_by_devt(vecs->pathvec, devt);
+	r = io_err_stat_handle_pathfail(pp);
+	lock_cleanup_pop(vecs->lock);
+
+	if (r)
+		condlog(3, "io_err_stat: fails to enqueue %s", pp->dev);
+	FREE(devt);
+	FREE(action);
+	return 0;
+out:
+	FREE(action);
+	return 1;
+}
+
+static int
 map_discovery (struct vectors * vecs)
 {
 	struct multipath * mpp;
@@ -1134,6 +1170,7 @@ uev_trigger (struct uevent * uev, void * trigger_data)
 	if (!strncmp(uev->kernel, "dm-", 3)) {
 		if (!strncmp(uev->action, "change", 6)) {
 			r = uev_add_map(uev, vecs);
+			uev_pathfail_check(uev, vecs);
 			goto out;
 		}
 		if (!strncmp(uev->action, "remove", 6)) {
@@ -1553,6 +1590,7 @@ static int check_path_reinstate_state(struct path * pp) {
 		condlog(2, "%s : hit error threshold. Delaying path reinstatement", pp->dev);
 		pp->dis_reinstate_time = curr_time.tv_sec;
 		pp->disable_reinstate = 1;
+
 		return 1;
 	} else {
 		return 0;
@@ -1684,6 +1722,16 @@ check_path (struct vectors * vecs, struct path * pp, int ticks)
 		return 1;
 	}
 
+	if (pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
+		pp->state = PATH_DELAYED;
+		/*
+		 * to reschedule as soon as possible,so that this path can
+		 * be recoverd in time
+		 */
+		pp->tick = 1;
+		return 1;
+	}
+
 	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
 	     pp->wait_checks > 0) {
 		if (pp->mpp->nr_active > 0) {
@@ -2377,6 +2425,7 @@ child (void * param)
 	setup_thread_attr(&misc_attr, 64 * 1024, 0);
 	setup_thread_attr(&uevent_attr, DEFAULT_UEVENT_STACKSIZE * 1024, 0);
 	setup_thread_attr(&waiter_attr, 32 * 1024, 1);
+	setup_thread_attr(&io_err_stat_attr, 32 * 1024, 1);
 
 	if (logsink == 1) {
 		setup_thread_attr(&log_attr, 64 * 1024, 0);
@@ -2499,6 +2548,10 @@ child (void * param)
 	/*
 	 * start threads
 	 */
+	rc = start_io_err_stat_thread(vecs);
+	if (rc)
+		goto failed;
+
 	if ((rc = pthread_create(&check_thr, &misc_attr, checkerloop, vecs))) {
 		condlog(0,"failed to create checker loop thread: %d", rc);
 		goto failed;
@@ -2548,6 +2601,8 @@ child (void * param)
 	remove_maps_and_stop_waiters(vecs);
 	unlock(&vecs->lock);
 
+	stop_io_err_stat_thread();
+
 	pthread_cancel(check_thr);
 	pthread_cancel(uevent_thr);
 	pthread_cancel(uxlsnr_thr);
@@ -2593,6 +2648,7 @@ child (void * param)
 	udev_unref(udev);
 	udev = NULL;
 	pthread_attr_destroy(&waiter_attr);
+	pthread_attr_destroy(&io_err_stat_attr);
 #ifdef _DEBUG_
 	dbg_free_final(NULL);
 #endif
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH V4 2/2] multipath-tools: discard san_path_err_XXX feature
  2017-09-17  3:40 [PATCH V4 0/2] multipath-tools: intermittent IO error accounting to improve reliability Guan Junxiong
  2017-09-17  3:40 ` [PATCH V4 1/2] " Guan Junxiong
@ 2017-09-17  3:40 ` Guan Junxiong
  1 sibling, 0 replies; 17+ messages in thread
From: Guan Junxiong @ 2017-09-17  3:40 UTC (permalink / raw)
  To: dm-devel, christophe.varoqui, mwilck
  Cc: guanjunxiong, chengjike.cheng, mmandala, niuhaoxin, shenhong09

Even the san_path_err_threshold , san_path_err_forget_rate and
san_path_err_recovery_time is turned on, the detect sample interval of
that path checkers is so big/coarse that it doesn't see what happens
in the middle of the sample interval.

Now we have new method  of detecting path state of IO erros
especially for intermittent IO errors in the previous patch.

Therefore, discard the original commit "c3705a12b893cc302a89587c4d37".

Cc: M Muneendra Kumar <mmandala@brocade.com>
Signed-off-by: Junxiong Guan <guanjunxiong@huawei.com>
---
 libmultipath/config.c      |  3 --
 libmultipath/config.h      |  9 -----
 libmultipath/configure.c   |  3 --
 libmultipath/dict.c        | 39 ---------------------
 libmultipath/propsel.c     | 47 --------------------------
 libmultipath/propsel.h     |  3 --
 libmultipath/structs.h     |  7 ----
 multipath/multipath.conf.5 | 57 -------------------------------
 multipathd/main.c          | 84 ----------------------------------------------
 9 files changed, 252 deletions(-)

diff --git a/libmultipath/config.c b/libmultipath/config.c
index b21a3aa1..d85ba7cc 100644
--- a/libmultipath/config.c
+++ b/libmultipath/config.c
@@ -351,9 +351,6 @@ merge_hwe (struct hwentry * dst, struct hwentry * src)
 	merge_num(delay_wait_checks);
 	merge_num(skip_kpartx);
 	merge_num(max_sectors_kb);
-	merge_num(san_path_err_threshold);
-	merge_num(san_path_err_forget_rate);
-	merge_num(san_path_err_recovery_time);
 
 	snprintf(id, sizeof(id), "%s/%s", dst->vendor, dst->product);
 	reconcile_features_with_options(id, &dst->features,
diff --git a/libmultipath/config.h b/libmultipath/config.h
index 215d29e9..4aa944d0 100644
--- a/libmultipath/config.h
+++ b/libmultipath/config.h
@@ -72,9 +72,6 @@ struct hwentry {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
-	int san_path_err_threshold;
-	int san_path_err_forget_rate;
-	int san_path_err_recovery_time;
 	int path_io_err_sample_time;
 	int path_io_err_rate_threshold;
 	int path_io_err_recovery_time;
@@ -106,9 +103,6 @@ struct mpentry {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
-	int san_path_err_threshold;
-	int san_path_err_forget_rate;
-	int san_path_err_recovery_time;
 	int path_io_err_sample_time;
 	int path_io_err_rate_threshold;
 	int path_io_err_recovery_time;
@@ -158,9 +152,6 @@ struct config {
 	int processed_main_config;
 	int delay_watch_checks;
 	int delay_wait_checks;
-	int san_path_err_threshold;
-	int san_path_err_forget_rate;
-	int san_path_err_recovery_time;
 	int path_io_err_sample_time;
 	int path_io_err_rate_threshold;
 	int path_io_err_recovery_time;
diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 81dc97d9..44e61864 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -295,9 +295,6 @@ int setup_map(struct multipath *mpp, char *params, int params_size)
 	select_deferred_remove(conf, mpp);
 	select_delay_watch_checks(conf, mpp);
 	select_delay_wait_checks(conf, mpp);
-	select_san_path_err_threshold(conf, mpp);
-	select_san_path_err_forget_rate(conf, mpp);
-	select_san_path_err_recovery_time(conf, mpp);
 	select_path_io_err_sample_time(conf, mpp);
 	select_path_io_err_rate_threshold(conf, mpp);
 	select_path_io_err_recovery_time(conf, mpp);
diff --git a/libmultipath/dict.c b/libmultipath/dict.c
index 18b1fdb1..d42c7ba9 100644
--- a/libmultipath/dict.c
+++ b/libmultipath/dict.c
@@ -1081,33 +1081,6 @@ declare_hw_handler(delay_wait_checks, set_off_int_undef)
 declare_hw_snprint(delay_wait_checks, print_off_int_undef)
 declare_mp_handler(delay_wait_checks, set_off_int_undef)
 declare_mp_snprint(delay_wait_checks, print_off_int_undef)
-declare_def_handler(san_path_err_threshold, set_off_int_undef)
-declare_def_snprint_defint(san_path_err_threshold, print_off_int_undef,
-			   DEFAULT_ERR_CHECKS)
-declare_ovr_handler(san_path_err_threshold, set_off_int_undef)
-declare_ovr_snprint(san_path_err_threshold, print_off_int_undef)
-declare_hw_handler(san_path_err_threshold, set_off_int_undef)
-declare_hw_snprint(san_path_err_threshold, print_off_int_undef)
-declare_mp_handler(san_path_err_threshold, set_off_int_undef)
-declare_mp_snprint(san_path_err_threshold, print_off_int_undef)
-declare_def_handler(san_path_err_forget_rate, set_off_int_undef)
-declare_def_snprint_defint(san_path_err_forget_rate, print_off_int_undef,
-			   DEFAULT_ERR_CHECKS)
-declare_ovr_handler(san_path_err_forget_rate, set_off_int_undef)
-declare_ovr_snprint(san_path_err_forget_rate, print_off_int_undef)
-declare_hw_handler(san_path_err_forget_rate, set_off_int_undef)
-declare_hw_snprint(san_path_err_forget_rate, print_off_int_undef)
-declare_mp_handler(san_path_err_forget_rate, set_off_int_undef)
-declare_mp_snprint(san_path_err_forget_rate, print_off_int_undef)
-declare_def_handler(san_path_err_recovery_time, set_off_int_undef)
-declare_def_snprint_defint(san_path_err_recovery_time, print_off_int_undef,
-			   DEFAULT_ERR_CHECKS)
-declare_ovr_handler(san_path_err_recovery_time, set_off_int_undef)
-declare_ovr_snprint(san_path_err_recovery_time, print_off_int_undef)
-declare_hw_handler(san_path_err_recovery_time, set_off_int_undef)
-declare_hw_snprint(san_path_err_recovery_time, print_off_int_undef)
-declare_mp_handler(san_path_err_recovery_time, set_off_int_undef)
-declare_mp_snprint(san_path_err_recovery_time, print_off_int_undef)
 declare_def_handler(path_io_err_sample_time, set_off_int_undef)
 declare_def_snprint_defint(path_io_err_sample_time, print_off_int_undef,
 			   DEFAULT_ERR_CHECKS)
@@ -1469,9 +1442,6 @@ init_keywords(vector keywords)
 	install_keyword("config_dir", &def_config_dir_handler, &snprint_def_config_dir);
 	install_keyword("delay_watch_checks", &def_delay_watch_checks_handler, &snprint_def_delay_watch_checks);
 	install_keyword("delay_wait_checks", &def_delay_wait_checks_handler, &snprint_def_delay_wait_checks);
-	install_keyword("san_path_err_threshold", &def_san_path_err_threshold_handler, &snprint_def_san_path_err_threshold);
-	install_keyword("san_path_err_forget_rate", &def_san_path_err_forget_rate_handler, &snprint_def_san_path_err_forget_rate);
-	install_keyword("san_path_err_recovery_time", &def_san_path_err_recovery_time_handler, &snprint_def_san_path_err_recovery_time);
 	install_keyword("path_io_err_sample_time", &def_path_io_err_sample_time_handler, &snprint_def_path_io_err_sample_time);
 	install_keyword("path_io_err_rate_threshold", &def_path_io_err_rate_threshold_handler, &snprint_def_path_io_err_rate_threshold);
 	install_keyword("path_io_err_recovery_time", &def_path_io_err_recovery_time_handler, &snprint_def_path_io_err_recovery_time);
@@ -1559,9 +1529,6 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &hw_deferred_remove_handler, &snprint_hw_deferred_remove);
 	install_keyword("delay_watch_checks", &hw_delay_watch_checks_handler, &snprint_hw_delay_watch_checks);
 	install_keyword("delay_wait_checks", &hw_delay_wait_checks_handler, &snprint_hw_delay_wait_checks);
-	install_keyword("san_path_err_threshold", &hw_san_path_err_threshold_handler, &snprint_hw_san_path_err_threshold);
-	install_keyword("san_path_err_forget_rate", &hw_san_path_err_forget_rate_handler, &snprint_hw_san_path_err_forget_rate);
-	install_keyword("san_path_err_recovery_time", &hw_san_path_err_recovery_time_handler, &snprint_hw_san_path_err_recovery_time);
 	install_keyword("path_io_err_sample_time", &hw_path_io_err_sample_time_handler, &snprint_hw_path_io_err_sample_time);
 	install_keyword("path_io_err_rate_threshold", &hw_path_io_err_rate_threshold_handler, &snprint_hw_path_io_err_rate_threshold);
 	install_keyword("path_io_err_recovery_time", &hw_path_io_err_recovery_time_handler, &snprint_hw_path_io_err_recovery_time);
@@ -1595,9 +1562,6 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &ovr_deferred_remove_handler, &snprint_ovr_deferred_remove);
 	install_keyword("delay_watch_checks", &ovr_delay_watch_checks_handler, &snprint_ovr_delay_watch_checks);
 	install_keyword("delay_wait_checks", &ovr_delay_wait_checks_handler, &snprint_ovr_delay_wait_checks);
-	install_keyword("san_path_err_threshold", &ovr_san_path_err_threshold_handler, &snprint_ovr_san_path_err_threshold);
-	install_keyword("san_path_err_forget_rate", &ovr_san_path_err_forget_rate_handler, &snprint_ovr_san_path_err_forget_rate);
-	install_keyword("san_path_err_recovery_time", &ovr_san_path_err_recovery_time_handler, &snprint_ovr_san_path_err_recovery_time);
 	install_keyword("path_io_err_sample_time", &ovr_path_io_err_sample_time_handler, &snprint_ovr_path_io_err_sample_time);
 	install_keyword("path_io_err_rate_threshold", &ovr_path_io_err_rate_threshold_handler, &snprint_ovr_path_io_err_rate_threshold);
 	install_keyword("path_io_err_recovery_time", &ovr_path_io_err_recovery_time_handler, &snprint_ovr_path_io_err_recovery_time);
@@ -1630,9 +1594,6 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &mp_deferred_remove_handler, &snprint_mp_deferred_remove);
 	install_keyword("delay_watch_checks", &mp_delay_watch_checks_handler, &snprint_mp_delay_watch_checks);
 	install_keyword("delay_wait_checks", &mp_delay_wait_checks_handler, &snprint_mp_delay_wait_checks);
-	install_keyword("san_path_err_threshold", &mp_san_path_err_threshold_handler, &snprint_mp_san_path_err_threshold);
-	install_keyword("san_path_err_forget_rate", &mp_san_path_err_forget_rate_handler, &snprint_mp_san_path_err_forget_rate);
-	install_keyword("san_path_err_recovery_time", &mp_san_path_err_recovery_time_handler, &snprint_mp_san_path_err_recovery_time);
 	install_keyword("path_io_err_sample_time", &mp_path_io_err_sample_time_handler, &snprint_mp_path_io_err_sample_time);
 	install_keyword("path_io_err_rate_threshold", &mp_path_io_err_rate_threshold_handler, &snprint_mp_path_io_err_rate_threshold);
 	install_keyword("path_io_err_recovery_time", &mp_path_io_err_recovery_time_handler, &snprint_mp_path_io_err_recovery_time);
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index 9d2c3c09..9aab9805 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -732,53 +732,6 @@ out:
 
 }
 
-int select_san_path_err_threshold(struct config *conf, struct multipath *mp)
-{
-	char *origin, buff[12];
-
-	mp_set_mpe(san_path_err_threshold);
-	mp_set_ovr(san_path_err_threshold);
-	mp_set_hwe(san_path_err_threshold);
-	mp_set_conf(san_path_err_threshold);
-	mp_set_default(san_path_err_threshold, DEFAULT_ERR_CHECKS);
-out:
-	print_off_int_undef(buff, 12, &mp->san_path_err_threshold);
-	condlog(3, "%s: san_path_err_threshold = %s %s", mp->alias, buff, origin);
-	return 0;
-}
-
-int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp)
-{
-	char *origin, buff[12];
-
-	mp_set_mpe(san_path_err_forget_rate);
-	mp_set_ovr(san_path_err_forget_rate);
-	mp_set_hwe(san_path_err_forget_rate);
-	mp_set_conf(san_path_err_forget_rate);
-	mp_set_default(san_path_err_forget_rate, DEFAULT_ERR_CHECKS);
-out:
-	print_off_int_undef(buff, 12, &mp->san_path_err_forget_rate);
-	condlog(3, "%s: san_path_err_forget_rate = %s %s", mp->alias, buff, origin);
-	return 0;
-
-}
-
-int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp)
-{
-	char *origin, buff[12];
-
-	mp_set_mpe(san_path_err_recovery_time);
-	mp_set_ovr(san_path_err_recovery_time);
-	mp_set_hwe(san_path_err_recovery_time);
-	mp_set_conf(san_path_err_recovery_time);
-	mp_set_default(san_path_err_recovery_time, DEFAULT_ERR_CHECKS);
-out:
-	print_off_int_undef(buff, 12, &mp->san_path_err_recovery_time);
-	condlog(3, "%s: san_path_err_recovery_time = %s %s", mp->alias, buff, origin);
-	return 0;
-
-}
-
 int select_path_io_err_sample_time(struct config *conf, struct multipath *mp)
 {
 	char *origin, buff[12];
diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h
index 1b2b5714..16a890eb 100644
--- a/libmultipath/propsel.h
+++ b/libmultipath/propsel.h
@@ -25,9 +25,6 @@ int select_delay_watch_checks (struct config *conf, struct multipath * mp);
 int select_delay_wait_checks (struct config *conf, struct multipath * mp);
 int select_skip_kpartx (struct config *conf, struct multipath * mp);
 int select_max_sectors_kb (struct config *conf, struct multipath * mp);
-int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp);
-int select_san_path_err_threshold(struct config *conf, struct multipath *mp);
-int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp);
 int select_path_io_err_sample_time(struct config *conf, struct multipath *mp);
 int select_path_io_err_rate_threshold(struct config *conf, struct multipath *mp);
 int select_path_io_err_recovery_time(struct config *conf, struct multipath *mp);
diff --git a/libmultipath/structs.h b/libmultipath/structs.h
index 1ab8cb9b..9da33d5a 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -231,10 +231,6 @@ struct path {
 	int initialized;
 	int retriggers;
 	int wwid_changed;
-	unsigned int path_failures;
-	time_t dis_reinstate_time;
-	int disable_reinstate;
-	int san_path_err_forget_rate;
 	time_t io_err_dis_reinstate_time;
 	int io_err_disable_reinstate;
 	int io_err_pathfail_cnt;
@@ -270,9 +266,6 @@ struct multipath {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
-	int san_path_err_threshold;
-	int san_path_err_forget_rate;
-	int san_path_err_recovery_time;
 	int path_io_err_sample_time;
 	int path_io_err_rate_threshold;
 	int path_io_err_recovery_time;
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index f49ede66..41f698c0 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -810,45 +810,6 @@ The default is: \fB/etc/multipath/conf.d/\fR
 .
 .
 .TP
-.B san_path_err_threshold
-If set to a value greater than 0, multipathd will watch paths and check how many
-times a path has been failed due to errors.If the number of failures on a particular
-path is greater then the san_path_err_threshold then the path will not  reinstante
-till san_path_err_recovery_time.These path failures should occur within a
-san_path_err_forget_rate checks, if not we will consider the path is good enough
-to reinstantate.
-.RS
-.TP
-The default is: \fBno\fR
-.RE
-.
-.
-.TP
-.B san_path_err_forget_rate
-If set to a value greater than 0, multipathd will check whether the path failures
-has exceeded  the san_path_err_threshold within this many checks i.e
-san_path_err_forget_rate . If so we will not reinstante the path till
-san_path_err_recovery_time.
-.RS
-.TP
-The default is: \fBno\fR
-.RE
-.
-.
-.TP
-.B san_path_err_recovery_time
-If set to a value greater than 0, multipathd will make sure that when path failures
-has exceeded the san_path_err_threshold within san_path_err_forget_rate then the path
-will be placed in failed state for san_path_err_recovery_time duration.Once san_path_err_recovery_time
-has timeout  we will reinstante the failed path .
-san_path_err_recovery_time value should be in secs.
-.RS
-.TP
-The default is: \fBno\fR
-.RE
-.
-.
-.TP
 .B path_io_err_sample_time
 One of the three parameters of supporting path check based on accounting IO
 error such as intermittent error. If it is set to a value no less than 120,
@@ -1160,12 +1121,6 @@ are taken from the \fIdefaults\fR or \fIdevices\fR section:
 .TP
 .B deferred_remove
 .TP
-.B san_path_err_threshold
-.TP
-.B san_path_err_forget_rate
-.TP
-.B san_path_err_recovery_time
-.TP
 .B path_io_err_sample_time
 .TP
 .B path_io_err_rate_threshold
@@ -1293,12 +1248,6 @@ section:
 .TP
 .B deferred_remove
 .TP
-.B san_path_err_threshold
-.TP
-.B san_path_err_forget_rate
-.TP
-.B san_path_err_recovery_time
-.TP
 .B path_io_err_sample_time
 .TP
 .B path_io_err_rate_threshold
@@ -1371,12 +1320,6 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections:
 .TP
 .B deferred_remove
 .TP
-.B san_path_err_threshold
-.TP
-.B san_path_err_forget_rate
-.TP
-.B san_path_err_recovery_time
-.TP
 .B path_io_err_sample_time
 .TP
 .B path_io_err_rate_threshold
diff --git a/multipathd/main.c b/multipathd/main.c
index 38158006..001a03e2 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -1525,84 +1525,6 @@ void repair_path(struct path * pp)
 	LOG_MSG(1, checker_message(&pp->checker));
 }
 
-static int check_path_reinstate_state(struct path * pp) {
-	struct timespec curr_time;
-	if (!((pp->mpp->san_path_err_threshold > 0) &&
-				(pp->mpp->san_path_err_forget_rate > 0) &&
-				(pp->mpp->san_path_err_recovery_time >0))) {
-		return 0;
-	}
-
-	if (pp->disable_reinstate) {
-		/* If we don't know how much time has passed, automatically
-		 * reinstate the path, just to be safe. Also, if there are
-		 * no other usable paths, reinstate the path
-		 */
-		if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0 ||
-				pp->mpp->nr_active == 0) {
-			condlog(2, "%s : reinstating path early", pp->dev);
-			goto reinstate_path;
-		}
-		if ((curr_time.tv_sec - pp->dis_reinstate_time ) > pp->mpp->san_path_err_recovery_time) {
-			condlog(2,"%s : reinstate the path after err recovery time", pp->dev);
-			goto reinstate_path;
-		}
-		return 1;
-	}
-	/* forget errors on a working path */
-	if ((pp->state == PATH_UP || pp->state == PATH_GHOST) &&
-			pp->path_failures > 0) {
-		if (pp->san_path_err_forget_rate > 0){
-			pp->san_path_err_forget_rate--;
-		} else {
-			/* for every san_path_err_forget_rate number of
-			 * successful path checks decrement path_failures by 1
-			 */
-			pp->path_failures--;
-			pp->san_path_err_forget_rate = pp->mpp->san_path_err_forget_rate;
-		}
-		return 0;
-	}
-
-	/* If the path isn't recovering from a failed state, do nothing */
-	if (pp->state != PATH_DOWN && pp->state != PATH_SHAKY &&
-			pp->state != PATH_TIMEOUT)
-		return 0;
-
-	if (pp->path_failures == 0)
-		pp->san_path_err_forget_rate = pp->mpp->san_path_err_forget_rate;
-
-	pp->path_failures++;
-
-	/* if we don't know the currently time, we don't know how long to
-	 * delay the path, so there's no point in checking if we should
-	 */
-
-	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
-		return 0;
-	/* when path failures has exceeded the san_path_err_threshold
-	 * place the path in delayed state till san_path_err_recovery_time
-	 * so that the cutomer can rectify the issue within this time. After
-	 * the completion of san_path_err_recovery_time it should
-	 * automatically reinstate the path
-	 */
-	if (pp->path_failures > pp->mpp->san_path_err_threshold) {
-		condlog(2, "%s : hit error threshold. Delaying path reinstatement", pp->dev);
-		pp->dis_reinstate_time = curr_time.tv_sec;
-		pp->disable_reinstate = 1;
-
-		return 1;
-	} else {
-		return 0;
-	}
-
-reinstate_path:
-	pp->path_failures = 0;
-	pp->disable_reinstate = 0;
-	pp->san_path_err_forget_rate = 0;
-	return 0;
-}
-
 /*
  * Returns '1' if the path has been checked, '-1' if it was blacklisted
  * and '0' otherwise
@@ -1716,12 +1638,6 @@ check_path (struct vectors * vecs, struct path * pp, int ticks)
 	if (!pp->mpp)
 		return 0;
 
-	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
-			check_path_reinstate_state(pp)) {
-		pp->state = PATH_DELAYED;
-		return 1;
-	}
-
 	if (pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
 		pp->state = PATH_DELAYED;
 		/*
-- 
2.11.1

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-17  3:40 ` [PATCH V4 1/2] " Guan Junxiong
@ 2017-09-18 12:53   ` Muneendra Kumar M
  2017-09-18 14:36     ` Guan Junxiong
  0 siblings, 1 reply; 17+ messages in thread
From: Muneendra Kumar M @ 2017-09-18 12:53 UTC (permalink / raw)
  To: Guan Junxiong, dm-devel, christophe.varoqui, mwilck
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi Guan,
This a good effort for detecting the intermittent IO error accounting to improve reliability.
Your new algorithm is  mutually exclusive with san_path_err_XXX.
It resolved the below issue which you have mentioned .
>>Even the san_path_err_threshold , san_path_err_forget_rate and san_path_err_recovery_time is turned on,
>>the detect sample interval of that path checkers is so big/coarse that it doesn't see what happens in the middle of the sample interval.

But I have some concerns.

Correct me if my understanding on the below line is correct
>>On a particular path when a path failing events occur twice in 60 second due to an IO error, multipathd will fail the path and enqueue 
>>this path into a queue of which each member is sent a couple of continuous direct reading asynchronous io at a fixed sample rate of 10HZ. 

Once we hit the above condition (2 errors in 60 secs) for a path_io_err_sample_time we keeps on injecting the asynchronous io at a fixed sample rate of 10HZ.
And during this path_io_err_sample_time if we hit the the path_io_err_rate_threshold then we will not reinstantate this path for a path_io_err_recovery_time.
Is this understanding correct?

If the above understanding is correct then my concern is :
1) On a particular path if we are seeing continuous errors but not within 60 secs (may be for every 120 secs) of duration how do we handle this. Still this a shaky link.
This is what our customers are pointing out.
And if iam not wrong the new algorithm will comes into place only  if a path failing events occur twice in 60 seconds.

Then this will not solve the intermittent IO error issue which we are seeing as the data is still going on the shaky path .
I think this is the place where we need to pull in  in san_path_err_forget_rate .

Our main intention to bring the san_path_err_XXX patch was ,if we are hitting   i/o errors on a path which are exceeding san_path_err_threshold within a san_path_err_forget_rate then 
We are not supposed to reinstate the path for san_path_err_recovery_time.


path_io_err_sample_time should be a  sub window of san_path_err_forget_rate.
If the errors are not happening within 60 secs duration, still  we need to keep track of  the number of errors and if the error threshold is hit within san_path_err_forget_rate  then the path will not reinstate for recover_time seconds.
With the combination of these two we can find the shaky path within path_io_err_sample_time / san_path_err_forget_rate.

Regards,
Muneendra.



-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Sunday, September 17, 2017 9:11 AM
To: dm-devel@redhat.com; christophe.varoqui@opensvc.com; mwilck@suse.com
Cc: Muneendra Kumar M <mmandala@Brocade.com>; shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com; guanjunxiong@huawei.com
Subject: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

This patch adds a new method of path state checking based on accounting IO error. This is useful in many scenarios such as intermittent IO error an a path due to network congestion, or a shaky link.

Three parameters are added for the admin: "path_io_err_sample_time", "path_io_err_rate_threshold" and "path_io_err_recovery_time".
If path_io_err_sample_time are set no less than 120 and path_io_err_recovery_time are set to a value greater than 0, when path failing events occur twice in 60 second due to an IO error, multipathd will fail the path and enqueue this path into a queue of which each member is sent a couple of continuous direct reading asynchronous io at a fixed sample rate of 10HZ. The IO accounting process for a path will last for path_io_err_sample_time. If the IO error rate on a particular path is greater than the path_io_err_rate_threshold, then the path will not reinstate for recover_time seconds unless there is only one active path.

If recover_time expires, we will reschedule this IO error checking process. If the path is good enough, we will claim it good.

This helps us place the path in delayed state if we hit a lot of intermittent IO errors on a particular path due to network/target issues and isolate such degraded path and allow the admin to rectify the errors on a path.

Signed-off-by: Junxiong Guan <guanjunxiong@huawei.com>
---
 libmultipath/Makefile      |   5 +-
 libmultipath/config.h      |   9 +
 libmultipath/configure.c   |   3 +
 libmultipath/dict.c        |  41 +++
 libmultipath/io_err_stat.c | 743 +++++++++++++++++++++++++++++++++++++++++++++
 libmultipath/io_err_stat.h |  15 +
 libmultipath/propsel.c     |  53 ++++
 libmultipath/propsel.h     |   3 +
 libmultipath/structs.h     |   7 +
 libmultipath/uevent.c      |  32 ++
 libmultipath/uevent.h      |   2 +
 multipath/multipath.conf.5 |  65 ++++
 multipathd/main.c          |  56 ++++
 13 files changed, 1032 insertions(+), 2 deletions(-)  create mode 100644 libmultipath/io_err_stat.c  create mode 100644 libmultipath/io_err_stat.h

diff --git a/libmultipath/Makefile b/libmultipath/Makefile index b3244fc7..dce73afe 100644
--- a/libmultipath/Makefile
+++ b/libmultipath/Makefile
@@ -9,7 +9,7 @@ LIBS = $(DEVLIB).$(SONAME)
 
 CFLAGS += $(LIB_CFLAGS) -I$(mpathcmddir)
 
-LIBDEPS += -lpthread -ldl -ldevmapper -ludev -L$(mpathcmddir) -lmpathcmd -lurcu
+LIBDEPS += -lpthread -ldl -ldevmapper -ludev -L$(mpathcmddir) 
+-lmpathcmd -lurcu -laio
 
 ifdef SYSTEMD
 	CFLAGS += -DUSE_SYSTEMD=$(SYSTEMD)
@@ -42,7 +42,8 @@ OBJS = memory.o parser.o vector.o devmapper.o callout.o \
 	pgpolicies.o debug.o defaults.o uevent.o time-util.o \
 	switchgroup.o uxsock.o print.o alias.o log_pthread.o \
 	log.o configure.o structs_vec.o sysfs.o prio.o checkers.o \
-	lock.o waiter.o file.o wwids.o prioritizers/alua_rtpg.o
+	lock.o waiter.o file.o wwids.o prioritizers/alua_rtpg.o \
+	io_err_stat.o
 
 all: $(LIBS)
 
diff --git a/libmultipath/config.h b/libmultipath/config.h index ffc69b5f..215d29e9 100644
--- a/libmultipath/config.h
+++ b/libmultipath/config.h
@@ -75,6 +75,9 @@ struct hwentry {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int skip_kpartx;
 	int max_sectors_kb;
 	char * bl_product;
@@ -106,6 +109,9 @@ struct mpentry {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int skip_kpartx;
 	int max_sectors_kb;
 	uid_t uid;
@@ -155,6 +161,9 @@ struct config {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int uxsock_timeout;
 	int strict_timing;
 	int retrigger_tries;
diff --git a/libmultipath/configure.c b/libmultipath/configure.c index 74b6f52a..81dc97d9 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -298,6 +298,9 @@ int setup_map(struct multipath *mpp, char *params, int params_size)
 	select_san_path_err_threshold(conf, mpp);
 	select_san_path_err_forget_rate(conf, mpp);
 	select_san_path_err_recovery_time(conf, mpp);
+	select_path_io_err_sample_time(conf, mpp);
+	select_path_io_err_rate_threshold(conf, mpp);
+	select_path_io_err_recovery_time(conf, mpp);
 	select_skip_kpartx(conf, mpp);
 	select_max_sectors_kb(conf, mpp);
 
diff --git a/libmultipath/dict.c b/libmultipath/dict.c index 9dc10904..18b1fdb1 100644
--- a/libmultipath/dict.c
+++ b/libmultipath/dict.c
@@ -1108,6 +1108,35 @@ declare_hw_handler(san_path_err_recovery_time, set_off_int_undef)  declare_hw_snprint(san_path_err_recovery_time, print_off_int_undef)  declare_mp_handler(san_path_err_recovery_time, set_off_int_undef)  declare_mp_snprint(san_path_err_recovery_time, print_off_int_undef)
+declare_def_handler(path_io_err_sample_time, set_off_int_undef) 
+declare_def_snprint_defint(path_io_err_sample_time, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(path_io_err_sample_time, set_off_int_undef) 
+declare_ovr_snprint(path_io_err_sample_time, print_off_int_undef) 
+declare_hw_handler(path_io_err_sample_time, set_off_int_undef) 
+declare_hw_snprint(path_io_err_sample_time, print_off_int_undef) 
+declare_mp_handler(path_io_err_sample_time, set_off_int_undef) 
+declare_mp_snprint(path_io_err_sample_time, print_off_int_undef) 
+declare_def_handler(path_io_err_rate_threshold, set_off_int_undef) 
+declare_def_snprint_defint(path_io_err_rate_threshold, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(path_io_err_rate_threshold, set_off_int_undef) 
+declare_ovr_snprint(path_io_err_rate_threshold, print_off_int_undef) 
+declare_hw_handler(path_io_err_rate_threshold, set_off_int_undef) 
+declare_hw_snprint(path_io_err_rate_threshold, print_off_int_undef) 
+declare_mp_handler(path_io_err_rate_threshold, set_off_int_undef) 
+declare_mp_snprint(path_io_err_rate_threshold, print_off_int_undef) 
+declare_def_handler(path_io_err_recovery_time, set_off_int_undef) 
+declare_def_snprint_defint(path_io_err_recovery_time, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(path_io_err_recovery_time, set_off_int_undef) 
+declare_ovr_snprint(path_io_err_recovery_time, print_off_int_undef) 
+declare_hw_handler(path_io_err_recovery_time, set_off_int_undef) 
+declare_hw_snprint(path_io_err_recovery_time, print_off_int_undef) 
+declare_mp_handler(path_io_err_recovery_time, set_off_int_undef) 
+declare_mp_snprint(path_io_err_recovery_time, print_off_int_undef)
+
+
 static int
 def_uxsock_timeout_handler(struct config *conf, vector strvec)  { @@ -1443,6 +1472,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &def_san_path_err_threshold_handler, &snprint_def_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &def_san_path_err_forget_rate_handler, &snprint_def_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &def_san_path_err_recovery_time_handler, &snprint_def_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &def_path_io_err_sample_time_handler, &snprint_def_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &def_path_io_err_rate_threshold_handler, &snprint_def_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", 
+&def_path_io_err_recovery_time_handler, 
+&snprint_def_path_io_err_recovery_time);
 
 	install_keyword("find_multipaths", &def_find_multipaths_handler, &snprint_def_find_multipaths);
 	install_keyword("uxsock_timeout", &def_uxsock_timeout_handler, &snprint_def_uxsock_timeout); @@ -1530,6 +1562,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &hw_san_path_err_threshold_handler, &snprint_hw_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &hw_san_path_err_forget_rate_handler, &snprint_hw_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &hw_san_path_err_recovery_time_handler, &snprint_hw_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &hw_path_io_err_sample_time_handler, &snprint_hw_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &hw_path_io_err_rate_threshold_handler, &snprint_hw_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", 
+&hw_path_io_err_recovery_time_handler, 
+&snprint_hw_path_io_err_recovery_time);
 	install_keyword("skip_kpartx", &hw_skip_kpartx_handler, &snprint_hw_skip_kpartx);
 	install_keyword("max_sectors_kb", &hw_max_sectors_kb_handler, &snprint_hw_max_sectors_kb);
 	install_sublevel_end();
@@ -1563,6 +1598,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &ovr_san_path_err_threshold_handler, &snprint_ovr_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &ovr_san_path_err_forget_rate_handler, &snprint_ovr_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &ovr_san_path_err_recovery_time_handler, &snprint_ovr_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &ovr_path_io_err_sample_time_handler, &snprint_ovr_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &ovr_path_io_err_rate_threshold_handler, &snprint_ovr_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", 
+&ovr_path_io_err_recovery_time_handler, 
+&snprint_ovr_path_io_err_recovery_time);
 
 	install_keyword("skip_kpartx", &ovr_skip_kpartx_handler, &snprint_ovr_skip_kpartx);
 	install_keyword("max_sectors_kb", &ovr_max_sectors_kb_handler, &snprint_ovr_max_sectors_kb); @@ -1595,6 +1633,9 @@ init_keywords(vector keywords)
 	install_keyword("san_path_err_threshold", &mp_san_path_err_threshold_handler, &snprint_mp_san_path_err_threshold);
 	install_keyword("san_path_err_forget_rate", &mp_san_path_err_forget_rate_handler, &snprint_mp_san_path_err_forget_rate);
 	install_keyword("san_path_err_recovery_time", &mp_san_path_err_recovery_time_handler, &snprint_mp_san_path_err_recovery_time);
+	install_keyword("path_io_err_sample_time", &mp_path_io_err_sample_time_handler, &snprint_mp_path_io_err_sample_time);
+	install_keyword("path_io_err_rate_threshold", &mp_path_io_err_rate_threshold_handler, &snprint_mp_path_io_err_rate_threshold);
+	install_keyword("path_io_err_recovery_time", 
+&mp_path_io_err_recovery_time_handler, 
+&snprint_mp_path_io_err_recovery_time);
 	install_keyword("skip_kpartx", &mp_skip_kpartx_handler, &snprint_mp_skip_kpartx);
 	install_keyword("max_sectors_kb", &mp_max_sectors_kb_handler, &snprint_mp_max_sectors_kb);
 	install_sublevel_end();
diff --git a/libmultipath/io_err_stat.c b/libmultipath/io_err_stat.c new file mode 100644 index 00000000..088e3354
--- /dev/null
+++ b/libmultipath/io_err_stat.c
@@ -0,0 +1,743 @@
+/*
+ * (C) Copyright HUAWEI Technology Corp. 2017, All Rights Reserved.
+ *
+ * io_err_stat.c
+ * version 1.0
+ *
+ * IO error stream statistic process for path failure event from kernel
+ *
+ * Author(s): Guan Junxiong 2017 <guanjunxiong@huawei.com>
+ *
+ * This file is released under the GPL version 2, or any later version.
+ */
+
+#include <unistd.h>
+#include <pthread.h>
+#include <signal.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/ioctl.h>
+#include <linux/fs.h>
+#include <libaio.h>
+#include <errno.h>
+#include <sys/mman.h>
+
+#include "vector.h"
+#include "memory.h"
+#include "checkers.h"
+#include "config.h"
+#include "structs.h"
+#include "structs_vec.h"
+#include "devmapper.h"
+#include "debug.h"
+#include "lock.h"
+#include "time-util.h"
+#include "io_err_stat.h"
+
+#define IOTIMEOUT_SEC			60
+#define TIMEOUT_NO_IO_NSEC		10000000 /*10ms = 10000000ns*/
+#define FLAKY_PATHFAIL_THRESHOLD	2
+#define FLAKY_PATHFAIL_TIME_FRAME	60
+#define CONCUR_NR_EVENT			32
+
+#define PATH_IO_ERR_IN_CHECKING		-1
+#define PATH_IO_ERR_IN_POLLING_RECHECK	-2
+
+#define io_err_stat_log(prio, fmt, args...) \
+	condlog(prio, "io error statistic: " fmt, ##args)
+
+
+struct io_err_stat_pathvec {
+	pthread_mutex_t mutex;
+	vector		pathvec;
+};
+
+struct dio_ctx {
+	struct timespec	io_starttime;
+	int		blksize;
+	void		*buf;
+	struct iocb	io;
+};
+
+struct io_err_stat_path {
+	char		devname[FILE_NAME_SIZE];
+	int		fd;
+	struct dio_ctx	*dio_ctx_array;
+	int		io_err_nr;
+	int		io_nr;
+	struct timespec	start_time;
+
+	int		total_time;
+	int		err_rate_threshold;
+};
+
+pthread_t		io_err_stat_thr;
+pthread_attr_t		io_err_stat_attr;
+
+static struct io_err_stat_pathvec *paths; struct vectors *vecs;
+io_context_t	ioctx;
+
+static void cancel_inflight_io(struct io_err_stat_path *pp);
+
+static void rcu_unregister(void *param) {
+	rcu_unregister_thread();
+}
+
+struct io_err_stat_path *find_err_path_by_dev(vector pathvec, char 
+*dev) {
+	int i;
+	struct io_err_stat_path *pp;
+
+	if (!pathvec)
+		return NULL;
+	vector_foreach_slot(pathvec, pp, i)
+		if (!strcmp(pp->devname, dev))
+			return pp;
+
+	io_err_stat_log(4, "%s: not found in check queue", dev);
+
+	return NULL;
+}
+
+static int init_each_dio_ctx(struct dio_ctx *ct, int blksize,
+		unsigned long pgsize)
+{
+	ct->blksize = blksize;
+	if(posix_memalign(&ct->buf, pgsize, blksize))
+		return 1;
+	memset(ct->buf, 0, blksize);
+	ct->io_starttime.tv_sec = 0;
+	ct->io_starttime.tv_nsec = 0;
+
+	return 0;
+}
+
+static void deinit_each_dio_ctx(struct dio_ctx *ct) {
+	if (ct->buf)
+		free(ct->buf);
+}
+
+static int setup_directio_ctx(struct io_err_stat_path *p) {
+	unsigned long pgsize = getpagesize();
+	char fpath[PATH_MAX];
+	int blksize = 0;
+	int i;
+
+	if (snprintf(fpath, PATH_MAX, "/dev/%s", p->devname) >= PATH_MAX)
+		return 1;
+	if (p->fd < 0)
+		p->fd = open(fpath, O_RDONLY | O_DIRECT);
+	if (p->fd < 0)
+		return 1;
+
+	p->dio_ctx_array = MALLOC(sizeof(struct dio_ctx) * CONCUR_NR_EVENT);
+	if (!p->dio_ctx_array)
+		goto fail_close;
+
+	if (ioctl(p->fd, BLKBSZGET, &blksize) < 0) {
+		io_err_stat_log(4, "%s:cannot get blocksize, set default 512",
+				p->devname);
+		blksize = 512;
+	}
+	if (!blksize)
+		goto free_pdctx;
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++) {
+		if (init_each_dio_ctx(p->dio_ctx_array + i, blksize, pgsize))
+			goto deinit;
+	}
+	return 0;
+
+deinit:
+	for (i = 0; i < CONCUR_NR_EVENT; i++)
+		deinit_each_dio_ctx(p->dio_ctx_array + i);
+free_pdctx:
+	FREE(p->dio_ctx_array);
+fail_close:
+	close(p->fd);
+
+	return 1;
+}
+
+static void destroy_directio_ctx(struct io_err_stat_path *p) {
+	int i;
+
+	if (!p || !p->dio_ctx_array)
+		return;
+	cancel_inflight_io(p);
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++)
+		deinit_each_dio_ctx(p->dio_ctx_array + i);
+	FREE(p->dio_ctx_array);
+
+	if (p->fd > 0)
+		close(p->fd);
+}
+
+static struct io_err_stat_path *alloc_io_err_stat_path(void) {
+	struct io_err_stat_path *p;
+
+	p = (struct io_err_stat_path *)MALLOC(sizeof(*p));
+	if (!p)
+		return NULL;
+
+	memset(p->devname, 0, sizeof(p->devname));
+	p->io_err_nr = 0;
+	p->io_nr = 0;
+	p->total_time = 0;
+	p->start_time.tv_sec = 0;
+	p->start_time.tv_nsec = 0;
+	p->err_rate_threshold = 0;
+	p->fd = -1;
+
+	return p;
+}
+
+static void free_io_err_stat_path(struct io_err_stat_path *p) {
+	FREE(p);
+}
+
+static struct io_err_stat_pathvec *alloc_pathvec(void) {
+	struct io_err_stat_pathvec *p;
+	int r;
+
+	p = (struct io_err_stat_pathvec *)MALLOC(sizeof(*p));
+	if (!p)
+		return NULL;
+	p->pathvec = vector_alloc();
+	if (!p->pathvec)
+		goto out_free_struct_pathvec;
+	r = pthread_mutex_init(&p->mutex, NULL);
+	if (r)
+		goto out_free_member_pathvec;
+
+	return p;
+
+out_free_member_pathvec:
+	vector_free(p->pathvec);
+out_free_struct_pathvec:
+	FREE(p);
+	return NULL;
+}
+
+static void free_io_err_pathvec(struct io_err_stat_pathvec *p) {
+	struct io_err_stat_path *path;
+	int i;
+
+	if (!p)
+		return;
+	pthread_mutex_destroy(&p->mutex);
+	if (!p->pathvec) {
+		vector_foreach_slot(p->pathvec, path, i) {
+			destroy_directio_ctx(path);
+			free_io_err_stat_path(path);
+		}
+		vector_free(p->pathvec);
+	}
+	FREE(p);
+}
+
+/*
+ * return value
+ * 0: enqueue OK
+ * 1: fails because of internal error
+ * 2: fails because of existing already  */ static int 
+enqueue_io_err_stat_by_path(struct path *path) {
+	struct io_err_stat_path *p;
+
+	pthread_mutex_lock(&paths->mutex);
+	p = find_err_path_by_dev(paths->pathvec, path->dev);
+	if (p) {
+		pthread_mutex_unlock(&paths->mutex);
+		return 2;
+	}
+	pthread_mutex_unlock(&paths->mutex);
+
+	p = alloc_io_err_stat_path();
+	if (!p)
+		return 1;
+
+	memcpy(p->devname, path->dev, sizeof(p->devname));
+	p->total_time = path->mpp->path_io_err_sample_time;
+	p->err_rate_threshold = path->mpp->path_io_err_rate_threshold;
+
+	if (setup_directio_ctx(p))
+		goto free_ioerr_path;
+	pthread_mutex_lock(&paths->mutex);
+	if (!vector_alloc_slot(paths->pathvec))
+		goto unlock_destroy;
+	vector_set_slot(paths->pathvec, p);
+	pthread_mutex_unlock(&paths->mutex);
+
+	if (!path->io_err_disable_reinstate) {
+		/*
+		 *fail the path in the kernel for the time of the to make
+		 *the test more reliable
+		 */
+		io_err_stat_log(3, "%s: fail dm path %s before checking",
+				path->mpp->alias, path->dev);
+		path->io_err_disable_reinstate = 1;
+		dm_fail_path(path->mpp->alias, path->dev_t);
+		update_queue_mode_del_path(path->mpp);
+
+		/*
+		 * schedule path check as soon as possible to
+		 * update path state to delayed state
+		 */
+		path->tick = 1;
+
+	}
+	io_err_stat_log(2, "%s: enqueue path %s to check",
+			path->mpp->alias, path->dev);
+	return 0;
+
+unlock_destroy:
+	pthread_mutex_unlock(&paths->mutex);
+	destroy_directio_ctx(p);
+free_ioerr_path:
+	free_io_err_stat_path(p);
+
+	return 1;
+}
+
+int io_err_stat_handle_pathfail(struct path *path) {
+	struct timespec curr_time;
+	int res;
+
+	if (path->io_err_disable_reinstate) {
+		io_err_stat_log(3, "%s: reinstate is already disabled",
+				path->dev);
+		return 1;
+	}
+	if (path->io_err_pathfail_cnt < 0)
+		return 1;
+
+	if (!path->mpp)
+		return 1;
+	if (path->mpp->nr_active <= 1)
+		return 1;
+	if (path->mpp->path_io_err_sample_time <= 0 ||
+		path->mpp->path_io_err_recovery_time <= 0 ||
+		path->mpp->path_io_err_rate_threshold < 0) {
+		io_err_stat_log(4, "%s: parameter not set", path->mpp->alias);
+		return 1;
+	}
+	if (path->mpp->path_io_err_sample_time < (2 * IOTIMEOUT_SEC)) {
+		io_err_stat_log(2, "%s: path_io_err_sample_time should not less than %d",
+				path->mpp->alias, 2 * IOTIMEOUT_SEC);
+		return 1;
+	}
+	/*
+	 * The test should only be started for paths that have failed
+	 * repeatedly in a certain time frame, so that we have reason
+	 * to assume they're flaky. Without bother the admin to configure
+	 * the repeated count threshold and time frame, we assume a path
+	 * which fails at least twice within 60 seconds is flaky.
+	 */
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return 1;
+	if (path->io_err_pathfail_cnt == 0) {
+		path->io_err_pathfail_cnt++;
+		path->io_err_pathfail_starttime = curr_time.tv_sec;
+		io_err_stat_log(5, "%s: start path flakiness pre-checking",
+				path->dev);
+		return 0;
+	}
+	if ((curr_time.tv_sec - path->io_err_pathfail_starttime) >
+			FLAKY_PATHFAIL_TIME_FRAME) {
+		path->io_err_pathfail_cnt = 0;
+		path->io_err_pathfail_starttime = curr_time.tv_sec;
+		io_err_stat_log(5, "%s: restart path flakiness pre-checking",
+				path->dev);
+	}
+	path->io_err_pathfail_cnt++;
+	if (path->io_err_pathfail_cnt >= FLAKY_PATHFAIL_THRESHOLD) {
+		res = enqueue_io_err_stat_by_path(path);
+		if (!res)
+			path->io_err_pathfail_cnt = PATH_IO_ERR_IN_CHECKING;
+		else
+			path->io_err_pathfail_cnt = 0;
+	}
+
+	return 0;
+}
+
+int hit_io_err_recheck_time(struct path *pp) {
+	struct timespec curr_time;
+	int r;
+
+	if (pp->io_err_disable_reinstate == 0)
+		return 1;
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return 1;
+	if (pp->io_err_pathfail_cnt != PATH_IO_ERR_IN_POLLING_RECHECK)
+		return 1;
+	if (pp->mpp->nr_active <= 0) {
+		io_err_stat_log(2, "%s: recover path early", pp->dev);
+		goto recover;
+	}
+	if ((curr_time.tv_sec - pp->io_err_dis_reinstate_time) >
+			pp->mpp->path_io_err_recovery_time) {
+		io_err_stat_log(4, "%s: reschedule checking after %d seconds",
+				pp->dev, pp->mpp->path_io_err_sample_time);
+		/*
+		 * to reschedule io error checking again
+		 * if the path is good enough, we claim it is good
+		 * and can be reinsated as soon as possible in the
+		 * check_path routine.
+		 */
+		pp->io_err_dis_reinstate_time = curr_time.tv_sec;
+		r = enqueue_io_err_stat_by_path(pp);
+		/*
+		 * Enqueue fails because of internal error.
+		 * In this case , we recover this path
+		 * Or else,  return 1 to set path state to PATH_DELAYED
+		 */
+		if (r == 1) {
+			io_err_stat_log(3, "%s: enqueue fails, to recover",
+					pp->dev);
+			goto recover;
+		}
+		else if (!r) {
+			pp->io_err_pathfail_cnt = PATH_IO_ERR_IN_CHECKING;
+		}
+	}
+
+	return 1;
+
+recover:
+	pp->io_err_pathfail_cnt = 0;
+	pp->io_err_disable_reinstate = 0;
+	pp->tick = 1;
+	return 0;
+}
+
+static int delete_io_err_stat_by_addr(struct io_err_stat_path *p) {
+	int i;
+
+	i = find_slot(paths->pathvec, p);
+	if (i != -1)
+		vector_del_slot(paths->pathvec, i);
+
+	destroy_directio_ctx(p);
+	free_io_err_stat_path(p);
+
+	return 0;
+}
+
+static void account_async_io_state(struct io_err_stat_path *pp, int rc) 
+{
+	switch (rc) {
+	case PATH_DOWN:
+	case PATH_TIMEOUT:
+		pp->io_err_nr++;
+		break;
+	case PATH_UNCHECKED:
+	case PATH_UP:
+	case PATH_PENDING:
+		break;
+	default:
+		break;
+	}
+}
+
+static int poll_io_err_stat(struct vectors *vecs, struct 
+io_err_stat_path *pp) {
+	struct timespec currtime, difftime;
+	struct path *path;
+	double err_rate;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &currtime) != 0)
+		return 1;
+	timespecsub(&currtime, &pp->start_time, &difftime);
+	if (difftime.tv_sec < pp->total_time)
+		return 0;
+
+	io_err_stat_log(4, "check end for %s", pp->devname);
+
+	err_rate = pp->io_nr == 0 ? 0 : (pp->io_err_nr * 1000.0f) / pp->io_nr;
+	io_err_stat_log(5, "%s: IO error rate (%.1f/1000)",
+			pp->devname, err_rate);
+	pthread_cleanup_push(cleanup_lock, &vecs->lock);
+	lock(&vecs->lock);
+	pthread_testcancel();
+	path = find_path_by_dev(vecs->pathvec, pp->devname);
+	if (!path) {
+		io_err_stat_log(4, "path %s not found'", pp->devname);
+	} else if (err_rate <= pp->err_rate_threshold) {
+		path->io_err_pathfail_cnt = 0;
+		path->io_err_disable_reinstate = 0;
+		io_err_stat_log(4, "%s: (%d/%d) good to enable reinstating",
+				pp->devname, pp->io_err_nr, pp->io_nr);
+		/*
+		 * schedule path check as soon as possible to
+		 * update path state. Do NOT reinstate dm path here
+		 */
+		path->tick = 1;
+
+	} else if (path->mpp && path->mpp->nr_active > 1) {
+		io_err_stat_log(3, "%s: keep failing dm path %s",
+				path->mpp->alias, path->dev);
+		path->io_err_pathfail_cnt = PATH_IO_ERR_IN_POLLING_RECHECK;
+		path->io_err_disable_reinstate = 1;
+		path->io_err_dis_reinstate_time = currtime.tv_sec;
+		io_err_stat_log(3, "%s: to disable %s to reinstate",
+				path->mpp->alias, path->dev);
+	} else {
+		path->io_err_pathfail_cnt = 0;
+		path->io_err_disable_reinstate = 0;
+		io_err_stat_log(4, "%s: there is orphan path, enable reinstating",
+				pp->devname);
+	}
+	lock_cleanup_pop(vecs->lock);
+
+	delete_io_err_stat_by_addr(pp);
+
+	return 0;
+}
+
+static int send_each_async_io(struct dio_ctx *ct, int fd, char *dev) {
+	int rc = -1;
+
+	if (ct->io_starttime.tv_nsec == 0 &&
+			ct->io_starttime.tv_sec == 0) {
+		struct iocb *ios[1] = { &ct->io };
+
+		if (clock_gettime(CLOCK_MONOTONIC, &ct->io_starttime) != 0) {
+			ct->io_starttime.tv_sec = 0;
+			ct->io_starttime.tv_nsec = 0;
+			return rc;
+		}
+		io_prep_pread(&ct->io, fd, ct->buf, ct->blksize, 0);
+		if (io_submit(ioctx, 1, ios) != 1) {
+			io_err_stat_log(5, "%s: io_submit error %i",
+					dev, errno);
+			return rc;
+		}
+		rc = 0;
+	}
+
+	return rc;
+}
+
+static void send_batch_async_ios(struct io_err_stat_path *pp) {
+	int i;
+	struct dio_ctx *ct;
+	struct timespec currtime, difftime;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &currtime) != 0)
+		return;
+	/*
+	 * Give a free time for all IO to complete or timeout
+	 */
+	if (pp->start_time.tv_sec != 0) {
+		timespecsub(&currtime, &pp->start_time, &difftime);
+		if (difftime.tv_sec + IOTIMEOUT_SEC >= pp->total_time)
+			return;
+	}
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++) {
+		ct = pp->dio_ctx_array + i;
+		if (!send_each_async_io(ct, pp->fd, pp->devname))
+			pp->io_nr++;
+	}
+	if (pp->start_time.tv_sec == 0 && pp->start_time.tv_nsec == 0 &&
+		clock_gettime(CLOCK_MONOTONIC, &pp->start_time)) {
+		pp->start_time.tv_sec = 0;
+		pp->start_time.tv_nsec = 0;
+	}
+}
+
+static int try_to_cancel_timeout_io(struct dio_ctx *ct, struct timespec *t,
+		char *dev)
+{
+	struct timespec	difftime;
+	struct io_event	event;
+	int		rc = PATH_UNCHECKED;
+	int		r;
+
+	if (ct->io_starttime.tv_sec == 0)
+		return rc;
+	timespecsub(t, &ct->io_starttime, &difftime);
+	if (difftime.tv_sec > IOTIMEOUT_SEC) {
+		struct iocb *ios[1] = { &ct->io };
+
+		io_err_stat_log(5, "%s: abort check on timeout", dev);
+		r = io_cancel(ioctx, ios[0], &event);
+		if (r)
+			io_err_stat_log(5, "%s: io_cancel error %i",
+					dev, errno);
+		ct->io_starttime.tv_sec = 0;
+		ct->io_starttime.tv_nsec = 0;
+		rc = PATH_TIMEOUT;
+	} else {
+		rc = PATH_PENDING;
+	}
+
+	return rc;
+}
+
+static void poll_async_io_timeout(void) {
+	struct io_err_stat_path *pp;
+	struct timespec curr_time;
+	int		rc = PATH_UNCHECKED;
+	int		i, j;
+
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return;
+	vector_foreach_slot(paths->pathvec, pp, i) {
+		for (j = 0; j < CONCUR_NR_EVENT; j++) {
+			rc = try_to_cancel_timeout_io(pp->dio_ctx_array + j,
+					&curr_time, pp->devname);
+			account_async_io_state(pp, rc);
+		}
+	}
+}
+
+static void cancel_inflight_io(struct io_err_stat_path *pp) {
+	struct io_event event;
+	int i, r;
+
+	for (i = 0; i < CONCUR_NR_EVENT; i++) {
+		struct dio_ctx *ct = pp->dio_ctx_array + i;
+		struct iocb *ios[1] = { &ct->io };
+
+		if (ct->io_starttime.tv_sec == 0
+				&& ct->io_starttime.tv_nsec == 0)
+			continue;
+		io_err_stat_log(5, "%s: abort infligh io",
+				pp->devname);
+		r = io_cancel(ioctx, ios[0], &event);
+		if (r)
+			io_err_stat_log(5, "%s: io_cancel error %d, %i",
+					pp->devname, r, errno);
+		ct->io_starttime.tv_sec = 0;
+		ct->io_starttime.tv_nsec = 0;
+	}
+}
+
+static inline int handle_done_dio_ctx(struct dio_ctx *ct, struct 
+io_event *ev) {
+	ct->io_starttime.tv_sec = 0;
+	ct->io_starttime.tv_nsec = 0;
+	return (ev->res == ct->blksize) ? PATH_UP : PATH_DOWN; }
+
+static void handle_async_io_done_event(struct io_event *io_evt) {
+	struct io_err_stat_path *pp;
+	struct dio_ctx *ct;
+	int rc = PATH_UNCHECKED;
+	int i, j;
+
+	vector_foreach_slot(paths->pathvec, pp, i) {
+		for (j = 0; j < CONCUR_NR_EVENT; j++) {
+			ct = pp->dio_ctx_array + j;
+			if (&ct->io == io_evt->obj) {
+				rc = handle_done_dio_ctx(ct, io_evt);
+				account_async_io_state(pp, rc);
+				return;
+			}
+		}
+	}
+}
+
+static void process_async_ios_event(int timeout_nsecs, char *dev) {
+	struct io_event events[CONCUR_NR_EVENT];
+	int		i, n;
+	struct timespec	timeout = { .tv_nsec = timeout_nsecs };
+
+	errno = 0;
+	n = io_getevents(ioctx, 1L, CONCUR_NR_EVENT, events, &timeout);
+	if (n < 0) {
+		io_err_stat_log(3, "%s: async io events returned %d (errno=%s)",
+				dev, n, strerror(errno));
+	} else {
+		for (i = 0; i < n; i++)
+			handle_async_io_done_event(&events[i]);
+	}
+}
+
+static void service_paths(void)
+{
+	struct io_err_stat_path *pp;
+	int i;
+
+	pthread_mutex_lock(&paths->mutex);
+	vector_foreach_slot(paths->pathvec, pp, i) {
+		send_batch_async_ios(pp);
+		process_async_ios_event(TIMEOUT_NO_IO_NSEC, pp->devname);
+		poll_async_io_timeout();
+		poll_io_err_stat(vecs, pp);
+	}
+	pthread_mutex_unlock(&paths->mutex);
+}
+
+static void *io_err_stat_loop(void *data) {
+	vecs = (struct vectors *)data;
+	pthread_cleanup_push(rcu_unregister, NULL);
+	rcu_register_thread();
+
+	mlockall(MCL_CURRENT | MCL_FUTURE);
+	while (1) {
+		service_paths();
+		usleep(100000);
+	}
+
+	pthread_cleanup_pop(1);
+	return NULL;
+}
+
+int start_io_err_stat_thread(void *data) {
+	if (io_setup(CONCUR_NR_EVENT, &ioctx) != 0) {
+		io_err_stat_log(4, "io_setup failed");
+		return 1;
+	}
+	paths = alloc_pathvec();
+	if (!paths)
+		goto destroy_ctx;
+
+	if (pthread_create(&io_err_stat_thr, &io_err_stat_attr,
+				io_err_stat_loop, data)) {
+		io_err_stat_log(0, "cannot create io_error statistic thread");
+		goto out_free;
+	}
+	io_err_stat_log(3, "thread started");
+	return 0;
+
+out_free:
+	free_io_err_pathvec(paths);
+destroy_ctx:
+	io_destroy(ioctx);
+	io_err_stat_log(0, "failed to start io_error statistic thread");
+	return 1;
+}
+
+void stop_io_err_stat_thread(void)
+{
+	pthread_cancel(io_err_stat_thr);
+	pthread_kill(io_err_stat_thr, SIGUSR2);
+	free_io_err_pathvec(paths);
+	io_destroy(ioctx);
+}
diff --git a/libmultipath/io_err_stat.h b/libmultipath/io_err_stat.h new file mode 100644 index 00000000..bbf31b4f
--- /dev/null
+++ b/libmultipath/io_err_stat.h
@@ -0,0 +1,15 @@
+#ifndef _IO_ERR_STAT_H
+#define _IO_ERR_STAT_H
+
+#include "vector.h"
+#include "lock.h"
+
+
+extern pthread_attr_t io_err_stat_attr;
+
+int start_io_err_stat_thread(void *data); void 
+stop_io_err_stat_thread(void); int io_err_stat_handle_pathfail(struct 
+path *path); int hit_io_err_recheck_time(struct path *pp);
+
+#endif /* _IO_ERR_STAT_H */
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c index 175fbe11..9d2c3c09 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -731,6 +731,7 @@ out:
 	return 0;
 
 }
+
 int select_san_path_err_threshold(struct config *conf, struct multipath *mp)  {
 	char *origin, buff[12];
@@ -761,6 +762,7 @@ out:
 	return 0;
 
 }
+
 int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp)  {
 	char *origin, buff[12];
@@ -776,6 +778,57 @@ out:
 	return 0;
 
 }
+
+int select_path_io_err_sample_time(struct config *conf, struct 
+multipath *mp) {
+	char *origin, buff[12];
+
+	mp_set_mpe(path_io_err_sample_time);
+	mp_set_ovr(path_io_err_sample_time);
+	mp_set_hwe(path_io_err_sample_time);
+	mp_set_conf(path_io_err_sample_time);
+	mp_set_default(path_io_err_sample_time, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, &mp->path_io_err_sample_time);
+	condlog(3, "%s: path_io_err_sample_time = %s %s", mp->alias, buff,
+			origin);
+	return 0;
+}
+
+int select_path_io_err_rate_threshold(struct config *conf, struct 
+multipath *mp) {
+	char *origin, buff[12];
+
+	mp_set_mpe(path_io_err_rate_threshold);
+	mp_set_ovr(path_io_err_rate_threshold);
+	mp_set_hwe(path_io_err_rate_threshold);
+	mp_set_conf(path_io_err_rate_threshold);
+	mp_set_default(path_io_err_rate_threshold, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, &mp->path_io_err_rate_threshold);
+	condlog(3, "%s: path_io_err_rate_threshold = %s %s", mp->alias, buff,
+			origin);
+	return 0;
+
+}
+
+int select_path_io_err_recovery_time(struct config *conf, struct 
+multipath *mp) {
+	char *origin, buff[12];
+
+	mp_set_mpe(path_io_err_recovery_time);
+	mp_set_ovr(path_io_err_recovery_time);
+	mp_set_hwe(path_io_err_recovery_time);
+	mp_set_conf(path_io_err_recovery_time);
+	mp_set_default(path_io_err_recovery_time, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, &mp->path_io_err_recovery_time);
+	condlog(3, "%s: path_io_err_recovery_time = %s %s", mp->alias, buff,
+			origin);
+	return 0;
+
+}
+
 int select_skip_kpartx (struct config *conf, struct multipath * mp)  {
 	char *origin;
diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h index f8e96d85..1b2b5714 100644
--- a/libmultipath/propsel.h
+++ b/libmultipath/propsel.h
@@ -28,6 +28,9 @@ int select_max_sectors_kb (struct config *conf, struct multipath * mp);  int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp);  int select_san_path_err_threshold(struct config *conf, struct multipath *mp);  int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp);
+int select_path_io_err_sample_time(struct config *conf, struct 
+multipath *mp); int select_path_io_err_rate_threshold(struct config 
+*conf, struct multipath *mp); int 
+select_path_io_err_recovery_time(struct config *conf, struct multipath 
+*mp);
 void reconcile_features_with_options(const char *id, char **features,
 				     int* no_path_retry,
 				     int *retain_hwhandler);
diff --git a/libmultipath/structs.h b/libmultipath/structs.h index 8ea984d9..1ab8cb9b 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -235,6 +235,10 @@ struct path {
 	time_t dis_reinstate_time;
 	int disable_reinstate;
 	int san_path_err_forget_rate;
+	time_t io_err_dis_reinstate_time;
+	int io_err_disable_reinstate;
+	int io_err_pathfail_cnt;
+	int io_err_pathfail_starttime;
 	/* configlet pointers */
 	struct hwentry * hwe;
 };
@@ -269,6 +273,9 @@ struct multipath {
 	int san_path_err_threshold;
 	int san_path_err_forget_rate;
 	int san_path_err_recovery_time;
+	int path_io_err_sample_time;
+	int path_io_err_rate_threshold;
+	int path_io_err_recovery_time;
 	int skip_kpartx;
 	int max_sectors_kb;
 	int force_readonly;
diff --git a/libmultipath/uevent.c b/libmultipath/uevent.c index eb44da56..e74e3dad 100644
--- a/libmultipath/uevent.c
+++ b/libmultipath/uevent.c
@@ -913,3 +913,35 @@ char *uevent_get_dm_name(struct uevent *uev)
 	}
 	return p;
 }
+
+char *uevent_get_dm_path(struct uevent *uev) {
+	char *p = NULL;
+	int i;
+
+	for (i = 0; uev->envp[i] != NULL; i++) {
+		if (!strncmp(uev->envp[i], "DM_PATH", 7) &&
+		    strlen(uev->envp[i]) > 8) {
+			p = MALLOC(strlen(uev->envp[i] + 8) + 1);
+			strcpy(p, uev->envp[i] + 8);
+			break;
+		}
+	}
+	return p;
+}
+
+char *uevent_get_dm_action(struct uevent *uev) {
+	char *p = NULL;
+	int i;
+
+	for (i = 0; uev->envp[i] != NULL; i++) {
+		if (!strncmp(uev->envp[i], "DM_ACTION", 9) &&
+		    strlen(uev->envp[i]) > 10) {
+			p = MALLOC(strlen(uev->envp[i] + 10) + 1);
+			strcpy(p, uev->envp[i] + 10);
+			break;
+		}
+	}
+	return p;
+}
diff --git a/libmultipath/uevent.h b/libmultipath/uevent.h index 61a42071..6f5af0af 100644
--- a/libmultipath/uevent.h
+++ b/libmultipath/uevent.h
@@ -37,5 +37,7 @@ int uevent_get_major(struct uevent *uev);  int uevent_get_minor(struct uevent *uev);  int uevent_get_disk_ro(struct uevent *uev);  char *uevent_get_dm_name(struct uevent *uev);
+char *uevent_get_dm_path(struct uevent *uev); char 
+*uevent_get_dm_action(struct uevent *uev);
 
 #endif /* _UEVENT_H */
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5 index d9ac279f..f49ede66 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -849,6 +849,53 @@ The default is: \fBno\fR  .
 .
 .TP
+.B path_io_err_sample_time
+One of the three parameters of supporting path check based on 
+accounting IO error such as intermittent error. If it is set to a value 
+no less than 120, when a path fail event occurs twice in 60 second due 
+to an IO error, multipathd will fail the path and enqueue this path 
+into a queue of which members are sent a couple of continuous direct 
+reading aio at a fixed sample rate of 10HZ. The IO accounting process for a path will last for \fIpath_io_err_sample_time\fR.
+If the rate of IO error on a particular path is greater than the 
+\fIpath_io_err_rate_threshold\fR, then the path will not reinstate for 
+\fIpath_io_err_rate_threshold\fR seconds unless there is only one active path.
+After \fIpath_io_err_recovery_time\fR expires, the path will be 
+requeueed for checking. If checking result is good enough, the path will be reinstated.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B path_io_err_rate_threshold
+The error rate threshold as a permillage (1/1000). One of the three 
+parameters of supporting path check based on accounting IO error such 
+as intermittent error. Refer to \fIpath_io_err_sample_time\fR. If the 
+rate of IO errors on a particular path is greater than this parameter, 
+then the path will not reinstate for \fIpath_io_err_rate_threshold\fR 
+seconds unless there is only one active path.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B path_io_err_recovery_time
+One of the three parameters of supporting path check based on 
+accounting IO error such as intermittent error. Refer to 
+\fIpath_io_err_sample_time\fR. If this parameter is set to a positive 
+value, the path which has many error will not reinsate till 
+\fIpath_io_err_recovery_time\fR seconds. After 
+\fIpath_io_err_recovery_time\fR expires, the path will be requeueed for checking. If checking result is good enough, the path will be reinstated.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
 .B delay_watch_checks
 If set to a value greater than 0, multipathd will watch paths that have  recently become valid for this many checks. If they fail again while they are @@ -1119,6 +1166,12 @@ are taken from the \fIdefaults\fR or \fIdevices\fR section:
 .TP
 .B san_path_err_recovery_time
 .TP
+.B path_io_err_sample_time
+.TP
+.B path_io_err_rate_threshold
+.TP
+.B path_io_err_recovery_time
+.TP
 .B delay_watch_checks
 .TP
 .B delay_wait_checks
@@ -1246,6 +1299,12 @@ section:
 .TP
 .B san_path_err_recovery_time
 .TP
+.B path_io_err_sample_time
+.TP
+.B path_io_err_rate_threshold
+.TP
+.B path_io_err_recovery_time
+.TP
 .B delay_watch_checks
 .TP
 .B delay_wait_checks
@@ -1318,6 +1377,12 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections:
 .TP
 .B san_path_err_recovery_time
 .TP
+.B path_io_err_sample_time
+.TP
+.B path_io_err_rate_threshold
+.TP
+.B path_io_err_recovery_time
+.TP
 .B delay_watch_checks
 .TP
 .B delay_wait_checks
diff --git a/multipathd/main.c b/multipathd/main.c index 4be2c579..38158006 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -84,6 +84,7 @@ int uxsock_timeout;
 #include "cli_handlers.h"
 #include "lock.h"
 #include "waiter.h"
+#include "io_err_stat.h"
 #include "wwids.h"
 #include "../third-party/valgrind/drd.h"
 
@@ -1050,6 +1051,41 @@ out:
 }
 
 static int
+uev_pathfail_check(struct uevent *uev, struct vectors *vecs) {
+	char *action = NULL, *devt = NULL;
+	struct path *pp;
+	int r;
+
+	action = uevent_get_dm_action(uev);
+	if (!action)
+		return 1;
+	if (strncmp(action, "PATH_FAILED", 11))
+		goto out;
+	devt = uevent_get_dm_path(uev);
+	if (!devt) {
+		condlog(3, "%s: No DM_PATH in uevent", uev->kernel);
+		goto out;
+	}
+
+	pthread_cleanup_push(cleanup_lock, &vecs->lock);
+	lock(&vecs->lock);
+	pthread_testcancel();
+	pp = find_path_by_devt(vecs->pathvec, devt);
+	r = io_err_stat_handle_pathfail(pp);
+	lock_cleanup_pop(vecs->lock);
+
+	if (r)
+		condlog(3, "io_err_stat: fails to enqueue %s", pp->dev);
+	FREE(devt);
+	FREE(action);
+	return 0;
+out:
+	FREE(action);
+	return 1;
+}
+
+static int
 map_discovery (struct vectors * vecs)
 {
 	struct multipath * mpp;
@@ -1134,6 +1170,7 @@ uev_trigger (struct uevent * uev, void * trigger_data)
 	if (!strncmp(uev->kernel, "dm-", 3)) {
 		if (!strncmp(uev->action, "change", 6)) {
 			r = uev_add_map(uev, vecs);
+			uev_pathfail_check(uev, vecs);
 			goto out;
 		}
 		if (!strncmp(uev->action, "remove", 6)) { @@ -1553,6 +1590,7 @@ static int check_path_reinstate_state(struct path * pp) {
 		condlog(2, "%s : hit error threshold. Delaying path reinstatement", pp->dev);
 		pp->dis_reinstate_time = curr_time.tv_sec;
 		pp->disable_reinstate = 1;
+
 		return 1;
 	} else {
 		return 0;
@@ -1684,6 +1722,16 @@ check_path (struct vectors * vecs, struct path * pp, int ticks)
 		return 1;
 	}
 
+	if (pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
+		pp->state = PATH_DELAYED;
+		/*
+		 * to reschedule as soon as possible,so that this path can
+		 * be recoverd in time
+		 */
+		pp->tick = 1;
+		return 1;
+	}
+
 	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
 	     pp->wait_checks > 0) {
 		if (pp->mpp->nr_active > 0) {
@@ -2377,6 +2425,7 @@ child (void * param)
 	setup_thread_attr(&misc_attr, 64 * 1024, 0);
 	setup_thread_attr(&uevent_attr, DEFAULT_UEVENT_STACKSIZE * 1024, 0);
 	setup_thread_attr(&waiter_attr, 32 * 1024, 1);
+	setup_thread_attr(&io_err_stat_attr, 32 * 1024, 1);
 
 	if (logsink == 1) {
 		setup_thread_attr(&log_attr, 64 * 1024, 0); @@ -2499,6 +2548,10 @@ child (void * param)
 	/*
 	 * start threads
 	 */
+	rc = start_io_err_stat_thread(vecs);
+	if (rc)
+		goto failed;
+
 	if ((rc = pthread_create(&check_thr, &misc_attr, checkerloop, vecs))) {
 		condlog(0,"failed to create checker loop thread: %d", rc);
 		goto failed;
@@ -2548,6 +2601,8 @@ child (void * param)
 	remove_maps_and_stop_waiters(vecs);
 	unlock(&vecs->lock);
 
+	stop_io_err_stat_thread();
+
 	pthread_cancel(check_thr);
 	pthread_cancel(uevent_thr);
 	pthread_cancel(uxlsnr_thr);
@@ -2593,6 +2648,7 @@ child (void * param)
 	udev_unref(udev);
 	udev = NULL;
 	pthread_attr_destroy(&waiter_attr);
+	pthread_attr_destroy(&io_err_stat_attr);
 #ifdef _DEBUG_
 	dbg_free_final(NULL);
 #endif
--
2.11.1

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-18 12:53   ` Muneendra Kumar M
@ 2017-09-18 14:36     ` Guan Junxiong
  2017-09-18 19:51       ` Martin Wilck
  0 siblings, 1 reply; 17+ messages in thread
From: Guan Junxiong @ 2017-09-18 14:36 UTC (permalink / raw)
  To: Muneendra Kumar M, dm-devel, christophe.varoqui, mwilck
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi Muneendra,

Thanks for you feedback.  My comments are incline below.

On 2017/9/18 20:53, Muneendra Kumar M wrote:
> Hi Guan,
> This a good effort for detecting the intermittent IO error accounting to improve reliability.
> Your new algorithm is  mutually exclusive with san_path_err_XXX.
> It resolved the below issue which you have mentioned .
>>> Even the san_path_err_threshold , san_path_err_forget_rate and san_path_err_recovery_time is turned on,
>>> the detect sample interval of that path checkers is so big/coarse that it doesn't see what happens in the middle of the sample interval.
> 
> But I have some concerns.
> 
> Correct me if my understanding on the below line is correct
>>> On a particular path when a path failing events occur twice in 60 second due to an IO error, multipathd will fail the path and enqueue 
>>> this path into a queue of which each member is sent a couple of continuous direct reading asynchronous io at a fixed sample rate of 10HZ. 
> 
> Once we hit the above condition (2 errors in 60 secs) for a path_io_err_sample_time we keeps on injecting the asynchronous io at a fixed sample rate of 10HZ.
> And during this path_io_err_sample_time if we hit the the path_io_err_rate_threshold then we will not reinstantate this path for a path_io_err_recovery_time.
> Is this understanding correct?
>

Partial correct.
If we hit the above condition (2 errors in 60 secs), we will fail the path first before injecting a couple of asynchronous IOs to keep the testing not affected by other IOs.
And after this path_io_err_sample_time :
(1) if we hit the the path_io_err_rate_threshold, the failed path will keep unchanged  and then after the path_io_err_recovery_time
(which is confusing, sorry, I will rename it to "recheck"), we will reschedule this IO error checking process again.
(2) if we do NOT hit the path_io_err_rate_threshold, the failed path will reinstated by path checking thread in a tick (1 second) ASAP.


> If the above understanding is correct then my concern is :
> 1) On a particular path if we are seeing continuous errors but not within 60 secs (may be for every 120 secs) of duration how do we handle this. Still this a shaky link.
> This is what our customers are pointing out.
> And if i am not wrong the new algorithm will comes into place only  if a path failing events occur twice in 60 seconds.
> 
> Then this will not solve the intermittent IO error issue which we are seeing as the data is still going on the shaky path .
> I think this is the place where we need to pull in  in san_path_err_forget_rate .
> 

Yes .  I have thought about using some adjustable parameters such as san_path_err_pre_check_time and  san_path_err_threshold to cover ALL the scenarios the user encounters.
In the above fixed example,san_path_err_pre_check_time is set to 60 seconds, san_path_err_threshold is set 2.
However, if I adopt this, we have 5 parameters (san_path_err_pre_check_time and  san_path_err_threshold + 3 path_io_err_XXXs ) to support this feature. You know, mulitpath.conf
configuration is becoming more and more daunting as Martin pointed in the V1 of this patch.

But now, maybe it is acceptable for users to use the 5 parameters if we set san_path_err_pre_check_time and  san_path_err_threshold to some default values such as 60 second and 2 respectively.
**Martin** , **Muneendra**, how about this a little compromising method?  If it is OK , I will update in next version of patch.


> Our main intention to bring the san_path_err_XXX patch was ,if we are hitting   i/o errors on a path which are exceeding san_path_err_threshold within a san_path_err_forget_rate then 
> We are not supposed to reinstate the path for san_path_err_recovery_time.
> 
> 
> path_io_err_sample_time should be a  sub window of san_path_err_forget_rate.

No, path_io_err_sample_time take effects after san_path_err_forget_rate(equal to the above san_path_err_pre_check_time). They are conditionally in sequence.


> If the errors are not happening within 60 secs duration, still  we need to keep track of  the number of errors and if the error threshold is hit within san_path_err_forget_rate  then the path will not reinstate for recover_time seconds.
> With the combination of these two we can find the shaky path within path_io_err_sample_time / san_path_err_forget_rate.
> 
> Regards,
> Muneendra.
> 

san_path_err_forget_rate is hard to understand, shall we use san_path_err_pre_check_time instead?

Best Wishes
Guan


> 
> -----Original Message-----
> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
> Sent: Sunday, September 17, 2017 9:11 AM
> To: dm-devel@redhat.com; christophe.varoqui@opensvc.com; mwilck@suse.com
> Cc: Muneendra Kumar M <mmandala@Brocade.com>; shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com; guanjunxiong@huawei.com
> Subject: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> This patch adds a new method of path state checking based on accounting IO error. This is useful in many scenarios such as intermittent IO error an a path due to network congestion, or a shaky link.
> 
> Three parameters are added for the admin: "path_io_err_sample_time", "path_io_err_rate_threshold" and "path_io_err_recovery_time".
> If path_io_err_sample_time are set no less than 120 and path_io_err_recovery_time are set to a value greater than 0, when path failing events occur twice in 60 second due to an IO error, multipathd will fail the path and enqueue this path into a queue of which each member is sent a couple of continuous direct reading asynchronous io at a fixed sample rate of 10HZ. The IO accounting process for a path will last for path_io_err_sample_time. If the IO error rate on a particular path is greater than the path_io_err_rate_threshold, then the path will not reinstate for recover_time seconds unless there is only one active path.
> 
> If recover_time expires, we will reschedule this IO error checking process. If the path is good enough, we will claim it good.
> 
> This helps us place the path in delayed state if we hit a lot of intermittent IO errors on a particular path due to network/target issues and isolate such degraded path and allow the admin to rectify the errors on a path.
> 
> Signed-off-by: Junxiong Guan <guanjunxiong@huawei.com>
> ---
>  libmultipath/Makefile      |   5 +-
>  libmultipath/config.h      |   9 +
>  libmultipath/configure.c   |   3 +
>  libmultipath/dict.c        |  41 +++
>  libmultipath/io_err_stat.c | 743 +++++++++++++++++++++++++++++++++++++++++++++
>  libmultipath/io_err_stat.h |  15 +
>  libmultipath/propsel.c     |  53 ++++
>  libmultipath/propsel.h     |   3 +
>  libmultipath/structs.h     |   7 +
>  libmultipath/uevent.c      |  32 ++
>  libmultipath/uevent.h      |   2 +
>  multipath/multipath.conf.5 |  65 ++++
>  multipathd/main.c          |  56 ++++
>  13 files changed, 1032 insertions(+), 2 deletions(-)  create mode 100644 libmultipath/io_err_stat.c  create mode 100644 libmultipath/io_err_stat.h
> 
> diff --git a/libmultipath/Makefile b/libmultipath/Makefile index b3244fc7..dce73afe 100644
> --- a/libmultipath/Makefile
> +++ b/libmultipath/Makefile

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-18 14:36     ` Guan Junxiong
@ 2017-09-18 19:51       ` Martin Wilck
  2017-09-19  1:32         ` Guan Junxiong
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Wilck @ 2017-09-18 19:51 UTC (permalink / raw)
  To: Guan Junxiong, Muneendra Kumar M, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09

On Mon, 2017-09-18 at 22:36 +0800, Guan Junxiong wrote:
> Hi Muneendra,
> 
> Thanks for you feedback.  My comments are incline below.
> 
> On 2017/9/18 20:53, Muneendra Kumar M wrote:
> > Hi Guan,
> > This a good effort for detecting the intermittent IO error
> > accounting to improve reliability.
> > Your new algorithm is  mutually exclusive with san_path_err_XXX.
> > It resolved the below issue which you have mentioned .
> > > > Even the san_path_err_threshold , san_path_err_forget_rate and
> > > > san_path_err_recovery_time is turned on,
> > > > the detect sample interval of that path checkers is so
> > > > big/coarse that it doesn't see what happens in the middle of
> > > > the sample interval.
> > 
> > But I have some concerns.
> > 
> > Correct me if my understanding on the below line is correct
> > > > On a particular path when a path failing events occur twice in
> > > > 60 second due to an IO error, multipathd will fail the path and
> > > > enqueue 
> > > > this path into a queue of which each member is sent a couple of
> > > > continuous direct reading asynchronous io at a fixed sample
> > > > rate of 10HZ. 
> > 
> > Once we hit the above condition (2 errors in 60 secs) for a
> > path_io_err_sample_time we keeps on injecting the asynchronous io
> > at a fixed sample rate of 10HZ.
> > And during this path_io_err_sample_time if we hit the the
> > path_io_err_rate_threshold then we will not reinstantate this path
> > for a path_io_err_recovery_time.
> > Is this understanding correct?
> > 
> 
> Partial correct.
> If we hit the above condition (2 errors in 60 secs), we will fail the
> path first before injecting a couple of asynchronous IOs to keep the
> testing not affected by other IOs.
> And after this path_io_err_sample_time :
> (1) if we hit the the path_io_err_rate_threshold, the failed path
> will keep unchanged  and then after the path_io_err_recovery_time
> (which is confusing, sorry, I will rename it to "recheck"), we will
> reschedule this IO error checking process again.
> (2) if we do NOT hit the path_io_err_rate_threshold, the failed path
> will reinstated by path checking thread in a tick (1 second) ASAP.
> 
> 
> > If the above understanding is correct then my concern is :
> > 1) On a particular path if we are seeing continuous errors but not
> > within 60 secs (may be for every 120 secs) of duration how do we
> > handle this. Still this a shaky link.
> > This is what our customers are pointing out.
> > And if i am not wrong the new algorithm will comes into place
> > only  if a path failing events occur twice in 60 seconds.
> > 
> > Then this will not solve the intermittent IO error issue which we
> > are seeing as the data is still going on the shaky path .
> > I think this is the place where we need to pull in  in
> > san_path_err_forget_rate .
> > 
> 
> Yes .  I have thought about using some adjustable parameters such as
> san_path_err_pre_check_time and  san_path_err_threshold to cover ALL
> the scenarios the user encounters.
> In the above fixed example,san_path_err_pre_check_time is set to 60
> seconds, san_path_err_threshold is set 2.
> However, if I adopt this, we have 5 parameters
> (san_path_err_pre_check_time and  san_path_err_threshold + 3
> path_io_err_XXXs ) to support this feature. You know, mulitpath.conf 
> configuration is becoming more and more daunting as Martin pointed in
> the V1 of this patch.
> 
> But now, maybe it is acceptable for users to use the 5 parameters if
> we set san_path_err_pre_check_time and  san_path_err_threshold to
> some default values such as 60 second and 2 respectively.
> **Martin** , **Muneendra**, how about this a little compromising
> method?  If it is OK , I will update in next version of patch.

Hm, that sounds a lot like san_path_err_threshold and
san_path_err_forget_rate, which you were about to remove.

Maybe we can simplify the algorithm by checking paths which fail in a
given time interval after they've been reinstated? That would be one
less additional parameter.

The big question is: how do administrators derive appropriate values
for these parameters for their environment? IIUC the values don't
depend on the storage array, but rather on the environment as a whole;
all kinds of things like switches, cabling, or even network load can
affect the behavior, so multipathd's hwtable will not help us provide
good defaults. Yet we have to assume that a very high percentage of
installations will just use default or vendor-recommended values. Even
if the documentation of the algorithm and its parameters was perfect
(which it currently isn't), most admins won't have a clue how to set
them. AFAICS we don't even have a test procedure to derive the optimal
settings experimentally, thus guesswork is going to be applied, with
questionable odds for success.

IOW: the whole stuff is basically useless without good default values.
It would be up to you hardware guys to come up with them.

> san_path_err_forget_rate is hard to understand, shall we use Regards
> san_path_err_pre_check_time

A 'rate' would be something which is measured in Hz, which is not the
case here. Calling it a 'time' is more accurate. If we go with my
proposal above, we might call it "san_path_double_fault_time".

Regards
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-18 19:51       ` Martin Wilck
@ 2017-09-19  1:32         ` Guan Junxiong
  2017-09-19 10:59           ` Muneendra Kumar M
  0 siblings, 1 reply; 17+ messages in thread
From: Guan Junxiong @ 2017-09-19  1:32 UTC (permalink / raw)
  To: Martin Wilck, Muneendra Kumar M, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09



On 2017/9/19 3:51, Martin Wilck wrote:
> On Mon, 2017-09-18 at 22:36 +0800, Guan Junxiong wrote:
>> Hi Muneendra,
>>
>> Thanks for you feedback.  My comments are incline below.
>>
>> On 2017/9/18 20:53, Muneendra Kumar M wrote:
>>> Hi Guan,
>>> This a good effort for detecting the intermittent IO error
>>> accounting to improve reliability.
>>> Your new algorithm is  mutually exclusive with san_path_err_XXX.
>>> It resolved the below issue which you have mentioned .
>>>>> Even the san_path_err_threshold , san_path_err_forget_rate and
>>>>> san_path_err_recovery_time is turned on,
>>>>> the detect sample interval of that path checkers is so
>>>>> big/coarse that it doesn't see what happens in the middle of
>>>>> the sample interval.
>>>
>>> But I have some concerns.
>>>
>>> Correct me if my understanding on the below line is correct
>>>>> On a particular path when a path failing events occur twice in
>>>>> 60 second due to an IO error, multipathd will fail the path and
>>>>> enqueue 
>>>>> this path into a queue of which each member is sent a couple of
>>>>> continuous direct reading asynchronous io at a fixed sample
>>>>> rate of 10HZ. 
>>>
>>> Once we hit the above condition (2 errors in 60 secs) for a
>>> path_io_err_sample_time we keeps on injecting the asynchronous io
>>> at a fixed sample rate of 10HZ.
>>> And during this path_io_err_sample_time if we hit the the
>>> path_io_err_rate_threshold then we will not reinstantate this path
>>> for a path_io_err_recovery_time.
>>> Is this understanding correct?
>>>
>>
>> Partial correct.
>> If we hit the above condition (2 errors in 60 secs), we will fail the
>> path first before injecting a couple of asynchronous IOs to keep the
>> testing not affected by other IOs.
>> And after this path_io_err_sample_time :
>> (1) if we hit the the path_io_err_rate_threshold, the failed path
>> will keep unchanged  and then after the path_io_err_recovery_time
>> (which is confusing, sorry, I will rename it to "recheck"), we will
>> reschedule this IO error checking process again.
>> (2) if we do NOT hit the path_io_err_rate_threshold, the failed path
>> will reinstated by path checking thread in a tick (1 second) ASAP.
>>
>>
>>> If the above understanding is correct then my concern is :
>>> 1) On a particular path if we are seeing continuous errors but not
>>> within 60 secs (may be for every 120 secs) of duration how do we
>>> handle this. Still this a shaky link.
>>> This is what our customers are pointing out.
>>> And if i am not wrong the new algorithm will comes into place
>>> only  if a path failing events occur twice in 60 seconds.
>>>
>>> Then this will not solve the intermittent IO error issue which we
>>> are seeing as the data is still going on the shaky path .
>>> I think this is the place where we need to pull in  in
>>> san_path_err_forget_rate .
>>>
>>
>> Yes .  I have thought about using some adjustable parameters such as
>> san_path_err_pre_check_time and  san_path_err_threshold to cover ALL
>> the scenarios the user encounters.
>> In the above fixed example,san_path_err_pre_check_time is set to 60
>> seconds, san_path_err_threshold is set 2.
>> However, if I adopt this, we have 5 parameters
>> (san_path_err_pre_check_time and  san_path_err_threshold + 3
>> path_io_err_XXXs ) to support this feature. You know, mulitpath.conf 
>> configuration is becoming more and more daunting as Martin pointed in
>> the V1 of this patch.
>>
>> But now, maybe it is acceptable for users to use the 5 parameters if
>> we set san_path_err_pre_check_time and  san_path_err_threshold to
>> some default values such as 60 second and 2 respectively.
>> **Martin** , **Muneendra**, how about this a little compromising
>> method?  If it is OK , I will update in next version of patch.
> 
> Hm, that sounds a lot like san_path_err_threshold and
> san_path_err_forget_rate, which you were about to remove.
> 
> Maybe we can simplify the algorithm by checking paths which fail in a
> given time interval after they've been reinstated? That would be one
> less additional parameter.
> 

"san_path_double_fault_time"  is great.  One less additional parameter and
still covering most scenarios are appreciated.


> The big question is: how do administrators derive appropriate values
> for these parameters for their environment? IIUC the values don't
> depend on the storage array, but rather on the environment as a whole;
> all kinds of things like switches, cabling, or even network load can
> affect the behavior, so multipathd's hwtable will not help us provide
> good defaults. Yet we have to assume that a very high percentage of
> installations will just use default or vendor-recommended values. Even
> if the documentation of the algorithm and its parameters was perfect
> (which it currently isn't), most admins won't have a clue how to set
> them. AFAICS we don't even have a test procedure to derive the optimal
> settings experimentally, thus guesswork is going to be applied, with
> questionable odds for success.
> 
> IOW: the whole stuff is basically useless without good default values.
> It would be up to you hardware guys to come up with them.
> 

I agree.  So let users to come up with those values. What we can do  is
to log the testing result such as path_io_err_rate in the given sample
time.

>> san_path_err_forget_rate is hard to understand, shall we use Regards
>> san_path_err_pre_check_time
> 
> A 'rate' would be something which is measured in Hz, which is not the
> case here. Calling it a 'time' is more accurate. If we go with my
> proposal above, we might call it "san_path_double_fault_time".
> 
> Regards
> Martin
> 

Regards
Guan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-19  1:32         ` Guan Junxiong
@ 2017-09-19 10:59           ` Muneendra Kumar M
  2017-09-19 12:53             ` Guan Junxiong
  0 siblings, 1 reply; 17+ messages in thread
From: Muneendra Kumar M @ 2017-09-19 10:59 UTC (permalink / raw)
  To: Guan Junxiong, Martin Wilck, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi Guan/Martin,
Below are my points.

>>> "san_path_double_fault_time"  is great.  One less additional parameter and still covering most scenarios are appreciated.

This looks good and I completely agree with Guan.

One question san_path_double_fault_time is the time between two failed states (failed-active-failed )is this correct ?.
Then this holds good.

Instead of san_path_double_fault_time can we call it as san_path_double_failed_time as the name suggests the time between two failed states is this ok ?

In SAN  topology (FC,NVME,SCSI)transient intermittent network errors make the ITL paths  as marginal paths. 

So instead of calling "path_io_err_sample_time", "path_io_err_rate_threshold" and "path_io_err_recovery_time"
can we name as "marginal_path_err_detection_time", " marginal_path_err_rate_threshold" and " marginal_path_err_recovery_time"

Some other names should also be good as the io_path is general word from my view.

If we agree with this one more thing which I would like to add as part of this patch.

Whenever the path is in XXX_io_error_recovery_time  and if the user runs multipath -ll command the path the state of the path is shown as failed as shown below.

	| `- 6:0:0:0 sdb 8:16  failed ready  running

Can we add a new state as marginal so that when the admin run the multipath command and found that the state is in marginal he can quickly come to know that this a marginal 
Path and needs to be recovered .If we keep the state as failed the admin cannot understand from past how much time the device is in failed state.


Regards,
Muneendra.


-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Tuesday, September 19, 2017 7:03 AM
To: Martin Wilck <mwilck@suse.com>; Muneendra Kumar M <mmandala@Brocade.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability



On 2017/9/19 3:51, Martin Wilck wrote:
> On Mon, 2017-09-18 at 22:36 +0800, Guan Junxiong wrote:
>> Hi Muneendra,
>>
>> Thanks for you feedback.  My comments are incline below.
>>
>> On 2017/9/18 20:53, Muneendra Kumar M wrote:
>>> Hi Guan,
>>> This a good effort for detecting the intermittent IO error 
>>> accounting to improve reliability.
>>> Your new algorithm is  mutually exclusive with san_path_err_XXX.
>>> It resolved the below issue which you have mentioned .
>>>>> Even the san_path_err_threshold , san_path_err_forget_rate and 
>>>>> san_path_err_recovery_time is turned on, the detect sample 
>>>>> interval of that path checkers is so big/coarse that it doesn't 
>>>>> see what happens in the middle of the sample interval.
>>>
>>> But I have some concerns.
>>>
>>> Correct me if my understanding on the below line is correct
>>>>> On a particular path when a path failing events occur twice in
>>>>> 60 second due to an IO error, multipathd will fail the path and 
>>>>> enqueue this path into a queue of which each member is sent a 
>>>>> couple of continuous direct reading asynchronous io at a fixed 
>>>>> sample rate of 10HZ.
>>>
>>> Once we hit the above condition (2 errors in 60 secs) for a 
>>> path_io_err_sample_time we keeps on injecting the asynchronous io at 
>>> a fixed sample rate of 10HZ.
>>> And during this path_io_err_sample_time if we hit the the 
>>> path_io_err_rate_threshold then we will not reinstantate this path 
>>> for a path_io_err_recovery_time.
>>> Is this understanding correct?
>>>
>>
>> Partial correct.
>> If we hit the above condition (2 errors in 60 secs), we will fail the 
>> path first before injecting a couple of asynchronous IOs to keep the 
>> testing not affected by other IOs.
>> And after this path_io_err_sample_time :
>> (1) if we hit the the path_io_err_rate_threshold, the failed path 
>> will keep unchanged  and then after the path_io_err_recovery_time 
>> (which is confusing, sorry, I will rename it to "recheck"), we will 
>> reschedule this IO error checking process again.
>> (2) if we do NOT hit the path_io_err_rate_threshold, the failed path 
>> will reinstated by path checking thread in a tick (1 second) ASAP.
>>
>>
>>> If the above understanding is correct then my concern is :
>>> 1) On a particular path if we are seeing continuous errors but not 
>>> within 60 secs (may be for every 120 secs) of duration how do we 
>>> handle this. Still this a shaky link.
>>> This is what our customers are pointing out.
>>> And if i am not wrong the new algorithm will comes into place only  
>>> if a path failing events occur twice in 60 seconds.
>>>
>>> Then this will not solve the intermittent IO error issue which we 
>>> are seeing as the data is still going on the shaky path .
>>> I think this is the place where we need to pull in  in 
>>> san_path_err_forget_rate .
>>>
>>
>> Yes .  I have thought about using some adjustable parameters such as 
>> san_path_err_pre_check_time and  san_path_err_threshold to cover ALL 
>> the scenarios the user encounters.
>> In the above fixed example,san_path_err_pre_check_time is set to 60 
>> seconds, san_path_err_threshold is set 2.
>> However, if I adopt this, we have 5 parameters 
>> (san_path_err_pre_check_time and  san_path_err_threshold + 3 
>> path_io_err_XXXs ) to support this feature. You know, mulitpath.conf 
>> configuration is becoming more and more daunting as Martin pointed in 
>> the V1 of this patch.
>>
>> But now, maybe it is acceptable for users to use the 5 parameters if 
>> we set san_path_err_pre_check_time and  san_path_err_threshold to 
>> some default values such as 60 second and 2 respectively.
>> **Martin** , **Muneendra**, how about this a little compromising 
>> method?  If it is OK , I will update in next version of patch.
> 
> Hm, that sounds a lot like san_path_err_threshold and 
> san_path_err_forget_rate, which you were about to remove.
> 
> Maybe we can simplify the algorithm by checking paths which fail in a 
> given time interval after they've been reinstated? That would be one 
> less additional parameter.
> 

"san_path_double_fault_time"  is great.  One less additional parameter and still covering most scenarios are appreciated.


> The big question is: how do administrators derive appropriate values 
> for these parameters for their environment? IIUC the values don't 
> depend on the storage array, but rather on the environment as a whole; 
> all kinds of things like switches, cabling, or even network load can 
> affect the behavior, so multipathd's hwtable will not help us provide 
> good defaults. Yet we have to assume that a very high percentage of 
> installations will just use default or vendor-recommended values. Even 
> if the documentation of the algorithm and its parameters was perfect 
> (which it currently isn't), most admins won't have a clue how to set 
> them. AFAICS we don't even have a test procedure to derive the optimal 
> settings experimentally, thus guesswork is going to be applied, with 
> questionable odds for success.
> 
> IOW: the whole stuff is basically useless without good default values.
> It would be up to you hardware guys to come up with them.
> 

I agree.  So let users to come up with those values. What we can do  is to log the testing result such as path_io_err_rate in the given sample time.

>> san_path_err_forget_rate is hard to understand, shall we use Regards 
>> san_path_err_pre_check_time
> 
> A 'rate' would be something which is measured in Hz, which is not the 
> case here. Calling it a 'time' is more accurate. If we go with my 
> proposal above, we might call it "san_path_double_fault_time".
> 
> Regards
> Martin
> 

Regards
Guan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-19 10:59           ` Muneendra Kumar M
@ 2017-09-19 12:53             ` Guan Junxiong
  2017-09-20 12:58               ` Muneendra Kumar M
  0 siblings, 1 reply; 17+ messages in thread
From: Guan Junxiong @ 2017-09-19 12:53 UTC (permalink / raw)
  To: Muneendra Kumar M, Martin Wilck, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi Muneendra ,
Thanks for your suggestion. My comments inline.

On 2017/9/19 18:59, Muneendra Kumar M wrote:
> Hi Guan/Martin,
> Below are my points.
> 
>>>> "san_path_double_fault_time"  is great.  One less additional parameter and still covering most scenarios are appreciated.
> This looks good and I completely agree with Guan.
> 
> One question san_path_double_fault_time is the time between two failed states (failed-active-failed )is this correct ?.
> Then this holds good.
> 
> Instead of san_path_double_fault_time can we call it as san_path_double_failed_time as the name suggests the time between two failed states is this ok ?
> 

Both names are fine for me.

> In SAN  topology (FC,NVME,SCSI)transient intermittent network errors make the ITL paths  as marginal paths. 
>
>
> So instead of calling "path_io_err_sample_time", "path_io_err_rate_threshold" and "path_io_err_recovery_time"
> can we name as "marginal_path_err_detection_time", " marginal_path_err_rate_threshold" and " marginal_path_err_recovery_time"
> 
> Some other names should also be good as the io_path is general word from my view.
> >

Can you explain "the marginal paths" in details?  Can the users easily catch the meaning of  marginal paths?
IMO, path_io_err_XXXs are easy to under_stand

> If we agree with this one more thing which I would like to add as part of this patch.
> 
> Whenever the path is in XXX_io_error_recovery_time  and if the user runs multipath -ll command the path the state of the path is shown as failed as shown below.
> 
> 	| `- 6:0:0:0 sdb 8:16  failed ready  running
> 
> Can we add a new state as marginal so that when the admin run the multipath command and found that the state is in marginal he can quickly come to know that this a marginal 
> Path and needs to be recovered .If we keep the state as failed the admin cannot understand from past how much time the device is in failed state.
> 
>

If the user use multipathd -k , then input "show paths", the multipathd will show the path in the state of "delayed".
But we can't find the exactly reason of the delayed state path because other features such as path waiting uses it.
Shall we use existing PATH_SHAKY ?

Regards.
Guan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-19 12:53             ` Guan Junxiong
@ 2017-09-20 12:58               ` Muneendra Kumar M
  2017-09-21 10:04                 ` Guan Junxiong
  0 siblings, 1 reply; 17+ messages in thread
From: Muneendra Kumar M @ 2017-09-20 12:58 UTC (permalink / raw)
  To: Guan Junxiong, Martin Wilck, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi Guan,
>>>Shall we use existing PATH_SHAKY ?
As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.

Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.

Marginal path:

A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)

There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).

To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
1) The intermittent switch/port issue is not observed
2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle

However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 

On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 

The above condition of the  ITL path is hence called “marginal”.

What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.

Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.

Regards,
Muneendra.

-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Tuesday, September 19, 2017 6:23 PM
To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

Hi Muneendra ,
Thanks for your suggestion. My comments inline.

On 2017/9/19 18:59, Muneendra Kumar M wrote:
> Hi Guan/Martin,
> Below are my points.
> 
>>>> "san_path_double_fault_time"  is great.  One less additional parameter and still covering most scenarios are appreciated.
> This looks good and I completely agree with Guan.
> 
> One question san_path_double_fault_time is the time between two failed states (failed-active-failed )is this correct ?.
> Then this holds good.
> 
> Instead of san_path_double_fault_time can we call it as san_path_double_failed_time as the name suggests the time between two failed states is this ok ?
> 

Both names are fine for me.

> In SAN  topology (FC,NVME,SCSI)transient intermittent network errors make the ITL paths  as marginal paths. 
>
>
> So instead of calling "path_io_err_sample_time", "path_io_err_rate_threshold" and "path_io_err_recovery_time"
> can we name as "marginal_path_err_detection_time", " marginal_path_err_rate_threshold" and " marginal_path_err_recovery_time"
> 
> Some other names should also be good as the io_path is general word from my view.
> >

Can you explain "the marginal paths" in details?  Can the users easily catch the meaning of  marginal paths?
IMO, path_io_err_XXXs are easy to under_stand

> If we agree with this one more thing which I would like to add as part of this patch.
> 
> Whenever the path is in XXX_io_error_recovery_time  and if the user runs multipath -ll command the path the state of the path is shown as failed as shown below.
> 
> 	| `- 6:0:0:0 sdb 8:16  failed ready  running
> 
> Can we add a new state as marginal so that when the admin run the 
> multipath command and found that the state is in marginal he can quickly come to know that this a marginal Path and needs to be recovered .If we keep the state as failed the admin cannot understand from past how much time the device is in failed state.
> 
>

If the user use multipathd -k , then input "show paths", the multipathd will show the path in the state of "delayed".
But we can't find the exactly reason of the delayed state path because other features such as path waiting uses it.
Shall we use existing PATH_SHAKY ?

Regards.
Guan

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-20 12:58               ` Muneendra Kumar M
@ 2017-09-21 10:04                 ` Guan Junxiong
  2017-09-21 10:10                   ` Muneendra Kumar M
       [not found]                   ` <615cdd5a955944e49986dca01bf406a5@BRMWP-EXMB12.corp.brocade.com>
  0 siblings, 2 replies; 17+ messages in thread
From: Guan Junxiong @ 2017-09-21 10:04 UTC (permalink / raw)
  To: Muneendra Kumar M, Martin Wilck, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi, Muneendra

  Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.

Regards,
Guan

On 2017/9/20 20:58, Muneendra Kumar M wrote:
> Hi Guan,
>>>> Shall we use existing PATH_SHAKY ?
> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
> 
> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
> 
> Marginal path:
> 
> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
> 
> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
> 
> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
> 1) The intermittent switch/port issue is not observed
> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
> 
> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
> 
> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
> 
> The above condition of the  ITL path is hence called “marginal”.
> 
> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
> 
> 
> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
> 
> Regards,
> Muneendra.

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-09-21 10:04                 ` Guan Junxiong
@ 2017-09-21 10:10                   ` Muneendra Kumar M
       [not found]                   ` <615cdd5a955944e49986dca01bf406a5@BRMWP-EXMB12.corp.brocade.com>
  1 sibling, 0 replies; 17+ messages in thread
From: Muneendra Kumar M @ 2017-09-21 10:10 UTC (permalink / raw)
  To: Guan Junxiong, Martin Wilck, dm-devel, christophe.varoqui
  Cc: chengjike.cheng, niuhaoxin, shenhong09

Hi Guan,
Thanks for adopting the naming convention. 
Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable.Could you please relook into it.

I will review the code in a day time.

Regards,
Muneendra.

-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Thursday, September 21, 2017 3:35 PM
To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

Hi, Muneendra

  Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.

Regards,
Guan

On 2017/9/20 20:58, Muneendra Kumar M wrote:
> Hi Guan,
>>>> Shall we use existing PATH_SHAKY ?
> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
> 
> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
> 
> Marginal path:
> 
> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
> 
> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
> 
> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
> 1) The intermittent switch/port issue is not observed
> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
> 
> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
> 
> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
> 
> The above condition of the  ITL path is hence called “marginal”.
> 
> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
> 
> 
> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
> 
> Regards,
> Muneendra.


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
       [not found]                   ` <615cdd5a955944e49986dca01bf406a5@BRMWP-EXMB12.corp.brocade.com>
@ 2017-10-09  0:42                     ` Guan Junxiong
  2017-10-09 11:39                       ` Muneendra Kumar M
  2017-10-12  6:35                       ` Muneendra Kumar M
  0 siblings, 2 replies; 17+ messages in thread
From: Guan Junxiong @ 2017-10-09  0:42 UTC (permalink / raw)
  To: Muneendra Kumar M; +Cc: dm-devel, niuhaoxin, Shenhong (C), Martin Wilck

Hi Muneendra,
Sorry for late reply because of National Holiday.

On 2017/10/6 13:54, Muneendra Kumar M wrote:
> Hi Guan,
> Did you push the patch to mainline.
> If so can you just provide me those details.
> If not can you just let me know the status.
> 

Yes, I pushed Version 6 of the patch to the mail list but it hasn't been merged yet.
It is still waiting for review.
You can find it at this link:
https://www.redhat.com/archives/dm-devel/2017-September/msg00296.html

> As couple of our clients are already using the previous patch(san_path_XX).
> If your patch is pushed then I can give them the updated patch and test the same.
> 

If the patch if OK for you, can I add your Reviewed-by tag into this patch?

Regards,
Guan

> Regards,
> Muneendra.
> 
> 
> -----Original Message-----
> From: Muneendra Kumar M 
> Sent: Thursday, September 21, 2017 3:41 PM
> To: 'Guan Junxiong' <guanjunxiong@huawei.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
> Subject: RE: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi Guan,
> Thanks for adopting the naming convention. 
> Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable.Could you please relook into it.
> 
> I will review the code in a day time.
> 
> Regards,
> Muneendra.
> 
> -----Original Message-----
> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
> Sent: Thursday, September 21, 2017 3:35 PM
> To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi, Muneendra
> 
>   Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.
> 
> Regards,
> Guan
> 
> On 2017/9/20 20:58, Muneendra Kumar M wrote:
>> Hi Guan,
>>>>> Shall we use existing PATH_SHAKY ?
>> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
>>
>> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
>>
>> Marginal path:
>>
>> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
>>
>> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
>> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
>> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
>>
>> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
>> 1) The intermittent switch/port issue is not observed
>> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
>> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
>>
>> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
>>
>> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
>> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
>>
>> The above condition of the  ITL path is hence called “marginal”.
>>
>> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
>>
>>
>> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
>>
>> Regards,
>> Muneendra.
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-10-09  0:42                     ` Guan Junxiong
@ 2017-10-09 11:39                       ` Muneendra Kumar M
  2017-10-12  6:35                       ` Muneendra Kumar M
  1 sibling, 0 replies; 17+ messages in thread
From: Muneendra Kumar M @ 2017-10-09 11:39 UTC (permalink / raw)
  To: Guan Junxiong; +Cc: dm-devel, niuhaoxin, Shenhong (C), Martin Wilck

Hi Guan,
Thanks for the info.
Changes looks fine.
Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable. 
This was only my input.

Regards,
Muneendra.

-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Monday, October 09, 2017 6:13 AM
To: Muneendra Kumar M <mmandala@Brocade.com>
Cc: Shenhong (C) <shenhong09@huawei.com>; niuhaoxin <niuhaoxin@huawei.com>; Martin Wilck <mwilck@suse.com>; Christophe Varoqui <christophe.varoqui@opensvc.com>; dm-devel@redhat.com
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

Hi Muneendra,
Sorry for late reply because of National Holiday.

On 2017/10/6 13:54, Muneendra Kumar M wrote:
> Hi Guan,
> Did you push the patch to mainline.
> If so can you just provide me those details.
> If not can you just let me know the status.
> 

Yes, I pushed Version 6 of the patch to the mail list but it hasn't been merged yet.
It is still waiting for review.
You can find it at this link:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.redhat.com_archives_dm-2Ddevel_2017-2DSeptember_msg00296.html&d=DwIDaQ&c=IL_XqQWOjubgfqINi2jTzg&r=E3ftc47B6BGtZ4fVaYvkuv19wKvC_Mc6nhXaA1sBIP0&m=N8T04oW6j0kkcf5fLp8jXA1y75SRN6PM9D-dM5nc2d4&s=sBZTTjpCVZB3NBgGXwPCE1fBtqAmx75s0DkAsVYRrwc&e=

> As couple of our clients are already using the previous patch(san_path_XX).
> If your patch is pushed then I can give them the updated patch and test the same.
> 

If the patch if OK for you, can I add your Reviewed-by tag into this patch?

Regards,
Guan

> Regards,
> Muneendra.
> 
> 
> -----Original Message-----
> From: Muneendra Kumar M 
> Sent: Thursday, September 21, 2017 3:41 PM
> To: 'Guan Junxiong' <guanjunxiong@huawei.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
> Subject: RE: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi Guan,
> Thanks for adopting the naming convention. 
> Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable.Could you please relook into it.
> 
> I will review the code in a day time.
> 
> Regards,
> Muneendra.
> 
> -----Original Message-----
> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
> Sent: Thursday, September 21, 2017 3:35 PM
> To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi, Muneendra
> 
>   Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.
> 
> Regards,
> Guan
> 
> On 2017/9/20 20:58, Muneendra Kumar M wrote:
>> Hi Guan,
>>>>> Shall we use existing PATH_SHAKY ?
>> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
>>
>> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
>>
>> Marginal path:
>>
>> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
>>
>> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
>> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
>> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
>>
>> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
>> 1) The intermittent switch/port issue is not observed
>> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
>> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
>>
>> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
>>
>> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
>> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
>>
>> The above condition of the  ITL path is hence called “marginal”.
>>
>> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
>>
>>
>> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
>>
>> Regards,
>> Muneendra.
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-10-09  0:42                     ` Guan Junxiong
  2017-10-09 11:39                       ` Muneendra Kumar M
@ 2017-10-12  6:35                       ` Muneendra Kumar M
  2017-10-12  6:46                         ` Guan Junxiong
  1 sibling, 1 reply; 17+ messages in thread
From: Muneendra Kumar M @ 2017-10-12  6:35 UTC (permalink / raw)
  To: Guan Junxiong; +Cc: dm-devel, niuhaoxin, Shenhong (C), Martin Wilck

Hi Guan,
>>If the patch if OK for you, can I add your Reviewed-by tag into this patch?
	The patch is ok for me.


The patch is ok for me.

Regards,
Muneendra.

-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Monday, October 09, 2017 6:13 AM
To: Muneendra Kumar M <mmandala@Brocade.com>
Cc: Shenhong (C) <shenhong09@huawei.com>; niuhaoxin <niuhaoxin@huawei.com>; Martin Wilck <mwilck@suse.com>; Christophe Varoqui <christophe.varoqui@opensvc.com>; dm-devel@redhat.com
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

Hi Muneendra,
Sorry for late reply because of National Holiday.

On 2017/10/6 13:54, Muneendra Kumar M wrote:
> Hi Guan,
> Did you push the patch to mainline.
> If so can you just provide me those details.
> If not can you just let me know the status.
> 

Yes, I pushed Version 6 of the patch to the mail list but it hasn't been merged yet.
It is still waiting for review.
You can find it at this link:
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.redhat.com_archives_dm-2Ddevel_2017-2DSeptember_msg00296.html&d=DwIDaQ&c=IL_XqQWOjubgfqINi2jTzg&r=E3ftc47B6BGtZ4fVaYvkuv19wKvC_Mc6nhXaA1sBIP0&m=N8T04oW6j0kkcf5fLp8jXA1y75SRN6PM9D-dM5nc2d4&s=sBZTTjpCVZB3NBgGXwPCE1fBtqAmx75s0DkAsVYRrwc&e=

> As couple of our clients are already using the previous patch(san_path_XX).
> If your patch is pushed then I can give them the updated patch and test the same.
> 

If the patch if OK for you, can I add your Reviewed-by tag into this patch?

Regards,
Guan

> Regards,
> Muneendra.
> 
> 
> -----Original Message-----
> From: Muneendra Kumar M 
> Sent: Thursday, September 21, 2017 3:41 PM
> To: 'Guan Junxiong' <guanjunxiong@huawei.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
> Subject: RE: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi Guan,
> Thanks for adopting the naming convention. 
> Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable.Could you please relook into it.
> 
> I will review the code in a day time.
> 
> Regards,
> Muneendra.
> 
> -----Original Message-----
> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
> Sent: Thursday, September 21, 2017 3:35 PM
> To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi, Muneendra
> 
>   Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.
> 
> Regards,
> Guan
> 
> On 2017/9/20 20:58, Muneendra Kumar M wrote:
>> Hi Guan,
>>>>> Shall we use existing PATH_SHAKY ?
>> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
>>
>> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
>>
>> Marginal path:
>>
>> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
>>
>> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
>> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
>> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
>>
>> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
>> 1) The intermittent switch/port issue is not observed
>> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
>> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
>>
>> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
>>
>> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
>> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
>>
>> The above condition of the  ITL path is hence called “marginal”.
>>
>> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
>>
>>
>> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
>>
>> Regards,
>> Muneendra.
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-10-12  6:35                       ` Muneendra Kumar M
@ 2017-10-12  6:46                         ` Guan Junxiong
  2017-10-12  6:59                           ` Muneendra Kumar M
  0 siblings, 1 reply; 17+ messages in thread
From: Guan Junxiong @ 2017-10-12  6:46 UTC (permalink / raw)
  To: Muneendra Kumar M; +Cc: dm-devel, niuhaoxin, Shenhong (C), Martin Wilck

Hi Muneendra，

On 2017/10/12 14:35, Muneendra Kumar M wrote:
> Hi Guan,
>>> If the patch if OK for you, can I add your Reviewed-by tag into this patch?
> 	The patch is ok for me.
> 
> 
> The patch is ok for me.
> 
OK, I will add your Reviewed-by tag in the version 7 ASAP.
BTW, do your clients give any feedback to improve the feature?

Regards
Guan


> -----Original Message-----
> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
> Sent: Monday, October 09, 2017 6:13 AM
> To: Muneendra Kumar M <mmandala@Brocade.com>
> Cc: Shenhong (C) <shenhong09@huawei.com>; niuhaoxin <niuhaoxin@huawei.com>; Martin Wilck <mwilck@suse.com>; Christophe Varoqui <christophe.varoqui@opensvc.com>; dm-devel@redhat.com
> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi Muneendra,
> Sorry for late reply because of National Holiday.
> 
> On 2017/10/6 13:54, Muneendra Kumar M wrote:
>> Hi Guan,
>> Did you push the patch to mainline.
>> If so can you just provide me those details.
>> If not can you just let me know the status.
>>
> 
> Yes, I pushed Version 6 of the patch to the mail list but it hasn't been merged yet.
> It is still waiting for review.
> You can find it at this link:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.redhat.com_archives_dm-2Ddevel_2017-2DSeptember_msg00296.html&d=DwIDaQ&c=IL_XqQWOjubgfqINi2jTzg&r=E3ftc47B6BGtZ4fVaYvkuv19wKvC_Mc6nhXaA1sBIP0&m=N8T04oW6j0kkcf5fLp8jXA1y75SRN6PM9D-dM5nc2d4&s=sBZTTjpCVZB3NBgGXwPCE1fBtqAmx75s0DkAsVYRrwc&e=
> 
>> As couple of our clients are already using the previous patch(san_path_XX).
>> If your patch is pushed then I can give them the updated patch and test the same.
>>
> 
> If the patch if OK for you, can I add your Reviewed-by tag into this patch?
> 
> Regards,
> Guan
> 
>> Regards,
>> Muneendra.
>>
>>
>> -----Original Message-----
>> From: Muneendra Kumar M 
>> Sent: Thursday, September 21, 2017 3:41 PM
>> To: 'Guan Junxiong' <guanjunxiong@huawei.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
>> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
>> Subject: RE: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
>>
>> Hi Guan,
>> Thanks for adopting the naming convention. 
>> Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable.Could you please relook into it.
>>
>> I will review the code in a day time.
>>
>> Regards,
>> Muneendra.
>>
>> -----Original Message-----
>> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
>> Sent: Thursday, September 21, 2017 3:35 PM
>> To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
>> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
>> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
>>
>> Hi, Muneendra
>>
>>   Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.
>>
>> Regards,
>> Guan
>>
>> On 2017/9/20 20:58, Muneendra Kumar M wrote:
>>> Hi Guan,
>>>>>> Shall we use existing PATH_SHAKY ?
>>> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
>>>
>>> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
>>>
>>> Marginal path:
>>>
>>> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
>>>
>>> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
>>> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
>>> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
>>>
>>> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
>>> 1) The intermittent switch/port issue is not observed
>>> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
>>> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
>>>
>>> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
>>>
>>> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
>>> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
>>>
>>> The above condition of the  ITL path is hence called “marginal”.
>>>
>>> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
>>>
>>>
>>> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
>>>
>>> Regards,
>>> Muneendra.
>>
> 

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
  2017-10-12  6:46                         ` Guan Junxiong
@ 2017-10-12  6:59                           ` Muneendra Kumar M
  0 siblings, 0 replies; 17+ messages in thread
From: Muneendra Kumar M @ 2017-10-12  6:59 UTC (permalink / raw)
  To: Guan Junxiong; +Cc: dm-devel, niuhaoxin, Shenhong (C), Martin Wilck

Hi Guan,
Iam waiting for the feedback from the clients.

Regards,
Muneendra.

-----Original Message-----
From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
Sent: Thursday, October 12, 2017 12:16 PM
To: Muneendra Kumar M <mmandala@Brocade.com>
Cc: Shenhong (C) <shenhong09@huawei.com>; niuhaoxin <niuhaoxin@huawei.com>; Martin Wilck <mwilck@suse.com>; Christophe Varoqui <christophe.varoqui@opensvc.com>; dm-devel@redhat.com
Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability

Hi Muneendra，

On 2017/10/12 14:35, Muneendra Kumar M wrote:
> Hi Guan,
>>> If the patch if OK for you, can I add your Reviewed-by tag into this patch?
> 	The patch is ok for me.
> 
> 
> The patch is ok for me.
> 
OK, I will add your Reviewed-by tag in the version 7 ASAP.
BTW, do your clients give any feedback to improve the feature?

Regards
Guan


> -----Original Message-----
> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
> Sent: Monday, October 09, 2017 6:13 AM
> To: Muneendra Kumar M <mmandala@Brocade.com>
> Cc: Shenhong (C) <shenhong09@huawei.com>; niuhaoxin <niuhaoxin@huawei.com>; Martin Wilck <mwilck@suse.com>; Christophe Varoqui <christophe.varoqui@opensvc.com>; dm-devel@redhat.com
> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
> 
> Hi Muneendra,
> Sorry for late reply because of National Holiday.
> 
> On 2017/10/6 13:54, Muneendra Kumar M wrote:
>> Hi Guan,
>> Did you push the patch to mainline.
>> If so can you just provide me those details.
>> If not can you just let me know the status.
>>
> 
> Yes, I pushed Version 6 of the patch to the mail list but it hasn't been merged yet.
> It is still waiting for review.
> You can find it at this link:
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.redhat.com_archives_dm-2Ddevel_2017-2DSeptember_msg00296.html&d=DwIDaQ&c=IL_XqQWOjubgfqINi2jTzg&r=E3ftc47B6BGtZ4fVaYvkuv19wKvC_Mc6nhXaA1sBIP0&m=N8T04oW6j0kkcf5fLp8jXA1y75SRN6PM9D-dM5nc2d4&s=sBZTTjpCVZB3NBgGXwPCE1fBtqAmx75s0DkAsVYRrwc&e=
> 
>> As couple of our clients are already using the previous patch(san_path_XX).
>> If your patch is pushed then I can give them the updated patch and test the same.
>>
> 
> If the patch if OK for you, can I add your Reviewed-by tag into this patch?
> 
> Regards,
> Guan
> 
>> Regards,
>> Muneendra.
>>
>>
>> -----Original Message-----
>> From: Muneendra Kumar M 
>> Sent: Thursday, September 21, 2017 3:41 PM
>> To: 'Guan Junxiong' <guanjunxiong@huawei.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
>> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
>> Subject: RE: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
>>
>> Hi Guan,
>> Thanks for adopting the naming convention. 
>> Instead of marginal_path_err_recheck_gap_time, marginal_path_recovery_time will looks reasonable.Could you please relook into it.
>>
>> I will review the code in a day time.
>>
>> Regards,
>> Muneendra.
>>
>> -----Original Message-----
>> From: Guan Junxiong [mailto:guanjunxiong@huawei.com] 
>> Sent: Thursday, September 21, 2017 3:35 PM
>> To: Muneendra Kumar M <mmandala@Brocade.com>; Martin Wilck <mwilck@suse.com>; dm-devel@redhat.com; christophe.varoqui@opensvc.com
>> Cc: shenhong09@huawei.com; niuhaoxin@huawei.com; chengjike.cheng@huawei.com
>> Subject: Re: [PATCH V4 1/2] multipath-tools: intermittent IO error accounting to improve reliability
>>
>> Hi, Muneendra
>>
>>   Thanks for your clarification. I adopt this renaming. If it is convenient for you, please review the V5 patch that I sent out 2 hours ago.
>>
>> Regards,
>> Guan
>>
>> On 2017/9/20 20:58, Muneendra Kumar M wrote:
>>> Hi Guan,
>>>>>> Shall we use existing PATH_SHAKY ?
>>> As the path_shaky Indicates path not available for "normal" operations we can use this state. That's  a good idea.
>>>
>>> Regarding the marginal paths below is my explanation. And brocade is publishing couple of white papers regarding the same to educate the SAN administrators and the san community.
>>>
>>> Marginal path:
>>>
>>> A host, target, LUN (ITL path) flow  goes through SAN. It is to be noted that the for each I/O request that goes to the SCSI layer, it transforms into a single SCSI exchange.  In a single SAN, there are typically multiple SAN network paths  for a ITL flow/path. Each SCSI exchange  can take one of the various network paths that are available for the ITL path.  A SAN can be based on Ethernet, FC, Infiniband physical networks to carry block storage traffic (SCSI, NVMe etc.)
>>>
>>> There are typically two type of SAN network problems that are categorized as marginal issues. These issues by nature are not permanent in time and do come and go away over time.
>>> 1) Switches in the SAN can have intermittent frame drops or intermittent frame corruptions due to bad optics cable (SFP) or any such wear/tear  port issues. This causes ITL flows that go through the faulty switch/port to intermittently experience frame drops.  
>>> 2) There exists SAN topologies where there are switch ports in the fabric that becomes the only  conduit for many different ITL flows across multiple hosts. These single network paths are essentially shared across multiple ITL flows. Under these conditions if the port link bandwidth is not able to handle the net sum of the shared ITL flows bandwidth going through the single path  then we could see intermittent network congestion problems. This condition is called network oversubscription. The intermittent congestions can delay SCSI exchange completion time (increase in I/O latency is observed).
>>>
>>> To overcome the above network issues and many more such target issues, there are frame level retries that are done in HBA device firmware and I/O retries in the SCSI layer. These retries might succeed because of two reasons:
>>> 1) The intermittent switch/port issue is not observed
>>> 2) The retry I/O is a new  SCSI exchange. This SCSI exchange  can take an alternate SAN path for the ITL flow, if such an SAN path exists.
>>> 3) Network congestion disappears momentarily because the net I/O bandwidth coming from multiple ITL flows on the single shared network path is something the path can handle
>>>
>>> However in some cases we have seen I/O retries don’t succeed because the retry I/Os hits a SAN network path that has  intermittent switch/port issue and/or network congestion. 
>>>
>>> On the host  thus we see configurations two or more ITL path sharing the same target/LUN going through two or more HBA ports. These HBA ports are connected to two or more SAN to the same target/LUN.
>>> If the I/O fails at the multipath layer then, the ITL path is turned into Failed state. Because of the marginal nature of the network, the next Health Check command sent from multipath layer might succeed, which results in making the ITL path into Active state. You end up seeing the DM path state going  into Active, Failed, Active transitions. This results in overall reduction in application I/O throughput and sometime application I/O failures (because of timing constraints). All this can happen because of I/O retries and I/O request moving across multiple paths of the DM device. In the host it is  to be noted all I/O retries on a single path and I/O movement across multiple paths results in slowing down the forward progress of new application I/O. Reason behind, the above I/O  re-queue actions are given higher priority than the newer I/O requests coming from the application. 
>>>
>>> The above condition of the  ITL path is hence called “marginal”.
>>>
>>> What we desire is for the DM to deterministically  categorize a ITL Path as “marginal” and move all the pending I/Os from the marginal Path to an Active Path. This will help in meeting application I/O timing constraints. Also a capability to automatically re-instantiate the marginal path into Active once the marginal condition in the network is fixed.
>>>
>>>
>>> Based on the above explanation I want to rename the names as marginal_path_XXXX and this is irrespective of any storage network.
>>>
>>> Regards,
>>> Muneendra.
>>
> 


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-10-12  6:59 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-17  3:40 [PATCH V4 0/2] multipath-tools: intermittent IO error accounting to improve reliability Guan Junxiong
2017-09-17  3:40 ` [PATCH V4 1/2] " Guan Junxiong
2017-09-18 12:53   ` Muneendra Kumar M
2017-09-18 14:36     ` Guan Junxiong
2017-09-18 19:51       ` Martin Wilck
2017-09-19  1:32         ` Guan Junxiong
2017-09-19 10:59           ` Muneendra Kumar M
2017-09-19 12:53             ` Guan Junxiong
2017-09-20 12:58               ` Muneendra Kumar M
2017-09-21 10:04                 ` Guan Junxiong
2017-09-21 10:10                   ` Muneendra Kumar M
     [not found]                   ` <615cdd5a955944e49986dca01bf406a5@BRMWP-EXMB12.corp.brocade.com>
2017-10-09  0:42                     ` Guan Junxiong
2017-10-09 11:39                       ` Muneendra Kumar M
2017-10-12  6:35                       ` Muneendra Kumar M
2017-10-12  6:46                         ` Guan Junxiong
2017-10-12  6:59                           ` Muneendra Kumar M
2017-09-17  3:40 ` [PATCH V4 2/2] multipath-tools: discard san_path_err_XXX feature Guan Junxiong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.