All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] san_path_err & multipath ANA support
@ 2018-12-18 23:19 Martin Wilck
  2018-12-18 23:19 ` [PATCH 01/19] libmultipath: Increase SERIAL_SIZE to 128 bytes Martin Wilck
                   ` (20 more replies)
  0 siblings, 21 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

Hi Christophe,

this series consists of 3 parts. The first part improves the documentation on
the current approaches to "shaky" or "marginal" path detection, and
re-introduces the previously removed "san_path_err_xy" approach, which has
been prematurely removed IMO. At the time, I thought that it was superseded by
the "marginal path" algorithm, but I have my issues with latter (hopefully
subject of a follow-up series), and I believe the "medium" complexity of the
san_path_err code actually has its merits. But to be honest, my strongest
reason to re-add it is that I have to continue to support it in SLES for some
time to come.

The second part accumulates a few bug fixes.

The third part introduces NVMe ANA support to multipath-tools, based on the
original patch from Li Jie of Huawei (#14). Instead of copy/pasting some code
from nvme-cli, as Li Jie did, I decided to copy some nvme-cli code unmodified
to our repo, and create a small wrapper around it.  I took care not increase
the generated binaries with code we don't need. I added detect_prio on top of
it, and also added ANA support for the "foreign" code for native NVMe
multipath. BTW: Instead of applying patch #12, it would probably be possible
to simply add https://github.com/linux-nvme/nvme-cli as a submodule to multipath-
tools. I haven't tried that yet.

One thing to note: in dm-multipath mode, multipathd can now read the ANA
properties and derive prio values. But it can't react on updates from the
storage so far, because the kernel doesn't generate events to user space
if this happens. I haven't decided how to tackle this problem yet. Hints
and comments are welcome.

Cheers,
Martin

Kyle Mahlkuch (1):
  libmultipath: Increase SERIAL_SIZE to 128 bytes

lijie (1):
  multipath-tools: add ANA support for NVMe device

Martin Wilck (17):
  multipath.conf.5: explain "shaky" path detection
  libmultipath: propsel: don't print undefined values
  Revert "multipath-tools: discard san_path_err_XXX feature"
  multipathd: marginal_path overrides san_path_err
  multipath.conf.5: man page fixes for san_path_err_xy
  setup_map: wait for pending path checkers to finish
  libmultipath: add ARRAY_SIZE helper
  libmultipath: make close_fd() a common helper
  libmultipath: restore PG prio in update_multipath_strings
  multipathd: don't check foreign paths every tick
  libmultipath: add files from nvme-cli for NVMe support
  libmultipath: add wrapper library for nvme ioctls
  libmultipath: ANA prioritzer: use nvme wrapper library
  libmultipath: detect_prio: try ANA for NVMe
  libmultipath/foreign/nvme: use failover topology
  libmultipath/foreign/nvme: show ANA state
  libmultipath/foreign/nvme: indicate ANA support

 libmultipath/Makefile              |   18 +-
 libmultipath/config.c              |    3 +
 libmultipath/config.h              |    9 +
 libmultipath/configure.c           |   86 +-
 libmultipath/dict.c                |   39 +
 libmultipath/foreign/Makefile      |    2 +-
 libmultipath/foreign/nvme.c        |  180 +++-
 libmultipath/nvme-lib.c            |   49 +
 libmultipath/nvme-lib.h            |   39 +
 libmultipath/nvme/argconfig.h      |   99 ++
 libmultipath/nvme/json.h           |   87 ++
 libmultipath/nvme/linux/nvme.h     | 1450 ++++++++++++++++++++++++++++
 libmultipath/nvme/nvme-ioctl.c     |  869 +++++++++++++++++
 libmultipath/nvme/nvme-ioctl.h     |  139 +++
 libmultipath/nvme/nvme.h           |  163 ++++
 libmultipath/nvme/plugin.h         |   36 +
 libmultipath/prio.h                |    1 +
 libmultipath/prioritizers/Makefile |    5 +
 libmultipath/prioritizers/ana.c    |  201 ++++
 libmultipath/prioritizers/ana.h    |  221 +++++
 libmultipath/propsel.c             |  151 ++-
 libmultipath/propsel.h             |    3 +
 libmultipath/structs.h             |   30 +-
 libmultipath/structs_vec.c         |    8 +
 libmultipath/sysfs.c               |    5 -
 libmultipath/util.c                |    5 +
 libmultipath/util.h                |    3 +
 multipath/main.c                   |    4 -
 multipath/multipath.conf.5         |  141 ++-
 multipathd/main.c                  |  105 +-
 tests/hwtable.c                    |    2 +-
 31 files changed, 4051 insertions(+), 102 deletions(-)
 create mode 100644 libmultipath/nvme-lib.c
 create mode 100644 libmultipath/nvme-lib.h
 create mode 100644 libmultipath/nvme/argconfig.h
 create mode 100644 libmultipath/nvme/json.h
 create mode 100644 libmultipath/nvme/linux/nvme.h
 create mode 100644 libmultipath/nvme/nvme-ioctl.c
 create mode 100644 libmultipath/nvme/nvme-ioctl.h
 create mode 100644 libmultipath/nvme/nvme.h
 create mode 100644 libmultipath/nvme/plugin.h
 create mode 100644 libmultipath/prioritizers/ana.c
 create mode 100644 libmultipath/prioritizers/ana.h

-- 
2.19.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 01/19] libmultipath: Increase SERIAL_SIZE to 128 bytes
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 02/19] multipath.conf.5: explain "shaky" path detection Martin Wilck
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, Kyle Mahlkuch, dm-devel

From: Kyle Mahlkuch <kmahlkuc@linux.vnet.ibm.com>

Certain IBM FlashSystem LUNs can return up to 85 bytes of serial
number in the Unit Serial Number VPD page, which is larger than
the current SERIAL_SIZE definition of 65 bytes. Since the max
size of this field does not appear to be defined in SPC, increasing
to 128 bytes should hopefully prevent us from hitting this
in future.

This is an example of a serial number from a FlashSystem:
Unit serial number VPD page:
Unit serial number: 3321360050764008101AB300000000000012204214503IBMfcp

Before this patch multipath returns the error:
Jul 17 11:24:58 | vpd pg80 overflow, 85/65 bytes required

After the patch is applied the error no longer occur.

Signed-off-by: Kyle Mahlkuch<kmahlkuc@linux.vnet.ibm.com>
Reviewed-by: Martin Wilck <mwilck@suse.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/structs.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libmultipath/structs.h b/libmultipath/structs.h
index 0a2623a0..d8961164 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -9,7 +9,7 @@
 #include "generic.h"
 
 #define WWID_SIZE		128
-#define SERIAL_SIZE		65
+#define SERIAL_SIZE		128
 #define NODE_NAME_SIZE		224
 #define PATH_STR_SIZE		16
 #define PARAMS_SIZE		4096
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 02/19] multipath.conf.5: explain "shaky" path detection
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
  2018-12-18 23:19 ` [PATCH 01/19] libmultipath: Increase SERIAL_SIZE to 128 bytes Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 03/19] libmultipath: propsel: don't print undefined values Martin Wilck
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui
  Cc: Guan Junxiong, dm-devel, M Muneendra Kumar, Martin Wilck

Explain the "shaky path" detection algorithms, and how they
relate to each other.

Cc: Guan Junxiong <guanjunxiong@huawei.com>
Cc: M Muneendra Kumar <mmandala@brocade.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 multipath/multipath.conf.5 | 59 ++++++++++++++++++++++++++++++++++----
 1 file changed, 53 insertions(+), 6 deletions(-)

diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index 63333669..68119baa 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -898,7 +898,7 @@ error such as intermittent error. When a path failed event occurs twice in
 other three parameters are set, multipathd will fail the path and enqueue
 this path into a queue of which members are sent a couple of continuous
 direct reading asynchronous IOs at a fixed sample rate of 10HZ to start IO
-error accounting process.
+error accounting process. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -920,7 +920,7 @@ If the rate of IO error on a particular path is greater than the
 \fImarginal_path_err_recheck_gap_time\fR seconds unless there is only one
 active path. After \fImarginal_path_err_recheck_gap_time\fR expires, the path
 will be requeueed for rechecking. If checking result is good enough, the
-path will be reinstated.
+path will be reinstated. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -934,7 +934,7 @@ of supporting path check based on accounting IO error such as intermittent
 error. Refer to \fImarginal_path_err_sample_time\fR. If the rate of IO errors
 on a particular path is greater than this parameter, then the path will not
 reinstate for \fImarginal_path_err_recheck_gap_time\fR seconds unless there is
-only one active path.
+only one active path. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -951,7 +951,7 @@ value, the failed path of  which the IO error rate is larger than
 \fImarginal_path_err_recheck_gap_time\fR seconds. When
 \fImarginal_path_err_recheck_gap_time\fR seconds expires, the path will be
 requeueed for checking. If checking result is good enough, the path will be
-reinstated, or else it will keep failed.
+reinstated, or else it will keep failed. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -963,7 +963,7 @@ The default is: \fBno\fR
 If set to a value greater than 0, multipathd will watch paths that have
 recently become valid for this many checks. If they fail again while they are
 being watched, when they next become valid, they will not be used until they
-have stayed up for \fIdelay_wait_checks\fR checks.
+have stayed up for \fIdelay_wait_checks\fR checks. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -975,7 +975,7 @@ The default is: \fBno\fR
 If set to a value greater than 0, when a device that has recently come back
 online fails again within \fIdelay_watch_checks\fR checks, the next time it
 comes back online, it will marked and delayed, and not used until it has passed
-\fIdelay_wait_checks\fR checks.
+\fIdelay_wait_checks\fR checks. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -1578,6 +1578,53 @@ are present multipath will try to use the sysfs attribute
 .
 .
 .\" ----------------------------------------------------------------------------
+.SH "Shaky paths detection"
+.\" ----------------------------------------------------------------------------
+.
+A common problem in SAN setups is the occurence of intermittent errors: a
+path is unreachable, then reachable again for a short time, disappears again,
+and so forth. This happens typically on unstable interconnects. It is
+undesirable to switch pathgroups unnecessarily on such frequent, unreliable
+events. \fImultipathd\fR supports two different methods for detecting this
+situation and dealing with it. All methods share the same basic mode of
+operation: If a path is found to be \(dqshaky\(dq or \(dqflipping\(dq,
+and appears to be in healthy status, it is not reinstated (put back to use)
+immediately. Instead, it is watched for some time, and only reinstated
+if the healthy state appears to be stable. The logic of determining
+\(dqshaky\(dq condition, as well as the logic when to reinstate,
+differs between the methods.
+.TP 8
+.B \(dqdelay_checks\(dq failure tracking
+If a path fails again within a
+\fIdelay_watch_checks\fR interval after a failure, don't
+reinstate it until it passes a \fIdelay_wait_checks\fR interval
+in always good status.
+The intervals are measured in \(dqticks\(dq, i.e. the
+time between path checks by multipathd, which is variable and controlled by the
+\fIpolling_interval\fR and \fImax_polling_interval\fR parameters.
+.TP
+.B \(dqmarginal_path\(dq failure tracking
+If a second failure event (good->bad transition) occurs within
+\fImarginal_path_double_failed_time\fR seconds after a failure, high-frequency
+monitoring is started for the affected path: I/O is sent at a rate of 10 per
+second. This is done for \fImarginal_path_err_sample_time\fR seconds. During
+this period, the path is not reinstated. If the
+rate of errors remains below \fImarginal_path_err_rate_threshold\fR during the
+monitoring period, the path is reinstated. Otherwise, it
+is kept in failed state for \fImarginal_path_err_recheck_gap_time\fR, and
+after that, it is monitored again. For this method, time intervals are measured
+in seconds.
+.
+.RE
+.LP
+.
+See the documentation of the individual options above for details.
+It is \fBstrongly discouraged\fR to use more than one of these methods for any
+given multipath map, because the two concurrent methods may interact in
+unpredictable ways.
+.
+.
+.\" ----------------------------------------------------------------------------
 .SH "KNOWN ISSUES"
 .\" ----------------------------------------------------------------------------
 .
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 03/19] libmultipath: propsel: don't print undefined values
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
  2018-12-18 23:19 ` [PATCH 01/19] libmultipath: Increase SERIAL_SIZE to 128 bytes Martin Wilck
  2018-12-18 23:19 ` [PATCH 02/19] multipath.conf.5: explain "shaky" path detection Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature" Martin Wilck
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

print_off_int_undef() may return 0 if passed NU_UNDEF,
in which case the buffer contents are undefined.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/propsel.c | 42 ++++++++++++++++++++++++------------------
 1 file changed, 24 insertions(+), 18 deletions(-)

diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index 970a3b5c..7b19fed0 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -855,8 +855,9 @@ int select_delay_watch_checks(struct config *conf, struct multipath *mp)
 	mp_set_conf(delay_watch_checks);
 	mp_set_default(delay_watch_checks, DEFAULT_DELAY_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->delay_watch_checks);
-	condlog(3, "%s: delay_watch_checks = %s %s", mp->alias, buff, origin);
+	if (print_off_int_undef(buff, 12, mp->delay_watch_checks) != 0)
+		condlog(3, "%s: delay_watch_checks = %s %s",
+			mp->alias, buff, origin);
 	return 0;
 }
 
@@ -871,8 +872,9 @@ int select_delay_wait_checks(struct config *conf, struct multipath *mp)
 	mp_set_conf(delay_wait_checks);
 	mp_set_default(delay_wait_checks, DEFAULT_DELAY_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->delay_wait_checks);
-	condlog(3, "%s: delay_wait_checks = %s %s", mp->alias, buff, origin);
+	if (print_off_int_undef(buff, 12, mp->delay_wait_checks) != 0)
+		condlog(3, "%s: delay_wait_checks = %s %s",
+			mp->alias, buff, origin);
 	return 0;
 
 }
@@ -888,9 +890,10 @@ int select_marginal_path_err_sample_time(struct config *conf, struct multipath *
 	mp_set_conf(marginal_path_err_sample_time);
 	mp_set_default(marginal_path_err_sample_time, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->marginal_path_err_sample_time);
-	condlog(3, "%s: marginal_path_err_sample_time = %s %s", mp->alias, buff,
-			origin);
+	if (print_off_int_undef(buff, 12, mp->marginal_path_err_sample_time)
+	    != 0)
+		condlog(3, "%s: marginal_path_err_sample_time = %s %s",
+			mp->alias, buff, origin);
 	return 0;
 }
 
@@ -905,9 +908,10 @@ int select_marginal_path_err_rate_threshold(struct config *conf, struct multipat
 	mp_set_conf(marginal_path_err_rate_threshold);
 	mp_set_default(marginal_path_err_rate_threshold, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->marginal_path_err_rate_threshold);
-	condlog(3, "%s: marginal_path_err_rate_threshold = %s %s", mp->alias, buff,
-			origin);
+	if (print_off_int_undef(buff, 12, mp->marginal_path_err_rate_threshold)
+	    != 0)
+		condlog(3, "%s: marginal_path_err_rate_threshold = %s %s",
+			mp->alias, buff, origin);
 	return 0;
 }
 
@@ -922,9 +926,10 @@ int select_marginal_path_err_recheck_gap_time(struct config *conf, struct multip
 	mp_set_conf(marginal_path_err_recheck_gap_time);
 	mp_set_default(marginal_path_err_recheck_gap_time, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->marginal_path_err_recheck_gap_time);
-	condlog(3, "%s: marginal_path_err_recheck_gap_time = %s %s", mp->alias, buff,
-			origin);
+	if (print_off_int_undef(buff, 12,
+				mp->marginal_path_err_recheck_gap_time) != 0)
+		condlog(3, "%s: marginal_path_err_recheck_gap_time = %s %s",
+			mp->alias, buff, origin);
 	return 0;
 }
 
@@ -939,9 +944,10 @@ int select_marginal_path_double_failed_time(struct config *conf, struct multipat
 	mp_set_conf(marginal_path_double_failed_time);
 	mp_set_default(marginal_path_double_failed_time, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->marginal_path_double_failed_time);
-	condlog(3, "%s: marginal_path_double_failed_time = %s %s", mp->alias, buff,
-			origin);
+	if (print_off_int_undef(buff, 12, mp->marginal_path_double_failed_time)
+	    != 0)
+		condlog(3, "%s: marginal_path_double_failed_time = %s %s",
+			mp->alias, buff, origin);
 	return 0;
 }
 
@@ -993,8 +999,8 @@ int select_ghost_delay (struct config *conf, struct multipath * mp)
 	mp_set_conf(ghost_delay);
 	mp_set_default(ghost_delay, DEFAULT_GHOST_DELAY);
 out:
-	print_off_int_undef(buff, 12, mp->ghost_delay);
-	condlog(3, "%s: ghost_delay = %s %s", mp->alias, buff, origin);
+	if (print_off_int_undef(buff, 12, mp->ghost_delay) != 0)
+		condlog(3, "%s: ghost_delay = %s %s", mp->alias, buff, origin);
 	return 0;
 }
 
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (2 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 03/19] libmultipath: propsel: don't print undefined values Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-19 11:32   ` Muneendra Kumar M
  2018-12-18 23:19 ` [PATCH 05/19] multipathd: marginal_path overrides san_path_err Martin Wilck
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui
  Cc: Guan Junxiong, dm-devel, M Muneendra Kumar, Martin Wilck

This reverts commit 9cf6a48f18a291982af34b4fb0110654b94e591c.
We removed this functionality prematurely. I am not convinced
that the "marginal_path" code really replaces it. Let customers
evaluate the different options, and vote with their feet.

Cc: Guan Junxiong <guanjunxiong@huawei.com>
Cc: M Muneendra Kumar <mmandala@brocade.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/config.c      |  3 ++
 libmultipath/config.h      |  9 ++++
 libmultipath/configure.c   |  3 ++
 libmultipath/dict.c        | 39 ++++++++++++++++++
 libmultipath/propsel.c     | 53 ++++++++++++++++++++++++
 libmultipath/propsel.h     |  3 ++
 libmultipath/structs.h     |  7 ++++
 multipath/multipath.conf.5 | 57 ++++++++++++++++++++++++++
 multipathd/main.c          | 84 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 258 insertions(+)

diff --git a/libmultipath/config.c b/libmultipath/config.c
index 5af7af58..24d71aed 100644
--- a/libmultipath/config.c
+++ b/libmultipath/config.c
@@ -369,6 +369,9 @@ merge_hwe (struct hwentry * dst, struct hwentry * src)
 	merge_num(max_sectors_kb);
 	merge_num(ghost_delay);
 	merge_num(all_tg_pt);
+	merge_num(san_path_err_threshold);
+	merge_num(san_path_err_forget_rate);
+	merge_num(san_path_err_recovery_time);
 
 	snprintf(id, sizeof(id), "%s/%s", dst->vendor, dst->product);
 	reconcile_features_with_options(id, &dst->features,
diff --git a/libmultipath/config.h b/libmultipath/config.h
index 7d0cd9a6..b938c26c 100644
--- a/libmultipath/config.h
+++ b/libmultipath/config.h
@@ -76,6 +76,9 @@ struct hwentry {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
@@ -112,6 +115,9 @@ struct mpentry {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
@@ -162,6 +168,9 @@ struct config {
 	int processed_main_config;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 84ae5f56..60a98873 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -309,6 +309,9 @@ int setup_map(struct multipath *mpp, char *params, int params_size,
 	select_deferred_remove(conf, mpp);
 	select_delay_watch_checks(conf, mpp);
 	select_delay_wait_checks(conf, mpp);
+	select_san_path_err_threshold(conf, mpp);
+	select_san_path_err_forget_rate(conf, mpp);
+	select_san_path_err_recovery_time(conf, mpp);
 	select_marginal_path_err_sample_time(conf, mpp);
 	select_marginal_path_err_rate_threshold(conf, mpp);
 	select_marginal_path_err_recheck_gap_time(conf, mpp);
diff --git a/libmultipath/dict.c b/libmultipath/dict.c
index a81c051f..fd29abca 100644
--- a/libmultipath/dict.c
+++ b/libmultipath/dict.c
@@ -1217,6 +1217,33 @@ declare_hw_handler(delay_wait_checks, set_off_int_undef)
 declare_hw_snprint(delay_wait_checks, print_off_int_undef)
 declare_mp_handler(delay_wait_checks, set_off_int_undef)
 declare_mp_snprint(delay_wait_checks, print_off_int_undef)
+declare_def_handler(san_path_err_threshold, set_off_int_undef)
+declare_def_snprint_defint(san_path_err_threshold, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(san_path_err_threshold, set_off_int_undef)
+declare_ovr_snprint(san_path_err_threshold, print_off_int_undef)
+declare_hw_handler(san_path_err_threshold, set_off_int_undef)
+declare_hw_snprint(san_path_err_threshold, print_off_int_undef)
+declare_mp_handler(san_path_err_threshold, set_off_int_undef)
+declare_mp_snprint(san_path_err_threshold, print_off_int_undef)
+declare_def_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_def_snprint_defint(san_path_err_forget_rate, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_ovr_snprint(san_path_err_forget_rate, print_off_int_undef)
+declare_hw_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_hw_snprint(san_path_err_forget_rate, print_off_int_undef)
+declare_mp_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_mp_snprint(san_path_err_forget_rate, print_off_int_undef)
+declare_def_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_def_snprint_defint(san_path_err_recovery_time, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_ovr_snprint(san_path_err_recovery_time, print_off_int_undef)
+declare_hw_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_hw_snprint(san_path_err_recovery_time, print_off_int_undef)
+declare_mp_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_mp_snprint(san_path_err_recovery_time, print_off_int_undef)
 declare_def_handler(marginal_path_err_sample_time, set_off_int_undef)
 declare_def_snprint_defint(marginal_path_err_sample_time, print_off_int_undef,
 			   DEFAULT_ERR_CHECKS)
@@ -1620,6 +1647,9 @@ init_keywords(vector keywords)
 	install_keyword("config_dir", &def_config_dir_handler, &snprint_def_config_dir);
 	install_keyword("delay_watch_checks", &def_delay_watch_checks_handler, &snprint_def_delay_watch_checks);
 	install_keyword("delay_wait_checks", &def_delay_wait_checks_handler, &snprint_def_delay_wait_checks);
+	install_keyword("san_path_err_threshold", &def_san_path_err_threshold_handler, &snprint_def_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate", &def_san_path_err_forget_rate_handler, &snprint_def_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time", &def_san_path_err_recovery_time_handler, &snprint_def_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time", &def_marginal_path_err_sample_time_handler, &snprint_def_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold", &def_marginal_path_err_rate_threshold_handler, &snprint_def_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time", &def_marginal_path_err_recheck_gap_time_handler, &snprint_def_marginal_path_err_recheck_gap_time);
@@ -1714,6 +1744,9 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &hw_deferred_remove_handler, &snprint_hw_deferred_remove);
 	install_keyword("delay_watch_checks", &hw_delay_watch_checks_handler, &snprint_hw_delay_watch_checks);
 	install_keyword("delay_wait_checks", &hw_delay_wait_checks_handler, &snprint_hw_delay_wait_checks);
+	install_keyword("san_path_err_threshold", &hw_san_path_err_threshold_handler, &snprint_hw_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate", &hw_san_path_err_forget_rate_handler, &snprint_hw_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time", &hw_san_path_err_recovery_time_handler, &snprint_hw_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time", &hw_marginal_path_err_sample_time_handler, &snprint_hw_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold", &hw_marginal_path_err_rate_threshold_handler, &snprint_hw_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time", &hw_marginal_path_err_recheck_gap_time_handler, &snprint_hw_marginal_path_err_recheck_gap_time);
@@ -1750,6 +1783,9 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &ovr_deferred_remove_handler, &snprint_ovr_deferred_remove);
 	install_keyword("delay_watch_checks", &ovr_delay_watch_checks_handler, &snprint_ovr_delay_watch_checks);
 	install_keyword("delay_wait_checks", &ovr_delay_wait_checks_handler, &snprint_ovr_delay_wait_checks);
+	install_keyword("san_path_err_threshold", &ovr_san_path_err_threshold_handler, &snprint_ovr_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate", &ovr_san_path_err_forget_rate_handler, &snprint_ovr_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time", &ovr_san_path_err_recovery_time_handler, &snprint_ovr_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time", &ovr_marginal_path_err_sample_time_handler, &snprint_ovr_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold", &ovr_marginal_path_err_rate_threshold_handler, &snprint_ovr_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time", &ovr_marginal_path_err_recheck_gap_time_handler, &snprint_ovr_marginal_path_err_recheck_gap_time);
@@ -1785,6 +1821,9 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &mp_deferred_remove_handler, &snprint_mp_deferred_remove);
 	install_keyword("delay_watch_checks", &mp_delay_watch_checks_handler, &snprint_mp_delay_watch_checks);
 	install_keyword("delay_wait_checks", &mp_delay_wait_checks_handler, &snprint_mp_delay_wait_checks);
+	install_keyword("san_path_err_threshold", &mp_san_path_err_threshold_handler, &snprint_mp_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate", &mp_san_path_err_forget_rate_handler, &snprint_mp_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time", &mp_san_path_err_recovery_time_handler, &snprint_mp_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time", &mp_marginal_path_err_sample_time_handler, &snprint_mp_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold", &mp_marginal_path_err_rate_threshold_handler, &snprint_mp_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time", &mp_marginal_path_err_recheck_gap_time_handler, &snprint_mp_marginal_path_err_recheck_gap_time);
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index 7b19fed0..a4d114c0 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -879,6 +879,59 @@ out:
 
 }
 
+int select_san_path_err_threshold(struct config *conf, struct multipath *mp)
+{
+	const char *origin;
+	char buff[12];
+
+	mp_set_mpe(san_path_err_threshold);
+	mp_set_ovr(san_path_err_threshold);
+	mp_set_hwe(san_path_err_threshold);
+	mp_set_conf(san_path_err_threshold);
+	mp_set_default(san_path_err_threshold, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, mp->san_path_err_threshold);
+	condlog(3, "%s: san_path_err_threshold = %s %s", mp->alias, buff,
+		origin);
+	return 0;
+}
+
+int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp)
+{
+	const char *origin;
+	char buff[12];
+
+	mp_set_mpe(san_path_err_forget_rate);
+	mp_set_ovr(san_path_err_forget_rate);
+	mp_set_hwe(san_path_err_forget_rate);
+	mp_set_conf(san_path_err_forget_rate);
+	mp_set_default(san_path_err_forget_rate, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, mp->san_path_err_forget_rate);
+	condlog(3, "%s: san_path_err_forget_rate = %s %s", mp->alias,
+		buff, origin);
+	return 0;
+
+}
+
+int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp)
+{
+	const char *origin;
+	char buff[12];
+
+	mp_set_mpe(san_path_err_recovery_time);
+	mp_set_ovr(san_path_err_recovery_time);
+	mp_set_hwe(san_path_err_recovery_time);
+	mp_set_conf(san_path_err_recovery_time);
+	mp_set_default(san_path_err_recovery_time, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, mp->san_path_err_recovery_time);
+	condlog(3, "%s: san_path_err_recovery_time = %s %s", mp->alias,
+		buff, origin);
+	return 0;
+
+}
+
 int select_marginal_path_err_sample_time(struct config *conf, struct multipath *mp)
 {
 	const char *origin;
diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h
index ae99b927..b352c16a 100644
--- a/libmultipath/propsel.h
+++ b/libmultipath/propsel.h
@@ -26,6 +26,9 @@ int select_delay_watch_checks (struct config *conf, struct multipath * mp);
 int select_delay_wait_checks (struct config *conf, struct multipath * mp);
 int select_skip_kpartx (struct config *conf, struct multipath * mp);
 int select_max_sectors_kb (struct config *conf, struct multipath * mp);
+int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp);
+int select_san_path_err_threshold(struct config *conf, struct multipath *mp);
+int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp);
 int select_marginal_path_err_sample_time(struct config *conf, struct multipath *mp);
 int select_marginal_path_err_rate_threshold(struct config *conf, struct multipath *mp);
 int select_marginal_path_err_recheck_gap_time(struct config *conf, struct multipath *mp);
diff --git a/libmultipath/structs.h b/libmultipath/structs.h
index d8961164..96df8c8a 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -280,6 +280,10 @@ struct path {
 	int initialized;
 	int retriggers;
 	int wwid_changed;
+	unsigned int path_failures;
+	time_t dis_reinstate_time;
+	int disable_reinstate;
+	int san_path_err_forget_rate;
 	time_t io_err_dis_reinstate_time;
 	int io_err_disable_reinstate;
 	int io_err_pathfail_cnt;
@@ -318,6 +322,9 @@ struct multipath {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index 68119baa..35e6d37c 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -891,6 +891,45 @@ The default is: \fB/etc/multipath/conf.d/\fR
 .
 .
 .TP
+.B san_path_err_threshold
+If set to a value greater than 0, multipathd will watch paths and check how many
+times a path has been failed due to errors.If the number of failures on a particular
+path is greater then the san_path_err_threshold then the path will not  reinstante
+till san_path_err_recovery_time.These path failures should occur within a
+san_path_err_forget_rate checks, if not we will consider the path is good enough
+to reinstantate.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B san_path_err_forget_rate
+If set to a value greater than 0, multipathd will check whether the path failures
+has exceeded  the san_path_err_threshold within this many checks i.e
+san_path_err_forget_rate . If so we will not reinstante the path till
+san_path_err_recovery_time.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B san_path_err_recovery_time
+If set to a value greater than 0, multipathd will make sure that when path failures
+has exceeded the san_path_err_threshold within san_path_err_forget_rate then the path
+will be placed in failed state for san_path_err_recovery_time duration.Once san_path_err_recovery_time
+has timeout  we will reinstante the failed path .
+san_path_err_recovery_time value should be in secs.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
 .B marginal_path_double_failed_time
 One of the four parameters of supporting path check based on accounting IO
 error such as intermittent error. When a path failed event occurs twice in
@@ -1297,6 +1336,12 @@ section:
 .TP
 .B deferred_remove
 .TP
+.B san_path_err_threshold
+.TP
+.B san_path_err_forget_rate
+.TP
+.B san_path_err_recovery_time
+.TP
 .B marginal_path_err_sample_time
 .TP
 .B marginal_path_err_rate_threshold
@@ -1448,6 +1493,12 @@ section:
 .TP
 .B deferred_remove
 .TP
+.B san_path_err_threshold
+.TP
+.B san_path_err_forget_rate
+.TP
+.B san_path_err_recovery_time
+.TP
 .B marginal_path_err_sample_time
 .TP
 .B marginal_path_err_rate_threshold
@@ -1524,6 +1575,12 @@ the values are taken from the \fIdevices\fR or \fIdefaults\fR sections:
 .TP
 .B deferred_remove
 .TP
+.B san_path_err_threshold
+.TP
+.B san_path_err_forget_rate
+.TP
+.B san_path_err_recovery_time
+.TP
 .B marginal_path_err_sample_time
 .TP
 .B marginal_path_err_rate_threshold
diff --git a/multipathd/main.c b/multipathd/main.c
index 99145293..57bb7143 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -1833,6 +1833,84 @@ int update_path_groups(struct multipath *mpp, struct vectors *vecs, int refresh)
 	return 0;
 }
 
+static int check_path_reinstate_state(struct path * pp) {
+	struct timespec curr_time;
+	if (!((pp->mpp->san_path_err_threshold > 0) &&
+				(pp->mpp->san_path_err_forget_rate > 0) &&
+				(pp->mpp->san_path_err_recovery_time >0))) {
+		return 0;
+	}
+
+	if (pp->disable_reinstate) {
+		/* If we don't know how much time has passed, automatically
+		 * reinstate the path, just to be safe. Also, if there are
+		 * no other usable paths, reinstate the path
+		 */
+		if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0 ||
+				pp->mpp->nr_active == 0) {
+			condlog(2, "%s : reinstating path early", pp->dev);
+			goto reinstate_path;
+		}
+		if ((curr_time.tv_sec - pp->dis_reinstate_time ) > pp->mpp->san_path_err_recovery_time) {
+			condlog(2,"%s : reinstate the path after err recovery time", pp->dev);
+			goto reinstate_path;
+		}
+		return 1;
+	}
+	/* forget errors on a working path */
+	if ((pp->state == PATH_UP || pp->state == PATH_GHOST) &&
+			pp->path_failures > 0) {
+		if (pp->san_path_err_forget_rate > 0){
+			pp->san_path_err_forget_rate--;
+		} else {
+			/* for every san_path_err_forget_rate number of
+			 * successful path checks decrement path_failures by 1
+			 */
+			pp->path_failures--;
+			pp->san_path_err_forget_rate = pp->mpp->san_path_err_forget_rate;
+		}
+		return 0;
+	}
+
+	/* If the path isn't recovering from a failed state, do nothing */
+	if (pp->state != PATH_DOWN && pp->state != PATH_SHAKY &&
+			pp->state != PATH_TIMEOUT)
+		return 0;
+
+	if (pp->path_failures == 0)
+		pp->san_path_err_forget_rate = pp->mpp->san_path_err_forget_rate;
+
+	pp->path_failures++;
+
+	/* if we don't know the currently time, we don't know how long to
+	 * delay the path, so there's no point in checking if we should
+	 */
+
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return 0;
+	/* when path failures has exceeded the san_path_err_threshold
+	 * place the path in delayed state till san_path_err_recovery_time
+	 * so that the cutomer can rectify the issue within this time. After
+	 * the completion of san_path_err_recovery_time it should
+	 * automatically reinstate the path
+	 */
+	if (pp->path_failures > pp->mpp->san_path_err_threshold) {
+		condlog(2, "%s : hit error threshold. Delaying path reinstatement", pp->dev);
+		pp->dis_reinstate_time = curr_time.tv_sec;
+		pp->disable_reinstate = 1;
+
+		return 1;
+	} else {
+		return 0;
+	}
+
+reinstate_path:
+	pp->path_failures = 0;
+	pp->disable_reinstate = 0;
+	pp->san_path_err_forget_rate = 0;
+	return 0;
+}
+
 /*
  * Returns '1' if the path has been checked, '-1' if it was blacklisted
  * and '0' otherwise
@@ -1980,6 +2058,12 @@ check_path (struct vectors * vecs, struct path * pp, int ticks)
 	if (!pp->mpp)
 		return 0;
 
+	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
+			check_path_reinstate_state(pp)) {
+		pp->state = PATH_DELAYED;
+		return 1;
+	}
+
 	if (pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
 		pp->state = PATH_SHAKY;
 		/*
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 05/19] multipathd: marginal_path overrides san_path_err
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (3 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature" Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 06/19] multipath.conf.5: man page fixes for san_path_err_xy Martin Wilck
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui
  Cc: Guan Junxiong, dm-devel, M Muneendra Kumar, Martin Wilck

disable san_path_err_XY if marginal path checking is
enabled. Also warn about san_path_err_XY being deprecated,
and warn if any of the two is used in combination with
delay_XY_checks.

Add some minor fixes to the san_path_err code, and a comment
that explains a part of the code that was not immediately obvious
to me.

Cc: Guan Junxiong <guanjunxiong@huawei.com>
Cc: M Muneendra Kumar <mmandala@brocade.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/configure.c | 24 ++++++++++++++------
 libmultipath/propsel.c   | 49 ++++++++++++++++++++++++++++++++--------
 libmultipath/structs.h   | 21 +++++++++++++++++
 multipathd/main.c        | 10 ++++++++
 4 files changed, 88 insertions(+), 16 deletions(-)

diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 60a98873..5af4a189 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -309,13 +309,13 @@ int setup_map(struct multipath *mpp, char *params, int params_size,
 	select_deferred_remove(conf, mpp);
 	select_delay_watch_checks(conf, mpp);
 	select_delay_wait_checks(conf, mpp);
-	select_san_path_err_threshold(conf, mpp);
-	select_san_path_err_forget_rate(conf, mpp);
-	select_san_path_err_recovery_time(conf, mpp);
 	select_marginal_path_err_sample_time(conf, mpp);
 	select_marginal_path_err_rate_threshold(conf, mpp);
 	select_marginal_path_err_recheck_gap_time(conf, mpp);
 	select_marginal_path_double_failed_time(conf, mpp);
+	select_san_path_err_threshold(conf, mpp);
+	select_san_path_err_forget_rate(conf, mpp);
+	select_san_path_err_recovery_time(conf, mpp);
 	select_skip_kpartx(conf, mpp);
 	select_max_sectors_kb(conf, mpp);
 	select_ghost_delay(conf, mpp);
@@ -324,11 +324,21 @@ int setup_map(struct multipath *mpp, char *params, int params_size,
 	sysfs_set_scsi_tmo(mpp, conf->checkint);
 	pthread_cleanup_pop(1);
 
-	if (mpp->marginal_path_double_failed_time > 0 &&
-	    mpp->marginal_path_err_sample_time > 0 &&
-	    mpp->marginal_path_err_recheck_gap_time > 0 &&
-	    mpp->marginal_path_err_rate_threshold >= 0)
+	if (marginal_path_check_enabled(mpp)) {
+		if (delay_check_enabled(mpp)) {
+			condlog(1, "%s: WARNING: both marginal_path and delay_checks error detection selected",
+				mpp->alias);
+			condlog(0, "%s: unexpected behavior may occur!",
+				mpp->alias);
+		}
 		start_io_err_stat_thread(vecs);
+	}
+	if (san_path_check_enabled(mpp) && delay_check_enabled(mpp)) {
+		condlog(1, "%s: WARNING: both san_path_err and delay_checks error detection selected",
+			mpp->alias);
+		condlog(0, "%s: unexpected behavior may occur!",
+			mpp->alias);
+	}
 	/*
 	 * assign paths to path groups -- start with no groups and all paths
 	 * in mpp->paths
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index a4d114c0..f5d87786 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -74,6 +74,8 @@ static const char cmdline_origin[] =
 	"(setting: multipath command line [-p] flag)";
 static const char autodetect_origin[] =
 	"(setting: storage device autodetected)";
+static const char marginal_path_origin[] =
+	"(setting: implied by marginal_path check)";
 
 #define do_default(dest, value)						\
 do {									\
@@ -879,20 +881,37 @@ out:
 
 }
 
+static int san_path_deprecated_warned;
+#define warn_san_path_deprecated(v, x)					\
+	do {								\
+		if (v->x > 0 && !san_path_deprecated_warned) {		\
+		san_path_deprecated_warned = 1;				\
+		condlog(1, "WARNING: option %s is deprecated, "		\
+			"please use marginal_path options instead",	\
+			#x);						\
+		}							\
+	} while(0)
+
 int select_san_path_err_threshold(struct config *conf, struct multipath *mp)
 {
 	const char *origin;
 	char buff[12];
 
+	if (marginal_path_check_enabled(mp)) {
+		mp->san_path_err_threshold = NU_NO;
+		origin = marginal_path_origin;
+		goto out;
+	}
 	mp_set_mpe(san_path_err_threshold);
 	mp_set_ovr(san_path_err_threshold);
 	mp_set_hwe(san_path_err_threshold);
 	mp_set_conf(san_path_err_threshold);
 	mp_set_default(san_path_err_threshold, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->san_path_err_threshold);
-	condlog(3, "%s: san_path_err_threshold = %s %s", mp->alias, buff,
-		origin);
+	if (print_off_int_undef(buff, 12, mp->san_path_err_threshold) != 0)
+		condlog(3, "%s: san_path_err_threshold = %s %s",
+			mp->alias, buff, origin);
+	warn_san_path_deprecated(mp, san_path_err_threshold);
 	return 0;
 }
 
@@ -901,15 +920,21 @@ int select_san_path_err_forget_rate(struct config *conf, struct multipath *mp)
 	const char *origin;
 	char buff[12];
 
+	if (marginal_path_check_enabled(mp)) {
+		mp->san_path_err_forget_rate = NU_NO;
+		origin = marginal_path_origin;
+		goto out;
+	}
 	mp_set_mpe(san_path_err_forget_rate);
 	mp_set_ovr(san_path_err_forget_rate);
 	mp_set_hwe(san_path_err_forget_rate);
 	mp_set_conf(san_path_err_forget_rate);
 	mp_set_default(san_path_err_forget_rate, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->san_path_err_forget_rate);
-	condlog(3, "%s: san_path_err_forget_rate = %s %s", mp->alias,
-		buff, origin);
+	if (print_off_int_undef(buff, 12, mp->san_path_err_forget_rate) != 0)
+		condlog(3, "%s: san_path_err_forget_rate = %s %s", mp->alias,
+			buff, origin);
+	warn_san_path_deprecated(mp, san_path_err_forget_rate);
 	return 0;
 
 }
@@ -919,15 +944,21 @@ int select_san_path_err_recovery_time(struct config *conf, struct multipath *mp)
 	const char *origin;
 	char buff[12];
 
+	if (marginal_path_check_enabled(mp)) {
+		mp->san_path_err_recovery_time = NU_NO;
+		origin = marginal_path_origin;
+		goto out;
+	}
 	mp_set_mpe(san_path_err_recovery_time);
 	mp_set_ovr(san_path_err_recovery_time);
 	mp_set_hwe(san_path_err_recovery_time);
 	mp_set_conf(san_path_err_recovery_time);
 	mp_set_default(san_path_err_recovery_time, DEFAULT_ERR_CHECKS);
 out:
-	print_off_int_undef(buff, 12, mp->san_path_err_recovery_time);
-	condlog(3, "%s: san_path_err_recovery_time = %s %s", mp->alias,
-		buff, origin);
+	if (print_off_int_undef(buff, 12, mp->san_path_err_recovery_time) != 0)
+		condlog(3, "%s: san_path_err_recovery_time = %s %s", mp->alias,
+			buff, origin);
+	warn_san_path_deprecated(mp, san_path_err_recovery_time);
 	return 0;
 
 }
diff --git a/libmultipath/structs.h b/libmultipath/structs.h
index 96df8c8a..375c7284 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -377,6 +377,27 @@ struct multipath {
 	struct gen_multipath generic_mp;
 };
 
+static inline int marginal_path_check_enabled(const struct multipath *mpp)
+{
+	return mpp->marginal_path_double_failed_time > 0 &&
+		mpp->marginal_path_err_sample_time > 0 &&
+		mpp->marginal_path_err_recheck_gap_time > 0 &&
+		mpp->marginal_path_err_rate_threshold >= 0;
+}
+
+static inline int san_path_check_enabled(const struct multipath *mpp)
+{
+	return mpp->san_path_err_threshold > 0 &&
+		mpp->san_path_err_forget_rate > 0 &&
+		mpp->san_path_err_recovery_time > 0;
+}
+
+static inline int delay_check_enabled(const struct multipath *mpp)
+{
+	return mpp->delay_watch_checks != NU_NO ||
+		mpp->delay_wait_checks != NU_NO;
+}
+
 struct pathgroup {
 	long id;
 	int status;
diff --git a/multipathd/main.c b/multipathd/main.c
index 57bb7143..aac32ac8 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -1835,6 +1835,16 @@ int update_path_groups(struct multipath *mpp, struct vectors *vecs, int refresh)
 
 static int check_path_reinstate_state(struct path * pp) {
 	struct timespec curr_time;
+
+	/*
+	 * This function is only called when the path state changes
+	 * from "bad" to "good". pp->state reflects the *previous* state.
+	 * If this was "bad", we know that a failure must have occured
+	 * beforehand, and count that.
+	 * Note that we count path state _changes_ this way. If a path
+	 * remains in "bad" state, failure count is not increased.
+	 */
+
 	if (!((pp->mpp->san_path_err_threshold > 0) &&
 				(pp->mpp->san_path_err_forget_rate > 0) &&
 				(pp->mpp->san_path_err_recovery_time >0))) {
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 06/19] multipath.conf.5: man page fixes for san_path_err_xy
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (4 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 05/19] multipathd: marginal_path overrides san_path_err Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 07/19] setup_map: wait for pending path checkers to finish Martin Wilck
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui
  Cc: Guan Junxiong, dm-devel, M Muneendra Kumar, Martin Wilck

This adds a paragraph about the san_path_err algorithm to the
"shaky paths" section of the man page.

Cc: Guan Junxiong <guanjunxiong@huawei.com>
Cc: M Muneendra Kumar <mmandala@brocade.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 multipath/multipath.conf.5 | 35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index 35e6d37c..c7f59147 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -894,10 +894,10 @@ The default is: \fB/etc/multipath/conf.d/\fR
 .B san_path_err_threshold
 If set to a value greater than 0, multipathd will watch paths and check how many
 times a path has been failed due to errors.If the number of failures on a particular
-path is greater then the san_path_err_threshold then the path will not  reinstante
-till san_path_err_recovery_time.These path failures should occur within a
+path is greater then the san_path_err_threshold, then the path will not reinstate
+till san_path_err_recovery_time. These path failures should occur within a
 san_path_err_forget_rate checks, if not we will consider the path is good enough
-to reinstantate.
+to reinstantate. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -909,7 +909,7 @@ The default is: \fBno\fR
 If set to a value greater than 0, multipathd will check whether the path failures
 has exceeded  the san_path_err_threshold within this many checks i.e
 san_path_err_forget_rate . If so we will not reinstante the path till
-san_path_err_recovery_time.
+san_path_err_recovery_time. See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -923,6 +923,7 @@ has exceeded the san_path_err_threshold within san_path_err_forget_rate then the
 will be placed in failed state for san_path_err_recovery_time duration.Once san_path_err_recovery_time
 has timeout  we will reinstante the failed path .
 san_path_err_recovery_time value should be in secs.
+See "Shaky paths detection" below.
 .RS
 .TP
 The default is: \fBno\fR
@@ -1642,14 +1643,14 @@ A common problem in SAN setups is the occurence of intermittent errors: a
 path is unreachable, then reachable again for a short time, disappears again,
 and so forth. This happens typically on unstable interconnects. It is
 undesirable to switch pathgroups unnecessarily on such frequent, unreliable
-events. \fImultipathd\fR supports two different methods for detecting this
+events. \fImultipathd\fR supports three different methods for detecting this
 situation and dealing with it. All methods share the same basic mode of
 operation: If a path is found to be \(dqshaky\(dq or \(dqflipping\(dq,
 and appears to be in healthy status, it is not reinstated (put back to use)
 immediately. Instead, it is watched for some time, and only reinstated
 if the healthy state appears to be stable. The logic of determining
 \(dqshaky\(dq condition, as well as the logic when to reinstate,
-differs between the methods.
+differs between the three methods.
 .TP 8
 .B \(dqdelay_checks\(dq failure tracking
 If a path fails again within a
@@ -1671,14 +1672,30 @@ monitoring period, the path is reinstated. Otherwise, it
 is kept in failed state for \fImarginal_path_err_recheck_gap_time\fR, and
 after that, it is monitored again. For this method, time intervals are measured
 in seconds.
+.TP
+.B \(dqsan_path_err\(dq failure tracking
+multipathd counts path failures for each path. Once the number of failures
+exceeds the value given by \fIsan_path_err_threshold\fR, the path is not
+reinstated for \fIsan_path_err_recovery_time\fR ticks. While counting
+failures, multipathd \(dqforgets\(dq one past failure every
+\(dqsan_path_err_forget_rate\(dq ticks; thus if errors don't occur more
+often then once in the forget rate interval, the failure count doesn't
+increase and the threshold is never reached. As for the \fIdelay_xy\fR method,
+intervals are measured in \(dqticks\(dq.
+.
+.RS 8
+.LP
+This method is \fBdeprecated\fR in favor of the \(dqmarginal_path\(dq failure
+tracking method, and only offered for backward compatibility.
 .
 .RE
 .LP
-.
-See the documentation of the individual options above for details.
+See the documentation
+of the individual options above for details.
 It is \fBstrongly discouraged\fR to use more than one of these methods for any
 given multipath map, because the two concurrent methods may interact in
-unpredictable ways.
+unpredictable ways. If the \(dqmarginal_path\(dq method is active, the
+\(dqsan_path_err\(dq parameters are implicitly set to 0.
 .
 .
 .\" ----------------------------------------------------------------------------
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 07/19] setup_map: wait for pending path checkers to finish
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (5 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 06/19] multipath.conf.5: man page fixes for san_path_err_xy Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 08/19] libmultipath: add ARRAY_SIZE helper Martin Wilck
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

The timeout for synchronous checks is only 1ms when async path
checking is used, which may easily be missed on busy systems.
That's not a problem for the path checker, because it simply
repeats the check one tick later. However, when maps are
set up (e.g. when new paths are detected), it's desirable
to get the number of active paths right. Therefore, wait a bit
longer if path checkers are found pending in setup_map().

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/configure.c | 65 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 63 insertions(+), 2 deletions(-)

diff --git a/libmultipath/configure.c b/libmultipath/configure.c
index 5af4a189..39d2a956 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -44,6 +44,10 @@
 #include "sysfs.h"
 #include "io_err_stat.h"
 
+/* Time in ms to wait for pending checkers in setup_map() */
+#define WAIT_CHECKERS_PENDING_MS 10
+#define WAIT_ALL_CHECKERS_PENDING_MS 90
+
 /* group paths in pg by host adapter
  */
 int group_by_host_adapter(struct pathgroup *pgp, vector adapters)
@@ -257,12 +261,43 @@ int rr_optimize_path_order(struct pathgroup *pgp)
 	return 0;
 }
 
+static int wait_for_pending_paths(struct multipath *mpp,
+				  struct config *conf,
+				  int n_pending, int goal, int wait_ms)
+{
+	static const struct timespec millisec =
+		{ .tv_sec = 0, .tv_nsec = 1000*1000 };
+	int i, j;
+	struct path *pp;
+	struct pathgroup *pgp;
+	struct timespec ts;
+
+	do {
+		vector_foreach_slot(mpp->pg, pgp, i) {
+			vector_foreach_slot(pgp->paths, pp, j) {
+				if (pp->state != PATH_PENDING)
+					continue;
+				pp->state = get_state(pp, conf,
+						      0, PATH_PENDING);
+				if (pp->state != PATH_PENDING &&
+				    --n_pending <= goal)
+					return 0;
+			}
+		}
+		ts = millisec;
+		while (nanosleep(&ts, &ts) != 0 && errno == EINTR)
+			/* nothing */;
+	} while (--wait_ms > 0);
+
+	return n_pending;
+}
+
 int setup_map(struct multipath *mpp, char *params, int params_size,
 	      struct vectors *vecs)
 {
 	struct pathgroup * pgp;
 	struct config *conf;
-	int i;
+	int i, n_paths;
 
 	/*
 	 * don't bother if devmap size is unknown
@@ -339,7 +374,9 @@ int setup_map(struct multipath *mpp, char *params, int params_size,
 		condlog(0, "%s: unexpected behavior may occur!",
 			mpp->alias);
 	}
-	/*
+
+	n_paths = VECTOR_SIZE(mpp->paths);
+        /*
 	 * assign paths to path groups -- start with no groups and all paths
 	 * in mpp->paths
 	 */
@@ -353,6 +390,30 @@ int setup_map(struct multipath *mpp, char *params, int params_size,
 	if (mpp->pgpolicyfn && mpp->pgpolicyfn(mpp))
 		return 1;
 
+	/*
+	 * If async state detection is used, see if pending state checks
+	 * have finished, to get nr_active right. We can't wait until the
+	 * checkers time out, as that may take 30s or more, and we are
+	 * holding the vecs lock.
+	 */
+	if (conf->force_sync == 0 && n_paths > 0) {
+		int n_pending = pathcount(mpp, PATH_PENDING);
+
+		if (n_pending > 0)
+			n_pending = wait_for_pending_paths(
+				mpp, conf, n_pending, 0,
+				WAIT_CHECKERS_PENDING_MS);
+		/* ALL paths pending - wait some more, but be satisfied
+		   with only some paths finished */
+		if (n_pending == n_paths)
+			n_pending = wait_for_pending_paths(
+				mpp, conf, n_pending,
+				n_paths >= 4 ? 2 : 1,
+				WAIT_ALL_CHECKERS_PENDING_MS);
+		if (n_pending > 0)
+			condlog(2, "%s: setting up map with %d/%d path checkers pending",
+				mpp->alias, n_pending, n_paths);
+	}
 	mpp->nr_active = pathcount(mpp, PATH_UP) + pathcount(mpp, PATH_GHOST);
 
 	/*
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 08/19] libmultipath: add ARRAY_SIZE helper
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (6 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 07/19] setup_map: wait for pending path checkers to finish Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 09/19] libmultipath: make close_fd() a common helper Martin Wilck
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/util.h | 1 +
 tests/hwtable.c     | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/libmultipath/util.h b/libmultipath/util.h
index 1f13c913..dea3fa04 100644
--- a/libmultipath/util.h
+++ b/libmultipath/util.h
@@ -25,6 +25,7 @@ int safe_write(int fd, const void *buf, size_t count);
 void set_max_fds(int max_fds);
 
 #define KERNEL_VERSION(maj, min, ptc) ((((maj) * 256) + (min)) * 256 + (ptc))
+#define ARRAY_SIZE(x) (sizeof(x)/sizeof((x)[0]))
 
 #define safe_sprintf(var, format, args...)	\
 	snprintf(var, sizeof(var), format, ##args) >= sizeof(var)
diff --git a/tests/hwtable.c b/tests/hwtable.c
index 789481ff..ad863b08 100644
--- a/tests/hwtable.c
+++ b/tests/hwtable.c
@@ -24,8 +24,8 @@
 #include "pgpolicies.h"
 #include "test-lib.h"
 #include "print.h"
+#include "util.h"
 
-#define ARRAY_SIZE(x) (sizeof(x)/sizeof((x)[0]))
 #define N_CONF_FILES 2
 
 static const char tmplate[] = "/tmp/hwtable-XXXXXX";
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 09/19] libmultipath: make close_fd() a common helper
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (7 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 08/19] libmultipath: add ARRAY_SIZE helper Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 10/19] libmultipath: restore PG prio in update_multipath_strings Martin Wilck
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

move close_fd() into util.c.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/sysfs.c | 5 -----
 libmultipath/util.c  | 5 +++++
 libmultipath/util.h  | 2 ++
 multipath/main.c     | 4 ----
 4 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/libmultipath/sysfs.c b/libmultipath/sysfs.c
index 558c8d6a..65904d7b 100644
--- a/libmultipath/sysfs.c
+++ b/libmultipath/sysfs.c
@@ -295,11 +295,6 @@ static int select_dm_devs(const struct dirent *di)
 	return fnmatch("dm-*", di->d_name, FNM_FILE_NAME) == 0;
 }
 
-static void close_fd(void *arg)
-{
-	close((long)arg);
-}
-
 bool sysfs_is_multipathed(const struct path *pp)
 {
 	char pathbuf[PATH_MAX];
diff --git a/libmultipath/util.c b/libmultipath/util.c
index 28eb7577..944c632e 100644
--- a/libmultipath/util.c
+++ b/libmultipath/util.c
@@ -506,3 +506,8 @@ void free_scandir_result(struct scandir_result *res)
 		FREE(res->di[i]);
 	FREE(res->di);
 }
+
+void close_fd(void *arg)
+{
+	close((long)arg);
+}
diff --git a/libmultipath/util.h b/libmultipath/util.h
index dea3fa04..1e0d832c 100644
--- a/libmultipath/util.h
+++ b/libmultipath/util.h
@@ -35,6 +35,8 @@ void set_max_fds(int max_fds);
 #define pthread_cleanup_push_cast(f, arg)		\
 	pthread_cleanup_push(((void (*)(void *))&f), (arg))
 
+void close_fd(void *arg);
+
 struct scandir_result {
 	struct dirent **di;
 	int n;
diff --git a/multipath/main.c b/multipath/main.c
index f40c179b..a25e1b4f 100644
--- a/multipath/main.c
+++ b/multipath/main.c
@@ -388,10 +388,6 @@ enum {
 };
 
 static const char shm_find_mp_dir[] = MULTIPATH_SHM_BASE "find_multipaths";
-static void close_fd(void *arg)
-{
-	close((long)arg);
-}
 
 /**
  * find_multipaths_check_timeout(wwid, tmo)
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 10/19] libmultipath: restore PG prio in update_multipath_strings
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (8 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 09/19] libmultipath: make close_fd() a common helper Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 11/19] multipathd: don't check foreign paths every tick Martin Wilck
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

update_multipath_strings() destroys and recreates the
pathgroup vector. This wipes information previously
stored. Restore the path group priorities.

Fixes: efc7407bed65 "libmultipath: don't update path groups when
 printing"
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/structs_vec.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/libmultipath/structs_vec.c b/libmultipath/structs_vec.c
index 03e2b978..db5d19da 100644
--- a/libmultipath/structs_vec.c
+++ b/libmultipath/structs_vec.c
@@ -18,6 +18,7 @@
 #include "configure.h"
 #include "libdevmapper.h"
 #include "io_err_stat.h"
+#include "switchgroup.h"
 
 /*
  * creates or updates mpp->paths reading mpp->pg
@@ -261,6 +262,9 @@ void sync_paths(struct multipath *mpp, vector pathvec)
 int
 update_multipath_strings(struct multipath *mpp, vector pathvec, int is_daemon)
 {
+	struct pathgroup *pgp;
+	int i;
+
 	if (!mpp)
 		return 1;
 
@@ -278,6 +282,10 @@ update_multipath_strings(struct multipath *mpp, vector pathvec, int is_daemon)
 	if (update_multipath_status(mpp))
 		return 1;
 
+	vector_foreach_slot(mpp->pg, pgp, i)
+		if (pgp->paths)
+			path_group_prio_update(pgp);
+
 	return 0;
 }
 
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 11/19] multipathd: don't check foreign paths every tick
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (9 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 10/19] libmultipath: restore PG prio in update_multipath_strings Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 12/19] libmultipath: add files from nvme-cli for NVMe support Martin Wilck
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

We don't do this for native paths, so don't do it for
foreigns, either. Instead use max_checkint for foreign
paths, always.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 multipathd/main.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/multipathd/main.c b/multipathd/main.c
index aac32ac8..c981d437 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -2271,6 +2271,7 @@ checkerloop (void *ap)
 	unsigned int i;
 	struct timespec last_time;
 	struct config *conf;
+	int foreign_tick = 0;
 
 	pthread_cleanup_push(rcu_unregister, NULL);
 	rcu_register_thread();
@@ -2368,7 +2369,15 @@ checkerloop (void *ap)
 						diff_time.tv_sec);
 			}
 		}
-		check_foreign();
+
+		if (foreign_tick == 0) {
+			conf = get_multipath_config();
+			foreign_tick = conf->max_checkint;
+			put_multipath_config(conf);
+		}
+		if (--foreign_tick == 0)
+			check_foreign();
+
 		post_config_state(DAEMON_IDLE);
 		conf = get_multipath_config();
 		strict_timing = conf->strict_timing;
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 12/19] libmultipath: add files from nvme-cli for NVMe support
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (10 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 11/19] multipathd: don't check foreign paths every tick Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 13/19] libmultipath: add wrapper library for nvme ioctls Martin Wilck
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, lijie, dm-devel

Added code from https://github.com/linux-nvme/nvme-cli in the
"libmultipath/nvme" subdirectory.

This code is licensed under GPL v2.0.

The idea is to add this code unmodified and add a simple wrapper
for libmultipath, so that upstream nvme-cli bug fixes can be
most easily integrated.

Cc: lijie <lijie34@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/nvme/argconfig.h  |   99 +++
 libmultipath/nvme/json.h       |   87 ++
 libmultipath/nvme/linux/nvme.h | 1450 ++++++++++++++++++++++++++++++++
 libmultipath/nvme/nvme-ioctl.c |  869 +++++++++++++++++++
 libmultipath/nvme/nvme-ioctl.h |  139 +++
 libmultipath/nvme/nvme.h       |  163 ++++
 libmultipath/nvme/plugin.h     |   36 +
 7 files changed, 2843 insertions(+)
 create mode 100644 libmultipath/nvme/argconfig.h
 create mode 100644 libmultipath/nvme/json.h
 create mode 100644 libmultipath/nvme/linux/nvme.h
 create mode 100644 libmultipath/nvme/nvme-ioctl.c
 create mode 100644 libmultipath/nvme/nvme-ioctl.h
 create mode 100644 libmultipath/nvme/nvme.h
 create mode 100644 libmultipath/nvme/plugin.h

diff --git a/libmultipath/nvme/argconfig.h b/libmultipath/nvme/argconfig.h
new file mode 100644
index 00000000..adb192b6
--- /dev/null
+++ b/libmultipath/nvme/argconfig.h
@@ -0,0 +1,99 @@
+////////////////////////////////////////////////////////////////////////
+//
+// Copyright 2014 PMC-Sierra, Inc.
+//
+// This program is free software; you can redistribute it and/or
+// modify it under the terms of the GNU General Public License
+// as published by the Free Software Foundation; either version 2
+// of the License, or (at your option) any later version.
+//
+// This program is distributed in the hope that it will be useful,
+// but WITHOUT ANY WARRANTY; without even the implied warranty of
+// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+// GNU General Public License for more details.
+//
+// You should have received a copy of the GNU General Public License
+// along with this program; if not, write to the Free Software
+// Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, USA.
+//
+////////////////////////////////////////////////////////////////////////
+
+////////////////////////////////////////////////////////////////////////
+//
+//   Author: Logan Gunthorpe <logang@deltatee.com>
+//           Logan Gunthorpe
+//
+//   Date:   Oct 23 2014
+//
+//   Description:
+//     Header file for argconfig.c
+//
+////////////////////////////////////////////////////////////////////////
+
+#ifndef argconfig_H
+#define argconfig_H
+
+#include <string.h>
+#include <getopt.h>
+#include <stdarg.h>
+
+enum argconfig_types {
+	CFG_NONE,
+	CFG_STRING,
+	CFG_INT,
+	CFG_SIZE,
+	CFG_LONG,
+	CFG_LONG_SUFFIX,
+	CFG_DOUBLE,
+	CFG_BOOL,
+	CFG_BYTE,
+	CFG_SHORT,
+	CFG_POSITIVE,
+	CFG_INCREMENT,
+	CFG_SUBOPTS,
+	CFG_FILE_A,
+	CFG_FILE_W,
+	CFG_FILE_R,
+	CFG_FILE_AP,
+	CFG_FILE_WP,
+	CFG_FILE_RP,
+};
+
+struct argconfig_commandline_options {
+	const char *option;
+	const char short_option;
+	const char *meta;
+	enum argconfig_types config_type;
+	void *default_value;
+	int argument_type;
+	const char *help;
+};
+
+#define CFG_MAX_SUBOPTS 500
+#define MAX_HELP_FUNC 20
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+typedef void argconfig_help_func();
+void argconfig_append_usage(const char *str);
+void argconfig_print_help(const char *program_desc,
+			  const struct argconfig_commandline_options *options);
+int argconfig_parse(int argc, char *argv[], const char *program_desc,
+		    const struct argconfig_commandline_options *options,
+		    void *config_out, size_t config_size);
+int argconfig_parse_subopt_string(char *string, char **options,
+				  size_t max_options);
+unsigned argconfig_parse_comma_sep_array(char *string, int *ret,
+					 unsigned max_length);
+unsigned argconfig_parse_comma_sep_array_long(char *string,
+					      unsigned long long *ret,
+					      unsigned max_length);
+void argconfig_register_help_func(argconfig_help_func * f);
+
+void print_word_wrapped(const char *s, int indent, int start);
+#ifdef __cplusplus
+}
+#endif
+#endif
diff --git a/libmultipath/nvme/json.h b/libmultipath/nvme/json.h
new file mode 100644
index 00000000..c4ea5316
--- /dev/null
+++ b/libmultipath/nvme/json.h
@@ -0,0 +1,87 @@
+#ifndef __JSON__H
+#define __JSON__H
+
+struct json_object;
+struct json_array;
+struct json_pair;
+
+#define JSON_TYPE_STRING 0
+#define JSON_TYPE_INTEGER 1
+#define JSON_TYPE_FLOAT 2
+#define JSON_TYPE_OBJECT 3
+#define JSON_TYPE_ARRAY 4
+#define JSON_TYPE_UINT 5
+#define JSON_PARENT_TYPE_PAIR 0
+#define JSON_PARENT_TYPE_ARRAY 1
+struct json_value {
+	int type;
+	union {
+		long long integer_number;
+		unsigned long long uint_number;
+		long double float_number;
+		char *string;
+		struct json_object *object;
+		struct json_array *array;
+	};
+	int parent_type;
+	union {
+		struct json_pair *parent_pair;
+		struct json_array *parent_array;
+	};
+};
+
+struct json_array {
+	struct json_value **values;
+	int value_cnt;
+	struct json_value *parent;
+};
+
+struct json_object {
+	struct json_pair **pairs;
+	int pair_cnt;
+	struct json_value *parent;
+};
+
+struct json_pair {
+	char *name;
+	struct json_value *value;
+	struct json_object *parent;
+};
+
+struct json_object *json_create_object(void);
+struct json_array *json_create_array(void);
+
+void json_free_object(struct json_object *obj);
+
+int json_object_add_value_type(struct json_object *obj, const char *name, int type, ...);
+#define json_object_add_value_int(obj, name, val) \
+	json_object_add_value_type((obj), name, JSON_TYPE_INTEGER, (long long) (val))
+#define json_object_add_value_uint(obj, name, val) \
+	json_object_add_value_type((obj), name, JSON_TYPE_UINT, (unsigned long long) (val))
+#define json_object_add_value_float(obj, name, val) \
+	json_object_add_value_type((obj), name, JSON_TYPE_FLOAT, (val))
+#define json_object_add_value_string(obj, name, val) \
+	json_object_add_value_type((obj), name, JSON_TYPE_STRING, (val))
+#define json_object_add_value_object(obj, name, val) \
+	json_object_add_value_type((obj), name, JSON_TYPE_OBJECT, (val))
+#define json_object_add_value_array(obj, name, val) \
+	json_object_add_value_type((obj), name, JSON_TYPE_ARRAY, (val))
+int json_array_add_value_type(struct json_array *array, int type, ...);
+#define json_array_add_value_int(obj, val) \
+	json_array_add_value_type((obj), JSON_TYPE_INTEGER, (val))
+#define json_array_add_value_uint(obj, val) \
+	json_array_add_value_type((obj), JSON_TYPE_UINT, (val))
+#define json_array_add_value_float(obj, val) \
+	json_array_add_value_type((obj), JSON_TYPE_FLOAT, (val))
+#define json_array_add_value_string(obj, val) \
+	json_array_add_value_type((obj), JSON_TYPE_STRING, (val))
+#define json_array_add_value_object(obj, val) \
+	json_array_add_value_type((obj), JSON_TYPE_OBJECT, (val))
+#define json_array_add_value_array(obj, val) \
+	json_array_add_value_type((obj), JSON_TYPE_ARRAY, (val))
+
+#define json_array_last_value_object(obj) \
+	(obj->values[obj->value_cnt - 1]->object)
+
+void json_print_object(struct json_object *obj, void *);
+#endif
diff --git a/libmultipath/nvme/linux/nvme.h b/libmultipath/nvme/linux/nvme.h
new file mode 100644
index 00000000..68000eb8
--- /dev/null
+++ b/libmultipath/nvme/linux/nvme.h
@@ -0,0 +1,1450 @@
+/*
+ * Definitions for the NVM Express interface
+ * Copyright (c) 2011-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _LINUX_NVME_H
+#define _LINUX_NVME_H
+
+#include <linux/types.h>
+#include <linux/uuid.h>
+
+/* NQN names in commands fields specified one size */
+#define NVMF_NQN_FIELD_LEN	256
+
+/* However the max length of a qualified name is another size */
+#define NVMF_NQN_SIZE		223
+
+#define NVMF_TRSVCID_SIZE	32
+#define NVMF_TRADDR_SIZE	256
+#define NVMF_TSAS_SIZE		256
+
+#define NVME_DISC_SUBSYS_NAME	"nqn.2014-08.org.nvmexpress.discovery"
+
+#define NVME_RDMA_IP_PORT	4420
+
+#define NVME_NSID_ALL		0xffffffff
+
+enum nvme_subsys_type {
+	NVME_NQN_DISC	= 1,		/* Discovery type target subsystem */
+	NVME_NQN_NVME	= 2,		/* NVME type target subsystem */
+};
+
+/* Address Family codes for Discovery Log Page entry ADRFAM field */
+enum {
+	NVMF_ADDR_FAMILY_PCI	= 0,	/* PCIe */
+	NVMF_ADDR_FAMILY_IP4	= 1,	/* IP4 */
+	NVMF_ADDR_FAMILY_IP6	= 2,	/* IP6 */
+	NVMF_ADDR_FAMILY_IB	= 3,	/* InfiniBand */
+	NVMF_ADDR_FAMILY_FC	= 4,	/* Fibre Channel */
+};
+
+/* Transport Type codes for Discovery Log Page entry TRTYPE field */
+enum {
+	NVMF_TRTYPE_RDMA	= 1,	/* RDMA */
+	NVMF_TRTYPE_FC		= 2,	/* Fibre Channel */
+	NVMF_TRTYPE_TCP		= 3,	/* TCP */
+	NVMF_TRTYPE_LOOP	= 254,	/* Reserved for host usage */
+	NVMF_TRTYPE_MAX,
+};
+
+/* Transport Requirements codes for Discovery Log Page entry TREQ field */
+enum {
+	NVMF_TREQ_NOT_SPECIFIED	= 0,		/* Not specified */
+	NVMF_TREQ_REQUIRED	= 1,		/* Required */
+	NVMF_TREQ_NOT_REQUIRED	= 2,		/* Not Required */
+	NVMF_TREQ_DISABLE_SQFLOW = (1 << 2),	/* SQ flow control disable supported */
+};
+
+/* RDMA QP Service Type codes for Discovery Log Page entry TSAS
+ * RDMA_QPTYPE field
+ */
+enum {
+	NVMF_RDMA_QPTYPE_CONNECTED	= 1, /* Reliable Connected */
+	NVMF_RDMA_QPTYPE_DATAGRAM	= 2, /* Reliable Datagram */
+};
+
+/* RDMA QP Service Type codes for Discovery Log Page entry TSAS
+ * RDMA_QPTYPE field
+ */
+enum {
+	NVMF_RDMA_PRTYPE_NOT_SPECIFIED	= 1, /* No Provider Specified */
+	NVMF_RDMA_PRTYPE_IB		= 2, /* InfiniBand */
+	NVMF_RDMA_PRTYPE_ROCE		= 3, /* InfiniBand RoCE */
+	NVMF_RDMA_PRTYPE_ROCEV2		= 4, /* InfiniBand RoCEV2 */
+	NVMF_RDMA_PRTYPE_IWARP		= 5, /* IWARP */
+};
+
+/* RDMA Connection Management Service Type codes for Discovery Log Page
+ * entry TSAS RDMA_CMS field
+ */
+enum {
+	NVMF_RDMA_CMS_RDMA_CM	= 1, /* Sockets based endpoint addressing */
+};
+
+/* TCP port security type for  Discovery Log Page entry TSAS
+ */
+enum {
+	NVMF_TCP_SECTYPE_NONE	= 0, /* No Security */
+	NVMF_TCP_SECTYPE_TLS	= 1, /* Transport Layer Security */
+};
+
+#define NVME_AQ_DEPTH		32
+#define NVME_NR_AEN_COMMANDS	1
+#define NVME_AQ_BLK_MQ_DEPTH	(NVME_AQ_DEPTH - NVME_NR_AEN_COMMANDS)
+
+/*
+ * Subtract one to leave an empty queue entry for 'Full Queue' condition. See
+ * NVM-Express 1.2 specification, section 4.1.2.
+ */
+#define NVME_AQ_MQ_TAG_DEPTH	(NVME_AQ_BLK_MQ_DEPTH - 1)
+
+enum {
+	NVME_REG_CAP	= 0x0000,	/* Controller Capabilities */
+	NVME_REG_VS	= 0x0008,	/* Version */
+	NVME_REG_INTMS	= 0x000c,	/* Interrupt Mask Set */
+	NVME_REG_INTMC	= 0x0010,	/* Interrupt Mask Clear */
+	NVME_REG_CC	= 0x0014,	/* Controller Configuration */
+	NVME_REG_CSTS	= 0x001c,	/* Controller Status */
+	NVME_REG_NSSR	= 0x0020,	/* NVM Subsystem Reset */
+	NVME_REG_AQA	= 0x0024,	/* Admin Queue Attributes */
+	NVME_REG_ASQ	= 0x0028,	/* Admin SQ Base Address */
+	NVME_REG_ACQ	= 0x0030,	/* Admin CQ Base Address */
+	NVME_REG_CMBLOC = 0x0038,	/* Controller Memory Buffer Location */
+	NVME_REG_CMBSZ	= 0x003c,	/* Controller Memory Buffer Size */
+	NVME_REG_BPINFO	= 0x0040,	/* Boot Partition Information */
+	NVME_REG_BPRSEL	= 0x0044,	/* Boot Partition Read Select */
+	NVME_REG_BPMBL	= 0x0048,	/* Boot Partition Memory Buffer Location */
+	NVME_REG_DBS	= 0x1000,	/* SQ 0 Tail Doorbell */
+};
+
+#define NVME_CAP_MQES(cap)	((cap) & 0xffff)
+#define NVME_CAP_TIMEOUT(cap)	(((cap) >> 24) & 0xff)
+#define NVME_CAP_STRIDE(cap)	(((cap) >> 32) & 0xf)
+#define NVME_CAP_NSSRC(cap)	(((cap) >> 36) & 0x1)
+#define NVME_CAP_MPSMIN(cap)	(((cap) >> 48) & 0xf)
+#define NVME_CAP_MPSMAX(cap)	(((cap) >> 52) & 0xf)
+
+#define NVME_CMB_BIR(cmbloc)	((cmbloc) & 0x7)
+#define NVME_CMB_OFST(cmbloc)	(((cmbloc) >> 12) & 0xfffff)
+#define NVME_CMB_SZ(cmbsz)	(((cmbsz) >> 12) & 0xfffff)
+#define NVME_CMB_SZU(cmbsz)	(((cmbsz) >> 8) & 0xf)
+
+#define NVME_CMB_WDS(cmbsz)	((cmbsz) & 0x10)
+#define NVME_CMB_RDS(cmbsz)	((cmbsz) & 0x8)
+#define NVME_CMB_LISTS(cmbsz)	((cmbsz) & 0x4)
+#define NVME_CMB_CQS(cmbsz)	((cmbsz) & 0x2)
+#define NVME_CMB_SQS(cmbsz)	((cmbsz) & 0x1)
+
+/*
+ * Submission and Completion Queue Entry Sizes for the NVM command set.
+ * (In bytes and specified as a power of two (2^n)).
+ */
+#define NVME_NVM_IOSQES		6
+#define NVME_NVM_IOCQES		4
+
+enum {
+	NVME_CC_ENABLE		= 1 << 0,
+	NVME_CC_CSS_NVM		= 0 << 4,
+	NVME_CC_EN_SHIFT	= 0,
+	NVME_CC_CSS_SHIFT	= 4,
+	NVME_CC_MPS_SHIFT	= 7,
+	NVME_CC_AMS_SHIFT	= 11,
+	NVME_CC_SHN_SHIFT	= 14,
+	NVME_CC_IOSQES_SHIFT	= 16,
+	NVME_CC_IOCQES_SHIFT	= 20,
+	NVME_CC_AMS_RR		= 0 << NVME_CC_AMS_SHIFT,
+	NVME_CC_AMS_WRRU	= 1 << NVME_CC_AMS_SHIFT,
+	NVME_CC_AMS_VS		= 7 << NVME_CC_AMS_SHIFT,
+	NVME_CC_SHN_NONE	= 0 << NVME_CC_SHN_SHIFT,
+	NVME_CC_SHN_NORMAL	= 1 << NVME_CC_SHN_SHIFT,
+	NVME_CC_SHN_ABRUPT	= 2 << NVME_CC_SHN_SHIFT,
+	NVME_CC_SHN_MASK	= 3 << NVME_CC_SHN_SHIFT,
+	NVME_CC_IOSQES		= NVME_NVM_IOSQES << NVME_CC_IOSQES_SHIFT,
+	NVME_CC_IOCQES		= NVME_NVM_IOCQES << NVME_CC_IOCQES_SHIFT,
+	NVME_CSTS_RDY		= 1 << 0,
+	NVME_CSTS_CFS		= 1 << 1,
+	NVME_CSTS_NSSRO		= 1 << 4,
+	NVME_CSTS_PP		= 1 << 5,
+	NVME_CSTS_SHST_NORMAL	= 0 << 2,
+	NVME_CSTS_SHST_OCCUR	= 1 << 2,
+	NVME_CSTS_SHST_CMPLT	= 2 << 2,
+	NVME_CSTS_SHST_MASK	= 3 << 2,
+};
+
+struct nvme_id_power_state {
+	__le16			max_power;	/* centiwatts */
+	__u8			rsvd2;
+	__u8			flags;
+	__le32			entry_lat;	/* microseconds */
+	__le32			exit_lat;	/* microseconds */
+	__u8			read_tput;
+	__u8			read_lat;
+	__u8			write_tput;
+	__u8			write_lat;
+	__le16			idle_power;
+	__u8			idle_scale;
+	__u8			rsvd19;
+	__le16			active_power;
+	__u8			active_work_scale;
+	__u8			rsvd23[9];
+};
+
+enum {
+	NVME_PS_FLAGS_MAX_POWER_SCALE	= 1 << 0,
+	NVME_PS_FLAGS_NON_OP_STATE	= 1 << 1,
+};
+
+struct nvme_id_ctrl {
+	__le16			vid;
+	__le16			ssvid;
+	char			sn[20];
+	char			mn[40];
+	char			fr[8];
+	__u8			rab;
+	__u8			ieee[3];
+	__u8			cmic;
+	__u8			mdts;
+	__le16			cntlid;
+	__le32			ver;
+	__le32			rtd3r;
+	__le32			rtd3e;
+	__le32			oaes;
+	__le32			ctratt;
+	__le16			rrls;
+	__u8			rsvd102[154];
+	__le16			oacs;
+	__u8			acl;
+	__u8			aerl;
+	__u8			frmw;
+	__u8			lpa;
+	__u8			elpe;
+	__u8			npss;
+	__u8			avscc;
+	__u8			apsta;
+	__le16			wctemp;
+	__le16			cctemp;
+	__le16			mtfa;
+	__le32			hmpre;
+	__le32			hmmin;
+	__u8			tnvmcap[16];
+	__u8			unvmcap[16];
+	__le32			rpmbs;
+	__le16			edstt;
+	__u8			dsto;
+	__u8			fwug;
+	__le16			kas;
+	__le16			hctma;
+	__le16			mntmt;
+	__le16			mxtmt;
+	__le32			sanicap;
+	__le32			hmminds;
+	__le16			hmmaxd;
+	__le16			nsetidmax;
+	__u8			rsvd340[2];
+	__u8			anatt;
+	__u8			anacap;
+	__le32			anagrpmax;
+	__le32			nanagrpid;
+	__u8			rsvd352[160];
+	__u8			sqes;
+	__u8			cqes;
+	__le16			maxcmd;
+	__le32			nn;
+	__le16			oncs;
+	__le16			fuses;
+	__u8			fna;
+	__u8			vwc;
+	__le16			awun;
+	__le16			awupf;
+	__u8			nvscc;
+	__u8			nwpc;
+	__le16			acwu;
+	__u8			rsvd534[2];
+	__le32			sgls;
+	__le32			mnan;
+	__u8			rsvd544[224];
+	char			subnqn[256];
+	__u8			rsvd1024[768];
+	__le32			ioccsz;
+	__le32			iorcsz;
+	__le16			icdoff;
+	__u8			ctrattr;
+	__u8			msdbd;
+	__u8			rsvd1804[244];
+	struct nvme_id_power_state	psd[32];
+	__u8			vs[1024];
+};
+
+enum {
+	NVME_CTRL_ONCS_COMPARE			= 1 << 0,
+	NVME_CTRL_ONCS_WRITE_UNCORRECTABLE	= 1 << 1,
+	NVME_CTRL_ONCS_DSM			= 1 << 2,
+	NVME_CTRL_ONCS_WRITE_ZEROES		= 1 << 3,
+	NVME_CTRL_ONCS_TIMESTAMP		= 1 << 6,
+	NVME_CTRL_VWC_PRESENT			= 1 << 0,
+	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
+	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
+	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 8,
+	NVME_CTRL_LPA_CMD_EFFECTS_LOG		= 1 << 1,
+	NVME_CTRL_CTRATT_128_ID			= 1 << 0,
+	NVME_CTRL_CTRATT_NON_OP_PSP		= 1 << 1,
+	NVME_CTRL_CTRATT_NVM_SETS		= 1 << 2,
+	NVME_CTRL_CTRATT_READ_RECV_LVLS		= 1 << 3,
+	NVME_CTRL_CTRATT_ENDURANCE_GROUPS	= 1 << 4,
+	NVME_CTRL_CTRATT_PREDICTABLE_LAT	= 1 << 5,
+};
+
+struct nvme_lbaf {
+	__le16			ms;
+	__u8			ds;
+	__u8			rp;
+};
+
+struct nvme_id_ns {
+	__le64			nsze;
+	__le64			ncap;
+	__le64			nuse;
+	__u8			nsfeat;
+	__u8			nlbaf;
+	__u8			flbas;
+	__u8			mc;
+	__u8			dpc;
+	__u8			dps;
+	__u8			nmic;
+	__u8			rescap;
+	__u8			fpi;
+	__u8			dlfeat;
+	__le16			nawun;
+	__le16			nawupf;
+	__le16			nacwu;
+	__le16			nabsn;
+	__le16			nabo;
+	__le16			nabspf;
+	__le16			noiob;
+	__u8			nvmcap[16];
+	__u8			rsvd64[28];
+	__le32			anagrpid;
+	__u8			rsvd96[3];
+	__u8			nsattr;
+	__le16			nvmsetid;
+	__le16			endgid;
+	__u8			nguid[16];
+	__u8			eui64[8];
+	struct nvme_lbaf	lbaf[16];
+	__u8			rsvd192[192];
+	__u8			vs[3712];
+};
+
+enum {
+	NVME_ID_CNS_NS			= 0x00,
+	NVME_ID_CNS_CTRL		= 0x01,
+	NVME_ID_CNS_NS_ACTIVE_LIST	= 0x02,
+	NVME_ID_CNS_NS_DESC_LIST	= 0x03,
+	NVME_ID_CNS_NVMSET_LIST		= 0x04,
+	NVME_ID_CNS_NS_PRESENT_LIST	= 0x10,
+	NVME_ID_CNS_NS_PRESENT		= 0x11,
+	NVME_ID_CNS_CTRL_NS_LIST	= 0x12,
+	NVME_ID_CNS_CTRL_LIST		= 0x13,
+};
+
+enum {
+	NVME_DIR_IDENTIFY		= 0x00,
+	NVME_DIR_STREAMS		= 0x01,
+	NVME_DIR_SND_ID_OP_ENABLE	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_ID	= 0x01,
+	NVME_DIR_SND_ST_OP_REL_RSC	= 0x02,
+	NVME_DIR_RCV_ID_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_PARAM	= 0x01,
+	NVME_DIR_RCV_ST_OP_STATUS	= 0x02,
+	NVME_DIR_RCV_ST_OP_RESOURCE	= 0x03,
+	NVME_DIR_ENDIR			= 0x01,
+};
+
+enum {
+	NVME_NS_FEAT_THIN	= 1 << 0,
+	NVME_NS_FLBAS_LBA_MASK	= 0xf,
+	NVME_NS_FLBAS_META_EXT	= 0x10,
+	NVME_LBAF_RP_BEST	= 0,
+	NVME_LBAF_RP_BETTER	= 1,
+	NVME_LBAF_RP_GOOD	= 2,
+	NVME_LBAF_RP_DEGRADED	= 3,
+	NVME_NS_DPC_PI_LAST	= 1 << 4,
+	NVME_NS_DPC_PI_FIRST	= 1 << 3,
+	NVME_NS_DPC_PI_TYPE3	= 1 << 2,
+	NVME_NS_DPC_PI_TYPE2	= 1 << 1,
+	NVME_NS_DPC_PI_TYPE1	= 1 << 0,
+	NVME_NS_DPS_PI_FIRST	= 1 << 3,
+	NVME_NS_DPS_PI_MASK	= 0x7,
+	NVME_NS_DPS_PI_TYPE1	= 1,
+	NVME_NS_DPS_PI_TYPE2	= 2,
+	NVME_NS_DPS_PI_TYPE3	= 3,
+};
+
+struct nvme_ns_id_desc {
+	__u8 nidt;
+	__u8 nidl;
+	__le16 reserved;
+};
+
+#define NVME_NIDT_EUI64_LEN	8
+#define NVME_NIDT_NGUID_LEN	16
+#define NVME_NIDT_UUID_LEN	16
+
+enum {
+	NVME_NIDT_EUI64		= 0x01,
+	NVME_NIDT_NGUID		= 0x02,
+	NVME_NIDT_UUID		= 0x03,
+};
+
+#define NVME_MAX_NVMSET		31
+
+struct nvme_nvmset_attr_entry {
+	__le16			id;
+	__le16			endurance_group_id;
+	__u8			rsvd4[4];
+	__le32			random_4k_read_typical;
+	__le32			opt_write_size;
+	__u8			total_nvmset_cap[16];
+	__u8			unalloc_nvmset_cap[16];
+	__u8			rsvd48[80];
+};
+
+struct nvme_id_nvmset {
+	__u8				nid;
+	__u8				rsvd1[127];
+	struct nvme_nvmset_attr_entry	ent[NVME_MAX_NVMSET];
+};
+
+/* Derived from 1.3a Figure 101: Get Log Page – Telemetry Host
+ * -Initiated Log (Log Identifier 07h)
+ */
+struct nvme_telemetry_log_page_hdr {
+	__u8    lpi; /* Log page identifier */
+	__u8    rsvd[4];
+	__u8    iee_oui[3];
+	__u16   dalb1; /* Data area 1 last block */
+	__u16   dalb2; /* Data area 2 last block */
+	__u16   dalb3; /* Data area 3 last block */
+	__u8    rsvd1[368]; /* TODO verify */
+	__u8    ctrlavail; /* Controller initiated data avail?*/
+	__u8    ctrldgn; /* Controller initiated telemetry Data Gen # */
+	__u8    rsnident[128];
+	/* We'll have to double fetch so we can get the header,
+	 * parse dalb1->3 determine how much size we need for the
+	 * log then alloc below. Or just do a secondary non-struct
+	 * allocation.
+	 */
+	__u8    telemetry_dataarea[0];
+};
+
+struct nvme_endurance_group_log {
+	__u32	rsvd0;
+	__u8	avl_spare_threshold;
+	__u8	percent_used;
+	__u8	rsvd6[26];
+	__u8	endurance_estimate[16];
+	__u8	data_units_read[16];
+	__u8	data_units_written[16];
+	__u8	media_units_written[16];
+	__u8	rsvd96[416];
+};
+
+struct nvme_smart_log {
+	__u8			critical_warning;
+	__u8			temperature[2];
+	__u8			avail_spare;
+	__u8			spare_thresh;
+	__u8			percent_used;
+	__u8			rsvd6[26];
+	__u8			data_units_read[16];
+	__u8			data_units_written[16];
+	__u8			host_reads[16];
+	__u8			host_writes[16];
+	__u8			ctrl_busy_time[16];
+	__u8			power_cycles[16];
+	__u8			power_on_hours[16];
+	__u8			unsafe_shutdowns[16];
+	__u8			media_errors[16];
+	__u8			num_err_log_entries[16];
+	__le32			warning_temp_time;
+	__le32			critical_comp_time;
+	__le16			temp_sensor[8];
+	__le32			thm_temp1_trans_count;
+	__le32			thm_temp2_trans_count;
+	__le32			thm_temp1_total_time;
+	__le32			thm_temp2_total_time;
+	__u8			rsvd232[280];
+};
+
+struct nvme_self_test_res {
+	__u8 			device_self_test_status;
+	__u8			segment_num;
+	__u8			valid_diagnostic_info;
+	__u8			rsvd;
+	__le64			power_on_hours;
+	__le32			nsid;
+	__le64			failing_lba;
+	__u8			status_code_type;
+	__u8			status_code;
+	__u8			vendor_specific[2];
+} __attribute__((packed));
+
+struct nvme_self_test_log {
+	__u8                      crnt_dev_selftest_oprn;
+	__u8                      crnt_dev_selftest_compln;
+	__u8                      rsvd[2];
+	struct nvme_self_test_res result[20];
+} __attribute__((packed));
+
+struct nvme_fw_slot_info_log {
+	__u8			afi;
+	__u8			rsvd1[7];
+	__le64			frs[7];
+	__u8			rsvd64[448];
+};
+
+/* NVMe Namespace Write Protect State */
+enum {
+	NVME_NS_NO_WRITE_PROTECT = 0,
+	NVME_NS_WRITE_PROTECT,
+	NVME_NS_WRITE_PROTECT_POWER_CYCLE,
+	NVME_NS_WRITE_PROTECT_PERMANENT,
+};
+
+#define NVME_MAX_CHANGED_NAMESPACES     1024
+
+struct nvme_changed_ns_list_log {
+	__le32			log[NVME_MAX_CHANGED_NAMESPACES];
+};
+
+enum {
+	NVME_CMD_EFFECTS_CSUPP		= 1 << 0,
+	NVME_CMD_EFFECTS_LBCC		= 1 << 1,
+	NVME_CMD_EFFECTS_NCC		= 1 << 2,
+	NVME_CMD_EFFECTS_NIC		= 1 << 3,
+	NVME_CMD_EFFECTS_CCC		= 1 << 4,
+	NVME_CMD_EFFECTS_CSE_MASK	= 3 << 16,
+};
+
+struct nvme_effects_log {
+	__le32 acs[256];
+	__le32 iocs[256];
+	__u8   resv[2048];
+};
+
+enum nvme_ana_state {
+	NVME_ANA_OPTIMIZED		= 0x01,
+	NVME_ANA_NONOPTIMIZED		= 0x02,
+	NVME_ANA_INACCESSIBLE		= 0x03,
+	NVME_ANA_PERSISTENT_LOSS	= 0x04,
+	NVME_ANA_CHANGE			= 0x0f,
+};
+
+struct nvme_ana_group_desc {
+	__le32  grpid;
+	__le32  nnsids;
+	__le64  chgcnt;
+	__u8    state;
+	__u8    rsvd17[15];
+	__le32  nsids[];
+};
+
+/* flag for the log specific field of the ANA log */
+#define NVME_ANA_LOG_RGO   (1 << 0)
+
+struct nvme_ana_rsp_hdr {
+	__le64  chgcnt;
+	__le16  ngrps;
+	__le16  rsvd10[3];
+};
+
+enum {
+	NVME_SMART_CRIT_SPARE		= 1 << 0,
+	NVME_SMART_CRIT_TEMPERATURE	= 1 << 1,
+	NVME_SMART_CRIT_RELIABILITY	= 1 << 2,
+	NVME_SMART_CRIT_MEDIA		= 1 << 3,
+	NVME_SMART_CRIT_VOLATILE_MEMORY	= 1 << 4,
+};
+
+enum {
+	NVME_AER_ERROR			= 0,
+	NVME_AER_SMART			= 1,
+	NVME_AER_CSS			= 6,
+	NVME_AER_VS			= 7,
+	NVME_AER_NOTICE_NS_CHANGED	= 0x0002,
+	NVME_AER_NOTICE_ANA		= 0x0003,
+	NVME_AER_NOTICE_FW_ACT_STARTING = 0x0102,
+};
+
+struct nvme_lba_range_type {
+	__u8			type;
+	__u8			attributes;
+	__u8			rsvd2[14];
+	__u64			slba;
+	__u64			nlb;
+	__u8			guid[16];
+	__u8			rsvd48[16];
+};
+
+enum {
+	NVME_LBART_TYPE_FS	= 0x01,
+	NVME_LBART_TYPE_RAID	= 0x02,
+	NVME_LBART_TYPE_CACHE	= 0x03,
+	NVME_LBART_TYPE_SWAP	= 0x04,
+
+	NVME_LBART_ATTRIB_TEMP	= 1 << 0,
+	NVME_LBART_ATTRIB_HIDE	= 1 << 1,
+};
+
+struct nvme_plm_config {
+	__u16	enable_event;
+	__u8	rsvd2[30];
+	__u64	dtwin_reads_thresh;
+	__u64	dtwin_writes_thresh;
+	__u64	dtwin_time_thresh;
+	__u8	rsvd56[456];
+};
+
+struct nvme_reservation_status {
+	__le32	gen;
+	__u8	rtype;
+	__u8	regctl[2];
+	__u8	resv5[2];
+	__u8	ptpls;
+	__u8	resv10[13];
+	struct {
+		__le16	cntlid;
+		__u8	rcsts;
+		__u8	resv3[5];
+		__le64	hostid;
+		__le64	rkey;
+	} regctl_ds[];
+};
+
+struct nvme_reservation_status_ext {
+	__le32	gen;
+	__u8	rtype;
+	__u8	regctl[2];
+	__u8	resv5[2];
+	__u8	ptpls;
+	__u8	resv10[14];
+	__u8	resv24[40];
+	struct {
+		__le16	cntlid;
+		__u8	rcsts;
+		__u8	resv3[5];
+		__le64	rkey;
+		__u8	hostid[16];
+		__u8	resv32[32];
+	} regctl_eds[];
+};
+
+enum nvme_async_event_type {
+	NVME_AER_TYPE_ERROR	= 0,
+	NVME_AER_TYPE_SMART	= 1,
+	NVME_AER_TYPE_NOTICE	= 2,
+};
+
+/* I/O commands */
+
+enum nvme_opcode {
+	nvme_cmd_flush		= 0x00,
+	nvme_cmd_write		= 0x01,
+	nvme_cmd_read		= 0x02,
+	nvme_cmd_write_uncor	= 0x04,
+	nvme_cmd_compare	= 0x05,
+	nvme_cmd_write_zeroes	= 0x08,
+	nvme_cmd_dsm		= 0x09,
+	nvme_cmd_resv_register	= 0x0d,
+	nvme_cmd_resv_report	= 0x0e,
+	nvme_cmd_resv_acquire	= 0x11,
+	nvme_cmd_resv_release	= 0x15,
+};
+
+/*
+ * Descriptor subtype - lower 4 bits of nvme_(keyed_)sgl_desc identifier
+ *
+ * @NVME_SGL_FMT_ADDRESS:     absolute address of the data block
+ * @NVME_SGL_FMT_OFFSET:      relative offset of the in-capsule data block
+ * @NVME_SGL_FMT_TRANSPORT_A: transport defined format, value 0xA
+ * @NVME_SGL_FMT_INVALIDATE:  RDMA transport specific remote invalidation
+ *                            request subtype
+ */
+enum {
+	NVME_SGL_FMT_ADDRESS		= 0x00,
+	NVME_SGL_FMT_OFFSET		= 0x01,
+	NVME_SGL_FMT_TRANSPORT_A	= 0x0A,
+	NVME_SGL_FMT_INVALIDATE		= 0x0f,
+};
+
+/*
+ * Descriptor type - upper 4 bits of nvme_(keyed_)sgl_desc identifier
+ *
+ * For struct nvme_sgl_desc:
+ *   @NVME_SGL_FMT_DATA_DESC:		data block descriptor
+ *   @NVME_SGL_FMT_SEG_DESC:		sgl segment descriptor
+ *   @NVME_SGL_FMT_LAST_SEG_DESC:	last sgl segment descriptor
+ *
+ * For struct nvme_keyed_sgl_desc:
+ *   @NVME_KEY_SGL_FMT_DATA_DESC:	keyed data block descriptor
+ *
+ * Transport-specific SGL types:
+ *   @NVME_TRANSPORT_SGL_DATA_DESC:	Transport SGL data dlock descriptor
+ */
+enum {
+	NVME_SGL_FMT_DATA_DESC		= 0x00,
+	NVME_SGL_FMT_SEG_DESC		= 0x02,
+	NVME_SGL_FMT_LAST_SEG_DESC	= 0x03,
+	NVME_KEY_SGL_FMT_DATA_DESC	= 0x04,
+	NVME_TRANSPORT_SGL_DATA_DESC	= 0x05,
+};
+
+struct nvme_sgl_desc {
+	__le64	addr;
+	__le32	length;
+	__u8	rsvd[3];
+	__u8	type;
+};
+
+struct nvme_keyed_sgl_desc {
+	__le64	addr;
+	__u8	length[3];
+	__u8	key[4];
+	__u8	type;
+};
+
+union nvme_data_ptr {
+	struct {
+		__le64	prp1;
+		__le64	prp2;
+	};
+	struct nvme_sgl_desc	sgl;
+	struct nvme_keyed_sgl_desc ksgl;
+};
+
+/*
+ * Lowest two bits of our flags field (FUSE field in the spec):
+ *
+ * @NVME_CMD_FUSE_FIRST:   Fused Operation, first command
+ * @NVME_CMD_FUSE_SECOND:  Fused Operation, second command
+ *
+ * Highest two bits in our flags field (PSDT field in the spec):
+ *
+ * @NVME_CMD_PSDT_SGL_METABUF:	Use SGLS for this transfer,
+ *	If used, MPTR contains addr of single physical buffer (byte aligned).
+ * @NVME_CMD_PSDT_SGL_METASEG:	Use SGLS for this transfer,
+ *	If used, MPTR contains an address of an SGL segment containing
+ *	exactly 1 SGL descriptor (qword aligned).
+ */
+enum {
+	NVME_CMD_FUSE_FIRST	= (1 << 0),
+	NVME_CMD_FUSE_SECOND	= (1 << 1),
+
+	NVME_CMD_SGL_METABUF	= (1 << 6),
+	NVME_CMD_SGL_METASEG	= (1 << 7),
+	NVME_CMD_SGL_ALL	= NVME_CMD_SGL_METABUF | NVME_CMD_SGL_METASEG,
+};
+
+struct nvme_common_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__le32			cdw2[2];
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le32			cdw10[6];
+};
+
+struct nvme_rw_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le64			slba;
+	__le16			length;
+	__le16			control;
+	__le32			dsmgmt;
+	__le32			reftag;
+	__le16			apptag;
+	__le16			appmask;
+};
+
+enum {
+	NVME_RW_LR			= 1 << 15,
+	NVME_RW_FUA			= 1 << 14,
+	NVME_RW_DEAC			= 1 << 9,
+	NVME_RW_DSM_FREQ_UNSPEC		= 0,
+	NVME_RW_DSM_FREQ_TYPICAL	= 1,
+	NVME_RW_DSM_FREQ_RARE		= 2,
+	NVME_RW_DSM_FREQ_READS		= 3,
+	NVME_RW_DSM_FREQ_WRITES		= 4,
+	NVME_RW_DSM_FREQ_RW		= 5,
+	NVME_RW_DSM_FREQ_ONCE		= 6,
+	NVME_RW_DSM_FREQ_PREFETCH	= 7,
+	NVME_RW_DSM_FREQ_TEMP		= 8,
+	NVME_RW_DSM_LATENCY_NONE	= 0 << 4,
+	NVME_RW_DSM_LATENCY_IDLE	= 1 << 4,
+	NVME_RW_DSM_LATENCY_NORM	= 2 << 4,
+	NVME_RW_DSM_LATENCY_LOW		= 3 << 4,
+	NVME_RW_DSM_SEQ_REQ		= 1 << 6,
+	NVME_RW_DSM_COMPRESSED		= 1 << 7,
+	NVME_RW_PRINFO_PRCHK_REF	= 1 << 10,
+	NVME_RW_PRINFO_PRCHK_APP	= 1 << 11,
+	NVME_RW_PRINFO_PRCHK_GUARD	= 1 << 12,
+	NVME_RW_PRINFO_PRACT		= 1 << 13,
+	NVME_RW_DTYPE_STREAMS		= 1 << 4,
+};
+
+struct nvme_dsm_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le32			nr;
+	__le32			attributes;
+	__u32			rsvd12[4];
+};
+
+enum {
+	NVME_DSMGMT_IDR		= 1 << 0,
+	NVME_DSMGMT_IDW		= 1 << 1,
+	NVME_DSMGMT_AD		= 1 << 2,
+};
+
+#define NVME_DSM_MAX_RANGES	256
+
+struct nvme_dsm_range {
+	__le32			cattr;
+	__le32			nlb;
+	__le64			slba;
+};
+
+struct nvme_write_zeroes_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2;
+	__le64			metadata;
+	union nvme_data_ptr	dptr;
+	__le64			slba;
+	__le16			length;
+	__le16			control;
+	__le32			dsmgmt;
+	__le32			reftag;
+	__le16			apptag;
+	__le16			appmask;
+};
+
+/* Features */
+
+struct nvme_feat_auto_pst {
+	__le64 entries[32];
+};
+
+enum {
+	NVME_HOST_MEM_ENABLE	= (1 << 0),
+	NVME_HOST_MEM_RETURN	= (1 << 1),
+};
+
+/* Admin commands */
+
+enum nvme_admin_opcode {
+	nvme_admin_delete_sq		= 0x00,
+	nvme_admin_create_sq		= 0x01,
+	nvme_admin_get_log_page		= 0x02,
+	nvme_admin_delete_cq		= 0x04,
+	nvme_admin_create_cq		= 0x05,
+	nvme_admin_identify		= 0x06,
+	nvme_admin_abort_cmd		= 0x08,
+	nvme_admin_set_features		= 0x09,
+	nvme_admin_get_features		= 0x0a,
+	nvme_admin_async_event		= 0x0c,
+	nvme_admin_ns_mgmt		= 0x0d,
+	nvme_admin_activate_fw		= 0x10,
+	nvme_admin_download_fw		= 0x11,
+	nvme_admin_dev_self_test	= 0x14,
+	nvme_admin_ns_attach		= 0x15,
+	nvme_admin_keep_alive		= 0x18,
+	nvme_admin_directive_send	= 0x19,
+	nvme_admin_directive_recv	= 0x1a,
+	nvme_admin_virtual_mgmt		= 0x1c,
+	nvme_admin_nvme_mi_send		= 0x1d,
+	nvme_admin_nvme_mi_recv		= 0x1e,
+	nvme_admin_dbbuf		= 0x7C,
+	nvme_admin_format_nvm		= 0x80,
+	nvme_admin_security_send	= 0x81,
+	nvme_admin_security_recv	= 0x82,
+	nvme_admin_sanitize_nvm		= 0x84,
+};
+
+enum {
+	NVME_QUEUE_PHYS_CONTIG	= (1 << 0),
+	NVME_CQ_IRQ_ENABLED	= (1 << 1),
+	NVME_SQ_PRIO_URGENT	= (0 << 1),
+	NVME_SQ_PRIO_HIGH	= (1 << 1),
+	NVME_SQ_PRIO_MEDIUM	= (2 << 1),
+	NVME_SQ_PRIO_LOW	= (3 << 1),
+	NVME_FEAT_ARBITRATION	= 0x01,
+	NVME_FEAT_POWER_MGMT	= 0x02,
+	NVME_FEAT_LBA_RANGE	= 0x03,
+	NVME_FEAT_TEMP_THRESH	= 0x04,
+	NVME_FEAT_ERR_RECOVERY	= 0x05,
+	NVME_FEAT_VOLATILE_WC	= 0x06,
+	NVME_FEAT_NUM_QUEUES	= 0x07,
+	NVME_FEAT_IRQ_COALESCE	= 0x08,
+	NVME_FEAT_IRQ_CONFIG	= 0x09,
+	NVME_FEAT_WRITE_ATOMIC	= 0x0a,
+	NVME_FEAT_ASYNC_EVENT	= 0x0b,
+	NVME_FEAT_AUTO_PST	= 0x0c,
+	NVME_FEAT_HOST_MEM_BUF	= 0x0d,
+	NVME_FEAT_TIMESTAMP	= 0x0e,
+	NVME_FEAT_KATO		= 0x0f,
+	NVME_FEAT_HCTM		= 0X10,
+	NVME_FEAT_NOPSC		= 0X11,
+	NVME_FEAT_RRL		= 0x12,
+	NVME_FEAT_PLM_CONFIG	= 0x13,
+	NVME_FEAT_PLM_WINDOW	= 0x14,
+	NVME_FEAT_SW_PROGRESS	= 0x80,
+	NVME_FEAT_HOST_ID	= 0x81,
+	NVME_FEAT_RESV_MASK	= 0x82,
+	NVME_FEAT_RESV_PERSIST	= 0x83,
+	NVME_FEAT_WRITE_PROTECT	= 0x84,
+	NVME_LOG_ERROR		= 0x01,
+	NVME_LOG_SMART		= 0x02,
+	NVME_LOG_FW_SLOT	= 0x03,
+	NVME_LOG_CHANGED_NS	= 0x04,
+	NVME_LOG_CMD_EFFECTS	= 0x05,
+	NVME_LOG_DEVICE_SELF_TEST = 0x06,
+	NVME_LOG_TELEMETRY_HOST = 0x07,
+	NVME_LOG_TELEMETRY_CTRL = 0x08,
+	NVME_LOG_ENDURANCE_GROUP = 0x09,
+	NVME_LOG_ANA		= 0x0c,
+	NVME_LOG_DISC		= 0x70,
+	NVME_LOG_RESERVATION	= 0x80,
+	NVME_LOG_SANITIZE	= 0x81,
+	NVME_FWACT_REPL		= (0 << 3),
+	NVME_FWACT_REPL_ACTV	= (1 << 3),
+	NVME_FWACT_ACTV		= (2 << 3),
+};
+
+enum {
+	NVME_NO_LOG_LSP       = 0x0,
+	NVME_NO_LOG_LPO       = 0x0,
+	NVME_LOG_ANA_LSP_RGO  = 0x1,
+	NVME_TELEM_LSP_CREATE = 0x1,
+};
+
+/* Sanitize and Sanitize Monitor/Log */
+enum {
+	/* Sanitize */
+	NVME_SANITIZE_NO_DEALLOC	= 0x00000200,
+	NVME_SANITIZE_OIPBP		= 0x00000100,
+	NVME_SANITIZE_OWPASS_SHIFT	= 0x00000004,
+	NVME_SANITIZE_AUSE		= 0x00000008,
+	NVME_SANITIZE_ACT_CRYPTO_ERASE	= 0x00000004,
+	NVME_SANITIZE_ACT_OVERWRITE	= 0x00000003,
+	NVME_SANITIZE_ACT_BLOCK_ERASE	= 0x00000002,
+	NVME_SANITIZE_ACT_EXIT		= 0x00000001,
+
+	/* Sanitize Monitor/Log */
+	NVME_SANITIZE_LOG_DATA_LEN		= 0x0014,
+	NVME_SANITIZE_LOG_GLOBAL_DATA_ERASED	= 0x0100,
+	NVME_SANITIZE_LOG_NUM_CMPLTED_PASS_MASK	= 0x00F8,
+	NVME_SANITIZE_LOG_STATUS_MASK		= 0x0007,
+	NVME_SANITIZE_LOG_NEVER_SANITIZED	= 0x0000,
+	NVME_SANITIZE_LOG_COMPLETED_SUCCESS	= 0x0001,
+	NVME_SANITIZE_LOG_IN_PROGESS		= 0x0002,
+	NVME_SANITIZE_LOG_COMPLETED_FAILED	= 0x0003,
+};
+
+enum {
+	/* Self-test log Validation bits */
+	NVME_SELF_TEST_VALID_NSID	= 1 << 0,
+	NVME_SELF_TEST_VALID_FLBA	= 1 << 1,
+	NVME_SELF_TEST_VALID_SCT	= 1 << 2,
+	NVME_SELF_TEST_VALID_SC		= 1 << 3,
+	NVME_SELF_TEST_REPORTS		= 20,
+};
+
+struct nvme_identify {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			cns;
+	__u8			rsvd3;
+	__le16			ctrlid;
+	__u32			rsvd11[5];
+};
+
+#define NVME_IDENTIFY_DATA_SIZE 4096
+
+struct nvme_features {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le32			fid;
+	__le32			dword11;
+	__le32                  dword12;
+	__le32                  dword13;
+	__le32                  dword14;
+	__le32                  dword15;
+};
+
+struct nvme_host_mem_buf_desc {
+	__le64			addr;
+	__le32			size;
+	__u32			rsvd;
+};
+
+struct nvme_create_cq {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__u64			rsvd8;
+	__le16			cqid;
+	__le16			qsize;
+	__le16			cq_flags;
+	__le16			irq_vector;
+	__u32			rsvd12[4];
+};
+
+struct nvme_create_sq {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__u64			rsvd8;
+	__le16			sqid;
+	__le16			qsize;
+	__le16			sq_flags;
+	__le16			cqid;
+	__u32			rsvd12[4];
+};
+
+struct nvme_delete_queue {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[9];
+	__le16			qid;
+	__u16			rsvd10;
+	__u32			rsvd11[5];
+};
+
+struct nvme_abort_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[9];
+	__le16			sqid;
+	__u16			cid;
+	__u32			rsvd11[5];
+};
+
+struct nvme_download_firmware {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	union nvme_data_ptr	dptr;
+	__le32			numd;
+	__le32			offset;
+	__u32			rsvd12[4];
+};
+
+struct nvme_format_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[4];
+	__le32			cdw10;
+	__u32			rsvd11[5];
+};
+
+struct nvme_get_log_page_command {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__u8			lid;
+	__u8			lsp;
+	__le16			numdl;
+	__le16			numdu;
+	__u16			rsvd11;
+	__le32			lpol;
+	__le32			lpou;
+	__u32			rsvd14[2];
+};
+
+struct nvme_directive_cmd {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__le32			nsid;
+	__u64			rsvd2[2];
+	union nvme_data_ptr	dptr;
+	__le32			numd;
+	__u8			doper;
+	__u8			dtype;
+	__le16			dspec;
+	__u8			endir;
+	__u8			tdtype;
+	__u16			rsvd15;
+
+	__u32			rsvd16[3];
+};
+
+/* Sanitize Log Page */
+struct nvme_sanitize_log_page {
+	__le16			progress;
+	__le16			status;
+	__le32			cdw10_info;
+	__le32			est_ovrwrt_time;
+	__le32			est_blk_erase_time;
+	__le32			est_crypto_erase_time;
+};
+
+/*
+ * Fabrics subcommands.
+ */
+enum nvmf_fabrics_opcode {
+	nvme_fabrics_command		= 0x7f,
+};
+
+enum nvmf_capsule_command {
+	nvme_fabrics_type_property_set	= 0x00,
+	nvme_fabrics_type_connect	= 0x01,
+	nvme_fabrics_type_property_get	= 0x04,
+};
+
+struct nvmf_common_command {
+	__u8	opcode;
+	__u8	resv1;
+	__u16	command_id;
+	__u8	fctype;
+	__u8	resv2[35];
+	__u8	ts[24];
+};
+
+/*
+ * The legal cntlid range a NVMe Target will provide.
+ * Note that cntlid of value 0 is considered illegal in the fabrics world.
+ * Devices based on earlier specs did not have the subsystem concept;
+ * therefore, those devices had their cntlid value set to 0 as a result.
+ */
+#define NVME_CNTLID_MIN		1
+#define NVME_CNTLID_MAX		0xffef
+#define NVME_CNTLID_DYNAMIC	0xffff
+
+#define MAX_DISC_LOGS	255
+
+/* Discovery log page entry */
+struct nvmf_disc_rsp_page_entry {
+	__u8		trtype;
+	__u8		adrfam;
+	__u8		subtype;
+	__u8		treq;
+	__le16		portid;
+	__le16		cntlid;
+	__le16		asqsz;
+	__u8		resv8[22];
+	char		trsvcid[NVMF_TRSVCID_SIZE];
+	__u8		resv64[192];
+	char		subnqn[NVMF_NQN_FIELD_LEN];
+	char		traddr[NVMF_TRADDR_SIZE];
+	union tsas {
+		char		common[NVMF_TSAS_SIZE];
+		struct rdma {
+			__u8	qptype;
+			__u8	prtype;
+			__u8	cms;
+			__u8	resv3[5];
+			__u16	pkey;
+			__u8	resv10[246];
+		} rdma;
+		struct tcp {
+			__u8	sectype;
+		} tcp;
+	} tsas;
+};
+
+/* Discovery log page header */
+struct nvmf_disc_rsp_page_hdr {
+	__le64		genctr;
+	__le64		numrec;
+	__le16		recfmt;
+	__u8		resv14[1006];
+	struct nvmf_disc_rsp_page_entry entries[0];
+};
+
+struct nvmf_connect_command {
+	__u8		opcode;
+	__u8		resv1;
+	__u16		command_id;
+	__u8		fctype;
+	__u8		resv2[19];
+	union nvme_data_ptr dptr;
+	__le16		recfmt;
+	__le16		qid;
+	__le16		sqsize;
+	__u8		cattr;
+	__u8		resv3;
+	__le32		kato;
+	__u8		resv4[12];
+};
+
+struct nvmf_connect_data {
+	uuid_t		hostid;
+	__le16		cntlid;
+	char		resv4[238];
+	char		subsysnqn[NVMF_NQN_FIELD_LEN];
+	char		hostnqn[NVMF_NQN_FIELD_LEN];
+	char		resv5[256];
+};
+
+struct nvmf_property_set_command {
+	__u8		opcode;
+	__u8		resv1;
+	__u16		command_id;
+	__u8		fctype;
+	__u8		resv2[35];
+	__u8		attrib;
+	__u8		resv3[3];
+	__le32		offset;
+	__le64		value;
+	__u8		resv4[8];
+};
+
+struct nvmf_property_get_command {
+	__u8		opcode;
+	__u8		resv1;
+	__u16		command_id;
+	__u8		fctype;
+	__u8		resv2[35];
+	__u8		attrib;
+	__u8		resv3[3];
+	__le32		offset;
+	__u8		resv4[16];
+};
+
+struct nvme_dbbuf {
+	__u8			opcode;
+	__u8			flags;
+	__u16			command_id;
+	__u32			rsvd1[5];
+	__le64			prp1;
+	__le64			prp2;
+	__u32			rsvd12[6];
+};
+
+struct streams_directive_params {
+	__le16	msl;
+	__le16	nssa;
+	__le16	nsso;
+	__u8	rsvd[10];
+	__le32	sws;
+	__le16	sgs;
+	__le16	nsa;
+	__le16	nso;
+	__u8	rsvd2[6];
+};
+
+struct nvme_command {
+	union {
+		struct nvme_common_command common;
+		struct nvme_rw_command rw;
+		struct nvme_identify identify;
+		struct nvme_features features;
+		struct nvme_create_cq create_cq;
+		struct nvme_create_sq create_sq;
+		struct nvme_delete_queue delete_queue;
+		struct nvme_download_firmware dlfw;
+		struct nvme_format_cmd format;
+		struct nvme_dsm_cmd dsm;
+		struct nvme_write_zeroes_cmd write_zeroes;
+		struct nvme_abort_cmd abort;
+		struct nvme_get_log_page_command get_log_page;
+		struct nvmf_common_command fabrics;
+		struct nvmf_connect_command connect;
+		struct nvmf_property_set_command prop_set;
+		struct nvmf_property_get_command prop_get;
+		struct nvme_dbbuf dbbuf;
+		struct nvme_directive_cmd directive;
+	};
+};
+
+static inline bool nvme_is_write(struct nvme_command *cmd)
+{
+	/*
+	 * What a mess...
+	 *
+	 * Why can't we simply have a Fabrics In and Fabrics out command?
+	 */
+	if (unlikely(cmd->common.opcode == nvme_fabrics_command))
+		return cmd->fabrics.fctype & 1;
+	return cmd->common.opcode & 1;
+}
+
+enum {
+	/*
+	 * Generic Command Status:
+	 */
+	NVME_SC_SUCCESS			= 0x0,
+	NVME_SC_INVALID_OPCODE		= 0x1,
+	NVME_SC_INVALID_FIELD		= 0x2,
+	NVME_SC_CMDID_CONFLICT		= 0x3,
+	NVME_SC_DATA_XFER_ERROR		= 0x4,
+	NVME_SC_POWER_LOSS		= 0x5,
+	NVME_SC_INTERNAL		= 0x6,
+	NVME_SC_ABORT_REQ		= 0x7,
+	NVME_SC_ABORT_QUEUE		= 0x8,
+	NVME_SC_FUSED_FAIL		= 0x9,
+	NVME_SC_FUSED_MISSING		= 0xa,
+	NVME_SC_INVALID_NS		= 0xb,
+	NVME_SC_CMD_SEQ_ERROR		= 0xc,
+	NVME_SC_SGL_INVALID_LAST	= 0xd,
+	NVME_SC_SGL_INVALID_COUNT	= 0xe,
+	NVME_SC_SGL_INVALID_DATA	= 0xf,
+	NVME_SC_SGL_INVALID_METADATA	= 0x10,
+	NVME_SC_SGL_INVALID_TYPE	= 0x11,
+
+	NVME_SC_SGL_INVALID_OFFSET	= 0x16,
+	NVME_SC_SGL_INVALID_SUBTYPE	= 0x17,
+
+	NVME_SC_SANITIZE_FAILED		= 0x1C,
+	NVME_SC_SANITIZE_IN_PROGRESS	= 0x1D,
+
+	NVME_SC_NS_WRITE_PROTECTED	= 0x20,
+
+	NVME_SC_LBA_RANGE		= 0x80,
+	NVME_SC_CAP_EXCEEDED		= 0x81,
+	NVME_SC_NS_NOT_READY		= 0x82,
+	NVME_SC_RESERVATION_CONFLICT	= 0x83,
+
+	/*
+	 * Command Specific Status:
+	 */
+	NVME_SC_CQ_INVALID		= 0x100,
+	NVME_SC_QID_INVALID		= 0x101,
+	NVME_SC_QUEUE_SIZE		= 0x102,
+	NVME_SC_ABORT_LIMIT		= 0x103,
+	NVME_SC_ABORT_MISSING		= 0x104,
+	NVME_SC_ASYNC_LIMIT		= 0x105,
+	NVME_SC_FIRMWARE_SLOT		= 0x106,
+	NVME_SC_FIRMWARE_IMAGE		= 0x107,
+	NVME_SC_INVALID_VECTOR		= 0x108,
+	NVME_SC_INVALID_LOG_PAGE	= 0x109,
+	NVME_SC_INVALID_FORMAT		= 0x10a,
+	NVME_SC_FW_NEEDS_CONV_RESET	= 0x10b,
+	NVME_SC_INVALID_QUEUE		= 0x10c,
+	NVME_SC_FEATURE_NOT_SAVEABLE	= 0x10d,
+	NVME_SC_FEATURE_NOT_CHANGEABLE	= 0x10e,
+	NVME_SC_FEATURE_NOT_PER_NS	= 0x10f,
+	NVME_SC_FW_NEEDS_SUBSYS_RESET	= 0x110,
+	NVME_SC_FW_NEEDS_RESET		= 0x111,
+	NVME_SC_FW_NEEDS_MAX_TIME	= 0x112,
+	NVME_SC_FW_ACIVATE_PROHIBITED	= 0x113,
+	NVME_SC_OVERLAPPING_RANGE	= 0x114,
+	NVME_SC_NS_INSUFFICENT_CAP	= 0x115,
+	NVME_SC_NS_ID_UNAVAILABLE	= 0x116,
+	NVME_SC_NS_ALREADY_ATTACHED	= 0x118,
+	NVME_SC_NS_IS_PRIVATE		= 0x119,
+	NVME_SC_NS_NOT_ATTACHED		= 0x11a,
+	NVME_SC_THIN_PROV_NOT_SUPP	= 0x11b,
+	NVME_SC_CTRL_LIST_INVALID	= 0x11c,
+	NVME_SC_BP_WRITE_PROHIBITED	= 0x11e,
+
+	/*
+	 * I/O Command Set Specific - NVM commands:
+	 */
+	NVME_SC_BAD_ATTRIBUTES		= 0x180,
+	NVME_SC_INVALID_PI		= 0x181,
+	NVME_SC_READ_ONLY		= 0x182,
+	NVME_SC_ONCS_NOT_SUPPORTED	= 0x183,
+
+	/*
+	 * I/O Command Set Specific - Fabrics commands:
+	 */
+	NVME_SC_CONNECT_FORMAT		= 0x180,
+	NVME_SC_CONNECT_CTRL_BUSY	= 0x181,
+	NVME_SC_CONNECT_INVALID_PARAM	= 0x182,
+	NVME_SC_CONNECT_RESTART_DISC	= 0x183,
+	NVME_SC_CONNECT_INVALID_HOST	= 0x184,
+
+	NVME_SC_DISCOVERY_RESTART	= 0x190,
+	NVME_SC_AUTH_REQUIRED		= 0x191,
+
+	/*
+	 * Media and Data Integrity Errors:
+	 */
+	NVME_SC_WRITE_FAULT		= 0x280,
+	NVME_SC_READ_ERROR		= 0x281,
+	NVME_SC_GUARD_CHECK		= 0x282,
+	NVME_SC_APPTAG_CHECK		= 0x283,
+	NVME_SC_REFTAG_CHECK		= 0x284,
+	NVME_SC_COMPARE_FAILED		= 0x285,
+	NVME_SC_ACCESS_DENIED		= 0x286,
+	NVME_SC_UNWRITTEN_BLOCK		= 0x287,
+
+	/*
+	 * Path-related Errors:
+	 */
+	NVME_SC_ANA_PERSISTENT_LOSS	= 0x301,
+	NVME_SC_ANA_INACCESSIBLE	= 0x302,
+	NVME_SC_ANA_TRANSITION		= 0x303,
+
+	NVME_SC_DNR			= 0x4000,
+};
+
+struct nvme_completion {
+	/*
+	 * Used by Admin and Fabrics commands to return data:
+	 */
+	union nvme_result {
+		__le16	u16;
+		__le32	u32;
+		__le64	u64;
+	} result;
+	__le16	sq_head;	/* how much of this queue may be reclaimed */
+	__le16	sq_id;		/* submission queue that generated this entry */
+	__u16	command_id;	/* of the command which completed */
+	__le16	status;		/* did the command fail, and if so, why? */
+};
+
+#define NVME_VS(major, minor, tertiary) \
+	(((major) << 16) | ((minor) << 8) | (tertiary))
+
+#define NVME_MAJOR(ver)		((ver) >> 16)
+#define NVME_MINOR(ver)		(((ver) >> 8) & 0xff)
+#define NVME_TERTIARY(ver)	((ver) & 0xff)
+
+#endif /* _LINUX_NVME_H */
diff --git a/libmultipath/nvme/nvme-ioctl.c b/libmultipath/nvme/nvme-ioctl.c
new file mode 100644
index 00000000..70a16ced
--- /dev/null
+++ b/libmultipath/nvme/nvme-ioctl.c
@@ -0,0 +1,869 @@
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+
+#include <errno.h>
+#include <getopt.h>
+#include <fcntl.h>
+#include <inttypes.h>
+#include <locale.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <math.h>
+
+#include "nvme-ioctl.h"
+
+static int nvme_verify_chr(int fd)
+{
+	static struct stat nvme_stat;
+	int err = fstat(fd, &nvme_stat);
+
+	if (err < 0) {
+		perror("fstat");
+		return errno;
+	}
+	if (!S_ISCHR(nvme_stat.st_mode)) {
+		fprintf(stderr,
+			"Error: requesting reset on non-controller handle\n");
+		return ENOTBLK;
+	}
+	return 0;
+}
+
+int nvme_subsystem_reset(int fd)
+{
+	int ret;
+
+	ret = nvme_verify_chr(fd);
+	if (ret)
+		return ret;
+	return ioctl(fd, NVME_IOCTL_SUBSYS_RESET);
+}
+
+int nvme_reset_controller(int fd)
+{
+	int ret;
+
+	ret = nvme_verify_chr(fd);
+	if (ret)
+		return ret;
+	return ioctl(fd, NVME_IOCTL_RESET);
+}
+
+int nvme_ns_rescan(int fd)
+{
+	int ret;
+
+	ret = nvme_verify_chr(fd);
+	if (ret)
+		return ret;
+	return ioctl(fd, NVME_IOCTL_RESCAN);
+}
+
+int nvme_get_nsid(int fd)
+{
+	static struct stat nvme_stat;
+	int err = fstat(fd, &nvme_stat);
+
+	if (err < 0)
+		return -errno;
+
+	if (!S_ISBLK(nvme_stat.st_mode)) {
+		fprintf(stderr,
+			"Error: requesting namespace-id from non-block device\n");
+		errno = ENOTBLK;
+		return -errno;
+	}
+	return ioctl(fd, NVME_IOCTL_ID);
+}
+
+int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
+			 struct nvme_passthru_cmd *cmd)
+{
+	return ioctl(fd, ioctl_cmd, cmd);
+}
+
+static int nvme_submit_admin_passthru(int fd, struct nvme_passthru_cmd *cmd)
+{
+	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, cmd);
+}
+
+static int nvme_submit_io_passthru(int fd, struct nvme_passthru_cmd *cmd)
+{
+	return ioctl(fd, NVME_IOCTL_IO_CMD, cmd);
+}
+
+int nvme_passthru(int fd, unsigned long ioctl_cmd, __u8 opcode,
+		  __u8 flags, __u16 rsvd,
+		  __u32 nsid, __u32 cdw2, __u32 cdw3, __u32 cdw10, __u32 cdw11,
+		  __u32 cdw12, __u32 cdw13, __u32 cdw14, __u32 cdw15,
+		  __u32 data_len, void *data, __u32 metadata_len,
+		  void *metadata, __u32 timeout_ms, __u32 *result)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= opcode,
+		.flags		= flags,
+		.rsvd1		= rsvd,
+		.nsid		= nsid,
+		.cdw2		= cdw2,
+		.cdw3		= cdw3,
+		.metadata	= (__u64)(uintptr_t) metadata,
+		.addr		= (__u64)(uintptr_t) data,
+		.metadata_len	= metadata_len,
+		.data_len	= data_len,
+		.cdw10		= cdw10,
+		.cdw11		= cdw11,
+		.cdw12		= cdw12,
+		.cdw13		= cdw13,
+		.cdw14		= cdw14,
+		.cdw15		= cdw15,
+		.timeout_ms	= timeout_ms,
+		.result		= 0,
+	};
+	int err;
+
+	err = nvme_submit_passthru(fd, ioctl_cmd, &cmd);
+	if (!err && result)
+		*result = cmd.result;
+	return err;
+}
+
+int nvme_io(int fd, __u8 opcode, __u64 slba, __u16 nblocks, __u16 control,
+	    __u32 dsmgmt, __u32 reftag, __u16 apptag, __u16 appmask, void *data,
+	    void *metadata)
+{
+	struct nvme_user_io io = {
+		.opcode		= opcode,
+		.flags		= 0,
+		.control	= control,
+		.nblocks	= nblocks,
+		.rsvd		= 0,
+		.metadata	= (__u64)(uintptr_t) metadata,
+		.addr		= (__u64)(uintptr_t) data,
+		.slba		= slba,
+		.dsmgmt		= dsmgmt,
+		.reftag		= reftag,
+		.appmask	= appmask,
+		.apptag		= apptag,
+	};
+	return ioctl(fd, NVME_IOCTL_SUBMIT_IO, &io);
+}
+
+int nvme_read(int fd, __u64 slba, __u16 nblocks, __u16 control, __u32 dsmgmt,
+	      __u32 reftag, __u16 apptag, __u16 appmask, void *data,
+	      void *metadata)
+{
+	return nvme_io(fd, nvme_cmd_read, slba, nblocks, control, dsmgmt,
+		       reftag, apptag, appmask, data, metadata);
+}
+
+int nvme_write(int fd, __u64 slba, __u16 nblocks, __u16 control, __u32 dsmgmt,
+	       __u32 reftag, __u16 apptag, __u16 appmask, void *data,
+	       void *metadata)
+{
+	return nvme_io(fd, nvme_cmd_write, slba, nblocks, control, dsmgmt,
+		       reftag, apptag, appmask, data, metadata);
+}
+
+int nvme_compare(int fd, __u64 slba, __u16 nblocks, __u16 control, __u32 dsmgmt,
+		 __u32 reftag, __u16 apptag, __u16 appmask, void *data,
+		 void *metadata)
+{
+	return nvme_io(fd, nvme_cmd_compare, slba, nblocks, control, dsmgmt,
+		       reftag, apptag, appmask, data, metadata);
+}
+
+int nvme_passthru_io(int fd, __u8 opcode, __u8 flags, __u16 rsvd,
+		     __u32 nsid, __u32 cdw2, __u32 cdw3, __u32 cdw10,
+		     __u32 cdw11, __u32 cdw12, __u32 cdw13, __u32 cdw14,
+		     __u32 cdw15, __u32 data_len, void *data,
+		     __u32 metadata_len, void *metadata, __u32 timeout_ms)
+{
+	return nvme_passthru(fd, NVME_IOCTL_IO_CMD, opcode, flags, rsvd, nsid,
+			     cdw2, cdw3, cdw10, cdw11, cdw12, cdw13, cdw14,
+			     cdw15, data_len, data, metadata_len, metadata,
+			     timeout_ms, NULL);
+}
+
+int nvme_write_zeros(int fd, __u32 nsid, __u64 slba, __u16 nlb,
+		     __u16 control, __u32 reftag, __u16 apptag, __u16 appmask)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_write_zeroes,
+		.nsid		= nsid,
+		.cdw10		= slba & 0xffffffff,
+		.cdw11		= slba >> 32,
+		.cdw12		= nlb | (control << 16),
+		.cdw14		= reftag,
+		.cdw15		= apptag | (appmask << 16),
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_write_uncorrectable(int fd, __u32 nsid, __u64 slba, __u16 nlb)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_write_uncor,
+		.nsid		= nsid,
+		.cdw10		= slba & 0xffffffff,
+		.cdw11		= slba >> 32,
+		.cdw12		= nlb,
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_flush(int fd, __u32 nsid)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_flush,
+		.nsid		= nsid,
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_dsm(int fd, __u32 nsid, __u32 cdw11, struct nvme_dsm_range *dsm,
+	     __u16 nr_ranges)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_dsm,
+		.nsid		= nsid,
+		.addr		= (__u64)(uintptr_t) dsm,
+		.data_len	= nr_ranges * sizeof(*dsm),
+		.cdw10		= nr_ranges - 1,
+		.cdw11		= cdw11,
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+struct nvme_dsm_range *nvme_setup_dsm_range(__u32 *ctx_attrs, __u32 *llbas,
+					    __u64 *slbas, __u16 nr_ranges)
+{
+	int i;
+	struct nvme_dsm_range *dsm = malloc(nr_ranges * sizeof(*dsm));
+
+	if (!dsm) {
+		fprintf(stderr, "malloc: %s\n", strerror(errno));
+		return NULL;
+	}
+	for (i = 0; i < nr_ranges; i++) {
+		dsm[i].cattr = cpu_to_le32(ctx_attrs[i]);
+		dsm[i].nlb = cpu_to_le32(llbas[i]);
+		dsm[i].slba = cpu_to_le64(slbas[i]);
+	}
+	return dsm;
+}
+
+int nvme_resv_acquire(int fd, __u32 nsid, __u8 rtype, __u8 racqa,
+		      bool iekey, __u64 crkey, __u64 nrkey)
+{
+	__le64 payload[2] = { cpu_to_le64(crkey), cpu_to_le64(nrkey) };
+	__u32 cdw10 = (racqa & 0x7) | (iekey ? 1 << 3 : 0) | rtype << 8;
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_resv_acquire,
+		.nsid		= nsid,
+		.cdw10		= cdw10,
+		.addr		= (__u64)(uintptr_t) (payload),
+		.data_len	= sizeof(payload),
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_resv_register(int fd, __u32 nsid, __u8 rrega, __u8 cptpl,
+		       bool iekey, __u64 crkey, __u64 nrkey)
+{
+	__le64 payload[2] = { cpu_to_le64(crkey), cpu_to_le64(nrkey) };
+	__u32 cdw10 = (rrega & 0x7) | (iekey ? 1 << 3 : 0) | cptpl << 30;
+
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_resv_register,
+		.nsid		= nsid,
+		.cdw10		= cdw10,
+		.addr		= (__u64)(uintptr_t) (payload),
+		.data_len	= sizeof(payload),
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_resv_release(int fd, __u32 nsid, __u8 rtype, __u8 rrela,
+		      bool iekey, __u64 crkey)
+{
+	__le64 payload[1] = { cpu_to_le64(crkey) };
+	__u32 cdw10 = (rrela & 0x7) | (iekey ? 1 << 3 : 0) | rtype << 8;
+
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_resv_release,
+		.nsid		= nsid,
+		.cdw10		= cdw10,
+		.addr		= (__u64)(uintptr_t) (payload),
+		.data_len	= sizeof(payload),
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_resv_report(int fd, __u32 nsid, __u32 numd, __u32 cdw11, void *data)
+{
+	struct nvme_passthru_cmd cmd = {
+		.opcode		= nvme_cmd_resv_report,
+		.nsid		= nsid,
+		.cdw10		= numd,
+		.cdw11		= cdw11,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= (numd + 1) << 2,
+	};
+
+	return nvme_submit_io_passthru(fd, &cmd);
+}
+
+int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, void *data)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_identify,
+		.nsid		= nsid,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= NVME_IDENTIFY_DATA_SIZE,
+		.cdw10		= cdw10,
+		.cdw11		= cdw11,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data)
+{
+	return nvme_identify13(fd, nsid, cdw10, 0, data);
+}
+
+int nvme_identify_ctrl(int fd, void *data)
+{
+	return nvme_identify(fd, 0, 1, data);
+}
+
+int nvme_identify_ns(int fd, __u32 nsid, bool present, void *data)
+{
+	int cns = present ? NVME_ID_CNS_NS_PRESENT : NVME_ID_CNS_NS;
+
+	return nvme_identify(fd, nsid, cns, data);
+}
+
+int nvme_identify_ns_list(int fd, __u32 nsid, bool all, void *data)
+{
+	int cns = all ? NVME_ID_CNS_NS_PRESENT_LIST : NVME_ID_CNS_NS_ACTIVE_LIST;
+
+	return nvme_identify(fd, nsid, cns, data);
+}
+
+int nvme_identify_ctrl_list(int fd, __u32 nsid, __u16 cntid, void *data)
+{
+	int cns = nsid ? NVME_ID_CNS_CTRL_NS_LIST : NVME_ID_CNS_CTRL_LIST;
+
+	return nvme_identify(fd, nsid, (cntid << 16) | cns, data);
+}
+
+int nvme_identify_ns_descs(int fd, __u32 nsid, void *data)
+{
+
+	return nvme_identify(fd, nsid, NVME_ID_CNS_NS_DESC_LIST, data);
+}
+
+int nvme_identify_nvmset(int fd, __u16 nvmset_id, void *data)
+{
+	return nvme_identify13(fd, 0, NVME_ID_CNS_NVMSET_LIST, nvmset_id, data);
+}
+
+int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 lpo,
+                 __u16 lsi, bool rae, __u32 data_len, void *data)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_get_log_page,
+		.nsid		= nsid,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= data_len,
+	};
+	__u32 numd = (data_len >> 2) - 1;
+	__u16 numdu = numd >> 16, numdl = numd & 0xffff;
+
+	cmd.cdw10 = log_id | (numdl << 16) | (rae ? 1 << 15 : 0);
+	if (lsp)
+                cmd.cdw10 |= lsp << 8;
+
+	cmd.cdw11 = numdu | (lsi << 16);
+	cmd.cdw12 = lpo;
+	cmd.cdw13 = (lpo >> 32);
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+
+}
+
+int nvme_get_log(int fd, __u32 nsid, __u8 log_id, bool rae,
+		 __u32 data_len, void *data)
+{
+	void *ptr = data;
+	__u32 offset = 0, xfer_len = data_len;
+	int ret;
+
+	/*
+	 * 4k is the smallest possible transfer unit, so by
+	 * restricting ourselves for 4k transfers we avoid having
+	 * to check the MDTS value of the controller.
+	 */
+	do {
+		xfer_len = data_len - offset;
+		if (xfer_len > 4096)
+			xfer_len = 4096;
+
+		ret = nvme_get_log13(fd, nsid, log_id, NVME_NO_LOG_LSP,
+				     offset, 0, rae, xfer_len, ptr);
+		if (ret)
+			return ret;
+
+		offset += xfer_len;
+		ptr += xfer_len;
+	} while (offset < data_len);
+
+	return 0;
+}
+
+int nvme_get_telemetry_log(int fd, void *lp, int generate_report,
+			   int ctrl_init, size_t log_page_size, __u64 offset)
+{
+	if (ctrl_init)
+		return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_TELEMETRY_CTRL,
+				      NVME_NO_LOG_LSP, offset,
+				      0, 1, log_page_size, lp);
+	if (generate_report)
+		return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_TELEMETRY_HOST,
+				      NVME_TELEM_LSP_CREATE, offset,
+				      0, 1, log_page_size, lp);
+	else
+		return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_TELEMETRY_HOST,
+				      NVME_NO_LOG_LSP, offset,
+				      0, 1, log_page_size, lp);
+}
+
+int nvme_fw_log(int fd, struct nvme_firmware_log_page *fw_log)
+{
+	return nvme_get_log(fd, NVME_NSID_ALL, NVME_LOG_FW_SLOT, true,
+			sizeof(*fw_log), fw_log);
+}
+
+int nvme_changed_ns_list_log(int fd, struct nvme_changed_ns_list_log *changed_ns_list_log)
+{
+	return nvme_get_log(fd, 0, NVME_LOG_CHANGED_NS, true,
+			sizeof(changed_ns_list_log->log),
+			changed_ns_list_log->log);
+}
+
+int nvme_error_log(int fd, int entries, struct nvme_error_log_page *err_log)
+{
+	return nvme_get_log(fd, NVME_NSID_ALL, NVME_LOG_ERROR, false,
+			entries * sizeof(*err_log), err_log);
+}
+
+int nvme_endurance_log(int fd, __u16 group_id, struct nvme_endurance_group_log *endurance_log)
+{
+	return nvme_get_log13(fd, 0, NVME_LOG_ENDURANCE_GROUP, 0, 0, group_id, 0,
+			sizeof(*endurance_log), endurance_log);
+}
+
+int nvme_smart_log(int fd, __u32 nsid, struct nvme_smart_log *smart_log)
+{
+	return nvme_get_log(fd, nsid, NVME_LOG_SMART, false,
+			sizeof(*smart_log), smart_log);
+}
+
+int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo)
+{
+	__u64 lpo = 0;
+
+	return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_ANA, rgo, lpo, 0,
+			true, ana_log_len, ana_log);
+}
+
+int nvme_self_test_log(int fd, struct nvme_self_test_log *self_test_log)
+{
+	return nvme_get_log(fd, NVME_NSID_ALL, NVME_LOG_DEVICE_SELF_TEST, false,
+		sizeof(*self_test_log), self_test_log);
+}
+
+int nvme_effects_log(int fd, struct nvme_effects_log_page *effects_log)
+{
+	return nvme_get_log(fd, 0, NVME_LOG_CMD_EFFECTS, false,
+			sizeof(*effects_log), effects_log);
+}
+
+int nvme_discovery_log(int fd, struct nvmf_disc_rsp_page_hdr *log, __u32 size)
+{
+	return nvme_get_log(fd, 0, NVME_LOG_DISC, false, size, log);
+}
+
+int nvme_sanitize_log(int fd, struct nvme_sanitize_log_page *sanitize_log)
+{
+	return nvme_get_log(fd, 0, NVME_LOG_SANITIZE, false,
+			sizeof(*sanitize_log), sanitize_log);
+}
+
+int nvme_feature(int fd, __u8 opcode, __u32 nsid, __u32 cdw10, __u32 cdw11,
+		 __u32 cdw12, __u32 data_len, void *data, __u32 *result)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= opcode,
+		.nsid		= nsid,
+		.cdw10		= cdw10,
+		.cdw11		= cdw11,
+		.cdw12		= cdw12,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= data_len,
+	};
+	int err;
+
+	err = nvme_submit_admin_passthru(fd, &cmd);
+	if (!err && result)
+		*result = cmd.result;
+	return err;
+}
+
+int nvme_set_feature(int fd, __u32 nsid, __u8 fid, __u32 value, __u32 cdw12,
+		     bool save, __u32 data_len, void *data, __u32 *result)
+{
+	__u32 cdw10 = fid | (save ? 1 << 31 : 0);
+
+	return nvme_feature(fd, nvme_admin_set_features, nsid, cdw10, value,
+			    cdw12, data_len, data, result);
+}
+
+static int nvme_property(int fd, __u8 fctype, __le32 off, __le64 *value, __u8 attrib)
+{
+	int err;
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_fabrics_command,
+		.cdw10		= attrib,
+		.cdw11		= off,
+	};
+
+	if (!value) {
+		errno = EINVAL;
+		return -errno;
+	}
+
+	if (fctype == nvme_fabrics_type_property_get){
+		cmd.nsid = nvme_fabrics_type_property_get;
+	} else if(fctype == nvme_fabrics_type_property_set) {
+		cmd.nsid = nvme_fabrics_type_property_set;
+		cmd.cdw12 = *value;
+	} else {
+		errno = EINVAL;
+		return -errno;
+	}
+
+	err = nvme_submit_admin_passthru(fd, &cmd);
+	if (!err && fctype == nvme_fabrics_type_property_get)
+		*value = cpu_to_le64(cmd.result);
+	return err;
+}
+
+static int get_property_helper(int fd, int offset, void *value, int *advance)
+{
+	__le64 value64;
+	int err = -EINVAL;
+
+	switch (offset) {
+	case NVME_REG_CAP:
+	case NVME_REG_ASQ:
+	case NVME_REG_ACQ:
+		*advance = 8;
+		break;
+	default:
+		*advance = 4;
+	}
+
+	if (!value)
+		return err;
+
+	err = nvme_property(fd, nvme_fabrics_type_property_get,
+			cpu_to_le32(offset), &value64, (*advance == 8));
+
+	if (!err) {
+		if (*advance == 8)
+			*((uint64_t *)value) = le64_to_cpu(value64);
+		else
+			*((uint32_t *)value) = le32_to_cpu(value64);
+	}
+
+	return err;
+}
+
+int nvme_get_property(int fd, int offset, uint64_t *value)
+{
+	int advance;
+	return get_property_helper(fd, offset, value, &advance);
+}
+
+int nvme_get_properties(int fd, void **pbar)
+{
+	int offset, advance;
+	int err, ret = -EINVAL;
+	int size = getpagesize();
+
+	*pbar = malloc(size);
+	if (!*pbar) {
+		fprintf(stderr, "malloc: %s\n", strerror(errno));
+		return -ENOMEM;
+	}
+
+	memset(*pbar, 0xff, size);
+	for (offset = NVME_REG_CAP; offset <= NVME_REG_CMBSZ; offset += advance) {
+		err = get_property_helper(fd, offset, *pbar + offset, &advance);
+		if (!err)
+			ret = 0;
+	}
+
+	return ret;
+}
+
+int nvme_set_property(int fd, int offset, int value)
+{
+	__le64 val = cpu_to_le64(value);
+	__le32 off = cpu_to_le32(offset);
+	bool is64bit;
+
+	switch (off) {
+	case NVME_REG_CAP:
+	case NVME_REG_ASQ:
+	case NVME_REG_ACQ:
+		is64bit = true;
+		break;
+	default:
+		is64bit = false;
+	}
+
+	return nvme_property(fd, nvme_fabrics_type_property_set,
+			off, &val, is64bit ? 1: 0);
+}
+
+int nvme_get_feature(int fd, __u32 nsid, __u8 fid, __u8 sel, __u32 cdw11,
+		     __u32 data_len, void *data, __u32 *result)
+{
+	__u32 cdw10 = fid | sel << 8;
+
+	return nvme_feature(fd, nvme_admin_get_features, nsid, cdw10, cdw11,
+			    0, data_len, data, result);
+}
+
+int nvme_format(int fd, __u32 nsid, __u8 lbaf, __u8 ses, __u8 pi,
+		__u8 pil, __u8 ms, __u32 timeout)
+{
+	__u32 cdw10 = lbaf | ms << 4 | pi << 5 | pil << 8 | ses << 9;
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_format_nvm,
+		.nsid		= nsid,
+		.cdw10		= cdw10,
+		.timeout_ms	= timeout,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_ns_create(int fd, __u64 nsze, __u64 ncap, __u8 flbas,
+		   __u8 dps, __u8 nmic, __u32 *result)
+{
+	struct nvme_id_ns ns = {
+		.nsze		= cpu_to_le64(nsze),
+		.ncap		= cpu_to_le64(ncap),
+		.flbas		= flbas,
+		.dps		= dps,
+		.nmic		= nmic,
+	};
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_ns_mgmt,
+		.addr		= (__u64)(uintptr_t) ((void *)&ns),
+		.cdw10		= 0,
+		.data_len	= 0x1000,
+	};
+	int err;
+
+	err = nvme_submit_admin_passthru(fd, &cmd);
+	if (!err && result)
+		*result = cmd.result;
+	return err;
+}
+
+int nvme_ns_delete(int fd, __u32 nsid)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_ns_mgmt,
+		.nsid		= nsid,
+		.cdw10		= 1,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_ns_attachment(int fd, __u32 nsid, __u16 num_ctrls, __u16 *ctrlist,
+		       bool attach)
+{
+	int i;
+	__u8 buf[0x1000];
+	struct nvme_controller_list *cntlist =
+					(struct nvme_controller_list *)buf;
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_ns_attach,
+		.nsid		= nsid,
+		.addr		= (__u64)(uintptr_t) cntlist,
+		.cdw10		= attach ? 0 : 1,
+		.data_len	= 0x1000,
+	};
+
+	memset(buf, 0, sizeof(buf));
+	cntlist->num = cpu_to_le16(num_ctrls);
+	for (i = 0; i < num_ctrls; i++)
+		cntlist->identifier[i] = cpu_to_le16(ctrlist[i]);
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_ns_attach_ctrls(int fd, __u32 nsid, __u16 num_ctrls, __u16 *ctrlist)
+{
+	return nvme_ns_attachment(fd, nsid, num_ctrls, ctrlist, true);
+}
+
+int nvme_ns_detach_ctrls(int fd, __u32 nsid, __u16 num_ctrls, __u16 *ctrlist)
+{
+	return nvme_ns_attachment(fd, nsid, num_ctrls, ctrlist, false);
+}
+
+int nvme_fw_download(int fd, __u32 offset, __u32 data_len, void *data)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_download_fw,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= data_len,
+		.cdw10		= (data_len >> 2) - 1,
+		.cdw11		= offset >> 2,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_fw_commit(int fd, __u8 slot, __u8 action, __u8 bpid)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_activate_fw,
+		.cdw10		= (bpid << 31) | (action << 3) | slot,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_sec_send(int fd, __u32 nsid, __u8 nssf, __u16 spsp,
+		  __u8 secp, __u32 tl, __u32 data_len, void *data, __u32 *result)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_security_send,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= data_len,
+		.nsid		= nsid,
+		.cdw10		= secp << 24 | spsp << 8 | nssf,
+		.cdw11		= tl,
+	};
+	int err;
+
+	err = nvme_submit_admin_passthru(fd, &cmd);
+	if (!err && result)
+		*result = cmd.result;
+	return err;
+}
+
+int nvme_sec_recv(int fd, __u32 nsid, __u8 nssf, __u16 spsp,
+		  __u8 secp, __u32 al, __u32 data_len, void *data, __u32 *result)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_security_recv,
+		.nsid		= nsid,
+		.cdw10		= secp << 24 | spsp << 8 | nssf,
+		.cdw11		= al,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= data_len,
+	};
+	int err;
+
+	err = nvme_submit_admin_passthru(fd, &cmd);
+	if (!err && result)
+		*result = cmd.result;
+	return err;
+}
+
+int nvme_dir_send(int fd, __u32 nsid, __u16 dspec, __u8 dtype, __u8 doper,
+                  __u32 data_len, __u32 dw12, void *data, __u32 *result)
+{
+        struct nvme_admin_cmd cmd = {
+                .opcode         = nvme_admin_directive_send,
+                .addr           = (__u64)(uintptr_t) data,
+                .data_len       = data_len,
+                .nsid           = nsid,
+                .cdw10          = data_len? (data_len >> 2) - 1 : 0,
+                .cdw11          = dspec << 16 | dtype << 8 | doper,
+                .cdw12          = dw12,
+        };
+        int err;
+
+        err = nvme_submit_admin_passthru(fd, &cmd);
+        if (!err && result)
+                *result = cmd.result;
+        return err;
+}
+
+int nvme_dir_recv(int fd, __u32 nsid, __u16 dspec, __u8 dtype, __u8 doper,
+                  __u32 data_len, __u32 dw12, void *data, __u32 *result)
+{
+        struct nvme_admin_cmd cmd = {
+                .opcode         = nvme_admin_directive_recv,
+                .addr           = (__u64)(uintptr_t) data,
+                .data_len       = data_len,
+                .nsid           = nsid,
+                .cdw10          = data_len? (data_len >> 2) - 1 : 0,
+                .cdw11          = dspec << 16 | dtype << 8 | doper,
+                .cdw12          = dw12,
+        };
+        int err;
+
+        err = nvme_submit_admin_passthru(fd, &cmd);
+        if (!err && result)
+                *result = cmd.result;
+        return err;
+}
+
+int nvme_sanitize(int fd, __u8 sanact, __u8 ause, __u8 owpass, __u8 oipbp,
+		  __u8 no_dealloc, __u32 ovrpat)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_sanitize_nvm,
+		.cdw10		= no_dealloc << 9 | oipbp << 8 |
+				  owpass << NVME_SANITIZE_OWPASS_SHIFT |
+				  ause << 3 | sanact,
+		.cdw11		= ovrpat,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_self_test_start(int fd, __u32 nsid, __u32 cdw10)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode = nvme_admin_dev_self_test,
+		.nsid = nsid,
+		.cdw10 = cdw10,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
diff --git a/libmultipath/nvme/nvme-ioctl.h b/libmultipath/nvme/nvme-ioctl.h
new file mode 100644
index 00000000..3fb740c3
--- /dev/null
+++ b/libmultipath/nvme/nvme-ioctl.h
@@ -0,0 +1,139 @@
+#ifndef _NVME_LIB_H
+#define _NVME_LIB_H
+
+#include <linux/types.h>
+#include <stdbool.h>
+#include "linux/nvme_ioctl.h"
+#include "nvme.h"
+
+int nvme_get_nsid(int fd);
+
+/* Generic passthrough */
+int nvme_submit_passthru(int fd, unsigned long ioctl_cmd,
+			 struct nvme_passthru_cmd *cmd);
+
+int nvme_passthru(int fd, unsigned long ioctl_cmd, __u8 opcode, __u8 flags,
+		  __u16 rsvd, __u32 nsid, __u32 cdw2, __u32 cdw3,
+		  __u32 cdw10, __u32 cdw11, __u32 cdw12,
+		  __u32 cdw13, __u32 cdw14, __u32 cdw15,
+		  __u32 data_len, void *data, __u32 metadata_len,
+		  void *metadata, __u32 timeout_ms, __u32 *result);
+
+/* NVME_SUBMIT_IO */
+int nvme_io(int fd, __u8 opcode, __u64 slba, __u16 nblocks, __u16 control,
+	      __u32 dsmgmt, __u32 reftag, __u16 apptag,
+	      __u16 appmask, void *data, void *metadata);
+
+int nvme_read(int fd, __u64 slba, __u16 nblocks, __u16 control,
+	      __u32 dsmgmt, __u32 reftag, __u16 apptag,
+	      __u16 appmask, void *data, void *metadata);
+
+int nvme_write(int fd, __u64 slba, __u16 nblocks, __u16 control,
+	       __u32 dsmgmt, __u32 reftag, __u16 apptag,
+	       __u16 appmask, void *data, void *metadata);
+
+int nvme_compare(int fd, __u64 slba, __u16 nblocks, __u16 control,
+		 __u32 dsmgmt, __u32 reftag, __u16 apptag,
+		 __u16 appmask, void *data, void *metadata);
+
+/* NVME_IO_CMD */
+int nvme_passthru_io(int fd, __u8 opcode, __u8 flags, __u16 rsvd,
+		     __u32 nsid, __u32 cdw2, __u32 cdw3,
+		     __u32 cdw10, __u32 cdw11, __u32 cdw12,
+		     __u32 cdw13, __u32 cdw14, __u32 cdw15,
+		     __u32 data_len, void *data, __u32 metadata_len,
+		     void *metadata, __u32 timeout);
+
+int nvme_write_zeros(int fd, __u32 nsid, __u64 slba, __u16 nlb,
+		     __u16 control, __u32 reftag, __u16 apptag, __u16 appmask);
+
+int nvme_write_uncorrectable(int fd, __u32 nsid, __u64 slba, __u16 nlb);
+
+int nvme_flush(int fd, __u32 nsid);
+
+int nvme_dsm(int fd, __u32 nsid, __u32 cdw11, struct nvme_dsm_range *dsm,
+	     __u16 nr_ranges);
+struct nvme_dsm_range *nvme_setup_dsm_range(__u32 *ctx_attrs,
+					    __u32 *llbas, __u64 *slbas,
+					    __u16 nr_ranges);
+
+int nvme_resv_acquire(int fd, __u32 nsid, __u8 rtype, __u8 racqa,
+		      bool iekey, __u64 crkey, __u64 nrkey);
+int nvme_resv_register(int fd, __u32 nsid, __u8 rrega, __u8 cptpl,
+		       bool iekey, __u64 crkey, __u64 nrkey);
+int nvme_resv_release(int fd, __u32 nsid, __u8 rtype, __u8 rrela,
+		      bool iekey, __u64 crkey);
+int nvme_resv_report(int fd, __u32 nsid, __u32 numd, __u32 cdw11, void *data);
+
+int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, void *data);
+int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data);
+int nvme_identify_ctrl(int fd, void *data);
+int nvme_identify_ns(int fd, __u32 nsid, bool present, void *data);
+int nvme_identify_ns_list(int fd, __u32 nsid, bool all, void *data);
+int nvme_identify_ctrl_list(int fd, __u32 nsid, __u16 cntid, void *data);
+int nvme_identify_ns_descs(int fd, __u32 nsid, void *data);
+int nvme_identify_nvmset(int fd, __u16 nvmset_id, void *data);
+int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 lpo,
+		   __u16 group_id, bool rae, __u32 data_len, void *data);
+int nvme_get_log(int fd, __u32 nsid, __u8 log_id, bool rae,
+		 __u32 data_len, void *data);
+
+
+int nvme_get_telemetry_log(int fd, void *lp, int generate_report,
+			   int ctrl_gen, size_t log_page_size, __u64 offset);
+int nvme_fw_log(int fd, struct nvme_firmware_log_page *fw_log);
+int nvme_changed_ns_list_log(int fd,
+		struct nvme_changed_ns_list_log *changed_ns_list_log);
+int nvme_error_log(int fd, int entries, struct nvme_error_log_page *err_log);
+int nvme_smart_log(int fd, __u32 nsid, struct nvme_smart_log *smart_log);
+int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo);
+int nvme_effects_log(int fd, struct nvme_effects_log_page *effects_log);
+int nvme_discovery_log(int fd, struct nvmf_disc_rsp_page_hdr *log, __u32 size);
+int nvme_sanitize_log(int fd, struct nvme_sanitize_log_page *sanitize_log);
+int nvme_endurance_log(int fd, __u16 group_id,
+		       struct nvme_endurance_group_log *endurance_log);
+
+int nvme_feature(int fd, __u8 opcode, __u32 nsid, __u32 cdw10,
+		 __u32 cdw11, __u32 cdw12, __u32 data_len, void *data,
+		 __u32 *result);
+int nvme_set_feature(int fd, __u32 nsid, __u8 fid, __u32 value, __u32 cdw12,
+		     bool save, __u32 data_len, void *data, __u32 *result);
+int nvme_get_feature(int fd, __u32 nsid, __u8 fid, __u8 sel,
+		     __u32 cdw11, __u32 data_len, void *data, __u32 *result);
+
+int nvme_format(int fd, __u32 nsid, __u8 lbaf, __u8 ses, __u8 pi,
+		__u8 pil, __u8 ms, __u32 timeout);
+
+int nvme_ns_create(int fd, __u64 nsze, __u64 ncap, __u8 flbas,
+		   __u8 dps, __u8 nmic, __u32 *result);
+int nvme_ns_delete(int fd, __u32 nsid);
+
+int nvme_ns_attachment(int fd, __u32 nsid, __u16 num_ctrls,
+		       __u16 *ctrlist, bool attach);
+int nvme_ns_attach_ctrls(int fd, __u32 nsid, __u16 num_ctrls, __u16 *ctrlist);
+int nvme_ns_detach_ctrls(int fd, __u32 nsid, __u16 num_ctrls, __u16 *ctrlist);
+
+int nvme_fw_download(int fd, __u32 offset, __u32 data_len, void *data);
+int nvme_fw_commit(int fd, __u8 slot, __u8 action, __u8 bpid);
+
+int nvme_sec_send(int fd, __u32 nsid, __u8 nssf, __u16 spsp,
+		  __u8 secp, __u32 tl, __u32 data_len, void *data, __u32 *result);
+int nvme_sec_recv(int fd, __u32 nsid, __u8 nssf, __u16 spsp,
+		  __u8 secp, __u32 al, __u32 data_len, void *data, __u32 *result);
+
+int nvme_subsystem_reset(int fd);
+int nvme_reset_controller(int fd);
+int nvme_ns_rescan(int fd);
+
+int nvme_dir_send(int fd, __u32 nsid, __u16 dspec, __u8 dtype, __u8 doper,
+		  __u32 data_len, __u32 dw12, void *data, __u32 *result);
+int nvme_dir_recv(int fd, __u32 nsid, __u16 dspec, __u8 dtype, __u8 doper,
+		  __u32 data_len, __u32 dw12, void *data, __u32 *result);
+int nvme_get_properties(int fd, void **pbar);
+int nvme_set_property(int fd, int offset, int value);
+int nvme_get_property(int fd, int offset, uint64_t *value);
+int nvme_sanitize(int fd, __u8 sanact, __u8 ause, __u8 owpass, __u8 oipbp,
+		  __u8 no_dealloc, __u32 ovrpat);
+int nvme_self_test_start(int fd, __u32 nsid, __u32 cdw10);
+int nvme_self_test_log(int fd, struct nvme_self_test_log *self_test_log);
+#endif				/* _NVME_LIB_H */
diff --git a/libmultipath/nvme/nvme.h b/libmultipath/nvme/nvme.h
new file mode 100644
index 00000000..685d1799
--- /dev/null
+++ b/libmultipath/nvme/nvme.h
@@ -0,0 +1,163 @@
+/*
+ * Definitions for the NVM Express interface
+ * Copyright (c) 2011-2014, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#ifndef _NVME_H
+#define _NVME_H
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <endian.h>
+#include "plugin.h"
+#include "json.h"
+
+#define unlikely(x) x
+
+#ifdef LIBUUID
+#include <uuid/uuid.h>
+#else
+typedef struct {
+	uint8_t b[16];
+} uuid_t;
+#endif
+
+#include "linux/nvme.h"
+
+struct nvme_effects_log_page {
+	__le32 acs[256];
+	__le32 iocs[256];
+	__u8   resv[2048];
+};
+
+struct nvme_error_log_page {
+	__u64	error_count;
+	__u16	sqid;
+	__u16	cmdid;
+	__u16	status_field;
+	__u16	parm_error_location;
+	__u64	lba;
+	__u32	nsid;
+	__u8	vs;
+	__u8	resv[3];
+	__u64	cs;
+	__u8	resv2[24];
+};
+
+struct nvme_firmware_log_page {
+	__u8	afi;
+	__u8	resv[7];
+	__u64	frs[7];
+	__u8	resv2[448];
+};
+
+/* idle and active power scales occupy the last 2 bits of the field */
+#define POWER_SCALE(s) ((s) >> 6)
+
+struct nvme_host_mem_buffer {
+	__u32			hsize;
+	__u32			hmdlal;
+	__u32			hmdlau;
+	__u32			hmdlec;
+	__u8			rsvd16[4080];
+};
+
+struct nvme_auto_pst {
+	__u32	data;
+	__u32	rsvd32;
+};
+
+struct nvme_timestamp {
+	__u8 timestamp[6];
+	__u8 attr;
+	__u8 rsvd;
+};
+
+struct nvme_controller_list {
+	__le16 num;
+	__le16 identifier[];
+};
+
+struct nvme_bar_cap {
+	__u16	mqes;
+	__u8	ams_cqr;
+	__u8	to;
+	__u16	bps_css_nssrs_dstrd;
+	__u8	mpsmax_mpsmin;
+	__u8	reserved;
+};
+
+#ifdef __CHECKER__
+#define __force       __attribute__((force))
+#else
+#define __force
+#endif
+
+#define cpu_to_le16(x) \
+	((__force __le16)htole16(x))
+#define cpu_to_le32(x) \
+	((__force __le32)htole32(x))
+#define cpu_to_le64(x) \
+	((__force __le64)htole64(x))
+
+#define le16_to_cpu(x) \
+	le16toh((__force __u16)(x))
+#define le32_to_cpu(x) \
+	le32toh((__force __u32)(x))
+#define le64_to_cpu(x) \
+	le64toh((__force __u64)(x))
+
+#define MAX_LIST_ITEMS 256
+struct list_item {
+	char                node[1024];
+	struct nvme_id_ctrl ctrl;
+	int                 nsid;
+	struct nvme_id_ns   ns;
+	unsigned            block;
+};
+
+struct ctrl_list_item {
+	char *name;
+	char *address;
+	char *transport;
+	char *state;
+	char *ana_state;
+};
+
+struct subsys_list_item {
+	char *name;
+	char *subsysnqn;
+	int nctrls;
+	struct ctrl_list_item *ctrls;
+};
+
+enum {
+	NORMAL,
+	JSON,
+	BINARY,
+};
+
+void register_extension(struct plugin *plugin);
+
+#include "argconfig.h"
+int parse_and_open(int argc, char **argv, const char *desc,
+	const struct argconfig_commandline_options *clo, void *cfg, size_t size);
+
+extern const char *devicename;
+
+int __id_ctrl(int argc, char **argv, struct command *cmd, struct plugin *plugin, void (*vs)(__u8 *vs, struct json_object *root));
+int	validate_output_format(char *format);
+
+struct subsys_list_item *get_subsys_list(int *subcnt, char *subsysnqn, __u32 nsid);
+void free_subsys_list(struct subsys_list_item *slist, int n);
+char *nvme_char_from_block(char *block);
+#endif /* _NVME_H */
diff --git a/libmultipath/nvme/plugin.h b/libmultipath/nvme/plugin.h
new file mode 100644
index 00000000..91079fbe
--- /dev/null
+++ b/libmultipath/nvme/plugin.h
@@ -0,0 +1,36 @@
+#ifndef PLUGIN_H
+#define PLUGIN_H
+
+#include <stdbool.h>
+
+struct program {
+	const char *name;
+	const char *version;
+	const char *usage;
+	const char *desc;
+	const char *more;
+	struct command **commands;
+	struct plugin *extensions;
+};
+
+struct plugin {
+	const char *name;
+	const char *desc;
+	struct command **commands;
+	struct program *parent;
+	struct plugin *next;
+	struct plugin *tail;
+};
+
+struct command {
+	char *name;
+	char *help;
+	int (*fn)(int argc, char **argv, struct command *command, struct plugin *plugin);
+	char *alias;
+};
+
+void usage(struct plugin *plugin);
+void general_help(struct plugin *plugin);
+int handle_plugin(int argc, char **argv, struct plugin *plugin);
+
+#endif
-- 
2.19.2

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 13/19] libmultipath: add wrapper library for nvme ioctls
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (11 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 12/19] libmultipath: add files from nvme-cli for NVMe support Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 14/19] multipath-tools: add ANA support for NVMe device Martin Wilck
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, lijie, dm-devel

Create a small wrapper around the code from nvme-cli provide
the necessary functionality (and only that) for libmultipath.

libmultipath code should include "nvme-lib.h" and possibly
"nvme.h" (the latter with -Invme"). The nvme-cli code is
rewritten, changing all functions to static linkage, and
included by nvme-lib.c, so that only those functions that
are actually exported via nvme-lib.c become part of
libmultipath.

This allows us to include the nvme-cli code without modifications,
and at the same time not carry around binary code for stuff we
don't need.

When additional functionality from nvme-cli is needed, more
wrappers need to be added to nvme-lib.[hc].

Cc: lijie <lijie34@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/Makefile   | 17 ++++++++++++++++-
 libmultipath/nvme-lib.c | 36 ++++++++++++++++++++++++++++++++++++
 libmultipath/nvme-lib.h | 33 +++++++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)
 create mode 100644 libmultipath/nvme-lib.c
 create mode 100644 libmultipath/nvme-lib.h

diff --git a/libmultipath/Makefile b/libmultipath/Makefile
index 33f52691..7d27ea7f 100644
--- a/libmultipath/Makefile
+++ b/libmultipath/Makefile
@@ -45,8 +45,23 @@ OBJS = memory.o parser.o vector.o devmapper.o callout.o \
 	lock.o file.o wwids.o prioritizers/alua_rtpg.o prkey.o \
 	io_err_stat.o dm-generic.o generic.o foreign.o
 
+ifneq ($(call check_file,/usr/include/linux/nvme_ioctl.h),0)
+	OBJS += nvme-lib.o
+endif
+
 all: $(LIBS)
 
+nvme-lib.o: nvme-lib.c nvme-ioctl.c nvme-ioctl.h
+	$(CC) $(CFLAGS) -Wno-unused-function -I. -Invme -c -o $@ $<
+
+make_static = $(shell sed '/^static/!s/^\([a-z]\{1,\} \)/static \1/' <$1 >$2)
+
+nvme-ioctl.c: nvme/nvme-ioctl.c
+	$(call make_static,$<,$@)
+
+nvme-ioctl.h: nvme/nvme-ioctl.h
+	$(call make_static,$<,$@)
+
 $(LIBS): $(OBJS)
 	$(CC) $(LDFLAGS) $(SHARED_FLAGS) -Wl,-soname=$@ -o $@ $(OBJS) $(LIBDEPS)
 	$(LN) $@ $(DEVLIB)
@@ -62,7 +77,7 @@ uninstall:
 	$(RM) $(DESTDIR)$(syslibdir)/$(DEVLIB)
 
 clean: dep_clean
-	$(RM) core *.a *.o *.so *.so.* *.gz
+	$(RM) core *.a *.o *.so *.so.* *.gz nvme-ioctl.c nvme-ioctl.h
 
 include $(wildcard $(OBJS:.o=.d))
 
diff --git a/libmultipath/nvme-lib.c b/libmultipath/nvme-lib.c
new file mode 100644
index 00000000..9c32f369
--- /dev/null
+++ b/libmultipath/nvme-lib.c
@@ -0,0 +1,36 @@
+#include <sys/types.h>
+/* avoid inclusion of standard API */
+#define _NVME_LIB_C 1
+#include "nvme-lib.h"
+#include "nvme-ioctl.c"
+#include "debug.h"
+
+int log_nvme_errcode(int err, const char *dev, const char *msg)
+{
+	if (err > 0)
+		condlog(3, "%s: %s: NVMe status %d", dev, msg, err);
+	else if (err < 0)
+		condlog(3, "%s: %s: %s", dev, msg, strerror(errno));
+	return err;
+}
+
+int libmp_nvme_get_nsid(int fd)
+{
+	return nvme_get_nsid(fd);
+}
+
+int libmp_nvme_identify_ctrl(int fd, struct nvme_id_ctrl *ctrl)
+{
+	return nvme_identify_ctrl(fd, ctrl);
+}
+
+int libmp_nvme_identify_ns(int fd, __u32 nsid, bool present,
+			   struct nvme_id_ns *ns)
+{
+	return nvme_identify_ns(fd, nsid, present, ns);
+}
+
+int libmp_nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo)
+{
+	return nvme_ana_log(fd, ana_log, ana_log_len, rgo);
+}
diff --git a/libmultipath/nvme-lib.h b/libmultipath/nvme-lib.h
new file mode 100644
index 00000000..445c4f46
--- /dev/null
+++ b/libmultipath/nvme-lib.h
@@ -0,0 +1,33 @@
+#ifndef NVME_LIB_H
+#define NVME_LIB_H
+
+#include "nvme.h"
+
+int log_nvme_errcode(int err, const char *dev, const char *msg);
+int libmp_nvme_get_nsid(int fd);
+int libmp_nvme_identify_ctrl(int fd, struct nvme_id_ctrl *ctrl);
+int libmp_nvme_identify_ns(int fd, __u32 nsid, bool present,
+			   struct nvme_id_ns *ns);
+int libmp_nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo);
+
+#ifndef _NVME_LIB_C
+/*
+ * In all files except nvme-lib.c, the nvme functions can be called
+ * by their usual name.
+ */
+#define nvme_get_nsid libmp_nvme_get_nsid
+#define nvme_identify_ctrl libmp_nvme_identify_ctrl
+#define nvme_identify_ns libmp_nvme_identify_ns
+#define nvme_ana_log libmp_nvme_ana_log
+/*
+ * Undefine these to avoid clashes with libmultipath's byteorder.h
+ */
+#undef cpu_to_le16
+#undef cpu_to_le32
+#undef cpu_to_le64
+#undef le16_to_cpu
+#undef le32_to_cpu
+#undef le64_to_cpu
+#endif
+
+#endif /* NVME_LIB_H */
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 14/19] multipath-tools: add ANA support for NVMe device
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (12 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 13/19] libmultipath: add wrapper library for nvme ioctls Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-20 15:17   ` Hannes Reinecke
  2018-12-18 23:19 ` [PATCH 15/19] libmultipath: ANA prioritzer: use nvme wrapper library Martin Wilck
                   ` (6 subsequent siblings)
  20 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, lijie, dm-devel

From: lijie <lijie34@huawei.com>

Add support for Asynchronous Namespace Access as specified in NVMe 1.3
TP 4004. The states are updated through reading the ANA log page.

By default, the native nvme multipath takes over the nvme device.
We can pass a false to the parameter 'multipath' of the nvme-core.ko
module,when we want to use multipath-tools.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/prio.h                |   1 +
 libmultipath/prioritizers/Makefile |   1 +
 libmultipath/prioritizers/ana.c    | 292 +++++++++++++++++++++++++++++
 libmultipath/prioritizers/ana.h    | 221 ++++++++++++++++++++++
 multipath/multipath.conf.5         |   8 +
 5 files changed, 523 insertions(+)
 create mode 100644 libmultipath/prioritizers/ana.c
 create mode 100644 libmultipath/prioritizers/ana.h

diff --git a/libmultipath/prio.h b/libmultipath/prio.h
index aa587ccd..599d1d88 100644
--- a/libmultipath/prio.h
+++ b/libmultipath/prio.h
@@ -30,6 +30,7 @@ struct path;
 #define PRIO_WEIGHTED_PATH	"weightedpath"
 #define PRIO_SYSFS		"sysfs"
 #define PRIO_PATH_LATENCY	"path_latency"
+#define PRIO_ANA		"ana"
 
 /*
  * Value used to mark the fact prio was not defined
diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile
index ab7bc075..15afaba3 100644
--- a/libmultipath/prioritizers/Makefile
+++ b/libmultipath/prioritizers/Makefile
@@ -19,6 +19,7 @@ LIBS = \
 	libpriordac.so \
 	libprioweightedpath.so \
 	libpriopath_latency.so \
+	libprioana.so \
 	libpriosysfs.so
 
 all: $(LIBS)
diff --git a/libmultipath/prioritizers/ana.c b/libmultipath/prioritizers/ana.c
new file mode 100644
index 00000000..c5aaa5fb
--- /dev/null
+++ b/libmultipath/prioritizers/ana.c
@@ -0,0 +1,292 @@
+/*
+ * (C) Copyright HUAWEI Technology Corp. 2017   All Rights Reserved.
+ *
+ * ana.c
+ * Version 1.00
+ *
+ * Tool to make use of a NVMe-feature called  Asymmetric Namespace Access.
+ * It determines the ANA state of a device and prints a priority value to stdout.
+ *
+ * Author(s): Cheng Jike <chengjike.cheng@huawei.com>
+ *            Li Jie <lijie34@huawei.com>
+ *
+ * This file is released under the GPL version 2, or any later version.
+ */
+#include <stdio.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <stdbool.h>
+
+#include "debug.h"
+#include "prio.h"
+#include "structs.h"
+#include "ana.h"
+
+enum {
+	ANA_PRIO_OPTIMIZED		= 50,
+	ANA_PRIO_NONOPTIMIZED		= 10,
+	ANA_PRIO_INACCESSIBLE		= 5,
+	ANA_PRIO_PERSISTENT_LOSS	= 1,
+	ANA_PRIO_CHANGE			= 0,
+	ANA_PRIO_RESERVED		= 0,
+	ANA_PRIO_GETCTRL_FAILED		= -1,
+	ANA_PRIO_NOT_SUPPORTED		= -2,
+	ANA_PRIO_GETANAS_FAILED		= -3,
+	ANA_PRIO_GETANALOG_FAILED	= -4,
+	ANA_PRIO_GETNSID_FAILED		= -5,
+	ANA_PRIO_GETNS_FAILED		= -6,
+	ANA_PRIO_NO_MEMORY		= -7,
+	ANA_PRIO_NO_INFORMATION		= -8,
+};
+
+static const char * anas_string[] = {
+	[NVME_ANA_OPTIMIZED]			= "ANA Optimized State",
+	[NVME_ANA_NONOPTIMIZED]			= "ANA Non-Optimized State",
+	[NVME_ANA_INACCESSIBLE]			= "ANA Inaccessible State",
+	[NVME_ANA_PERSISTENT_LOSS]		= "ANA Persistent Loss State",
+	[NVME_ANA_CHANGE]			= "ANA Change state",
+	[NVME_ANA_RESERVED]			= "Invalid namespace group state!",
+};
+
+static const char *aas_print_string(int rc)
+{
+	rc &= 0xff;
+
+	switch(rc) {
+	case NVME_ANA_OPTIMIZED:
+	case NVME_ANA_NONOPTIMIZED:
+	case NVME_ANA_INACCESSIBLE:
+	case NVME_ANA_PERSISTENT_LOSS:
+	case NVME_ANA_CHANGE:
+		return anas_string[rc];
+	default:
+		return anas_string[NVME_ANA_RESERVED];
+	}
+
+	return anas_string[NVME_ANA_RESERVED];
+}
+
+static int nvme_get_nsid(int fd, unsigned *nsid)
+{
+	static struct stat nvme_stat;
+	int err = fstat(fd, &nvme_stat);
+	if (err < 0)
+		return 1;
+
+	if (!S_ISBLK(nvme_stat.st_mode)) {
+		condlog(0, "Error: requesting namespace-id from non-block device\n");
+		return 1;
+	}
+
+	*nsid = ioctl(fd, NVME_IOCTL_ID);
+	return 0;
+}
+
+static int nvme_submit_admin_passthru(int fd, struct nvme_passthru_cmd *cmd)
+{
+	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, cmd);
+}
+
+int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 lpo,
+                 __u16 lsi, bool rae, __u32 data_len, void *data)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_get_log_page,
+		.nsid		= nsid,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= data_len,
+	};
+	__u32 numd = (data_len >> 2) - 1;
+	__u16 numdu = numd >> 16, numdl = numd & 0xffff;
+
+	cmd.cdw10 = log_id | (numdl << 16) | (rae ? 1 << 15 : 0);
+	if (lsp)
+		cmd.cdw10 |= lsp << 8;
+
+	cmd.cdw11 = numdu | (lsi << 16);
+	cmd.cdw12 = lpo;
+	cmd.cdw13 = (lpo >> 32);
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+
+}
+
+int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, void *data)
+{
+	struct nvme_admin_cmd cmd = {
+		.opcode		= nvme_admin_identify,
+		.nsid		= nsid,
+		.addr		= (__u64)(uintptr_t) data,
+		.data_len	= NVME_IDENTIFY_DATA_SIZE,
+		.cdw10		= cdw10,
+		.cdw11		= cdw11,
+	};
+
+	return nvme_submit_admin_passthru(fd, &cmd);
+}
+
+int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data)
+{
+	return nvme_identify13(fd, nsid, cdw10, 0, data);
+}
+
+int nvme_identify_ctrl(int fd, void *data)
+{
+	return nvme_identify(fd, 0, NVME_ID_CNS_CTRL, data);
+}
+
+int nvme_identify_ns(int fd, __u32 nsid, void *data)
+{
+	return nvme_identify(fd, nsid, NVME_ID_CNS_NS, data);
+}
+
+int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo)
+{
+	__u64 lpo = 0;
+
+	return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_ANA, rgo, lpo, 0,
+			true, ana_log_len, ana_log);
+}
+
+static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log)
+{
+	int	rc = ANA_PRIO_GETANAS_FAILED;
+	void *base = ana_log;
+	struct nvme_ana_rsp_hdr *hdr = base;
+	struct nvme_ana_group_desc *ana_desc;
+	int offset = sizeof(struct nvme_ana_rsp_hdr);
+	__u32 nr_nsids;
+	size_t nsid_buf_size;
+	int i, j;
+
+	for (i = 0; i < le16_to_cpu(hdr->ngrps); i++) {
+		ana_desc = base + offset;
+		nr_nsids = le32_to_cpu(ana_desc->nnsids);
+		nsid_buf_size = nr_nsids * sizeof(__le32);
+
+		offset += sizeof(*ana_desc);
+
+		for (j = 0; j < nr_nsids; j++) {
+			if (nsid == le32_to_cpu(ana_desc->nsids[j]))
+				return ana_desc->state;
+		}
+
+		if (anagrpid != 0 && anagrpid == le32_to_cpu(ana_desc->grpid))
+			rc = ana_desc->state;
+
+		offset += nsid_buf_size;
+	}
+
+	return rc;
+}
+
+int get_ana_info(struct path * pp, unsigned int timeout)
+{
+	int	rc;
+	__u32 nsid;
+	struct nvme_id_ctrl ctrl;
+	struct nvme_id_ns ns;
+	void *ana_log;
+	size_t ana_log_len;
+
+	rc = nvme_identify_ctrl(pp->fd, &ctrl);
+	if (rc)
+		return ANA_PRIO_GETCTRL_FAILED;
+
+	if(!(ctrl.cmic & (1 << 3)))
+		return ANA_PRIO_NOT_SUPPORTED;
+
+	rc = nvme_get_nsid(pp->fd, &nsid);
+	if (rc)
+		return ANA_PRIO_GETNSID_FAILED;
+
+	rc = nvme_identify_ns(pp->fd, nsid, &ns);
+	if (rc)
+		return ANA_PRIO_GETNS_FAILED;
+
+	ana_log_len = sizeof(struct nvme_ana_rsp_hdr) +
+		le32_to_cpu(ctrl.nanagrpid) * sizeof(struct nvme_ana_group_desc);
+	if (!(ctrl.anacap & (1 << 6)))
+		ana_log_len += le32_to_cpu(ctrl.mnan) * sizeof(__le32);
+
+	ana_log = malloc(ana_log_len);
+	if (!ana_log)
+		return ANA_PRIO_NO_MEMORY;
+
+	rc = nvme_ana_log(pp->fd, ana_log, ana_log_len,
+		(ctrl.anacap & (1 << 6)) ? NVME_ANA_LOG_RGO : 0);
+	if (rc) {
+		free(ana_log);
+		return ANA_PRIO_GETANALOG_FAILED;
+	}
+
+	rc = get_ana_state(nsid, le32_to_cpu(ns.anagrpid), ana_log);
+	if (rc < 0){
+		free(ana_log);
+		return ANA_PRIO_GETANAS_FAILED;
+	}
+
+	free(ana_log);
+	condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc, aas_print_string(rc));
+
+	return rc;
+}
+
+int getprio(struct path * pp, char * args, unsigned int timeout)
+{
+	int rc;
+
+	if (pp->fd < 0)
+		return ANA_PRIO_NO_INFORMATION;
+
+	rc = get_ana_info(pp, timeout);
+	if (rc >= 0) {
+		rc &= 0x0f;
+		switch(rc) {
+		case NVME_ANA_OPTIMIZED:
+			rc = ANA_PRIO_OPTIMIZED;
+			break;
+		case NVME_ANA_NONOPTIMIZED:
+			rc = ANA_PRIO_NONOPTIMIZED;
+			break;
+		case NVME_ANA_INACCESSIBLE:
+			rc = ANA_PRIO_INACCESSIBLE;
+			break;
+		case NVME_ANA_PERSISTENT_LOSS:
+			rc = ANA_PRIO_PERSISTENT_LOSS;
+			break;
+		case NVME_ANA_CHANGE:
+			rc = ANA_PRIO_CHANGE;
+			break;
+		default:
+			rc = ANA_PRIO_RESERVED;
+		}
+	} else {
+		switch(rc) {
+		case ANA_PRIO_GETCTRL_FAILED:
+			condlog(0, "%s: couldn't get ctrl info", pp->dev);
+			break;
+		case ANA_PRIO_NOT_SUPPORTED:
+			condlog(0, "%s: ana not supported", pp->dev);
+			break;
+		case ANA_PRIO_GETANAS_FAILED:
+			condlog(0, "%s: couldn't get ana state", pp->dev);
+			break;
+		case ANA_PRIO_GETANALOG_FAILED:
+			condlog(0, "%s: couldn't get ana log", pp->dev);
+			break;
+		case ANA_PRIO_GETNS_FAILED:
+			condlog(0, "%s: couldn't get namespace", pp->dev);
+			break;
+		case ANA_PRIO_GETNSID_FAILED:
+			condlog(0, "%s: couldn't get namespace id", pp->dev);
+			break;
+		case ANA_PRIO_NO_MEMORY:
+			condlog(0, "%s: couldn't alloc memory", pp->dev);
+			break;
+		}
+	}
+	return rc;
+}
+
diff --git a/libmultipath/prioritizers/ana.h b/libmultipath/prioritizers/ana.h
new file mode 100644
index 00000000..92cfa9e3
--- /dev/null
+++ b/libmultipath/prioritizers/ana.h
@@ -0,0 +1,221 @@
+#ifndef _ANA_H
+#define _ANA_H
+
+#include <linux/types.h>
+
+#define NVME_NSID_ALL			0xffffffff
+#define NVME_IDENTIFY_DATA_SIZE 	4096
+
+#define NVME_LOG_ANA			0x0c
+
+/* Admin commands */
+enum nvme_admin_opcode {
+	nvme_admin_get_log_page		= 0x02,
+	nvme_admin_identify		= 0x06,
+};
+
+enum {
+	NVME_ID_CNS_NS			= 0x00,
+	NVME_ID_CNS_CTRL		= 0x01,
+};
+
+/* nvme ioctl start */
+struct nvme_passthru_cmd {
+	__u8	opcode;
+	__u8	flags;
+	__u16	rsvd1;
+	__u32	nsid;
+	__u32	cdw2;
+	__u32	cdw3;
+	__u64	metadata;
+	__u64	addr;
+	__u32	metadata_len;
+	__u32	data_len;
+	__u32	cdw10;
+	__u32	cdw11;
+	__u32	cdw12;
+	__u32	cdw13;
+	__u32	cdw14;
+	__u32	cdw15;
+	__u32	timeout_ms;
+	__u32	result;
+};
+
+#define nvme_admin_cmd nvme_passthru_cmd
+
+#define NVME_IOCTL_ID		_IO('N', 0x40)
+#define NVME_IOCTL_ADMIN_CMD	_IOWR('N', 0x41, struct nvme_admin_cmd)
+/* nvme ioctl end */
+
+/* nvme id ctrl start */
+struct nvme_id_power_state {
+	__le16			max_power;	/* centiwatts */
+	__u8			rsvd2;
+	__u8			flags;
+	__le32			entry_lat;	/* microseconds */
+	__le32			exit_lat;	/* microseconds */
+	__u8			read_tput;
+	__u8			read_lat;
+	__u8			write_tput;
+	__u8			write_lat;
+	__le16			idle_power;
+	__u8			idle_scale;
+	__u8			rsvd19;
+	__le16			active_power;
+	__u8			active_work_scale;
+	__u8			rsvd23[9];
+};
+
+struct nvme_id_ctrl {
+	__le16			vid;
+	__le16			ssvid;
+	char			sn[20];
+	char			mn[40];
+	char			fr[8];
+	__u8			rab;
+	__u8			ieee[3];
+	__u8			cmic;
+	__u8			mdts;
+	__le16			cntlid;
+	__le32			ver;
+	__le32			rtd3r;
+	__le32			rtd3e;
+	__le32			oaes;
+	__le32			ctratt;
+	__u8			rsvd100[156];
+	__le16			oacs;
+	__u8			acl;
+	__u8			aerl;
+	__u8			frmw;
+	__u8			lpa;
+	__u8			elpe;
+	__u8			npss;
+	__u8			avscc;
+	__u8			apsta;
+	__le16			wctemp;
+	__le16			cctemp;
+	__le16			mtfa;
+	__le32			hmpre;
+	__le32			hmmin;
+	__u8			tnvmcap[16];
+	__u8			unvmcap[16];
+	__le32			rpmbs;
+	__le16			edstt;
+	__u8			dsto;
+	__u8			fwug;
+	__le16			kas;
+	__le16			hctma;
+	__le16			mntmt;
+	__le16			mxtmt;
+	__le32			sanicap;
+	__le32			hmminds;
+	__le16			hmmaxd;
+	__u8			rsvd338[4];
+	__u8			anatt;
+	__u8			anacap;
+	__le32			anagrpmax;
+	__le32			nanagrpid;
+	__u8			rsvd352[160];
+	__u8			sqes;
+	__u8			cqes;
+	__le16			maxcmd;
+	__le32			nn;
+	__le16			oncs;
+	__le16			fuses;
+	__u8			fna;
+	__u8			vwc;
+	__le16			awun;
+	__le16			awupf;
+	__u8			nvscc;
+	__u8			nwpc;
+	__le16			acwu;
+	__u8			rsvd534[2];
+	__le32			sgls;
+	__le32			mnan;
+	__u8			rsvd544[224];
+	char			subnqn[256];
+	__u8			rsvd1024[768];
+	__le32			ioccsz;
+	__le32			iorcsz;
+	__le16			icdoff;
+	__u8			ctrattr;
+	__u8			msdbd;
+	__u8			rsvd1804[244];
+	struct nvme_id_power_state	psd[32];
+	__u8			vs[1024];
+};
+/* nvme id ctrl end */
+
+/* nvme id ns start */
+struct nvme_lbaf {
+	__le16			ms;
+	__u8			ds;
+	__u8			rp;
+};
+
+struct nvme_id_ns {
+	__le64			nsze;
+	__le64			ncap;
+	__le64			nuse;
+	__u8			nsfeat;
+	__u8			nlbaf;
+	__u8			flbas;
+	__u8			mc;
+	__u8			dpc;
+	__u8			dps;
+	__u8			nmic;
+	__u8			rescap;
+	__u8			fpi;
+	__u8			rsvd33;
+	__le16			nawun;
+	__le16			nawupf;
+	__le16			nacwu;
+	__le16			nabsn;
+	__le16			nabo;
+	__le16			nabspf;
+	__le16			noiob;
+	__u8			nvmcap[16];
+	__u8			rsvd64[28];
+	__le32			anagrpid;
+	__u8			rsvd96[3];
+	__u8			nsattr;
+	__u8			rsvd100[4];
+	__u8			nguid[16];
+	__u8			eui64[8];
+	struct nvme_lbaf	lbaf[16];
+	__u8			rsvd192[192];
+	__u8			vs[3712];
+};
+/* nvme id ns end */
+
+/* nvme ana start */
+enum nvme_ana_state {
+	NVME_ANA_OPTIMIZED		= 0x01,
+	NVME_ANA_NONOPTIMIZED		= 0x02,
+	NVME_ANA_INACCESSIBLE		= 0x03,
+	NVME_ANA_PERSISTENT_LOSS	= 0x04,
+	NVME_ANA_CHANGE			= 0x0f,
+	NVME_ANA_RESERVED		= 0x05,
+};
+
+struct nvme_ana_rsp_hdr {
+	__le64	chgcnt;
+	__le16	ngrps;
+	__le16	rsvd10[3];
+};
+
+struct nvme_ana_group_desc {
+	__le32	grpid;
+	__le32	nnsids;
+	__le64	chgcnt;
+	__u8	state;
+	__u8	rsvd17[15];
+	__le32	nsids[];
+};
+
+/* flag for the log specific field of the ANA log */
+#define NVME_ANA_LOG_RGO	(1 << 0)
+
+/* nvme ana end */
+
+#endif
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5
index c7f59147..88b8edd0 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -334,6 +334,10 @@ priority provided as argument. Requires prio_args keyword.
 Generate the path priority based on a latency algorithm.
 Requires prio_args keyword.
 .TP
+.I ana
+(Hardware-dependent)
+Generate the path priority based on the NVMe ANA settings.
+.TP
 .I datacore
 (Hardware-dependent)
 Generate the path priority for some DataCore storage arrays. Requires prio_args
@@ -1437,6 +1441,10 @@ Active/Standby mode exclusively.
 .I 1 alua
 (Hardware-dependent)
 Hardware handler for SCSI-3 ALUA compatible arrays.
+.TP
+.I 1 ana
+(Hardware-dependent)
+Hardware handler for NVMe ANA compatible arrays.
 .PP
 The default is: \fB<unset>\fR
 .PP
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 15/19] libmultipath: ANA prioritzer: use nvme wrapper library
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (13 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 14/19] multipath-tools: add ANA support for NVMe device Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-20 22:58   ` Benjamin Marzinski
  2018-12-18 23:19 ` [PATCH 16/19] libmultipath: detect_prio: try ANA for NVMe Martin Wilck
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, lijie, dm-devel

Use the previously introduced NVME wrapper library for
the passthrough commands from the ANA prioritizer. Discard
code duplicated from nvme-cli from the ana code itself.

Furthermore, make additional cleanups in the ANA prioritizer:

 - don't use the same enum for priorities and error codes
 - use char* arrays for error messages and state names
 - return -1 prio to libmultipath for all error cases
 - check if a device is NVMe before trying ioctl
 - check for overflow in check_ana_state()
 - get_ana_info(): improve readability with is_anagrpid_const
 - priorities: PERSISTENT_LOSS state is worse than INACCESSIBLE
 and CHANGE

Cc: lijie <lijie34@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/prioritizers/Makefile |   6 +-
 libmultipath/prioritizers/ana.c    | 305 ++++++++++-------------------
 2 files changed, 113 insertions(+), 198 deletions(-)

diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile
index 15afaba3..4d80c20c 100644
--- a/libmultipath/prioritizers/Makefile
+++ b/libmultipath/prioritizers/Makefile
@@ -19,9 +19,13 @@ LIBS = \
 	libpriordac.so \
 	libprioweightedpath.so \
 	libpriopath_latency.so \
-	libprioana.so \
 	libpriosysfs.so
 
+ifneq ($(call check_file,/usr/include/linux/nvme_ioctl.h),0)
+	LIBS += libprioana.so
+	CFLAGS += -I../nvme
+endif
+
 all: $(LIBS)
 
 libprioalua.so: alua.o alua_rtpg.o
diff --git a/libmultipath/prioritizers/ana.c b/libmultipath/prioritizers/ana.c
index c5aaa5fb..88edb224 100644
--- a/libmultipath/prioritizers/ana.c
+++ b/libmultipath/prioritizers/ana.c
@@ -17,155 +17,91 @@
 #include <sys/stat.h>
 #include <sys/types.h>
 #include <stdbool.h>
+#include <libudev.h>
 
 #include "debug.h"
+#include "nvme-lib.h"
 #include "prio.h"
+#include "util.h"
 #include "structs.h"
-#include "ana.h"
 
 enum {
-	ANA_PRIO_OPTIMIZED		= 50,
-	ANA_PRIO_NONOPTIMIZED		= 10,
-	ANA_PRIO_INACCESSIBLE		= 5,
-	ANA_PRIO_PERSISTENT_LOSS	= 1,
-	ANA_PRIO_CHANGE			= 0,
-	ANA_PRIO_RESERVED		= 0,
-	ANA_PRIO_GETCTRL_FAILED		= -1,
-	ANA_PRIO_NOT_SUPPORTED		= -2,
-	ANA_PRIO_GETANAS_FAILED		= -3,
-	ANA_PRIO_GETANALOG_FAILED	= -4,
-	ANA_PRIO_GETNSID_FAILED		= -5,
-	ANA_PRIO_GETNS_FAILED		= -6,
-	ANA_PRIO_NO_MEMORY		= -7,
-	ANA_PRIO_NO_INFORMATION		= -8,
+	ANA_ERR_GETCTRL_FAILED		= 1,
+	ANA_ERR_NOT_NVME,
+	ANA_ERR_NOT_SUPPORTED,
+	ANA_ERR_GETANAS_OVERFLOW,
+	ANA_ERR_GETANAS_NOTFOUND,
+	ANA_ERR_GETANALOG_FAILED,
+	ANA_ERR_GETNSID_FAILED,
+	ANA_ERR_GETNS_FAILED,
+	ANA_ERR_NO_MEMORY,
+	ANA_ERR_NO_INFORMATION,
 };
 
-static const char * anas_string[] = {
+static const char *ana_errmsg[] = {
+	[ANA_ERR_GETCTRL_FAILED]	= "couldn't get ctrl info",
+	[ANA_ERR_NOT_NVME]		= "not an NVMe device",
+	[ANA_ERR_NOT_SUPPORTED]		= "ANA not supported",
+	[ANA_ERR_GETANAS_OVERFLOW]	= "buffer overflow in ANA log",
+	[ANA_ERR_GETANAS_NOTFOUND]	= "NSID or ANAGRPID not found",
+	[ANA_ERR_GETANALOG_FAILED]	= "couldn't get ana log",
+	[ANA_ERR_GETNSID_FAILED]	= "couldn't get NSID",
+	[ANA_ERR_GETNS_FAILED]		= "couldn't get namespace info",
+	[ANA_ERR_NO_MEMORY]		= "out of memory",
+	[ANA_ERR_NO_INFORMATION]	= "invalid fd",
+};
+
+/* Use the implicit initialization: value 0 is "invalid" */
+static const int ana_prio [] = {
+	[NVME_ANA_OPTIMIZED]		= 50,
+	[NVME_ANA_NONOPTIMIZED]		= 10,
+	[NVME_ANA_INACCESSIBLE]		=  5,
+	[NVME_ANA_PERSISTENT_LOSS]	=  1,
+	[NVME_ANA_CHANGE]		=  5,
+};
+
+static const char *anas_string[] = {
 	[NVME_ANA_OPTIMIZED]			= "ANA Optimized State",
 	[NVME_ANA_NONOPTIMIZED]			= "ANA Non-Optimized State",
 	[NVME_ANA_INACCESSIBLE]			= "ANA Inaccessible State",
 	[NVME_ANA_PERSISTENT_LOSS]		= "ANA Persistent Loss State",
 	[NVME_ANA_CHANGE]			= "ANA Change state",
-	[NVME_ANA_RESERVED]			= "Invalid namespace group state!",
 };
 
 static const char *aas_print_string(int rc)
 {
 	rc &= 0xff;
-
-	switch(rc) {
-	case NVME_ANA_OPTIMIZED:
-	case NVME_ANA_NONOPTIMIZED:
-	case NVME_ANA_INACCESSIBLE:
-	case NVME_ANA_PERSISTENT_LOSS:
-	case NVME_ANA_CHANGE:
+	if (rc >= 0 && rc < ARRAY_SIZE(anas_string) &&
+	    anas_string[rc] != NULL)
 		return anas_string[rc];
-	default:
-		return anas_string[NVME_ANA_RESERVED];
-	}
-
-	return anas_string[NVME_ANA_RESERVED];
-}
-
-static int nvme_get_nsid(int fd, unsigned *nsid)
-{
-	static struct stat nvme_stat;
-	int err = fstat(fd, &nvme_stat);
-	if (err < 0)
-		return 1;
-
-	if (!S_ISBLK(nvme_stat.st_mode)) {
-		condlog(0, "Error: requesting namespace-id from non-block device\n");
-		return 1;
-	}
-
-	*nsid = ioctl(fd, NVME_IOCTL_ID);
-	return 0;
-}
-
-static int nvme_submit_admin_passthru(int fd, struct nvme_passthru_cmd *cmd)
-{
-	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, cmd);
-}
-
-int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 lpo,
-                 __u16 lsi, bool rae, __u32 data_len, void *data)
-{
-	struct nvme_admin_cmd cmd = {
-		.opcode		= nvme_admin_get_log_page,
-		.nsid		= nsid,
-		.addr		= (__u64)(uintptr_t) data,
-		.data_len	= data_len,
-	};
-	__u32 numd = (data_len >> 2) - 1;
-	__u16 numdu = numd >> 16, numdl = numd & 0xffff;
-
-	cmd.cdw10 = log_id | (numdl << 16) | (rae ? 1 << 15 : 0);
-	if (lsp)
-		cmd.cdw10 |= lsp << 8;
-
-	cmd.cdw11 = numdu | (lsi << 16);
-	cmd.cdw12 = lpo;
-	cmd.cdw13 = (lpo >> 32);
-
-	return nvme_submit_admin_passthru(fd, &cmd);
-
-}
-
-int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, void *data)
-{
-	struct nvme_admin_cmd cmd = {
-		.opcode		= nvme_admin_identify,
-		.nsid		= nsid,
-		.addr		= (__u64)(uintptr_t) data,
-		.data_len	= NVME_IDENTIFY_DATA_SIZE,
-		.cdw10		= cdw10,
-		.cdw11		= cdw11,
-	};
-
-	return nvme_submit_admin_passthru(fd, &cmd);
-}
-
-int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data)
-{
-	return nvme_identify13(fd, nsid, cdw10, 0, data);
-}
 
-int nvme_identify_ctrl(int fd, void *data)
-{
-	return nvme_identify(fd, 0, NVME_ID_CNS_CTRL, data);
-}
-
-int nvme_identify_ns(int fd, __u32 nsid, void *data)
-{
-	return nvme_identify(fd, nsid, NVME_ID_CNS_NS, data);
-}
-
-int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo)
-{
-	__u64 lpo = 0;
-
-	return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_ANA, rgo, lpo, 0,
-			true, ana_log_len, ana_log);
+	return "invalid ANA state";
 }
 
-static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log)
+static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log,
+			 size_t ana_log_len)
 {
-	int	rc = ANA_PRIO_GETANAS_FAILED;
 	void *base = ana_log;
 	struct nvme_ana_rsp_hdr *hdr = base;
 	struct nvme_ana_group_desc *ana_desc;
-	int offset = sizeof(struct nvme_ana_rsp_hdr);
+	size_t offset = sizeof(struct nvme_ana_rsp_hdr);
 	__u32 nr_nsids;
 	size_t nsid_buf_size;
 	int i, j;
 
 	for (i = 0; i < le16_to_cpu(hdr->ngrps); i++) {
 		ana_desc = base + offset;
+
+		offset += sizeof(*ana_desc);
+		if (offset > ana_log_len)
+			return -ANA_ERR_GETANAS_OVERFLOW;
+
 		nr_nsids = le32_to_cpu(ana_desc->nnsids);
 		nsid_buf_size = nr_nsids * sizeof(__le32);
 
-		offset += sizeof(*ana_desc);
+		offset += nsid_buf_size;
+		if (offset > ana_log_len)
+			return -ANA_ERR_GETANAS_OVERFLOW;
 
 		for (j = 0; j < nr_nsids; j++) {
 			if (nsid == le32_to_cpu(ana_desc->nsids[j]))
@@ -173,12 +109,10 @@ static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log)
 		}
 
 		if (anagrpid != 0 && anagrpid == le32_to_cpu(ana_desc->grpid))
-			rc = ana_desc->state;
+			return ana_desc->state;
 
-		offset += nsid_buf_size;
 	}
-
-	return rc;
+	return -ANA_ERR_GETANAS_NOTFOUND;
 }
 
 int get_ana_info(struct path * pp, unsigned int timeout)
@@ -189,104 +123,81 @@ int get_ana_info(struct path * pp, unsigned int timeout)
 	struct nvme_id_ns ns;
 	void *ana_log;
 	size_t ana_log_len;
+	bool is_anagrpid_const;
 
 	rc = nvme_identify_ctrl(pp->fd, &ctrl);
-	if (rc)
-		return ANA_PRIO_GETCTRL_FAILED;
+	if (rc < 0) {
+		log_nvme_errcode(rc, pp->dev, "nvme_identify_ctrl");
+		return -ANA_ERR_GETCTRL_FAILED;
+	}
 
 	if(!(ctrl.cmic & (1 << 3)))
-		return ANA_PRIO_NOT_SUPPORTED;
-
-	rc = nvme_get_nsid(pp->fd, &nsid);
-	if (rc)
-		return ANA_PRIO_GETNSID_FAILED;
+		return -ANA_ERR_NOT_SUPPORTED;
 
-	rc = nvme_identify_ns(pp->fd, nsid, &ns);
-	if (rc)
-		return ANA_PRIO_GETNS_FAILED;
+	nsid = nvme_get_nsid(pp->fd);
+	if (nsid <= 0) {
+		log_nvme_errcode(rc, pp->dev, "nvme_get_nsid");
+		return -ANA_ERR_GETNSID_FAILED;
+	}
+	is_anagrpid_const = ctrl.anacap & (1 << 6);
 
+	/*
+	 * Code copied from nvme-cli/nvme.c. We don't need to allocate an
+	 * [nanagrpid*mnan] array of NSIDs because each NSID can occur at most
+	 * in one ANA group.
+	 */
 	ana_log_len = sizeof(struct nvme_ana_rsp_hdr) +
-		le32_to_cpu(ctrl.nanagrpid) * sizeof(struct nvme_ana_group_desc);
-	if (!(ctrl.anacap & (1 << 6)))
+		le32_to_cpu(ctrl.nanagrpid)
+		* sizeof(struct nvme_ana_group_desc);
+
+	if (is_anagrpid_const) {
+		rc = nvme_identify_ns(pp->fd, nsid, 0, &ns);
+		if (rc) {
+			log_nvme_errcode(rc, pp->dev, "nvme_identify_ns");
+			return -ANA_ERR_GETNS_FAILED;
+		}
+	} else
 		ana_log_len += le32_to_cpu(ctrl.mnan) * sizeof(__le32);
 
 	ana_log = malloc(ana_log_len);
 	if (!ana_log)
-		return ANA_PRIO_NO_MEMORY;
-
+		return -ANA_ERR_NO_MEMORY;
+	pthread_cleanup_push(free, ana_log);
 	rc = nvme_ana_log(pp->fd, ana_log, ana_log_len,
-		(ctrl.anacap & (1 << 6)) ? NVME_ANA_LOG_RGO : 0);
+			  is_anagrpid_const ? NVME_ANA_LOG_RGO : 0);
 	if (rc) {
-		free(ana_log);
-		return ANA_PRIO_GETANALOG_FAILED;
-	}
-
-	rc = get_ana_state(nsid, le32_to_cpu(ns.anagrpid), ana_log);
-	if (rc < 0){
-		free(ana_log);
-		return ANA_PRIO_GETANAS_FAILED;
-	}
-
-	free(ana_log);
-	condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc, aas_print_string(rc));
-
+		log_nvme_errcode(rc, pp->dev, "nvme_ana_log");
+		rc = -ANA_ERR_GETANALOG_FAILED;
+	} else
+		rc = get_ana_state(nsid,
+				   is_anagrpid_const ?
+				   le32_to_cpu(ns.anagrpid) : 0,
+				   ana_log, ana_log_len);
+	pthread_cleanup_pop(1);
+	if (rc >= 0)
+		condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc,
+			aas_print_string(rc));
 	return rc;
 }
 
-int getprio(struct path * pp, char * args, unsigned int timeout)
+int getprio(struct path *pp, char *args, unsigned int timeout)
 {
 	int rc;
 
 	if (pp->fd < 0)
-		return ANA_PRIO_NO_INFORMATION;
-
-	rc = get_ana_info(pp, timeout);
-	if (rc >= 0) {
-		rc &= 0x0f;
-		switch(rc) {
-		case NVME_ANA_OPTIMIZED:
-			rc = ANA_PRIO_OPTIMIZED;
-			break;
-		case NVME_ANA_NONOPTIMIZED:
-			rc = ANA_PRIO_NONOPTIMIZED;
-			break;
-		case NVME_ANA_INACCESSIBLE:
-			rc = ANA_PRIO_INACCESSIBLE;
-			break;
-		case NVME_ANA_PERSISTENT_LOSS:
-			rc = ANA_PRIO_PERSISTENT_LOSS;
-			break;
-		case NVME_ANA_CHANGE:
-			rc = ANA_PRIO_CHANGE;
-			break;
-		default:
-			rc = ANA_PRIO_RESERVED;
-		}
-	} else {
-		switch(rc) {
-		case ANA_PRIO_GETCTRL_FAILED:
-			condlog(0, "%s: couldn't get ctrl info", pp->dev);
-			break;
-		case ANA_PRIO_NOT_SUPPORTED:
-			condlog(0, "%s: ana not supported", pp->dev);
-			break;
-		case ANA_PRIO_GETANAS_FAILED:
-			condlog(0, "%s: couldn't get ana state", pp->dev);
-			break;
-		case ANA_PRIO_GETANALOG_FAILED:
-			condlog(0, "%s: couldn't get ana log", pp->dev);
-			break;
-		case ANA_PRIO_GETNS_FAILED:
-			condlog(0, "%s: couldn't get namespace", pp->dev);
-			break;
-		case ANA_PRIO_GETNSID_FAILED:
-			condlog(0, "%s: couldn't get namespace id", pp->dev);
-			break;
-		case ANA_PRIO_NO_MEMORY:
-			condlog(0, "%s: couldn't alloc memory", pp->dev);
-			break;
-		}
+		rc = -ANA_ERR_NO_INFORMATION;
+	else if (udev_device_get_parent_with_subsystem_devtype(pp->udev,
+							       "nvme", NULL)
+		 == NULL)
+		rc = -ANA_ERR_NOT_NVME;
+	else {
+		rc = get_ana_info(pp, timeout);
+		if (rc >= 0 && rc < ARRAY_SIZE(ana_prio) && ana_prio[rc] != 0)
+			return ana_prio[rc];
 	}
-	return rc;
+	if (rc < 0 && -rc < ARRAY_SIZE(ana_errmsg))
+		condlog(2, "%s: ANA error: %s", pp->dev, ana_errmsg[-rc]);
+	else
+		condlog(1, "%s: invalid ANA rc code %d", pp->dev, rc);
+	return -1;
 }
-
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 16/19] libmultipath: detect_prio: try ANA for NVMe
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (14 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 15/19] libmultipath: ANA prioritzer: use nvme wrapper library Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 17/19] libmultipath/foreign/nvme: use failover topology Martin Wilck
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, lijie, dm-devel

Check NVMe devices support ANA, and if yes, use ANA
for priority checks. The patch moves the ANA detection
functionality from the ANA prioritizer into generic code,
and uses it.

Cc: lijie <lijie34@huawei.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/Makefile           |  1 +
 libmultipath/nvme-lib.c         | 13 +++++++++++++
 libmultipath/nvme-lib.h         |  6 ++++++
 libmultipath/prioritizers/ana.c |  6 ++----
 libmultipath/propsel.c          | 25 +++++++++++++++++++------
 5 files changed, 41 insertions(+), 10 deletions(-)

diff --git a/libmultipath/Makefile b/libmultipath/Makefile
index 7d27ea7f..78cca5a8 100644
--- a/libmultipath/Makefile
+++ b/libmultipath/Makefile
@@ -47,6 +47,7 @@ OBJS = memory.o parser.o vector.o devmapper.o callout.o \
 
 ifneq ($(call check_file,/usr/include/linux/nvme_ioctl.h),0)
 	OBJS += nvme-lib.o
+	CFLAGS += -Invme
 endif
 
 all: $(LIBS)
diff --git a/libmultipath/nvme-lib.c b/libmultipath/nvme-lib.c
index 9c32f369..f30e7698 100644
--- a/libmultipath/nvme-lib.c
+++ b/libmultipath/nvme-lib.c
@@ -34,3 +34,16 @@ int libmp_nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo)
 {
 	return nvme_ana_log(fd, ana_log, ana_log_len, rgo);
 }
+
+int nvme_id_ctrl_ana(int fd, struct nvme_id_ctrl *ctrl)
+{
+	int rc;
+	struct nvme_id_ctrl c;
+
+	rc = nvme_identify_ctrl(fd, &c);
+	if (rc < 0)
+		return rc;
+	if (ctrl)
+		*ctrl = c;
+	return c.cmic & (1 << 3) ? 1 : 0;
+}
diff --git a/libmultipath/nvme-lib.h b/libmultipath/nvme-lib.h
index 445c4f46..448dd993 100644
--- a/libmultipath/nvme-lib.h
+++ b/libmultipath/nvme-lib.h
@@ -9,6 +9,12 @@ int libmp_nvme_identify_ctrl(int fd, struct nvme_id_ctrl *ctrl);
 int libmp_nvme_identify_ns(int fd, __u32 nsid, bool present,
 			   struct nvme_id_ns *ns);
 int libmp_nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo);
+/*
+ * Identify controller, and return true if ANA is supported
+ * ctrl will be filled in if controller is identified, even w/o ANA
+ * ctrl may be NULL
+ */
+int nvme_id_ctrl_ana(int fd, struct nvme_id_ctrl *ctrl);
 
 #ifndef _NVME_LIB_C
 /*
diff --git a/libmultipath/prioritizers/ana.c b/libmultipath/prioritizers/ana.c
index 88edb224..b22e7b4a 100644
--- a/libmultipath/prioritizers/ana.c
+++ b/libmultipath/prioritizers/ana.c
@@ -125,13 +125,11 @@ int get_ana_info(struct path * pp, unsigned int timeout)
 	size_t ana_log_len;
 	bool is_anagrpid_const;
 
-	rc = nvme_identify_ctrl(pp->fd, &ctrl);
+	rc = nvme_id_ctrl_ana(pp->fd, &ctrl);
 	if (rc < 0) {
 		log_nvme_errcode(rc, pp->dev, "nvme_identify_ctrl");
 		return -ANA_ERR_GETCTRL_FAILED;
-	}
-
-	if(!(ctrl.cmic & (1 << 3)))
+	} else if (rc == 0)
 		return -ANA_ERR_NOT_SUPPORTED;
 
 	nsid = nvme_get_nsid(pp->fd);
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c
index f5d87786..98068f34 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -5,6 +5,7 @@
  */
 #include <stdio.h>
 
+#include "nvme-lib.h"
 #include "checkers.h"
 #include "memory.h"
 #include "vector.h"
@@ -550,13 +551,25 @@ detect_prio(struct config *conf, struct path * pp)
 {
 	struct prio *p = &pp->prio;
 	char buff[512];
-	char *default_prio = PRIO_ALUA;
-
-	if (pp->tpgs <= 0)
-		return;
-	if (pp->tpgs == 2 || !check_rdac(pp)) {
-		if (sysfs_get_asymmetric_access_state(pp, buff, 512) >= 0)
+	char *default_prio;
+
+	switch(pp->bus) {
+	case SYSFS_BUS_NVME:
+		if (nvme_id_ctrl_ana(pp->fd, NULL) == 0)
+			return;
+		default_prio = PRIO_ANA;
+		break;
+	case SYSFS_BUS_SCSI:
+		if (pp->tpgs <= 0)
+			return;
+		if ((pp->tpgs == 2 || !check_rdac(pp)) &&
+		    sysfs_get_asymmetric_access_state(pp, buff, 512) >= 0)
 			default_prio = PRIO_SYSFS;
+		else
+			default_prio = PRIO_ALUA;
+		break;
+	default:
+		return;
 	}
 	prio_get(conf->multipath_dir, p, default_prio, DEFAULT_PRIO_ARGS);
 }
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 17/19] libmultipath/foreign/nvme: use failover topology
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (15 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 16/19] libmultipath: detect_prio: try ANA for NVMe Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 18/19] libmultipath/foreign/nvme: show ANA state Martin Wilck
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

The native multipath driver does not use a multibus policy
as the current topology output of the NVMe foreign code
indicates. Rather, it uses failover policy, queueing all
IO to the current path until a failure occurs.

Change the data structures of the nvme foreign library
accordingly.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/foreign/nvme.c | 92 +++++++++++++++++++------------------
 1 file changed, 48 insertions(+), 44 deletions(-)

diff --git a/libmultipath/foreign/nvme.c b/libmultipath/foreign/nvme.c
index c753a747..11849889 100644
--- a/libmultipath/foreign/nvme.c
+++ b/libmultipath/foreign/nvme.c
@@ -40,17 +40,22 @@ static const char N_A[] = "n/a";
 const char *THIS;
 
 struct nvme_map;
+struct nvme_pathgroup {
+	struct gen_pathgroup gen;
+	struct _vector pathvec;
+};
+
 struct nvme_path {
 	struct gen_path gen;
 	struct udev_device *udev;
 	struct udev_device *ctl;
 	struct nvme_map *map;
 	bool seen;
-};
-
-struct nvme_pathgroup {
-	struct gen_pathgroup gen;
-	vector pathvec;
+	/*
+	 * The kernel works in failover mode.
+	 * Each path has a separate path group.
+	 */
+	struct nvme_pathgroup pg;
 };
 
 struct nvme_map {
@@ -58,11 +63,7 @@ struct nvme_map {
 	struct udev_device *udev;
 	struct udev_device *subsys;
 	dev_t devt;
-	/* Just one static pathgroup for NVMe for now */
-	struct nvme_pathgroup pg;
-	struct gen_pathgroup *gpg;
 	struct _vector pgvec;
-	vector pathvec;
 	int nr_live;
 };
 
@@ -76,29 +77,33 @@ struct nvme_map {
 #define const_gen_path_to_nvme(g) ((const struct nvme_path*)(g))
 #define gen_path_to_nvme(g) ((struct nvme_path*)(g))
 #define nvme_path_to_gen(n) &((n)->gen)
+#define nvme_pg_to_path(x) (VECTOR_SLOT(&((x)->pathvec), 0))
+#define nvme_path_to_pg(x) &((x)->pg)
 
 static void cleanup_nvme_path(struct nvme_path *path)
 {
 	condlog(5, "%s: %p %p", __func__, path, path->udev);
 	if (path->udev)
 		udev_device_unref(path->udev);
+	vector_reset(&path->pg.pathvec);
+
 	/* ctl is implicitly referenced by udev, no need to unref */
 	free(path);
 }
 
 static void cleanup_nvme_map(struct nvme_map *map)
 {
-	if (map->pathvec) {
-		struct nvme_path *path;
-		int i;
+	struct nvme_pathgroup *pg;
+	struct nvme_path *path;
+	int i;
 
-		vector_foreach_slot_backwards(map->pathvec, path, i) {
-			condlog(5, "%s: %d %p", __func__, i, path);
-			cleanup_nvme_path(path);
-			vector_del_slot(map->pathvec, i);
-		}
+	vector_foreach_slot_backwards(&map->pgvec, pg, i) {
+		path = nvme_pg_to_path(pg);
+		condlog(5, "%s: %d %p", __func__, i, path);
+		cleanup_nvme_path(path);
+		vector_del_slot(&map->pgvec, i);
 	}
-	vector_free(map->pathvec);
+	vector_reset(&map->pgvec);
 	if (map->udev)
 		udev_device_unref(map->udev);
 	/* subsys is implicitly referenced by udev, no need to unref */
@@ -190,7 +195,7 @@ nvme_pg_get_paths(const struct gen_pathgroup *gpg) {
 	const struct nvme_pathgroup *gp = const_gen_pg_to_nvme(gpg);
 
 	/* This is all used under the lock, no need to copy */
-	return gp->pathvec;
+	return &gp->pathvec;
 }
 
 static void
@@ -432,7 +437,7 @@ static struct nvme_map *_find_nvme_map_by_devt(const struct context *ctx,
 static struct nvme_path *
 _find_path_by_syspath(struct nvme_map *map, const char *syspath)
 {
-	struct nvme_path *path;
+	struct nvme_pathgroup *pg;
 	char real[PATH_MAX];
 	const char *ppath;
 	int i;
@@ -443,7 +448,9 @@ _find_path_by_syspath(struct nvme_map *map, const char *syspath)
 		ppath = syspath;
 	}
 
-	vector_foreach_slot(map->pathvec, path, i) {
+	vector_foreach_slot(&map->pgvec, pg, i) {
+		struct nvme_path *path = nvme_pg_to_path(pg);
+
 		if (!strcmp(ppath,
 			    udev_device_get_syspath(path->udev)))
 			return path;
@@ -537,14 +544,17 @@ static void _find_controllers(struct context *ctx, struct nvme_map *map)
 	struct dirent **di = NULL;
 	struct scandir_result sr;
 	struct udev_device *subsys;
+	struct nvme_pathgroup *pg;
 	struct nvme_path *path;
 	int r, i, n;
 
 	if (map == NULL || map->udev == NULL)
 		return;
 
-	vector_foreach_slot(map->pathvec, path, i)
+	vector_foreach_slot(&map->pgvec, pg, i) {
+		path = nvme_pg_to_path(pg);
 		path->seen = false;
+	}
 
 	subsys = udev_device_get_parent_with_subsystem_devtype(map->udev,
 							       "nvme-subsystem",
@@ -606,7 +616,8 @@ static void _find_controllers(struct context *ctx, struct nvme_map *map)
 		if (udev == NULL)
 			continue;
 
-		path = _find_path_by_syspath(map, udev_device_get_syspath(udev));
+		path = _find_path_by_syspath(map,
+					     udev_device_get_syspath(udev));
 		if (path != NULL) {
 			path->seen = true;
 			condlog(4, "%s: %s already known",
@@ -630,24 +641,30 @@ static void _find_controllers(struct context *ctx, struct nvme_map *map)
 			cleanup_nvme_path(path);
 			continue;
 		}
-
-		if (vector_alloc_slot(map->pathvec) == NULL) {
+		path->pg.gen.ops = &nvme_pg_ops;
+		if (vector_alloc_slot(&path->pg.pathvec) == NULL) {
 			cleanup_nvme_path(path);
 			continue;
 		}
+		vector_set_slot(&path->pg.pathvec, path);
+		if (vector_alloc_slot(&map->pgvec) == NULL) {
+			cleanup_nvme_path(path);
+			continue;
+		}
+		vector_set_slot(&map->pgvec, &path->pg);
 		condlog(3, "%s: %s: new path %s added to %s",
 			__func__, THIS, udev_device_get_sysname(udev),
 			udev_device_get_sysname(map->udev));
-		vector_set_slot(map->pathvec, path);
 	}
 	pthread_cleanup_pop(1);
 
 	map->nr_live = 0;
-	vector_foreach_slot_backwards(map->pathvec, path, i) {
+	vector_foreach_slot_backwards(&map->pgvec, pg, i) {
+		path = nvme_pg_to_path(pg);
 		if (!path->seen) {
 			condlog(1, "path %d not found in %s any more",
 				i, udev_device_get_sysname(map->udev));
-			vector_del_slot(map->pathvec, i);
+			vector_del_slot(&map->pgvec, i);
 			cleanup_nvme_path(path);
 		} else {
 			static const char live_state[] = "live";
@@ -661,7 +678,7 @@ static void _find_controllers(struct context *ctx, struct nvme_map *map)
 	}
 	condlog(3, "%s: %s: map %s has %d/%d live paths", __func__, THIS,
 		udev_device_get_sysname(map->udev), map->nr_live,
-		VECTOR_SIZE(map->pathvec));
+		VECTOR_SIZE(&map->pgvec));
 }
 
 static int _add_map(struct context *ctx, struct udev_device *ud,
@@ -686,19 +703,6 @@ static int _add_map(struct context *ctx, struct udev_device *ud,
 	map->subsys = subsys;
 	map->gen.ops = &nvme_map_ops;
 
-	map->pathvec = vector_alloc();
-	if (map->pathvec == NULL) {
-		cleanup_nvme_map(map);
-		return FOREIGN_ERR;
-	}
-
-	map->pg.gen.ops = &nvme_pg_ops;
-	map->pg.pathvec = map->pathvec;
-	map->gpg = nvme_pg_to_gen(&map->pg);
-
-	map->pgvec.allocated = 1;
-	map->pgvec.slot = (void**)&map->gpg;
-
 	if (vector_alloc_slot(ctx->mpvec) == NULL) {
 		cleanup_nvme_map(map);
 		return FOREIGN_ERR;
@@ -842,8 +846,8 @@ const struct _vector * get_paths(const struct context *ctx)
 	condlog(5, "%s called for \"%s\"", __func__, THIS);
 	vector_foreach_slot(ctx->mpvec, gm, i) {
 		const struct nvme_map *nm = const_gen_mp_to_nvme(gm);
-		paths = vector_convert(paths, nm->pathvec,
-				       struct gen_path, identity);
+		paths = vector_convert(paths, &nm->pgvec,
+				       struct nvme_pathgroup, nvme_pg_to_path);
 	}
 	return paths;
 }
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 18/19] libmultipath/foreign/nvme: show ANA state
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (16 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 17/19] libmultipath/foreign/nvme: use failover topology Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-18 23:19 ` [PATCH 19/19] libmultipath/foreign/nvme: indicate ANA support Martin Wilck
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

Obtain the ana_state attribute from the kernel and
use it to display information about path state and
"priority" of native NVMe multipath.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/foreign/nvme.c | 43 +++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/libmultipath/foreign/nvme.c b/libmultipath/foreign/nvme.c
index 11849889..bda9bcc4 100644
--- a/libmultipath/foreign/nvme.c
+++ b/libmultipath/foreign/nvme.c
@@ -204,12 +204,6 @@ nvme_pg_rel_paths(const struct gen_pathgroup *gpg, const struct _vector *v)
 	/* empty */
 }
 
-static int snprint_nvme_pg(const struct gen_pathgroup *gmp,
-			   char *buff, int len, char wildcard)
-{
-	return snprintf(buff, len, N_A);
-}
-
 static int snprint_hcil(const struct nvme_path *np, char *buf, int len)
 {
 	unsigned int nvmeid, ctlid, nsid;
@@ -249,6 +243,23 @@ static int snprint_nvme_path(const struct gen_path *gp,
 	case 'o':
 		sysfs_attr_get_value(np->ctl, "state", fld, sizeof(fld));
 		return snprintf(buff, len, "%s", fld);
+	case 'T':
+		if (sysfs_attr_get_value(np->udev, "ana_state", fld,
+					 sizeof(fld)) > 0)
+			return snprintf(buff, len, "%s", fld);
+		break;
+	case 'p':
+		if (sysfs_attr_get_value(np->udev, "ana_state", fld,
+					 sizeof(fld)) > 0) {
+			rstrip(fld);
+			if (!strcmp(fld, "optimized"))
+				return snprintf(buff, len, "%d", 50);
+			else if (!strcmp(fld, "non-optimized"))
+				return snprintf(buff, len, "%d", 10);
+			else
+				return snprintf(buff, len, "%d", 0);
+		}
+		break;
 	case 's':
 		snprintf(fld, sizeof(fld), "%s",
 			 udev_device_get_sysattr_value(np->ctl,
@@ -286,12 +297,30 @@ static int snprint_nvme_path(const struct gen_path *gp,
 					udev_device_get_sysname(pci));
 		/* fall through */
 	default:
-		return snprintf(buff, len, "%s", N_A);
 		break;
 	}
+	return snprintf(buff, len, "%s", N_A);
 	return 0;
 }
 
+static int snprint_nvme_pg(const struct gen_pathgroup *gmp,
+			   char *buff, int len, char wildcard)
+{
+	const struct nvme_pathgroup *pg = const_gen_pg_to_nvme(gmp);
+	const struct nvme_path *path = nvme_pg_to_path(pg);
+
+	switch (wildcard) {
+	case 't':
+		return snprint_nvme_path(nvme_path_to_gen(path),
+					 buff, len, 'T');
+	case 'p':
+		return snprint_nvme_path(nvme_path_to_gen(path),
+					 buff, len, 'p');
+	default:
+		return snprintf(buff, len, N_A);
+	}
+}
+
 static int nvme_style(const struct gen_multipath* gm,
 		      char *buf, int len, int verbosity)
 {
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 19/19] libmultipath/foreign/nvme: indicate ANA support
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (17 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 18/19] libmultipath/foreign/nvme: show ANA state Martin Wilck
@ 2018-12-18 23:19 ` Martin Wilck
  2018-12-20 23:24 ` [PATCH 00/19] san_path_err & multipath " Benjamin Marzinski
  2018-12-21 16:06 ` Benjamin Marzinski
  20 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-18 23:19 UTC (permalink / raw)
  To: Christophe Varoqui; +Cc: Martin Wilck, dm-devel

Indicate ANA support in the "hwhandler" output field.

Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/foreign/Makefile |  2 +-
 libmultipath/foreign/nvme.c   | 47 +++++++++++++++++++++++++++++++++--
 2 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/libmultipath/foreign/Makefile b/libmultipath/foreign/Makefile
index fe98ddf7..713762cb 100644
--- a/libmultipath/foreign/Makefile
+++ b/libmultipath/foreign/Makefile
@@ -3,7 +3,7 @@
 #
 include ../../Makefile.inc
 
-CFLAGS += $(LIB_CFLAGS) -I..
+CFLAGS += $(LIB_CFLAGS) -I.. -I../nvme
 
 # If you add or remove a checker also update multipath/multipath.conf.5
 LIBS= \
diff --git a/libmultipath/foreign/nvme.c b/libmultipath/foreign/nvme.c
index bda9bcc4..838e450e 100644
--- a/libmultipath/foreign/nvme.c
+++ b/libmultipath/foreign/nvme.c
@@ -15,6 +15,8 @@
   along with this program.  If not, see <https://www.gnu.org/licenses/>.
 */
 
+#include "nvme-lib.h"
+#include <sys/types.h>
 #include <sys/sysmacros.h>
 #include <libudev.h>
 #include <stdio.h>
@@ -27,6 +29,7 @@
 #include <dirent.h>
 #include <errno.h>
 #include <ctype.h>
+#include <fcntl.h>
 #include "util.h"
 #include "vector.h"
 #include "generic.h"
@@ -65,6 +68,7 @@ struct nvme_map {
 	dev_t devt;
 	struct _vector pgvec;
 	int nr_live;
+	int ana_supported;
 };
 
 #define NAME_LEN 64 /* buffer length for temp attributes */
@@ -183,11 +187,14 @@ static int snprint_nvme_map(const struct gen_multipath *gmp,
 			return snprintf(buff, len, "%s", "rw");
 	case 'G':
 		return snprintf(buff, len, "%s", THIS);
+	case 'h':
+		if (nvm->ana_supported == YNU_YES)
+			return snprintf(buff, len, "ANA");
 	default:
-		return snprintf(buff, len, N_A);
 		break;
 	}
-	return 0;
+
+	return snprintf(buff, len, N_A);
 }
 
 static const struct _vector*
@@ -567,6 +574,40 @@ out:
 	return blkdev;
 }
 
+static void test_ana_support(struct nvme_map *map, struct udev_device *ctl)
+{
+	const char *dev_t;
+	char sys_path[64];
+	long fd;
+	int rc;
+
+	if (map->ana_supported != YNU_UNDEF)
+		return;
+
+	dev_t = udev_device_get_sysattr_value(ctl, "dev");
+	if (snprintf(sys_path, sizeof(sys_path), "/dev/char/%s", dev_t)
+	    >= sizeof(sys_path))
+		return;
+
+	fd = open(sys_path, O_RDONLY);
+	if (fd == -1) {
+		condlog(2, "%s: error opening %s", __func__, sys_path);
+		return;
+	}
+
+	pthread_cleanup_push(close_fd, (void *)fd);
+	rc = nvme_id_ctrl_ana(fd, NULL);
+	if (rc < 0)
+		condlog(2, "%s: error in nvme_id_ctrl: %s", __func__,
+			strerror(errno));
+	else {
+		map->ana_supported = (rc == 1 ? YNU_YES : YNU_NO);
+		condlog(3, "%s: NVMe ctrl %s: ANA %s supported", __func__, dev_t,
+			rc == 1 ? "is" : "is not");
+	}
+	pthread_cleanup_pop(1);
+}
+
 static void _find_controllers(struct context *ctx, struct nvme_map *map)
 {
 	char pathbuf[PATH_MAX], realbuf[PATH_MAX];
@@ -670,6 +711,8 @@ static void _find_controllers(struct context *ctx, struct nvme_map *map)
 			cleanup_nvme_path(path);
 			continue;
 		}
+		test_ana_support(map, path->ctl);
+
 		path->pg.gen.ops = &nvme_pg_ops;
 		if (vector_alloc_slot(&path->pg.pathvec) == NULL) {
 			cleanup_nvme_path(path);
-- 
2.19.2

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-18 23:19 ` [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature" Martin Wilck
@ 2018-12-19 11:32   ` Muneendra Kumar M
  2018-12-19 12:02     ` Martin Wilck
  0 siblings, 1 reply; 36+ messages in thread
From: Muneendra Kumar M @ 2018-12-19 11:32 UTC (permalink / raw)
  To: Martin Wilck, Christophe Varoqui
  Cc: Guan Junxiong, Muneendra Kumar M, dm-devel, M Muneendra Kumar

Hi Martin,
In one of the patch   "[PATCH 00/19] san_path_err & multipath ANA support"

you have mentioned that san_path_err_XXX has some merits
over marginal_path_err_XXX.

Is this understanding correct if so could you please explain the scenario
in which use case this was better.

I can say Marginal_path_err_xx is superset of san_path_err_xx.

If we need both san_path_err_xx , Marginal_path_err_xx then so many
configurations will really confuse the customers.


Regards,
Muneendra.


-----Original Message-----
From: Martin Wilck [mailto:mwilck@suse.com]
Sent: Wednesday, December 19, 2018 4:49 AM
To: Christophe Varoqui <christophe.varoqui@opensvc.com>
Cc: Benjamin Marzinski <bmarzins@redhat.com>; dm-devel@redhat.com; Hannes
Reinecke <hare@suse.de>; Martin Wilck <mwilck@suse.com>; Guan Junxiong
<guanjunxiong@huawei.com>; M Muneendra Kumar <mmandala@brocade.com>
Subject: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX
feature"

This reverts commit 9cf6a48f18a291982af34b4fb0110654b94e591c.
We removed this functionality prematurely. I am not convinced that the
"marginal_path" code really replaces it. Let customers evaluate the
different options, and vote with their feet.

Cc: Guan Junxiong <guanjunxiong@huawei.com>
Cc: M Muneendra Kumar <mmandala@brocade.com>
Signed-off-by: Martin Wilck <mwilck@suse.com>
---
 libmultipath/config.c      |  3 ++
 libmultipath/config.h      |  9 ++++
 libmultipath/configure.c   |  3 ++
 libmultipath/dict.c        | 39 ++++++++++++++++++
 libmultipath/propsel.c     | 53 ++++++++++++++++++++++++
 libmultipath/propsel.h     |  3 ++
 libmultipath/structs.h     |  7 ++++
 multipath/multipath.conf.5 | 57 ++++++++++++++++++++++++++
 multipathd/main.c          | 84 ++++++++++++++++++++++++++++++++++++++
 9 files changed, 258 insertions(+)

diff --git a/libmultipath/config.c b/libmultipath/config.c index
5af7af58..24d71aed 100644
--- a/libmultipath/config.c
+++ b/libmultipath/config.c
@@ -369,6 +369,9 @@ merge_hwe (struct hwentry * dst, struct hwentry * src)
 	merge_num(max_sectors_kb);
 	merge_num(ghost_delay);
 	merge_num(all_tg_pt);
+	merge_num(san_path_err_threshold);
+	merge_num(san_path_err_forget_rate);
+	merge_num(san_path_err_recovery_time);

 	snprintf(id, sizeof(id), "%s/%s", dst->vendor, dst->product);
 	reconcile_features_with_options(id, &dst->features, diff --git
a/libmultipath/config.h b/libmultipath/config.h index 7d0cd9a6..b938c26c
100644
--- a/libmultipath/config.h
+++ b/libmultipath/config.h
@@ -76,6 +76,9 @@ struct hwentry {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
@@ -112,6 +115,9 @@ struct mpentry {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
@@ -162,6 +168,9 @@ struct config {
 	int processed_main_config;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
diff --git a/libmultipath/configure.c b/libmultipath/configure.c index
84ae5f56..60a98873 100644
--- a/libmultipath/configure.c
+++ b/libmultipath/configure.c
@@ -309,6 +309,9 @@ int setup_map(struct multipath *mpp, char *params, int
params_size,
 	select_deferred_remove(conf, mpp);
 	select_delay_watch_checks(conf, mpp);
 	select_delay_wait_checks(conf, mpp);
+	select_san_path_err_threshold(conf, mpp);
+	select_san_path_err_forget_rate(conf, mpp);
+	select_san_path_err_recovery_time(conf, mpp);
 	select_marginal_path_err_sample_time(conf, mpp);
 	select_marginal_path_err_rate_threshold(conf, mpp);
 	select_marginal_path_err_recheck_gap_time(conf, mpp); diff --git
a/libmultipath/dict.c b/libmultipath/dict.c index a81c051f..fd29abca
100644
--- a/libmultipath/dict.c
+++ b/libmultipath/dict.c
@@ -1217,6 +1217,33 @@ declare_hw_handler(delay_wait_checks,
set_off_int_undef)  declare_hw_snprint(delay_wait_checks,
print_off_int_undef)  declare_mp_handler(delay_wait_checks,
set_off_int_undef)  declare_mp_snprint(delay_wait_checks,
print_off_int_undef)
+declare_def_handler(san_path_err_threshold, set_off_int_undef)
+declare_def_snprint_defint(san_path_err_threshold, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(san_path_err_threshold, set_off_int_undef)
+declare_ovr_snprint(san_path_err_threshold, print_off_int_undef)
+declare_hw_handler(san_path_err_threshold, set_off_int_undef)
+declare_hw_snprint(san_path_err_threshold, print_off_int_undef)
+declare_mp_handler(san_path_err_threshold, set_off_int_undef)
+declare_mp_snprint(san_path_err_threshold, print_off_int_undef)
+declare_def_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_def_snprint_defint(san_path_err_forget_rate, print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_ovr_snprint(san_path_err_forget_rate, print_off_int_undef)
+declare_hw_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_hw_snprint(san_path_err_forget_rate, print_off_int_undef)
+declare_mp_handler(san_path_err_forget_rate, set_off_int_undef)
+declare_mp_snprint(san_path_err_forget_rate, print_off_int_undef)
+declare_def_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_def_snprint_defint(san_path_err_recovery_time,
print_off_int_undef,
+			   DEFAULT_ERR_CHECKS)
+declare_ovr_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_ovr_snprint(san_path_err_recovery_time, print_off_int_undef)
+declare_hw_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_hw_snprint(san_path_err_recovery_time, print_off_int_undef)
+declare_mp_handler(san_path_err_recovery_time, set_off_int_undef)
+declare_mp_snprint(san_path_err_recovery_time, print_off_int_undef)
 declare_def_handler(marginal_path_err_sample_time, set_off_int_undef)
declare_def_snprint_defint(marginal_path_err_sample_time,
print_off_int_undef,
 			   DEFAULT_ERR_CHECKS)
@@ -1620,6 +1647,9 @@ init_keywords(vector keywords)
 	install_keyword("config_dir", &def_config_dir_handler,
&snprint_def_config_dir);
 	install_keyword("delay_watch_checks",
&def_delay_watch_checks_handler, &snprint_def_delay_watch_checks);
 	install_keyword("delay_wait_checks",
&def_delay_wait_checks_handler, &snprint_def_delay_wait_checks);
+	install_keyword("san_path_err_threshold",
&def_san_path_err_threshold_handler, &snprint_def_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate",
&def_san_path_err_forget_rate_handler,
&snprint_def_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time",
+&def_san_path_err_recovery_time_handler,
+&snprint_def_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time",
&def_marginal_path_err_sample_time_handler,
&snprint_def_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold",
&def_marginal_path_err_rate_threshold_handler,
&snprint_def_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time",
&def_marginal_path_err_recheck_gap_time_handler,
&snprint_def_marginal_path_err_recheck_gap_time);
@@ -1714,6 +1744,9 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &hw_deferred_remove_handler,
&snprint_hw_deferred_remove);
 	install_keyword("delay_watch_checks",
&hw_delay_watch_checks_handler, &snprint_hw_delay_watch_checks);
 	install_keyword("delay_wait_checks",
&hw_delay_wait_checks_handler, &snprint_hw_delay_wait_checks);
+	install_keyword("san_path_err_threshold",
&hw_san_path_err_threshold_handler, &snprint_hw_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate",
&hw_san_path_err_forget_rate_handler,
&snprint_hw_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time",
+&hw_san_path_err_recovery_time_handler,
+&snprint_hw_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time",
&hw_marginal_path_err_sample_time_handler,
&snprint_hw_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold",
&hw_marginal_path_err_rate_threshold_handler,
&snprint_hw_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time",
&hw_marginal_path_err_recheck_gap_time_handler,
&snprint_hw_marginal_path_err_recheck_gap_time);
@@ -1750,6 +1783,9 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &ovr_deferred_remove_handler,
&snprint_ovr_deferred_remove);
 	install_keyword("delay_watch_checks",
&ovr_delay_watch_checks_handler, &snprint_ovr_delay_watch_checks);
 	install_keyword("delay_wait_checks",
&ovr_delay_wait_checks_handler, &snprint_ovr_delay_wait_checks);
+	install_keyword("san_path_err_threshold",
&ovr_san_path_err_threshold_handler, &snprint_ovr_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate",
&ovr_san_path_err_forget_rate_handler,
&snprint_ovr_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time",
+&ovr_san_path_err_recovery_time_handler,
+&snprint_ovr_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time",
&ovr_marginal_path_err_sample_time_handler,
&snprint_ovr_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold",
&ovr_marginal_path_err_rate_threshold_handler,
&snprint_ovr_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time",
&ovr_marginal_path_err_recheck_gap_time_handler,
&snprint_ovr_marginal_path_err_recheck_gap_time);
@@ -1785,6 +1821,9 @@ init_keywords(vector keywords)
 	install_keyword("deferred_remove", &mp_deferred_remove_handler,
&snprint_mp_deferred_remove);
 	install_keyword("delay_watch_checks",
&mp_delay_watch_checks_handler, &snprint_mp_delay_watch_checks);
 	install_keyword("delay_wait_checks",
&mp_delay_wait_checks_handler, &snprint_mp_delay_wait_checks);
+	install_keyword("san_path_err_threshold",
&mp_san_path_err_threshold_handler, &snprint_mp_san_path_err_threshold);
+	install_keyword("san_path_err_forget_rate",
&mp_san_path_err_forget_rate_handler,
&snprint_mp_san_path_err_forget_rate);
+	install_keyword("san_path_err_recovery_time",
+&mp_san_path_err_recovery_time_handler,
+&snprint_mp_san_path_err_recovery_time);
 	install_keyword("marginal_path_err_sample_time",
&mp_marginal_path_err_sample_time_handler,
&snprint_mp_marginal_path_err_sample_time);
 	install_keyword("marginal_path_err_rate_threshold",
&mp_marginal_path_err_rate_threshold_handler,
&snprint_mp_marginal_path_err_rate_threshold);
 	install_keyword("marginal_path_err_recheck_gap_time",
&mp_marginal_path_err_recheck_gap_time_handler,
&snprint_mp_marginal_path_err_recheck_gap_time);
diff --git a/libmultipath/propsel.c b/libmultipath/propsel.c index
7b19fed0..a4d114c0 100644
--- a/libmultipath/propsel.c
+++ b/libmultipath/propsel.c
@@ -879,6 +879,59 @@ out:

 }

+int select_san_path_err_threshold(struct config *conf, struct multipath
+*mp) {
+	const char *origin;
+	char buff[12];
+
+	mp_set_mpe(san_path_err_threshold);
+	mp_set_ovr(san_path_err_threshold);
+	mp_set_hwe(san_path_err_threshold);
+	mp_set_conf(san_path_err_threshold);
+	mp_set_default(san_path_err_threshold, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, mp->san_path_err_threshold);
+	condlog(3, "%s: san_path_err_threshold = %s %s", mp->alias, buff,
+		origin);
+	return 0;
+}
+
+int select_san_path_err_forget_rate(struct config *conf, struct
+multipath *mp) {
+	const char *origin;
+	char buff[12];
+
+	mp_set_mpe(san_path_err_forget_rate);
+	mp_set_ovr(san_path_err_forget_rate);
+	mp_set_hwe(san_path_err_forget_rate);
+	mp_set_conf(san_path_err_forget_rate);
+	mp_set_default(san_path_err_forget_rate, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, mp->san_path_err_forget_rate);
+	condlog(3, "%s: san_path_err_forget_rate = %s %s", mp->alias,
+		buff, origin);
+	return 0;
+
+}
+
+int select_san_path_err_recovery_time(struct config *conf, struct
+multipath *mp) {
+	const char *origin;
+	char buff[12];
+
+	mp_set_mpe(san_path_err_recovery_time);
+	mp_set_ovr(san_path_err_recovery_time);
+	mp_set_hwe(san_path_err_recovery_time);
+	mp_set_conf(san_path_err_recovery_time);
+	mp_set_default(san_path_err_recovery_time, DEFAULT_ERR_CHECKS);
+out:
+	print_off_int_undef(buff, 12, mp->san_path_err_recovery_time);
+	condlog(3, "%s: san_path_err_recovery_time = %s %s", mp->alias,
+		buff, origin);
+	return 0;
+
+}
+
 int select_marginal_path_err_sample_time(struct config *conf, struct
multipath *mp)  {
 	const char *origin;
diff --git a/libmultipath/propsel.h b/libmultipath/propsel.h index
ae99b927..b352c16a 100644
--- a/libmultipath/propsel.h
+++ b/libmultipath/propsel.h
@@ -26,6 +26,9 @@ int select_delay_watch_checks (struct config *conf,
struct multipath * mp);  int select_delay_wait_checks (struct config
*conf, struct multipath * mp);  int select_skip_kpartx (struct config
*conf, struct multipath * mp);  int select_max_sectors_kb (struct config
*conf, struct multipath * mp);
+int select_san_path_err_forget_rate(struct config *conf, struct
+multipath *mp); int select_san_path_err_threshold(struct config *conf,
+struct multipath *mp); int select_san_path_err_recovery_time(struct
+config *conf, struct multipath *mp);
 int select_marginal_path_err_sample_time(struct config *conf, struct
multipath *mp);  int select_marginal_path_err_rate_threshold(struct config
*conf, struct multipath *mp);  int
select_marginal_path_err_recheck_gap_time(struct config *conf, struct
multipath *mp); diff --git a/libmultipath/structs.h
b/libmultipath/structs.h index d8961164..96df8c8a 100644
--- a/libmultipath/structs.h
+++ b/libmultipath/structs.h
@@ -280,6 +280,10 @@ struct path {
 	int initialized;
 	int retriggers;
 	int wwid_changed;
+	unsigned int path_failures;
+	time_t dis_reinstate_time;
+	int disable_reinstate;
+	int san_path_err_forget_rate;
 	time_t io_err_dis_reinstate_time;
 	int io_err_disable_reinstate;
 	int io_err_pathfail_cnt;
@@ -318,6 +322,9 @@ struct multipath {
 	int deferred_remove;
 	int delay_watch_checks;
 	int delay_wait_checks;
+	int san_path_err_threshold;
+	int san_path_err_forget_rate;
+	int san_path_err_recovery_time;
 	int marginal_path_err_sample_time;
 	int marginal_path_err_rate_threshold;
 	int marginal_path_err_recheck_gap_time;
diff --git a/multipath/multipath.conf.5 b/multipath/multipath.conf.5 index
68119baa..35e6d37c 100644
--- a/multipath/multipath.conf.5
+++ b/multipath/multipath.conf.5
@@ -891,6 +891,45 @@ The default is: \fB/etc/multipath/conf.d/\fR  .
 .
 .TP
+.B san_path_err_threshold
+If set to a value greater than 0, multipathd will watch paths and check
+how many times a path has been failed due to errors.If the number of
+failures on a particular path is greater then the
+san_path_err_threshold then the path will not  reinstante till
+san_path_err_recovery_time.These path failures should occur within a
+san_path_err_forget_rate checks, if not we will consider the path is good
enough to reinstantate.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B san_path_err_forget_rate
+If set to a value greater than 0, multipathd will check whether the
+path failures has exceeded  the san_path_err_threshold within this many
+checks i.e san_path_err_forget_rate . If so we will not reinstante the
+path till san_path_err_recovery_time.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
+.B san_path_err_recovery_time
+If set to a value greater than 0, multipathd will make sure that when
+path failures has exceeded the san_path_err_threshold within
+san_path_err_forget_rate then the path will be placed in failed state
+for san_path_err_recovery_time duration.Once san_path_err_recovery_time
has timeout  we will reinstante the failed path .
+san_path_err_recovery_time value should be in secs.
+.RS
+.TP
+The default is: \fBno\fR
+.RE
+.
+.
+.TP
 .B marginal_path_double_failed_time
 One of the four parameters of supporting path check based on accounting
IO  error such as intermittent error. When a path failed event occurs
twice in @@ -1297,6 +1336,12 @@ section:
 .TP
 .B deferred_remove
 .TP
+.B san_path_err_threshold
+.TP
+.B san_path_err_forget_rate
+.TP
+.B san_path_err_recovery_time
+.TP
 .B marginal_path_err_sample_time
 .TP
 .B marginal_path_err_rate_threshold
@@ -1448,6 +1493,12 @@ section:
 .TP
 .B deferred_remove
 .TP
+.B san_path_err_threshold
+.TP
+.B san_path_err_forget_rate
+.TP
+.B san_path_err_recovery_time
+.TP
 .B marginal_path_err_sample_time
 .TP
 .B marginal_path_err_rate_threshold
@@ -1524,6 +1575,12 @@ the values are taken from the \fIdevices\fR or
\fIdefaults\fR sections:
 .TP
 .B deferred_remove
 .TP
+.B san_path_err_threshold
+.TP
+.B san_path_err_forget_rate
+.TP
+.B san_path_err_recovery_time
+.TP
 .B marginal_path_err_sample_time
 .TP
 .B marginal_path_err_rate_threshold
diff --git a/multipathd/main.c b/multipathd/main.c index
99145293..57bb7143 100644
--- a/multipathd/main.c
+++ b/multipathd/main.c
@@ -1833,6 +1833,84 @@ int update_path_groups(struct multipath *mpp,
struct vectors *vecs, int refresh)
 	return 0;
 }

+static int check_path_reinstate_state(struct path * pp) {
+	struct timespec curr_time;
+	if (!((pp->mpp->san_path_err_threshold > 0) &&
+				(pp->mpp->san_path_err_forget_rate > 0) &&
+				(pp->mpp->san_path_err_recovery_time >0)))
{
+		return 0;
+	}
+
+	if (pp->disable_reinstate) {
+		/* If we don't know how much time has passed,
automatically
+		 * reinstate the path, just to be safe. Also, if there are
+		 * no other usable paths, reinstate the path
+		 */
+		if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0 ||
+				pp->mpp->nr_active == 0) {
+			condlog(2, "%s : reinstating path early",
pp->dev);
+			goto reinstate_path;
+		}
+		if ((curr_time.tv_sec - pp->dis_reinstate_time ) >
pp->mpp->san_path_err_recovery_time) {
+			condlog(2,"%s : reinstate the path after err
recovery time", pp->dev);
+			goto reinstate_path;
+		}
+		return 1;
+	}
+	/* forget errors on a working path */
+	if ((pp->state == PATH_UP || pp->state == PATH_GHOST) &&
+			pp->path_failures > 0) {
+		if (pp->san_path_err_forget_rate > 0){
+			pp->san_path_err_forget_rate--;
+		} else {
+			/* for every san_path_err_forget_rate number of
+			 * successful path checks decrement path_failures
by 1
+			 */
+			pp->path_failures--;
+			pp->san_path_err_forget_rate =
pp->mpp->san_path_err_forget_rate;
+		}
+		return 0;
+	}
+
+	/* If the path isn't recovering from a failed state, do nothing */
+	if (pp->state != PATH_DOWN && pp->state != PATH_SHAKY &&
+			pp->state != PATH_TIMEOUT)
+		return 0;
+
+	if (pp->path_failures == 0)
+		pp->san_path_err_forget_rate =
pp->mpp->san_path_err_forget_rate;
+
+	pp->path_failures++;
+
+	/* if we don't know the currently time, we don't know how long to
+	 * delay the path, so there's no point in checking if we should
+	 */
+
+	if (clock_gettime(CLOCK_MONOTONIC, &curr_time) != 0)
+		return 0;
+	/* when path failures has exceeded the san_path_err_threshold
+	 * place the path in delayed state till san_path_err_recovery_time
+	 * so that the cutomer can rectify the issue within this time.
After
+	 * the completion of san_path_err_recovery_time it should
+	 * automatically reinstate the path
+	 */
+	if (pp->path_failures > pp->mpp->san_path_err_threshold) {
+		condlog(2, "%s : hit error threshold. Delaying path
reinstatement", pp->dev);
+		pp->dis_reinstate_time = curr_time.tv_sec;
+		pp->disable_reinstate = 1;
+
+		return 1;
+	} else {
+		return 0;
+	}
+
+reinstate_path:
+	pp->path_failures = 0;
+	pp->disable_reinstate = 0;
+	pp->san_path_err_forget_rate = 0;
+	return 0;
+}
+
 /*
  * Returns '1' if the path has been checked, '-1' if it was blacklisted
  * and '0' otherwise
@@ -1980,6 +2058,12 @@ check_path (struct vectors * vecs, struct path *
pp, int ticks)
 	if (!pp->mpp)
 		return 0;

+	if ((newstate == PATH_UP || newstate == PATH_GHOST) &&
+			check_path_reinstate_state(pp)) {
+		pp->state = PATH_DELAYED;
+		return 1;
+	}
+
 	if (pp->io_err_disable_reinstate && hit_io_err_recheck_time(pp)) {
 		pp->state = PATH_SHAKY;
 		/*
--
2.19.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-19 11:32   ` Muneendra Kumar M
@ 2018-12-19 12:02     ` Martin Wilck
  2018-12-20 10:41       ` Muneendra Kumar M
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2018-12-19 12:02 UTC (permalink / raw)
  To: Muneendra Kumar M, Christophe Varoqui, mwilck+gmail
  Cc: Guan Junxiong, M Muneendra Kumar, dm-devel

On Wed, 2018-12-19 at 17:02 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> In one of the patch   "[PATCH 00/19] san_path_err & multipath ANA
> support"
> 
> you have mentioned that san_path_err_XXX has some merits
> over marginal_path_err_XXX.
> 
> Is this understanding correct if so could you please explain the
> scenario
> in which use case this was better.
> 
> I can say Marginal_path_err_xx is superset of san_path_err_xx.

If you think so, please explain how. Imagine a user who has configured

  san_path_err_threshold     X
  san_path_err_forget_rate   Y
  san_path_err_recovery_time Z

Now this user is suppsed migrate to marginal_path settings.

  marginal_path_double_failed_time   A
  marginal_path_err_sample_time      B
  marginal_path_err_rate_threshold   C 
  marginal_path_err_recheck_gap_time D

Can you provide a formula to calculate A,B,C,D such that the system
behaves the same way (or "better") than previously with X, Y, Z?

I have pondered this for a while and concluded that I can't.

> If we need both san_path_err_xx , Marginal_path_err_xx then so many
> configurations will really confuse the customers.

True, the many different options are confusing. However, I don't think
it becomes much worse by offering both methods. Both methods aren't
easy to understand by themselves. Once users understand that these two
parameter sets are mutually exclusive, I think they can deal with that.

What we really need is easier set-up of either method (think of 2-3
sets of reasobable pre-set parameter values for different scenarios). 
I believe most admins are so intimidated by the complexity of the
parameters and their interaction that they give up and use
delay_xx_checks instead, or nothing at all.

Unfortunately this is all based on guessing; we at least have no data
if users are trying these parameters and if yes, what they are using.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-19 12:02     ` Martin Wilck
@ 2018-12-20 10:41       ` Muneendra Kumar M
  2018-12-20 21:26         ` Martin Wilck
  0 siblings, 1 reply; 36+ messages in thread
From: Muneendra Kumar M @ 2018-12-20 10:41 UTC (permalink / raw)
  To: Martin Wilck, Christophe Varoqui, mwilck+gmail
  Cc: Guan Junxiong, M Muneendra Kumar, dm-devel

Hi Martin,
I completely agree with you as we cannot derive a direct formula behind
these two unless we don't know the IOPS on a particular path.

As the IOPS in both the cases are different during the detection of Shaky
path.
In marginal_path_XX case the IOPS are fixed i.e 100 (at a sample rate of
10HZ) ,Similarly in san_path_xx case the IOPS are not fixed(as it depends on
the application).

But there are lot of ways to derive the IOPS on a particular path if we can
get that then we can derive the values  like below IMO.

And to calculate these we need to derive error threshold as the percentage
of IOPS and the percentage should not be less than 1(as most of the Brocade
SAN customers are using this configuration).
i.e  san_path_errr_threshold and marginal_path_err_rate_threshold   needs to
be computed as percentage of  IOPS for a given number of secs(derived from
san_path_err_forget_rate/ marginal_path_err_sample_time).

For example if  1000 IOPS are happening on a particular path and making the
percentage factor as 1 and sample time as 60 secs the configuration will be
as below

	san_path_err_threshold     =600 (1 percentage of 60*1000)
	san_path_err_forget_rate   =60
	san_path_err_recovery_time 100

Now this user is supposed to migrate to marginal_path settings.
(IOPS in this case is fixed to 100 during the shaky path detection)
	marginal_path_err_rate_threshold   60 (1 percentage of 60*100)
	marginal_path_err_sample_time      60
	marginal_path_err_recheck_gap_time 100



And in this case  san_path_err_forget_rate  should be same as
marginal_path_err_sample_time    and
san_path_err_recovery_time should be same as
marginal_path_err_recheck_gap_time  .
only the variable factor is san_path_err_threshold  and
marginal_path_err_rate_threshold   which keeps changing based on the number
of errors as a percentage of IOPS for a given number of secs.

The only parameter that is extra in marginal case is
marginal_path_double_failed_time   which we need to configure for suspecting
a marginal path.

As we still see some merits in the san_path_XX approach as you mentioned
earlier
and we need both san_path_err_xx and marginal_path_err_xx  I am thinking of
the below approach so that the customers can have the common configuration
for both.
>From the functionality wise san_path_err_forget_rate  ,
marginal_path_err_sample_time    and
san_path_err_recovery_time ,marginal_path_err_recheck_gap_time  and
san_path_err_threshold  , marginal_path_err_rate_threshold are same.

So we can have the common configuration name as marginal_path_err_XX
(parameters) for both approaches and the deriving factor should be
marginal_path_double_failed_time   .
If marginal_path_double_failed_time   is not  defined go with san_path_err
approach else go with marginal_path_err approach to detect the Shaky path.



Regards,
Muneendra.





-----Original Message-----
From: Martin Wilck [mailto:mwilck@suse.com]
Sent: Wednesday, December 19, 2018 5:32 PM
To: Muneendra Kumar M <muneendra.kumar@broadcom.com>; Christophe Varoqui
<christophe.varoqui@opensvc.com>; mwilck+gmail@suse.de
Cc: M Muneendra Kumar <mmandala@brocade.com>; Guan Junxiong
<guanjunxiong@huawei.com>; Benjamin Marzinski <bmarzins@redhat.com>;
dm-devel@redhat.com; Hannes Reinecke <hare@suse.de>
Subject: Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX
feature"

On Wed, 2018-12-19 at 17:02 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> In one of the patch   "[PATCH 00/19] san_path_err & multipath ANA
> support"
>
> you have mentioned that san_path_err_XXX has some merits over
> marginal_path_err_XXX.
>
> Is this understanding correct if so could you please explain the
> scenario in which use case this was better.
>
> I can say Marginal_path_err_xx is superset of san_path_err_xx.

If you think so, please explain how. Imagine a user who has configured

  san_path_err_threshold     X
  san_path_err_forget_rate   Y
  san_path_err_recovery_time Z

Now this user is suppsed migrate to marginal_path settings.

  marginal_path_double_failed_time   A
  marginal_path_err_sample_time      B
  marginal_path_err_rate_threshold   C
  marginal_path_err_recheck_gap_time D

Can you provide a formula to calculate A,B,C,D such that the system behaves
the same way (or "better") than previously with X, Y, Z?

I have pondered this for a while and concluded that I can't.

> If we need both san_path_err_xx , Marginal_path_err_xx then so many
> configurations will really confuse the customers.

True, the many different options are confusing. However, I don't think it
becomes much worse by offering both methods. Both methods aren't easy to
understand by themselves. Once users understand that these two parameter
sets are mutually exclusive, I think they can deal with that.

What we really need is easier set-up of either method (think of 2-3 sets of
reasobable pre-set parameter values for different scenarios).
I believe most admins are so intimidated by the complexity of the parameters
and their interaction that they give up and use delay_xx_checks instead, or
nothing at all.

Unfortunately this is all based on guessing; we at least have no data if
users are trying these parameters and if yes, what they are using.

Martin

--
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux
GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG
Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 14/19] multipath-tools: add ANA support for NVMe device
  2018-12-18 23:19 ` [PATCH 14/19] multipath-tools: add ANA support for NVMe device Martin Wilck
@ 2018-12-20 15:17   ` Hannes Reinecke
  2018-12-20 23:45     ` Martin Wilck
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Reinecke @ 2018-12-20 15:17 UTC (permalink / raw)
  To: Martin Wilck, Christophe Varoqui; +Cc: dm-devel, lijie

On 12/19/18 12:19 AM, Martin Wilck wrote:
> From: lijie <lijie34@huawei.com>
> 
> Add support for Asynchronous Namespace Access as specified in NVMe 1.3
> TP 4004. The states are updated through reading the ANA log page.
> 
> By default, the native nvme multipath takes over the nvme device.
> We can pass a false to the parameter 'multipath' of the nvme-core.ko
> module,when we want to use multipath-tools.
> 
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>   libmultipath/prio.h                |   1 +
>   libmultipath/prioritizers/Makefile |   1 +
>   libmultipath/prioritizers/ana.c    | 292 +++++++++++++++++++++++++++++
>   libmultipath/prioritizers/ana.h    | 221 ++++++++++++++++++++++
>   multipath/multipath.conf.5         |   8 +
>   5 files changed, 523 insertions(+)
>   create mode 100644 libmultipath/prioritizers/ana.c
>   create mode 100644 libmultipath/prioritizers/ana.h
> 
> diff --git a/libmultipath/prio.h b/libmultipath/prio.h
> index aa587ccd..599d1d88 100644
> --- a/libmultipath/prio.h
> +++ b/libmultipath/prio.h
> @@ -30,6 +30,7 @@ struct path;
>   #define PRIO_WEIGHTED_PATH	"weightedpath"
>   #define PRIO_SYSFS		"sysfs"
>   #define PRIO_PATH_LATENCY	"path_latency"
> +#define PRIO_ANA		"ana"
>   
>   /*
>    * Value used to mark the fact prio was not defined
> diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile
> index ab7bc075..15afaba3 100644
> --- a/libmultipath/prioritizers/Makefile
> +++ b/libmultipath/prioritizers/Makefile
> @@ -19,6 +19,7 @@ LIBS = \
>   	libpriordac.so \
>   	libprioweightedpath.so \
>   	libpriopath_latency.so \
> +	libprioana.so \
>   	libpriosysfs.so
>   
>   all: $(LIBS)
> diff --git a/libmultipath/prioritizers/ana.c b/libmultipath/prioritizers/ana.c
> new file mode 100644
> index 00000000..c5aaa5fb
> --- /dev/null
> +++ b/libmultipath/prioritizers/ana.c
> @@ -0,0 +1,292 @@
> +/*
> + * (C) Copyright HUAWEI Technology Corp. 2017   All Rights Reserved.
> + *
> + * ana.c
> + * Version 1.00
> + *
> + * Tool to make use of a NVMe-feature called  Asymmetric Namespace Access.
> + * It determines the ANA state of a device and prints a priority value to stdout.
> + *
> + * Author(s): Cheng Jike <chengjike.cheng@huawei.com>
> + *            Li Jie <lijie34@huawei.com>
> + *
> + * This file is released under the GPL version 2, or any later version.
> + */
> +#include <stdio.h>
> +#include <sys/ioctl.h>
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <stdbool.h>
> +
> +#include "debug.h"
> +#include "prio.h"
> +#include "structs.h"
> +#include "ana.h"
> +
> +enum {
> +	ANA_PRIO_OPTIMIZED		= 50,
> +	ANA_PRIO_NONOPTIMIZED		= 10,
> +	ANA_PRIO_INACCESSIBLE		= 5,
> +	ANA_PRIO_PERSISTENT_LOSS	= 1,
> +	ANA_PRIO_CHANGE			= 0,
> +	ANA_PRIO_RESERVED		= 0,
> +	ANA_PRIO_GETCTRL_FAILED		= -1,
> +	ANA_PRIO_NOT_SUPPORTED		= -2,
> +	ANA_PRIO_GETANAS_FAILED		= -3,
> +	ANA_PRIO_GETANALOG_FAILED	= -4,
> +	ANA_PRIO_GETNSID_FAILED		= -5,
> +	ANA_PRIO_GETNS_FAILED		= -6,
> +	ANA_PRIO_NO_MEMORY		= -7,
> +	ANA_PRIO_NO_INFORMATION		= -8,
> +};

Please model the priorities according to the ALUA handler; ANA state 
'persistent loss' maps onto ALUA 'unavailable' (and hence should have a 
priority of '0'), and ANA state 'inaccessible' is roughly similar to 
ALUA 'standby', hence should have a priority of '1'.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-20 10:41       ` Muneendra Kumar M
@ 2018-12-20 21:26         ` Martin Wilck
  2018-12-21 11:03           ` Muneendra Kumar M
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2018-12-20 21:26 UTC (permalink / raw)
  To: Muneendra Kumar M, Christophe Varoqui, mwilck
  Cc: Guan Junxiong, M Muneendra Kumar, dm-devel

Hello Muneedra,

On Thu, 2018-12-20 at 16:11 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> I completely agree with you as we cannot derive a direct formula
> behind
> these two unless we don't know the IOPS on a particular path.
>
> As the IOPS in both the cases are different during the detection of
> Shaky
> path.
> In marginal_path_XX case the IOPS are fixed i.e 100 (at a sample rate
> of
> 10HZ) ,Similarly in san_path_xx case the IOPS are not fixed(as it
> depends on
> the application).
>
> But there are lot of ways to derive the IOPS on a particular path if
> we can
> get that then we can derive the values  like below IMO.
>
> And to calculate these we need to derive error threshold as the
> percentage
> of IOPS and the percentage should not be less than 1(as most of the
> Brocade
> SAN customers are using this configuration).
> i.e  san_path_errr_threshold and
> marginal_path_err_rate_threshold   needs to
> be computed as percentage of  IOPS for a given number of secs(derived
> from
> san_path_err_forget_rate/ marginal_path_err_sample_time).

You make me curious - are Brocade customers using our upstream
multipath code? Do you have insights about if, and how, they apply
marginal path checking in multipath-tools, and what parameter values
they are applying?

If yes, it would be very valuable for the community if you could share
some of these insights. So far I'm gathering that you recommend to
consider paths as shaky if they have an error rate of more than 1%.

>
> For example if  1000 IOPS are happening on a particular path and
> making the
> percentage factor as 1 and sample time as 60 secs the configuration
> will be
> as below
>
>       san_path_err_threshold     =600 (1 percentage of 60*1000)
>       san_path_err_forget_rate   =60
>       san_path_err_recovery_time 100

Hm, I understand it differently. In the san_path_err model, if you have
an error rate of 1% and the settings above, IMO you will *never* reach
the threshold. The failure count will increase (on average) in 1/100
ticks, but it will decrease in 1/60 ticks, resulting in a negative
first derivative (more precisely, a stochastic process where the
overall trend goes towards 0, not upwards towards the threshold).

In the san_path_err model, the maximum tolerable failure rate is
basically the reciprocal of the san_path_err_forget_rate parameter. 

The error threshold as a different effect, acting rather as a "delay" 
until the algorithm really considers the path shaky. The closer the
failure rate to the forget rate, the longer it takes. For example, if
you have an error rate of 1/30 (3.3%), the failure count will increase
by one every 60 ticks (1/30-1/60 = 1/60), and it will take 60*600 =
36000 (!) ticks, or 10h at best, until the path is considered shaky.
OTOH, with an error rate of 10%, the threshold is reached in 7200
ticks, and at an error rate of 50%, in 1200s.

For you scenario, I'd use something like

   san_path_err_threshold 4
   san_path_err_forget_rate 100
   san_path_err_recovery_time 100 

At least that's how I understand the algorithm. Am I wrong?

Btw, are you aware that the san_path_err algorithm, at least in the
form that was merged upstream, only counts good->bad transitions?
Especially with high error rates, this is quite different from an
overall error rate (failures / overall I/Os), because several
subsequent failures are only counted as one.

>
> Now this user is supposed to migrate to marginal_path settings.
> (IOPS in this case is fixed to 100 during the shaky path detection)
>       marginal_path_err_rate_threshold   60 (1 percentage of 60*100)
>       marginal_path_err_sample_time      60
>       marginal_path_err_recheck_gap_time 100
>
>
>
> And in this case  san_path_err_forget_rate  should be same as
> marginal_path_err_sample_time    and
> san_path_err_recovery_time should be same as
> marginal_path_err_recheck_gap_time  .
> only the variable factor is san_path_err_threshold  and
> marginal_path_err_rate_threshold   which keeps changing based on the
> number
> of errors as a percentage of IOPS for a given number of secs.
>
> The only parameter that is extra in marginal case is
> marginal_path_double_failed_time   which we need to configure for
> suspecting
> a marginal path.

I don't think these parameters will have the behavior as the
san_path_err parameters above. Argument above.

Note that marginal_path_err_sample_time 60 is invalid (the marginal
path code requires at least 120s), and that the error threshold is
always given as a "permillage" (should be set to 10 for 1%).

>
> As we still see some merits in the san_path_XX approach as you
> mentioned
> earlier
> and we need both san_path_err_xx and marginal_path_err_xx  I am
> thinking of
> the below approach so that the customers can have the common
> configuration
> for both.
> From the functionality wise san_path_err_forget_rate  ,
> marginal_path_err_sample_time    and
> san_path_err_recovery_time ,marginal_path_err_recheck_gap_time  and
> san_path_err_threshold  , marginal_path_err_rate_threshold are same.
>
> So we can have the common configuration name as marginal_path_err_XX
> (parameters) for both approaches and the deriving factor should be
> marginal_path_double_failed_time   .
> If marginal_path_double_failed_time   is not  defined go with
> san_path_err
> approach else go with marginal_path_err approach to detect the Shaky
> path.

I'm not sure about that. It's important that users are able to
understand the effect that each parameter has. If we use the same
parameter name for different parameters of different algorithms, even
bigger confusion might arise than we have now.
"san_path_err_recovery_time" and "marginal_path_recheck_gap_time" 
obviously have very similar effects, but for the other parameters I
don't see 1:1 equivalence.

Best regards,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 15/19] libmultipath: ANA prioritzer: use nvme wrapper library
  2018-12-18 23:19 ` [PATCH 15/19] libmultipath: ANA prioritzer: use nvme wrapper library Martin Wilck
@ 2018-12-20 22:58   ` Benjamin Marzinski
  0 siblings, 0 replies; 36+ messages in thread
From: Benjamin Marzinski @ 2018-12-20 22:58 UTC (permalink / raw)
  To: Martin Wilck; +Cc: dm-devel, lijie

On Wed, Dec 19, 2018 at 12:19:27AM +0100, Martin Wilck wrote:
> Use the previously introduced NVME wrapper library for
> the passthrough commands from the ANA prioritizer. Discard
> code duplicated from nvme-cli from the ana code itself.
> 
> Furthermore, make additional cleanups in the ANA prioritizer:
> 
>  - don't use the same enum for priorities and error codes
>  - use char* arrays for error messages and state names
>  - return -1 prio to libmultipath for all error cases
>  - check if a device is NVMe before trying ioctl

It's not a big deal, but I do wonder what's the point with this.
Presumably the nvme prioritizer was configured for this device. Also,
the first thing we do is get the identify controller data structure via
a NVMe ioctl. This should fail if the device isn't an NVMe device. Doing
this check does give us more informational error messages, but we don't
do this kind of double-checking for the ALUA prioritizer, for instance.

>  - check for overflow in check_ana_state()
>  - get_ana_info(): improve readability with is_anagrpid_const
>  - priorities: PERSISTENT_LOSS state is worse than INACCESSIBLE
>  and CHANGE
> 
> Cc: lijie <lijie34@huawei.com>
> Signed-off-by: Martin Wilck <mwilck@suse.com>
> ---
>  libmultipath/prioritizers/Makefile |   6 +-
>  libmultipath/prioritizers/ana.c    | 305 ++++++++++-------------------

Is there a reason why this patch doesn't remove ana.h?

Otherwise, this looks good.

-Ben

>  2 files changed, 113 insertions(+), 198 deletions(-)
> 
> diff --git a/libmultipath/prioritizers/Makefile b/libmultipath/prioritizers/Makefile
> index 15afaba3..4d80c20c 100644
> --- a/libmultipath/prioritizers/Makefile
> +++ b/libmultipath/prioritizers/Makefile
> @@ -19,9 +19,13 @@ LIBS = \
>  	libpriordac.so \
>  	libprioweightedpath.so \
>  	libpriopath_latency.so \
> -	libprioana.so \
>  	libpriosysfs.so
>  
> +ifneq ($(call check_file,/usr/include/linux/nvme_ioctl.h),0)
> +	LIBS += libprioana.so
> +	CFLAGS += -I../nvme
> +endif
> +
>  all: $(LIBS)
>  
>  libprioalua.so: alua.o alua_rtpg.o
> diff --git a/libmultipath/prioritizers/ana.c b/libmultipath/prioritizers/ana.c
> index c5aaa5fb..88edb224 100644
> --- a/libmultipath/prioritizers/ana.c
> +++ b/libmultipath/prioritizers/ana.c
> @@ -17,155 +17,91 @@
>  #include <sys/stat.h>
>  #include <sys/types.h>
>  #include <stdbool.h>
> +#include <libudev.h>
>  
>  #include "debug.h"
> +#include "nvme-lib.h"
>  #include "prio.h"
> +#include "util.h"
>  #include "structs.h"
> -#include "ana.h"
>  
>  enum {
> -	ANA_PRIO_OPTIMIZED		= 50,
> -	ANA_PRIO_NONOPTIMIZED		= 10,
> -	ANA_PRIO_INACCESSIBLE		= 5,
> -	ANA_PRIO_PERSISTENT_LOSS	= 1,
> -	ANA_PRIO_CHANGE			= 0,
> -	ANA_PRIO_RESERVED		= 0,
> -	ANA_PRIO_GETCTRL_FAILED		= -1,
> -	ANA_PRIO_NOT_SUPPORTED		= -2,
> -	ANA_PRIO_GETANAS_FAILED		= -3,
> -	ANA_PRIO_GETANALOG_FAILED	= -4,
> -	ANA_PRIO_GETNSID_FAILED		= -5,
> -	ANA_PRIO_GETNS_FAILED		= -6,
> -	ANA_PRIO_NO_MEMORY		= -7,
> -	ANA_PRIO_NO_INFORMATION		= -8,
> +	ANA_ERR_GETCTRL_FAILED		= 1,
> +	ANA_ERR_NOT_NVME,
> +	ANA_ERR_NOT_SUPPORTED,
> +	ANA_ERR_GETANAS_OVERFLOW,
> +	ANA_ERR_GETANAS_NOTFOUND,
> +	ANA_ERR_GETANALOG_FAILED,
> +	ANA_ERR_GETNSID_FAILED,
> +	ANA_ERR_GETNS_FAILED,
> +	ANA_ERR_NO_MEMORY,
> +	ANA_ERR_NO_INFORMATION,
>  };
>  
> -static const char * anas_string[] = {
> +static const char *ana_errmsg[] = {
> +	[ANA_ERR_GETCTRL_FAILED]	= "couldn't get ctrl info",
> +	[ANA_ERR_NOT_NVME]		= "not an NVMe device",
> +	[ANA_ERR_NOT_SUPPORTED]		= "ANA not supported",
> +	[ANA_ERR_GETANAS_OVERFLOW]	= "buffer overflow in ANA log",
> +	[ANA_ERR_GETANAS_NOTFOUND]	= "NSID or ANAGRPID not found",
> +	[ANA_ERR_GETANALOG_FAILED]	= "couldn't get ana log",
> +	[ANA_ERR_GETNSID_FAILED]	= "couldn't get NSID",
> +	[ANA_ERR_GETNS_FAILED]		= "couldn't get namespace info",
> +	[ANA_ERR_NO_MEMORY]		= "out of memory",
> +	[ANA_ERR_NO_INFORMATION]	= "invalid fd",
> +};
> +
> +/* Use the implicit initialization: value 0 is "invalid" */
> +static const int ana_prio [] = {
> +	[NVME_ANA_OPTIMIZED]		= 50,
> +	[NVME_ANA_NONOPTIMIZED]		= 10,
> +	[NVME_ANA_INACCESSIBLE]		=  5,
> +	[NVME_ANA_PERSISTENT_LOSS]	=  1,
> +	[NVME_ANA_CHANGE]		=  5,
> +};
> +
> +static const char *anas_string[] = {
>  	[NVME_ANA_OPTIMIZED]			= "ANA Optimized State",
>  	[NVME_ANA_NONOPTIMIZED]			= "ANA Non-Optimized State",
>  	[NVME_ANA_INACCESSIBLE]			= "ANA Inaccessible State",
>  	[NVME_ANA_PERSISTENT_LOSS]		= "ANA Persistent Loss State",
>  	[NVME_ANA_CHANGE]			= "ANA Change state",
> -	[NVME_ANA_RESERVED]			= "Invalid namespace group state!",
>  };
>  
>  static const char *aas_print_string(int rc)
>  {
>  	rc &= 0xff;
> -
> -	switch(rc) {
> -	case NVME_ANA_OPTIMIZED:
> -	case NVME_ANA_NONOPTIMIZED:
> -	case NVME_ANA_INACCESSIBLE:
> -	case NVME_ANA_PERSISTENT_LOSS:
> -	case NVME_ANA_CHANGE:
> +	if (rc >= 0 && rc < ARRAY_SIZE(anas_string) &&
> +	    anas_string[rc] != NULL)
>  		return anas_string[rc];
> -	default:
> -		return anas_string[NVME_ANA_RESERVED];
> -	}
> -
> -	return anas_string[NVME_ANA_RESERVED];
> -}
> -
> -static int nvme_get_nsid(int fd, unsigned *nsid)
> -{
> -	static struct stat nvme_stat;
> -	int err = fstat(fd, &nvme_stat);
> -	if (err < 0)
> -		return 1;
> -
> -	if (!S_ISBLK(nvme_stat.st_mode)) {
> -		condlog(0, "Error: requesting namespace-id from non-block device\n");
> -		return 1;
> -	}
> -
> -	*nsid = ioctl(fd, NVME_IOCTL_ID);
> -	return 0;
> -}
> -
> -static int nvme_submit_admin_passthru(int fd, struct nvme_passthru_cmd *cmd)
> -{
> -	return ioctl(fd, NVME_IOCTL_ADMIN_CMD, cmd);
> -}
> -
> -int nvme_get_log13(int fd, __u32 nsid, __u8 log_id, __u8 lsp, __u64 lpo,
> -                 __u16 lsi, bool rae, __u32 data_len, void *data)
> -{
> -	struct nvme_admin_cmd cmd = {
> -		.opcode		= nvme_admin_get_log_page,
> -		.nsid		= nsid,
> -		.addr		= (__u64)(uintptr_t) data,
> -		.data_len	= data_len,
> -	};
> -	__u32 numd = (data_len >> 2) - 1;
> -	__u16 numdu = numd >> 16, numdl = numd & 0xffff;
> -
> -	cmd.cdw10 = log_id | (numdl << 16) | (rae ? 1 << 15 : 0);
> -	if (lsp)
> -		cmd.cdw10 |= lsp << 8;
> -
> -	cmd.cdw11 = numdu | (lsi << 16);
> -	cmd.cdw12 = lpo;
> -	cmd.cdw13 = (lpo >> 32);
> -
> -	return nvme_submit_admin_passthru(fd, &cmd);
> -
> -}
> -
> -int nvme_identify13(int fd, __u32 nsid, __u32 cdw10, __u32 cdw11, void *data)
> -{
> -	struct nvme_admin_cmd cmd = {
> -		.opcode		= nvme_admin_identify,
> -		.nsid		= nsid,
> -		.addr		= (__u64)(uintptr_t) data,
> -		.data_len	= NVME_IDENTIFY_DATA_SIZE,
> -		.cdw10		= cdw10,
> -		.cdw11		= cdw11,
> -	};
> -
> -	return nvme_submit_admin_passthru(fd, &cmd);
> -}
> -
> -int nvme_identify(int fd, __u32 nsid, __u32 cdw10, void *data)
> -{
> -	return nvme_identify13(fd, nsid, cdw10, 0, data);
> -}
>  
> -int nvme_identify_ctrl(int fd, void *data)
> -{
> -	return nvme_identify(fd, 0, NVME_ID_CNS_CTRL, data);
> -}
> -
> -int nvme_identify_ns(int fd, __u32 nsid, void *data)
> -{
> -	return nvme_identify(fd, nsid, NVME_ID_CNS_NS, data);
> -}
> -
> -int nvme_ana_log(int fd, void *ana_log, size_t ana_log_len, int rgo)
> -{
> -	__u64 lpo = 0;
> -
> -	return nvme_get_log13(fd, NVME_NSID_ALL, NVME_LOG_ANA, rgo, lpo, 0,
> -			true, ana_log_len, ana_log);
> +	return "invalid ANA state";
>  }
>  
> -static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log)
> +static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log,
> +			 size_t ana_log_len)
>  {
> -	int	rc = ANA_PRIO_GETANAS_FAILED;
>  	void *base = ana_log;
>  	struct nvme_ana_rsp_hdr *hdr = base;
>  	struct nvme_ana_group_desc *ana_desc;
> -	int offset = sizeof(struct nvme_ana_rsp_hdr);
> +	size_t offset = sizeof(struct nvme_ana_rsp_hdr);
>  	__u32 nr_nsids;
>  	size_t nsid_buf_size;
>  	int i, j;
>  
>  	for (i = 0; i < le16_to_cpu(hdr->ngrps); i++) {
>  		ana_desc = base + offset;
> +
> +		offset += sizeof(*ana_desc);
> +		if (offset > ana_log_len)
> +			return -ANA_ERR_GETANAS_OVERFLOW;
> +
>  		nr_nsids = le32_to_cpu(ana_desc->nnsids);
>  		nsid_buf_size = nr_nsids * sizeof(__le32);
>  
> -		offset += sizeof(*ana_desc);
> +		offset += nsid_buf_size;
> +		if (offset > ana_log_len)
> +			return -ANA_ERR_GETANAS_OVERFLOW;
>  
>  		for (j = 0; j < nr_nsids; j++) {
>  			if (nsid == le32_to_cpu(ana_desc->nsids[j]))
> @@ -173,12 +109,10 @@ static int get_ana_state(__u32 nsid, __u32 anagrpid, void *ana_log)
>  		}
>  
>  		if (anagrpid != 0 && anagrpid == le32_to_cpu(ana_desc->grpid))
> -			rc = ana_desc->state;
> +			return ana_desc->state;
>  
> -		offset += nsid_buf_size;
>  	}
> -
> -	return rc;
> +	return -ANA_ERR_GETANAS_NOTFOUND;
>  }
>  
>  int get_ana_info(struct path * pp, unsigned int timeout)
> @@ -189,104 +123,81 @@ int get_ana_info(struct path * pp, unsigned int timeout)
>  	struct nvme_id_ns ns;
>  	void *ana_log;
>  	size_t ana_log_len;
> +	bool is_anagrpid_const;
>  
>  	rc = nvme_identify_ctrl(pp->fd, &ctrl);
> -	if (rc)
> -		return ANA_PRIO_GETCTRL_FAILED;
> +	if (rc < 0) {
> +		log_nvme_errcode(rc, pp->dev, "nvme_identify_ctrl");
> +		return -ANA_ERR_GETCTRL_FAILED;
> +	}
>  
>  	if(!(ctrl.cmic & (1 << 3)))
> -		return ANA_PRIO_NOT_SUPPORTED;
> -
> -	rc = nvme_get_nsid(pp->fd, &nsid);
> -	if (rc)
> -		return ANA_PRIO_GETNSID_FAILED;
> +		return -ANA_ERR_NOT_SUPPORTED;
>  
> -	rc = nvme_identify_ns(pp->fd, nsid, &ns);
> -	if (rc)
> -		return ANA_PRIO_GETNS_FAILED;
> +	nsid = nvme_get_nsid(pp->fd);
> +	if (nsid <= 0) {
> +		log_nvme_errcode(rc, pp->dev, "nvme_get_nsid");
> +		return -ANA_ERR_GETNSID_FAILED;
> +	}
> +	is_anagrpid_const = ctrl.anacap & (1 << 6);
>  
> +	/*
> +	 * Code copied from nvme-cli/nvme.c. We don't need to allocate an
> +	 * [nanagrpid*mnan] array of NSIDs because each NSID can occur at most
> +	 * in one ANA group.
> +	 */
>  	ana_log_len = sizeof(struct nvme_ana_rsp_hdr) +
> -		le32_to_cpu(ctrl.nanagrpid) * sizeof(struct nvme_ana_group_desc);
> -	if (!(ctrl.anacap & (1 << 6)))
> +		le32_to_cpu(ctrl.nanagrpid)
> +		* sizeof(struct nvme_ana_group_desc);
> +
> +	if (is_anagrpid_const) {
> +		rc = nvme_identify_ns(pp->fd, nsid, 0, &ns);
> +		if (rc) {
> +			log_nvme_errcode(rc, pp->dev, "nvme_identify_ns");
> +			return -ANA_ERR_GETNS_FAILED;
> +		}
> +	} else
>  		ana_log_len += le32_to_cpu(ctrl.mnan) * sizeof(__le32);
>  
>  	ana_log = malloc(ana_log_len);
>  	if (!ana_log)
> -		return ANA_PRIO_NO_MEMORY;
> -
> +		return -ANA_ERR_NO_MEMORY;
> +	pthread_cleanup_push(free, ana_log);
>  	rc = nvme_ana_log(pp->fd, ana_log, ana_log_len,
> -		(ctrl.anacap & (1 << 6)) ? NVME_ANA_LOG_RGO : 0);
> +			  is_anagrpid_const ? NVME_ANA_LOG_RGO : 0);
>  	if (rc) {
> -		free(ana_log);
> -		return ANA_PRIO_GETANALOG_FAILED;
> -	}
> -
> -	rc = get_ana_state(nsid, le32_to_cpu(ns.anagrpid), ana_log);
> -	if (rc < 0){
> -		free(ana_log);
> -		return ANA_PRIO_GETANAS_FAILED;
> -	}
> -
> -	free(ana_log);
> -	condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc, aas_print_string(rc));
> -
> +		log_nvme_errcode(rc, pp->dev, "nvme_ana_log");
> +		rc = -ANA_ERR_GETANALOG_FAILED;
> +	} else
> +		rc = get_ana_state(nsid,
> +				   is_anagrpid_const ?
> +				   le32_to_cpu(ns.anagrpid) : 0,
> +				   ana_log, ana_log_len);
> +	pthread_cleanup_pop(1);
> +	if (rc >= 0)
> +		condlog(3, "%s: ana state = %02x [%s]", pp->dev, rc,
> +			aas_print_string(rc));
>  	return rc;
>  }
>  
> -int getprio(struct path * pp, char * args, unsigned int timeout)
> +int getprio(struct path *pp, char *args, unsigned int timeout)
>  {
>  	int rc;
>  
>  	if (pp->fd < 0)
> -		return ANA_PRIO_NO_INFORMATION;
> -
> -	rc = get_ana_info(pp, timeout);
> -	if (rc >= 0) {
> -		rc &= 0x0f;
> -		switch(rc) {
> -		case NVME_ANA_OPTIMIZED:
> -			rc = ANA_PRIO_OPTIMIZED;
> -			break;
> -		case NVME_ANA_NONOPTIMIZED:
> -			rc = ANA_PRIO_NONOPTIMIZED;
> -			break;
> -		case NVME_ANA_INACCESSIBLE:
> -			rc = ANA_PRIO_INACCESSIBLE;
> -			break;
> -		case NVME_ANA_PERSISTENT_LOSS:
> -			rc = ANA_PRIO_PERSISTENT_LOSS;
> -			break;
> -		case NVME_ANA_CHANGE:
> -			rc = ANA_PRIO_CHANGE;
> -			break;
> -		default:
> -			rc = ANA_PRIO_RESERVED;
> -		}
> -	} else {
> -		switch(rc) {
> -		case ANA_PRIO_GETCTRL_FAILED:
> -			condlog(0, "%s: couldn't get ctrl info", pp->dev);
> -			break;
> -		case ANA_PRIO_NOT_SUPPORTED:
> -			condlog(0, "%s: ana not supported", pp->dev);
> -			break;
> -		case ANA_PRIO_GETANAS_FAILED:
> -			condlog(0, "%s: couldn't get ana state", pp->dev);
> -			break;
> -		case ANA_PRIO_GETANALOG_FAILED:
> -			condlog(0, "%s: couldn't get ana log", pp->dev);
> -			break;
> -		case ANA_PRIO_GETNS_FAILED:
> -			condlog(0, "%s: couldn't get namespace", pp->dev);
> -			break;
> -		case ANA_PRIO_GETNSID_FAILED:
> -			condlog(0, "%s: couldn't get namespace id", pp->dev);
> -			break;
> -		case ANA_PRIO_NO_MEMORY:
> -			condlog(0, "%s: couldn't alloc memory", pp->dev);
> -			break;
> -		}
> +		rc = -ANA_ERR_NO_INFORMATION;
> +	else if (udev_device_get_parent_with_subsystem_devtype(pp->udev,
> +							       "nvme", NULL)
> +		 == NULL)
> +		rc = -ANA_ERR_NOT_NVME;
> +	else {
> +		rc = get_ana_info(pp, timeout);
> +		if (rc >= 0 && rc < ARRAY_SIZE(ana_prio) && ana_prio[rc] != 0)
> +			return ana_prio[rc];
>  	}
> -	return rc;
> +	if (rc < 0 && -rc < ARRAY_SIZE(ana_errmsg))
> +		condlog(2, "%s: ANA error: %s", pp->dev, ana_errmsg[-rc]);
> +	else
> +		condlog(1, "%s: invalid ANA rc code %d", pp->dev, rc);
> +	return -1;
>  }
> -
> -- 
> 2.19.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/19] san_path_err & multipath ANA support
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (18 preceding siblings ...)
  2018-12-18 23:19 ` [PATCH 19/19] libmultipath/foreign/nvme: indicate ANA support Martin Wilck
@ 2018-12-20 23:24 ` Benjamin Marzinski
  2018-12-21 16:06 ` Benjamin Marzinski
  20 siblings, 0 replies; 36+ messages in thread
From: Benjamin Marzinski @ 2018-12-20 23:24 UTC (permalink / raw)
  To: Martin Wilck; +Cc: dm-devel

On Wed, Dec 19, 2018 at 12:19:12AM +0100, Martin Wilck wrote:
> Hi Christophe,
> 
> this series consists of 3 parts. The first part improves the documentation on
> the current approaches to "shaky" or "marginal" path detection, and
> re-introduces the previously removed "san_path_err_xy" approach, which has
> been prematurely removed IMO. At the time, I thought that it was superseded by
> the "marginal path" algorithm, but I have my issues with latter (hopefully
> subject of a follow-up series), and I believe the "medium" complexity of the
> san_path_err code actually has its merits. But to be honest, my strongest
> reason to re-add it is that I have to continue to support it in SLES for some
> time to come.
> 
> The second part accumulates a few bug fixes.
> 
> The third part introduces NVMe ANA support to multipath-tools, based on the
> original patch from Li Jie of Huawei (#14). Instead of copy/pasting some code
> from nvme-cli, as Li Jie did, I decided to copy some nvme-cli code unmodified
> to our repo, and create a small wrapper around it.  I took care not increase
> the generated binaries with code we don't need. I added detect_prio on top of
> it, and also added ANA support for the "foreign" code for native NVMe
> multipath. BTW: Instead of applying patch #12, it would probably be possible
> to simply add https://github.com/linux-nvme/nvme-cli as a submodule to multipath-
> tools. I haven't tried that yet.

I'm fine with not going the submodule route.  If this actually causes
maintainance issues it can always be changed later. But that seems
unlikely, since the code that multipath needs from nvme-cli is pretty
well defined by the NVMe spec (or at least it, will be when the ANA part
is officially added to the spec). 
 
> One thing to note: in dm-multipath mode, multipathd can now read the ANA
> properties and derive prio values. But it can't react on updates from the
> storage so far, because the kernel doesn't generate events to user space
> if this happens. I haven't decided how to tackle this problem yet. Hints
> and comments are welcome.

It seems that even if users were using native NVMe multipathing, I can
see them being interested in being notified when the ANA state changed,
even if its just to correlate these changes with errors or performance
changes on their system. So, it seems perfectly reasonable for the
kernel to generate a uevent in these instances, however I have no idea
how a patch along those lines would be received.

At any rate, aside from my questions about patch 15, ACK for the set.

-Ben

> Cheers,
> Martin
> 
> Kyle Mahlkuch (1):
>   libmultipath: Increase SERIAL_SIZE to 128 bytes
> 
> lijie (1):
>   multipath-tools: add ANA support for NVMe device
> 
> Martin Wilck (17):
>   multipath.conf.5: explain "shaky" path detection
>   libmultipath: propsel: don't print undefined values
>   Revert "multipath-tools: discard san_path_err_XXX feature"
>   multipathd: marginal_path overrides san_path_err
>   multipath.conf.5: man page fixes for san_path_err_xy
>   setup_map: wait for pending path checkers to finish
>   libmultipath: add ARRAY_SIZE helper
>   libmultipath: make close_fd() a common helper
>   libmultipath: restore PG prio in update_multipath_strings
>   multipathd: don't check foreign paths every tick
>   libmultipath: add files from nvme-cli for NVMe support
>   libmultipath: add wrapper library for nvme ioctls
>   libmultipath: ANA prioritzer: use nvme wrapper library
>   libmultipath: detect_prio: try ANA for NVMe
>   libmultipath/foreign/nvme: use failover topology
>   libmultipath/foreign/nvme: show ANA state
>   libmultipath/foreign/nvme: indicate ANA support
> 
>  libmultipath/Makefile              |   18 +-
>  libmultipath/config.c              |    3 +
>  libmultipath/config.h              |    9 +
>  libmultipath/configure.c           |   86 +-
>  libmultipath/dict.c                |   39 +
>  libmultipath/foreign/Makefile      |    2 +-
>  libmultipath/foreign/nvme.c        |  180 +++-
>  libmultipath/nvme-lib.c            |   49 +
>  libmultipath/nvme-lib.h            |   39 +
>  libmultipath/nvme/argconfig.h      |   99 ++
>  libmultipath/nvme/json.h           |   87 ++
>  libmultipath/nvme/linux/nvme.h     | 1450 ++++++++++++++++++++++++++++
>  libmultipath/nvme/nvme-ioctl.c     |  869 +++++++++++++++++
>  libmultipath/nvme/nvme-ioctl.h     |  139 +++
>  libmultipath/nvme/nvme.h           |  163 ++++
>  libmultipath/nvme/plugin.h         |   36 +
>  libmultipath/prio.h                |    1 +
>  libmultipath/prioritizers/Makefile |    5 +
>  libmultipath/prioritizers/ana.c    |  201 ++++
>  libmultipath/prioritizers/ana.h    |  221 +++++
>  libmultipath/propsel.c             |  151 ++-
>  libmultipath/propsel.h             |    3 +
>  libmultipath/structs.h             |   30 +-
>  libmultipath/structs_vec.c         |    8 +
>  libmultipath/sysfs.c               |    5 -
>  libmultipath/util.c                |    5 +
>  libmultipath/util.h                |    3 +
>  multipath/main.c                   |    4 -
>  multipath/multipath.conf.5         |  141 ++-
>  multipathd/main.c                  |  105 +-
>  tests/hwtable.c                    |    2 +-
>  31 files changed, 4051 insertions(+), 102 deletions(-)
>  create mode 100644 libmultipath/nvme-lib.c
>  create mode 100644 libmultipath/nvme-lib.h
>  create mode 100644 libmultipath/nvme/argconfig.h
>  create mode 100644 libmultipath/nvme/json.h
>  create mode 100644 libmultipath/nvme/linux/nvme.h
>  create mode 100644 libmultipath/nvme/nvme-ioctl.c
>  create mode 100644 libmultipath/nvme/nvme-ioctl.h
>  create mode 100644 libmultipath/nvme/nvme.h
>  create mode 100644 libmultipath/nvme/plugin.h
>  create mode 100644 libmultipath/prioritizers/ana.c
>  create mode 100644 libmultipath/prioritizers/ana.h
> 
> -- 
> 2.19.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 14/19] multipath-tools: add ANA support for NVMe device
  2018-12-20 15:17   ` Hannes Reinecke
@ 2018-12-20 23:45     ` Martin Wilck
  0 siblings, 0 replies; 36+ messages in thread
From: Martin Wilck @ 2018-12-20 23:45 UTC (permalink / raw)
  To: Hannes Reinecke, Christophe Varoqui, mwilck+gmail; +Cc: lijie, dm-devel

On Thu, 2018-12-20 at 16:17 +0100, Hannes Reinecke wrote:
> +
> > +enum {
> > +	ANA_PRIO_OPTIMIZED		= 50,
> > +	ANA_PRIO_NONOPTIMIZED		= 10,
> > +	ANA_PRIO_INACCESSIBLE		= 5,
> > +	ANA_PRIO_PERSISTENT_LOSS	= 1,
> > +	ANA_PRIO_CHANGE			= 0,
> > +	ANA_PRIO_RESERVED		= 0,
> > +	ANA_PRIO_GETCTRL_FAILED		= -1,
> > +	ANA_PRIO_NOT_SUPPORTED		= -2,
> > +	ANA_PRIO_GETANAS_FAILED		= -3,
> > +	ANA_PRIO_GETANALOG_FAILED	= -4,
> > +	ANA_PRIO_GETNSID_FAILED		= -5,
> > +	ANA_PRIO_GETNS_FAILED		= -6,
> > +	ANA_PRIO_NO_MEMORY		= -7,
> > +	ANA_PRIO_NO_INFORMATION		= -8,
> > +};
> 
> Please model the priorities according to the ALUA handler; ANA state 
> 'persistent loss' maps onto ALUA 'unavailable' (and hence should have
> a 
> priority of '0'), and ANA state 'inaccessible' is roughly similar to 
> ALUA 'standby', hence should have a priority of '1'.

Will do. But please note that, in contrast to what we discussed off-
list, a priority of "0" has no special meaning. In particular,
pathgroup priority "0" (or negative!) doesn't imply that the PG in
question can't be selected for I/O. The only thing that is "special"
about priority 0 is that multipathd assigns this prio to PGs that have
no working paths. Therefore, a PG to which the prioritizer assigns prio
<= 0 will not be *preferred* over such a zero-path PG.

The only way to avoid that the kernel select a particular PG is to set
all paths in the PG to failed state, or to remove it altogether.
multipathd could try to set the PG to "disabled" state, but currently
it doesn't, and if it did, it wouldn't have the expected effect,
because "disabled" really just means "bypassed" in device mapper. A
"bypassed" PG will be selected for I/O if no other PG has healthy
paths. (Side note: "bypassed" might actually be a reasonable PG state
to use for a PG consisting only of GHOST paths, but we don't do that
today).

Therefore, I think that it makes sense to add an "ana path checker" to
multipathd, which would detect NVMe paths in states not suitable for
I/O and fail them in device mapper. We don't want device mapper to try
these paths. I'm not quite sure about "inaccessible" state - your
statement above would imply that "inaccessible" shouldn't be failed. 
But the way I read the ANA spec (8.19.4), simply trying I/O through
"inaccessible" ports would be wrong. Rather, the path should be
monitored for a transition to either "optimized" or "non-optimized"
state. That matches the behavior of the kernel native NVMe multipath
driver, which AFAICS never attempts I/O through any paths which aren't
either "optimized" or "non-optimized", and makes no distinction between
"inaccessible" and "persistent loss" states.

Cheers,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-20 21:26         ` Martin Wilck
@ 2018-12-21 11:03           ` Muneendra Kumar M
  2018-12-23 10:59             ` Martin Wilck
  0 siblings, 1 reply; 36+ messages in thread
From: Muneendra Kumar M @ 2018-12-21 11:03 UTC (permalink / raw)
  To: Martin Wilck, Christophe Varoqui
  Cc: Guan Junxiong, M Muneendra Kumar, dm-devel

Hi Martin,
The san_path_err_XX feature was added by me and pushed to the upstream.
And this feature was driven from Brocade Customer Feedback.

And the below link will give  the history of this where couple of
discussions went before we started this feature.

https://www.redhat.com/archives/dm-devel/2017-January/msg00025.html



Our requirement was simple
For example If there are two paths on a dm-1 say sda and sdb as below.

 #  multipath -ll
 mpathd (3600110d001ee7f0102050001cc0b6751) dm-1 SANBlaze,VLUN MyLun
 size=8.0M features='0' hwhandler='0' wp=rw
 `-+- policy='round-robin 0' prio=50 status=active
   |- 8:0:1:0  sda 8:48 active ready  running
   `- 9:0:1:0  sdb 8:64 active ready  running

 And on sda if iam seeing lot of errors due to which the sda path is
fluctuating from failed state to active state and vicevera.

 The  requirement was something like this  if sda is failed(moved from
active to failed state) for more than X
 times in a Y duration ,then I want to keep the sda in failed state for Z
duration

 And the data should travel only through sdb path for Z hrs.


 From the configuration point of view

 san_path_err_threshold: The number of times the sda has been moved from
active to failed (from the above example it is X)
 san_path_err_forget_rate: Watch window (within this time frame if the path
failures (sda moving from active to failed ) are more than err threshold
then don't reinstate the path) (from the above example it is Y)
 san_path_err_recovery_time: Place the path in failed state for this
particular time (from the above example it is Z)

 Moving from active state to Failed state (good to bad) is considered as 1
count.

 It means if a particular path has failed (moved from active to failed
states)  san_path_err_threshold times within a
 san_path_err_forget_rate time frame window ,place the path in failed state
and does not reinstantate it for  san_path_err_recovery_time time.


 Coming back to the marginal path implementation i have rechecked the
implementation and I completely agree with you
 it's difficult to derive  the direct formula for both.
And the example which I gave doesn't holds god.

And both approaches are mutually exclusive in detecting the marginal/shaky
path.

 In san_path_err_XX case we are taking the consideration of overall failures
(san_path_err_threshold ) whereas in marginal case IMO we are considering
the error rate (marginal_path_err_rate_threshold   )?
 And you are correct if we merge the san_path_err_XX  ,marginal_path_XX
configuration as one parameters this will further confuse the user.

 Since there are different approaches we need to come up with a way as how
the user can choose the algorithm in multipath.conf.

 Similar to Multipaths  configuration in .conf file.


 Regards,
 Muneendra

-----Original Message-----
From: Martin Wilck [mailto:mwilck@suse.com]
Sent: Friday, December 21, 2018 2:56 AM
To: Muneendra Kumar M <muneendra.kumar@broadcom.com>; Christophe Varoqui
<christophe.varoqui@opensvc.com>; mwilck@suse.com
Cc: M Muneendra Kumar <mmandala@brocade.com>; Guan Junxiong
<guanjunxiong@huawei.com>; Benjamin Marzinski <bmarzins@redhat.com>;
dm-devel@redhat.com; Hannes Reinecke <hare@suse.de>
Subject: Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX
feature"

Hello Muneedra,

On Thu, 2018-12-20 at 16:11 +0530, Muneendra Kumar M wrote:
> Hi Martin,
> I completely agree with you as we cannot derive a direct formula
> behind these two unless we don't know the IOPS on a particular path.
>
> As the IOPS in both the cases are different during the detection of
> Shaky path.
> In marginal_path_XX case the IOPS are fixed i.e 100 (at a sample rate
> of
> 10HZ) ,Similarly in san_path_xx case the IOPS are not fixed(as it
> depends on the application).
>
> But there are lot of ways to derive the IOPS on a particular path if
> we can get that then we can derive the values  like below IMO.
>
> And to calculate these we need to derive error threshold as the
> percentage of IOPS and the percentage should not be less than 1(as
> most of the Brocade SAN customers are using this configuration).
> i.e  san_path_errr_threshold and
> marginal_path_err_rate_threshold   needs to
> be computed as percentage of  IOPS for a given number of secs(derived
> from san_path_err_forget_rate/ marginal_path_err_sample_time).

You make me curious - are Brocade customers using our upstream multipath
code? Do you have insights about if, and how, they apply marginal path
checking in multipath-tools, and what parameter values they are applying?

If yes, it would be very valuable for the community if you could share some
of these insights. So far I'm gathering that you recommend to consider paths
as shaky if they have an error rate of more than 1%.

>
> For example if  1000 IOPS are happening on a particular path and
> making the percentage factor as 1 and sample time as 60 secs the
> configuration will be as below
>
>       san_path_err_threshold     =600 (1 percentage of 60*1000)
>       san_path_err_forget_rate   =60
>       san_path_err_recovery_time 100

Hm, I understand it differently. In the san_path_err model, if you have an
error rate of 1% and the settings above, IMO you will *never* reach the
threshold. The failure count will increase (on average) in 1/100 ticks, but
it will decrease in 1/60 ticks, resulting in a negative first derivative
(more precisely, a stochastic process where the overall trend goes towards
0, not upwards towards the threshold).

In the san_path_err model, the maximum tolerable failure rate is basically
the reciprocal of the san_path_err_forget_rate parameter.

The error threshold as a different effect, acting rather as a "delay"
until the algorithm really considers the path shaky. The closer the failure
rate to the forget rate, the longer it takes. For example, if you have an
error rate of 1/30 (3.3%), the failure count will increase by one every 60
ticks (1/30-1/60 = 1/60), and it will take 60*600 =
36000 (!) ticks, or 10h at best, until the path is considered shaky.
OTOH, with an error rate of 10%, the threshold is reached in 7200 ticks, and
at an error rate of 50%, in 1200s.

For you scenario, I'd use something like

   san_path_err_threshold 4
   san_path_err_forget_rate 100
   san_path_err_recovery_time 100

At least that's how I understand the algorithm. Am I wrong?

Btw, are you aware that the san_path_err algorithm, at least in the form
that was merged upstream, only counts good->bad transitions?
Especially with high error rates, this is quite different from an overall
error rate (failures / overall I/Os), because several subsequent failures
are only counted as one.

>
> Now this user is supposed to migrate to marginal_path settings.
> (IOPS in this case is fixed to 100 during the shaky path detection)
>           60 (1 percentage of 60*100)
>       marginal_path_err_sample_time      60
>       marginal_path_err_recheck_gap_time 100
>
>
>
> And in this case  san_path_err_forget_rate  should be same as
> marginal_path_err_sample_time    and
> san_path_err_recovery_time should be same as
> marginal_path_err_recheck_gap_time  .
> only the variable factor is san_path_err_threshold  and
> marginal_path_err_rate_threshold   which keeps changing based on the
> number
> of errors as a percentage of IOPS for a given number of secs.
>
> The only parameter that is extra in marginal case is
> marginal_path_double_failed_time   which we need to configure for
> suspecting
> a marginal path.

I don't think these parameters will have the behavior as the san_path_err
parameters above. Argument above.

Note that marginal_path_err_sample_time 60 is invalid (the marginal path
code requires at least 120s), and that the error threshold is always given
as a "permillage" (should be set to 10 for 1%).

>
> As we still see some merits in the san_path_XX approach as you
> mentioned earlier and we need both san_path_err_xx and
> marginal_path_err_xx  I am thinking of the below approach so that the
> customers can have the common configuration for both.
> From the functionality wise san_path_err_forget_rate  ,
> marginal_path_err_sample_time    and
> san_path_err_recovery_time ,marginal_path_err_recheck_gap_time  and
> san_path_err_threshold  , marginal_path_err_rate_threshold are same.
>
> So we can have the common configuration name as marginal_path_err_XX
> (parameters) for both approaches and the deriving factor should be
> marginal_path_double_failed_time   .
> If marginal_path_double_failed_time   is not  defined go with
> san_path_err
> approach else go with marginal_path_err approach to detect the Shaky
> path.

I'm not sure about that. It's important that users are able to understand
the effect that each parameter has. If we use the same parameter name for
different parameters of different algorithms, even bigger confusion might
arise than we have now.
"san_path_err_recovery_time" and "marginal_path_recheck_gap_time"
obviously have very similar effects, but for the other parameters I don't
see 1:1 equivalence.

Best regards,
Martin

--
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107 SUSE Linux
GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG
Nürnberg)

--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/19] san_path_err & multipath ANA support
  2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
                   ` (19 preceding siblings ...)
  2018-12-20 23:24 ` [PATCH 00/19] san_path_err & multipath " Benjamin Marzinski
@ 2018-12-21 16:06 ` Benjamin Marzinski
  2019-01-07 11:21   ` Martin Wilck
  20 siblings, 1 reply; 36+ messages in thread
From: Benjamin Marzinski @ 2018-12-21 16:06 UTC (permalink / raw)
  To: Martin Wilck; +Cc: dm-devel

On Wed, Dec 19, 2018 at 12:19:12AM +0100, Martin Wilck wrote:
> Hi Christophe,
> 
> this series consists of 3 parts. The first part improves the documentation on
> the current approaches to "shaky" or "marginal" path detection, and
> re-introduces the previously removed "san_path_err_xy" approach, which has
> been prematurely removed IMO. At the time, I thought that it was superseded by
> the "marginal path" algorithm, but I have my issues with latter (hopefully
> subject of a follow-up series), and I believe the "medium" complexity of the
> san_path_err code actually has its merits. But to be honest, my strongest
> reason to re-add it is that I have to continue to support it in SLES for some
> time to come.

I've been thinking about how we handle marginal paths, and it seems to
me that instead of telling the kernel that they have failed, it might be
better to create pathgroups of last resort, which contains marginal
paths that should only be used if all the other paths are down.

The downsides to this method are that it is quite possible that it could
double the number of pathgroups whenever you have connection issues,
since a connection issue near the host HBA could cause a marginal path
in each pathgroup. This means more reloading tables, and more confusing
layouts.

The upside to this method is that multipath won't run out of paths while
their are still marginal paths that it could use. When queuing isn't
enabled, there's nothing to stop the kernel from failing IO while
potentially usable marginal paths exist.

On the other hand, this problem could be mitigated by having multipath
work such that, when marginal path detection is configured, it always
makes sure that no_path_retry is at least some minimum value that we
believe is long enough for multipathd to be notified of the path failure
by the kernel and to reinstate the marginal paths.

Any thoughts?

-Ben
 
> The second part accumulates a few bug fixes.
> 
> The third part introduces NVMe ANA support to multipath-tools, based on the
> original patch from Li Jie of Huawei (#14). Instead of copy/pasting some code
> from nvme-cli, as Li Jie did, I decided to copy some nvme-cli code unmodified
> to our repo, and create a small wrapper around it.  I took care not increase
> the generated binaries with code we don't need. I added detect_prio on top of
> it, and also added ANA support for the "foreign" code for native NVMe
> multipath. BTW: Instead of applying patch #12, it would probably be possible
> to simply add https://github.com/linux-nvme/nvme-cli as a submodule to multipath-
> tools. I haven't tried that yet.
> 
> One thing to note: in dm-multipath mode, multipathd can now read the ANA
> properties and derive prio values. But it can't react on updates from the
> storage so far, because the kernel doesn't generate events to user space
> if this happens. I haven't decided how to tackle this problem yet. Hints
> and comments are welcome.
> 
> Cheers,
> Martin
> 
> Kyle Mahlkuch (1):
>   libmultipath: Increase SERIAL_SIZE to 128 bytes
> 
> lijie (1):
>   multipath-tools: add ANA support for NVMe device
> 
> Martin Wilck (17):
>   multipath.conf.5: explain "shaky" path detection
>   libmultipath: propsel: don't print undefined values
>   Revert "multipath-tools: discard san_path_err_XXX feature"
>   multipathd: marginal_path overrides san_path_err
>   multipath.conf.5: man page fixes for san_path_err_xy
>   setup_map: wait for pending path checkers to finish
>   libmultipath: add ARRAY_SIZE helper
>   libmultipath: make close_fd() a common helper
>   libmultipath: restore PG prio in update_multipath_strings
>   multipathd: don't check foreign paths every tick
>   libmultipath: add files from nvme-cli for NVMe support
>   libmultipath: add wrapper library for nvme ioctls
>   libmultipath: ANA prioritzer: use nvme wrapper library
>   libmultipath: detect_prio: try ANA for NVMe
>   libmultipath/foreign/nvme: use failover topology
>   libmultipath/foreign/nvme: show ANA state
>   libmultipath/foreign/nvme: indicate ANA support
> 
>  libmultipath/Makefile              |   18 +-
>  libmultipath/config.c              |    3 +
>  libmultipath/config.h              |    9 +
>  libmultipath/configure.c           |   86 +-
>  libmultipath/dict.c                |   39 +
>  libmultipath/foreign/Makefile      |    2 +-
>  libmultipath/foreign/nvme.c        |  180 +++-
>  libmultipath/nvme-lib.c            |   49 +
>  libmultipath/nvme-lib.h            |   39 +
>  libmultipath/nvme/argconfig.h      |   99 ++
>  libmultipath/nvme/json.h           |   87 ++
>  libmultipath/nvme/linux/nvme.h     | 1450 ++++++++++++++++++++++++++++
>  libmultipath/nvme/nvme-ioctl.c     |  869 +++++++++++++++++
>  libmultipath/nvme/nvme-ioctl.h     |  139 +++
>  libmultipath/nvme/nvme.h           |  163 ++++
>  libmultipath/nvme/plugin.h         |   36 +
>  libmultipath/prio.h                |    1 +
>  libmultipath/prioritizers/Makefile |    5 +
>  libmultipath/prioritizers/ana.c    |  201 ++++
>  libmultipath/prioritizers/ana.h    |  221 +++++
>  libmultipath/propsel.c             |  151 ++-
>  libmultipath/propsel.h             |    3 +
>  libmultipath/structs.h             |   30 +-
>  libmultipath/structs_vec.c         |    8 +
>  libmultipath/sysfs.c               |    5 -
>  libmultipath/util.c                |    5 +
>  libmultipath/util.h                |    3 +
>  multipath/main.c                   |    4 -
>  multipath/multipath.conf.5         |  141 ++-
>  multipathd/main.c                  |  105 +-
>  tests/hwtable.c                    |    2 +-
>  31 files changed, 4051 insertions(+), 102 deletions(-)
>  create mode 100644 libmultipath/nvme-lib.c
>  create mode 100644 libmultipath/nvme-lib.h
>  create mode 100644 libmultipath/nvme/argconfig.h
>  create mode 100644 libmultipath/nvme/json.h
>  create mode 100644 libmultipath/nvme/linux/nvme.h
>  create mode 100644 libmultipath/nvme/nvme-ioctl.c
>  create mode 100644 libmultipath/nvme/nvme-ioctl.h
>  create mode 100644 libmultipath/nvme/nvme.h
>  create mode 100644 libmultipath/nvme/plugin.h
>  create mode 100644 libmultipath/prioritizers/ana.c
>  create mode 100644 libmultipath/prioritizers/ana.h
> 
> -- 
> 2.19.2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-21 11:03           ` Muneendra Kumar M
@ 2018-12-23 10:59             ` Martin Wilck
  2018-12-28 12:19               ` Muneendra Kumar M
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2018-12-23 10:59 UTC (permalink / raw)
  To: Muneendra Kumar M, Christophe Varoqui, mwilck+gmail
  Cc: Guan Junxiong, M Muneendra Kumar, dm-devel

Hi Muneedra,

> The san_path_err_XX feature was added by me and pushed to the
> upstream.
> And this feature was driven from Brocade Customer Feedback.
> 
> And the below link will give  the history of this where couple of
> discussions went before we started this feature.
> 
> https://www.redhat.com/archives/dm-devel/2017-January/msg00025.html

I'm aware that you authored the feature. I was not aware of that post
you quoted, thanks for the link. Anyway, you mentioned in that post
that the interested customers were using RHEL, have you made them
upgrade their multipath-tools to recent upstream to use the
san_path_err and/or marginal_path features?

> Our requirement was simple
> For example If there are two paths on a dm-1 say sda and sdb as
> below.
> 
>  #  multipath -ll
>  mpathd (3600110d001ee7f0102050001cc0b6751) dm-1 SANBlaze,VLUN MyLun
>  size=8.0M features='0' hwhandler='0' wp=rw
>  `-+- policy='round-robin 0' prio=50 status=active
>    |- 8:0:1:0  sda 8:48 active ready  running
>    `- 9:0:1:0  sdb 8:64 active ready  running
> 
>  And on sda if iam seeing lot of errors due to which the sda path is
> fluctuating from failed state to active state and vicevera.
> 
>  The  requirement was something like this  if sda is failed(moved
> from
> active to failed state) for more than X
>  times in a Y duration ,then I want to keep the sda in failed state
> for Z
> duration

Thanks for clarifying what you meant with "is failed". I'd been
wondering if it meant "good"->"failed" transitions, as you just
confirmed, or overall "failed" state count.

>  And the data should travel only through sdb path for Z hrs.
> 
> 
>  From the configuration point of view
> 
>  san_path_err_threshold: The number of times the sda has been moved
> from
> active to failed (from the above example it is X)
>  san_path_err_forget_rate: Watch window (within this time frame if
> the path
> failures (sda moving from active to failed ) are more than err
> threshold
> then don't reinstate the path) (from the above example it is Y)

The "watch window" analogy fits if you have a stable path (no or only
very rare failures over extended periods of time) which suddenly starts
fluctuating. More precisely, a "background" failure rate clearly below
"san_path_err_forget_rate", interchanging with problematic periods in
which the failure rate is significantly higher than
"san_path_err_forget_rate". And that's is the situation the algorithm
was made for, right?

In general, the "time" (in ticks) to reach the treshold is

  t = T / max(1/R - 1/F, 0)

Where T is san_path_err_threshold, R is the average time (in ticks)
between "good"->"failed" transitions of the path, and F is
san_path_err_forget_rate (aka the time in ticks after which
"path_failures" is decremented by 1).

If R >= F, t is infinite; the "path_failures" count effectively stays
0. If R is much smaller than F, t ~ T * R. If R is only a little bit
smaller than F, t is finite but (possibly much) larger than T * R.
That's why I sloppily called F the "maximum tolerable failure rate" in
my previous post.

Best regards,
Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature"
  2018-12-23 10:59             ` Martin Wilck
@ 2018-12-28 12:19               ` Muneendra Kumar M
  0 siblings, 0 replies; 36+ messages in thread
From: Muneendra Kumar M @ 2018-12-28 12:19 UTC (permalink / raw)
  To: Martin Wilck, Christophe Varoqui, mwilck+gmail
  Cc: Guan Junxiong, M Muneendra Kumar, dm-devel

Hi Martin,
Please find my replies below.

>Hi Muneedra,

> The san_path_err_XX feature was added by me and pushed to the
> upstream.
> And this feature was driven from Brocade Customer Feedback.
>
> And the below link will give  the history of this where couple of
> discussions went before we started this feature.
>
> https://www.redhat.com/archives/dm-devel/2017-January/msg00025.html

>I'm aware that you authored the feature. I was not aware of that post you
>quoted, thanks for the link. Anyway, you mentioned in that post that the
>interested customers were using RHEL, have you made them upgrade their
>multipath-tools to >recent upstream to use the san_path_err and/or
>marginal_path features?


>>>> I will get back to u with the details.


> Our requirement was simple
> For example If there are two paths on a dm-1 say sda and sdb as below.
>
>  #  multipath -ll
>  mpathd (3600110d001ee7f0102050001cc0b6751) dm-1 SANBlaze,VLUN MyLun
> size=8.0M features='0' hwhandler='0' wp=rw
>  `-+- policy='round-robin 0' prio=50 status=active
>    |- 8:0:1:0  sda 8:48 active ready  running
>    `- 9:0:1:0  sdb 8:64 active ready  running
>
>  And on sda if iam seeing lot of errors due to which the sda path is
> fluctuating from failed state to active state and vicevera.
>
>  The  requirement was something like this  if sda is failed(moved from
> active to failed state) for more than X  times in a Y duration ,then I
> want to keep the sda in failed state for Z duration

>Thanks for clarifying what you meant with "is failed". I'd been wondering
>if it meant "good"->"failed" transitions, as you just confirmed, or overall
>"failed" state count.

>  And the data should travel only through sdb path for Z hrs.
>
>
>  From the configuration point of view
>
>  san_path_err_threshold: The number of times the sda has been moved
> from active to failed (from the above example it is X)
>  san_path_err_forget_rate: Watch window (within this time frame if the
> path failures (sda moving from active to failed ) are more than err
> threshold then don't reinstate the path) (from the above example it is
> Y)

>The "watch window" analogy fits if you have a stable path (no or only very
>rare failures over extended periods of time) which suddenly starts
>fluctuating. More precisely, a "background" failure rate clearly below
>"san_path_err_forget_rate", >interchanging with problematic periods in
>which the failure rate is significantly higher than
>"san_path_err_forget_rate". And that's is the situation the algorithm was
>made for, right?

>In general, the "time" (in ticks) to reach the treshold is

  >t = T / max(1/R - 1/F, 0)

>Where T is san_path_err_threshold, R is the average time (in ticks) between
>"good"->"failed" transitions of the path, and F is san_path_err_forget_rate
>(aka the time in ticks after which "path_failures" is decremented by 1).

>If R >= F, t is infinite; the "path_failures" count effectively stays 0. If
>R is much smaller than F, t ~ T * R. If R is only a little bit smaller than
>F, t is finite but (possibly much) larger than T * R.
>That's why I sloppily called F the "maximum tolerable failure rate" in my
>previous post.

>>>> Yes.

......


Regards,
Muneendra.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/19] san_path_err & multipath ANA support
  2018-12-21 16:06 ` Benjamin Marzinski
@ 2019-01-07 11:21   ` Martin Wilck
  2019-01-07 19:15     ` Benjamin Marzinski
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2019-01-07 11:21 UTC (permalink / raw)
  To: Benjamin Marzinski, mwilck+gmail; +Cc: dm-devel

On Fri, 2018-12-21 at 10:06 -0600, Benjamin Marzinski wrote:
> 
> I've been thinking about how we handle marginal paths, and it seems
> to
> me that instead of telling the kernel that they have failed, it might
> be
> better to create pathgroups of last resort, which contains marginal
> paths that should only be used if all the other paths are down.

Maybe we should simply assign marginal paths a very low priority? 

At least with "group_by_prio" and immediate failback, that would cause
multipathd to switch to these paths if nothing else is available, and
switch back ASAP - so it would give you the desired behavior almost at
no cost. An open question for me is whether this priority should be
higher or lower than what we assign to "ghost" paths ins standby state
(1, currently).

Side note: the global "failback" policy setting may not fit the needs
of all modern setups. I think that immediate failback is always correct
for "marginal" vs. flawless paths, but we know that it's not always
wanted for non-optimal vs. optimal paths, or other failback scenarios.

> 
> The downsides to this method are that it is quite possible that it
> could
> double the number of pathgroups whenever you have connection issues,
> since a connection issue near the host HBA could cause a marginal
> path
> in each pathgroup. This means more reloading tables, and more
> confusing
> layouts.
> 
> The upside to this method is that multipath won't run out of paths
> while
> their are still marginal paths that it could use. When queuing isn't
> enabled, there's nothing to stop the kernel from failing IO while
> potentially usable marginal paths exist.
> 
> On the other hand, this problem could be mitigated by having
> multipath
> work such that, when marginal path detection is configured, it always
> makes sure that no_path_retry is at least some minimum value that we
> believe is long enough for multipathd to be notified of the path
> failure
> by the kernel and to reinstate the marginal paths.

I'd rather simply document that we discourage "no_path_retry = fail"
while marginall path detection is enabled. "long enough" sounds like a
can of worms to me.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/19] san_path_err & multipath ANA support
  2019-01-07 11:21   ` Martin Wilck
@ 2019-01-07 19:15     ` Benjamin Marzinski
  2019-01-08  8:50       ` Martin Wilck
  0 siblings, 1 reply; 36+ messages in thread
From: Benjamin Marzinski @ 2019-01-07 19:15 UTC (permalink / raw)
  To: Martin Wilck; +Cc: mwilck+gmail, dm-devel

On Mon, Jan 07, 2019 at 12:21:55PM +0100, Martin Wilck wrote:
> On Fri, 2018-12-21 at 10:06 -0600, Benjamin Marzinski wrote:
> > 
> > I've been thinking about how we handle marginal paths, and it seems
> > to
> > me that instead of telling the kernel that they have failed, it might
> > be
> > better to create pathgroups of last resort, which contains marginal
> > paths that should only be used if all the other paths are down.
> 
> Maybe we should simply assign marginal paths a very low priority? 

Yeah, that's the idea. The question is whether all the table reloading
and messy configurations that could come with this outweighs the benefit
of having the kernel automatically use these paths when nothing else is
available.
 
> At least with "group_by_prio" and immediate failback, that would cause
> multipathd to switch to these paths if nothing else is available, and
> switch back ASAP - so it would give you the desired behavior almost at
> no cost. An open question for me is whether this priority should be
> higher or lower than what we assign to "ghost" paths ins standby state
> (1, currently).
> 
> Side note: the global "failback" policy setting may not fit the needs
> of all modern setups. I think that immediate failback is always correct
> for "marginal" vs. flawless paths, but we know that it's not always
> wanted for non-optimal vs. optimal paths, or other failback scenarios.

Agreed, but I don't think that there is another failback policy that
makes more sense as the global default.

> > 
> > The downsides to this method are that it is quite possible that it
> > could
> > double the number of pathgroups whenever you have connection issues,
> > since a connection issue near the host HBA could cause a marginal
> > path
> > in each pathgroup. This means more reloading tables, and more
> > confusing
> > layouts.
> > 
> > The upside to this method is that multipath won't run out of paths
> > while
> > their are still marginal paths that it could use. When queuing isn't
> > enabled, there's nothing to stop the kernel from failing IO while
> > potentially usable marginal paths exist.
> > 
> > On the other hand, this problem could be mitigated by having
> > multipath
> > work such that, when marginal path detection is configured, it always
> > makes sure that no_path_retry is at least some minimum value that we
> > believe is long enough for multipathd to be notified of the path
> > failure
> > by the kernel and to reinstate the marginal paths.
> 
> I'd rather simply document that we discourage "no_path_retry = fail"
> while marginall path detection is enabled. "long enough" sounds like a
> can of worms to me.

Sure.

-Ben

> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/19] san_path_err & multipath ANA support
  2019-01-07 19:15     ` Benjamin Marzinski
@ 2019-01-08  8:50       ` Martin Wilck
  2019-01-08 16:23         ` Benjamin Marzinski
  0 siblings, 1 reply; 36+ messages in thread
From: Martin Wilck @ 2019-01-08  8:50 UTC (permalink / raw)
  To: Benjamin Marzinski; +Cc: dm-devel, mwilck

On Mon, 2019-01-07 at 13:15 -0600, Benjamin Marzinski wrote:
> On Mon, Jan 07, 2019 at 12:21:55PM +0100, Martin Wilck wrote:
> > On Fri, 2018-12-21 at 10:06 -0600, Benjamin Marzinski wrote:
> > > I've been thinking about how we handle marginal paths, and it
> > > seems
> > > to
> > > me that instead of telling the kernel that they have failed, it
> > > might
> > > be
> > > better to create pathgroups of last resort, which contains
> > > marginal
> > > paths that should only be used if all the other paths are down.
> > 
> > Maybe we should simply assign marginal paths a very low priority? 
> 
> Yeah, that's the idea. The question is whether all the table
> reloading
> and messy configurations that could come with this outweighs the
> benefit
> of having the kernel automatically use these paths when nothing else
> is
> available.

I had a similar discussion with Hannes lately about "ghost" states
(ALUA: STANDBY, ANA: INACCESSIBLE), which we currently represent as
"OK" paths with priority = 1. Our current model with "OK" vs. "FAILED"
paths, plus a numeric priority, isn't perfect for representing  either
the cost of trespassing, or the temporary, "fuzzy" state of a path
being "marginal".

That aside, we should probably just try the priority-based approach.
Patches welcome :-)

Another question is whether "marginal" state should be a matter of path
_group_ switching at all. We could also model it in the path selector
using rr_weight.

> > At least with "group_by_prio" and immediate failback, that would
> > cause
> > multipathd to switch to these paths if nothing else is available,
> > and
> > switch back ASAP - so it would give you the desired behavior almost
> > at
> > no cost. An open question for me is whether this priority should be
> > higher or lower than what we assign to "ghost" paths ins standby
> > state
> > (1, currently).
> > 
> > Side note: the global "failback" policy setting may not fit the
> > needs
> > of all modern setups. I think that immediate failback is always
> > correct
> > for "marginal" vs. flawless paths, but we know that it's not always
> > wanted for non-optimal vs. optimal paths, or other failback
> > scenarios.
> 
> Agreed, but I don't think that there is another failback policy that
> makes more sense as the global default.

I wasn't talking about defaults. We are currently not able to provide a
policy that makes different decisions based on which priority the
current and the best PG have. Our failback model simply doesn't have
this feature. 

Btw it could be added quite simply, like this:

 - we agree on a priority value P_0 in all prioritizers (P_0 = 5, say)
 - whever the prio of the current PG is below P_0, and another PG is
above P_0, we fail back immediately, no matter what the current
failback setting is.

Martin

-- 
Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)


--
dm-devel mailing list
dm-devel@redhat.com
https://www.redhat.com/mailman/listinfo/dm-devel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 00/19] san_path_err & multipath ANA support
  2019-01-08  8:50       ` Martin Wilck
@ 2019-01-08 16:23         ` Benjamin Marzinski
  0 siblings, 0 replies; 36+ messages in thread
From: Benjamin Marzinski @ 2019-01-08 16:23 UTC (permalink / raw)
  To: Martin Wilck; +Cc: dm-devel

On Tue, Jan 08, 2019 at 09:50:33AM +0100, Martin Wilck wrote:
> On Mon, 2019-01-07 at 13:15 -0600, Benjamin Marzinski wrote:
> > On Mon, Jan 07, 2019 at 12:21:55PM +0100, Martin Wilck wrote:
> > > On Fri, 2018-12-21 at 10:06 -0600, Benjamin Marzinski wrote:
> > > > I've been thinking about how we handle marginal paths, and it
> > > > seems
> > > > to
> > > > me that instead of telling the kernel that they have failed, it
> > > > might
> > > > be
> > > > better to create pathgroups of last resort, which contains
> > > > marginal
> > > > paths that should only be used if all the other paths are down.
> > > 
> > > Maybe we should simply assign marginal paths a very low priority? 
> > 
> > Yeah, that's the idea. The question is whether all the table
> > reloading
> > and messy configurations that could come with this outweighs the
> > benefit
> > of having the kernel automatically use these paths when nothing else
> > is
> > available.
> 
> I had a similar discussion with Hannes lately about "ghost" states
> (ALUA: STANDBY, ANA: INACCESSIBLE), which we currently represent as
> "OK" paths with priority = 1. Our current model with "OK" vs. "FAILED"
> paths, plus a numeric priority, isn't perfect for representing  either
> the cost of trespassing, or the temporary, "fuzzy" state of a path
> being "marginal".
> 
> That aside, we should probably just try the priority-based approach.
> Patches welcome :-)
> 
> Another question is whether "marginal" state should be a matter of path
> _group_ switching at all. We could also model it in the path selector
> using rr_weight.

I don't think that would work.  Imagine the flakey component being the
storage controller or the connection between the switch and the
controller. Most likely all of the effected paths would be in the same
pathgroup. If we didn't change the priority of those paths and they
happened to be the highest priority paths, then all the paths in the
highest priority pathgroup would be flakey.

> > > At least with "group_by_prio" and immediate failback, that would
> > > cause
> > > multipathd to switch to these paths if nothing else is available,
> > > and
> > > switch back ASAP - so it would give you the desired behavior almost
> > > at
> > > no cost. An open question for me is whether this priority should be
> > > higher or lower than what we assign to "ghost" paths ins standby
> > > state
> > > (1, currently).
> > > 
> > > Side note: the global "failback" policy setting may not fit the
> > > needs
> > > of all modern setups. I think that immediate failback is always
> > > correct
> > > for "marginal" vs. flawless paths, but we know that it's not always
> > > wanted for non-optimal vs. optimal paths, or other failback
> > > scenarios.
> > 
> > Agreed, but I don't think that there is another failback policy that
> > makes more sense as the global default.
> 
> I wasn't talking about defaults. We are currently not able to provide a
> policy that makes different decisions based on which priority the
> current and the best PG have. Our failback model simply doesn't have
> this feature. 
> 
> Btw it could be added quite simply, like this:
> 
>  - we agree on a priority value P_0 in all prioritizers (P_0 = 5, say)
>  - whever the prio of the current PG is below P_0, and another PG is
> above P_0, we fail back immediately, no matter what the current
> failback setting is.

Ah. Good point.

-Ben

> 
> Martin
> 
> -- 
> Dr. Martin Wilck <mwilck@suse.com>, Tel. +49 (0)911 74053 2107
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton
> HRB 21284 (AG Nürnberg)
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2019-01-08 16:23 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-18 23:19 [PATCH 00/19] san_path_err & multipath ANA support Martin Wilck
2018-12-18 23:19 ` [PATCH 01/19] libmultipath: Increase SERIAL_SIZE to 128 bytes Martin Wilck
2018-12-18 23:19 ` [PATCH 02/19] multipath.conf.5: explain "shaky" path detection Martin Wilck
2018-12-18 23:19 ` [PATCH 03/19] libmultipath: propsel: don't print undefined values Martin Wilck
2018-12-18 23:19 ` [PATCH 04/19] Revert "multipath-tools: discard san_path_err_XXX feature" Martin Wilck
2018-12-19 11:32   ` Muneendra Kumar M
2018-12-19 12:02     ` Martin Wilck
2018-12-20 10:41       ` Muneendra Kumar M
2018-12-20 21:26         ` Martin Wilck
2018-12-21 11:03           ` Muneendra Kumar M
2018-12-23 10:59             ` Martin Wilck
2018-12-28 12:19               ` Muneendra Kumar M
2018-12-18 23:19 ` [PATCH 05/19] multipathd: marginal_path overrides san_path_err Martin Wilck
2018-12-18 23:19 ` [PATCH 06/19] multipath.conf.5: man page fixes for san_path_err_xy Martin Wilck
2018-12-18 23:19 ` [PATCH 07/19] setup_map: wait for pending path checkers to finish Martin Wilck
2018-12-18 23:19 ` [PATCH 08/19] libmultipath: add ARRAY_SIZE helper Martin Wilck
2018-12-18 23:19 ` [PATCH 09/19] libmultipath: make close_fd() a common helper Martin Wilck
2018-12-18 23:19 ` [PATCH 10/19] libmultipath: restore PG prio in update_multipath_strings Martin Wilck
2018-12-18 23:19 ` [PATCH 11/19] multipathd: don't check foreign paths every tick Martin Wilck
2018-12-18 23:19 ` [PATCH 12/19] libmultipath: add files from nvme-cli for NVMe support Martin Wilck
2018-12-18 23:19 ` [PATCH 13/19] libmultipath: add wrapper library for nvme ioctls Martin Wilck
2018-12-18 23:19 ` [PATCH 14/19] multipath-tools: add ANA support for NVMe device Martin Wilck
2018-12-20 15:17   ` Hannes Reinecke
2018-12-20 23:45     ` Martin Wilck
2018-12-18 23:19 ` [PATCH 15/19] libmultipath: ANA prioritzer: use nvme wrapper library Martin Wilck
2018-12-20 22:58   ` Benjamin Marzinski
2018-12-18 23:19 ` [PATCH 16/19] libmultipath: detect_prio: try ANA for NVMe Martin Wilck
2018-12-18 23:19 ` [PATCH 17/19] libmultipath/foreign/nvme: use failover topology Martin Wilck
2018-12-18 23:19 ` [PATCH 18/19] libmultipath/foreign/nvme: show ANA state Martin Wilck
2018-12-18 23:19 ` [PATCH 19/19] libmultipath/foreign/nvme: indicate ANA support Martin Wilck
2018-12-20 23:24 ` [PATCH 00/19] san_path_err & multipath " Benjamin Marzinski
2018-12-21 16:06 ` Benjamin Marzinski
2019-01-07 11:21   ` Martin Wilck
2019-01-07 19:15     ` Benjamin Marzinski
2019-01-08  8:50       ` Martin Wilck
2019-01-08 16:23         ` Benjamin Marzinski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.