All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme: freeze IO accesses around format
@ 2017-10-27 16:35 Jens Axboe
  2017-10-27 16:44 ` Keith Busch
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jens Axboe @ 2017-10-27 16:35 UTC (permalink / raw)


If someone attempts to do IO to a drive while it is under format,
we risk timing out that IO. That potentially leads to the driver
triggering a controller reset, and subsequently the format is ruined and
the device goes away.

Prevents this by freezing IO access to the device around a format.
Without this, the following set of commands can easily make your device
disappear:

parted -s /dev/nvme3n1 mklabel gpt
parted -s /dev/nvme3n1 mkpart primary 0G 100G
parted -s /dev/nvme3n1 rm 1
nvme format /dev/nvme3

since the last partition removal will trigger a udev partition reload,
which happens while the format is running. If the format takes longer
than the normal IO timeout, we start timing it out:

[  456.799438]  nvme3n1:
[  456.833656]  nvme3n1: p1
[  456.842025]  nvme3n1: p1
[  456.887368]  nvme3n1:
[  487.699023] nvme nvme3: I/O 879 QID 12 timeout, aborting
[  518.098840] nvme nvme3: I/O 879 QID 12 timeout, reset controller
[  571.700471] nvme nvme3: Abort status: 0x7
[  571.798306] nvme nvme3: Removing after probe failure status: -22
[  571.811330] nvme3n1: detected capacity change from 4000787030016 to 0
[  571.819189] print_req_error: I/O error, dev nvme3n1, sector 7814036992

and the device is gone, needing a driver reload or reboot to bring it
back. Same thing happens if you just do a dd from the device and then
start a format. Behavior is vendor agnostic, basically just timing
dependent.

Signed-off-by: Jens Axboe <axboe at kernel.dk>

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 5a14cc7f28ee..13d7fda73fbc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1013,10 +1013,23 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	if (cmd.timeout_ms)
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
+	/*
+	 * Freeze current access to the device, and prevent new ones, around
+	 * a format operation.
+	 */
+	if (cmd.opcode == nvme_admin_format_nvm) {
+		nvme_start_freeze(ctrl);
+		nvme_wait_freeze(ctrl);
+	}
+
 	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
 			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
 			(void __user *)(uintptr_t)cmd.metadata, cmd.metadata,
 			0, &cmd.result, timeout);
+
+	if (cmd.opcode == nvme_admin_format_nvm)
+		nvme_unfreeze(ctrl);
+
 	if (status >= 0) {
 		if (put_user(cmd.result, &ucmd->result))
 			return -EFAULT;

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-27 16:35 [PATCH] nvme: freeze IO accesses around format Jens Axboe
@ 2017-10-27 16:44 ` Keith Busch
  2017-10-27 23:07 ` Keith Busch
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Keith Busch @ 2017-10-27 16:44 UTC (permalink / raw)


On Fri, Oct 27, 2017@10:35:58AM -0600, Jens Axboe wrote:
> If someone attempts to do IO to a drive while it is under format,
> we risk timing out that IO. That potentially leads to the driver
> triggering a controller reset, and subsequently the format is ruined and
> the device goes away.
> 
> Prevents this by freezing IO access to the device around a format.
> Without this, the following set of commands can easily make your device
> disappear:
> 
> parted -s /dev/nvme3n1 mklabel gpt
> parted -s /dev/nvme3n1 mkpart primary 0G 100G
> parted -s /dev/nvme3n1 rm 1
> nvme format /dev/nvme3
> 
> since the last partition removal will trigger a udev partition reload,
> which happens while the format is running. If the format takes longer
> than the normal IO timeout, we start timing it out:
> 
> [  456.799438]  nvme3n1:
> [  456.833656]  nvme3n1: p1
> [  456.842025]  nvme3n1: p1
> [  456.887368]  nvme3n1:
> [  487.699023] nvme nvme3: I/O 879 QID 12 timeout, aborting
> [  518.098840] nvme nvme3: I/O 879 QID 12 timeout, reset controller
> [  571.700471] nvme nvme3: Abort status: 0x7
> [  571.798306] nvme nvme3: Removing after probe failure status: -22
> [  571.811330] nvme3n1: detected capacity change from 4000787030016 to 0
> [  571.819189] print_req_error: I/O error, dev nvme3n1, sector 7814036992
> 
> and the device is gone, needing a driver reload or reboot to bring it
> back. Same thing happens if you just do a dd from the device and then
> start a format. Behavior is vendor agnostic, basically just timing
> dependent.
> 
> Signed-off-by: Jens Axboe <axboe at kernel.dk>

Looks good.

Reviewed-by: Keith Busch <keith.busch at intel.com>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-27 16:35 [PATCH] nvme: freeze IO accesses around format Jens Axboe
  2017-10-27 16:44 ` Keith Busch
@ 2017-10-27 23:07 ` Keith Busch
  2017-10-30 14:29   ` Jens Axboe
  2017-11-01 15:56   ` Christoph Hellwig
  2017-10-28  6:45 ` Christoph Hellwig
  2017-11-01 15:43 ` Christoph Hellwig
  3 siblings, 2 replies; 13+ messages in thread
From: Keith Busch @ 2017-10-27 23:07 UTC (permalink / raw)


On Fri, Oct 27, 2017@10:35:58AM -0600, Jens Axboe wrote:
> @@ -1013,10 +1013,23 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>  	if (cmd.timeout_ms)
>  		timeout = msecs_to_jiffies(cmd.timeout_ms);
>  
> +	/*
> +	 * Freeze current access to the device, and prevent new ones, around
> +	 * a format operation.
> +	 */
> +	if (cmd.opcode == nvme_admin_format_nvm) {
> +		nvme_start_freeze(ctrl);
> +		nvme_wait_freeze(ctrl);
> +	}
> +
>  	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
>  			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
>  			(void __user *)(uintptr_t)cmd.metadata, cmd.metadata,
>  			0, &cmd.result, timeout);
> +
> +	if (cmd.opcode == nvme_admin_format_nvm)
> +		nvme_unfreeze(ctrl);
> +

Testing some obscure vendor specific IO commands, I noticed this
will deadlock if an IO command opcode happens to be the same as an
nvme_admin_format_nvm command since the IO queues are all frozen.

You can check 'ns == NULL' before checking the opcode.

I also have this patch that's more generic and will perform actions
based on the controller's command effects:

---
commit 760052c4f2d697c19e0f3a78f2f9bc36b2b6b92f
Author: Keith Busch <keith.busch at intel.com>
Date:   Fri Oct 27 10:35:58 2017 -0600

    nvme: Check admin passthru command effects
    
    This patch saves the effects log page if the controller supports it,
    and performs appropriate actions before and after an admin passthrough
    command is completed.
    
    The NVMe standard provides a command effects log page so the host is
    aware of actions it may need to do in response to a particular passthrough
    command. For example, the command may need to run with IO quiesced to
    prevent timeouts or undefined behavior, or it may change the logical
    block formats that alter how the host needs to construct commands.
    
    If the controller does not support the command effects log page, the
    driver will define the effects for known opcodes. The nvme format is
    the only such command in this patch with known effects.
    
    Signed-off-by: Keith Busch <keith.busch at intel.com>

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 37f9039bb9ca..9b352541672d 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -76,6 +76,10 @@ static DEFINE_SPINLOCK(dev_list_lock);
 
 static struct class *nvme_class;
 
+static void nvme_scan(struct nvme_ctrl *ctrl);
+static void nvme_ns_remove(struct nvme_ns *ns);
+static int nvme_revalidate_disk(struct gendisk *disk);
+
 static __le32 nvme_get_log_dw10(u8 lid, size_t size)
 {
 	return cpu_to_le32((((size / 4) - 1) << 16) | lid);
@@ -982,12 +986,68 @@ static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 			metadata, meta_len, io.slba, NULL, 0);
 }
 
+static void nvme_get_admin_effects(struct nvme_ctrl *ctrl, u8 opcode, u32 *effects)
+{
+	if (ctrl->effects)
+		*effects = le32_to_cpu(ctrl->effects->acs[opcode]);
+
+	if (*effects != 0)
+		return;
+
+	/*
+	 * Set known effects for opcodes if the controller doesn't support
+	 * reporting the command's effects.
+	 */
+	switch (opcode) {
+	case nvme_admin_format_nvm:
+		*effects = NVME_CMD_FX_CSUPP | NVME_CMD_FX_LBCC |
+					NVME_CMD_FX_CSE_MASK;
+		break;
+	default:
+		break;
+	}
+}
+
+static void nvme_passthru_start(struct nvme_ctrl *ctrl, u32 effects)
+{
+	if (effects & (NVME_CMD_FX_LBCC | NVME_CMD_FX_CSE_MASK)) {
+		nvme_start_freeze(ctrl);
+		nvme_wait_freeze(ctrl);
+	}
+}
+
+static void nvme_passthru_end(struct nvme_ctrl *ctrl, u32 effects)
+{
+	/*
+	 * Revaildate LBA changes prior to unfreezing. This is necessary to
+	 * prevent memory corruption if a logical block size was changed by
+	 * this command.
+	 */
+	if (effects & NVME_CMD_FX_LBCC) {
+		struct nvme_ns *ns;
+
+		mutex_lock(&ctrl->namespaces_mutex);
+		list_for_each_entry(ns, &ctrl->namespaces, list) {
+			if (ns->disk && nvme_revalidate_disk(ns->disk))
+				nvme_ns_remove(ns);
+		}
+		mutex_unlock(&ctrl->namespaces_mutex);
+	}
+	if (effects & (NVME_CMD_FX_LBCC | NVME_CMD_FX_CSE_MASK))
+		nvme_unfreeze(ctrl);
+	if (effects & NVME_CMD_FX_CCC)
+		nvme_init_identify(ctrl);
+	if (effects & (NVME_CMD_FX_NIC | NVME_CMD_FX_NCC))
+		nvme_scan(ctrl);
+}
+
 static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 			struct nvme_passthru_cmd __user *ucmd)
 {
 	struct nvme_passthru_cmd cmd;
 	struct nvme_command c;
 	unsigned timeout = 0;
+	u32 effects = 0;
 	int status;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -997,6 +1057,9 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	if (cmd.flags)
 		return -EINVAL;
 
+	if (ns != NULL)
+		nvme_get_admin_effects(ctrl, cmd.opcode, &effects);
+
 	memset(&c, 0, sizeof(c));
 	c.common.opcode = cmd.opcode;
 	c.common.flags = cmd.flags;
@@ -1013,10 +1076,13 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
 	if (cmd.timeout_ms)
 		timeout = msecs_to_jiffies(cmd.timeout_ms);
 
+	nvme_passthru_start(ctrl, effects);
 	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
 			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
 			(void __user *)(uintptr_t)cmd.metadata, cmd.metadata,
 			0, &cmd.result, timeout);
+	nvme_passthru_end(ctrl, effects);
+
 	if (status >= 0) {
 		if (put_user(cmd.result, &ucmd->result))
 			return -EFAULT;
@@ -1762,6 +1828,37 @@ static void nvme_init_subnqn(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
 	memset(ctrl->subnqn + off, 0, sizeof(ctrl->subnqn) - off);
 }
 
+static int nvme_get_log(struct nvme_ctrl *ctrl, u8 log_page, void *log, size_t size)
+{
+	struct nvme_command c = { };
+
+
+	c.common.opcode = nvme_admin_get_log_page;
+	c.common.nsid = cpu_to_le32(NVME_NSID_ALL);
+	c.common.cdw10[0] = nvme_get_log_dw10(log_page, size);
+
+	return nvme_submit_sync_cmd(ctrl->admin_q, &c, log, size);
+}
+
+static int nvme_get_effects_log(struct nvme_ctrl *ctrl)
+{
+	int ret;
+
+	if (!ctrl->effects)
+		ctrl->effects = kzalloc(sizeof(*ctrl->effects), GFP_KERNEL);
+
+	if (!ctrl->effects)
+		return 0;
+
+	ret = nvme_get_log(ctrl, NVME_LOG_CMD_FX, ctrl->effects,
+					sizeof(*ctrl->effects));
+	if (ret) {
+		kfree(ctrl->effects);
+		ctrl->effects = NULL;
+	}
+	return ret;
+}
+
 /*
  * Initialize the cached copies of the Identify data and various controller
  * register in our nvme_ctrl structure.  This should be called as soon as
@@ -1797,6 +1894,12 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 		return -EIO;
 	}
 
+	if (id->lpa & NVME_CTRL_LPA_CMD_FX_LOG) {
+		ret = nvme_get_effects_log(ctrl);
+		if (ret < 0)
+			return ret;
+	}
+
 	nvme_init_subnqn(ctrl, id);
 
 	if (!ctrl->identified) {
@@ -2522,10 +2625,8 @@ static void nvme_scan_ns_sequential(struct nvme_ctrl *ctrl, unsigned nn)
 	nvme_remove_invalid_namespaces(ctrl, nn);
 }
 
-static void nvme_scan_work(struct work_struct *work)
+static void nvme_scan(struct nvme_ctrl *ctrl)
 {
-	struct nvme_ctrl *ctrl =
-		container_of(work, struct nvme_ctrl, scan_work);
 	struct nvme_id_ctrl *id;
 	unsigned nn;
 
@@ -2549,6 +2650,14 @@ static void nvme_scan_work(struct work_struct *work)
 	kfree(id);
 }
 
+static void nvme_scan_work(struct work_struct *work)
+{
+	struct nvme_ctrl *ctrl =
+		container_of(work, struct nvme_ctrl, scan_work);
+
+	nvme_scan(ctrl);
+}
+
 void nvme_queue_scan(struct nvme_ctrl *ctrl)
 {
 	/*
@@ -2615,18 +2724,13 @@ static bool nvme_ctrl_pp_status(struct nvme_ctrl *ctrl)
 
 static void nvme_get_fw_slot_info(struct nvme_ctrl *ctrl)
 {
-	struct nvme_command c = { };
 	struct nvme_fw_slot_info_log *log;
 
 	log = kmalloc(sizeof(*log), GFP_KERNEL);
 	if (!log)
 		return;
 
-	c.common.opcode = nvme_admin_get_log_page;
-	c.common.nsid = cpu_to_le32(NVME_NSID_ALL);
-	c.common.cdw10[0] = nvme_get_log_dw10(NVME_LOG_FW_SLOT, sizeof(*log));
-
-	if (!nvme_submit_sync_cmd(ctrl->admin_q, &c, log, sizeof(*log)))
+	if (nvme_get_log(ctrl, NVME_LOG_FW_SLOT, log, sizeof(*log)))
 		dev_warn(ctrl->device,
 				"Get FW SLOT INFO log error\n");
 	kfree(log);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index d3f3c4447515..69dd8d2cf05c 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -172,6 +172,7 @@ struct nvme_ctrl {
 	bool subsystem;
 	unsigned long quirks;
 	struct nvme_id_power_state psd[32];
+	struct nvme_effects_log *effects;
 	struct work_struct scan_work;
 	struct work_struct async_event_work;
 	struct delayed_work ka_work;
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 9310ce77d8e1..cad7bde44f8e 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -267,6 +267,7 @@ enum {
 	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
 	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
 	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 8,
+	NVME_CTRL_LPA_CMD_FX_LOG		= 1 << 1,
 };
 
 struct nvme_lbaf {
@@ -396,6 +397,22 @@ struct nvme_fw_slot_info_log {
 };
 
 enum {
+	NVME_CMD_FX_CSUPP	= 1 << 0,
+	NVME_CMD_FX_LBCC	= 1 << 1,
+	NVME_CMD_FX_NCC		= 1 << 2,
+	NVME_CMD_FX_NIC		= 1 << 3,
+	NVME_CMD_FX_CCC		= 1 << 4,
+	NVME_CMD_FX_CSE_MASK	= 3 << 16,
+};
+
+struct nvme_effects_log {
+	__le32 acs[256];
+	__le32 iocs[256];
+	__u8   resv[2048];
+};
+
+
+enum {
 	NVME_SMART_CRIT_SPARE		= 1 << 0,
 	NVME_SMART_CRIT_TEMPERATURE	= 1 << 1,
 	NVME_SMART_CRIT_RELIABILITY	= 1 << 2,
@@ -712,6 +729,7 @@ enum {
 	NVME_LOG_ERROR		= 0x01,
 	NVME_LOG_SMART		= 0x02,
 	NVME_LOG_FW_SLOT	= 0x03,
+	NVME_LOG_CMD_FX		= 0x05,
 	NVME_LOG_DISC		= 0x70,
 	NVME_LOG_RESERVATION	= 0x80,
 	NVME_FWACT_REPL		= (0 << 3),
--

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-27 16:35 [PATCH] nvme: freeze IO accesses around format Jens Axboe
  2017-10-27 16:44 ` Keith Busch
  2017-10-27 23:07 ` Keith Busch
@ 2017-10-28  6:45 ` Christoph Hellwig
  2017-10-30 14:30   ` Jens Axboe
  2017-11-01 15:43 ` Christoph Hellwig
  3 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-10-28  6:45 UTC (permalink / raw)


And what about sanitize?  I guess instead of looking at opcodes directly
we'd better look at the indications in the Commands Supported and Effects
log.  That's what the Windows driver does, btw.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-27 23:07 ` Keith Busch
@ 2017-10-30 14:29   ` Jens Axboe
  2017-11-01 15:56   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2017-10-30 14:29 UTC (permalink / raw)


On 10/27/2017 05:07 PM, Keith Busch wrote:
> On Fri, Oct 27, 2017@10:35:58AM -0600, Jens Axboe wrote:
>> @@ -1013,10 +1013,23 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
>>  	if (cmd.timeout_ms)
>>  		timeout = msecs_to_jiffies(cmd.timeout_ms);
>>  
>> +	/*
>> +	 * Freeze current access to the device, and prevent new ones, around
>> +	 * a format operation.
>> +	 */
>> +	if (cmd.opcode == nvme_admin_format_nvm) {
>> +		nvme_start_freeze(ctrl);
>> +		nvme_wait_freeze(ctrl);
>> +	}
>> +
>>  	status = nvme_submit_user_cmd(ns ? ns->queue : ctrl->admin_q, &c,
>>  			(void __user *)(uintptr_t)cmd.addr, cmd.data_len,
>>  			(void __user *)(uintptr_t)cmd.metadata, cmd.metadata,
>>  			0, &cmd.result, timeout);
>> +
>> +	if (cmd.opcode == nvme_admin_format_nvm)
>> +		nvme_unfreeze(ctrl);
>> +
> 
> Testing some obscure vendor specific IO commands, I noticed this
> will deadlock if an IO command opcode happens to be the same as an
> nvme_admin_format_nvm command since the IO queues are all frozen.
> 
> You can check 'ns == NULL' before checking the opcode.

Yeah good point, it should have checked for this being the admin
queue target.

> I also have this patch that's more generic and will perform actions
> based on the controller's command effects:

Looks fine to me, one minor comment below.


> +static void nvme_get_admin_effects(struct nvme_ctrl *ctrl, u8 opcode, u32 *effects)
> +{
> +	if (ctrl->effects)
> +		*effects = le32_to_cpu(ctrl->effects->acs[opcode]);
> +
> +	if (*effects != 0)
> +		return;
> +
> +	/*
> +	 * Set known effects for opcodes if the controller doesn't support
> +	 * reporting the command's effects.
> +	 */
> +	switch (opcode) {
> +	case nvme_admin_format_nvm:
> +		*effects = NVME_CMD_FX_CSUPP | NVME_CMD_FX_LBCC |
> +					NVME_CMD_FX_CSE_MASK;
> +		break;
> +	default:
> +		break;
> +	}
> +}

Returning the value in a passed in pointer is pretty ugly, I only
generally think it's acceptable if you have to pass back some error as
well. And it's still nasty in that case. How about just returning an u32
from this function?


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-28  6:45 ` Christoph Hellwig
@ 2017-10-30 14:30   ` Jens Axboe
  2017-11-01 15:42     ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2017-10-30 14:30 UTC (permalink / raw)


On 10/28/2017 12:45 AM, Christoph Hellwig wrote:
> And what about sanitize?  I guess instead of looking at opcodes directly
> we'd better look at the indications in the Commands Supported and Effects
> log.  That's what the Windows driver does, btw.

Isn't that exactly what Keith's patch did?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-30 14:30   ` Jens Axboe
@ 2017-11-01 15:42     ` Christoph Hellwig
  2017-11-01 15:49       ` Keith Busch
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-11-01 15:42 UTC (permalink / raw)


On Mon, Oct 30, 2017@08:30:04AM -0600, Jens Axboe wrote:
> On 10/28/2017 12:45 AM, Christoph Hellwig wrote:
> > And what about sanitize?  I guess instead of looking at opcodes directly
> > we'd better look at the indications in the Commands Supported and Effects
> > log.  That's what the Windows driver does, btw.
> 
> Isn't that exactly what Keith's patch did?

Ooops I had missed his patch was a completely different approach.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-27 16:35 [PATCH] nvme: freeze IO accesses around format Jens Axboe
                   ` (2 preceding siblings ...)
  2017-10-28  6:45 ` Christoph Hellwig
@ 2017-11-01 15:43 ` Christoph Hellwig
  2017-11-01 15:45   ` Jens Axboe
  3 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-11-01 15:43 UTC (permalink / raw)


On Fri, Oct 27, 2017@10:35:58AM -0600, Jens Axboe wrote:
> Prevents this by freezing IO access to the device around a format.
> Without this, the following set of commands can easily make your device
> disappear:
> 
> parted -s /dev/nvme3n1 mklabel gpt
> parted -s /dev/nvme3n1 mkpart primary 0G 100G
> parted -s /dev/nvme3n1 rm 1
> nvme format /dev/nvme3

Can you wire this test case up for blktests?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-11-01 15:43 ` Christoph Hellwig
@ 2017-11-01 15:45   ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2017-11-01 15:45 UTC (permalink / raw)


On 11/01/2017 09:43 AM, Christoph Hellwig wrote:
> On Fri, Oct 27, 2017@10:35:58AM -0600, Jens Axboe wrote:
>> Prevents this by freezing IO access to the device around a format.
>> Without this, the following set of commands can easily make your device
>> disappear:
>>
>> parted -s /dev/nvme3n1 mklabel gpt
>> parted -s /dev/nvme3n1 mkpart primary 0G 100G
>> parted -s /dev/nvme3n1 rm 1
>> nvme format /dev/nvme3
> 
> Can you wire this test case up for blktests?

Yeah that's a good idea, I'll do that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-11-01 15:42     ` Christoph Hellwig
@ 2017-11-01 15:49       ` Keith Busch
  0 siblings, 0 replies; 13+ messages in thread
From: Keith Busch @ 2017-11-01 15:49 UTC (permalink / raw)


On Wed, Nov 01, 2017@08:42:32AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 30, 2017@08:30:04AM -0600, Jens Axboe wrote:
> > On 10/28/2017 12:45 AM, Christoph Hellwig wrote:
> > > And what about sanitize?  I guess instead of looking at opcodes directly
> > > we'd better look at the indications in the Commands Supported and Effects
> > > log.  That's what the Windows driver does, btw.
> > 
> > Isn't that exactly what Keith's patch did?
> 
> Ooops I had missed his patch was a completely different approach.

I'll send it out in a new thread. I have fix the command effects log
page memory leak, anyway.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-10-27 23:07 ` Keith Busch
  2017-10-30 14:29   ` Jens Axboe
@ 2017-11-01 15:56   ` Christoph Hellwig
  2017-11-01 16:21     ` Keith Busch
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2017-11-01 15:56 UTC (permalink / raw)


>     If the controller does not support the command effects log page, the
>     driver will define the effects for known opcodes. The nvme format is
>     the only such command in this patch with known effects.

Sanitize is another one.  Also Format might either affect a single
namespace or the whole controller depending on what it advertises in
FNA.

> +static void nvme_get_admin_effects(struct nvme_ctrl *ctrl, u8 opcode, u32 *effects)

Overly long line.

> +{
> +	if (ctrl->effects)
> +		*effects = le32_to_cpu(ctrl->effects->acs[opcode]);
> +
> +	if (*effects != 0)
> +		return;
> +
> +	/*
> +	 * Set known effects for opcodes if the controller doesn't support
> +	 * reporting the command's effects.
> +	 */
> +	switch (opcode) {
> +	case nvme_admin_format_nvm:
> +		*effects = NVME_CMD_FX_CSUPP | NVME_CMD_FX_LBCC |
> +					NVME_CMD_FX_CSE_MASK;
> +		break;
> +	default:
> +		break;
> +	}
> +}

Hmm.  I'd expect it to be something like:

static u32 nvme_get_admin_effects((struct nvme_ctrl *ctrl, u8 opcode)
{
	if (ctrl->effects)
		return le32_to_cpu(ctrl->effects->acs[opcode]);

	switch (opcode) {
		...
	}
}

That is: a) return the value instead of pass by reference, and
b) if the controller supports the log page rely on it.

Also don't we need to also handle this for I/O commands?  While
non of the currently defined I/O commands would need anything, the
spec defines the mechanism? and it might be useful for vendor
specific commands.

> +static void nvme_passthru_start(struct nvme_ctrl *ctrl, u32 effects)
> +{
> +	if (effects & (NVME_CMD_FX_LBCC | NVME_CMD_FX_CSE_MASK)) {
> +		nvme_start_freeze(ctrl);
> +		nvme_wait_freeze(ctrl);
> +	}
> +}

I'd move the nvme_get_admin_effects call into this, and return the
value from this function to keep the caller a little more uncluttered.

> +	if (effects & NVME_CMD_FX_LBCC) {
> +		struct nvme_ns *ns;
> +
> +		mutex_lock(&ctrl->namespaces_mutex);
> +		list_for_each_entry(ns, &ctrl->namespaces, list) {
> +			if (ns->disk && nvme_revalidate_disk(ns->disk))
> +				nvme_ns_remove(ns);
> +		}
> +		mutex_unlock(&ctrl->namespaces_mutex);
> +	}

Split the code above into a helper?

> +static int nvme_get_log(struct nvme_ctrl *ctrl, u8 log_page, void *log, size_t size)

To loong line again.  Also I think adding this helper should probably
be a preparation patch.

> +{
> +	struct nvme_command c = { };
> +
> +
> +	c.common.opcode = nvme_admin_get_log_page;

Double empty line.

> +static void nvme_scan(struct nvme_ctrl *ctrl)
>  {
> -	struct nvme_ctrl *ctrl =
> -		container_of(work, struct nvme_ctrl, scan_work);
>  	struct nvme_id_ctrl *id;
>  	unsigned nn;
>  
> @@ -2549,6 +2650,14 @@ static void nvme_scan_work(struct work_struct *work)
>  	kfree(id);
>  }
>  
> +static void nvme_scan_work(struct work_struct *work)
> +{
> +	struct nvme_ctrl *ctrl =
> +		container_of(work, struct nvme_ctrl, scan_work);
> +
> +	nvme_scan(ctrl);
> +}

Why do we do the scan inline here, but not in any other place?

> @@ -267,6 +267,7 @@ enum {
>  	NVME_CTRL_OACS_SEC_SUPP                 = 1 << 0,
>  	NVME_CTRL_OACS_DIRECTIVES		= 1 << 5,
>  	NVME_CTRL_OACS_DBBUF_SUPP		= 1 << 8,
> +	NVME_CTRL_LPA_CMD_FX_LOG		= 1 << 1,

I'd just use the bit directly, given that it doesn't have a name
in the spec either.

>  enum {
> +	NVME_CMD_FX_CSUPP	= 1 << 0,
> +	NVME_CMD_FX_LBCC	= 1 << 1,
> +	NVME_CMD_FX_NCC		= 1 << 2,
> +	NVME_CMD_FX_NIC		= 1 << 3,
> +	NVME_CMD_FX_CCC		= 1 << 4,
> +	NVME_CMD_FX_CSE_MASK	= 3 << 16,

s/NVME_CMD_FX/NVME_CMD_EFFECTS/g ?

> +	NVME_LOG_CMD_FX		= 0x05,

same here.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-11-01 16:21     ` Keith Busch
@ 2017-11-01 16:21       ` Christoph Hellwig
  0 siblings, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2017-11-01 16:21 UTC (permalink / raw)


On Wed, Nov 01, 2017@10:21:47AM -0600, Keith Busch wrote:
> 
> Okay, we can freeze single nvme namespace's request queues if desired.
> It's only a little more code to do that.

Not sure we need to bother doing it now, but a comment that we are
lazy might be useful.

> > Also don't we need to also handle this for I/O commands?  While
> > non of the currently defined I/O commands would need anything, the
> > spec defines the mechanism? and it might be useful for vendor
> > specific commands
> 
> That gets tricky. What if the IO command effects says it needs to run
> exclusively? We don't have a way to quiesce IO queues and then issue our
> exclusive IO through the frozen queue. We'd deadlock trying to allocate
> a request from it.
> 
> If it's okay, I'd like to handle the IO command effects separately for
> future work.

True.  Maybe we should warn if we have any bit set for I/O commands?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH] nvme: freeze IO accesses around format
  2017-11-01 15:56   ` Christoph Hellwig
@ 2017-11-01 16:21     ` Keith Busch
  2017-11-01 16:21       ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Keith Busch @ 2017-11-01 16:21 UTC (permalink / raw)


On Wed, Nov 01, 2017@08:56:03AM -0700, Christoph Hellwig wrote:
> >     If the controller does not support the command effects log page, the
> >     driver will define the effects for known opcodes. The nvme format is
> >     the only such command in this patch with known effects.
> 
> Sanitize is another one.  Also Format might either affect a single
> namespace or the whole controller depending on what it advertises in
> FNA.

Okay, we can freeze single nvme namespace's request queues if desired.
It's only a little more code to do that.
 
> static u32 nvme_get_admin_effects((struct nvme_ctrl *ctrl, u8 opcode)
> {
> 	if (ctrl->effects)
> 		return le32_to_cpu(ctrl->effects->acs[opcode]);
> 
> 	switch (opcode) {
> 		...
> 	}
> }
> 
> That is: a) return the value instead of pass by reference, and
> b) if the controller supports the log page rely on it.

Agreed, that's better.
 
> Also don't we need to also handle this for I/O commands?  While
> non of the currently defined I/O commands would need anything, the
> spec defines the mechanism? and it might be useful for vendor
> specific commands

That gets tricky. What if the IO command effects says it needs to run
exclusively? We don't have a way to quiesce IO queues and then issue our
exclusive IO through the frozen queue. We'd deadlock trying to allocate
a request from it.

If it's okay, I'd like to handle the IO command effects separately for
future work.

> > +static void nvme_passthru_start(struct nvme_ctrl *ctrl, u32 effects)
> > +{
> > +	if (effects & (NVME_CMD_FX_LBCC | NVME_CMD_FX_CSE_MASK)) {
> > +		nvme_start_freeze(ctrl);
> > +		nvme_wait_freeze(ctrl);
> > +	}
> > +}
> 
> I'd move the nvme_get_admin_effects call into this, and return the
> value from this function to keep the caller a little more uncluttered.

Sounds good.
 
> > +static int nvme_get_log(struct nvme_ctrl *ctrl, u8 log_page, void *log, size_t size)
> 
> To loong line again.  Also I think adding this helper should probably
> be a preparation patch.

Will do.
 
> > +static void nvme_scan_work(struct work_struct *work)
> > +{
> > +	struct nvme_ctrl *ctrl =
> > +		container_of(work, struct nvme_ctrl, scan_work);
> > +
> > +	nvme_scan(ctrl);
> > +}
> 
> Why do we do the scan inline here, but not in any other place?

Huh, not sure. I wanted the driver to react to the command's effects
before returning from the ioctl. I don't know why I thought that was
important, though. Will queue it as normal.

> >  enum {
> > +	NVME_CMD_FX_CSUPP	= 1 << 0,
> > +	NVME_CMD_FX_LBCC	= 1 << 1,
> > +	NVME_CMD_FX_NCC		= 1 << 2,
> > +	NVME_CMD_FX_NIC		= 1 << 3,
> > +	NVME_CMD_FX_CCC		= 1 << 4,
> > +	NVME_CMD_FX_CSE_MASK	= 3 << 16,
> 
> s/NVME_CMD_FX/NVME_CMD_EFFECTS/g ?

Just trying to shorten the name with an abbreviation. Will spell it out
in the next patch.

Thanks for the feedback.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-11-01 16:21 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-27 16:35 [PATCH] nvme: freeze IO accesses around format Jens Axboe
2017-10-27 16:44 ` Keith Busch
2017-10-27 23:07 ` Keith Busch
2017-10-30 14:29   ` Jens Axboe
2017-11-01 15:56   ` Christoph Hellwig
2017-11-01 16:21     ` Keith Busch
2017-11-01 16:21       ` Christoph Hellwig
2017-10-28  6:45 ` Christoph Hellwig
2017-10-30 14:30   ` Jens Axboe
2017-11-01 15:42     ` Christoph Hellwig
2017-11-01 15:49       ` Keith Busch
2017-11-01 15:43 ` Christoph Hellwig
2017-11-01 15:45   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.