All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mdadm v2 0/2] Discard Option for Creating Arrays
@ 2022-09-08 23:08 Logan Gunthorpe
  2022-09-08 23:08 ` [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 23:08 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Hi,

This patchset adds the --discard option for creating new arrays in mdadm.

When specified, mdadm will send block discard (aka. trim or deallocate)
requests to all of the specified block devices. It will then read back
parts of the device to double check that the disks are now all zeros. If
they are all zero, the array is in a known state and does not need to
generate the parity seeing everything is zero and correct. If the devices
do not support discard, or do not result in zero data on each disk, an
error will be returned and the array will not be created.

If all disks get successfully discarded and appear zeroed, then the new
array will not need to be synchronized. The create operation will then
proceed as if --assume-clean was specified.

This provides a safe way and fast way to create an array that does not
need to be synchronized with devices that support discard requests.

Another option for this work is to use a write zero request. This can
be done in linux currently with fallocate and the FALLOC_FL_PUNCH_HOLE
| FALLOC_FL_KEEP_SIZE flags. This will send optimized write-zero requests
to the devices, without falling back to regular writes to zero the disk.
The benefit of this is that the disk will explicitly read back as zeros,
so a zero check is not necessary. The down side is that not all devices
implement this in as optimal a way as the discard request does and on
some of these devices zeroing can take multiple seconds per GB.

Because write-zero requests may be slow and most (but not all) discard
requests read back as zeros, this work uses only discard requests.

Logan

--

Changes since v1:

   * Discard the data in the devices later in the create process
     while they are already open. This requires treating the
     s.discard option the same as the s.assume_clean option.
     Per Mariusz.
   * A couple other minor cleanup changes from Mariusz.

--

Logan Gunthorpe (2):
  mdadm: Add --discard option for Create
  manpage: Add --discard option to manpage

 Create.c   | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++----
 ReadMe.c   |  1 +
 mdadm.8.in | 15 +++++++++++++++
 mdadm.c    |  4 ++++
 mdadm.h    |  2 ++
 5 files changed, 73 insertions(+), 4 deletions(-)


base-commit: 171e9743881edf2dfb163ddff483566fbf913ccd
--
2.30.2

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-08 23:08 [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Logan Gunthorpe
@ 2022-09-08 23:08 ` Logan Gunthorpe
  2022-09-09  9:57   ` Mariusz Tkaczyk
  2022-09-19  8:41   ` Xiao Ni
  2022-09-08 23:08 ` [PATCH mdadm v2 2/2] manpage: Add --discard option to manpage Logan Gunthorpe
  2022-09-12 17:40 ` [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Martin K. Petersen
  2 siblings, 2 replies; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 23:08 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Add the --discard option for Create which will send BLKDISCARD requests to
all disks before assembling the array. This is a fast way to know the
current state of all the disks. If the discard request zeros the data
on the disks (as is common but not universal) then the array is in a
state with correct parity. Thus the initial sync may be skipped.

After issuing each discard request, check if the first page of the
block device is zero. If it is, it is safe to assume the entire
disk should be zero. If it's not report an error.

If all the discard requests are successful and there are no missing
disks thin it is safe to set assume_clean as we know the array is clean.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Create.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
 ReadMe.c |  1 +
 mdadm.c  |  4 ++++
 mdadm.h  |  2 ++
 4 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/Create.c b/Create.c
index e06ec2ae96a1..52bb88bccd53 100644
--- a/Create.c
+++ b/Create.c
@@ -26,6 +26,12 @@
 #include	"md_u.h"
 #include	"md_p.h"
 #include	<ctype.h>
+#include	<sys/ioctl.h>
+
+#ifndef BLKDISCARD
+#define BLKDISCARD _IO(0x12,119)
+#endif
+#include	<fcntl.h>
 
 static int round_size_and_verify(unsigned long long *size, int chunk)
 {
@@ -91,6 +97,38 @@ int default_layout(struct supertype *st, int level, int verbose)
 	return layout;
 }
 
+static int discard_device(struct context *c, int fd, const char *devname,
+			  unsigned long long offset, unsigned long long size)
+{
+	uint64_t range[2] = {offset, size};
+	unsigned long buf[4096 / sizeof(unsigned long)];
+	unsigned long i;
+
+	if (c->verbose)
+		printf("discarding data from %lld to %lld on: %s\n",
+		       offset, size, devname);
+
+	if (ioctl(fd, BLKDISCARD, &range)) {
+		pr_err("discard failed on '%s': %m\n", devname);
+		return 1;
+	}
+
+	if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
+		pr_err("failed to readback '%s' after discard: %m\n", devname);
+		return 1;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(buf); i++) {
+		if (buf[i]) {
+			pr_err("device did not read back zeros after discard on '%s': %lx\n",
+			       devname, buf[i]);
+			return 1;
+		}
+	}
+
+	return 0;
+}
+
 int Create(struct supertype *st, char *mddev,
 	   char *name, int *uuid,
 	   int subdevs, struct mddev_dev *devlist,
@@ -607,7 +645,7 @@ int Create(struct supertype *st, char *mddev,
 	 * as missing, so that a reconstruct happens (faster than re-parity)
 	 * FIX: Can we do this for raid6 as well?
 	 */
-	if (st->ss->external == 0 && s->assume_clean == 0 &&
+	if (st->ss->external == 0 && s->assume_clean == 0 && s->discard == 0 &&
 	    c->force == 0 && first_missing >= s->raiddisks) {
 		switch (s->level) {
 		case 4:
@@ -624,8 +662,8 @@ int Create(struct supertype *st, char *mddev,
 	/* For raid6, if creating with 1 missing drive, make a good drive
 	 * into a spare, else the create will fail
 	 */
-	if (s->assume_clean == 0 && c->force == 0 && first_missing < s->raiddisks &&
-	    st->ss->external == 0 &&
+	if (s->assume_clean == 0 && s->discard == 0 && c->force == 0 &&
+	    first_missing < s->raiddisks && st->ss->external == 0 &&
 	    second_missing >= s->raiddisks && s->level == 6) {
 		insert_point = s->raiddisks - 1;
 		if (insert_point == first_missing)
@@ -686,7 +724,7 @@ int Create(struct supertype *st, char *mddev,
 	     (insert_point < s->raiddisks || first_missing < s->raiddisks)) ||
 	    (s->level == 6 && (insert_point < s->raiddisks ||
 			       second_missing < s->raiddisks)) ||
-	    (s->level <= 0) || s->assume_clean) {
+	    (s->level <= 0) || s->assume_clean || s->discard) {
 		info.array.state = 1; /* clean, but one+ drive will be missing*/
 		info.resync_start = MaxSector;
 	} else {
@@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
 				}
 				if (fd >= 0)
 					remove_partitions(fd);
+
+				if (s->discard &&
+				    discard_device(c, fd, dv->devname,
+						   dv->data_offset << 9,
+						   s->size << 10)) {
+					ioctl(mdfd, STOP_ARRAY, NULL);
+					goto abort_locked;
+				}
+
 				if (st->ss->add_to_super(st, &inf->disk,
 							 fd, dv->devname,
 							 dv->data_offset)) {
diff --git a/ReadMe.c b/ReadMe.c
index 7f94847e9769..544a057f83a0 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -138,6 +138,7 @@ struct option long_options[] = {
     {"size",	  1, 0, 'z'},
     {"auto",	  1, 0, Auto}, /* also for --assemble */
     {"assume-clean",0,0, AssumeClean },
+    {"discard",	  0, 0, Discard },
     {"metadata",  1, 0, 'e'}, /* superblock format */
     {"bitmap",	  1, 0, Bitmap},
     {"bitmap-chunk", 1, 0, BitmapChunk},
diff --git a/mdadm.c b/mdadm.c
index 972adb524dfb..049cdce1cdd2 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -602,6 +602,10 @@ int main(int argc, char *argv[])
 			s.assume_clean = 1;
 			continue;
 
+		case O(CREATE, Discard):
+			s.discard = true;
+			continue;
+
 		case O(GROW,'n'):
 		case O(CREATE,'n'):
 		case O(BUILD,'n'): /* number of raid disks */
diff --git a/mdadm.h b/mdadm.h
index 941a5f3823a0..a1e0bc9f01ad 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -433,6 +433,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
  */
 enum special_options {
 	AssumeClean = 300,
+	Discard,
 	BitmapChunk,
 	WriteBehind,
 	ReAdd,
@@ -593,6 +594,7 @@ struct shape {
 	int	bitmap_chunk;
 	char	*bitmap_file;
 	int	assume_clean;
+	bool	discard;
 	int	write_behind;
 	unsigned long long size;
 	unsigned long long data_offset;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH mdadm v2 2/2] manpage: Add --discard option to manpage
  2022-09-08 23:08 [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Logan Gunthorpe
  2022-09-08 23:08 ` [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
@ 2022-09-08 23:08 ` Logan Gunthorpe
  2022-09-12 17:40 ` [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Martin K. Petersen
  2 siblings, 0 replies; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 23:08 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Document the new --discard option in the manpage.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 mdadm.8.in | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mdadm.8.in b/mdadm.8.in
index f273622641da..93a0a32dd314 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -836,6 +836,21 @@ array is resynced at creation.  From Linux version 3.0,
 .B \-\-assume\-clean
 can be used with that command to avoid the automatic resync.
 
+.TP
+.BR \-\-discard
+When creating an array, send block discard (aka trim or deallocate)
+requests to all the block devices. In most cases this should zero all
+the disks. If any discard fails, or if zeros are not read back from the
+disk, then the operation will be aborted.
+.IP
+If all discards succeed, and there are no missing disks specified,
+then the array should not need to be synchronized after it is created
+(as all disks are zero). The operation will thus proceed as if
+.B \-\-assume\-clean
+was specified.
+.IP
+This is only meaningful with --create.
+
 .TP
 .BR \-\-backup\-file=
 This is needed when
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-08 23:08 ` [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
@ 2022-09-09  9:57   ` Mariusz Tkaczyk
  2022-09-09 11:54     ` Roman Mamedov
  2022-09-09 15:47     ` Logan Gunthorpe
  2022-09-19  8:41   ` Xiao Ni
  1 sibling, 2 replies; 18+ messages in thread
From: Mariusz Tkaczyk @ 2022-09-09  9:57 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

Hi Logan,
See comments below.

Thanks,
Mariusz

On Thu,  8 Sep 2022 17:08:46 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> Add the --discard option for Create which will send BLKDISCARD requests to
> all disks before assembling the array. This is a fast way to know the
> current state of all the disks. If the discard request zeros the data
> on the disks (as is common but not universal) then the array is in a
> state with correct parity. Thus the initial sync may be skipped.
> 
> After issuing each discard request, check if the first page of the
> block device is zero. If it is, it is safe to assume the entire
> disk should be zero. If it's not report an error.
> 
> If all the discard requests are successful and there are no missing
> disks thin it is safe to set assume_clean as we know the array is clean.

Please update message. We agreed in v1 that missing disks and discard features
are not related, right?

> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  Create.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
>  ReadMe.c |  1 +
>  mdadm.c  |  4 ++++
>  mdadm.h  |  2 ++
>  4 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/Create.c b/Create.c
> index e06ec2ae96a1..52bb88bccd53 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -26,6 +26,12 @@
>  #include	"md_u.h"
>  #include	"md_p.h"
>  #include	<ctype.h>
> +#include	<sys/ioctl.h>
> +
> +#ifndef BLKDISCARD
> +#define BLKDISCARD _IO(0x12,119)
> +#endif
> +#include	<fcntl.h>
>  
>  static int round_size_and_verify(unsigned long long *size, int chunk)
>  {
> @@ -91,6 +97,38 @@ int default_layout(struct supertype *st, int level, int
> verbose) return layout;
>  }
>  
> +static int discard_device(struct context *c, int fd, const char *devname,
> +			  unsigned long long offset, unsigned long long size)

Will be great if you can description.

> +{
> +	uint64_t range[2] = {offset, size};
Probably you don't need to specify [2] but it is not an issue I think.

> +	unsigned long buf[4096 / sizeof(unsigned long)];

Can you use any define for 4096? 

> +	unsigned long i;
> +
> +	if (c->verbose)
> +		printf("discarding data from %lld to %lld on: %s\n",
> +		       offset, size, devname);
> +
> +	if (ioctl(fd, BLKDISCARD, &range)) {
> +		pr_err("discard failed on '%s': %m\n", devname);
> +		return 1;
> +	}
> +
> +	if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
> +		pr_err("failed to readback '%s' after discard: %m\n",
> devname);
> +		return 1;
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
> +		if (buf[i]) {
> +			pr_err("device did not read back zeros after discard
> on '%s': %lx\n",
> +			       devname, buf[i]);
In previous version I wanted to leave the message on stderr, but just move a
data (buf[i]) to debug, or if (verbose > 0).
I think that printing binary data in error message is not necessary.

BTW. I'm not sure if discard ensures that data will be all zero. It causes that
drive drops all references but I doesn't mean that data is zeroed. Could you
please check it in documentation? Should we expect zeroes?

> +			return 1;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
>  int Create(struct supertype *st, char *mddev,
>  	   char *name, int *uuid,
>  	   int subdevs, struct mddev_dev *devlist,
> @@ -607,7 +645,7 @@ int Create(struct supertype *st, char *mddev,
>  	 * as missing, so that a reconstruct happens (faster than re-parity)
>  	 * FIX: Can we do this for raid6 as well?
>  	 */
> -	if (st->ss->external == 0 && s->assume_clean == 0 &&
> +	if (st->ss->external == 0 && s->assume_clean == 0 && s->discard == 0
> && c->force == 0 && first_missing >= s->raiddisks) {
>  		switch (s->level) {
>  		case 4:
> @@ -624,8 +662,8 @@ int Create(struct supertype *st, char *mddev,
>  	/* For raid6, if creating with 1 missing drive, make a good drive
>  	 * into a spare, else the create will fail
>  	 */
> -	if (s->assume_clean == 0 && c->force == 0 && first_missing <
> s->raiddisks &&
> -	    st->ss->external == 0 &&
> +	if (s->assume_clean == 0 && s->discard == 0 && c->force == 0 &&
> +	    first_missing < s->raiddisks && st->ss->external == 0 &&
>  	    second_missing >= s->raiddisks && s->level == 6) {
>  		insert_point = s->raiddisks - 1;
>  		if (insert_point == first_missing)
> @@ -686,7 +724,7 @@ int Create(struct supertype *st, char *mddev,
>  	     (insert_point < s->raiddisks || first_missing < s->raiddisks))
> || (s->level == 6 && (insert_point < s->raiddisks ||
>  			       second_missing < s->raiddisks)) ||
> -	    (s->level <= 0) || s->assume_clean) {
> +	    (s->level <= 0) || s->assume_clean || s->discard) {
>  		info.array.state = 1; /* clean, but one+ drive will be
> missing*/ info.resync_start = MaxSector;
>  	} else {
> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
>  				}
>  				if (fd >= 0)
>  					remove_partitions(fd);
> +
> +				if (s->discard &&
> +				    discard_device(c, fd, dv->devname,
> +						   dv->data_offset << 9,
> +						   s->size << 10)) {
> +					ioctl(mdfd, STOP_ARRAY, NULL);
> +					goto abort_locked;
> +				}
> +
Feel free to use up to 100 char in one line it is allowed now.
Why we need dv->data_offset << 9 and  s->size << 10 here?
How this applies to zoned raid0?

>  				if (st->ss->add_to_super(st, &inf->disk,
>  							 fd, dv->devname,
>  							 dv->data_offset)) {
> diff --git a/ReadMe.c b/ReadMe.c
> index 7f94847e9769..544a057f83a0 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -138,6 +138,7 @@ struct option long_options[] = {
>      {"size",	  1, 0, 'z'},
>      {"auto",	  1, 0, Auto}, /* also for --assemble */
>      {"assume-clean",0,0, AssumeClean },
> +    {"discard",	  0, 0, Discard },
>      {"metadata",  1, 0, 'e'}, /* superblock format */
>      {"bitmap",	  1, 0, Bitmap},
>      {"bitmap-chunk", 1, 0, BitmapChunk},
> diff --git a/mdadm.c b/mdadm.c
> index 972adb524dfb..049cdce1cdd2 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
>  			s.assume_clean = 1;
>  			continue;
>  
> +		case O(CREATE, Discard):
> +			s.discard = true;
> +			continue;
> +

I would like to set s.assume_clean=true along with discard. Then will be no need
to modify other conditions. If we are assuming that after discard all is zeros
then we can skip resync, right? According to message, it should be.
Please add message for user and set assume_clean too.

>  		case O(GROW,'n'):
>  		case O(CREATE,'n'):
>  		case O(BUILD,'n'): /* number of raid disks */
> diff --git a/mdadm.h b/mdadm.h
> index 941a5f3823a0..a1e0bc9f01ad 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -433,6 +433,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
>   */
>  enum special_options {
>  	AssumeClean = 300,
> +	Discard,
>  	BitmapChunk,
>  	WriteBehind,
>  	ReAdd,
> @@ -593,6 +594,7 @@ struct shape {
>  	int	bitmap_chunk;
>  	char	*bitmap_file;
>  	int	assume_clean;
> +	bool	discard;
>  	int	write_behind;
>  	unsigned long long size;
>  	unsigned long long data_offset;


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-09  9:57   ` Mariusz Tkaczyk
@ 2022-09-09 11:54     ` Roman Mamedov
       [not found]       ` <CABdXBANrJNWjq4237k9DPRoxLVmiAUoKMZxaaLUrcMHsODwvmA@mail.gmail.com>
  2022-09-12 17:43       ` Martin K. Petersen
  2022-09-09 15:47     ` Logan Gunthorpe
  1 sibling, 2 replies; 18+ messages in thread
From: Roman Mamedov @ 2022-09-09 11:54 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: Logan Gunthorpe, linux-raid, Jes Sorensen, Guoqing Jiang,
	Xiao Ni, Coly Li, Chaitanya Kulkarni, Jonmichael Hands,
	Stephen Bates, Martin Oliveira, David Sloan

On Fri, 9 Sep 2022 11:57:49 +0200
Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> wrote:

> BTW. I'm not sure if discard ensures that data will be all zero. It causes that
> drive drops all references but I doesn't mean that data is zeroed. Could you
> please check it in documentation? Should we expect zeroes?

Indeed, the references are dropped, but the actual data is still on disk.
Physically wiping it would be time-consuming. Instead the "zeroing" is achieved
via the fact that trying to read back areas for which the drive has no
reference, makes the drive return all zeroes instead.

The behavior varies per SSD model. Some do not return zeroes after a discard,
but some binary garbage instead.

In the ATA layer there is the sysfs file "discard_zeroes_data", which tells
whether or not to expect zeroes after a discard. That flag does not appear to
be always correct, e.g. I have some SSDs which do observably read back zeroes
after a discard, but the flag still says "0". Rather than relying on that flag
Logan opted for a manual check, reading back the actual disk data to check for
zeroes, which I think could employ the optimization safely enough much more
often, than simply going per the flag value would allow.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
       [not found]       ` <CABdXBANrJNWjq4237k9DPRoxLVmiAUoKMZxaaLUrcMHsODwvmA@mail.gmail.com>
@ 2022-09-09 15:31         ` Roman Mamedov
  0 siblings, 0 replies; 18+ messages in thread
From: Roman Mamedov @ 2022-09-09 15:31 UTC (permalink / raw)
  To: Jonmichael Hands
  Cc: Mariusz Tkaczyk, Logan Gunthorpe, linux-raid, Jes Sorensen,
	Guoqing Jiang, Xiao Ni, Coly Li, Chaitanya Kulkarni,
	Stephen Bates, Martin Oliveira, David Sloan

On Fri, 9 Sep 2022 07:40:00 -0700
Jonmichael Hands <jm@chia.net> wrote:

> Deterministic read zero after TRIM is standardized and can be reported by
> device. If the device supports these bits, they will always return zero
> after discard.

Point is that some/many devices do not report this capability properly, but
actually do read back zeroes after TRIM. Relying on the reporting alone would
exclude a lot of cases where the optimization could have been used. So it is
useful to verify the observed behavior and still allow the optimization in
those cases.

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-09  9:57   ` Mariusz Tkaczyk
  2022-09-09 11:54     ` Roman Mamedov
@ 2022-09-09 15:47     ` Logan Gunthorpe
  2022-09-13  7:35       ` Mariusz Tkaczyk
  1 sibling, 1 reply; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-09 15:47 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan



On 2022-09-09 03:57, Mariusz Tkaczyk wrote:
>> If all the discard requests are successful and there are no missing
>> disks thin it is safe to set assume_clean as we know the array is clean.
> 
> Please update message. We agreed in v1 that missing disks and discard features
> are not related, right?

Oops, yes, I'll update the commit message for v3.


>> +static int discard_device(struct context *c, int fd, const char *devname,
>> +			  unsigned long long offset, unsigned long long size)
> 
> Will be great if you can description.

Ok, will do for v3.

>> +{
>> +	uint64_t range[2] = {offset, size};
> Probably you don't need to specify [2] but it is not an issue I think.
> 
>> +	unsigned long buf[4096 / sizeof(unsigned long)];
> 
> Can you use any define for 4096? 

I don't see any appropriate defines in the code base. It really just
needs to be bigger than any O_DIRECT restrictions. 4096 bytes is usually
the worst case.

>> +	unsigned long i;
>> +
>> +	if (c->verbose)
>> +		printf("discarding data from %lld to %lld on: %s\n",
>> +		       offset, size, devname);
>> +
>> +	if (ioctl(fd, BLKDISCARD, &range)) {
>> +		pr_err("discard failed on '%s': %m\n", devname);
>> +		return 1;
>> +	}
>> +
>> +	if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
>> +		pr_err("failed to readback '%s' after discard: %m\n",
>> devname);
>> +		return 1;
>> +	}
>> +
>> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
>> +		if (buf[i]) {
>> +			pr_err("device did not read back zeros after discard
>> on '%s': %lx\n",
>> +			       devname, buf[i]);
> In previous version I wanted to leave the message on stderr, but just move a
> data (buf[i]) to debug, or if (verbose > 0).
> I think that printing binary data in error message is not necessary.

I added the hex because it might be informative to know what a discard
did to the device (all FFs or random data).

> BTW. I'm not sure if discard ensures that data will be all zero. It causes that
> drive drops all references but I doesn't mean that data is zeroed. Could you
> please check it in documentation? Should we expect zeroes?

That's correct. I discussed this in the cover letter. That's why this
check is here. Per some of the discussion from others I still think the
best course of action is to just check what the discard did and fail if
it is non-zero. Even though many NVMe and ATA devices have the ability
to control or query the behaviour, the kernel doesn't support this and
I don't think it can be relied upon.


>> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
>>  				}
>>  				if (fd >= 0)
>>  					remove_partitions(fd);
>> +
>> +				if (s->discard &&
>> +				    discard_device(c, fd, dv->devname,
>> +						   dv->data_offset << 9,
>> +						   s->size << 10)) {
>> +					ioctl(mdfd, STOP_ARRAY, NULL);
>> +					goto abort_locked;
>> +				}
>> +
> Feel free to use up to 100 char in one line it is allowed now.
> Why we need dv->data_offset << 9 and  s->size << 10 here?
> How this applies to zoned raid0?

As I understand it the offset and size will give the bounds of the
data region on the disk. Do you not think it works for zoned raid0?

>> diff --git a/mdadm.c b/mdadm.c
>> index 972adb524dfb..049cdce1cdd2 100644
>> --- a/mdadm.c
>> +++ b/mdadm.c
>> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
>>  			s.assume_clean = 1;
>>  			continue;
>>  
>> +		case O(CREATE, Discard):
>> +			s.discard = true;
>> +			continue;
>> +
> 
> I would like to set s.assume_clean=true along with discard. Then will be no need
> to modify other conditions. If we are assuming that after discard all is zeros
> then we can skip resync, right? According to message, it should be.
> Please add message for user and set assume_clean too.


Well it was my opinion that it was clearer in the code to just
explicitly include discard in the conditionals instead of making discard
also set assume-clean, but if you think otherwise I can change it for v3.

What kind of user message are you thinking is necessary here?

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 0/2] Discard Option for Creating Arrays
  2022-09-08 23:08 [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Logan Gunthorpe
  2022-09-08 23:08 ` [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
  2022-09-08 23:08 ` [PATCH mdadm v2 2/2] manpage: Add --discard option to manpage Logan Gunthorpe
@ 2022-09-12 17:40 ` Martin K. Petersen
       [not found]   ` <CABdXBAP0LeQMmhSLUMZ_TmnSp5xmZ4xJBkNa7HUm7094m_x9xA@mail.gmail.com>
  2022-09-13 15:38   ` Logan Gunthorpe
  2 siblings, 2 replies; 18+ messages in thread
From: Martin K. Petersen @ 2022-09-12 17:40 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni,
	Mariusz Tkaczyk, Coly Li, Chaitanya Kulkarni, Jonmichael Hands,
	Stephen Bates, Martin Oliveira, David Sloan


Hi Logan!

> When specified, mdadm will send block discard (aka. trim or
> deallocate) requests to all of the specified block devices. It will
> then read back parts of the device to double check that the disks are
> now all zeros. If they are all zero, the array is in a known state and
> does not need to generate the parity seeing everything is zero and
> correct.

Unfortunately that's a dangerous assertion. The drive is free to ignore
any or all parts of a discard request. And typically the results vary
depending on what else the drive has going on at the moment the request
was executed.  I.e. you could experience completely different results on
the same drive depending on whether it was busy garbage collecting or
doing other I/O when the various portions of a discard request were
processed.

> Another option for this work is to use a write zero request. This can
> be done in linux currently with fallocate and the FALLOC_FL_PUNCH_HOLE
> | FALLOC_FL_KEEP_SIZE flags. This will send optimized write-zero requests
> to the devices, without falling back to regular writes to zero the disk.
> The benefit of this is that the disk will explicitly read back as zeros,
> so a zero check is not necessary. The down side is that not all devices
> implement this in as optimal a way as the discard request does and on
> some of these devices zeroing can take multiple seconds per GB.

REQ_OP_WRITE_ZEROES was explicitly designed for this use case. It will
use discards if it is safe to do so. That is if the device supports
deterministic zeroing; either explicitly through the storage protocol or
through ATA quirks (thanks to the drive being vendor-qualified for RAID
usage).

> Because write-zero requests may be slow and most (but not all) discard
> requests read back as zeros, this work uses only discard requests.

REQ_OP_WRITE_ZEROES will pick the most optimal way to guarantee that all
blocks in the requested range will return zeroes for subsequent reads.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-09 11:54     ` Roman Mamedov
       [not found]       ` <CABdXBANrJNWjq4237k9DPRoxLVmiAUoKMZxaaLUrcMHsODwvmA@mail.gmail.com>
@ 2022-09-12 17:43       ` Martin K. Petersen
  1 sibling, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2022-09-12 17:43 UTC (permalink / raw)
  To: Roman Mamedov
  Cc: Mariusz Tkaczyk, Logan Gunthorpe, linux-raid, Jes Sorensen,
	Guoqing Jiang, Xiao Ni, Coly Li, Chaitanya Kulkarni,
	Jonmichael Hands, Stephen Bates, Martin Oliveira, David Sloan


Roman,

> In the ATA layer there is the sysfs file "discard_zeroes_data", which
> tells whether or not to expect zeroes after a discard. That flag does
> not appear to be always correct, e.g. I have some SSDs which do
> observably read back zeroes after a discard, but the flag still says
> "0".

It is hardwired to "0" for backwards compatibility. We don't support
discards for clearing block ranges anymore. There is an explicit zeroing
operation for that use case.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 0/2] Discard Option for Creating Arrays
       [not found]   ` <CABdXBAP0LeQMmhSLUMZ_TmnSp5xmZ4xJBkNa7HUm7094m_x9xA@mail.gmail.com>
@ 2022-09-13  3:47     ` Martin K. Petersen
  0 siblings, 0 replies; 18+ messages in thread
From: Martin K. Petersen @ 2022-09-13  3:47 UTC (permalink / raw)
  To: Jonmichael Hands
  Cc: Martin K. Petersen, Logan Gunthorpe, linux-raid, Jes Sorensen,
	Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Stephen Bates, Martin Oliveira, David Sloan


Jonmichael,

> are there capabilities of REQ_OP_WRITE_ZEROES for detection of NVMe
> DLFEAT in the identify namespace information? The purpose of this
> capability is for operating systems to detect it, precisely for use
> cases like we have identified where deterministic read zero is
> required to save a tremendous amount of time and NAND endurance.

I don't believe DEAC/DLFEAT are currently wired up in the NVMe driver
but it would be trivial to match what SCSI does in that department.

The intent of the REQ_OP_WRITE_ZEROES interface is to provide the choice
between deallocate semantics (think discard) and allocate semantics
(think write same) for zeroing. See the BLKDEV_ZERO_NOUNMAP flag for
more info.

The important distinction between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES
is that the latter is a data integrity operation that produces
deterministic results. I.e. guarantees that all blocks will return
zeroes on subsequent reads. Whereas REQ_OP_DISCARD is a hint that can
and often will skip portions of the request sent.

It was a mistake to conflate deallocation and zeroing in our initial
implementation of discards in Linux. We have painstakingly removed that
and now provide two distinct interfaces: REQ_OP_DISCARD tells a device
that a block range is no longer in use, we don't care about block
contents for future reads. Whereas REQ_OP_WRITE_ZEROES aims to provide
an optimal interface for clearing block ranges given the reported
characteristics of a given device.

Note that I am careful about using REQ_OP_DISCARD and
REQ_OP_WRITE_ZEROES terminology to describe the block layer primitives
for deallocating and zeroing block ranges here. At the bottom of the
stack, a REQ_OP_WRITE_ZEROES operation could very well end up issuing
what people would think of as a "discard" operation (DSM TRIM, WRITE
SAME w/UNMAP) assuming the device has been identified as doing the right
thing.

Anything operating at the block device level should be using the
REQ_OP_DISCARD/REQ_OP_WRITE_ZEROES primitives (or their corresponding
ioctls or fallocate flags). And if there is a need to address how those
primitives are translated into commands for a given device, then we
should handle that in the relevant device driver.

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-09 15:47     ` Logan Gunthorpe
@ 2022-09-13  7:35       ` Mariusz Tkaczyk
  2022-09-13 15:43         ` Logan Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Mariusz Tkaczyk @ 2022-09-13  7:35 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

On Fri, 9 Sep 2022 09:47:21 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:
> >> +{
> >> +	uint64_t range[2] = {offset, size};  
> > Probably you don't need to specify [2] but it is not an issue I think.
> >   
> >> +	unsigned long buf[4096 / sizeof(unsigned long)];  
> > 
> > Can you use any define for 4096?   
> 
> I don't see any appropriate defines in the code base. It really just
> needs to be bigger than any O_DIRECT restrictions. 4096 bytes is usually
> the worst case.

See comment bellow.
> 
> >> +	unsigned long i;
> >> +
> >> +	if (c->verbose)
> >> +		printf("discarding data from %lld to %lld on: %s\n",
> >> +		       offset, size, devname);
> >> +
> >> +	if (ioctl(fd, BLKDISCARD, &range)) {
> >> +		pr_err("discard failed on '%s': %m\n", devname);
> >> +		return 1;
> >> +	}
> >> +
> >> +	if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
> >> +		pr_err("failed to readback '%s' after discard: %m\n",
> >> devname);
> >> +		return 1;
> >> +	}
> >> +
> >> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
> >> +		if (buf[i]) {
> >> +			pr_err("device did not read back zeros after
> >> discard on '%s': %lx\n",
> >> +			       devname, buf[i]);  
> > In previous version I wanted to leave the message on stderr, but just move a
> > data (buf[i]) to debug, or if (verbose > 0).
> > I think that printing binary data in error message is not necessary.  
> 
> I added the hex because it might be informative to know what a discard
> did to the device (all FFs or random data).
> 
> > BTW. I'm not sure if discard ensures that data will be all zero. It causes
> > that drive drops all references but I doesn't mean that data is zeroed.
> > Could you please check it in documentation? Should we expect zeroes?  
> 
> That's correct. I discussed this in the cover letter. That's why this
> check is here. Per some of the discussion from others I still think the
> best course of action is to just check what the discard did and fail if
> it is non-zero. Even though many NVMe and ATA devices have the ability
> to control or query the behaviour, the kernel doesn't support this and
> I don't think it can be relied upon.

It is also controversial approach[1].

I think that the best we can do here is to warn user, for example:
"Please ensure that drive you used return zeros after discard."
We should ask for confirmation (it should be possible to skip with --run).
I would like to leave discard feature validation on user side, it is not mdadm
task. Simply, if you want to use it, then it is done on your own risk and
hopefully you know what you want to achieve.
That will simplify implementation on mdadm side. What do you think?

> >> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
> >>  				}
> >>  				if (fd >= 0)
> >>  					remove_partitions(fd);
> >> +
> >> +				if (s->discard &&
> >> +				    discard_device(c, fd, dv->devname,
> >> +						   dv->data_offset << 9,
> >> +						   s->size << 10)) {
> >> +					ioctl(mdfd, STOP_ARRAY, NULL);
> >> +					goto abort_locked;
> >> +				}
> >> +  
> > Feel free to use up to 100 char in one line it is allowed now.
> > Why we need dv->data_offset << 9 and  s->size << 10 here?
> > How this applies to zoned raid0?  
> 
> As I understand it the offset and size will give the bounds of the
> data region on the disk. Do you not think it works for zoned raid0?

mdadm operates on 512B, so using 4K data regions could be destructive.
Also left shift causes that size value is increasing. We can't clear more that
user requested. We need to use 512b sectors as mdadm does.

I don't know how native raid0 size is passed and how zones are created but I
suspect that s->size may not be correct for all drives. It it a global property
but for zoned raid member size could be different. You need to check how it
applies.

Also, I'm not sure if this feature is needed for raid0, because there is no
resync. Maybe we can exclude raid0 from discard?
> 
> >> diff --git a/mdadm.c b/mdadm.c
> >> index 972adb524dfb..049cdce1cdd2 100644
> >> --- a/mdadm.c
> >> +++ b/mdadm.c
> >> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
> >>  			s.assume_clean = 1;
> >>  			continue;
> >>  
> >> +		case O(CREATE, Discard):
> >> +			s.discard = true;
> >> +			continue;
> >> +  
> > 
> > I would like to set s.assume_clean=true along with discard. Then will be no
> > need to modify other conditions. If we are assuming that after discard all
> > is zeros then we can skip resync, right? According to message, it should be.
> > Please add message for user and set assume_clean too.  
> 
> 
> Well it was my opinion that it was clearer in the code to just
> explicitly include discard in the conditionals instead of making discard
> also set assume-clean, but if you think otherwise I can change it for v3.
> 
> What kind of user message are you thinking is necessary here?

In my convention, all shape attributes should be set in mdadm.c, later they
should be considered as const, we should not overwrite them. This structure
represents user settings.
This is why I consider updating assume_clean in Create.c as wrong. When discard
is set then we are assuming that user want to skip resync too, otherwise it
doesn't have sense. Also, if discard of any drive fails we are returning error
and create operation will fail anyway.

I would expected something like: "Discard requested, setting --assume-clean to
skip resync".
Also, will be great if you can add some tests, at least for command-line.

[1] https://lore.kernel.org/linux-raid/yq1sfkw7yqi.fsf@ca-mkp.ca.oracle.com/T/#t

Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 0/2] Discard Option for Creating Arrays
  2022-09-12 17:40 ` [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Martin K. Petersen
       [not found]   ` <CABdXBAP0LeQMmhSLUMZ_TmnSp5xmZ4xJBkNa7HUm7094m_x9xA@mail.gmail.com>
@ 2022-09-13 15:38   ` Logan Gunthorpe
  1 sibling, 0 replies; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-13 15:38 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni,
	Mariusz Tkaczyk, Coly Li, Chaitanya Kulkarni, Jonmichael Hands,
	Stephen Bates, Martin Oliveira, David Sloan



On 2022-09-12 11:40, Martin K. Petersen wrote:
>> Because write-zero requests may be slow and most (but not all) discard
>> requests read back as zeros, this work uses only discard requests.
> 
> REQ_OP_WRITE_ZEROES will pick the most optimal way to guarantee that all
> blocks in the requested range will return zeroes for subsequent reads.

Makes sense to me. I wanted to use WRITE_ZEROS but was disappointed in
the performance even on the nvme drive we tested. I'll respin this to
use WRITE_ZEROS and maybe we can do some investigation on more NVMe
drives to see if more of them might do it efficiently.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-13  7:35       ` Mariusz Tkaczyk
@ 2022-09-13 15:43         ` Logan Gunthorpe
  2022-09-14 12:01           ` Mariusz Tkaczyk
  0 siblings, 1 reply; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-13 15:43 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan



On 2022-09-13 01:35, Mariusz Tkaczyk wrote:
> On Fri, 9 Sep 2022 09:47:21 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
>>>> +{
>>>> +	uint64_t range[2] = {offset, size};  
>>> Probably you don't need to specify [2] but it is not an issue I think.
>>>   
>>>> +	unsigned long buf[4096 / sizeof(unsigned long)];  
>>>
>>> Can you use any define for 4096?   
>>
>> I don't see any appropriate defines in the code base. It really just
>> needs to be bigger than any O_DIRECT restrictions. 4096 bytes is usually
>> the worst case.
> 
> See comment bellow.

I don't see how the comment below relates to this at all. This 4k has
nothing to do with anything related to the array, it wos only a
convenient size to read and check the result. But per Martin's points,
this code will go away in v3 seeing it's more appropriate to use
WRITE_ZEROS and that interface guarantees that there will be zeros, thus
no need to check.


>> That's correct. I discussed this in the cover letter. That's why this
>> check is here. Per some of the discussion from others I still think the
>> best course of action is to just check what the discard did and fail if
>> it is non-zero. Even though many NVMe and ATA devices have the ability
>> to control or query the behaviour, the kernel doesn't support this and
>> I don't think it can be relied upon.
> 
> It is also controversial approach[1].
> 
> I think that the best we can do here is to warn user, for example:
> "Please ensure that drive you used return zeros after discard."
> We should ask for confirmation (it should be possible to skip with --run).
> I would like to leave discard feature validation on user side, it is not mdadm
> task. Simply, if you want to use it, then it is done on your own risk and
> hopefully you know what you want to achieve.
> That will simplify implementation on mdadm side. What do you think?

I think the better approach is to just use WRITE_ZEROS.

>>>> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
>>>>  				}
>>>>  				if (fd >= 0)
>>>>  					remove_partitions(fd);
>>>> +
>>>> +				if (s->discard &&
>>>> +				    discard_device(c, fd, dv->devname,
>>>> +						   dv->data_offset << 9,
>>>> +						   s->size << 10)) {
>>>> +					ioctl(mdfd, STOP_ARRAY, NULL);
>>>> +					goto abort_locked;
>>>> +				}
>>>> +  
>>> Feel free to use up to 100 char in one line it is allowed now.
>>> Why we need dv->data_offset << 9 and  s->size << 10 here?
>>> How this applies to zoned raid0?  
>>
>> As I understand it the offset and size will give the bounds of the
>> data region on the disk. Do you not think it works for zoned raid0?
> 
> mdadm operates on 512B, so using 4K data regions could be destructive.
> Also left shift causes that size value is increasing. We can't clear more that
> user requested. We need to use 512b sectors as mdadm does.

I don't really follow this.

> I don't know how native raid0 size is passed and how zones are created but I
> suspect that s->size may not be correct for all drives. It it a global property
> but for zoned raid member size could be different. You need to check how it
> applies.

> Also, I'm not sure if this feature is needed for raid0, because there is no
> resync. Maybe we can exclude raid0 from discard?

I'll check raid0 size. If possible I'd rather not have the restriction
to avoid raid0 as it becomes complicated and users may have reason to
zero besides avoiding resync.
>>
>> Well it was my opinion that it was clearer in the code to just
>> explicitly include discard in the conditionals instead of making discard
>> also set assume-clean, but if you think otherwise I can change it for v3.
>>
>> What kind of user message are you thinking is necessary here?
> 
> In my convention, all shape attributes should be set in mdadm.c, later they
> should be considered as const, we should not overwrite them. This structure
> represents user settings.
> This is why I consider updating assume_clean in Create.c as wrong. When discard
> is set then we are assuming that user want to skip resync too, otherwise it
> doesn't have sense. Also, if discard of any drive fails we are returning error
> and create operation will fail anyway.

The v2 version of this patch did not modify the shape attributes in
Create.c; that was only in v1.

> I would expected something like: "Discard requested, setting --assume-clean to
> skip resync".
> Also, will be great if you can add some tests, at least for command-line.

Ok.

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-13 15:43         ` Logan Gunthorpe
@ 2022-09-14 12:01           ` Mariusz Tkaczyk
  2022-09-14 16:29             ` Logan Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Mariusz Tkaczyk @ 2022-09-14 12:01 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

On Tue, 13 Sep 2022 09:43:52 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 2022-09-13 01:35, Mariusz Tkaczyk wrote:
> > On Fri, 9 Sep 2022 09:47:21 -0600
> > Logan Gunthorpe <logang@deltatee.com> wrote:  
> >>>> +{
> >>>> +	uint64_t range[2] = {offset, size};    
> >>> Probably you don't need to specify [2] but it is not an issue I think.
> >>>     
> >>>> +	unsigned long buf[4096 / sizeof(unsigned long)];    
> >>>
> >>> Can you use any define for 4096?     
> >>
> >> I don't see any appropriate defines in the code base. It really just
> >> needs to be bigger than any O_DIRECT restrictions. 4096 bytes is usually
> >> the worst case.  
> > 
> > See comment bellow.  
> 
> I don't see how the comment below relates to this at all. This 4k has
> nothing to do with anything related to the array, it wos only a
> convenient size to read and check the result. But per Martin's points,
> this code will go away in v3 seeing it's more appropriate to use
> WRITE_ZEROS and that interface guarantees that there will be zeros, thus
> no need to check.

I suggested to skip verification at all in next comment but as you said with
WRITE_ZEROS it is not needed anyway. Sorry for being inaccurate.

> 
> 
> >> That's correct. I discussed this in the cover letter. That's why this
> >> check is here. Per some of the discussion from others I still think the
> >> best course of action is to just check what the discard did and fail if
> >> it is non-zero. Even though many NVMe and ATA devices have the ability
> >> to control or query the behaviour, the kernel doesn't support this and
> >> I don't think it can be relied upon.  
> > 
> > It is also controversial approach[1].
> > 
> > I think that the best we can do here is to warn user, for example:
> > "Please ensure that drive you used return zeros after discard."
> > We should ask for confirmation (it should be possible to skip with --run).
> > I would like to leave discard feature validation on user side, it is not
> > mdadm task. Simply, if you want to use it, then it is done on your own risk
> > and hopefully you know what you want to achieve.
> > That will simplify implementation on mdadm side. What do you think?  
> 
> I think the better approach is to just use WRITE_ZEROS.

Agree.

> 
> >>>> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
> >>>>  				}
> >>>>  				if (fd >= 0)
> >>>>  					remove_partitions(fd);
> >>>> +
> >>>> +				if (s->discard &&
> >>>> +				    discard_device(c, fd, dv->devname,
> >>>> +						   dv->data_offset << 9,
> >>>> +						   s->size << 10)) {
> >>>> +					ioctl(mdfd, STOP_ARRAY, NULL);
> >>>> +					goto abort_locked;
> >>>> +				}
> >>>> +    
> >>> Feel free to use up to 100 char in one line it is allowed now.
> >>> Why we need dv->data_offset << 9 and  s->size << 10 here?
> >>> How this applies to zoned raid0?    
> >>
> >> As I understand it the offset and size will give the bounds of the
> >> data region on the disk. Do you not think it works for zoned raid0?  
> > 
> > mdadm operates on 512B, so using 4K data regions could be destructive.
> > Also left shift causes that size value is increasing. We can't clear more
> > that user requested. We need to use 512b sectors as mdadm does.  
> 
> I don't really follow this.

I understand that you want left shit is used to round size to data region
and I assumed that data_region is 4K and that is probably wrong.
You are right I has no sense, my apologizes.

Let's imagine that our size is for example, 2687 sectors. Left shit will
cause that we will get 2751488 and that will be passed as a size to function.
Similar for data_offset. That is much more than we want to clear.
Do I miss something? I guess that ioctl operates on sectors too but please
correct me if that is wrong.

> 
> > I don't know how native raid0 size is passed and how zones are created but I
> > suspect that s->size may not be correct for all drives. It it a global
> > property but for zoned raid member size could be different. You need to
> > check how it applies.  
> 
> > Also, I'm not sure if this feature is needed for raid0, because there is no
> > resync. Maybe we can exclude raid0 from discard?  
> 
> I'll check raid0 size. If possible I'd rather not have the restriction
> to avoid raid0 as it becomes complicated and users may have reason to
> zero besides avoiding resync.

Agree, thanks.

> >>
> >> Well it was my opinion that it was clearer in the code to just
> >> explicitly include discard in the conditionals instead of making discard
> >> also set assume-clean, but if you think otherwise I can change it for v3.
> >>
> >> What kind of user message are you thinking is necessary here?  
> > 
> > In my convention, all shape attributes should be set in mdadm.c, later they
> > should be considered as const, we should not overwrite them. This structure
> > represents user settings.
> > This is why I consider updating assume_clean in Create.c as wrong. When
> > discard is set then we are assuming that user want to skip resync too,
> > otherwise it doesn't have sense. Also, if discard of any drive fails we are
> > returning error and create operation will fail anyway.  
> 
> The v2 version of this patch did not modify the shape attributes in
> Create.c; that was only in v1.

Ok, I missed that, thanks.
> 
> > I would expected something like: "Discard requested, setting --assume-clean
> > to skip resync".
> > Also, will be great if you can add some tests, at least for command-line.  
> 
> Ok.
> 

Mariusz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-14 12:01           ` Mariusz Tkaczyk
@ 2022-09-14 16:29             ` Logan Gunthorpe
  2022-09-14 17:39               ` Mariusz Tkaczyk
  0 siblings, 1 reply; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-14 16:29 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan



On 2022-09-14 06:01, Mariusz Tkaczyk wrote:
>>>> As I understand it the offset and size will give the bounds of the
>>>> data region on the disk. Do you not think it works for zoned raid0?  
>>>
>>> mdadm operates on 512B, so using 4K data regions could be destructive.
>>> Also left shift causes that size value is increasing. We can't clear more
>>> that user requested. We need to use 512b sectors as mdadm does.  
>>
>> I don't really follow this.
> 
> I understand that you want left shit is used to round size to data region
> and I assumed that data_region is 4K and that is probably wrong.
> You are right I has no sense, my apologizes.
> 
> Let's imagine that our size is for example, 2687 sectors. Left shit will
> cause that we will get 2751488 and that will be passed as a size to function.
> Similar for data_offset. That is much more than we want to clear.
> Do I miss something? I guess that ioctl operates on sectors too but please
> correct me if that is wrong.

The BLKDISCARD ioctl assumes bytes for the range, not sectors. Though it
does have to be sector aligned.

dv->data_offset is then in sectors, so we shift by 9.

s->size is in KB, so we shift by 10.

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-14 16:29             ` Logan Gunthorpe
@ 2022-09-14 17:39               ` Mariusz Tkaczyk
  0 siblings, 0 replies; 18+ messages in thread
From: Mariusz Tkaczyk @ 2022-09-14 17:39 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

On Wed, 14 Sep 2022 10:29:53 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> On 2022-09-14 06:01, Mariusz Tkaczyk wrote:
> >>>> As I understand it the offset and size will give the bounds of the
> >>>> data region on the disk. Do you not think it works for zoned raid0?    
> >>>
> >>> mdadm operates on 512B, so using 4K data regions could be destructive.
> >>> Also left shift causes that size value is increasing. We can't clear more
> >>> that user requested. We need to use 512b sectors as mdadm does.    
> >>
> >> I don't really follow this.  
> > 
> > I understand that you want left shit is used to round size to data region
> > and I assumed that data_region is 4K and that is probably wrong.
> > You are right I has no sense, my apologizes.
> > 
> > Let's imagine that our size is for example, 2687 sectors. Left shit will
> > cause that we will get 2751488 and that will be passed as a size to
> > function. Similar for data_offset. That is much more than we want to clear.
> > Do I miss something? I guess that ioctl operates on sectors too but please
> > correct me if that is wrong.  
> 
> The BLKDISCARD ioctl assumes bytes for the range, not sectors. Though it
> does have to be sector aligned.
> 
> dv->data_offset is then in sectors, so we shift by 9.
> 
> s->size is in KB, so we shift by 10.
> 
Got it!
I will be thankful if you can create #defines for that. For Example:
KIB_TO_BYTES(s->size)
SEC_TO_BYTES(dv->offset)

We could reuse them later in other places.
Thanks,
Mariusz

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-08 23:08 ` [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
  2022-09-09  9:57   ` Mariusz Tkaczyk
@ 2022-09-19  8:41   ` Xiao Ni
  2022-09-21 18:45     ` Logan Gunthorpe
  1 sibling, 1 reply; 18+ messages in thread
From: Xiao Ni @ 2022-09-19  8:41 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Mariusz Tkaczyk,
	Coly Li, Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

On Fri, Sep 9, 2022 at 7:09 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Add the --discard option for Create which will send BLKDISCARD requests to
> all disks before assembling the array. This is a fast way to know the
> current state of all the disks. If the discard request zeros the data
> on the disks (as is common but not universal) then the array is in a
> state with correct parity. Thus the initial sync may be skipped.
>
> After issuing each discard request, check if the first page of the
> block device is zero. If it is, it is safe to assume the entire
> disk should be zero. If it's not report an error.
>
> If all the discard requests are successful and there are no missing
> disks thin it is safe to set assume_clean as we know the array is clean.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  Create.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++++++----
>  ReadMe.c |  1 +
>  mdadm.c  |  4 ++++
>  mdadm.h  |  2 ++
>  4 files changed, 58 insertions(+), 4 deletions(-)
>
> diff --git a/Create.c b/Create.c
> index e06ec2ae96a1..52bb88bccd53 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -26,6 +26,12 @@
>  #include       "md_u.h"
>  #include       "md_p.h"
>  #include       <ctype.h>
> +#include       <sys/ioctl.h>
> +
> +#ifndef BLKDISCARD
> +#define BLKDISCARD _IO(0x12,119)
> +#endif
> +#include       <fcntl.h>
>
>  static int round_size_and_verify(unsigned long long *size, int chunk)
>  {
> @@ -91,6 +97,38 @@ int default_layout(struct supertype *st, int level, int verbose)
>         return layout;
>  }
>
> +static int discard_device(struct context *c, int fd, const char *devname,
> +                         unsigned long long offset, unsigned long long size)
> +{
> +       uint64_t range[2] = {offset, size};
> +       unsigned long buf[4096 / sizeof(unsigned long)];
> +       unsigned long i;
> +
> +       if (c->verbose)
> +               printf("discarding data from %lld to %lld on: %s\n",
> +                      offset, size, devname);
> +
> +       if (ioctl(fd, BLKDISCARD, &range)) {
> +               pr_err("discard failed on '%s': %m\n", devname);
> +               return 1;
> +       }
> +
> +       if (pread(fd, buf, sizeof(buf), offset) != sizeof(buf)) {
> +               pr_err("failed to readback '%s' after discard: %m\n", devname);
> +               return 1;
> +       }
> +
> +       for (i = 0; i < ARRAY_SIZE(buf); i++) {
> +               if (buf[i]) {
> +                       pr_err("device did not read back zeros after discard on '%s': %lx\n",
> +                              devname, buf[i]);
> +                       return 1;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
>  int Create(struct supertype *st, char *mddev,
>            char *name, int *uuid,
>            int subdevs, struct mddev_dev *devlist,
> @@ -607,7 +645,7 @@ int Create(struct supertype *st, char *mddev,
>          * as missing, so that a reconstruct happens (faster than re-parity)
>          * FIX: Can we do this for raid6 as well?
>          */
> -       if (st->ss->external == 0 && s->assume_clean == 0 &&
> +       if (st->ss->external == 0 && s->assume_clean == 0 && s->discard == 0 &&
>             c->force == 0 && first_missing >= s->raiddisks) {
>                 switch (s->level) {
>                 case 4:
> @@ -624,8 +662,8 @@ int Create(struct supertype *st, char *mddev,
>         /* For raid6, if creating with 1 missing drive, make a good drive
>          * into a spare, else the create will fail
>          */
> -       if (s->assume_clean == 0 && c->force == 0 && first_missing < s->raiddisks &&
> -           st->ss->external == 0 &&
> +       if (s->assume_clean == 0 && s->discard == 0 && c->force == 0 &&
> +           first_missing < s->raiddisks && st->ss->external == 0 &&
>             second_missing >= s->raiddisks && s->level == 6) {
>                 insert_point = s->raiddisks - 1;
>                 if (insert_point == first_missing)
> @@ -686,7 +724,7 @@ int Create(struct supertype *st, char *mddev,
>              (insert_point < s->raiddisks || first_missing < s->raiddisks)) ||
>             (s->level == 6 && (insert_point < s->raiddisks ||
>                                second_missing < s->raiddisks)) ||
> -           (s->level <= 0) || s->assume_clean) {
> +           (s->level <= 0) || s->assume_clean || s->discard) {
>                 info.array.state = 1; /* clean, but one+ drive will be missing*/
>                 info.resync_start = MaxSector;
>         } else {
> @@ -945,6 +983,15 @@ int Create(struct supertype *st, char *mddev,
>                                 }
>                                 if (fd >= 0)
>                                         remove_partitions(fd);
> +
> +                               if (s->discard &&
> +                                   discard_device(c, fd, dv->devname,
> +                                                  dv->data_offset << 9,
> +                                                  s->size << 10)) {
> +                                       ioctl(mdfd, STOP_ARRAY, NULL);
> +                                       goto abort_locked;
> +                               }
> +

Hi Logan

When creating a raid device without specifying data offset,
dv->data_offset is always 1 (INVALID_SECTORS).
If users specify data offset, we need to use dv->data_offset. If users
don't specify it, we need to use
st->data_offset.

For super1, in the function write_init_super1, before writing
superblock to disk, it checks data_offset. If it's
INVALID_SECTORS, it will use st->data_offset (the default value of
specific superblock)

And for s->size, if the raid device is a raid0, s->size is 0 here. I
see you and Mariusz talked about the raid0
zone. If the raid0 is created with disks of different size, it can
have more than one zone.
e.g.  disk1(10G)   disk2(20G)  disk3(30G)
disk1   disk2    disk3
10G     10G      10G  (zone0)
            10G      10G  (zone1)  (The zone1 is behind zone0)
                         10G  (zone2)

If the user doesn't specify size, it will use all space of the disk
and create zones and we can discard
the whole disk in this situation.

Regards
Xiao

>                                 if (st->ss->add_to_super(st, &inf->disk,
>                                                          fd, dv->devname,
>                                                          dv->data_offset)) {
> diff --git a/ReadMe.c b/ReadMe.c
> index 7f94847e9769..544a057f83a0 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -138,6 +138,7 @@ struct option long_options[] = {
>      {"size",     1, 0, 'z'},
>      {"auto",     1, 0, Auto}, /* also for --assemble */
>      {"assume-clean",0,0, AssumeClean },
> +    {"discard",          0, 0, Discard },
>      {"metadata",  1, 0, 'e'}, /* superblock format */
>      {"bitmap",   1, 0, Bitmap},
>      {"bitmap-chunk", 1, 0, BitmapChunk},
> diff --git a/mdadm.c b/mdadm.c
> index 972adb524dfb..049cdce1cdd2 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
>                         s.assume_clean = 1;
>                         continue;
>
> +               case O(CREATE, Discard):
> +                       s.discard = true;
> +                       continue;
> +
>                 case O(GROW,'n'):
>                 case O(CREATE,'n'):
>                 case O(BUILD,'n'): /* number of raid disks */
> diff --git a/mdadm.h b/mdadm.h
> index 941a5f3823a0..a1e0bc9f01ad 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -433,6 +433,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
>   */
>  enum special_options {
>         AssumeClean = 300,
> +       Discard,
>         BitmapChunk,
>         WriteBehind,
>         ReAdd,
> @@ -593,6 +594,7 @@ struct shape {
>         int     bitmap_chunk;
>         char    *bitmap_file;
>         int     assume_clean;
> +       bool    discard;
>         int     write_behind;
>         unsigned long long size;
>         unsigned long long data_offset;
> --
> 2.30.2
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create
  2022-09-19  8:41   ` Xiao Ni
@ 2022-09-21 18:45     ` Logan Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Logan Gunthorpe @ 2022-09-21 18:45 UTC (permalink / raw)
  To: Xiao Ni
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Mariusz Tkaczyk,
	Coly Li, Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan



On 2022-09-19 02:41, Xiao Ni wrote:
> When creating a raid device without specifying data offset,
> dv->data_offset is always 1 (INVALID_SECTORS).
> If users specify data offset, we need to use dv->data_offset. If users
> don't specify it, we need to use
> st->data_offset.
> 
> For super1, in the function write_init_super1, before writing
> superblock to disk, it checks data_offset. If it's
> INVALID_SECTORS, it will use st->data_offset (the default value of
> specific superblock)
> 
> And for s->size, if the raid device is a raid0, s->size is 0 here. I
> see you and Mariusz talked about the raid0
> zone. If the raid0 is created with disks of different size, it can
> have more than one zone.
> e.g.  disk1(10G)   disk2(20G)  disk3(30G)
> disk1   disk2    disk3
> 10G     10G      10G  (zone0)
>             10G      10G  (zone1)  (The zone1 is behind zone0)
>                          10G  (zone2)
> 
> If the user doesn't specify size, it will use all space of the disk
> and create zones and we can discard
> the whole disk in this situation.

Thanks, this was really useful information. I've incorporated it into v3
which I'll be sending out in a bit.

Logan

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-09-21 18:45 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-08 23:08 [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Logan Gunthorpe
2022-09-08 23:08 ` [PATCH mdadm v2 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
2022-09-09  9:57   ` Mariusz Tkaczyk
2022-09-09 11:54     ` Roman Mamedov
     [not found]       ` <CABdXBANrJNWjq4237k9DPRoxLVmiAUoKMZxaaLUrcMHsODwvmA@mail.gmail.com>
2022-09-09 15:31         ` Roman Mamedov
2022-09-12 17:43       ` Martin K. Petersen
2022-09-09 15:47     ` Logan Gunthorpe
2022-09-13  7:35       ` Mariusz Tkaczyk
2022-09-13 15:43         ` Logan Gunthorpe
2022-09-14 12:01           ` Mariusz Tkaczyk
2022-09-14 16:29             ` Logan Gunthorpe
2022-09-14 17:39               ` Mariusz Tkaczyk
2022-09-19  8:41   ` Xiao Ni
2022-09-21 18:45     ` Logan Gunthorpe
2022-09-08 23:08 ` [PATCH mdadm v2 2/2] manpage: Add --discard option to manpage Logan Gunthorpe
2022-09-12 17:40 ` [PATCH mdadm v2 0/2] Discard Option for Creating Arrays Martin K. Petersen
     [not found]   ` <CABdXBAP0LeQMmhSLUMZ_TmnSp5xmZ4xJBkNa7HUm7094m_x9xA@mail.gmail.com>
2022-09-13  3:47     ` Martin K. Petersen
2022-09-13 15:38   ` Logan Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.