All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mdadm 0/2] Discard Option for Creating Arrays
@ 2022-09-07 20:03 Logan Gunthorpe
  2022-09-07 20:03 ` [PATCH mdadm 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
  2022-09-07 20:03 ` [PATCH mdadm 2/2] manpage: Add --discard option to manpage Logan Gunthorpe
  0 siblings, 2 replies; 5+ messages in thread
From: Logan Gunthorpe @ 2022-09-07 20:03 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Hi,

This patchset adds the --discard option for creating new arrays in mdadm.

When specified, mdadm will send block discard (aka. trim or deallocate)
requests to all of the specified block devices. It will then read back
parts of the device to double check that the disks are now all zeros. If
they are all zero, the array is in a known state and does not need to
generate the parity seeing everything is zero and correct. If the devices
do not support discard, or do not result in zero data on each disk, an
error will be returned and the array will not be created.

If all disks get successfully discarded and appear zeroed, then the new
array will not need to be synchronized. The create operation will then
proceed as if --assume-clean was specified.

This provides a safe way and fast way to create an array that does not
need to be synchronized with devices that support discard requests.

Another option for this work is to use a write zero request. This can
be done in linux currently with fallocate and the FALLOC_FL_PUNCH_HOLE
| FALLOC_FL_KEEP_SIZE flags. This will send optimized write-zero requests
to the devices, without falling back to regular writes to zero the disk.
The benefit of this is that the disk will explicitly read back as zeros,
so a zero check is not necessary. The down side is that not all devices
implement this in as optimal a way as the discard request does and on
some of these devices zeroing can take multiple seconds per GB.

Because write-zero requests may be slow and most (but not all) discard
requests read back as zeros, this work uses only discard requests.

Logan

--

Logan Gunthorpe (2):
  mdadm: Add --discard option for Create
  manpage: Add --discard option to manpage

 Create.c   | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ReadMe.c   |  1 +
 mdadm.8.in | 15 +++++++++++
 mdadm.c    |  4 +++
 mdadm.h    |  2 ++
 5 files changed, 97 insertions(+)


base-commit: 171e9743881edf2dfb163ddff483566fbf913ccd
--
2.30.2

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH mdadm 1/2] mdadm: Add --discard option for Create
  2022-09-07 20:03 [PATCH mdadm 0/2] Discard Option for Creating Arrays Logan Gunthorpe
@ 2022-09-07 20:03 ` Logan Gunthorpe
  2022-09-08  7:56   ` Mariusz Tkaczyk
  2022-09-07 20:03 ` [PATCH mdadm 2/2] manpage: Add --discard option to manpage Logan Gunthorpe
  1 sibling, 1 reply; 5+ messages in thread
From: Logan Gunthorpe @ 2022-09-07 20:03 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Add the --discard option for Create which will send BLKDISCARD requests to
all disks before assembling the array. This is a fast way to know the
current state of all the disks. If the discard request zeros the data
on the disks (as is common but not universal) then the array is in a
state with correct parity. Thus the initial sync may be skipped.

After issuing each discard request, check if the first page of the
block device is zero. If it is, it is safe to assume the entire
disk should be zero. If it's not report an error.

If all the discard requests are successful and there are no missing
disks thin it is safe to set assume_clean as we know the array is clean.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 Create.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ReadMe.c |  1 +
 mdadm.c  |  4 +++
 mdadm.h  |  2 ++
 4 files changed, 82 insertions(+)

diff --git a/Create.c b/Create.c
index e06ec2ae96a1..db99da1e8571 100644
--- a/Create.c
+++ b/Create.c
@@ -26,6 +26,12 @@
 #include	"md_u.h"
 #include	"md_p.h"
 #include	<ctype.h>
+#include	<sys/ioctl.h>
+
+#ifndef BLKDISCARD
+#define BLKDISCARD _IO(0x12,119)
+#endif
+#include	<fcntl.h>
 
 static int round_size_and_verify(unsigned long long *size, int chunk)
 {
@@ -91,6 +97,63 @@ int default_layout(struct supertype *st, int level, int verbose)
 	return layout;
 }
 
+static int discard_device(const char *devname, unsigned long long size)
+{
+	uint64_t range[2] = {0, size};
+	unsigned long buf[4096 / sizeof(unsigned long)];
+	unsigned long i;
+	int fd;
+
+	fd = open(devname, O_RDWR | O_EXCL);
+	if (fd < 0) {
+		pr_err("could not open device for discard: %s\n", devname);
+		return 1;
+	}
+
+	if (ioctl(fd, BLKDISCARD, &range)) {
+		pr_err("discard failed on '%s': %m\n", devname);
+		goto out_err;
+	}
+
+	if (read(fd, buf, sizeof(buf)) != sizeof(buf)) {
+		pr_err("failed to readback '%s' after discard: %m\n", devname);
+		goto out_err;
+	}
+
+	for (i = 0; i < ARRAY_SIZE(buf); i++) {
+		if (buf[i]) {
+			pr_err("device did not read back zeros after discard on '%s': %lx\n",
+			       devname, buf[i]);
+			goto out_err;
+		}
+	}
+
+	close(fd);
+	return 0;
+
+out_err:
+	close(fd);
+	return 1;
+}
+
+static int discard_devices(struct context *c, struct mddev_dev *devlist,
+			   unsigned long long size)
+{
+	struct mddev_dev *dv;
+
+	for (dv = devlist; dv; dv = dv->next) {
+		if (!strcmp(dv->devname, "missing"))
+			continue;
+
+		if (c->verbose)
+			pr_err("discarding all data on: %s\n", dv->devname);
+		if (discard_device(dv->devname, size))
+			return 1;
+	}
+
+	return 0;
+}
+
 int Create(struct supertype *st, char *mddev,
 	   char *name, int *uuid,
 	   int subdevs, struct mddev_dev *devlist,
@@ -603,6 +666,18 @@ int Create(struct supertype *st, char *mddev,
 		}
 	}
 
+	if (s->discard) {
+		if (discard_devices(c, devlist, (s->size << 10) +
+				    (st->data_offset << 9)))
+			return 1;
+
+		/* All disks are zero so if there are none missing assume
+		 * the array is clean
+		 */
+		if (first_missing >= s->raiddisks)
+			s->assume_clean = 1;
+	}
+
 	/* If this is raid4/5, we want to configure the last active slot
 	 * as missing, so that a reconstruct happens (faster than re-parity)
 	 * FIX: Can we do this for raid6 as well?
diff --git a/ReadMe.c b/ReadMe.c
index 7f94847e9769..544a057f83a0 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -138,6 +138,7 @@ struct option long_options[] = {
     {"size",	  1, 0, 'z'},
     {"auto",	  1, 0, Auto}, /* also for --assemble */
     {"assume-clean",0,0, AssumeClean },
+    {"discard",	  0, 0, Discard },
     {"metadata",  1, 0, 'e'}, /* superblock format */
     {"bitmap",	  1, 0, Bitmap},
     {"bitmap-chunk", 1, 0, BitmapChunk},
diff --git a/mdadm.c b/mdadm.c
index 972adb524dfb..8dee85a54a6a 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -602,6 +602,10 @@ int main(int argc, char *argv[])
 			s.assume_clean = 1;
 			continue;
 
+		case O(CREATE, Discard):
+			s.discard = 1;
+			continue;
+
 		case O(GROW,'n'):
 		case O(CREATE,'n'):
 		case O(BUILD,'n'): /* number of raid disks */
diff --git a/mdadm.h b/mdadm.h
index 941a5f3823a0..99769be57ac5 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -433,6 +433,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
  */
 enum special_options {
 	AssumeClean = 300,
+	Discard,
 	BitmapChunk,
 	WriteBehind,
 	ReAdd,
@@ -593,6 +594,7 @@ struct shape {
 	int	bitmap_chunk;
 	char	*bitmap_file;
 	int	assume_clean;
+	int	discard;
 	int	write_behind;
 	unsigned long long size;
 	unsigned long long data_offset;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH mdadm 2/2] manpage: Add --discard option to manpage
  2022-09-07 20:03 [PATCH mdadm 0/2] Discard Option for Creating Arrays Logan Gunthorpe
  2022-09-07 20:03 ` [PATCH mdadm 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
@ 2022-09-07 20:03 ` Logan Gunthorpe
  1 sibling, 0 replies; 5+ messages in thread
From: Logan Gunthorpe @ 2022-09-07 20:03 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Document the new --discard option in the manpage.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 mdadm.8.in | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mdadm.8.in b/mdadm.8.in
index f273622641da..93a0a32dd314 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -836,6 +836,21 @@ array is resynced at creation.  From Linux version 3.0,
 .B \-\-assume\-clean
 can be used with that command to avoid the automatic resync.
 
+.TP
+.BR \-\-discard
+When creating an array, send block discard (aka trim or deallocate)
+requests to all the block devices. In most cases this should zero all
+the disks. If any discard fails, or if zeros are not read back from the
+disk, then the operation will be aborted.
+.IP
+If all discards succeed, and there are no missing disks specified,
+then the array should not need to be synchronized after it is created
+(as all disks are zero). The operation will thus proceed as if
+.B \-\-assume\-clean
+was specified.
+.IP
+This is only meaningful with --create.
+
 .TP
 .BR \-\-backup\-file=
 This is needed when
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH mdadm 1/2] mdadm: Add --discard option for Create
  2022-09-07 20:03 ` [PATCH mdadm 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
@ 2022-09-08  7:56   ` Mariusz Tkaczyk
  2022-09-08 17:01     ` Logan Gunthorpe
  0 siblings, 1 reply; 5+ messages in thread
From: Mariusz Tkaczyk @ 2022-09-08  7:56 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

Hi Logan,
I like this idea but I have some question.

Thanks,
Mariusz

On Wed,  7 Sep 2022 14:03:54 -0600
Logan Gunthorpe <logang@deltatee.com> wrote:

> Add the --discard option for Create which will send BLKDISCARD requests to
> all disks before assembling the array. This is a fast way to know the
> current state of all the disks. If the discard request zeros the data
> on the disks (as is common but not universal) then the array is in a
> state with correct parity. Thus the initial sync may be skipped.

Are we discarding whole device or only space for array? I can see that we are
specifying range in ioctl.
> 
> After issuing each discard request, check if the first page of the
> block device is zero. If it is, it is safe to assume the entire
> disk should be zero. If it's not report an error.
> 
> If all the discard requests are successful and there are no missing
> disks thin it is safe to set assume_clean as we know the array is clean.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>  Create.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  ReadMe.c |  1 +
>  mdadm.c  |  4 +++
>  mdadm.h  |  2 ++
>  4 files changed, 82 insertions(+)
> 
> diff --git a/Create.c b/Create.c
> index e06ec2ae96a1..db99da1e8571 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -26,6 +26,12 @@
>  #include	"md_u.h"
>  #include	"md_p.h"
>  #include	<ctype.h>
> +#include	<sys/ioctl.h>
> +
> +#ifndef BLKDISCARD
> +#define BLKDISCARD _IO(0x12,119)
> +#endif
> +#include	<fcntl.h>
>  
>  static int round_size_and_verify(unsigned long long *size, int chunk)
>  {
> @@ -91,6 +97,63 @@ int default_layout(struct supertype *st, int level, int
> verbose) return layout;
>  }
>  
> +static int discard_device(const char *devname, unsigned long long size)
> +{
> +	uint64_t range[2] = {0, size};
Are we starting always from 0? If yest then it introduces bug. In IMSM there is
a matrix array conception. Two arrays on same drives.
I suspect that we are able to erase data from first array when we set --discard
during second volume creation.

> +	unsigned long buf[4096 / sizeof(unsigned long)];
> +	unsigned long i;
> +	int fd;
> +
> +	fd = open(devname, O_RDWR | O_EXCL);

Do we need to to open another RDWR descriptor, I think that mdadm will open
something again soon. We are opening descriptors many times, it generates
unnecessary uvents. We can incorporate it with existing logic and add
discarding just before st->ss->add_to_super() and discard every drive one by
one.
We will fail on error anyway.

> +	if (fd < 0) {
> +		pr_err("could not open device for discard: %s\n", devname);
> +		return 1;
> +	}

Please use is_fd_valid() here.

> +
> +	if (ioctl(fd, BLKDISCARD, &range)) {
> +		pr_err("discard failed on '%s': %m\n", devname);
> +		goto out_err;
> +	}
> +
> +	if (read(fd, buf, sizeof(buf)) != sizeof(buf)) {
> +		pr_err("failed to readback '%s' after discard: %m\n",
> devname);
> +		goto out_err;
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
> +		if (buf[i]) {
> +			pr_err("device did not read back zeros after discard
> on '%s': %lx\n",
> +			       devname, buf[i]);
> +			goto out_err;
Do we really need to print data here? Could you move it to debug?

> +		}
> +	}
> +
> +	close(fd);
there is also close_fd() but I don't see it useful here.

> +	return 0;
> +
> +out_err:
> +	close(fd);
> +	return 1;
> +}
> +
> +static int discard_devices(struct context *c, struct mddev_dev *devlist,
> +			   unsigned long long size)
> +{
> +	struct mddev_dev *dv;
> +
> +	for (dv = devlist; dv; dv = dv->next) {
> +		if (!strcmp(dv->devname, "missing"))
> +			continue;
> +
> +		if (c->verbose)
> +			pr_err("discarding all data on: %s\n", dv->devname);
I know that mdadm is printing messages to stderr mainly but for this one,
stdout is enough IMO. I give it to you.

> +		if (discard_device(dv->devname, size))
> +			return 1;
> +	}
> +
> +	return 0;
> +}
> +
>  int Create(struct supertype *st, char *mddev,
>  	   char *name, int *uuid,
>  	   int subdevs, struct mddev_dev *devlist,
> @@ -603,6 +666,18 @@ int Create(struct supertype *st, char *mddev,
>  		}
>  	}
>  
> +	if (s->discard) {
> +		if (discard_devices(c, devlist, (s->size << 10) +
> +				    (st->data_offset << 9)))
> +			return 1;
> +
> +		/* All disks are zero so if there are none missing assume
> +		 * the array is clean
> +		 */
> +		if (first_missing >= s->raiddisks)
> +			s->assume_clean = 1;

IMO missing drive has nothing to do with clean conception but please correct me
I'm wrong. All assume_clean does is skipping resync after the creation and that
is true even if some drives are missing. Array will be still clean but will be
also degraded.

If I'm correct then it simplifies implementation because we can set
s.assume_clean if s.discard is set too in mdadm.c.

If I'm wrong what is the point of discard when resync is started anyway with
missing drive? Maybe those features should be mutually exclusive?

> +	}

> +
>  	/* If this is raid4/5, we want to configure the last active slot
>  	 * as missing, so that a reconstruct happens (faster than re-parity)
>  	 * FIX: Can we do this for raid6 as well?
> diff --git a/ReadMe.c b/ReadMe.c
> index 7f94847e9769..544a057f83a0 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -138,6 +138,7 @@ struct option long_options[] = {
>      {"size",	  1, 0, 'z'},
>      {"auto",	  1, 0, Auto}, /* also for --assemble */
>      {"assume-clean",0,0, AssumeClean },
> +    {"discard",	  0, 0, Discard },
>      {"metadata",  1, 0, 'e'}, /* superblock format */
>      {"bitmap",	  1, 0, Bitmap},
>      {"bitmap-chunk", 1, 0, BitmapChunk},
> diff --git a/mdadm.c b/mdadm.c
> index 972adb524dfb..8dee85a54a6a 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -602,6 +602,10 @@ int main(int argc, char *argv[])
>  			s.assume_clean = 1;
>  			continue;
>  
> +		case O(CREATE, Discard):
> +			s.discard = 1;
please use true.

> +			continue;
> +
>  		case O(GROW,'n'):
>  		case O(CREATE,'n'):
>  		case O(BUILD,'n'): /* number of raid disks */
> diff --git a/mdadm.h b/mdadm.h
> index 941a5f3823a0..99769be57ac5 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -433,6 +433,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
>   */
>  enum special_options {
>  	AssumeClean = 300,
> +	Discard,
>  	BitmapChunk,
>  	WriteBehind,
>  	ReAdd,
> @@ -593,6 +594,7 @@ struct shape {
>  	int	bitmap_chunk;
>  	char	*bitmap_file;
>  	int	assume_clean;
> +	int	discard;

please use bool type.

>  	int	write_behind;
>  	unsigned long long size;
>  	unsigned long long data_offset;

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH mdadm 1/2] mdadm: Add --discard option for Create
  2022-09-08  7:56   ` Mariusz Tkaczyk
@ 2022-09-08 17:01     ` Logan Gunthorpe
  0 siblings, 0 replies; 5+ messages in thread
From: Logan Gunthorpe @ 2022-09-08 17:01 UTC (permalink / raw)
  To: Mariusz Tkaczyk
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan


Hi Mariusz,

Thanks for the feedback. I've incorporated it into a v2 patch set which
I will send later today. Specific responses below.

On 2022-09-08 01:56, Mariusz Tkaczyk wrote:
> On Wed,  7 Sep 2022 14:03:54 -0600
> Logan Gunthorpe <logang@deltatee.com> wrote:
> 
>> Add the --discard option for Create which will send BLKDISCARD requests to
>> all disks before assembling the array. This is a fast way to know the
>> current state of all the disks. If the discard request zeros the data
>> on the disks (as is common but not universal) then the array is in a
>> state with correct parity. Thus the initial sync may be skipped.
> 
> Are we discarding whole device or only space for array? I can see that we are
> specifying range in ioctl.

Just the space for the array.

>> +static int discard_device(const char *devname, unsigned long long size)
>> +{
>> +	uint64_t range[2] = {0, size};
> Are we starting always from 0? If yest then it introduces bug. In IMSM there is
> a matrix array conception. Two arrays on same drives.
> I suspect that we are able to erase data from first array when we set --discard
> during second volume creation.

Yes, ok. I'll have this fixed in v2.

>> +	unsigned long buf[4096 / sizeof(unsigned long)];
>> +	unsigned long i;
>> +	int fd;
>> +
>> +	fd = open(devname, O_RDWR | O_EXCL);
> 
> Do we need to to open another RDWR descriptor, I think that mdadm will open
> something again soon. We are opening descriptors many times, it generates
> unnecessary uvents. We can incorporate it with existing logic and add
> discarding just before st->ss->add_to_super() and discard every drive one by
> one.
> We will fail on error anyway.

Ok, that makes sense and cleans things up a bit. This will be included
in v2.
>> +	for (i = 0; i < ARRAY_SIZE(buf); i++) {
>> +		if (buf[i]) {
>> +			pr_err("device did not read back zeros after discard
>> on '%s': %lx\n",
>> +			       devname, buf[i]);
>> +			goto out_err;
> Do we really need to print data here? Could you move it to debug?

This is an error message and describes why we will not be creating the
array (we error out in this case). So I think it is necessary to be a
pr_err().

>> +		if (c->verbose)
>> +			pr_err("discarding all data on: %s\n", dv->devname);
> I know that mdadm is printing messages to stderr mainly but for this one,
> stdout is enough IMO. I give it to you.

Switched to printf() for the informational message in v2.


>> +		if (first_missing >= s->raiddisks)
>> +			s->assume_clean = 1;
> 
> IMO missing drive has nothing to do with clean conception but please correct me
> I'm wrong. All assume_clean does is skipping resync after the creation and that
> is true even if some drives are missing. Array will be still clean but will be
> also degraded.
> 
> If I'm correct then it simplifies implementation because we can set
> s.assume_clean if s.discard is set too in mdadm.c.
> 
> If I'm wrong what is the point of discard when resync is started anyway with
> missing drive? Maybe those features should be mutually exclusive?

I think you are right. I've implemented v2 to just add a s.discard check
to all the cases where s.assume_clean is checked. Makes it simpler and
more clear, rather than changing assume_clean in mdadm.
>> +		case O(CREATE, Discard):
>> +			s.discard = 1;
> please use true.

Done for v2.

>>  	char	*bitmap_file;
>>  	int	assume_clean;
>> +	int	discard;
> 
> please use bool type.

Done for v2.



Logan

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-09-08 17:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07 20:03 [PATCH mdadm 0/2] Discard Option for Creating Arrays Logan Gunthorpe
2022-09-07 20:03 ` [PATCH mdadm 1/2] mdadm: Add --discard option for Create Logan Gunthorpe
2022-09-08  7:56   ` Mariusz Tkaczyk
2022-09-08 17:01     ` Logan Gunthorpe
2022-09-07 20:03 ` [PATCH mdadm 2/2] manpage: Add --discard option to manpage Logan Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.