All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays
@ 2023-03-01 20:41 Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 1/7] Create: goto abort_locked instead of return 1 in error path Logan Gunthorpe
                   ` (10 more replies)
  0 siblings, 11 replies; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe

Hi,

This is the next iteration of the patchset to add a zeroing option
which bypasses the inital sync for arrays. This version of the patch
has some minor cleanup and collected a number of review and ack tags.

This patch set adds the --write-zeroes option which will imply
--assume-clean and write zeros to the data region in each disk before
starting the array. This can take some time so each disk is done in
parallel in its own fork. To make the forking code easier to
understand this patch set also starts with some cleanup of the
existing Create code.

We tested write-zeroes requests on a number of modern nvme drives of
various manufacturers and found most are not as optimized as the
discard path. A couple drives that were tested did not support
write-zeroes at all but still performed similarly with the kernel
falling back to writing zero pages. Typically we see it take on the
order of one minute per 100GB of data zeroed.

One reason write-zeroes is slower than discard is that today's NVMe
devices only allow about 2MB to be zeroed in one command where as
the entire drive can typically be discarded in one command. Partly,
this is a limitation of the spec as there are only 16 bits avalaible
in the write-zeros command size but drives still don't max this out.
Hopefully, in the future this will all be optimized a bit more
and this work will be able to take advantage of that.

Logan

--

Changes since v6:
   * Collected review and ack tags from Xiao, Chaitanya and Coly
   * Adjust the error reporting to us strerror() instead of the
     glibc %m extension. (per Coly)
   * Fix a typo in the man page ("despit" should have been "despite")
     (as noticed by Coly)

Changes since v5:
   * Ensure 'interrupted' is initialized in wait_for_zero_forks().
     (as noticed by Xiao)
   * Print a message indicating that the zeroing was interrupted.

Changes since v4:
   * Handle SIGINT better. Previous versions would leave the zeroing
     processes behind after the main thread exitted which would
     continue zeroing in the background (possibly for some time).
     This version splits the zero fallocate commands up so they can be
     interrupted quicker, and intercepts SIGINT in the main thread
     to print an appropriate message and wait for the threads
     to finish up. (as noticed by Xiao)

Changes since v3:
   * Store the pid in a local variable instead of the mdinfo struct
    (per Mariusz and Xiao)

Changes since v2:

   * Use write-zeroes instead of discard to zero the disks (per
     Martin)
   * Due to the time required to zero the disks, each disk is
     now done in parallel with separate forks of the process.
   * In order to add the forking some refactoring was done on the
     Create() function to make it easier to understand
   * Added a pr_info() call so that some prints can be done
     to stdout instead of stdour (per Mariusz)
   * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz)
   * Added a test to the mdadm test suite to test the option
     works.
   * Fixed up how the size and offset are calculated with some
     great information from Xiao.

Changes since v1:

   * Discard the data in the devices later in the create process
     while they are already open. This requires treating the
     s.discard option the same as the s.assume_clean option.
     Per Mariusz.
   * A couple other minor cleanup changes from Mariusz.

--

Logan Gunthorpe (7):
  Create: goto abort_locked instead of return 1 in error path
  Create: remove safe_mode_delay local variable
  Create: Factor out add_disks() helpers
  mdadm: Introduce pr_info()
  mdadm: Add --write-zeros option for Create
  tests/00raid5-zero: Introduce test to exercise --write-zeros.
  manpage: Add --write-zeroes option to manpage

 Create.c           | 565 +++++++++++++++++++++++++++++++--------------
 ReadMe.c           |   2 +
 mdadm.8.in         |  18 +-
 mdadm.c            |   9 +
 mdadm.h            |   7 +
 tests/00raid5-zero |  12 +
 6 files changed, 437 insertions(+), 176 deletions(-)
 create mode 100644 tests/00raid5-zero


base-commit: f1f3ef7d2de5e3a726c27b9f9bb20e270a100dab
--
2.30.2

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 1/7] Create: goto abort_locked instead of return 1 in error path
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 2/7] Create: remove safe_mode_delay local variable Logan Gunthorpe
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

The return 1 after the fstat_is_blkdev() check should be replaced
with an error return that goes through the error path to unlock
resources locked by this function.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Coly Li <colyli@suse.de>
---
 Create.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Create.c b/Create.c
index 953e73722518..2e8203ecdccd 100644
--- a/Create.c
+++ b/Create.c
@@ -939,7 +939,7 @@ int Create(struct supertype *st, char *mddev,
 						goto abort_locked;
 					}
 					if (!fstat_is_blkdev(fd, dv->devname, &rdev))
-						return 1;
+						goto abort_locked;
 					inf->disk.major = major(rdev);
 					inf->disk.minor = minor(rdev);
 				}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 2/7] Create: remove safe_mode_delay local variable
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 1/7] Create: goto abort_locked instead of return 1 in error path Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 3/7] Create: Factor out add_disks() helpers Logan Gunthorpe
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

All .getinfo_super() call sets the info.safe_mode_delay variables
to a constant value, so no matter what the current state is
that function will always set it to the same value.

Create() calls .getinfo_super() multiple times while creating the array.
The value is stored in a local variable for every disk in the loop
to add disks (so the last disc call takes precedence). The local
variable is then used in the call to sysfs_set_safemode().

This can be simplified by using info.safe_mode_delay directly. The info
variable had .getinfo_super() called on it early in the function so, by the
reasoning above, it will have the same value as the local variable which
can thus be removed.

Doing this allows for factoring out code from Create() in a subsequent
patch.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Coly Li <colyli@suse.de>
---
 Create.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/Create.c b/Create.c
index 2e8203ecdccd..8ded81dc265d 100644
--- a/Create.c
+++ b/Create.c
@@ -137,7 +137,6 @@ int Create(struct supertype *st, char *mddev,
 	int did_default = 0;
 	int do_default_layout = 0;
 	int do_default_chunk = 0;
-	unsigned long safe_mode_delay = 0;
 	char chosen_name[1024];
 	struct map_ent *map = NULL;
 	unsigned long long newsize;
@@ -952,7 +951,6 @@ int Create(struct supertype *st, char *mddev,
 					goto abort_locked;
 				}
 				st->ss->getinfo_super(st, inf, NULL);
-				safe_mode_delay = inf->safe_mode_delay;
 
 				if (have_container && c->verbose > 0)
 					pr_err("Using %s for device %d\n",
@@ -1065,7 +1063,7 @@ int Create(struct supertype *st, char *mddev,
 						    "readonly");
 				break;
 			}
-			sysfs_set_safemode(&info, safe_mode_delay);
+			sysfs_set_safemode(&info, info.safe_mode_delay);
 			if (err) {
 				pr_err("failed to activate array.\n");
 				ioctl(mdfd, STOP_ARRAY, NULL);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 3/7] Create: Factor out add_disks() helpers
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 1/7] Create: goto abort_locked instead of return 1 in error path Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 2/7] Create: remove safe_mode_delay local variable Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 4/7] mdadm: Introduce pr_info() Logan Gunthorpe
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

The Create function is massive with a very large number of variables.
Reading and understanding the function is almost impossible. To help
with this, factor out the two pass loop that adds the disks to the array.

This moves about 160 lines into three new helper functions and removes
a bunch of local variables from the main Create function. The main new
helper function add_disks() does the two pass loop and calls into
add_disk_to_super() and update_metadata(). Factoring out the
latter two helpers also helps to reduce a ton of indentation.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Coly Li <colyli@suse.de>
---
 Create.c | 382 +++++++++++++++++++++++++++++++------------------------
 1 file changed, 213 insertions(+), 169 deletions(-)

diff --git a/Create.c b/Create.c
index 8ded81dc265d..6a0446644e04 100644
--- a/Create.c
+++ b/Create.c
@@ -91,6 +91,214 @@ int default_layout(struct supertype *st, int level, int verbose)
 	return layout;
 }
 
+static int add_disk_to_super(int mdfd, struct shape *s, struct context *c,
+		struct supertype *st, struct mddev_dev *dv,
+		struct mdinfo *info, int have_container, int major_num)
+{
+	dev_t rdev;
+	int fd;
+
+	if (dv->disposition == 'j') {
+		info->disk.raid_disk = MD_DISK_ROLE_JOURNAL;
+		info->disk.state = (1<<MD_DISK_JOURNAL);
+	} else if (info->disk.raid_disk < s->raiddisks) {
+		info->disk.state = (1<<MD_DISK_ACTIVE) |
+			(1<<MD_DISK_SYNC);
+	} else {
+		info->disk.state = 0;
+	}
+
+	if (dv->writemostly == FlagSet) {
+		if (major_num == BITMAP_MAJOR_CLUSTERED) {
+			pr_err("Can not set %s --write-mostly with a clustered bitmap\n",dv->devname);
+			return 1;
+		} else {
+			info->disk.state |= (1<<MD_DISK_WRITEMOSTLY);
+		}
+
+	}
+
+	if (dv->failfast == FlagSet)
+		info->disk.state |= (1<<MD_DISK_FAILFAST);
+
+	if (have_container) {
+		fd = -1;
+	} else {
+		if (st->ss->external && st->container_devnm[0])
+			fd = open(dv->devname, O_RDWR);
+		else
+			fd = open(dv->devname, O_RDWR|O_EXCL);
+
+		if (fd < 0) {
+			pr_err("failed to open %s after earlier success - aborting\n",
+			       dv->devname);
+			return 1;
+		}
+		if (!fstat_is_blkdev(fd, dv->devname, &rdev))
+			return 1;
+		info->disk.major = major(rdev);
+		info->disk.minor = minor(rdev);
+	}
+	if (fd >= 0)
+		remove_partitions(fd);
+	if (st->ss->add_to_super(st, &info->disk, fd, dv->devname,
+				 dv->data_offset)) {
+		ioctl(mdfd, STOP_ARRAY, NULL);
+		return 1;
+	}
+	st->ss->getinfo_super(st, info, NULL);
+
+	if (have_container && c->verbose > 0)
+		pr_err("Using %s for device %d\n",
+		       map_dev(info->disk.major, info->disk.minor, 0),
+		       info->disk.number);
+
+	if (!have_container) {
+		/* getinfo_super might have lost these ... */
+		info->disk.major = major(rdev);
+		info->disk.minor = minor(rdev);
+	}
+
+	return 0;
+}
+
+static int update_metadata(int mdfd, struct shape *s, struct supertype *st,
+			   struct map_ent **map, struct mdinfo *info,
+			   char *chosen_name)
+{
+	struct mdinfo info_new;
+	struct map_ent *me = NULL;
+
+	/* check to see if the uuid has changed due to these
+	 * metadata changes, and if so update the member array
+	 * and container uuid.  Note ->write_init_super clears
+	 * the subarray cursor such that ->getinfo_super once
+	 * again returns container info.
+	 */
+	st->ss->getinfo_super(st, &info_new, NULL);
+	if (st->ss->external && is_container(s->level) &&
+	    !same_uuid(info_new.uuid, info->uuid, 0)) {
+		map_update(map, fd2devnm(mdfd),
+			   info_new.text_version,
+			   info_new.uuid, chosen_name);
+		me = map_by_devnm(map, st->container_devnm);
+	}
+
+	if (st->ss->write_init_super(st)) {
+		st->ss->free_super(st);
+		return 1;
+	}
+
+	/*
+	 * Before activating the array, perform extra steps
+	 * required to configure the internal write-intent
+	 * bitmap.
+	 */
+	if (info_new.consistency_policy == CONSISTENCY_POLICY_BITMAP &&
+	    st->ss->set_bitmap && st->ss->set_bitmap(st, info)) {
+		st->ss->free_super(st);
+		return 1;
+	}
+
+	/* update parent container uuid */
+	if (me) {
+		char *path = xstrdup(me->path);
+
+		st->ss->getinfo_super(st, &info_new, NULL);
+		map_update(map, st->container_devnm, info_new.text_version,
+			   info_new.uuid, path);
+		free(path);
+	}
+
+	flush_metadata_updates(st);
+	st->ss->free_super(st);
+
+	return 0;
+}
+
+static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
+		     struct context *c, struct supertype *st,
+		     struct map_ent **map, struct mddev_dev *devlist,
+		     int total_slots, int have_container, int insert_point,
+		     int major_num, char *chosen_name)
+{
+	struct mddev_dev *moved_disk = NULL;
+	int pass, raid_disk_num, dnum;
+	struct mddev_dev *dv;
+	struct mdinfo *infos;
+	int ret = 0;
+
+	infos = xmalloc(sizeof(*infos) * total_slots);
+	enable_fds(total_slots);
+	for (pass = 1; pass <= 2; pass++) {
+		for (dnum = 0, raid_disk_num = 0, dv = devlist; dv;
+		     dv = (dv->next) ? (dv->next) : moved_disk, dnum++) {
+			if (dnum >= total_slots)
+				abort();
+			if (dnum == insert_point) {
+				raid_disk_num += 1;
+				moved_disk = dv;
+				continue;
+			}
+			if (strcasecmp(dv->devname, "missing") == 0) {
+				raid_disk_num += 1;
+				continue;
+			}
+			if (have_container)
+				moved_disk = NULL;
+			if (have_container && dnum < total_slots - 1)
+				/* repeatedly use the container */
+				moved_disk = dv;
+
+			switch(pass) {
+			case 1:
+				infos[dnum] = *info;
+				infos[dnum].disk.number = dnum;
+				infos[dnum].disk.raid_disk = raid_disk_num++;
+
+				if (dv->disposition == 'j')
+					raid_disk_num--;
+
+				ret = add_disk_to_super(mdfd, s, c, st, dv,
+						&infos[dnum], have_container,
+						major_num);
+				if (ret)
+					goto out;
+
+				break;
+			case 2:
+				infos[dnum].errors = 0;
+
+				ret = add_disk(mdfd, st, info, &infos[dnum]);
+				if (ret) {
+					pr_err("ADD_NEW_DISK for %s failed: %s\n",
+					       dv->devname, strerror(errno));
+					if (errno == EINVAL &&
+					    info->array.level == 0) {
+						pr_err("Possibly your kernel doesn't support RAID0 layouts.\n");
+						pr_err("Either upgrade, or use --layout=dangerous\n");
+					}
+					goto out;
+				}
+				break;
+			}
+			if (!have_container &&
+			    dv == moved_disk && dnum != insert_point) break;
+		}
+
+		if (pass == 1) {
+			ret = update_metadata(mdfd, s, st, map, info,
+					      chosen_name);
+			if (ret)
+				goto out;
+		}
+	}
+
+out:
+	free(infos);
+	return ret;
+}
+
 int Create(struct supertype *st, char *mddev,
 	   char *name, int *uuid,
 	   int subdevs, struct mddev_dev *devlist,
@@ -117,7 +325,7 @@ int Create(struct supertype *st, char *mddev,
 	unsigned long long minsize = 0, maxsize = 0;
 	char *mindisc = NULL;
 	char *maxdisc = NULL;
-	int dnum, raid_disk_num;
+	int dnum;
 	struct mddev_dev *dv;
 	dev_t rdev;
 	int fail = 0, warn = 0;
@@ -126,14 +334,13 @@ int Create(struct supertype *st, char *mddev,
 	int missing_disks = 0;
 	int insert_point = subdevs * 2; /* where to insert a missing drive */
 	int total_slots;
-	int pass;
 	int rv;
 	int bitmap_fd;
 	int have_container = 0;
 	int container_fd = -1;
 	int need_mdmon = 0;
 	unsigned long long bitmapsize;
-	struct mdinfo info, *infos;
+	struct mdinfo info;
 	int did_default = 0;
 	int do_default_layout = 0;
 	int do_default_chunk = 0;
@@ -869,174 +1076,11 @@ int Create(struct supertype *st, char *mddev,
 		}
 	}
 
-	infos = xmalloc(sizeof(*infos) * total_slots);
-	enable_fds(total_slots);
-	for (pass = 1; pass <= 2; pass++) {
-		struct mddev_dev *moved_disk = NULL; /* the disk that was moved out of the insert point */
-
-		for (dnum = 0, raid_disk_num = 0, dv = devlist; dv;
-		     dv = (dv->next) ? (dv->next) : moved_disk, dnum++) {
-			int fd;
-			struct mdinfo *inf = &infos[dnum];
-
-			if (dnum >= total_slots)
-				abort();
-			if (dnum == insert_point) {
-				raid_disk_num += 1;
-				moved_disk = dv;
-				continue;
-			}
-			if (strcasecmp(dv->devname, "missing") == 0) {
-				raid_disk_num += 1;
-				continue;
-			}
-			if (have_container)
-				moved_disk = NULL;
-			if (have_container && dnum < info.array.raid_disks - 1)
-				/* repeatedly use the container */
-				moved_disk = dv;
-
-			switch(pass) {
-			case 1:
-				*inf = info;
-
-				inf->disk.number = dnum;
-				inf->disk.raid_disk = raid_disk_num++;
-
-				if (dv->disposition == 'j') {
-					inf->disk.raid_disk = MD_DISK_ROLE_JOURNAL;
-					inf->disk.state = (1<<MD_DISK_JOURNAL);
-					raid_disk_num--;
-				} else if (inf->disk.raid_disk < s->raiddisks)
-					inf->disk.state = (1<<MD_DISK_ACTIVE) |
-						(1<<MD_DISK_SYNC);
-				else
-					inf->disk.state = 0;
-
-				if (dv->writemostly == FlagSet) {
-					if (major_num == BITMAP_MAJOR_CLUSTERED) {
-						pr_err("Can not set %s --write-mostly with a clustered bitmap\n",dv->devname);
-						goto abort_locked;
-					} else
-						inf->disk.state |= (1<<MD_DISK_WRITEMOSTLY);
-				}
-				if (dv->failfast == FlagSet)
-					inf->disk.state |= (1<<MD_DISK_FAILFAST);
-
-				if (have_container)
-					fd = -1;
-				else {
-					if (st->ss->external &&
-					    st->container_devnm[0])
-						fd = open(dv->devname, O_RDWR);
-					else
-						fd = open(dv->devname, O_RDWR|O_EXCL);
-
-					if (fd < 0) {
-						pr_err("failed to open %s after earlier success - aborting\n",
-							dv->devname);
-						goto abort_locked;
-					}
-					if (!fstat_is_blkdev(fd, dv->devname, &rdev))
-						goto abort_locked;
-					inf->disk.major = major(rdev);
-					inf->disk.minor = minor(rdev);
-				}
-				if (fd >= 0)
-					remove_partitions(fd);
-				if (st->ss->add_to_super(st, &inf->disk,
-							 fd, dv->devname,
-							 dv->data_offset)) {
-					ioctl(mdfd, STOP_ARRAY, NULL);
-					goto abort_locked;
-				}
-				st->ss->getinfo_super(st, inf, NULL);
-
-				if (have_container && c->verbose > 0)
-					pr_err("Using %s for device %d\n",
-						map_dev(inf->disk.major,
-							inf->disk.minor,
-							0), dnum);
-
-				if (!have_container) {
-					/* getinfo_super might have lost these ... */
-					inf->disk.major = major(rdev);
-					inf->disk.minor = minor(rdev);
-				}
-				break;
-			case 2:
-				inf->errors = 0;
-
-				rv = add_disk(mdfd, st, &info, inf);
-
-				if (rv) {
-					pr_err("ADD_NEW_DISK for %s failed: %s\n",
-					       dv->devname, strerror(errno));
-					if (errno == EINVAL &&
-					    info.array.level == 0) {
-						pr_err("Possibly your kernel doesn't support RAID0 layouts.\n");
-						pr_err("Either upgrade, or use --layout=dangerous\n");
-					}
-					goto abort_locked;
-				}
-				break;
-			}
-			if (!have_container &&
-			    dv == moved_disk && dnum != insert_point) break;
-		}
-		if (pass == 1) {
-			struct mdinfo info_new;
-			struct map_ent *me = NULL;
-
-			/* check to see if the uuid has changed due to these
-			 * metadata changes, and if so update the member array
-			 * and container uuid.  Note ->write_init_super clears
-			 * the subarray cursor such that ->getinfo_super once
-			 * again returns container info.
-			 */
-			st->ss->getinfo_super(st, &info_new, NULL);
-			if (st->ss->external && !is_container(s->level) &&
-			    !same_uuid(info_new.uuid, info.uuid, 0)) {
-				map_update(&map, fd2devnm(mdfd),
-					   info_new.text_version,
-					   info_new.uuid, chosen_name);
-				me = map_by_devnm(&map, st->container_devnm);
-			}
-
-			if (st->ss->write_init_super(st)) {
-				st->ss->free_super(st);
-				goto abort_locked;
-			}
-			/*
-			 * Before activating the array, perform extra steps
-			 * required to configure the internal write-intent
-			 * bitmap.
-			 */
-			if (info_new.consistency_policy ==
-				    CONSISTENCY_POLICY_BITMAP &&
-			    st->ss->set_bitmap &&
-			    st->ss->set_bitmap(st, &info)) {
-				st->ss->free_super(st);
-				goto abort_locked;
-			}
-
-			/* update parent container uuid */
-			if (me) {
-				char *path = xstrdup(me->path);
-
-				st->ss->getinfo_super(st, &info_new, NULL);
-				map_update(&map, st->container_devnm,
-					   info_new.text_version,
-					   info_new.uuid, path);
-				free(path);
-			}
+	if (add_disks(mdfd, &info, s, c, st, &map, devlist, total_slots,
+		      have_container, insert_point, major_num, chosen_name))
+		goto abort_locked;
 
-			flush_metadata_updates(st);
-			st->ss->free_super(st);
-		}
-	}
 	map_unlock(&map);
-	free(infos);
 
 	if (is_container(s->level)) {
 		/* No need to start.  But we should signal udev to
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 4/7] mdadm: Introduce pr_info()
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2023-03-01 20:41 ` [PATCH mdadm v7 3/7] Create: Factor out add_disks() helpers Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
  2023-03-03 10:35   ` Paul Menzel
  2023-03-01 20:41 ` [PATCH mdadm v7 5/7] mdadm: Add --write-zeros option for Create Logan Gunthorpe
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

Feedback was given to avoid informational pr_err() calls that print
to stderr, even though that's done all through out the code.

Using printf() directly doesn't maintain the same format (an "mdadm"
prefix on every line.

So introduce pr_info() which prints to stdout with the same format
and use it for a couple informational pr_err() calls in Create().

Future work can make this call used in more cases.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Coly Li <colyli@suse.de>
---
 Create.c | 7 ++++---
 mdadm.h  | 2 ++
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/Create.c b/Create.c
index 6a0446644e04..4acda30c5256 100644
--- a/Create.c
+++ b/Create.c
@@ -984,11 +984,12 @@ int Create(struct supertype *st, char *mddev,
 
 			mdi = sysfs_read(-1, devnm, GET_VERSION);
 
-			pr_err("Creating array inside %s container %s\n",
+			pr_info("Creating array inside %s container %s\n",
 				mdi?mdi->text_version:"managed", devnm);
 			sysfs_free(mdi);
 		} else
-			pr_err("Defaulting to version %s metadata\n", info.text_version);
+			pr_info("Defaulting to version %s metadata\n",
+				info.text_version);
 	}
 
 	map_update(&map, fd2devnm(mdfd), info.text_version,
@@ -1145,7 +1146,7 @@ int Create(struct supertype *st, char *mddev,
 			ioctl(mdfd, RESTART_ARRAY_RW, NULL);
 		}
 		if (c->verbose >= 0)
-			pr_err("array %s started.\n", mddev);
+			pr_info("array %s started.\n", mddev);
 		if (st->ss->external && st->container_devnm[0]) {
 			if (need_mdmon)
 				start_mdmon(st->container_devnm);
diff --git a/mdadm.h b/mdadm.h
index 13f8b4cb5a6b..8bd65fba1887 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -1852,6 +1852,8 @@ static inline int xasprintf(char **strp, const char *fmt, ...) {
 #endif
 #define cont_err(fmt ...) fprintf(stderr, "       " fmt)
 
+#define pr_info(fmt, args...) printf("%s: "fmt, Name, ##args)
+
 void *xmalloc(size_t len);
 void *xrealloc(void *ptr, size_t len);
 void *xcalloc(size_t num, size_t size);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 5/7] mdadm: Add --write-zeros option for Create
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2023-03-01 20:41 ` [PATCH mdadm v7 4/7] mdadm: Introduce pr_info() Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
  2023-03-03  8:35   ` Coly Li
  2023-03-01 20:41 ` [PATCH mdadm v7 6/7] tests/00raid5-zero: Introduce test to exercise --write-zeros Logan Gunthorpe
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

Add the --write-zeros option for Create which will send a write zeros
request to all the disks before assembling the array. After zeroing
the array, the disks will be in a known clean state and the initial
sync may be skipped.

Writing zeroes is best used when there is a hardware offload method
to zero the data. But even still, zeroing can take several minutes on
a large device. Because of this, all disks are zeroed in parallel using
their own forked process and a message is printed to the user. The main
process will proceed only after all the zeroing processes have completed
successfully.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
---
 Create.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 ReadMe.c |   2 +
 mdadm.c  |   9 +++
 mdadm.h  |   5 ++
 4 files changed, 190 insertions(+), 2 deletions(-)

diff --git a/Create.c b/Create.c
index 4acda30c5256..bbe9e13dc76d 100644
--- a/Create.c
+++ b/Create.c
@@ -26,6 +26,10 @@
 #include	"md_u.h"
 #include	"md_p.h"
 #include	<ctype.h>
+#include	<fcntl.h>
+#include	<signal.h>
+#include	<sys/signalfd.h>
+#include	<sys/wait.h>
 
 static int round_size_and_verify(unsigned long long *size, int chunk)
 {
@@ -91,9 +95,149 @@ int default_layout(struct supertype *st, int level, int verbose)
 	return layout;
 }
 
+static pid_t write_zeroes_fork(int fd, struct shape *s, struct supertype *st,
+			       struct mddev_dev *dv)
+
+{
+	const unsigned long long req_size = 1 << 30;
+	unsigned long long offset_bytes, size_bytes, sz;
+	sigset_t sigset;
+	int ret = 0;
+	pid_t pid;
+
+	size_bytes = KIB_TO_BYTES(s->size);
+
+	/*
+	 * If size_bytes is zero, this is a zoned raid array where
+	 * each disk is of a different size and uses its full
+	 * disk. Thus zero the entire disk.
+	 */
+	if (!size_bytes && !get_dev_size(fd, dv->devname, &size_bytes))
+		return -1;
+
+	if (dv->data_offset != INVALID_SECTORS)
+		offset_bytes = SEC_TO_BYTES(dv->data_offset);
+	else
+		offset_bytes = SEC_TO_BYTES(st->data_offset);
+
+	pr_info("zeroing data from %lld to %lld on: %s\n",
+		offset_bytes, size_bytes, dv->devname);
+
+	pid = fork();
+	if (pid < 0) {
+		pr_err("Could not fork to zero disks: %s\n", strerror(errno));
+		return pid;
+	} else if (pid != 0) {
+		return pid;
+	}
+
+	sigemptyset(&sigset);
+	sigaddset(&sigset, SIGINT);
+	sigprocmask(SIG_UNBLOCK, &sigset, NULL);
+
+	while (size_bytes) {
+		/*
+		 * Split requests to the kernel into 1GB chunks seeing the
+		 * fallocate() call is not interruptible and blocking a
+		 * ctrl-c for several minutes is not desirable.
+		 *
+		 * 1GB is chosen as a compromise: the user may still have
+		 * to wait several seconds if they ctrl-c on devices that
+		 * zero slowly, but will reduce the number of requests
+		 * required and thus the overhead on devices that perform
+		 * better.
+		 */
+		sz = size_bytes;
+		if (sz >= req_size)
+			sz = req_size;
+
+		if (fallocate(fd, FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE,
+			      offset_bytes, sz)) {
+			pr_err("zeroing %s failed: %s\n", dv->devname,
+			       strerror(errno));
+			ret = 1;
+			break;
+		}
+
+		offset_bytes += sz;
+		size_bytes -= sz;
+	}
+
+	exit(ret);
+}
+
+static int wait_for_zero_forks(int *zero_pids, int count)
+{
+	int wstatus, ret = 0, i, sfd, wait_count = 0;
+	struct signalfd_siginfo fdsi;
+	bool interrupted = false;
+	sigset_t sigset;
+	ssize_t s;
+
+	for (i = 0; i < count; i++)
+		if (zero_pids[i])
+			wait_count++;
+	if (!wait_count)
+		return 0;
+
+	sigemptyset(&sigset);
+	sigaddset(&sigset, SIGINT);
+	sigaddset(&sigset, SIGCHLD);
+	sigprocmask(SIG_BLOCK, &sigset, NULL);
+
+	sfd = signalfd(-1, &sigset, 0);
+	if (sfd < 0) {
+		pr_err("Unable to create signalfd: %s\n", strerror(errno));
+		return 1;
+	}
+
+	while (1) {
+		s = read(sfd, &fdsi, sizeof(fdsi));
+		if (s != sizeof(fdsi)) {
+			pr_err("Invalid signalfd read: %s\n", strerror(errno));
+			close(sfd);
+			return 1;
+		}
+
+		if (fdsi.ssi_signo == SIGINT) {
+			printf("\n");
+			pr_info("Interrupting zeroing processes, please wait...\n");
+			interrupted = true;
+		} else if (fdsi.ssi_signo == SIGCHLD) {
+			if (!--wait_count)
+				break;
+		}
+	}
+
+	close(sfd);
+
+	for (i = 0; i < count; i++) {
+		if (!zero_pids[i])
+			continue;
+
+		waitpid(zero_pids[i], &wstatus, 0);
+		zero_pids[i] = 0;
+		if (!WIFEXITED(wstatus) || WEXITSTATUS(wstatus))
+			ret = 1;
+	}
+
+	if (interrupted) {
+		pr_err("zeroing interrupted!\n");
+		return 1;
+	}
+
+	if (ret)
+		pr_err("zeroing failed!\n");
+	else
+		pr_info("zeroing finished\n");
+
+	return ret;
+}
+
 static int add_disk_to_super(int mdfd, struct shape *s, struct context *c,
 		struct supertype *st, struct mddev_dev *dv,
-		struct mdinfo *info, int have_container, int major_num)
+		struct mdinfo *info, int have_container, int major_num,
+		int *zero_pid)
 {
 	dev_t rdev;
 	int fd;
@@ -148,6 +292,14 @@ static int add_disk_to_super(int mdfd, struct shape *s, struct context *c,
 	}
 	st->ss->getinfo_super(st, info, NULL);
 
+	if (fd >= 0 && s->write_zeroes) {
+		*zero_pid = write_zeroes_fork(fd, s, st, dv);
+		if (*zero_pid <= 0) {
+			ioctl(mdfd, STOP_ARRAY, NULL);
+			return 1;
+		}
+	}
+
 	if (have_container && c->verbose > 0)
 		pr_err("Using %s for device %d\n",
 		       map_dev(info->disk.major, info->disk.minor, 0),
@@ -224,10 +376,23 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
 {
 	struct mddev_dev *moved_disk = NULL;
 	int pass, raid_disk_num, dnum;
+	int zero_pids[total_slots];
 	struct mddev_dev *dv;
 	struct mdinfo *infos;
+	sigset_t sigset, orig_sigset;
 	int ret = 0;
 
+	/*
+	 * Block SIGINT so the main thread will always wait for the
+	 * zeroing processes when being interrupted. Otherwise the
+	 * zeroing processes will finish their work in the background
+	 * keeping the disk busy.
+	 */
+	sigemptyset(&sigset);
+	sigaddset(&sigset, SIGINT);
+	sigprocmask(SIG_BLOCK, &sigset, &orig_sigset);
+	memset(zero_pids, 0, sizeof(zero_pids));
+
 	infos = xmalloc(sizeof(*infos) * total_slots);
 	enable_fds(total_slots);
 	for (pass = 1; pass <= 2; pass++) {
@@ -261,7 +426,7 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
 
 				ret = add_disk_to_super(mdfd, s, c, st, dv,
 						&infos[dnum], have_container,
-						major_num);
+						major_num, &zero_pids[dnum]);
 				if (ret)
 					goto out;
 
@@ -287,6 +452,10 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
 		}
 
 		if (pass == 1) {
+			ret = wait_for_zero_forks(zero_pids, total_slots);
+			if (ret)
+				goto out;
+
 			ret = update_metadata(mdfd, s, st, map, info,
 					      chosen_name);
 			if (ret)
@@ -295,7 +464,10 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
 	}
 
 out:
+	if (ret)
+		wait_for_zero_forks(zero_pids, total_slots);
 	free(infos);
+	sigprocmask(SIG_SETMASK, &orig_sigset, NULL);
 	return ret;
 }
 
diff --git a/ReadMe.c b/ReadMe.c
index bd8d50d28661..db251ed2f3d4 100644
--- a/ReadMe.c
+++ b/ReadMe.c
@@ -138,6 +138,7 @@ struct option long_options[] = {
     {"size",	  1, 0, 'z'},
     {"auto",	  1, 0, Auto}, /* also for --assemble */
     {"assume-clean",0,0, AssumeClean },
+    {"write-zeroes",0,0, WriteZeroes },
     {"metadata",  1, 0, 'e'}, /* superblock format */
     {"bitmap",	  1, 0, Bitmap},
     {"bitmap-chunk", 1, 0, BitmapChunk},
@@ -390,6 +391,7 @@ char Help_create[] =
 "  --write-journal=      : Specify journal device for RAID-4/5/6 array\n"
 "  --consistency-policy= : Specify the policy that determines how the array\n"
 "                     -k : maintains consistency in case of unexpected shutdown.\n"
+"  --write-zeroes        : Write zeroes to the disks before creating. This will bypass initial sync.\n"
 "\n"
 ;
 
diff --git a/mdadm.c b/mdadm.c
index 57e8e6fa64b9..4685ad6b06c2 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -590,6 +590,10 @@ int main(int argc, char *argv[])
 			s.assume_clean = 1;
 			continue;
 
+		case O(CREATE, WriteZeroes):
+			s.write_zeroes = 1;
+			continue;
+
 		case O(GROW,'n'):
 		case O(CREATE,'n'):
 		case O(BUILD,'n'): /* number of raid disks */
@@ -1251,6 +1255,11 @@ int main(int argc, char *argv[])
 		}
 	}
 
+	if (s.write_zeroes && !s.assume_clean) {
+		pr_info("Disk zeroing requested, setting --assume-clean to skip resync\n");
+		s.assume_clean = 1;
+	}
+
 	if (!mode && devs_found) {
 		mode = MISC;
 		devmode = 'Q';
diff --git a/mdadm.h b/mdadm.h
index 8bd65fba1887..211cfcd54f92 100644
--- a/mdadm.h
+++ b/mdadm.h
@@ -275,6 +275,9 @@ static inline void __put_unaligned32(__u32 val, void *p)
 
 #define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))
 
+#define KIB_TO_BYTES(x)	((x) << 10)
+#define SEC_TO_BYTES(x)	((x) << 9)
+
 extern const char Name[];
 
 struct md_bb_entry {
@@ -435,6 +438,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
  */
 enum special_options {
 	AssumeClean = 300,
+	WriteZeroes,
 	BitmapChunk,
 	WriteBehind,
 	ReAdd,
@@ -640,6 +644,7 @@ struct shape {
 	int	bitmap_chunk;
 	char	*bitmap_file;
 	int	assume_clean;
+	bool	write_zeroes;
 	int	write_behind;
 	unsigned long long size;
 	unsigned long long data_offset;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 6/7] tests/00raid5-zero: Introduce test to exercise --write-zeros.
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2023-03-01 20:41 ` [PATCH mdadm v7 5/7] mdadm: Add --write-zeros option for Create Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
  2023-03-01 20:41 ` [PATCH mdadm v7 7/7] manpage: Add --write-zeroes option to manpage Logan Gunthorpe
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

Attempt to create a raid5 array with --write-zeros. If it is successful
check the array to ensure it is in sync.

If it is unsuccessful and an unsupported error is printed, skip the
test.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Coly Li <colyli@suse.de>
---
 tests/00raid5-zero | 12 ++++++++++++
 1 file changed, 12 insertions(+)
 create mode 100644 tests/00raid5-zero

diff --git a/tests/00raid5-zero b/tests/00raid5-zero
new file mode 100644
index 000000000000..7d0f05a12539
--- /dev/null
+++ b/tests/00raid5-zero
@@ -0,0 +1,12 @@
+
+if mdadm -CfR $md0 -l 5 -n3 $dev0 $dev1 $dev2 --write-zeroes ; then
+  check nosync
+  echo check > /sys/block/md0/md/sync_action;
+  check wait
+elif grep "zeroing [^ ]* failed: Operation not supported" \
+     $targetdir/stderr; then
+  echo "write-zeros not supported, skipping"
+else
+  echo >&2 "ERROR: mdadm return failure without not supported message"
+  exit 1
+fi
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH mdadm v7 7/7] manpage: Add --write-zeroes option to manpage
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2023-03-01 20:41 ` [PATCH mdadm v7 6/7] tests/00raid5-zero: Introduce test to exercise --write-zeros Logan Gunthorpe
@ 2023-03-01 20:41 ` Logan Gunthorpe
       [not found] ` <CABdXBAOAJAPmKC66WNnJSL5N72JZiM=AWBxhu_Yi1ojz3jn_Jg@mail.gmail.com>
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 14+ messages in thread
From: Logan Gunthorpe @ 2023-03-01 20:41 UTC (permalink / raw)
  To: linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan, Logan Gunthorpe, Kinga Tanska,
	Chaitanya Kulkarni

Document the new --write-zeroes option in the manpage.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Acked-by: Coly Li <colyli@suse.de>
---
 mdadm.8.in | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mdadm.8.in b/mdadm.8.in
index 64f71ed1df43..6f0f6c13baf1 100644
--- a/mdadm.8.in
+++ b/mdadm.8.in
@@ -837,6 +837,22 @@ array is resynced at creation.  From Linux version 3.0,
 .B \-\-assume\-clean
 can be used with that command to avoid the automatic resync.
 
+.TP
+.BR \-\-write-zeroes
+When creating an array, send write zeroes requests to all the block
+devices.  This should zero the data area on all disks such that the
+initial sync is not necessary and, if successfull, will behave
+as if
+.B \-\-assume\-clean
+was specified.
+.IP
+This is intended for use with devices that have hardware offload for
+zeroing, but despite this zeroing can still take several minutes for
+large disks.  Thus a message is printed before and after zeroing and
+each disk is zeroed in parallel with the others.
+.IP
+This is only meaningful with --create.
+
 .TP
 .BR \-\-backup\-file=
 This is needed when
@@ -1370,7 +1386,7 @@ and
 .B layout\-alternate
 options are for RAID0 arrays with non-uniform devices size that were in
 use before Linux 5.4.  If the array was being used with Linux 3.13 or
-earlier, then to assemble the array on a new kernel, 
+earlier, then to assemble the array on a new kernel,
 .B \-\-update=layout\-original
 must be given.  If the array was created and used with a kernel from Linux 3.14 to
 Linux 5.3, then
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays
       [not found] ` <CABdXBAOAJAPmKC66WNnJSL5N72JZiM=AWBxhu_Yi1ojz3jn_Jg@mail.gmail.com>
@ 2023-03-02  2:53   ` Chaitanya Kulkarni
  0 siblings, 0 replies; 14+ messages in thread
From: Chaitanya Kulkarni @ 2023-03-02  2:53 UTC (permalink / raw)
  To: Jonmichael Hands, Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni,
	Mariusz Tkaczyk, Coly Li, Stephen Bates, Martin Oliveira,
	David Sloan

On 3/1/2023 4:38 PM, Jonmichael Hands wrote:
> Hi Logan, Martin,
> just to be clear, the Linux implementation of write zeros is correctly 
> tagging DEAC to 1 during the NVMe write zeros command? This should be 
> very fast on drives that support write zeros with TRIM (most controllers 
> will have the same internal path as a TRIM updating the FTL to 
> deallocate the LBAs). Size of the command really shouldn't affect it 
> much, it is much more correlated with the firmware implementation of above.
> Thanks!
> JM
> Screenshot 2023-03-01 at 4.35.43 PM.png


Yes, it does please see [1].

-ck

1. NVMe DEC bit tagging in NVMe write-zeroes command setup.

static inline blk_status_t nvme_setup_write_zeroes(struct nvme_ns *ns,
		struct request *req, struct nvme_command *cmnd)
{
	memset(cmnd, 0, sizeof(*cmnd));

	if (ns->ctrl->quirks & NVME_QUIRK_DEALLOCATE_ZEROES)
		return nvme_setup_discard(ns, req, cmnd);

	cmnd->write_zeroes.opcode = nvme_cmd_write_zeroes;
	cmnd->write_zeroes.nsid = cpu_to_le32(ns->head->ns_id);
	cmnd->write_zeroes.slba =
		cpu_to_le64(nvme_sect_to_lba(ns, blk_rq_pos(req)));
	cmnd->write_zeroes.length =
		cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);

	if (!(req->cmd_flags & REQ_NOUNMAP) && (ns->features & NVME_NS_DEAC))
		*cmnd->write_zeroes.control |= cpu_to_le16(NVME_WZ_DEAC);*

	if (nvme_ns_has_pi(ns)) {
		cmnd->write_zeroes.control |= cpu_to_le16(NVME_RW_PRINFO_PRACT);

		switch (ns->pi_type) {
		case NVME_NS_DPS_PI_TYPE1:
		case NVME_NS_DPS_PI_TYPE2:
			nvme_set_ref_tag(ns, cmnd, req);
			break;
		}
	}

	return BLK_STS_OK;
}

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (7 preceding siblings ...)
       [not found] ` <CABdXBAOAJAPmKC66WNnJSL5N72JZiM=AWBxhu_Yi1ojz3jn_Jg@mail.gmail.com>
@ 2023-03-02  2:55 ` Chaitanya Kulkarni
  2023-03-02 16:56 ` Martin K. Petersen
  2023-03-13 14:08 ` Jes Sorensen
  10 siblings, 0 replies; 14+ messages in thread
From: Chaitanya Kulkarni @ 2023-03-02  2:55 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-raid, Jes Sorensen
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Jonmichael Hands, Stephen Bates, Martin Oliveira, David Sloan

On 3/1/2023 12:41 PM, Logan Gunthorpe wrote:
> Hi,
> 
> This is the next iteration of the patchset to add a zeroing option
> which bypasses the inital sync for arrays. This version of the patch
> has some minor cleanup and collected a number of review and ack tags.
> 
> This patch set adds the --write-zeroes option which will imply
> --assume-clean and write zeros to the data region in each disk before
> starting the array. This can take some time so each disk is done in
> parallel in its own fork. To make the forking code easier to
> understand this patch set also starts with some cleanup of the
> existing Create code.
> 
> We tested write-zeroes requests on a number of modern nvme drives of
> various manufacturers and found most are not as optimized as the
> discard path. A couple drives that were tested did not support
> write-zeroes at all but still performed similarly with the kernel
> falling back to writing zero pages. Typically we see it take on the
> order of one minute per 100GB of data zeroed.
> 
> One reason write-zeroes is slower than discard is that today's NVMe
> devices only allow about 2MB to be zeroed in one command where as
> the entire drive can typically be discarded in one command. Partly,
> this is a limitation of the spec as there are only 16 bits avalaible

's/avalaible/available/g'

> in the write-zeros command size but drives still don't max this out.
> Hopefully, in the future this will all be optimized a bit more
> and this work will be able to take advantage of that.
> 
> Logan
> 
> --
> 
>

Based on earlier discussion and feedback from Martin this
whole series looks good to me.

Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

-ck



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2023-03-02  2:55 ` Chaitanya Kulkarni
@ 2023-03-02 16:56 ` Martin K. Petersen
  2023-03-13 14:08 ` Jes Sorensen
  10 siblings, 0 replies; 14+ messages in thread
From: Martin K. Petersen @ 2023-03-02 16:56 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni,
	Mariusz Tkaczyk, Coly Li, Chaitanya Kulkarni, Jonmichael Hands,
	Stephen Bates, Martin Oliveira, David Sloan


Logan,

> This is the next iteration of the patchset to add a zeroing option
> which bypasses the inital sync for arrays. This version of the patch
> has some minor cleanup and collected a number of review and ack tags.

This looks good to me!

Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>

-- 
Martin K. Petersen	Oracle Linux Engineering

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mdadm v7 5/7] mdadm: Add --write-zeros option for Create
  2023-03-01 20:41 ` [PATCH mdadm v7 5/7] mdadm: Add --write-zeros option for Create Logan Gunthorpe
@ 2023-03-03  8:35   ` Coly Li
  0 siblings, 0 replies; 14+ messages in thread
From: Coly Li @ 2023-03-03  8:35 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni,
	Mariusz Tkaczyk, Chaitanya Kulkarni, Jonmichael Hands,
	Stephen Bates, Martin Oliveira, David Sloan, Kinga Tanska,
	Chaitanya Kulkarni

On Wed, Mar 01, 2023 at 01:41:33PM -0700, Logan Gunthorpe wrote:
> Add the --write-zeros option for Create which will send a write zeros
> request to all the disks before assembling the array. After zeroing
> the array, the disks will be in a known clean state and the initial
> sync may be skipped.
> 
> Writing zeroes is best used when there is a hardware offload method
> to zero the data. But even still, zeroing can take several minutes on
> a large device. Because of this, all disks are zeroed in parallel using
> their own forked process and a message is printed to the user. The main
> process will proceed only after all the zeroing processes have completed
> successfully.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
> Reviewed-by: Xiao Ni <xni@redhat.com>
> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>

Acked-by: Coly Li <colyli@suse.de>

Thanks.

> ---
>  Create.c | 176 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  ReadMe.c |   2 +
>  mdadm.c  |   9 +++
>  mdadm.h  |   5 ++
>  4 files changed, 190 insertions(+), 2 deletions(-)
> 
> diff --git a/Create.c b/Create.c
> index 4acda30c5256..bbe9e13dc76d 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -26,6 +26,10 @@
>  #include	"md_u.h"
>  #include	"md_p.h"
>  #include	<ctype.h>
> +#include	<fcntl.h>
> +#include	<signal.h>
> +#include	<sys/signalfd.h>
> +#include	<sys/wait.h>
>  
>  static int round_size_and_verify(unsigned long long *size, int chunk)
>  {
> @@ -91,9 +95,149 @@ int default_layout(struct supertype *st, int level, int verbose)
>  	return layout;
>  }
>  
> +static pid_t write_zeroes_fork(int fd, struct shape *s, struct supertype *st,
> +			       struct mddev_dev *dv)
> +
> +{
> +	const unsigned long long req_size = 1 << 30;
> +	unsigned long long offset_bytes, size_bytes, sz;
> +	sigset_t sigset;
> +	int ret = 0;
> +	pid_t pid;
> +
> +	size_bytes = KIB_TO_BYTES(s->size);
> +
> +	/*
> +	 * If size_bytes is zero, this is a zoned raid array where
> +	 * each disk is of a different size and uses its full
> +	 * disk. Thus zero the entire disk.
> +	 */
> +	if (!size_bytes && !get_dev_size(fd, dv->devname, &size_bytes))
> +		return -1;
> +
> +	if (dv->data_offset != INVALID_SECTORS)
> +		offset_bytes = SEC_TO_BYTES(dv->data_offset);
> +	else
> +		offset_bytes = SEC_TO_BYTES(st->data_offset);
> +
> +	pr_info("zeroing data from %lld to %lld on: %s\n",
> +		offset_bytes, size_bytes, dv->devname);
> +
> +	pid = fork();
> +	if (pid < 0) {
> +		pr_err("Could not fork to zero disks: %s\n", strerror(errno));
> +		return pid;
> +	} else if (pid != 0) {
> +		return pid;
> +	}
> +
> +	sigemptyset(&sigset);
> +	sigaddset(&sigset, SIGINT);
> +	sigprocmask(SIG_UNBLOCK, &sigset, NULL);
> +
> +	while (size_bytes) {
> +		/*
> +		 * Split requests to the kernel into 1GB chunks seeing the
> +		 * fallocate() call is not interruptible and blocking a
> +		 * ctrl-c for several minutes is not desirable.
> +		 *
> +		 * 1GB is chosen as a compromise: the user may still have
> +		 * to wait several seconds if they ctrl-c on devices that
> +		 * zero slowly, but will reduce the number of requests
> +		 * required and thus the overhead on devices that perform
> +		 * better.
> +		 */
> +		sz = size_bytes;
> +		if (sz >= req_size)
> +			sz = req_size;
> +
> +		if (fallocate(fd, FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE,
> +			      offset_bytes, sz)) {
> +			pr_err("zeroing %s failed: %s\n", dv->devname,
> +			       strerror(errno));
> +			ret = 1;
> +			break;
> +		}
> +
> +		offset_bytes += sz;
> +		size_bytes -= sz;
> +	}
> +
> +	exit(ret);
> +}
> +
> +static int wait_for_zero_forks(int *zero_pids, int count)
> +{
> +	int wstatus, ret = 0, i, sfd, wait_count = 0;
> +	struct signalfd_siginfo fdsi;
> +	bool interrupted = false;
> +	sigset_t sigset;
> +	ssize_t s;
> +
> +	for (i = 0; i < count; i++)
> +		if (zero_pids[i])
> +			wait_count++;
> +	if (!wait_count)
> +		return 0;
> +
> +	sigemptyset(&sigset);
> +	sigaddset(&sigset, SIGINT);
> +	sigaddset(&sigset, SIGCHLD);
> +	sigprocmask(SIG_BLOCK, &sigset, NULL);
> +
> +	sfd = signalfd(-1, &sigset, 0);
> +	if (sfd < 0) {
> +		pr_err("Unable to create signalfd: %s\n", strerror(errno));
> +		return 1;
> +	}
> +
> +	while (1) {
> +		s = read(sfd, &fdsi, sizeof(fdsi));
> +		if (s != sizeof(fdsi)) {
> +			pr_err("Invalid signalfd read: %s\n", strerror(errno));
> +			close(sfd);
> +			return 1;
> +		}
> +
> +		if (fdsi.ssi_signo == SIGINT) {
> +			printf("\n");
> +			pr_info("Interrupting zeroing processes, please wait...\n");
> +			interrupted = true;
> +		} else if (fdsi.ssi_signo == SIGCHLD) {
> +			if (!--wait_count)
> +				break;
> +		}
> +	}
> +
> +	close(sfd);
> +
> +	for (i = 0; i < count; i++) {
> +		if (!zero_pids[i])
> +			continue;
> +
> +		waitpid(zero_pids[i], &wstatus, 0);
> +		zero_pids[i] = 0;
> +		if (!WIFEXITED(wstatus) || WEXITSTATUS(wstatus))
> +			ret = 1;
> +	}
> +
> +	if (interrupted) {
> +		pr_err("zeroing interrupted!\n");
> +		return 1;
> +	}
> +
> +	if (ret)
> +		pr_err("zeroing failed!\n");
> +	else
> +		pr_info("zeroing finished\n");
> +
> +	return ret;
> +}
> +
>  static int add_disk_to_super(int mdfd, struct shape *s, struct context *c,
>  		struct supertype *st, struct mddev_dev *dv,
> -		struct mdinfo *info, int have_container, int major_num)
> +		struct mdinfo *info, int have_container, int major_num,
> +		int *zero_pid)
>  {
>  	dev_t rdev;
>  	int fd;
> @@ -148,6 +292,14 @@ static int add_disk_to_super(int mdfd, struct shape *s, struct context *c,
>  	}
>  	st->ss->getinfo_super(st, info, NULL);
>  
> +	if (fd >= 0 && s->write_zeroes) {
> +		*zero_pid = write_zeroes_fork(fd, s, st, dv);
> +		if (*zero_pid <= 0) {
> +			ioctl(mdfd, STOP_ARRAY, NULL);
> +			return 1;
> +		}
> +	}
> +
>  	if (have_container && c->verbose > 0)
>  		pr_err("Using %s for device %d\n",
>  		       map_dev(info->disk.major, info->disk.minor, 0),
> @@ -224,10 +376,23 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
>  {
>  	struct mddev_dev *moved_disk = NULL;
>  	int pass, raid_disk_num, dnum;
> +	int zero_pids[total_slots];
>  	struct mddev_dev *dv;
>  	struct mdinfo *infos;
> +	sigset_t sigset, orig_sigset;
>  	int ret = 0;
>  
> +	/*
> +	 * Block SIGINT so the main thread will always wait for the
> +	 * zeroing processes when being interrupted. Otherwise the
> +	 * zeroing processes will finish their work in the background
> +	 * keeping the disk busy.
> +	 */
> +	sigemptyset(&sigset);
> +	sigaddset(&sigset, SIGINT);
> +	sigprocmask(SIG_BLOCK, &sigset, &orig_sigset);
> +	memset(zero_pids, 0, sizeof(zero_pids));
> +
>  	infos = xmalloc(sizeof(*infos) * total_slots);
>  	enable_fds(total_slots);
>  	for (pass = 1; pass <= 2; pass++) {
> @@ -261,7 +426,7 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
>  
>  				ret = add_disk_to_super(mdfd, s, c, st, dv,
>  						&infos[dnum], have_container,
> -						major_num);
> +						major_num, &zero_pids[dnum]);
>  				if (ret)
>  					goto out;
>  
> @@ -287,6 +452,10 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
>  		}
>  
>  		if (pass == 1) {
> +			ret = wait_for_zero_forks(zero_pids, total_slots);
> +			if (ret)
> +				goto out;
> +
>  			ret = update_metadata(mdfd, s, st, map, info,
>  					      chosen_name);
>  			if (ret)
> @@ -295,7 +464,10 @@ static int add_disks(int mdfd, struct mdinfo *info, struct shape *s,
>  	}
>  
>  out:
> +	if (ret)
> +		wait_for_zero_forks(zero_pids, total_slots);
>  	free(infos);
> +	sigprocmask(SIG_SETMASK, &orig_sigset, NULL);
>  	return ret;
>  }
>  
> diff --git a/ReadMe.c b/ReadMe.c
> index bd8d50d28661..db251ed2f3d4 100644
> --- a/ReadMe.c
> +++ b/ReadMe.c
> @@ -138,6 +138,7 @@ struct option long_options[] = {
>      {"size",	  1, 0, 'z'},
>      {"auto",	  1, 0, Auto}, /* also for --assemble */
>      {"assume-clean",0,0, AssumeClean },
> +    {"write-zeroes",0,0, WriteZeroes },
>      {"metadata",  1, 0, 'e'}, /* superblock format */
>      {"bitmap",	  1, 0, Bitmap},
>      {"bitmap-chunk", 1, 0, BitmapChunk},
> @@ -390,6 +391,7 @@ char Help_create[] =
>  "  --write-journal=      : Specify journal device for RAID-4/5/6 array\n"
>  "  --consistency-policy= : Specify the policy that determines how the array\n"
>  "                     -k : maintains consistency in case of unexpected shutdown.\n"
> +"  --write-zeroes        : Write zeroes to the disks before creating. This will bypass initial sync.\n"
>  "\n"
>  ;
>  
> diff --git a/mdadm.c b/mdadm.c
> index 57e8e6fa64b9..4685ad6b06c2 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -590,6 +590,10 @@ int main(int argc, char *argv[])
>  			s.assume_clean = 1;
>  			continue;
>  
> +		case O(CREATE, WriteZeroes):
> +			s.write_zeroes = 1;
> +			continue;
> +
>  		case O(GROW,'n'):
>  		case O(CREATE,'n'):
>  		case O(BUILD,'n'): /* number of raid disks */
> @@ -1251,6 +1255,11 @@ int main(int argc, char *argv[])
>  		}
>  	}
>  
> +	if (s.write_zeroes && !s.assume_clean) {
> +		pr_info("Disk zeroing requested, setting --assume-clean to skip resync\n");
> +		s.assume_clean = 1;
> +	}
> +
>  	if (!mode && devs_found) {
>  		mode = MISC;
>  		devmode = 'Q';
> diff --git a/mdadm.h b/mdadm.h
> index 8bd65fba1887..211cfcd54f92 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -275,6 +275,9 @@ static inline void __put_unaligned32(__u32 val, void *p)
>  
>  #define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))
>  
> +#define KIB_TO_BYTES(x)	((x) << 10)
> +#define SEC_TO_BYTES(x)	((x) << 9)
> +
>  extern const char Name[];
>  
>  struct md_bb_entry {
> @@ -435,6 +438,7 @@ extern char Version[], Usage[], Help[], OptionHelp[],
>   */
>  enum special_options {
>  	AssumeClean = 300,
> +	WriteZeroes,
>  	BitmapChunk,
>  	WriteBehind,
>  	ReAdd,
> @@ -640,6 +644,7 @@ struct shape {
>  	int	bitmap_chunk;
>  	char	*bitmap_file;
>  	int	assume_clean;
> +	bool	write_zeroes;
>  	int	write_behind;
>  	unsigned long long size;
>  	unsigned long long data_offset;
> -- 
> 2.30.2
> 

-- 
Coly Li

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mdadm v7 4/7] mdadm: Introduce pr_info()
  2023-03-01 20:41 ` [PATCH mdadm v7 4/7] mdadm: Introduce pr_info() Logan Gunthorpe
@ 2023-03-03 10:35   ` Paul Menzel
  0 siblings, 0 replies; 14+ messages in thread
From: Paul Menzel @ 2023-03-03 10:35 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-raid, Jes Sorensen, Guoqing Jiang, Xiao Ni,
	Mariusz Tkaczyk, Coly Li, Chaitanya Kulkarni, Jonmichael Hands,
	Stephen Bates, Martin Oliveira, David Sloan, Kinga Tanska,
	Chaitanya Kulkarni

Dear Logan,


Am 01.03.23 um 21:41 schrieb Logan Gunthorpe:
> Feedback was given to avoid informational pr_err() calls that print
> to stderr, even though that's done all through out the code.
> 
> Using printf() directly doesn't maintain the same format (an "mdadm"
> prefix on every line.

Just a nit, that the closing ) is missing.

In the summary you could use: Introduce and use pr_info()

> So introduce pr_info() which prints to stdout with the same format
> and use it for a couple informational pr_err() calls in Create().
> 
> Future work can make this call used in more cases.
> 
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> Acked-by: Kinga Tanska <kinga.tanska@linux.intel.com>
> Reviewed-by: Xiao Ni <xni@redhat.com>
> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
> Acked-by: Coly Li <colyli@suse.de>
> ---
>   Create.c | 7 ++++---
>   mdadm.h  | 2 ++
>   2 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/Create.c b/Create.c
> index 6a0446644e04..4acda30c5256 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -984,11 +984,12 @@ int Create(struct supertype *st, char *mddev,
>   
>   			mdi = sysfs_read(-1, devnm, GET_VERSION);
>   
> -			pr_err("Creating array inside %s container %s\n",
> +			pr_info("Creating array inside %s container %s\n",
>   				mdi?mdi->text_version:"managed", devnm);
>   			sysfs_free(mdi);
>   		} else
> -			pr_err("Defaulting to version %s metadata\n", info.text_version);
> +			pr_info("Defaulting to version %s metadata\n",
> +				info.text_version);
>   	}
>   
>   	map_update(&map, fd2devnm(mdfd), info.text_version,
> @@ -1145,7 +1146,7 @@ int Create(struct supertype *st, char *mddev,
>   			ioctl(mdfd, RESTART_ARRAY_RW, NULL);
>   		}
>   		if (c->verbose >= 0)
> -			pr_err("array %s started.\n", mddev);
> +			pr_info("array %s started.\n", mddev);
>   		if (st->ss->external && st->container_devnm[0]) {
>   			if (need_mdmon)
>   				start_mdmon(st->container_devnm);
> diff --git a/mdadm.h b/mdadm.h
> index 13f8b4cb5a6b..8bd65fba1887 100644
> --- a/mdadm.h
> +++ b/mdadm.h
> @@ -1852,6 +1852,8 @@ static inline int xasprintf(char **strp, const char *fmt, ...) {
>   #endif
>   #define cont_err(fmt ...) fprintf(stderr, "       " fmt)
>   
> +#define pr_info(fmt, args...) printf("%s: "fmt, Name, ##args)
> +
>   void *xmalloc(size_t len);
>   void *xrealloc(void *ptr, size_t len);
>   void *xcalloc(size_t num, size_t size);

Acked-by: Paul Menzel <pmenzel@molgen.mpg.de>


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays
  2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2023-03-02 16:56 ` Martin K. Petersen
@ 2023-03-13 14:08 ` Jes Sorensen
  10 siblings, 0 replies; 14+ messages in thread
From: Jes Sorensen @ 2023-03-13 14:08 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-raid
  Cc: Guoqing Jiang, Xiao Ni, Mariusz Tkaczyk, Coly Li,
	Chaitanya Kulkarni, Jonmichael Hands, Stephen Bates,
	Martin Oliveira, David Sloan

On 3/1/23 15:41, Logan Gunthorpe wrote:
> Hi,
> 
> This is the next iteration of the patchset to add a zeroing option
> which bypasses the inital sync for arrays. This version of the patch
> has some minor cleanup and collected a number of review and ack tags.
> 
> This patch set adds the --write-zeroes option which will imply
> --assume-clean and write zeros to the data region in each disk before
> starting the array. This can take some time so each disk is done in
> parallel in its own fork. To make the forking code easier to
> understand this patch set also starts with some cleanup of the
> existing Create code.
> 
> We tested write-zeroes requests on a number of modern nvme drives of
> various manufacturers and found most are not as optimized as the
> discard path. A couple drives that were tested did not support
> write-zeroes at all but still performed similarly with the kernel
> falling back to writing zero pages. Typically we see it take on the
> order of one minute per 100GB of data zeroed.
> 
> One reason write-zeroes is slower than discard is that today's NVMe
> devices only allow about 2MB to be zeroed in one command where as
> the entire drive can typically be discarded in one command. Partly,
> this is a limitation of the spec as there are only 16 bits avalaible
> in the write-zeros command size but drives still don't max this out.
> Hopefully, in the future this will all be optimized a bit more
> and this work will be able to take advantage of that.
> 
> Logan
> 
> --
> 
> Changes since v6:
>    * Collected review and ack tags from Xiao, Chaitanya and Coly
>    * Adjust the error reporting to us strerror() instead of the
>      glibc %m extension. (per Coly)
>    * Fix a typo in the man page ("despit" should have been "despite")
>      (as noticed by Coly)
> 
> Changes since v5:
>    * Ensure 'interrupted' is initialized in wait_for_zero_forks().
>      (as noticed by Xiao)
>    * Print a message indicating that the zeroing was interrupted.
> 
> Changes since v4:
>    * Handle SIGINT better. Previous versions would leave the zeroing
>      processes behind after the main thread exitted which would
>      continue zeroing in the background (possibly for some time).
>      This version splits the zero fallocate commands up so they can be
>      interrupted quicker, and intercepts SIGINT in the main thread
>      to print an appropriate message and wait for the threads
>      to finish up. (as noticed by Xiao)
> 
> Changes since v3:
>    * Store the pid in a local variable instead of the mdinfo struct
>     (per Mariusz and Xiao)
> 
> Changes since v2:
> 
>    * Use write-zeroes instead of discard to zero the disks (per
>      Martin)
>    * Due to the time required to zero the disks, each disk is
>      now done in parallel with separate forks of the process.
>    * In order to add the forking some refactoring was done on the
>      Create() function to make it easier to understand
>    * Added a pr_info() call so that some prints can be done
>      to stdout instead of stdour (per Mariusz)
>    * Added KIB_TO_BYTES and SEC_TO_BYTES helpers (per Mariusz)
>    * Added a test to the mdadm test suite to test the option
>      works.
>    * Fixed up how the size and offset are calculated with some
>      great information from Xiao.
> 
> Changes since v1:
> 
>    * Discard the data in the devices later in the create process
>      while they are already open. This requires treating the
>      s.discard option the same as the s.assume_clean option.
>      Per Mariusz.
>    * A couple other minor cleanup changes from Mariusz.
> 
> --
> 
> Logan Gunthorpe (7):
>   Create: goto abort_locked instead of return 1 in error path
>   Create: remove safe_mode_delay local variable
>   Create: Factor out add_disks() helpers
>   mdadm: Introduce pr_info()
>   mdadm: Add --write-zeros option for Create
>   tests/00raid5-zero: Introduce test to exercise --write-zeros.
>   manpage: Add --write-zeroes option to manpage
> 
>  Create.c           | 565 +++++++++++++++++++++++++++++++--------------
>  ReadMe.c           |   2 +
>  mdadm.8.in         |  18 +-
>  mdadm.c            |   9 +
>  mdadm.h            |   7 +
>  tests/00raid5-zero |  12 +
>  6 files changed, 437 insertions(+), 176 deletions(-)
>  create mode 100644 tests/00raid5-zero
> 
> 
> base-commit: f1f3ef7d2de5e3a726c27b9f9bb20e270a100dab

All applied!

Thanks,
Jes



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2023-03-13 14:09 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-01 20:41 [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Logan Gunthorpe
2023-03-01 20:41 ` [PATCH mdadm v7 1/7] Create: goto abort_locked instead of return 1 in error path Logan Gunthorpe
2023-03-01 20:41 ` [PATCH mdadm v7 2/7] Create: remove safe_mode_delay local variable Logan Gunthorpe
2023-03-01 20:41 ` [PATCH mdadm v7 3/7] Create: Factor out add_disks() helpers Logan Gunthorpe
2023-03-01 20:41 ` [PATCH mdadm v7 4/7] mdadm: Introduce pr_info() Logan Gunthorpe
2023-03-03 10:35   ` Paul Menzel
2023-03-01 20:41 ` [PATCH mdadm v7 5/7] mdadm: Add --write-zeros option for Create Logan Gunthorpe
2023-03-03  8:35   ` Coly Li
2023-03-01 20:41 ` [PATCH mdadm v7 6/7] tests/00raid5-zero: Introduce test to exercise --write-zeros Logan Gunthorpe
2023-03-01 20:41 ` [PATCH mdadm v7 7/7] manpage: Add --write-zeroes option to manpage Logan Gunthorpe
     [not found] ` <CABdXBAOAJAPmKC66WNnJSL5N72JZiM=AWBxhu_Yi1ojz3jn_Jg@mail.gmail.com>
2023-03-02  2:53   ` [PATCH mdadm v7 0/7] Write Zeroes option for Creating Arrays Chaitanya Kulkarni
2023-03-02  2:55 ` Chaitanya Kulkarni
2023-03-02 16:56 ` Martin K. Petersen
2023-03-13 14:08 ` Jes Sorensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.