All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/7] Partial Parity Log for MD RAID 5
@ 2017-03-09  8:59 Artur Paszkiewicz
  2017-03-09  8:59 ` [PATCH v5 1/7] md: superblock changes for PPL Artur Paszkiewicz
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  8:59 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

This series of patches implements the Partial Parity Log for RAID5 arrays. The
purpose of this feature is closing the RAID 5 Write Hole. It is a solution
alternative to the existing raid5-cache, but the logging workflow and much of
the implementation is based on it.

The main differences compared to raid5-cache is that PPL is a distributed log -
it is stored on array member drives in the metadata area and does not require a
dedicated journaling drive. Write performance is reduced by up to 30%-40% but
it scales with the number of drives in the array and the journaling drive does
not become a bottleneck or a single point of failure. PPL does not protect from
losing in-flight data, only from silent data corruption. More details about how
the log works can be found in patches 3 and 5.

This feature originated from Intel RSTe, which uses IMSM metadata. PPL for IMSM
is going to be included in RSTe implementations starting with upcoming Xeon
platforms and Intel will continue supporting and maintaining it. This patchset
implements PPL for external metadata (specifically IMSM) as well as native MD
v1.x metadata.

Changes in mdadm are also required to make this fully usable. Patches for mdadm
will be sent later.

v5:
- Added a common raid5-cache and ppl interface in raid5-log.h.
- Moved ops_run_partial_parity() to raid5-ppl.c.
- Use an inline bio in struct ppl_io_unit, simplify ppl_submit_iounit() and fix
  a potential bio allocation issue.
- Simplified condition for appending a stripe_head to ppl entry in
  ppl_log_stripe().
- Flush disk cache after ppl recovery, write with FUA in
  ppl_write_empty_header().
- Removed order > 0 page allocation in ppl_recover_entry().
- Put r5l_io_unit and ppl_io_unit in a union in struct stripe_head.
- struct ppl_conf *ppl in struct r5conf replaced with void *log_private.
- Improved comments and descriptions.

v4:
- Separated raid5-cache and ppl structures.
- Removed the policy logic from raid5-cache, ppl calls moved to raid5 core.
- Checking wrong configuration when validating superblock.
- Moved documentation to separate file.
- More checks for ppl sector/size.
- Some small fixes and improvements.

v3:
- Fixed alignment issues in the metadata structures.
- Removed reading IMSM signature from superblock.
- Removed 'rwh_policy' and per-device JournalPpl flags, added
  'consistency_policy', 'ppl_sector' and 'ppl_size' sysfs attributes.
- Reworked and simplified disk removal logic.
- Debug messages in raid5-ppl.c converted to pr_debug().
- Fixed some bugs in logging and recovery code.
- Improved descriptions and documentation.

v2:
- Fixed wrong PPL size calculation for IMSM.
- Simplified full stripe write case.
- Removed direct access to bi_io_vec.
- Handle failed bio_add_page().

Artur Paszkiewicz (7):
  md: superblock changes for PPL
  raid5: separate header for log functions
  raid5-ppl: Partial Parity Log write logging implementation
  md: add sysfs entries for PPL
  raid5-ppl: load and recover the log
  raid5-ppl: support disk hot add/remove with PPL
  raid5-ppl: runtime PPL enabling or disabling

 Documentation/admin-guide/md.rst |   32 +-
 Documentation/md/raid5-ppl.txt   |   44 ++
 drivers/md/Makefile              |    2 +-
 drivers/md/md.c                  |  140 +++++
 drivers/md/md.h                  |   10 +
 drivers/md/raid0.c               |    3 +-
 drivers/md/raid1.c               |    3 +-
 drivers/md/raid5-cache.c         |   22 +-
 drivers/md/raid5-log.h           |  114 ++++
 drivers/md/raid5-ppl.c           | 1247 ++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c               |  182 ++++--
 drivers/md/raid5.h               |   40 +-
 include/uapi/linux/raid/md_p.h   |   45 +-
 13 files changed, 1799 insertions(+), 85 deletions(-)
 create mode 100644 Documentation/md/raid5-ppl.txt
 create mode 100644 drivers/md/raid5-log.h
 create mode 100644 drivers/md/raid5-ppl.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v5 1/7] md: superblock changes for PPL
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
@ 2017-03-09  8:59 ` Artur Paszkiewicz
  2017-03-09  8:59 ` [PATCH v5 2/7] raid5: separate header for log functions Artur Paszkiewicz
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  8:59 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c                | 19 +++++++++++++++++++
 drivers/md/md.h                |  8 ++++++++
 drivers/md/raid0.c             |  3 ++-
 drivers/md/raid1.c             |  3 ++-
 include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
 5 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 79a99a1c9ce7..173550455c42 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1506,6 +1506,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	} else if (sb->bblog_offset != 0)
 		rdev->badblocks.shift = 0;
 
+	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
+		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
+		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
+	}
+
 	if (!refdev) {
 		ret = 1;
 	} else {
@@ -1618,6 +1624,13 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
 			set_bit(MD_HAS_JOURNAL, &mddev->flags);
+
+		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+			if (le32_to_cpu(sb->feature_map) &
+			    (MD_FEATURE_BITMAP_OFFSET | MD_FEATURE_JOURNAL))
+				return -EINVAL;
+			set_bit(MD_HAS_PPL, &mddev->flags);
+		}
 	} else if (mddev->pers == NULL) {
 		/* Insist of good event counter while assembling, except for
 		 * spares (which don't need an event count) */
@@ -1831,6 +1844,12 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
 		sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
 
+	if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
+		sb->ppl.offset = cpu_to_le16(rdev->ppl.offset);
+		sb->ppl.size = cpu_to_le16(rdev->ppl.size);
+	}
+
 	rdev_for_each(rdev2, mddev) {
 		i = rdev2->desc_nr;
 		if (test_bit(Faulty, &rdev2->flags))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 1c00160b09f9..a7b2f16452c4 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -122,6 +122,13 @@ struct md_rdev {
 					   * sysfs entry */
 
 	struct badblocks badblocks;
+
+	struct {
+		short offset;	/* Offset from superblock to start of PPL.
+				 * Not used by external metadata. */
+		unsigned int size;	/* Size in sectors of the PPL space */
+		sector_t sector;	/* First sector of the PPL space */
+	} ppl;
 };
 enum flag_bits {
 	Faulty,			/* device is known to have a fault */
@@ -226,6 +233,7 @@ enum mddev_flags {
 				 * supported as calls to md_error() will
 				 * never cause the array to become failed.
 				 */
+	MD_HAS_PPL,		/* The raid array has PPL feature set */
 };
 
 enum mddev_sb_flags {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 93347ca7c7a6..56f70c3ad37c 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -29,7 +29,8 @@
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
 	 (1L << MD_JOURNAL_CLEAN) |	\
-	 (1L << MD_FAILFAST_SUPPORTED))
+	 (1L << MD_FAILFAST_SUPPORTED) |\
+	 (1L << MD_HAS_PPL))
 
 static int raid0_congested(struct mddev *mddev, int bits)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 3c5933b1d8fb..31c7df2859a6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -44,7 +44,8 @@
 
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
-	 (1L << MD_JOURNAL_CLEAN))
+	 (1L << MD_JOURNAL_CLEAN) |	\
+	 (1L << MD_HAS_PPL))
 
 /*
  * Number of guaranteed r1bios in case of extreme VM load:
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 9930f3e9040f..fe2112810c43 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -242,10 +242,18 @@ struct mdp_superblock_1 {
 
 	__le32	chunksize;	/* in 512byte sectors */
 	__le32	raid_disks;
-	__le32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
-				 * NOTE: signed, so bitmap can be before superblock
-				 * only meaningful of feature_map[0] is set.
-				 */
+	union {
+		__le32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
+					 * NOTE: signed, so bitmap can be before superblock
+					 * only meaningful of feature_map[0] is set.
+					 */
+
+		/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
+		struct {
+			__le16 offset; /* sectors from start of superblock that ppl starts (signed) */
+			__le16 size; /* ppl size in sectors */
+		} ppl;
+	};
 
 	/* These are only valid with feature bit '4' */
 	__le32	new_level;	/* new level we are reshaping to		*/
@@ -318,6 +326,7 @@ struct mdp_superblock_1 {
 					     */
 #define MD_FEATURE_CLUSTERED		256 /* clustered MD */
 #define	MD_FEATURE_JOURNAL		512 /* support write cache */
+#define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -328,6 +337,7 @@ struct mdp_superblock_1 {
 					|MD_FEATURE_RECOVERY_BITMAP	\
 					|MD_FEATURE_CLUSTERED		\
 					|MD_FEATURE_JOURNAL		\
+					|MD_FEATURE_PPL			\
 					)
 
 struct r5l_payload_header {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 2/7] raid5: separate header for log functions
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
  2017-03-09  8:59 ` [PATCH v5 1/7] md: superblock changes for PPL Artur Paszkiewicz
@ 2017-03-09  8:59 ` Artur Paszkiewicz
  2017-03-09  8:59 ` [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  8:59 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Move raid5-cache declarations from raid5.h to raid5-log.h, add inline
wrappers for functions which will be shared with ppl and use them in
raid5 core instead of direct calls to raid5-cache.

Remove unused parameter from r5c_cache_data(), move two duplicated
pr_debug() calls to r5l_init_log().

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-cache.c | 22 ++++++++++---
 drivers/md/raid5-log.h   | 81 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       | 48 ++++++++--------------------
 drivers/md/raid5.h       | 30 ------------------
 4 files changed, 112 insertions(+), 69 deletions(-)
 create mode 100644 drivers/md/raid5-log.h

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3f307be01b10..ac4e74ef1f7f 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -343,6 +343,8 @@ void r5c_handle_cached_data_endio(struct r5conf *conf,
 	}
 }
 
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+
 /* Check whether we should flush some stripes to free up stripe cache */
 void r5c_check_stripe_cache_usage(struct r5conf *conf)
 {
@@ -2620,9 +2622,7 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 	}
 }
 
-int
-r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
-	       struct stripe_head_state *s)
+int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	int pages = 0;
@@ -2785,6 +2785,10 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
 	struct r5l_log *log;
+	char b[BDEVNAME_SIZE];
+
+	pr_debug("md/raid:%s: using device %s as journal\n",
+		 mdname(conf->mddev), bdevname(rdev->bdev, b));
 
 	if (PAGE_SIZE != 4096)
 		return -EINVAL;
@@ -2887,7 +2891,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	return -EINVAL;
 }
 
-void r5l_exit_log(struct r5l_log *log)
+static void __r5l_exit_log(struct r5l_log *log)
 {
 	flush_work(&log->disable_writeback_work);
 	md_unregister_thread(&log->reclaim_thread);
@@ -2897,3 +2901,13 @@ void r5l_exit_log(struct r5l_log *log)
 	kmem_cache_destroy(log->io_kc);
 	kfree(log);
 }
+
+void r5l_exit_log(struct r5conf *conf)
+{
+	struct r5l_log *log = conf->log;
+
+	conf->log = NULL;
+	synchronize_rcu();
+
+	__r5l_exit_log(log);
+}
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
new file mode 100644
index 000000000000..2da4bd3bbd79
--- /dev/null
+++ b/drivers/md/raid5-log.h
@@ -0,0 +1,81 @@
+#ifndef _RAID5_LOG_H
+#define _RAID5_LOG_H
+
+extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
+extern void r5l_exit_log(struct r5conf *conf);
+extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
+extern void r5l_write_stripe_run(struct r5l_log *log);
+extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
+extern void r5l_stripe_write_finished(struct stripe_head *sh);
+extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
+extern void r5l_quiesce(struct r5l_log *log, int state);
+extern bool r5l_log_disk_error(struct r5conf *conf);
+extern bool r5c_is_writeback(struct r5l_log *log);
+extern int
+r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
+		      struct stripe_head_state *s, int disks);
+extern void
+r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
+			    struct stripe_head_state *s);
+extern void r5c_release_extra_page(struct stripe_head *sh);
+extern void r5c_use_extra_page(struct stripe_head *sh);
+extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+extern void r5c_handle_cached_data_endio(struct r5conf *conf,
+	struct stripe_head *sh, int disks, struct bio_list *return_bi);
+extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh);
+extern void r5c_make_stripe_write_out(struct stripe_head *sh);
+extern void r5c_flush_cache(struct r5conf *conf, int num);
+extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
+extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+extern struct md_sysfs_entry r5c_journal_mode;
+extern void r5c_update_on_rdev_error(struct mddev *mddev);
+extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+
+static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s)
+{
+	struct r5conf *conf = sh->raid_conf;
+
+	if (conf->log) {
+		if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
+			/* writing out phase */
+			if (s->waiting_extra_page)
+				return 0;
+			return r5l_write_stripe(conf->log, sh);
+		} else if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
+			/* caching phase */
+			return r5c_cache_data(conf->log, sh);
+		}
+	}
+
+	return -EAGAIN;
+}
+
+static inline void log_stripe_write_finished(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+
+	if (conf->log)
+		r5l_stripe_write_finished(sh);
+}
+
+static inline void log_write_stripe_run(struct r5conf *conf)
+{
+	if (conf->log)
+		r5l_write_stripe_run(conf->log);
+}
+
+static inline void log_exit(struct r5conf *conf)
+{
+	if (conf->log)
+		r5l_exit_log(conf);
+}
+
+static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev)
+{
+	if (journal_dev)
+		return r5l_init_log(conf, journal_dev);
+
+	return 0;
+}
+
+#endif
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5d9148125ec5..043a509560c2 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -61,6 +61,7 @@
 #include "raid5.h"
 #include "raid0.h"
 #include "bitmap.h"
+#include "raid5-log.h"
 
 #define UNSUPPORTED_MDDEV_FLAGS	(1L << MD_FAILFAST_SUPPORTED)
 
@@ -911,18 +912,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	might_sleep();
 
-	if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
-		/* writing out phase */
-		if (s->waiting_extra_page)
-			return;
-		if (r5l_write_stripe(conf->log, sh) == 0)
-			return;
-	} else {  /* caching phase */
-		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state)) {
-			r5c_cache_data(conf->log, sh, s);
-			return;
-		}
-	}
+	if (log_stripe(sh, s) == 0)
+		return;
 
 	for (i = disks; i--; ) {
 		int op, op_flags = 0;
@@ -3247,7 +3238,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 		if (bi)
 			bitmap_end = 1;
 
-		r5l_stripe_write_finished(sh);
+		log_stripe_write_finished(sh);
 
 		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
 			wake_up(&conf->wait_for_overlap);
@@ -3666,7 +3657,7 @@ static void handle_stripe_clean_event(struct r5conf *conf,
 				discard_pending = 1;
 		}
 
-	r5l_stripe_write_finished(sh);
+	log_stripe_write_finished(sh);
 
 	if (!discard_pending &&
 	    test_bit(R5_Discard, &sh->dev[sh->pd_idx].flags)) {
@@ -4656,7 +4647,7 @@ static void handle_stripe(struct stripe_head *sh)
 
 	if (s.just_cached)
 		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
-	r5l_stripe_write_finished(sh);
+	log_stripe_write_finished(sh);
 
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
@@ -6057,7 +6048,7 @@ static int handle_active_stripes(struct r5conf *conf, int group,
 
 	for (i = 0; i < batch_size; i++)
 		handle_stripe(batch[i]);
-	r5l_write_stripe_run(conf->log);
+	log_write_stripe_run(conf);
 
 	cond_resched();
 
@@ -6633,8 +6624,8 @@ static void free_conf(struct r5conf *conf)
 {
 	int i;
 
-	if (conf->log)
-		r5l_exit_log(conf->log);
+	log_exit(conf);
+
 	if (conf->shrinker.nr_deferred)
 		unregister_shrinker(&conf->shrinker);
 
@@ -7315,14 +7306,8 @@ static int raid5_run(struct mddev *mddev)
 		blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);
 	}
 
-	if (journal_dev) {
-		char b[BDEVNAME_SIZE];
-
-		pr_debug("md/raid:%s: using device %s as journal\n",
-			 mdname(mddev), bdevname(journal_dev->bdev, b));
-		if (r5l_init_log(conf, journal_dev))
-			goto abort;
-	}
+	if (log_init(conf, journal_dev))
+		goto abort;
 
 	return 0;
 abort:
@@ -7436,17 +7421,13 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 
 	print_raid5_conf(conf);
 	if (test_bit(Journal, &rdev->flags) && conf->log) {
-		struct r5l_log *log;
 		/*
 		 * we can't wait pending write here, as this is called in
 		 * raid5d, wait will deadlock.
 		 */
 		if (atomic_read(&mddev->writes_pending))
 			return -EBUSY;
-		log = conf->log;
-		conf->log = NULL;
-		synchronize_rcu();
-		r5l_exit_log(log);
+		log_exit(conf);
 		return 0;
 	}
 	if (rdev == p->rdev)
@@ -7515,7 +7496,6 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	int last = conf->raid_disks - 1;
 
 	if (test_bit(Journal, &rdev->flags)) {
-		char b[BDEVNAME_SIZE];
 		if (conf->log)
 			return -EBUSY;
 
@@ -7524,9 +7504,7 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * The array is in readonly mode if journal is missing, so no
 		 * write requests running. We should be safe
 		 */
-		r5l_init_log(conf, rdev);
-		pr_debug("md/raid:%s: using device %s as journal\n",
-			 mdname(mddev), bdevname(rdev->bdev, b));
+		log_init(conf, rdev);
 		return 0;
 	}
 	if (mddev->recovery_disabled == conf->recovery_disabled)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4bb27b97bf6b..749c6c496e7d 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -765,34 +765,4 @@ extern struct stripe_head *
 raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 			int previous, int noblock, int noquiesce);
 extern int raid5_calc_degraded(struct r5conf *conf);
-extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
-extern void r5l_exit_log(struct r5l_log *log);
-extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
-extern void r5l_write_stripe_run(struct r5l_log *log);
-extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
-extern void r5l_stripe_write_finished(struct stripe_head *sh);
-extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
-extern void r5l_quiesce(struct r5l_log *log, int state);
-extern bool r5l_log_disk_error(struct r5conf *conf);
-extern bool r5c_is_writeback(struct r5l_log *log);
-extern int
-r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
-		      struct stripe_head_state *s, int disks);
-extern void
-r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
-			    struct stripe_head_state *s);
-extern void r5c_release_extra_page(struct stripe_head *sh);
-extern void r5c_use_extra_page(struct stripe_head *sh);
-extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
-extern void r5c_handle_cached_data_endio(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi);
-extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
-			  struct stripe_head_state *s);
-extern void r5c_make_stripe_write_out(struct stripe_head *sh);
-extern void r5c_flush_cache(struct r5conf *conf, int num);
-extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
-extern void r5c_check_cached_full_stripe(struct r5conf *conf);
-extern struct md_sysfs_entry r5c_journal_mode;
-extern void r5c_update_on_rdev_error(struct mddev *mddev);
-extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
  2017-03-09  8:59 ` [PATCH v5 1/7] md: superblock changes for PPL Artur Paszkiewicz
  2017-03-09  8:59 ` [PATCH v5 2/7] raid5: separate header for log functions Artur Paszkiewicz
@ 2017-03-09  8:59 ` Artur Paszkiewicz
  2017-03-09 23:24   ` Shaohua Li
                     ` (2 more replies)
  2017-03-09  9:00 ` [PATCH v5 4/7] md: add sysfs entries for PPL Artur Paszkiewicz
                   ` (4 subsequent siblings)
  7 siblings, 3 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  8:59 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Implement the calculation of partial parity for a stripe and PPL write
logging functionality. The description of PPL is added to the
documentation. More details can be found in the comments in raid5-ppl.c.

Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.

Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:

- reconstruct-write case:
  xor data from all not updated disks in a stripe

- read-modify-write case:
  xor old data and parity from all updated disks in a stripe

Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.

Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.

Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.

Warn about using PPL with enabled disk volatile write-back cache for
now. It can be removed once disk cache flushing before writing PPL is
implemented.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Documentation/md/raid5-ppl.txt |  44 +++
 drivers/md/Makefile            |   2 +-
 drivers/md/raid5-log.h         |  24 ++
 drivers/md/raid5-ppl.c         | 703 +++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c             |  64 +++-
 drivers/md/raid5.h             |  10 +-
 include/uapi/linux/raid/md_p.h |  27 ++
 7 files changed, 869 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/md/raid5-ppl.txt
 create mode 100644 drivers/md/raid5-ppl.c

diff --git a/Documentation/md/raid5-ppl.txt b/Documentation/md/raid5-ppl.txt
new file mode 100644
index 000000000000..127072b09363
--- /dev/null
+++ b/Documentation/md/raid5-ppl.txt
@@ -0,0 +1,44 @@
+Partial Parity Log
+
+Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
+addressed by PPL is that after a dirty shutdown, parity of a particular stripe
+may become inconsistent with data on other member disks. If the array is also
+in degraded state, there is no way to recalculate parity, because one of the
+disks is missing. This can lead to silent data corruption when rebuilding the
+array or using it is as degraded - data calculated from parity for array blocks
+that have not been touched by a write request during the unclean shutdown can
+be incorrect. Such condition is known as the RAID5 Write Hole. Because of
+this, md by default does not allow starting a dirty degraded array.
+
+Partial parity for a write operation is the XOR of stripe data chunks not
+modified by this write. It is just enough data needed for recovering from the
+write hole. XORing partial parity with the modified chunks produces parity for
+the stripe, consistent with its state before the write operation, regardless of
+which chunk writes have completed. If one of the not modified data disks of
+this stripe is missing, this updated parity can be used to recover its
+contents. PPL recovery is also performed when starting an array after an
+unclean shutdown and all disks are available, eliminating the need to resync
+the array. Because of this, using write-intent bitmap and PPL together is not
+supported.
+
+When handling a write request PPL writes partial parity before new data and
+parity are dispatched to disks. PPL is a distributed log - it is stored on
+array member drives in the metadata area, on the parity drive of a particular
+stripe.  It does not require a dedicated journaling drive. Write performance is
+reduced by up to 30%-40% but it scales with the number of drives in the array
+and the journaling drive does not become a bottleneck or a single point of
+failure.
+
+Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
+not a true journal. It does not protect from losing in-flight data, only from
+silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
+performed for this stripe (parity is not updated). So it is possible to have
+arbitrary data in the written part of a stripe if that disk is lost. In such
+case the behavior is the same as in plain raid5.
+
+PPL is available for md version-1 metadata and external (specifically IMSM)
+metadata arrays. It can be enabled using mdadm option --consistency-policy=ppl.
+
+Currently, volatile write-back cache should be disabled on all member drives
+when using PPL. Otherwise it cannot guarantee consistency in case of power
+failure.
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1af87a0..4d48714ccc6b 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
 dm-era-y	+= dm-era-target.o
 dm-verity-y	+= dm-verity-target.o
 md-mod-y	+= md.o bitmap.o
-raid456-y	+= raid5.o raid5-cache.o
+raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
 
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 2da4bd3bbd79..a67fb58513b9 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -31,6 +31,20 @@ extern struct md_sysfs_entry r5c_journal_mode;
 extern void r5c_update_on_rdev_error(struct mddev *mddev);
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
 
+extern struct dma_async_tx_descriptor *
+ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
+		       struct dma_async_tx_descriptor *tx);
+extern int ppl_init_log(struct r5conf *conf);
+extern void ppl_exit_log(struct r5conf *conf);
+extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
+extern void ppl_write_stripe_run(struct r5conf *conf);
+extern void ppl_stripe_write_finished(struct stripe_head *sh);
+
+static inline bool raid5_has_ppl(struct r5conf *conf)
+{
+	return test_bit(MD_HAS_PPL, &conf->mddev->flags);
+}
+
 static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 {
 	struct r5conf *conf = sh->raid_conf;
@@ -45,6 +59,8 @@ static inline int log_stripe(struct stripe_head *sh, struct stripe_head_state *s
 			/* caching phase */
 			return r5c_cache_data(conf->log, sh);
 		}
+	} else if (raid5_has_ppl(conf)) {
+		return ppl_write_stripe(conf, sh);
 	}
 
 	return -EAGAIN;
@@ -56,24 +72,32 @@ static inline void log_stripe_write_finished(struct stripe_head *sh)
 
 	if (conf->log)
 		r5l_stripe_write_finished(sh);
+	else if (raid5_has_ppl(conf))
+		ppl_stripe_write_finished(sh);
 }
 
 static inline void log_write_stripe_run(struct r5conf *conf)
 {
 	if (conf->log)
 		r5l_write_stripe_run(conf->log);
+	else if (raid5_has_ppl(conf))
+		ppl_write_stripe_run(conf);
 }
 
 static inline void log_exit(struct r5conf *conf)
 {
 	if (conf->log)
 		r5l_exit_log(conf);
+	else if (raid5_has_ppl(conf))
+		ppl_exit_log(conf);
 }
 
 static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev)
 {
 	if (journal_dev)
 		return r5l_init_log(conf, journal_dev);
+	else if (raid5_has_ppl(conf))
+		return ppl_init_log(conf);
 
 	return 0;
 }
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
new file mode 100644
index 000000000000..92783586743d
--- /dev/null
+++ b/drivers/md/raid5-ppl.c
@@ -0,0 +1,703 @@
+/*
+ * Partial Parity Log for closing the RAID5 write hole
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/crc32c.h>
+#include <linux/flex_array.h>
+#include <linux/async_tx.h>
+#include <linux/raid/md_p.h>
+#include "md.h"
+#include "raid5.h"
+
+/*
+ * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
+ * partial parity data. The header contains an array of entries
+ * (struct ppl_header_entry) which describe the logged write requests.
+ * Partial parity for the entries comes after the header, written in the same
+ * sequence as the entries:
+ *
+ * Header
+ *   entry0
+ *   ...
+ *   entryN
+ * PP data
+ *   PP for entry0
+ *   ...
+ *   PP for entryN
+ *
+ * An entry describes one or more consecutive stripe_heads, up to a full
+ * stripe. The modifed raid data chunks form an m-by-n matrix, where m is the
+ * number of stripe_heads in the entry and n is the number of modified data
+ * disks. Every stripe_head in the entry must write to the same data disks.
+ * An example of a valid case described by a single entry (writes to the first
+ * stripe of a 4 disk array, 16k chunk size):
+ *
+ * sh->sector   dd0   dd1   dd2    ppl
+ *            +-----+-----+-----+
+ * 0          | --- | --- | --- | +----+
+ * 8          | -W- | -W- | --- | | pp |   data_sector = 8
+ * 16         | -W- | -W- | --- | | pp |   data_size = 3 * 2 * 4k
+ * 24         | -W- | -W- | --- | | pp |   pp_size = 3 * 4k
+ *            +-----+-----+-----+ +----+
+ *
+ * data_sector is the first raid sector of the modified data, data_size is the
+ * total size of modified data and pp_size is the size of partial parity for
+ * this entry. Entries for full stripe writes contain no partial parity
+ * (pp_size = 0), they only mark the stripes for which parity should be
+ * recalculated after an unclean shutdown. Every entry holds a checksum of its
+ * partial parity, the header also has a checksum of the header itself.
+ *
+ * A write request is always logged to the PPL instance stored on the parity
+ * disk of the corresponding stripe. For each member disk there is one ppl_log
+ * used to handle logging for this disk, independently from others. They are
+ * grouped in child_logs array in struct ppl_conf, which is assigned to
+ * r5conf->log_private.
+ *
+ * ppl_io_unit represents a full PPL write, header_page contains the ppl_header.
+ * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe_head
+ * can be appended to the last entry if it meets the conditions for a valid
+ * entry described above, otherwise a new entry is added. Checksums of entries
+ * are calculated incrementally as stripes containing partial parity are being
+ * added. ppl_submit_iounit() calculates the checksum of the header and submits
+ * a bio containing the header page and partial parity pages (sh->ppl_page) for
+ * all stripes of the io_unit. When the PPL write completes, the stripes
+ * associated with the io_unit are released and raid5d starts writing their data
+ * and parity. When all stripes are written, the io_unit is freed and the next
+ * can be submitted.
+ *
+ * An io_unit is used to gather stripes until it is submitted or becomes full
+ * (if the maximum number of entries or size of PPL is reached). Another io_unit
+ * can't be submitted until the previous has completed (PPL and stripe
+ * data+parity is written). The log->io_list tracks all io_units of a log
+ * (for a single member disk). New io_units are added to the end of the list
+ * and the first io_unit is submitted, if it is not submitted already.
+ * The current io_unit accepting new stripes is always at the end of the list.
+ */
+
+struct ppl_conf {
+	struct mddev *mddev;
+
+	/* array of child logs, one for each raid disk */
+	struct ppl_log *child_logs;
+	int count;
+
+	int block_size;		/* the logical block size used for data_sector
+				 * in ppl_header_entry */
+	u32 signature;		/* raid array identifier */
+	atomic64_t seq;		/* current log write sequence number */
+
+	struct kmem_cache *io_kc;
+	mempool_t *io_pool;
+	struct bio_set *bs;
+	mempool_t *meta_pool;
+};
+
+struct ppl_log {
+	struct ppl_conf *ppl_conf;	/* shared between all log instances */
+
+	struct md_rdev *rdev;		/* array member disk associated with
+					 * this log instance */
+	struct mutex io_mutex;
+	struct ppl_io_unit *current_io;	/* current io_unit accepting new data
+					 * always at the end of io_list */
+	spinlock_t io_list_lock;
+	struct list_head io_list;	/* all io_units of this log */
+	struct list_head no_mem_stripes;/* stripes to retry if failed to
+					 * allocate io_unit */
+};
+
+#define PPL_IO_INLINE_BVECS 32
+
+struct ppl_io_unit {
+	struct ppl_log *log;
+
+	struct page *header_page;	/* for ppl_header */
+
+	unsigned int entries_count;	/* number of entries in ppl_header */
+	unsigned int pp_size;		/* total size current of partial parity */
+
+	u64 seq;			/* sequence number of this log write */
+	struct list_head log_sibling;	/* log->io_list */
+
+	struct list_head stripe_list;	/* stripes added to the io_unit */
+	atomic_t pending_stripes;	/* how many stripes not written to raid */
+
+	bool submitted;			/* true if write to log started */
+
+	/* inline bio and its biovec for submitting the iounit */
+	struct bio bio;
+	struct bio_vec biovec[PPL_IO_INLINE_BVECS];
+};
+
+struct dma_async_tx_descriptor *
+ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
+		       struct dma_async_tx_descriptor *tx)
+{
+	int disks = sh->disks;
+	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
+	int count = 0, pd_idx = sh->pd_idx, i;
+	struct async_submit_ctl submit;
+
+	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
+
+	/*
+	 * Partial parity is the XOR of stripe data chunks that are not changed
+	 * during the write request. Depending on available data
+	 * (read-modify-write vs. reconstruct-write case) we calculate it
+	 * differently.
+	 */
+	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
+		/* rmw: xor old data and parity from updated disks */
+		for (i = disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
+				xor_srcs[count++] = dev->page;
+		}
+	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
+		/* rcw: xor data from all not updated disks */
+		for (i = disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_UPTODATE, &dev->flags))
+				xor_srcs[count++] = dev->page;
+		}
+	} else {
+		return tx;
+	}
+
+	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
+			  flex_array_get(percpu->scribble, 0)
+			  + sizeof(struct page *) * (sh->disks + 2));
+
+	if (count == 1)
+		tx = async_memcpy(sh->ppl_page, xor_srcs[0], 0, 0, PAGE_SIZE,
+				  &submit);
+	else
+		tx = async_xor(sh->ppl_page, xor_srcs, 0, count, PAGE_SIZE,
+			       &submit);
+
+	return tx;
+}
+
+static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
+					  struct stripe_head *sh)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct ppl_io_unit *io;
+	struct ppl_header *pplhdr;
+
+	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
+	if (!io)
+		return NULL;
+
+	memset(io, 0, sizeof(*io));
+	io->log = log;
+	INIT_LIST_HEAD(&io->log_sibling);
+	INIT_LIST_HEAD(&io->stripe_list);
+	atomic_set(&io->pending_stripes, 0);
+	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
+
+	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
+	pplhdr = page_address(io->header_page);
+	clear_page(pplhdr);
+	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
+	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
+
+	io->seq = atomic64_add_return(1, &ppl_conf->seq);
+	pplhdr->generation = cpu_to_le64(io->seq);
+
+	return io;
+}
+
+static int ppl_log_stripe(struct ppl_log *log, struct stripe_head *sh)
+{
+	struct ppl_io_unit *io = log->current_io;
+	struct ppl_header_entry *e = NULL;
+	struct ppl_header *pplhdr;
+	int i;
+	sector_t data_sector = 0;
+	int data_disks = 0;
+	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
+	struct r5conf *conf = sh->raid_conf;
+
+	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
+
+	/* check if current io_unit is full */
+	if (io && (io->pp_size == entry_space ||
+		   io->entries_count == PPL_HDR_MAX_ENTRIES)) {
+		pr_debug("%s: add io_unit blocked by seq: %llu\n",
+			 __func__, io->seq);
+		io = NULL;
+	}
+
+	/* add a new unit if there is none or the current is full */
+	if (!io) {
+		io = ppl_new_iounit(log, sh);
+		if (!io)
+			return -ENOMEM;
+		spin_lock_irq(&log->io_list_lock);
+		list_add_tail(&io->log_sibling, &log->io_list);
+		spin_unlock_irq(&log->io_list_lock);
+
+		log->current_io = io;
+	}
+
+	for (i = 0; i < sh->disks; i++) {
+		struct r5dev *dev = &sh->dev[i];
+
+		if (i != sh->pd_idx && test_bit(R5_Wantwrite, &dev->flags)) {
+			if (!data_disks || dev->sector < data_sector)
+				data_sector = dev->sector;
+			data_disks++;
+		}
+	}
+	BUG_ON(!data_disks);
+
+	pr_debug("%s: seq: %llu data_sector: %llu data_disks: %d\n", __func__,
+		 io->seq, (unsigned long long)data_sector, data_disks);
+
+	pplhdr = page_address(io->header_page);
+
+	if (io->entries_count > 0) {
+		struct ppl_header_entry *last =
+				&pplhdr->entries[io->entries_count - 1];
+		struct stripe_head *sh_last = list_last_entry(
+				&io->stripe_list, struct stripe_head, log_list);
+		u64 data_sector_last = le64_to_cpu(last->data_sector);
+		u32 data_size_last = le32_to_cpu(last->data_size);
+
+		/*
+		 * Check if we can append the stripe to the last entry. It must
+		 * be just after the last logged stripe and write to the same
+		 * disks. Use bit shift and logarithm to avoid 64-bit division.
+		 */
+		if ((sh->sector == sh_last->sector + STRIPE_SECTORS) &&
+		    (data_sector >> ilog2(conf->chunk_sectors) ==
+		     data_sector_last >> ilog2(conf->chunk_sectors)) &&
+		    ((data_sector - data_sector_last) * data_disks ==
+		     data_size_last >> 9))
+			e = last;
+	}
+
+	if (!e) {
+		e = &pplhdr->entries[io->entries_count++];
+		e->data_sector = cpu_to_le64(data_sector);
+		e->parity_disk = cpu_to_le32(sh->pd_idx);
+		e->checksum = cpu_to_le32(~0);
+	}
+
+	le32_add_cpu(&e->data_size, data_disks << PAGE_SHIFT);
+
+	/* don't write any PP if full stripe write */
+	if (!test_bit(STRIPE_FULL_WRITE, &sh->state)) {
+		le32_add_cpu(&e->pp_size, PAGE_SIZE);
+		io->pp_size += PAGE_SIZE;
+		e->checksum = cpu_to_le32(crc32c_le(le32_to_cpu(e->checksum),
+						    page_address(sh->ppl_page),
+						    PAGE_SIZE));
+	}
+
+	list_add_tail(&sh->log_list, &io->stripe_list);
+	atomic_inc(&io->pending_stripes);
+	sh->ppl_io = io;
+
+	return 0;
+}
+
+int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh)
+{
+	struct ppl_conf *ppl_conf = conf->log_private;
+	struct ppl_io_unit *io = sh->ppl_io;
+	struct ppl_log *log;
+
+	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
+	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
+	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
+		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
+		return -EAGAIN;
+	}
+
+	log = &ppl_conf->child_logs[sh->pd_idx];
+
+	mutex_lock(&log->io_mutex);
+
+	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
+		mutex_unlock(&log->io_mutex);
+		return -EAGAIN;
+	}
+
+	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	atomic_inc(&sh->count);
+
+	if (ppl_log_stripe(log, sh)) {
+		spin_lock_irq(&log->io_list_lock);
+		list_add_tail(&sh->log_list, &log->no_mem_stripes);
+		spin_unlock_irq(&log->io_list_lock);
+	}
+
+	mutex_unlock(&log->io_mutex);
+
+	return 0;
+}
+
+static void ppl_log_endio(struct bio *bio)
+{
+	struct ppl_io_unit *io = bio->bi_private;
+	struct ppl_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct stripe_head *sh, *next;
+
+	pr_debug("%s: seq: %llu\n", __func__, io->seq);
+
+	if (bio->bi_error)
+		md_error(ppl_conf->mddev, log->rdev);
+
+	mempool_free(io->header_page, ppl_conf->meta_pool);
+
+	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
+		list_del_init(&sh->log_list);
+
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+}
+
+static void ppl_submit_iounit_bio(struct ppl_io_unit *io, struct bio *bio)
+{
+	char b[BDEVNAME_SIZE];
+
+	pr_debug("%s: seq: %llu size: %u sector: %llu dev: %s\n",
+		 __func__, io->seq, bio->bi_iter.bi_size,
+		 (unsigned long long)bio->bi_iter.bi_sector,
+		 bdevname(bio->bi_bdev, b));
+
+	submit_bio(bio);
+}
+
+static void ppl_submit_iounit(struct ppl_io_unit *io)
+{
+	struct ppl_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct ppl_header *pplhdr = page_address(io->header_page);
+	struct bio *bio = &io->bio;
+	struct stripe_head *sh;
+	int i;
+
+	for (i = 0; i < io->entries_count; i++) {
+		struct ppl_header_entry *e = &pplhdr->entries[i];
+
+		pr_debug("%s: seq: %llu entry: %d data_sector: %llu pp_size: %u data_size: %u\n",
+			 __func__, io->seq, i, le64_to_cpu(e->data_sector),
+			 le32_to_cpu(e->pp_size), le32_to_cpu(e->data_size));
+
+		e->data_sector = cpu_to_le64(le64_to_cpu(e->data_sector) >>
+					     ilog2(ppl_conf->block_size >> 9));
+		e->checksum = cpu_to_le32(~le32_to_cpu(e->checksum));
+	}
+
+	pplhdr->entries_count = cpu_to_le32(io->entries_count);
+	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
+
+	bio->bi_private = io;
+	bio->bi_end_io = ppl_log_endio;
+	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
+	bio->bi_bdev = log->rdev->bdev;
+	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
+	bio_add_page(bio, io->header_page, PAGE_SIZE, 0);
+
+	list_for_each_entry(sh, &io->stripe_list, log_list) {
+		/* entries for full stripe writes have no partial parity */
+		if (test_bit(STRIPE_FULL_WRITE, &sh->state))
+			continue;
+
+		if (!bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
+			struct bio *prev = bio;
+
+			bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
+					       ppl_conf->bs);
+			bio->bi_opf = prev->bi_opf;
+			bio->bi_bdev = prev->bi_bdev;
+			bio->bi_iter.bi_sector = bio_end_sector(prev);
+			bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
+
+			bio_chain(bio, prev);
+			ppl_submit_iounit_bio(io, prev);
+		}
+	}
+
+	ppl_submit_iounit_bio(io, bio);
+}
+
+static void ppl_submit_current_io(struct ppl_log *log)
+{
+	struct ppl_io_unit *io;
+
+	spin_lock_irq(&log->io_list_lock);
+
+	io = list_first_entry_or_null(&log->io_list, struct ppl_io_unit,
+				      log_sibling);
+	if (io && io->submitted)
+		io = NULL;
+
+	spin_unlock_irq(&log->io_list_lock);
+
+	if (io) {
+		io->submitted = true;
+
+		if (io == log->current_io)
+			log->current_io = NULL;
+
+		ppl_submit_iounit(io);
+	}
+}
+
+void ppl_write_stripe_run(struct r5conf *conf)
+{
+	struct ppl_conf *ppl_conf = conf->log_private;
+	struct ppl_log *log;
+	int i;
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		log = &ppl_conf->child_logs[i];
+
+		mutex_lock(&log->io_mutex);
+		ppl_submit_current_io(log);
+		mutex_unlock(&log->io_mutex);
+	}
+}
+
+static void ppl_io_unit_finished(struct ppl_io_unit *io)
+{
+	struct ppl_log *log = io->log;
+	unsigned long flags;
+
+	pr_debug("%s: seq: %llu\n", __func__, io->seq);
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+
+	list_del(&io->log_sibling);
+	mempool_free(io, log->ppl_conf->io_pool);
+
+	if (!list_empty(&log->no_mem_stripes)) {
+		struct stripe_head *sh = list_first_entry(&log->no_mem_stripes,
+							  struct stripe_head,
+							  log_list);
+		list_del_init(&sh->log_list);
+		set_bit(STRIPE_HANDLE, &sh->state);
+		raid5_release_stripe(sh);
+	}
+
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+}
+
+void ppl_stripe_write_finished(struct stripe_head *sh)
+{
+	struct ppl_io_unit *io;
+
+	io = sh->ppl_io;
+	sh->ppl_io = NULL;
+
+	if (io && atomic_dec_and_test(&io->pending_stripes))
+		ppl_io_unit_finished(io);
+}
+
+static void __ppl_exit_log(struct ppl_conf *ppl_conf)
+{
+	clear_bit(MD_HAS_PPL, &ppl_conf->mddev->flags);
+
+	kfree(ppl_conf->child_logs);
+
+	mempool_destroy(ppl_conf->meta_pool);
+	if (ppl_conf->bs)
+		bioset_free(ppl_conf->bs);
+	mempool_destroy(ppl_conf->io_pool);
+	kmem_cache_destroy(ppl_conf->io_kc);
+
+	kfree(ppl_conf);
+}
+
+void ppl_exit_log(struct r5conf *conf)
+{
+	struct ppl_conf *ppl_conf = conf->log_private;
+
+	if (ppl_conf) {
+		__ppl_exit_log(ppl_conf);
+		conf->log_private = NULL;
+	}
+}
+
+static int ppl_validate_rdev(struct md_rdev *rdev)
+{
+	char b[BDEVNAME_SIZE];
+	int ppl_data_sectors;
+	int ppl_size_new;
+
+	/*
+	 * The configured PPL size must be enough to store
+	 * the header and (at the very least) partial parity
+	 * for one stripe. Round it down to ensure the data
+	 * space is cleanly divisible by stripe size.
+	 */
+	ppl_data_sectors = rdev->ppl.size - (PPL_HEADER_SIZE >> 9);
+
+	if (ppl_data_sectors > 0)
+		ppl_data_sectors = rounddown(ppl_data_sectors, STRIPE_SECTORS);
+
+	if (ppl_data_sectors <= 0) {
+		pr_warn("md/raid:%s: PPL space too small on %s\n",
+			mdname(rdev->mddev), bdevname(rdev->bdev, b));
+		return -ENOSPC;
+	}
+
+	ppl_size_new = ppl_data_sectors + (PPL_HEADER_SIZE >> 9);
+
+	if ((rdev->ppl.sector < rdev->data_offset &&
+	     rdev->ppl.sector + ppl_size_new > rdev->data_offset) ||
+	    (rdev->ppl.sector >= rdev->data_offset &&
+	     rdev->data_offset + rdev->sectors > rdev->ppl.sector)) {
+		pr_warn("md/raid:%s: PPL space overlaps with data on %s\n",
+			mdname(rdev->mddev), bdevname(rdev->bdev, b));
+		return -EINVAL;
+	}
+
+	if (!rdev->mddev->external &&
+	    ((rdev->ppl.offset > 0 && rdev->ppl.offset < (rdev->sb_size >> 9)) ||
+	     (rdev->ppl.offset <= 0 && rdev->ppl.offset + ppl_size_new > 0))) {
+		pr_warn("md/raid:%s: PPL space overlaps with superblock on %s\n",
+			mdname(rdev->mddev), bdevname(rdev->bdev, b));
+		return -EINVAL;
+	}
+
+	rdev->ppl.size = ppl_size_new;
+
+	return 0;
+}
+
+int ppl_init_log(struct r5conf *conf)
+{
+	struct ppl_conf *ppl_conf;
+	struct mddev *mddev = conf->mddev;
+	int ret = 0;
+	int i;
+	bool need_cache_flush;
+
+	pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
+		 mdname(conf->mddev));
+
+	if (PAGE_SIZE != 4096)
+		return -EINVAL;
+
+	if (mddev->level != 5) {
+		pr_warn("md/raid:%s PPL is not compatible with raid level %d\n",
+			mdname(mddev), mddev->level);
+		return -EINVAL;
+	}
+
+	if (mddev->bitmap_info.file || mddev->bitmap_info.offset) {
+		pr_warn("md/raid:%s PPL is not compatible with bitmap\n",
+			mdname(mddev));
+		return -EINVAL;
+	}
+
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		pr_warn("md/raid:%s PPL is not compatible with journal\n",
+			mdname(mddev));
+		return -EINVAL;
+	}
+
+	ppl_conf = kzalloc(sizeof(struct ppl_conf), GFP_KERNEL);
+	if (!ppl_conf)
+		return -ENOMEM;
+
+	ppl_conf->mddev = mddev;
+
+	ppl_conf->io_kc = KMEM_CACHE(ppl_io_unit, 0);
+	if (!ppl_conf->io_kc) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->io_pool = mempool_create_slab_pool(conf->raid_disks, ppl_conf->io_kc);
+	if (!ppl_conf->io_pool) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->bs = bioset_create(conf->raid_disks, 0);
+	if (!ppl_conf->bs) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
+	if (!ppl_conf->meta_pool) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->count = conf->raid_disks;
+	ppl_conf->child_logs = kcalloc(ppl_conf->count, sizeof(struct ppl_log),
+				       GFP_KERNEL);
+	if (!ppl_conf->child_logs) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	atomic64_set(&ppl_conf->seq, 0);
+
+	if (!mddev->external) {
+		ppl_conf->signature = ~crc32c_le(~0, mddev->uuid, sizeof(mddev->uuid));
+		ppl_conf->block_size = 512;
+	} else {
+		ppl_conf->block_size = queue_logical_block_size(mddev->queue);
+	}
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct ppl_log *log = &ppl_conf->child_logs[i];
+		struct md_rdev *rdev = conf->disks[i].rdev;
+
+		mutex_init(&log->io_mutex);
+		spin_lock_init(&log->io_list_lock);
+		INIT_LIST_HEAD(&log->io_list);
+		INIT_LIST_HEAD(&log->no_mem_stripes);
+
+		log->ppl_conf = ppl_conf;
+		log->rdev = rdev;
+
+		if (rdev) {
+			struct request_queue *q;
+
+			ret = ppl_validate_rdev(rdev);
+			if (ret)
+				goto err;
+
+			q = bdev_get_queue(rdev->bdev);
+			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+				need_cache_flush = true;
+		}
+	}
+
+	if (need_cache_flush)
+		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
+			mdname(mddev));
+
+	conf->log_private = ppl_conf;
+
+	return 0;
+err:
+	__ppl_exit_log(ppl_conf);
+	return ret;
+}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 043a509560c2..4ca8e555ae5c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -464,6 +464,11 @@ static void shrink_buffers(struct stripe_head *sh)
 		sh->dev[i].page = NULL;
 		put_page(p);
 	}
+
+	if (sh->ppl_page) {
+		put_page(sh->ppl_page);
+		sh->ppl_page = NULL;
+	}
 }
 
 static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
@@ -480,6 +485,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
 		sh->dev[i].page = page;
 		sh->dev[i].orig_page = page;
 	}
+
+	if (raid5_has_ppl(sh->raid_conf)) {
+		sh->ppl_page = alloc_page(gfp);
+		if (!sh->ppl_page)
+			return 1;
+	}
+
 	return 0;
 }
 
@@ -728,7 +740,7 @@ static bool stripe_can_batch(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 
-	if (conf->log)
+	if (conf->log || raid5_has_ppl(conf))
 		return false;
 	return test_bit(STRIPE_BATCH_READY, &sh->state) &&
 		!test_bit(STRIPE_BITMAP_PENDING, &sh->state) &&
@@ -1995,6 +2007,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 			async_tx_ack(tx);
 	}
 
+	if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
+		tx = ops_run_partial_parity(sh, percpu, tx);
+
 	if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
 		if (level < 6)
 			tx = ops_run_prexor5(sh, percpu, tx);
@@ -3070,6 +3085,12 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		s->locked++;
 	}
 
+	if (raid5_has_ppl(sh->raid_conf) &&
+	    test_bit(STRIPE_OP_BIODRAIN, &s->ops_request) &&
+	    !test_bit(STRIPE_FULL_WRITE, &sh->state) &&
+	    test_bit(R5_Insync, &sh->dev[pd_idx].flags))
+		set_bit(STRIPE_OP_PARTIAL_PARITY, &s->ops_request);
+
 	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
 		__func__, (unsigned long long)sh->sector,
 		s->locked, s->ops_request);
@@ -3117,6 +3138,36 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
 		goto overlap;
 
+	if (forwrite && raid5_has_ppl(conf)) {
+		/*
+		 * With PPL only writes to consecutive data chunks within a
+		 * stripe are allowed because for a single stripe_head we can
+		 * only have one PPL entry at a time, which describes one data
+		 * range. Not really an overlap, but wait_for_overlap can be
+		 * used to handle this.
+		 */
+		sector_t sector;
+		sector_t first = 0;
+		sector_t last = 0;
+		int count = 0;
+		int i;
+
+		for (i = 0; i < sh->disks; i++) {
+			if (i != sh->pd_idx &&
+			    (i == dd_idx || sh->dev[i].towrite)) {
+				sector = sh->dev[i].sector;
+				if (count == 0 || sector < first)
+					first = sector;
+				if (sector > last)
+					last = sector;
+				count++;
+			}
+		}
+
+		if (first + conf->chunk_sectors * (count - 1) != last)
+			goto overlap;
+	}
+
 	if (!forwrite || previous)
 		clear_bit(STRIPE_BATCH_READY, &sh->state);
 
@@ -7087,6 +7138,13 @@ static int raid5_run(struct mddev *mddev)
 		BUG_ON(mddev->delta_disks != 0);
 	}
 
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
+	    test_bit(MD_HAS_PPL, &mddev->flags)) {
+		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
+			mdname(mddev));
+		clear_bit(MD_HAS_PPL, &mddev->flags);
+	}
+
 	if (mddev->private == NULL)
 		conf = setup_conf(mddev);
 	else
@@ -7568,7 +7626,7 @@ static int raid5_resize(struct mddev *mddev, sector_t sectors)
 	sector_t newsize;
 	struct r5conf *conf = mddev->private;
 
-	if (conf->log)
+	if (conf->log || raid5_has_ppl(conf))
 		return -EINVAL;
 	sectors &= ~((sector_t)conf->chunk_sectors - 1);
 	newsize = raid5_size(mddev, sectors, mddev->raid_disks);
@@ -7619,7 +7677,7 @@ static int check_reshape(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
 
-	if (conf->log)
+	if (conf->log || raid5_has_ppl(conf))
 		return -EINVAL;
 	if (mddev->delta_disks == 0 &&
 	    mddev->new_layout == mddev->layout &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 749c6c496e7d..5a371fa9e782 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -224,10 +224,16 @@ struct stripe_head {
 	spinlock_t		batch_lock; /* only header's lock is useful */
 	struct list_head	batch_list; /* protected by head's batch lock*/
 
-	struct r5l_io_unit	*log_io;
+	union {
+		struct r5l_io_unit	*log_io;
+		struct ppl_io_unit	*ppl_io;
+	};
+
 	struct list_head	log_list;
 	sector_t		log_start; /* first meta block on the journal */
 	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
+
+	struct page		*ppl_page; /* partial parity of this stripe */
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -400,6 +406,7 @@ enum {
 	STRIPE_OP_BIODRAIN,
 	STRIPE_OP_RECONSTRUCT,
 	STRIPE_OP_CHECK,
+	STRIPE_OP_PARTIAL_PARITY,
 };
 
 /*
@@ -686,6 +693,7 @@ struct r5conf {
 	int			group_cnt;
 	int			worker_cnt_per_group;
 	struct r5l_log		*log;
+	void			*log_private;
 
 	struct bio_list		pending_bios;
 	spinlock_t		pending_bios_lock;
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index fe2112810c43..d9a1ead867b9 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -398,4 +398,31 @@ struct r5l_meta_block {
 
 #define R5LOG_VERSION 0x1
 #define R5LOG_MAGIC 0x6433c509
+
+struct ppl_header_entry {
+	__le64 data_sector;	/* raid sector of the new data */
+	__le32 pp_size;		/* length of partial parity */
+	__le32 data_size;	/* length of data */
+	__le32 parity_disk;	/* member disk containing parity */
+	__le32 checksum;	/* checksum of partial parity data for this
+				 * entry (~crc32c) */
+} __attribute__ ((__packed__));
+
+#define PPL_HEADER_SIZE 4096
+#define PPL_HDR_RESERVED 512
+#define PPL_HDR_ENTRY_SPACE \
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
+#define PPL_HDR_MAX_ENTRIES \
+	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
+
+struct ppl_header {
+	__u8 reserved[PPL_HDR_RESERVED];/* reserved space, fill with 0xff */
+	__le32 signature;		/* signature (family number of volume) */
+	__le32 padding;			/* zero pad */
+	__le64 generation;		/* generation number of the header */
+	__le32 entries_count;		/* number of entries in entry array */
+	__le32 checksum;		/* checksum of the header (~crc32c) */
+	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
+} __attribute__ ((__packed__));
+
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 4/7] md: add sysfs entries for PPL
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (2 preceding siblings ...)
  2017-03-09  8:59 ` [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
@ 2017-03-09  9:00 ` Artur Paszkiewicz
  2017-03-09  9:00 ` [PATCH v5 5/7] raid5-ppl: load and recover the log Artur Paszkiewicz
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  9:00 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Add 'consistency_policy' attribute for array. It indicates how the array
maintains consistency in case of unexpected shutdown.

Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
and size of the PPL space on the device. They can't be changed for
active members if the array is started and PPL is enabled, so in the
setter functions only basic checks are performed. More checks are done
in ppl_validate_rdev() when starting the log.

These attributes are writable to allow enabling PPL for external
metadata arrays and (later) to enable/disable PPL for a running array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Documentation/admin-guide/md.rst |  32 ++++++++++-
 drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 1e61bf50595c..84de718f24a4 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -276,14 +276,14 @@ All md devices contain:
      array creation it will default to 0, though starting the array as
      ``clean`` will set it much larger.
 
-   new_dev
+  new_dev
      This file can be written but not read.  The value written should
      be a block device number as major:minor.  e.g. 8:0
      This will cause that device to be attached to the array, if it is
      available.  It will then appear at md/dev-XXX (depending on the
      name of the device) and further configuration is then possible.
 
-   safe_mode_delay
+  safe_mode_delay
      When an md array has seen no write requests for a certain period
      of time, it will be marked as ``clean``.  When another write
      request arrives, the array is marked as ``dirty`` before the write
@@ -292,7 +292,7 @@ All md devices contain:
      period as a number of seconds.  The default is 200msec (0.200).
      Writing a value of 0 disables safemode.
 
-   array_state
+  array_state
      This file contains a single word which describes the current
      state of the array.  In many cases, the state can be set by
      writing the word for the desired state, however some states
@@ -401,7 +401,30 @@ All md devices contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
+  consistency_policy
+     This indicates how the array maintains consistency in case of unexpected
+     shutdown. It can be:
 
+     none
+       Array has no redundancy information, e.g. raid0, linear.
+
+     resync
+       Full resync is performed and all redundancy is regenerated when the
+       array is started after unclean shutdown.
+
+     bitmap
+       Resync assisted by a write-intent bitmap.
+
+     journal
+       For raid4/5/6, journal device is used to log transactions and replay
+       after unclean shutdown.
+
+     ppl
+       For raid5 only, Partial Parity Log is used to close the write hole and
+       eliminate resync.
+
+     The accepted values when writing to this file are ``ppl`` and ``resync``,
+     used to enable and disable PPL.
 
 
 As component devices are added to an md array, they appear in the ``md``
@@ -563,6 +586,9 @@ Each directory contains:
 	adds bad blocks without acknowledging them. This is largely
 	for testing.
 
+      ppl_sector, ppl_size
+        Location and size (in sectors) of the space used for Partial Parity Log
+        on this device.
 
 
 An active md device will also contain an entry for each active device
diff --git a/drivers/md/md.c b/drivers/md/md.c
index 173550455c42..1df48f365b3c 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3149,6 +3149,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
 static struct rdev_sysfs_entry rdev_unack_bad_blocks =
 __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
 
+static ssize_t
+ppl_sector_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
+}
+
+static ssize_t
+ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
+{
+	unsigned long long sector;
+
+	if (kstrtoull(buf, 10, &sector) < 0)
+		return -EINVAL;
+	if (sector != (sector_t)sector)
+		return -EINVAL;
+
+	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	if (rdev->mddev->persistent) {
+		if (rdev->mddev->major_version == 0)
+			return -EINVAL;
+		if ((sector > rdev->sb_start &&
+		     sector - rdev->sb_start > S16_MAX) ||
+		    (sector < rdev->sb_start &&
+		     rdev->sb_start - sector > -S16_MIN))
+			return -EINVAL;
+		rdev->ppl.offset = sector - rdev->sb_start;
+	} else if (!rdev->mddev->external) {
+		return -EBUSY;
+	}
+	rdev->ppl.sector = sector;
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_ppl_sector =
+__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
+
+static ssize_t
+ppl_size_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%u\n", rdev->ppl.size);
+}
+
+static ssize_t
+ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
+{
+	unsigned int size;
+
+	if (kstrtouint(buf, 10, &size) < 0)
+		return -EINVAL;
+
+	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	if (rdev->mddev->persistent) {
+		if (rdev->mddev->major_version == 0)
+			return -EINVAL;
+		if (size > U16_MAX)
+			return -EINVAL;
+	} else if (!rdev->mddev->external) {
+		return -EBUSY;
+	}
+	rdev->ppl.size = size;
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_ppl_size =
+__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
+
 static struct attribute *rdev_default_attrs[] = {
 	&rdev_state.attr,
 	&rdev_errors.attr,
@@ -3159,6 +3231,8 @@ static struct attribute *rdev_default_attrs[] = {
 	&rdev_recovery_start.attr,
 	&rdev_bad_blocks.attr,
 	&rdev_unack_bad_blocks.attr,
+	&rdev_ppl_sector.attr,
+	&rdev_ppl_size.attr,
 	NULL,
 };
 static ssize_t
@@ -4895,6 +4969,46 @@ static struct md_sysfs_entry md_array_size =
 __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
        array_size_store);
 
+static ssize_t
+consistency_policy_show(struct mddev *mddev, char *page)
+{
+	int ret;
+
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		ret = sprintf(page, "journal\n");
+	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		ret = sprintf(page, "ppl\n");
+	} else if (mddev->bitmap) {
+		ret = sprintf(page, "bitmap\n");
+	} else if (mddev->pers) {
+		if (mddev->pers->sync_request)
+			ret = sprintf(page, "resync\n");
+		else
+			ret = sprintf(page, "none\n");
+	} else {
+		ret = sprintf(page, "unknown\n");
+	}
+
+	return ret;
+}
+
+static ssize_t
+consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	if (mddev->pers) {
+		return -EBUSY;
+	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
+		set_bit(MD_HAS_PPL, &mddev->flags);
+		return len;
+	} else {
+		return -EINVAL;
+	}
+}
+
+static struct md_sysfs_entry md_consistency_policy =
+__ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show,
+       consistency_policy_store);
+
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_layout.attr,
@@ -4910,6 +5024,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_reshape_direction.attr,
 	&md_array_size.attr,
 	&max_corr_read_errors.attr,
+	&md_consistency_policy.attr,
 	NULL,
 };
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 5/7] raid5-ppl: load and recover the log
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (3 preceding siblings ...)
  2017-03-09  9:00 ` [PATCH v5 4/7] md: add sysfs entries for PPL Artur Paszkiewicz
@ 2017-03-09  9:00 ` Artur Paszkiewicz
  2017-03-09 23:30   ` Shaohua Li
  2017-03-09  9:00 ` [PATCH v5 6/7] raid5-ppl: support disk hot add/remove with PPL Artur Paszkiewicz
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  9:00 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Load the log from each disk when starting the array and recover if the
array is dirty.

The initial empty PPL is written by mdadm. When loading the log we
verify the header checksum and signature. For external metadata arrays
the signature is verified in userspace, so here we read it from the
header, verifying only if it matches on all disks, and use it later when
writing PPL.

In addition to the header checksum, each header entry also contains a
checksum of its partial parity data. If the header is valid, recovery is
performed for each entry until an invalid entry is found. If the array
is not degraded and recovery using PPL fully succeeds, there is no need
to resync the array because data and parity will be consistent, so in
this case resync will be disabled.

Due to compatibility with IMSM implementations on other systems, we
can't assume that the recovery data block size is always 4K. Writes
generated by MD raid5 don't have this issue, but when recovering PPL
written in other environments it is possible to have entries with
512-byte sector granularity. The recovery code takes this into account
and also the logical sector size of the underlying drives.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c     |   5 +-
 2 files changed, 501 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 92783586743d..548d1028a3ce 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -103,6 +103,10 @@ struct ppl_conf {
 	mempool_t *io_pool;
 	struct bio_set *bs;
 	mempool_t *meta_pool;
+
+	/* used only for recovery */
+	int recovered_entries;
+	int mismatch_count;
 };
 
 struct ppl_log {
@@ -514,6 +518,482 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
 		ppl_io_unit_finished(io);
 }
 
+static void ppl_xor(int size, struct page *page1, struct page *page2,
+		    struct page *page_result)
+{
+	struct async_submit_ctl submit;
+	struct dma_async_tx_descriptor *tx;
+	struct page *xor_srcs[] = { page1, page2 };
+
+	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
+			  NULL, NULL, NULL, NULL);
+	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
+
+	async_tx_quiesce(&tx);
+}
+
+/*
+ * PPL recovery strategy: xor partial parity and data from all modified data
+ * disks within a stripe and write the result as the new stripe parity. If all
+ * stripe data disks are modified (full stripe write), no partial parity is
+ * available, so just xor the data disks.
+ *
+ * Recovery of a PPL entry shall occur only if all modified data disks are
+ * available and read from all of them succeeds.
+ *
+ * A PPL entry applies to a stripe, partial parity size for an entry is at most
+ * the size of the chunk. Examples of possible cases for a single entry:
+ *
+ * case 0: single data disk write:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
+ * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = data_size
+ *
+ * case 1: more than one data disk write:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
+ * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = data_size / modified_data_disks
+ *
+ * case 2: write to all data disks (also full stripe write):
+ *   data0    data1    data2                parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ |           | (no change)        |
+ * | -data- | -data- | -data- | --------> | xor all data       |
+ * | ------ | ------ | ------ | --------> | (no change)        |
+ * | ------ | ------ | ------ |           | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = 0
+ *
+ * The following cases are possible only in other implementations. The recovery
+ * code can handle them, but they are not generated at runtime because they can
+ * be reduced to cases 0, 1 and 2:
+ *
+ * case 3:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+ +----+    +--------------------+
+ * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
+ * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
+ * | -data- | -data- | -data- | | -- | -> | xor all data       |
+ * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
+ * +--------+--------+--------+ +----+    +--------------------+
+ * pp_size = chunk_size
+ *
+ * case 4:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+ +----+    +--------------------+
+ * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
+ * | ------ | ------ | ------ | | -- | -> | (no change)        |
+ * | ------ | ------ | ------ | | -- | -> | (no change)        |
+ * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
+ * +--------+--------+--------+ +----+    +--------------------+
+ * pp_size = chunk_size
+ */
+static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
+			     sector_t ppl_sector)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct mddev *mddev = ppl_conf->mddev;
+	struct r5conf *conf = mddev->private;
+	int block_size = ppl_conf->block_size;
+	struct page *page1;
+	struct page *page2;
+	sector_t r_sector_first;
+	sector_t r_sector_last;
+	int strip_sectors;
+	int data_disks;
+	int i;
+	int ret = 0;
+	char b[BDEVNAME_SIZE];
+	unsigned int pp_size = le32_to_cpu(e->pp_size);
+	unsigned int data_size = le32_to_cpu(e->data_size);
+
+	page1 = alloc_page(GFP_KERNEL);
+	page2 = alloc_page(GFP_KERNEL);
+
+	if (!page1 || !page2) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
+
+	if ((pp_size >> 9) < conf->chunk_sectors) {
+		if (pp_size > 0) {
+			data_disks = data_size / pp_size;
+			strip_sectors = pp_size >> 9;
+		} else {
+			data_disks = conf->raid_disks - conf->max_degraded;
+			strip_sectors = (data_size >> 9) / data_disks;
+		}
+		r_sector_last = r_sector_first +
+				(data_disks - 1) * conf->chunk_sectors +
+				strip_sectors;
+	} else {
+		data_disks = conf->raid_disks - conf->max_degraded;
+		strip_sectors = conf->chunk_sectors;
+		r_sector_last = r_sector_first + (data_size >> 9);
+	}
+
+	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
+		 (unsigned long long)r_sector_first,
+		 (unsigned long long)r_sector_last);
+
+	/* if start and end is 4k aligned, use a 4k block */
+	if (block_size == 512 &&
+	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
+	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
+		block_size = STRIPE_SIZE;
+
+	/* iterate through blocks in strip */
+	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
+		bool update_parity = false;
+		sector_t parity_sector;
+		struct md_rdev *parity_rdev;
+		struct stripe_head sh;
+		int disk;
+		int indent = 0;
+
+		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
+		indent += 2;
+
+		memset(page_address(page1), 0, PAGE_SIZE);
+
+		/* iterate through data member disks */
+		for (disk = 0; disk < data_disks; disk++) {
+			int dd_idx;
+			struct md_rdev *rdev;
+			sector_t sector;
+			sector_t r_sector = r_sector_first + i +
+					    (disk * conf->chunk_sectors);
+
+			pr_debug("%s:%*s data member disk %d start\n",
+				 __func__, indent, "", disk);
+			indent += 2;
+
+			if (r_sector >= r_sector_last) {
+				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
+					 __func__, indent, "",
+					 (unsigned long long)r_sector);
+				indent -= 2;
+				continue;
+			}
+
+			update_parity = true;
+
+			/* map raid sector to member disk */
+			sector = raid5_compute_sector(conf, r_sector, 0,
+						      &dd_idx, NULL);
+			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
+				 __func__, indent, "",
+				 (unsigned long long)r_sector, dd_idx,
+				 (unsigned long long)sector);
+
+			rdev = conf->disks[dd_idx].rdev;
+			if (!rdev) {
+				pr_debug("%s:%*s data member disk %d missing\n",
+					 __func__, indent, "", dd_idx);
+				update_parity = false;
+				break;
+			}
+
+			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
+				 __func__, indent, "", bdevname(rdev->bdev, b),
+				 (unsigned long long)sector);
+			if (!sync_page_io(rdev, sector, block_size, page2,
+					REQ_OP_READ, 0, false)) {
+				md_error(mddev, rdev);
+				pr_debug("%s:%*s read failed!\n", __func__,
+					 indent, "");
+				ret = -EIO;
+				goto out;
+			}
+
+			ppl_xor(block_size, page1, page2, page1);
+
+			indent -= 2;
+		}
+
+		if (!update_parity)
+			continue;
+
+		if (pp_size > 0) {
+			pr_debug("%s:%*s reading pp disk sector %llu\n",
+				 __func__, indent, "",
+				 (unsigned long long)(ppl_sector + i));
+			if (!sync_page_io(log->rdev,
+					ppl_sector - log->rdev->data_offset + i,
+					block_size, page2, REQ_OP_READ, 0,
+					false)) {
+				pr_debug("%s:%*s read failed!\n", __func__,
+					 indent, "");
+				md_error(mddev, log->rdev);
+				ret = -EIO;
+				goto out;
+			}
+
+			ppl_xor(block_size, page1, page2, page1);
+		}
+
+		/* map raid sector to parity disk */
+		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
+				0, &disk, &sh);
+		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
+		parity_rdev = conf->disks[sh.pd_idx].rdev;
+
+		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
+		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
+			 __func__, indent, "",
+			 (unsigned long long)parity_sector,
+			 bdevname(parity_rdev->bdev, b));
+		if (!sync_page_io(parity_rdev, parity_sector, block_size,
+				page1, REQ_OP_WRITE, 0, false)) {
+			pr_debug("%s:%*s parity write error!\n", __func__,
+				 indent, "");
+			md_error(mddev, parity_rdev);
+			ret = -EIO;
+			goto out;
+		}
+	}
+out:
+	if (page1)
+		__free_page(page1);
+	if (page2)
+		__free_page(page2);
+	return ret;
+}
+
+static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct md_rdev *rdev = log->rdev;
+	struct mddev *mddev = rdev->mddev;
+	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
+	struct page *page;
+	int i;
+	int ret = 0;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	/* iterate through all PPL entries saved */
+	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
+		struct ppl_header_entry *e = &pplhdr->entries[i];
+		u32 pp_size = le32_to_cpu(e->pp_size);
+		sector_t sector = ppl_sector;
+		int ppl_entry_sectors = pp_size >> 9;
+		u32 crc, crc_stored;
+
+		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
+			 __func__, rdev->raid_disk, i,
+			 (unsigned long long)ppl_sector, pp_size);
+
+		crc = ~0;
+		crc_stored = le32_to_cpu(e->checksum);
+
+		/* read parial parity for this entry and calculate its checksum */
+		while (pp_size) {
+			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
+
+			if (!sync_page_io(rdev, sector - rdev->data_offset,
+					s, page, REQ_OP_READ, 0, false)) {
+				md_error(mddev, rdev);
+				ret = -EIO;
+				goto out;
+			}
+
+			crc = crc32c_le(crc, page_address(page), s);
+
+			pp_size -= s;
+			sector += s >> 9;
+		}
+
+		crc = ~crc;
+
+		if (crc != crc_stored) {
+			/*
+			 * Don't recover this entry if the checksum does not
+			 * match, but keep going and try to recover other
+			 * entries.
+			 */
+			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
+				 __func__, crc_stored, crc);
+			ppl_conf->mismatch_count++;
+		} else {
+			ret = ppl_recover_entry(log, e, ppl_sector);
+			if (ret)
+				goto out;
+			ppl_conf->recovered_entries++;
+		}
+
+		ppl_sector += ppl_entry_sectors;
+	}
+
+	/* flush the disk cache after recovery if necessary */
+	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
+		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
+
+		bio->bi_bdev = rdev->bdev;
+		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
+		ret = submit_bio_wait(bio);
+		bio_put(bio);
+	}
+out:
+	__free_page(page);
+	return ret;
+}
+
+static int ppl_write_empty_header(struct ppl_log *log)
+{
+	struct page *page;
+	struct ppl_header *pplhdr;
+	struct md_rdev *rdev = log->rdev;
+	int ret = 0;
+
+	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
+		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
+
+	page = alloc_page(GFP_NOIO | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+
+	pplhdr = page_address(page);
+	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
+	pplhdr->signature = cpu_to_le32(log->ppl_conf->signature);
+	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
+
+	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
+			  PPL_HEADER_SIZE, page, REQ_OP_WRITE | REQ_FUA, 0,
+			  false)) {
+		md_error(rdev->mddev, rdev);
+		ret = -EIO;
+	}
+
+	__free_page(page);
+	return ret;
+}
+
+static int ppl_load_distributed(struct ppl_log *log)
+{
+	struct ppl_conf *ppl_conf = log->ppl_conf;
+	struct md_rdev *rdev = log->rdev;
+	struct mddev *mddev = rdev->mddev;
+	struct page *page;
+	struct ppl_header *pplhdr;
+	u32 crc, crc_stored;
+	u32 signature;
+	int ret = 0;
+
+	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
+
+	/* read PPL header */
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
+			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
+		md_error(mddev, rdev);
+		ret = -EIO;
+		goto out;
+	}
+	pplhdr = page_address(page);
+
+	/* check header validity */
+	crc_stored = le32_to_cpu(pplhdr->checksum);
+	pplhdr->checksum = 0;
+	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
+
+	if (crc_stored != crc) {
+		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
+			 __func__, crc_stored, crc);
+		ppl_conf->mismatch_count++;
+		goto out;
+	}
+
+	signature = le32_to_cpu(pplhdr->signature);
+
+	if (mddev->external) {
+		/*
+		 * For external metadata the header signature is set and
+		 * validated in userspace.
+		 */
+		ppl_conf->signature = signature;
+	} else if (ppl_conf->signature != signature) {
+		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
+			 __func__, signature, ppl_conf->signature);
+		ppl_conf->mismatch_count++;
+		goto out;
+	}
+
+	/* attempt to recover from log if we are starting a dirty array */
+	if (!mddev->pers && mddev->recovery_cp != MaxSector)
+		ret = ppl_recover(log, pplhdr);
+out:
+	/* write empty header if we are starting the array */
+	if (!ret && !mddev->pers)
+		ret = ppl_write_empty_header(log);
+
+	__free_page(page);
+
+	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
+		 __func__, ret, ppl_conf->mismatch_count,
+		 ppl_conf->recovered_entries);
+	return ret;
+}
+
+static int ppl_load(struct ppl_conf *ppl_conf)
+{
+	int ret = 0;
+	u32 signature = 0;
+	bool signature_set = false;
+	int i;
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct ppl_log *log = &ppl_conf->child_logs[i];
+
+		/* skip missing drive */
+		if (!log->rdev)
+			continue;
+
+		ret = ppl_load_distributed(log);
+		if (ret)
+			break;
+
+		/*
+		 * For external metadata we can't check if the signature is
+		 * correct on a single drive, but we can check if it is the same
+		 * on all drives.
+		 */
+		if (ppl_conf->mddev->external) {
+			if (!signature_set) {
+				signature = ppl_conf->signature;
+				signature_set = true;
+			} else if (signature != ppl_conf->signature) {
+				pr_warn("md/raid:%s: PPL header signature does not match on all member drives\n",
+					mdname(ppl_conf->mddev));
+				ret = -EINVAL;
+				break;
+			}
+		}
+	}
+
+	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
+		 __func__, ret, ppl_conf->mismatch_count,
+		 ppl_conf->recovered_entries);
+	return ret;
+}
+
 static void __ppl_exit_log(struct ppl_conf *ppl_conf)
 {
 	clear_bit(MD_HAS_PPL, &ppl_conf->mddev->flags);
@@ -694,6 +1174,23 @@ int ppl_init_log(struct r5conf *conf)
 		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
 			mdname(mddev));
 
+	/* load and possibly recover the logs from the member disks */
+	ret = ppl_load(ppl_conf);
+
+	if (ret) {
+		goto err;
+	} else if (!mddev->pers &&
+		   mddev->recovery_cp == 0 && !mddev->degraded &&
+		   ppl_conf->recovered_entries > 0 &&
+		   ppl_conf->mismatch_count == 0) {
+		/*
+		 * If we are starting a dirty array and the recovery succeeds
+		 * without any issues, set the array as clean.
+		 */
+		mddev->recovery_cp = MaxSector;
+		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	}
+
 	conf->log_private = ppl_conf;
 
 	return 0;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 4ca8e555ae5c..f032af4d0421 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7236,7 +7236,10 @@ static int raid5_run(struct mddev *mddev)
 
 	if (mddev->degraded > dirty_parity_disks &&
 	    mddev->recovery_cp != MaxSector) {
-		if (mddev->ok_start_degraded)
+		if (test_bit(MD_HAS_PPL, &mddev->flags))
+			pr_crit("md/raid:%s: starting dirty degraded array with PPL.\n",
+				mdname(mddev));
+		else if (mddev->ok_start_degraded)
 			pr_crit("md/raid:%s: starting dirty degraded array - data corruption possible.\n",
 				mdname(mddev));
 		else {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 6/7] raid5-ppl: support disk hot add/remove with PPL
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (4 preceding siblings ...)
  2017-03-09  9:00 ` [PATCH v5 5/7] raid5-ppl: load and recover the log Artur Paszkiewicz
@ 2017-03-09  9:00 ` Artur Paszkiewicz
  2017-03-09  9:00 ` [PATCH v5 7/7] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
  2017-03-09 23:32 ` [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Shaohua Li
  7 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  9:00 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Add a function to modify the log by removing an rdev when a drive fails
or adding when a spare/replacement is activated as a raid member.

Removing a disk just clears the child log rdev pointer. No new stripes
will be accepted for this child log in ppl_write_stripe() and running io
units will be processed without writing PPL to the device.

Adding a disk sets the child log rdev pointer and writes an empty PPL
header.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-log.h |  9 +++++++++
 drivers/md/raid5-ppl.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.c     | 12 +++++++++++-
 3 files changed, 64 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index a67fb58513b9..4f5a0f4e0b1f 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -39,6 +39,7 @@ extern void ppl_exit_log(struct r5conf *conf);
 extern int ppl_write_stripe(struct r5conf *conf, struct stripe_head *sh);
 extern void ppl_write_stripe_run(struct r5conf *conf);
 extern void ppl_stripe_write_finished(struct stripe_head *sh);
+extern int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add);
 
 static inline bool raid5_has_ppl(struct r5conf *conf)
 {
@@ -102,4 +103,12 @@ static inline int log_init(struct r5conf *conf, struct md_rdev *journal_dev)
 	return 0;
 }
 
+static inline int log_modify(struct r5conf *conf, struct md_rdev *rdev, bool add)
+{
+	if (raid5_has_ppl(conf))
+		return ppl_modify_log(conf, rdev, add);
+
+	return 0;
+}
+
 #endif
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 548d1028a3ce..b5fcf55edee9 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -400,6 +400,13 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
 	struct stripe_head *sh;
 	int i;
 
+	bio->bi_private = io;
+
+	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
+		ppl_log_endio(bio);
+		return;
+	}
+
 	for (i = 0; i < io->entries_count; i++) {
 		struct ppl_header_entry *e = &pplhdr->entries[i];
 
@@ -415,7 +422,6 @@ static void ppl_submit_iounit(struct ppl_io_unit *io)
 	pplhdr->entries_count = cpu_to_le32(io->entries_count);
 	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
 
-	bio->bi_private = io;
 	bio->bi_end_io = ppl_log_endio;
 	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
 	bio->bi_bdev = log->rdev->bdev;
@@ -1198,3 +1204,40 @@ int ppl_init_log(struct r5conf *conf)
 	__ppl_exit_log(ppl_conf);
 	return ret;
 }
+
+int ppl_modify_log(struct r5conf *conf, struct md_rdev *rdev, bool add)
+{
+	struct ppl_conf *ppl_conf = conf->log_private;
+	struct ppl_log *log;
+	int ret = 0;
+	char b[BDEVNAME_SIZE];
+
+	if (!rdev)
+		return -EINVAL;
+
+	pr_debug("%s: disk: %d operation: %s dev: %s\n",
+		 __func__, rdev->raid_disk, add ? "add" : "remove",
+		 bdevname(rdev->bdev, b));
+
+	if (rdev->raid_disk < 0)
+		return 0;
+
+	if (rdev->raid_disk >= ppl_conf->count)
+		return -ENODEV;
+
+	log = &ppl_conf->child_logs[rdev->raid_disk];
+
+	mutex_lock(&log->io_mutex);
+	if (add) {
+		ret = ppl_validate_rdev(rdev);
+		if (!ret) {
+			log->rdev = rdev;
+			ret = ppl_write_empty_header(log);
+		}
+	} else {
+		log->rdev = NULL;
+	}
+	mutex_unlock(&log->io_mutex);
+
+	return ret;
+}
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f032af4d0421..8b8f172d6807 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7527,6 +7527,11 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			*rdevp = rdev;
 		}
 	}
+	if (!err) {
+		err = log_modify(conf, rdev, false);
+		if (err)
+			goto abort;
+	}
 	if (p->replacement) {
 		/* We must have just cleared 'rdev' */
 		p->rdev = p->replacement;
@@ -7536,6 +7541,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			   */
 		p->replacement = NULL;
 		clear_bit(WantReplacement, &rdev->flags);
+
+		if (!err)
+			err = log_modify(conf, p->rdev, true);
 	} else
 		/* We might have just removed the Replacement as faulty-
 		 * clear the bit just in case
@@ -7592,10 +7600,12 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 		if (p->rdev == NULL) {
 			clear_bit(In_sync, &rdev->flags);
 			rdev->raid_disk = disk;
-			err = 0;
 			if (rdev->saved_raid_disk != disk)
 				conf->fullsync = 1;
 			rcu_assign_pointer(p->rdev, rdev);
+
+			err = log_modify(conf, rdev, true);
+
 			goto out;
 		}
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v5 7/7] raid5-ppl: runtime PPL enabling or disabling
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (5 preceding siblings ...)
  2017-03-09  9:00 ` [PATCH v5 6/7] raid5-ppl: support disk hot add/remove with PPL Artur Paszkiewicz
@ 2017-03-09  9:00 ` Artur Paszkiewicz
  2017-03-09 23:32 ` [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Shaohua Li
  7 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-09  9:00 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array.  The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c        | 12 +++++++++---
 drivers/md/md.h        |  2 ++
 drivers/md/raid5-ppl.c |  4 ++++
 drivers/md/raid5.c     | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1df48f365b3c..8ef2100e2788 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -4995,14 +4995,20 @@ consistency_policy_show(struct mddev *mddev, char *page)
 static ssize_t
 consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
 {
+	int err = 0;
+
 	if (mddev->pers) {
-		return -EBUSY;
+		if (mddev->pers->change_consistency_policy)
+			err = mddev->pers->change_consistency_policy(mddev, buf);
+		else
+			err = -EBUSY;
 	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
 		set_bit(MD_HAS_PPL, &mddev->flags);
-		return len;
 	} else {
-		return -EINVAL;
+		err = -EINVAL;
 	}
+
+	return err ? err : len;
 }
 
 static struct md_sysfs_entry md_consistency_policy =
diff --git a/drivers/md/md.h b/drivers/md/md.h
index a7b2f16452c4..e0940064c3ec 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -545,6 +545,8 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
+	/* Changes the consistency policy of an active array. */
+	int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
 };
 
 struct md_sysfs_entry {
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index b5fcf55edee9..ca1ef644351a 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -1195,6 +1195,10 @@ int ppl_init_log(struct r5conf *conf)
 		 */
 		mddev->recovery_cp = MaxSector;
 		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	} else if (mddev->pers && ppl_conf->mismatch_count > 0) {
+		/* no mismatch allowed when enabling PPL for a running array */
+		ret = -EINVAL;
+		goto err;
 	}
 
 	conf->log_private = ppl_conf;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8b8f172d6807..7a6a6184e2a7 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8213,6 +8213,58 @@ static void *raid6_takeover(struct mddev *mddev)
 	return setup_conf(mddev);
 }
 
+static void raid5_reset_stripe_cache(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+
+	mutex_lock(&conf->cache_size_mutex);
+	while (conf->max_nr_stripes &&
+	       drop_one_stripe(conf))
+		;
+	while (conf->min_nr_stripes > conf->max_nr_stripes &&
+	       grow_one_stripe(conf, GFP_KERNEL))
+		;
+	mutex_unlock(&conf->cache_size_mutex);
+}
+
+static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
+{
+	struct r5conf *conf;
+	int err;
+
+	err = mddev_lock(mddev);
+	if (err)
+		return err;
+	conf = mddev->private;
+	if (!conf) {
+		mddev_unlock(mddev);
+		return -ENODEV;
+	}
+
+	if (strncmp(buf, "ppl", 3) == 0 && !raid5_has_ppl(conf)) {
+		mddev_suspend(mddev);
+		set_bit(MD_HAS_PPL, &mddev->flags);
+		err = log_init(conf, NULL);
+		if (!err)
+			raid5_reset_stripe_cache(mddev);
+		mddev_resume(mddev);
+	} else if (strncmp(buf, "resync", 6) == 0 && raid5_has_ppl(conf)) {
+		mddev_suspend(mddev);
+		log_exit(conf);
+		raid5_reset_stripe_cache(mddev);
+		mddev_resume(mddev);
+	} else {
+		err = -EINVAL;
+	}
+
+	if (!err)
+		md_update_sb(mddev, 1);
+
+	mddev_unlock(mddev);
+
+	return err;
+}
+
 static struct md_personality raid6_personality =
 {
 	.name		= "raid6",
@@ -8258,6 +8310,7 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
+	.change_consistency_policy = raid5_change_consistency_policy,
 };
 
 static struct md_personality raid4_personality =
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-09  8:59 ` [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
@ 2017-03-09 23:24   ` Shaohua Li
  2017-03-10 15:16     ` Artur Paszkiewicz
  2017-03-21 22:00   ` NeilBrown
  2017-04-16 22:58   ` Greg Thelen
  2 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-03-09 23:24 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid

On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
> Implement the calculation of partial parity for a stripe and PPL write
> logging functionality. The description of PPL is added to the
> documentation. More details can be found in the comments in raid5-ppl.c.
> 
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
> 
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
> 
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
> 
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
> 
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
> 
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
> 
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.
> 
> Warn about using PPL with enabled disk volatile write-back cache for
> now. It can be removed once disk cache flushing before writing PPL is
> implemented.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>

... snip ...

> +struct dma_async_tx_descriptor *
> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
> +		       struct dma_async_tx_descriptor *tx)
> +{
> +	int disks = sh->disks;
> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
> +	int count = 0, pd_idx = sh->pd_idx, i;
> +	struct async_submit_ctl submit;
> +
> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
> +
> +	/*
> +	 * Partial parity is the XOR of stripe data chunks that are not changed
> +	 * during the write request. Depending on available data
> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
> +	 * differently.
> +	 */
> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
> +		/* rmw: xor old data and parity from updated disks */
> +		for (i = disks; i--;) {
> +			struct r5dev *dev = &sh->dev[i];
> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
> +				xor_srcs[count++] = dev->page;
> +		}
> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
> +		/* rcw: xor data from all not updated disks */
> +		for (i = disks; i--;) {
> +			struct r5dev *dev = &sh->dev[i];
> +			if (test_bit(R5_UPTODATE, &dev->flags))
> +				xor_srcs[count++] = dev->page;
> +		}
> +	} else {
> +		return tx;
> +	}
> +
> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
> +			  flex_array_get(percpu->scribble, 0)
> +			  + sizeof(struct page *) * (sh->disks + 2));

Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 5/7] raid5-ppl: load and recover the log
  2017-03-09  9:00 ` [PATCH v5 5/7] raid5-ppl: load and recover the log Artur Paszkiewicz
@ 2017-03-09 23:30   ` Shaohua Li
  2017-03-10 15:23     ` Artur Paszkiewicz
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-03-09 23:30 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid

On Thu, Mar 09, 2017 at 10:00:01AM +0100, Artur Paszkiewicz wrote:
> Load the log from each disk when starting the array and recover if the
> array is dirty.
> 
> The initial empty PPL is written by mdadm. When loading the log we
> verify the header checksum and signature. For external metadata arrays
> the signature is verified in userspace, so here we read it from the
> header, verifying only if it matches on all disks, and use it later when
> writing PPL.
> 
> In addition to the header checksum, each header entry also contains a
> checksum of its partial parity data. If the header is valid, recovery is
> performed for each entry until an invalid entry is found. If the array
> is not degraded and recovery using PPL fully succeeds, there is no need
> to resync the array because data and parity will be consistent, so in
> this case resync will be disabled.
> 
> Due to compatibility with IMSM implementations on other systems, we
> can't assume that the recovery data block size is always 4K. Writes
> generated by MD raid5 don't have this issue, but when recovering PPL
> written in other environments it is possible to have entries with
> 512-byte sector granularity. The recovery code takes this into account
> and also the logical sector size of the underlying drives.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-ppl.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c     |   5 +-
>  2 files changed, 501 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index 92783586743d..548d1028a3ce 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -103,6 +103,10 @@ struct ppl_conf {
>  	mempool_t *io_pool;
>  	struct bio_set *bs;
>  	mempool_t *meta_pool;
> +
> +	/* used only for recovery */
> +	int recovered_entries;
> +	int mismatch_count;
>  };
>  
>  struct ppl_log {
> @@ -514,6 +518,482 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>  		ppl_io_unit_finished(io);
>  }
>  
> +static void ppl_xor(int size, struct page *page1, struct page *page2,
> +		    struct page *page_result)
> +{

I'd remove the page_result parameter, it should always be page1. And this will
make it clear why we need ASYNC_TX_XOR_DROP_DST.

> +	struct async_submit_ctl submit;
> +	struct dma_async_tx_descriptor *tx;
> +	struct page *xor_srcs[] = { page1, page2 };
> +
> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
> +			  NULL, NULL, NULL, NULL);
> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
> +
> +	async_tx_quiesce(&tx);
> +}

...
> +			ret = ppl_recover_entry(log, e, ppl_sector);
> +			if (ret)
> +				goto out;
> +			ppl_conf->recovered_entries++;
> +		}
> +
> +		ppl_sector += ppl_entry_sectors;
> +	}
> +
> +	/* flush the disk cache after recovery if necessary */
> +	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {

The block layer will handle this, so you don't need to check

> +		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
> +
> +		bio->bi_bdev = rdev->bdev;
> +		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> +		ret = submit_bio_wait(bio);
> +		bio_put(bio);

please use blkdev_issue_flush() instead.

> +	}
> +out:
> +	__free_page(page);
> +	return ret;
> +}
> +
...
> +static int ppl_load(struct ppl_conf *ppl_conf)
> +{
> +	int ret = 0;
> +	u32 signature = 0;
> +	bool signature_set = false;
> +	int i;
> +
> +	for (i = 0; i < ppl_conf->count; i++) {
> +		struct ppl_log *log = &ppl_conf->child_logs[i];
> +
> +		/* skip missing drive */
> +		if (!log->rdev)
> +			continue;
> +
> +		ret = ppl_load_distributed(log);
> +		if (ret)
> +			break;

Not sure about the strategy here. But if one disk fails, why don't we continue
do the recovery from other disks? This way we can at least recovery more data.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 0/7] Partial Parity Log for MD RAID 5
  2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (6 preceding siblings ...)
  2017-03-09  9:00 ` [PATCH v5 7/7] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
@ 2017-03-09 23:32 ` Shaohua Li
  2017-03-10 15:40   ` [PATCH] raid5-ppl: two minor improvements Artur Paszkiewicz
  7 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-03-09 23:32 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid

On Thu, Mar 09, 2017 at 09:59:56AM +0100, Artur Paszkiewicz wrote:
> This series of patches implements the Partial Parity Log for RAID5 arrays. The
> purpose of this feature is closing the RAID 5 Write Hole. It is a solution
> alternative to the existing raid5-cache, but the logging workflow and much of
> the implementation is based on it.
> 
> The main differences compared to raid5-cache is that PPL is a distributed log -
> it is stored on array member drives in the metadata area and does not require a
> dedicated journaling drive. Write performance is reduced by up to 30%-40% but
> it scales with the number of drives in the array and the journaling drive does
> not become a bottleneck or a single point of failure. PPL does not protect from
> losing in-flight data, only from silent data corruption. More details about how
> the log works can be found in patches 3 and 5.
> 
> This feature originated from Intel RSTe, which uses IMSM metadata. PPL for IMSM
> is going to be included in RSTe implementations starting with upcoming Xeon
> platforms and Intel will continue supporting and maintaining it. This patchset
> implements PPL for external metadata (specifically IMSM) as well as native MD
> v1.x metadata.
> 
> Changes in mdadm are also required to make this fully usable. Patches for mdadm
> will be sent later.

Looks good, I'll apply this series into next. There are some small issues I
replied separately. Please send a separate patch to fix them if they make
sense, I'll integrate them locally.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-09 23:24   ` Shaohua Li
@ 2017-03-10 15:16     ` Artur Paszkiewicz
  2017-03-10 18:15       ` Shaohua Li
  0 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-10 15:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid

On 03/10/2017 12:24 AM, Shaohua Li wrote:
> On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
>> Implement the calculation of partial parity for a stripe and PPL write
>> logging functionality. The description of PPL is added to the
>> documentation. More details can be found in the comments in raid5-ppl.c.
>>
>> Attach a page for holding the partial parity data to stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Partial parity is the xor of not modified data chunks of a stripe and is
>> calculated as follows:
>>
>> - reconstruct-write case:
>>   xor data from all not updated disks in a stripe
>>
>> - read-modify-write case:
>>   xor old data and parity from all updated disks in a stripe
>>
>> Implement it using the async_tx API and integrate into raid_run_ops().
>> It must be called when we still have access to old data, so do it when
>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> stored into sh->ppl_page.
>>
>> Partial parity is not meaningful for full stripe write and is not stored
>> in the log or used for recovery, so don't attempt to calculate it when
>> stripe has STRIPE_FULL_WRITE.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Warn about using PPL with enabled disk volatile write-back cache for
>> now. It can be removed once disk cache flushing before writing PPL is
>> implemented.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> 
> ... snip ...
> 
>> +struct dma_async_tx_descriptor *
>> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>> +		       struct dma_async_tx_descriptor *tx)
>> +{
>> +	int disks = sh->disks;
>> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
>> +	int count = 0, pd_idx = sh->pd_idx, i;
>> +	struct async_submit_ctl submit;
>> +
>> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
>> +
>> +	/*
>> +	 * Partial parity is the XOR of stripe data chunks that are not changed
>> +	 * during the write request. Depending on available data
>> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
>> +	 * differently.
>> +	 */
>> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
>> +		/* rmw: xor old data and parity from updated disks */
>> +		for (i = disks; i--;) {
>> +			struct r5dev *dev = &sh->dev[i];
>> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
>> +				xor_srcs[count++] = dev->page;
>> +		}
>> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
>> +		/* rcw: xor data from all not updated disks */
>> +		for (i = disks; i--;) {
>> +			struct r5dev *dev = &sh->dev[i];
>> +			if (test_bit(R5_UPTODATE, &dev->flags))
>> +				xor_srcs[count++] = dev->page;
>> +		}
>> +	} else {
>> +		return tx;
>> +	}
>> +
>> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
>> +			  flex_array_get(percpu->scribble, 0)
>> +			  + sizeof(struct page *) * (sh->disks + 2));
> 
> Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?

The result of this calculation isn't used later by other async_tx
operations, so it's not needed here, if I understand this correctly. But
maybe later we could optimize and use partial parity to calculate full
parity, then it will be necessary.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 5/7] raid5-ppl: load and recover the log
  2017-03-09 23:30   ` Shaohua Li
@ 2017-03-10 15:23     ` Artur Paszkiewicz
  0 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-10 15:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: shli, linux-raid

On 03/10/2017 12:30 AM, Shaohua Li wrote:
> On Thu, Mar 09, 2017 at 10:00:01AM +0100, Artur Paszkiewicz wrote:
>> Load the log from each disk when starting the array and recover if the
>> array is dirty.
>>
>> The initial empty PPL is written by mdadm. When loading the log we
>> verify the header checksum and signature. For external metadata arrays
>> the signature is verified in userspace, so here we read it from the
>> header, verifying only if it matches on all disks, and use it later when
>> writing PPL.
>>
>> In addition to the header checksum, each header entry also contains a
>> checksum of its partial parity data. If the header is valid, recovery is
>> performed for each entry until an invalid entry is found. If the array
>> is not degraded and recovery using PPL fully succeeds, there is no need
>> to resync the array because data and parity will be consistent, so in
>> this case resync will be disabled.
>>
>> Due to compatibility with IMSM implementations on other systems, we
>> can't assume that the recovery data block size is always 4K. Writes
>> generated by MD raid5 don't have this issue, but when recovering PPL
>> written in other environments it is possible to have entries with
>> 512-byte sector granularity. The recovery code takes this into account
>> and also the logical sector size of the underlying drives.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5-ppl.c | 497 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c     |   5 +-
>>  2 files changed, 501 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> index 92783586743d..548d1028a3ce 100644
>> --- a/drivers/md/raid5-ppl.c
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -103,6 +103,10 @@ struct ppl_conf {
>>  	mempool_t *io_pool;
>>  	struct bio_set *bs;
>>  	mempool_t *meta_pool;
>> +
>> +	/* used only for recovery */
>> +	int recovered_entries;
>> +	int mismatch_count;
>>  };
>>  
>>  struct ppl_log {
>> @@ -514,6 +518,482 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>>  		ppl_io_unit_finished(io);
>>  }
>>  
>> +static void ppl_xor(int size, struct page *page1, struct page *page2,
>> +		    struct page *page_result)
>> +{
> 
> I'd remove the page_result parameter, it should always be page1. And this will
> make it clear why we need ASYNC_TX_XOR_DROP_DST.
> 
>> +	struct async_submit_ctl submit;
>> +	struct dma_async_tx_descriptor *tx;
>> +	struct page *xor_srcs[] = { page1, page2 };
>> +
>> +	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>> +			  NULL, NULL, NULL, NULL);
>> +	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
>> +
>> +	async_tx_quiesce(&tx);
>> +}
> 
> ...
>> +			ret = ppl_recover_entry(log, e, ppl_sector);
>> +			if (ret)
>> +				goto out;
>> +			ppl_conf->recovered_entries++;
>> +		}
>> +
>> +		ppl_sector += ppl_entry_sectors;
>> +	}
>> +
>> +	/* flush the disk cache after recovery if necessary */
>> +	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
> 
> The block layer will handle this, so you don't need to check
> 
>> +		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
>> +
>> +		bio->bi_bdev = rdev->bdev;
>> +		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
>> +		ret = submit_bio_wait(bio);
>> +		bio_put(bio);
> 
> please use blkdev_issue_flush() instead.
> 
>> +	}
>> +out:
>> +	__free_page(page);
>> +	return ret;
>> +}
>> +
> ...
>> +static int ppl_load(struct ppl_conf *ppl_conf)
>> +{
>> +	int ret = 0;
>> +	u32 signature = 0;
>> +	bool signature_set = false;
>> +	int i;
>> +
>> +	for (i = 0; i < ppl_conf->count; i++) {
>> +		struct ppl_log *log = &ppl_conf->child_logs[i];
>> +
>> +		/* skip missing drive */
>> +		if (!log->rdev)
>> +			continue;
>> +
>> +		ret = ppl_load_distributed(log);
>> +		if (ret)
>> +			break;
> 
> Not sure about the strategy here. But if one disk fails, why don't we continue
> do the recovery from other disks? This way we can at least recovery more data.

I thought it would be safer to abort early. Then we can for example
remove the failed drive try again or disable ppl. And if the array is
already degraded and another disk fails, the recovery won't be
meaningful anyway.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] raid5-ppl: two minor improvements
  2017-03-09 23:32 ` [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Shaohua Li
@ 2017-03-10 15:40   ` Artur Paszkiewicz
  2017-03-10 18:16     ` Shaohua Li
  0 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-10 15:40 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

- Remove 'page_result' parameter from ppl_xor() and always write result
  to 'page1'.
- Use blkdev_issue_flush() instead of open coding the flush bio
  submission.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index ca1ef644351a..2419fbc6f684 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -524,8 +524,7 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
 		ppl_io_unit_finished(io);
 }
 
-static void ppl_xor(int size, struct page *page1, struct page *page2,
-		    struct page *page_result)
+static void ppl_xor(int size, struct page *page1, struct page *page2)
 {
 	struct async_submit_ctl submit;
 	struct dma_async_tx_descriptor *tx;
@@ -533,7 +532,7 @@ static void ppl_xor(int size, struct page *page1, struct page *page2,
 
 	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
 			  NULL, NULL, NULL, NULL);
-	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
+	tx = async_xor(page1, xor_srcs, 0, 2, size, &submit);
 
 	async_tx_quiesce(&tx);
 }
@@ -724,7 +723,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
 				goto out;
 			}
 
-			ppl_xor(block_size, page1, page2, page1);
+			ppl_xor(block_size, page1, page2);
 
 			indent -= 2;
 		}
@@ -747,7 +746,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
 				goto out;
 			}
 
-			ppl_xor(block_size, page1, page2, page1);
+			ppl_xor(block_size, page1, page2);
 		}
 
 		/* map raid sector to parity disk */
@@ -846,14 +845,7 @@ static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
 	}
 
 	/* flush the disk cache after recovery if necessary */
-	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
-		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
-
-		bio->bi_bdev = rdev->bdev;
-		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
-		ret = submit_bio_wait(bio);
-		bio_put(bio);
-	}
+	ret = blkdev_issue_flush(rdev->bdev, GFP_KERNEL, NULL);
 out:
 	__free_page(page);
 	return ret;
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-10 15:16     ` Artur Paszkiewicz
@ 2017-03-10 18:15       ` Shaohua Li
  2017-03-10 18:42         ` Dan Williams
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-03-10 18:15 UTC (permalink / raw)
  To: Artur Paszkiewicz, dan.j.williams; +Cc: shli, linux-raid

On Fri, Mar 10, 2017 at 04:16:58PM +0100, Artur Paszkiewicz wrote:
> On 03/10/2017 12:24 AM, Shaohua Li wrote:
> > On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
> >> Implement the calculation of partial parity for a stripe and PPL write
> >> logging functionality. The description of PPL is added to the
> >> documentation. More details can be found in the comments in raid5-ppl.c.
> >>
> >> Attach a page for holding the partial parity data to stripe_head.
> >> Allocate it only if mddev has the MD_HAS_PPL flag set.
> >>
> >> Partial parity is the xor of not modified data chunks of a stripe and is
> >> calculated as follows:
> >>
> >> - reconstruct-write case:
> >>   xor data from all not updated disks in a stripe
> >>
> >> - read-modify-write case:
> >>   xor old data and parity from all updated disks in a stripe
> >>
> >> Implement it using the async_tx API and integrate into raid_run_ops().
> >> It must be called when we still have access to old data, so do it when
> >> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> >> stored into sh->ppl_page.
> >>
> >> Partial parity is not meaningful for full stripe write and is not stored
> >> in the log or used for recovery, so don't attempt to calculate it when
> >> stripe has STRIPE_FULL_WRITE.
> >>
> >> Put the PPL metadata structures to md_p.h because userspace tools
> >> (mdadm) will also need to read/write PPL.
> >>
> >> Warn about using PPL with enabled disk volatile write-back cache for
> >> now. It can be removed once disk cache flushing before writing PPL is
> >> implemented.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> > 
> > ... snip ...
> > 
> >> +struct dma_async_tx_descriptor *
> >> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
> >> +		       struct dma_async_tx_descriptor *tx)
> >> +{
> >> +	int disks = sh->disks;
> >> +	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
> >> +	int count = 0, pd_idx = sh->pd_idx, i;
> >> +	struct async_submit_ctl submit;
> >> +
> >> +	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
> >> +
> >> +	/*
> >> +	 * Partial parity is the XOR of stripe data chunks that are not changed
> >> +	 * during the write request. Depending on available data
> >> +	 * (read-modify-write vs. reconstruct-write case) we calculate it
> >> +	 * differently.
> >> +	 */
> >> +	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
> >> +		/* rmw: xor old data and parity from updated disks */
> >> +		for (i = disks; i--;) {
> >> +			struct r5dev *dev = &sh->dev[i];
> >> +			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
> >> +				xor_srcs[count++] = dev->page;
> >> +		}
> >> +	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
> >> +		/* rcw: xor data from all not updated disks */
> >> +		for (i = disks; i--;) {
> >> +			struct r5dev *dev = &sh->dev[i];
> >> +			if (test_bit(R5_UPTODATE, &dev->flags))
> >> +				xor_srcs[count++] = dev->page;
> >> +		}
> >> +	} else {
> >> +		return tx;
> >> +	}
> >> +
> >> +	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
> >> +			  flex_array_get(percpu->scribble, 0)
> >> +			  + sizeof(struct page *) * (sh->disks + 2));
> > 
> > Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?
> 
> The result of this calculation isn't used later by other async_tx
> operations, so it's not needed here, if I understand this correctly. But
> maybe later we could optimize and use partial parity to calculate full
> parity, then it will be necessary.

But the source pages will be overwritten soon, if no fence, I think this is a
problem. Anyway, I'll let Dan to clarify.

Dan,
We are using async API for below operations:
1. xor several source pages to a dest page
2. memcpy other data to the source pages

The two operations will be in an async chain. Should the first operation
include ASYNC_TX_FENCE flag? The API comment declares the flag is required if
the destination page is used by subsequent operations, but I suspect it should
be used too if the subsequent operations could change the source pages.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] raid5-ppl: two minor improvements
  2017-03-10 15:40   ` [PATCH] raid5-ppl: two minor improvements Artur Paszkiewicz
@ 2017-03-10 18:16     ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2017-03-10 18:16 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid

On Fri, Mar 10, 2017 at 04:40:44PM +0100, Artur Paszkiewicz wrote:
> - Remove 'page_result' parameter from ppl_xor() and always write result
>   to 'page1'.
> - Use blkdev_issue_flush() instead of open coding the flush bio
>   submission.

merged to original patch
 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5-ppl.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> index ca1ef644351a..2419fbc6f684 100644
> --- a/drivers/md/raid5-ppl.c
> +++ b/drivers/md/raid5-ppl.c
> @@ -524,8 +524,7 @@ void ppl_stripe_write_finished(struct stripe_head *sh)
>  		ppl_io_unit_finished(io);
>  }
>  
> -static void ppl_xor(int size, struct page *page1, struct page *page2,
> -		    struct page *page_result)
> +static void ppl_xor(int size, struct page *page1, struct page *page2)
>  {
>  	struct async_submit_ctl submit;
>  	struct dma_async_tx_descriptor *tx;
> @@ -533,7 +532,7 @@ static void ppl_xor(int size, struct page *page1, struct page *page2,
>  
>  	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
>  			  NULL, NULL, NULL, NULL);
> -	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
> +	tx = async_xor(page1, xor_srcs, 0, 2, size, &submit);
>  
>  	async_tx_quiesce(&tx);
>  }
> @@ -724,7 +723,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>  				goto out;
>  			}
>  
> -			ppl_xor(block_size, page1, page2, page1);
> +			ppl_xor(block_size, page1, page2);
>  
>  			indent -= 2;
>  		}
> @@ -747,7 +746,7 @@ static int ppl_recover_entry(struct ppl_log *log, struct ppl_header_entry *e,
>  				goto out;
>  			}
>  
> -			ppl_xor(block_size, page1, page2, page1);
> +			ppl_xor(block_size, page1, page2);
>  		}
>  
>  		/* map raid sector to parity disk */
> @@ -846,14 +845,7 @@ static int ppl_recover(struct ppl_log *log, struct ppl_header *pplhdr)
>  	}
>  
>  	/* flush the disk cache after recovery if necessary */
> -	if (test_bit(QUEUE_FLAG_WC, &bdev_get_queue(rdev->bdev)->queue_flags)) {
> -		struct bio *bio = bio_alloc_bioset(GFP_KERNEL, 0, ppl_conf->bs);
> -
> -		bio->bi_bdev = rdev->bdev;
> -		bio->bi_opf = REQ_OP_WRITE | REQ_PREFLUSH;
> -		ret = submit_bio_wait(bio);
> -		bio_put(bio);
> -	}
> +	ret = blkdev_issue_flush(rdev->bdev, GFP_KERNEL, NULL);
>  out:
>  	__free_page(page);
>  	return ret;
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-10 18:15       ` Shaohua Li
@ 2017-03-10 18:42         ` Dan Williams
  0 siblings, 0 replies; 25+ messages in thread
From: Dan Williams @ 2017-03-10 18:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: Artur Paszkiewicz, Shaohua Li, linux-raid

On Fri, Mar 10, 2017 at 10:15 AM, Shaohua Li <shli@kernel.org> wrote:
> On Fri, Mar 10, 2017 at 04:16:58PM +0100, Artur Paszkiewicz wrote:
>> On 03/10/2017 12:24 AM, Shaohua Li wrote:
>> > On Thu, Mar 09, 2017 at 09:59:59AM +0100, Artur Paszkiewicz wrote:
>> >> Implement the calculation of partial parity for a stripe and PPL write
>> >> logging functionality. The description of PPL is added to the
>> >> documentation. More details can be found in the comments in raid5-ppl.c.
>> >>
>> >> Attach a page for holding the partial parity data to stripe_head.
>> >> Allocate it only if mddev has the MD_HAS_PPL flag set.
>> >>
>> >> Partial parity is the xor of not modified data chunks of a stripe and is
>> >> calculated as follows:
>> >>
>> >> - reconstruct-write case:
>> >>   xor data from all not updated disks in a stripe
>> >>
>> >> - read-modify-write case:
>> >>   xor old data and parity from all updated disks in a stripe
>> >>
>> >> Implement it using the async_tx API and integrate into raid_run_ops().
>> >> It must be called when we still have access to old data, so do it when
>> >> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> >> stored into sh->ppl_page.
>> >>
>> >> Partial parity is not meaningful for full stripe write and is not stored
>> >> in the log or used for recovery, so don't attempt to calculate it when
>> >> stripe has STRIPE_FULL_WRITE.
>> >>
>> >> Put the PPL metadata structures to md_p.h because userspace tools
>> >> (mdadm) will also need to read/write PPL.
>> >>
>> >> Warn about using PPL with enabled disk volatile write-back cache for
>> >> now. It can be removed once disk cache flushing before writing PPL is
>> >> implemented.
>> >>
>> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> >
>> > ... snip ...
>> >
>> >> +struct dma_async_tx_descriptor *
>> >> +ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
>> >> +                 struct dma_async_tx_descriptor *tx)
>> >> +{
>> >> +  int disks = sh->disks;
>> >> +  struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
>> >> +  int count = 0, pd_idx = sh->pd_idx, i;
>> >> +  struct async_submit_ctl submit;
>> >> +
>> >> +  pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
>> >> +
>> >> +  /*
>> >> +   * Partial parity is the XOR of stripe data chunks that are not changed
>> >> +   * during the write request. Depending on available data
>> >> +   * (read-modify-write vs. reconstruct-write case) we calculate it
>> >> +   * differently.
>> >> +   */
>> >> +  if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
>> >> +          /* rmw: xor old data and parity from updated disks */
>> >> +          for (i = disks; i--;) {
>> >> +                  struct r5dev *dev = &sh->dev[i];
>> >> +                  if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
>> >> +                          xor_srcs[count++] = dev->page;
>> >> +          }
>> >> +  } else if (sh->reconstruct_state == reconstruct_state_drain_run) {
>> >> +          /* rcw: xor data from all not updated disks */
>> >> +          for (i = disks; i--;) {
>> >> +                  struct r5dev *dev = &sh->dev[i];
>> >> +                  if (test_bit(R5_UPTODATE, &dev->flags))
>> >> +                          xor_srcs[count++] = dev->page;
>> >> +          }
>> >> +  } else {
>> >> +          return tx;
>> >> +  }
>> >> +
>> >> +  init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
>> >> +                    flex_array_get(percpu->scribble, 0)
>> >> +                    + sizeof(struct page *) * (sh->disks + 2));
>> >
>> > Since this should be done before biodrain, should this add ASYNC_TX_FENCE flag?
>>
>> The result of this calculation isn't used later by other async_tx
>> operations, so it's not needed here, if I understand this correctly. But
>> maybe later we could optimize and use partial parity to calculate full
>> parity, then it will be necessary.
>
> But the source pages will be overwritten soon, if no fence, I think this is a
> problem. Anyway, I'll let Dan to clarify.
>
> Dan,
> We are using async API for below operations:
> 1. xor several source pages to a dest page
> 2. memcpy other data to the source pages
>
> The two operations will be in an async chain. Should the first operation
> include ASYNC_TX_FENCE flag? The API comment declares the flag is required if
> the destination page is used by subsequent operations, but I suspect it should
> be used too if the subsequent operations could change the source pages.

I think you're right. If operation2 could change operation1 source
data we should include the fence.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-09  8:59 ` [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
  2017-03-09 23:24   ` Shaohua Li
@ 2017-03-21 22:00   ` NeilBrown
  2017-03-24 16:46     ` Shaohua Li
  2017-04-16 22:58   ` Greg Thelen
  2 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2017-03-21 22:00 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, Artur Paszkiewicz

[-- Attachment #1: Type: text/plain, Size: 4095 bytes --]

On Thu, Mar 09 2017, Artur Paszkiewicz wrote:

> Implement the calculation of partial parity for a stripe and PPL write
> logging functionality. The description of PPL is added to the
> documentation. More details can be found in the comments in raid5-ppl.c.
>
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
>
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
>
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
>
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
>
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
>
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
>
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.
>
> Warn about using PPL with enabled disk volatile write-back cache for
> now. It can be removed once disk cache flushing before writing PPL is
> implemented.
>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>

Sorry for the delay in getting to this for review...

> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> +					  struct stripe_head *sh)
> +{
> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> +	struct ppl_io_unit *io;
> +	struct ppl_header *pplhdr;
> +
> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> +	if (!io)
> +		return NULL;
> +
> +	memset(io, 0, sizeof(*io));
> +	io->log = log;
> +	INIT_LIST_HEAD(&io->log_sibling);
> +	INIT_LIST_HEAD(&io->stripe_list);
> +	atomic_set(&io->pending_stripes, 0);
> +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
> +
> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);

I'm trying to understand how these two mempool_alloc()s relate, and
particularly why the first one needs to be GFP_ATOMIC, while the second
one can safely be GFP_NOIO.
I see that the allocated memory is freed in different places:
header_page is called from the bi_endio function as soon as the write
completes, while 'io' is freed later.  But I'm not sure that is enough
to make it safe.

When working with mempools, you need to assume that the pool only
contains one element, and that every time you call mempool_alloc(), it
waits for that one element to be available.  While that doesn't usually
happen, it is possible and if that case isn't handled correctly, the
system can deadlock.

If no memory is available when this mempool_alloc() is called, it will
block.  As it is called from the raid5d thread, the whole array will
block.  So this can only complete safely is the write request has
already been submitted - or if there is some other workqueue which
submit requests after a timeout or similar.
I don't see that in the code.  These ppl_io_unit structures can queue up
and are only submitted later by raid5d (I think).  So if raid5d waits
for one to become free, it will wait forever.

One easy way around this problem (assuming my understanding is correct)
is to just have a single mempool which allocates both a struct
ppl_io_unit and a page.  You would need to define you own alloc/free
routines for the pool but that is easy enough.

Then you only need a single mempool_alloc(), which can sensibly be
GFP_ATOMIC.
If that fails, you queue for later handling as you do now.  If it
succeeds, then you continue to use the memory without any risk of
deadlocking.

Thanks,
NeilBrown


> +	pplhdr = page_address(io->header_page);
> +	clear_page(pplhdr);
> +	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
> +	pplhdr->signature = cpu_to_le32(ppl_conf->signature);
> +
> +	io->seq = atomic64_add_return(1, &ppl_conf->seq);
> +	pplhdr->generation = cpu_to_le64(io->seq);
> +
> +	return io;
> +}

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-21 22:00   ` NeilBrown
@ 2017-03-24 16:46     ` Shaohua Li
  2017-03-28 14:12       ` Artur Paszkiewicz
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-03-24 16:46 UTC (permalink / raw)
  To: NeilBrown; +Cc: Artur Paszkiewicz, shli, linux-raid

On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote:
> On Thu, Mar 09 2017, Artur Paszkiewicz wrote:
> 
> > Implement the calculation of partial parity for a stripe and PPL write
> > logging functionality. The description of PPL is added to the
> > documentation. More details can be found in the comments in raid5-ppl.c.
> >
> > Attach a page for holding the partial parity data to stripe_head.
> > Allocate it only if mddev has the MD_HAS_PPL flag set.
> >
> > Partial parity is the xor of not modified data chunks of a stripe and is
> > calculated as follows:
> >
> > - reconstruct-write case:
> >   xor data from all not updated disks in a stripe
> >
> > - read-modify-write case:
> >   xor old data and parity from all updated disks in a stripe
> >
> > Implement it using the async_tx API and integrate into raid_run_ops().
> > It must be called when we still have access to old data, so do it when
> > STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> > stored into sh->ppl_page.
> >
> > Partial parity is not meaningful for full stripe write and is not stored
> > in the log or used for recovery, so don't attempt to calculate it when
> > stripe has STRIPE_FULL_WRITE.
> >
> > Put the PPL metadata structures to md_p.h because userspace tools
> > (mdadm) will also need to read/write PPL.
> >
> > Warn about using PPL with enabled disk volatile write-back cache for
> > now. It can be removed once disk cache flushing before writing PPL is
> > implemented.
> >
> > Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> 
> Sorry for the delay in getting to this for review...
> 
> > +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> > +					  struct stripe_head *sh)
> > +{
> > +	struct ppl_conf *ppl_conf = log->ppl_conf;
> > +	struct ppl_io_unit *io;
> > +	struct ppl_header *pplhdr;
> > +
> > +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> > +	if (!io)
> > +		return NULL;
> > +
> > +	memset(io, 0, sizeof(*io));
> > +	io->log = log;
> > +	INIT_LIST_HEAD(&io->log_sibling);
> > +	INIT_LIST_HEAD(&io->stripe_list);
> > +	atomic_set(&io->pending_stripes, 0);
> > +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
> > +
> > +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
> 
> I'm trying to understand how these two mempool_alloc()s relate, and
> particularly why the first one needs to be GFP_ATOMIC, while the second
> one can safely be GFP_NOIO.
> I see that the allocated memory is freed in different places:
> header_page is called from the bi_endio function as soon as the write
> completes, while 'io' is freed later.  But I'm not sure that is enough
> to make it safe.
> 
> When working with mempools, you need to assume that the pool only
> contains one element, and that every time you call mempool_alloc(), it
> waits for that one element to be available.  While that doesn't usually
> happen, it is possible and if that case isn't handled correctly, the
> system can deadlock.
> 
> If no memory is available when this mempool_alloc() is called, it will
> block.  As it is called from the raid5d thread, the whole array will
> block.  So this can only complete safely is the write request has
> already been submitted - or if there is some other workqueue which
> submit requests after a timeout or similar.
> I don't see that in the code.  These ppl_io_unit structures can queue up
> and are only submitted later by raid5d (I think).  So if raid5d waits
> for one to become free, it will wait forever.
> 
> One easy way around this problem (assuming my understanding is correct)
> is to just have a single mempool which allocates both a struct
> ppl_io_unit and a page.  You would need to define you own alloc/free
> routines for the pool but that is easy enough.
> 
> Then you only need a single mempool_alloc(), which can sensibly be
> GFP_ATOMIC.
> If that fails, you queue for later handling as you do now.  If it
> succeeds, then you continue to use the memory without any risk of
> deadlocking.

Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with
commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool
does make sense.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-24 16:46     ` Shaohua Li
@ 2017-03-28 14:12       ` Artur Paszkiewicz
  2017-03-28 16:16         ` Shaohua Li
  0 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-03-28 14:12 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown; +Cc: shli, linux-raid

On 03/24/2017 05:46 PM, Shaohua Li wrote:
> On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote:
>> On Thu, Mar 09 2017, Artur Paszkiewicz wrote:
>>
>>> Implement the calculation of partial parity for a stripe and PPL write
>>> logging functionality. The description of PPL is added to the
>>> documentation. More details can be found in the comments in raid5-ppl.c.
>>>
>>> Attach a page for holding the partial parity data to stripe_head.
>>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>>
>>> Partial parity is the xor of not modified data chunks of a stripe and is
>>> calculated as follows:
>>>
>>> - reconstruct-write case:
>>>   xor data from all not updated disks in a stripe
>>>
>>> - read-modify-write case:
>>>   xor old data and parity from all updated disks in a stripe
>>>
>>> Implement it using the async_tx API and integrate into raid_run_ops().
>>> It must be called when we still have access to old data, so do it when
>>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>>> stored into sh->ppl_page.
>>>
>>> Partial parity is not meaningful for full stripe write and is not stored
>>> in the log or used for recovery, so don't attempt to calculate it when
>>> stripe has STRIPE_FULL_WRITE.
>>>
>>> Put the PPL metadata structures to md_p.h because userspace tools
>>> (mdadm) will also need to read/write PPL.
>>>
>>> Warn about using PPL with enabled disk volatile write-back cache for
>>> now. It can be removed once disk cache flushing before writing PPL is
>>> implemented.
>>>
>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>>
>> Sorry for the delay in getting to this for review...
>>
>>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
>>> +					  struct stripe_head *sh)
>>> +{
>>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
>>> +	struct ppl_io_unit *io;
>>> +	struct ppl_header *pplhdr;
>>> +
>>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
>>> +	if (!io)
>>> +		return NULL;
>>> +
>>> +	memset(io, 0, sizeof(*io));
>>> +	io->log = log;
>>> +	INIT_LIST_HEAD(&io->log_sibling);
>>> +	INIT_LIST_HEAD(&io->stripe_list);
>>> +	atomic_set(&io->pending_stripes, 0);
>>> +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
>>> +
>>> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
>>
>> I'm trying to understand how these two mempool_alloc()s relate, and
>> particularly why the first one needs to be GFP_ATOMIC, while the second
>> one can safely be GFP_NOIO.
>> I see that the allocated memory is freed in different places:
>> header_page is called from the bi_endio function as soon as the write
>> completes, while 'io' is freed later.  But I'm not sure that is enough
>> to make it safe.
>>
>> When working with mempools, you need to assume that the pool only
>> contains one element, and that every time you call mempool_alloc(), it
>> waits for that one element to be available.  While that doesn't usually
>> happen, it is possible and if that case isn't handled correctly, the
>> system can deadlock.
>>
>> If no memory is available when this mempool_alloc() is called, it will
>> block.  As it is called from the raid5d thread, the whole array will
>> block.  So this can only complete safely is the write request has
>> already been submitted - or if there is some other workqueue which
>> submit requests after a timeout or similar.
>> I don't see that in the code.  These ppl_io_unit structures can queue up
>> and are only submitted later by raid5d (I think).  So if raid5d waits
>> for one to become free, it will wait forever.
>>
>> One easy way around this problem (assuming my understanding is correct)
>> is to just have a single mempool which allocates both a struct
>> ppl_io_unit and a page.  You would need to define you own alloc/free
>> routines for the pool but that is easy enough.
>>
>> Then you only need a single mempool_alloc(), which can sensibly be
>> GFP_ATOMIC.
>> If that fails, you queue for later handling as you do now.  If it
>> succeeds, then you continue to use the memory without any risk of
>> deadlocking.
> 
> Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with
> commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool
> does make sense.

That's right, I based this on raid5-cache code. I like the approach that
Neil proposed, I'll test it some more and probably send a patch soon. Do
you think GFP_ATOMIC is necessary here or would GFP_NOWAIT be enough?

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-28 14:12       ` Artur Paszkiewicz
@ 2017-03-28 16:16         ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2017-03-28 16:16 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: NeilBrown, shli, linux-raid

On Tue, Mar 28, 2017 at 04:12:37PM +0200, Artur Paszkiewicz wrote:
> On 03/24/2017 05:46 PM, Shaohua Li wrote:
> > On Wed, Mar 22, 2017 at 09:00:47AM +1100, Neil Brown wrote:
> >> On Thu, Mar 09 2017, Artur Paszkiewicz wrote:
> >>
> >>> Implement the calculation of partial parity for a stripe and PPL write
> >>> logging functionality. The description of PPL is added to the
> >>> documentation. More details can be found in the comments in raid5-ppl.c.
> >>>
> >>> Attach a page for holding the partial parity data to stripe_head.
> >>> Allocate it only if mddev has the MD_HAS_PPL flag set.
> >>>
> >>> Partial parity is the xor of not modified data chunks of a stripe and is
> >>> calculated as follows:
> >>>
> >>> - reconstruct-write case:
> >>>   xor data from all not updated disks in a stripe
> >>>
> >>> - read-modify-write case:
> >>>   xor old data and parity from all updated disks in a stripe
> >>>
> >>> Implement it using the async_tx API and integrate into raid_run_ops().
> >>> It must be called when we still have access to old data, so do it when
> >>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> >>> stored into sh->ppl_page.
> >>>
> >>> Partial parity is not meaningful for full stripe write and is not stored
> >>> in the log or used for recovery, so don't attempt to calculate it when
> >>> stripe has STRIPE_FULL_WRITE.
> >>>
> >>> Put the PPL metadata structures to md_p.h because userspace tools
> >>> (mdadm) will also need to read/write PPL.
> >>>
> >>> Warn about using PPL with enabled disk volatile write-back cache for
> >>> now. It can be removed once disk cache flushing before writing PPL is
> >>> implemented.
> >>>
> >>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >>
> >> Sorry for the delay in getting to this for review...
> >>
> >>> +static struct ppl_io_unit *ppl_new_iounit(struct ppl_log *log,
> >>> +					  struct stripe_head *sh)
> >>> +{
> >>> +	struct ppl_conf *ppl_conf = log->ppl_conf;
> >>> +	struct ppl_io_unit *io;
> >>> +	struct ppl_header *pplhdr;
> >>> +
> >>> +	io = mempool_alloc(ppl_conf->io_pool, GFP_ATOMIC);
> >>> +	if (!io)
> >>> +		return NULL;
> >>> +
> >>> +	memset(io, 0, sizeof(*io));
> >>> +	io->log = log;
> >>> +	INIT_LIST_HEAD(&io->log_sibling);
> >>> +	INIT_LIST_HEAD(&io->stripe_list);
> >>> +	atomic_set(&io->pending_stripes, 0);
> >>> +	bio_init(&io->bio, io->biovec, PPL_IO_INLINE_BVECS);
> >>> +
> >>> +	io->header_page = mempool_alloc(ppl_conf->meta_pool, GFP_NOIO);
> >>
> >> I'm trying to understand how these two mempool_alloc()s relate, and
> >> particularly why the first one needs to be GFP_ATOMIC, while the second
> >> one can safely be GFP_NOIO.
> >> I see that the allocated memory is freed in different places:
> >> header_page is called from the bi_endio function as soon as the write
> >> completes, while 'io' is freed later.  But I'm not sure that is enough
> >> to make it safe.
> >>
> >> When working with mempools, you need to assume that the pool only
> >> contains one element, and that every time you call mempool_alloc(), it
> >> waits for that one element to be available.  While that doesn't usually
> >> happen, it is possible and if that case isn't handled correctly, the
> >> system can deadlock.
> >>
> >> If no memory is available when this mempool_alloc() is called, it will
> >> block.  As it is called from the raid5d thread, the whole array will
> >> block.  So this can only complete safely is the write request has
> >> already been submitted - or if there is some other workqueue which
> >> submit requests after a timeout or similar.
> >> I don't see that in the code.  These ppl_io_unit structures can queue up
> >> and are only submitted later by raid5d (I think).  So if raid5d waits
> >> for one to become free, it will wait forever.
> >>
> >> One easy way around this problem (assuming my understanding is correct)
> >> is to just have a single mempool which allocates both a struct
> >> ppl_io_unit and a page.  You would need to define you own alloc/free
> >> routines for the pool but that is easy enough.
> >>
> >> Then you only need a single mempool_alloc(), which can sensibly be
> >> GFP_ATOMIC.
> >> If that fails, you queue for later handling as you do now.  If it
> >> succeeds, then you continue to use the memory without any risk of
> >> deadlocking.
> > 
> > Maybe Artur is following the raid5-cache code, which uses GFP_ATOMIC with
> > commit: 5036c39(raid5: allow r5l_io_unit allocations to fail). A single pool
> > does make sense.
> 
> That's right, I based this on raid5-cache code. I like the approach that
> Neil proposed, I'll test it some more and probably send a patch soon. Do
> you think GFP_ATOMIC is necessary here or would GFP_NOWAIT be enough?

Since we can handle error, I don't see the reason why we should use GFP_ATOMIC,
GFP_NOWAIT is good to me.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation
  2017-03-09  8:59 ` [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
  2017-03-09 23:24   ` Shaohua Li
  2017-03-21 22:00   ` NeilBrown
@ 2017-04-16 22:58   ` Greg Thelen
  2017-04-19  8:48     ` [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error Artur Paszkiewicz
  2 siblings, 1 reply; 25+ messages in thread
From: Greg Thelen @ 2017-04-16 22:58 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: Shaohua Li, linux-raid

On Thu, Mar 09 2017, Artur Paszkiewicz wrote:

> Implement the calculation of partial parity for a stripe and PPL write
> logging functionality. The description of PPL is added to the
> documentation. More details can be found in the comments in raid5-ppl.c.
> 
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
> 
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
> 
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
> 
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
> 
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
> 
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
> 
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.

...

> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index fe2112810c43..d9a1ead867b9 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -398,4 +398,31 @@ struct r5l_meta_block {
>  
>  #define R5LOG_VERSION 0x1
>  #define R5LOG_MAGIC 0x6433c509
> +
> +struct ppl_header_entry {
> +	__le64 data_sector;	/* raid sector of the new data */
> +	__le32 pp_size;		/* length of partial parity */
> +	__le32 data_size;	/* length of data */
> +	__le32 parity_disk;	/* member disk containing parity */
> +	__le32 checksum;	/* checksum of partial parity data for this
> +				 * entry (~crc32c) */
> +} __attribute__ ((__packed__));
> +
> +#define PPL_HEADER_SIZE 4096
> +#define PPL_HDR_RESERVED 512
> +#define PPL_HDR_ENTRY_SPACE \
> +	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))

PPL_HDR_ENTRY_SPACE is ostensibly a function of the ppl_header structure fields.
Can we use its field types: e.g. __le32 rather than u32.

This fixes klibc build error:
  In file included from /klibc/usr/klibc/../include/sys/md.h:30:0,
                   from /klibc/usr/kinit/do_mounts_md.c:19:
  /linux-next/usr/include/linux/raid/md_p.h:414:51: error: 'u32' undeclared here (not in a function)
    (PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))

 #define PPL_HDR_ENTRY_SPACE \
-	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(__le32) - sizeof(__le64))

> +#define PPL_HDR_MAX_ENTRIES \
> +	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
> +
> +struct ppl_header {
> +	__u8 reserved[PPL_HDR_RESERVED];/* reserved space, fill with 0xff */
> +	__le32 signature;		/* signature (family number of volume) */
> +	__le32 padding;			/* zero pad */
> +	__le64 generation;		/* generation number of the header */
> +	__le32 entries_count;		/* number of entries in entry array */
> +	__le32 checksum;		/* checksum of the header (~crc32c) */
> +	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
> +} __attribute__ ((__packed__));
> +
>  #endif


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error
  2017-04-16 22:58   ` Greg Thelen
@ 2017-04-19  8:48     ` Artur Paszkiewicz
  2017-04-19 16:59       ` Greg Thelen
  2017-04-20 16:41       ` Shaohua Li
  0 siblings, 2 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-04-19  8:48 UTC (permalink / raw)
  To: shli; +Cc: linux-raid, gthelen, ncroxon, Artur Paszkiewicz

Use __le32 and __le64 instead of u32 and u64.

This fixes klibc build error:
  In file included from /klibc/usr/klibc/../include/sys/md.h:30:0,
                   from /klibc/usr/kinit/do_mounts_md.c:19:
  /linux-next/usr/include/linux/raid/md_p.h:414:51: error: 'u32' undeclared here (not in a function)
    (PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))

Reported-by: Greg Thelen <gthelen@google.com>
Reported-by: Nigel Croxon <ncroxon@redhat.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 include/uapi/linux/raid/md_p.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index d9a1ead867b9..d500bd224979 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -411,7 +411,7 @@ struct ppl_header_entry {
 #define PPL_HEADER_SIZE 4096
 #define PPL_HDR_RESERVED 512
 #define PPL_HDR_ENTRY_SPACE \
-	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(__le32) - sizeof(__le64))
 #define PPL_HDR_MAX_ENTRIES \
 	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error
  2017-04-19  8:48     ` [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error Artur Paszkiewicz
@ 2017-04-19 16:59       ` Greg Thelen
  2017-04-20 16:41       ` Shaohua Li
  1 sibling, 0 replies; 25+ messages in thread
From: Greg Thelen @ 2017-04-19 16:59 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: Shaohua Li, linux-raid, ncroxon

On Wed, Apr 19, 2017 at 1:48 AM, Artur Paszkiewicz
<artur.paszkiewicz@intel.com> wrote:
> Use __le32 and __le64 instead of u32 and u64.
>
> This fixes klibc build error:
>   In file included from /klibc/usr/klibc/../include/sys/md.h:30:0,
>                    from /klibc/usr/kinit/do_mounts_md.c:19:
>   /linux-next/usr/include/linux/raid/md_p.h:414:51: error: 'u32' undeclared here (not in a function)
>     (PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
>
> Reported-by: Greg Thelen <gthelen@google.com>
> Reported-by: Nigel Croxon <ncroxon@redhat.com>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>

Tested-by: Greg Thelen <gthelen@google.com>

My klibc builds are working again.  Thanks.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error
  2017-04-19  8:48     ` [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error Artur Paszkiewicz
  2017-04-19 16:59       ` Greg Thelen
@ 2017-04-20 16:41       ` Shaohua Li
  1 sibling, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2017-04-20 16:41 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: shli, linux-raid, gthelen, ncroxon

On Wed, Apr 19, 2017 at 10:48:06AM +0200, Artur Paszkiewicz wrote:
> Use __le32 and __le64 instead of u32 and u64.
> 
> This fixes klibc build error:
>   In file included from /klibc/usr/klibc/../include/sys/md.h:30:0,
>                    from /klibc/usr/kinit/do_mounts_md.c:19:
>   /linux-next/usr/include/linux/raid/md_p.h:414:51: error: 'u32' undeclared here (not in a function)
>     (PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))

applied, thanks!
 
> Reported-by: Greg Thelen <gthelen@google.com>
> Reported-by: Nigel Croxon <ncroxon@redhat.com>
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  include/uapi/linux/raid/md_p.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
> index d9a1ead867b9..d500bd224979 100644
> --- a/include/uapi/linux/raid/md_p.h
> +++ b/include/uapi/linux/raid/md_p.h
> @@ -411,7 +411,7 @@ struct ppl_header_entry {
>  #define PPL_HEADER_SIZE 4096
>  #define PPL_HDR_RESERVED 512
>  #define PPL_HDR_ENTRY_SPACE \
> -	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
> +	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(__le32) - sizeof(__le64))
>  #define PPL_HDR_MAX_ENTRIES \
>  	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
>  
> -- 
> 2.11.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-04-20 16:41 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-09  8:59 [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
2017-03-09  8:59 ` [PATCH v5 1/7] md: superblock changes for PPL Artur Paszkiewicz
2017-03-09  8:59 ` [PATCH v5 2/7] raid5: separate header for log functions Artur Paszkiewicz
2017-03-09  8:59 ` [PATCH v5 3/7] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
2017-03-09 23:24   ` Shaohua Li
2017-03-10 15:16     ` Artur Paszkiewicz
2017-03-10 18:15       ` Shaohua Li
2017-03-10 18:42         ` Dan Williams
2017-03-21 22:00   ` NeilBrown
2017-03-24 16:46     ` Shaohua Li
2017-03-28 14:12       ` Artur Paszkiewicz
2017-03-28 16:16         ` Shaohua Li
2017-04-16 22:58   ` Greg Thelen
2017-04-19  8:48     ` [PATCH] uapi: fix linux/raid/md_p.h userspace compilation error Artur Paszkiewicz
2017-04-19 16:59       ` Greg Thelen
2017-04-20 16:41       ` Shaohua Li
2017-03-09  9:00 ` [PATCH v5 4/7] md: add sysfs entries for PPL Artur Paszkiewicz
2017-03-09  9:00 ` [PATCH v5 5/7] raid5-ppl: load and recover the log Artur Paszkiewicz
2017-03-09 23:30   ` Shaohua Li
2017-03-10 15:23     ` Artur Paszkiewicz
2017-03-09  9:00 ` [PATCH v5 6/7] raid5-ppl: support disk hot add/remove with PPL Artur Paszkiewicz
2017-03-09  9:00 ` [PATCH v5 7/7] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
2017-03-09 23:32 ` [PATCH v5 0/7] Partial Parity Log for MD RAID 5 Shaohua Li
2017-03-10 15:40   ` [PATCH] raid5-ppl: two minor improvements Artur Paszkiewicz
2017-03-10 18:16     ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.