All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/9] Partial Parity Log for MD RAID 5
@ 2017-01-30 18:59 Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 1/9] raid5-cache: move declarations to separate header Artur Paszkiewicz
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

This series of patches implements the Partial Parity Log for RAID5 arrays. The
purpose of this feature is closing the RAID 5 Write Hole. It is a solution
alternative to the existing raid5-cache, but the implementation is based on it
and reuses some of the code by introducing support for interchangeable
policies. This allows decoupling policy from mechanism and not adding more
boilerplate code in raid5.c.

The main differences compared to raid5-cache is that PPL is a distributed log -
it is stored on array member drives in the metadata area and does not require a
dedicated journaling drive. Write performance is reduced by up to 30%-40% but
it scales with the number of drives in the array and the journaling drive does
not become a bottleneck or a single point of failure. PPL does not protect from
losing in-flight data, only from silent data corruption. More details about how
the log works can be found in patches 5 and 7.

This feature originated from Intel RSTe, which uses IMSM metadata. This
patchset implements PPL for external metadata (specifically IMSM) as well as
native MD v1.x metadata.

Changes in mdadm are also required to make this fully usable. Patches for mdadm
will be sent later.

v3:
- Rebased to latest md for-next.
- Fixed alignment issues in the metadata structures.
- Removed reading IMSM signature from superblock.
- Removed 'rwh_policy' and per-device JournalPpl flags, added
  'consistency_policy', 'ppl_sector' and 'ppl_size' sysfs attributes.
- Reworked and simplified disk removal logic.
- Debug messages in raid5-ppl.c converted to pr_debug().
- Fixed some bugs in logging and recovery code.
- Improved descriptions and documentation.

v2:
- Rebased to latest md for-next.
- Fixed wrong PPL size calculation for IMSM.
- Simplified full stripe write case.
- Removed direct access to bi_io_vec.
- Handle failed bio_add_page().

Artur Paszkiewicz (9):
  raid5-cache: move declarations to separate header
  raid5-cache: add policy logic
  md: superblock changes for PPL
  raid5: calculate partial parity for a stripe
  raid5-ppl: Partial Parity Log write logging implementation
  md: add sysfs entries for PPL
  raid5-ppl: load and recover the log
  raid5-ppl: support disk hot add/remove with PPL
  raid5-ppl: runtime PPL enabling or disabling

 Documentation/admin-guide/md.rst |   85 ++-
 drivers/md/Makefile              |    2 +-
 drivers/md/md.c                  |  136 +++++
 drivers/md/md.h                  |   10 +
 drivers/md/raid0.c               |    3 +-
 drivers/md/raid1.c               |    3 +-
 drivers/md/raid5-cache.c         |  270 ++++------
 drivers/md/raid5-cache.h         |  197 +++++++
 drivers/md/raid5-ppl.c           | 1101 ++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c               |  195 ++++++-
 drivers/md/raid5.h               |   32 +-
 include/uapi/linux/raid/md_p.h   |   44 +-
 12 files changed, 1875 insertions(+), 203 deletions(-)
 create mode 100644 drivers/md/raid5-cache.h
 create mode 100644 drivers/md/raid5-ppl.c

-- 
2.11.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/9] raid5-cache: move declarations to separate header
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 2/9] raid5-cache: add policy logic Artur Paszkiewicz
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Next patches will be reusing raid5-cache structures and functions, so
put them in their own header.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-cache.c | 133 +------------------------------------
 drivers/md/raid5-cache.h | 166 +++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       |   1 +
 drivers/md/raid5.h       |  30 ---------
 4 files changed, 168 insertions(+), 162 deletions(-)
 create mode 100644 drivers/md/raid5-cache.h

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 76c0e5063f1b..394d87b62efa 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -24,6 +24,7 @@
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
+#include "raid5-cache.h"
 
 /*
  * metadata/data stored in disk with 4k size unit (a block) regardless
@@ -53,16 +54,6 @@
  */
 #define R5L_POOL_SIZE	4
 
-/*
- * r5c journal modes of the array: write-back or write-through.
- * write-through mode has identical behavior as existing log only
- * implementation.
- */
-enum r5c_journal_mode {
-	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
-	R5C_JOURNAL_MODE_WRITE_BACK = 1,
-};
-
 static char *r5c_journal_mode_str[] = {"write-through",
 				       "write-back"};
 /*
@@ -96,81 +87,6 @@ static char *r5c_journal_mode_str[] = {"write-through",
  *	- return IO for pending writes
  */
 
-struct r5l_log {
-	struct md_rdev *rdev;
-
-	u32 uuid_checksum;
-
-	sector_t device_size;		/* log device size, round to
-					 * BLOCK_SECTORS */
-	sector_t max_free_space;	/* reclaim run if free space is at
-					 * this size */
-
-	sector_t last_checkpoint;	/* log tail. where recovery scan
-					 * starts from */
-	u64 last_cp_seq;		/* log tail sequence */
-
-	sector_t log_start;		/* log head. where new data appends */
-	u64 seq;			/* log head sequence */
-
-	sector_t next_checkpoint;
-
-	struct mutex io_mutex;
-	struct r5l_io_unit *current_io;	/* current io_unit accepting new data */
-
-	spinlock_t io_list_lock;
-	struct list_head running_ios;	/* io_units which are still running,
-					 * and have not yet been completely
-					 * written to the log */
-	struct list_head io_end_ios;	/* io_units which have been completely
-					 * written to the log but not yet written
-					 * to the RAID */
-	struct list_head flushing_ios;	/* io_units which are waiting for log
-					 * cache flush */
-	struct list_head finished_ios;	/* io_units which settle down in log disk */
-	struct bio flush_bio;
-
-	struct list_head no_mem_stripes;   /* pending stripes, -ENOMEM */
-
-	struct kmem_cache *io_kc;
-	mempool_t *io_pool;
-	struct bio_set *bs;
-	mempool_t *meta_pool;
-
-	struct md_thread *reclaim_thread;
-	unsigned long reclaim_target;	/* number of space that need to be
-					 * reclaimed.  if it's 0, reclaim spaces
-					 * used by io_units which are in
-					 * IO_UNIT_STRIPE_END state (eg, reclaim
-					 * dones't wait for specific io_unit
-					 * switching to IO_UNIT_STRIPE_END
-					 * state) */
-	wait_queue_head_t iounit_wait;
-
-	struct list_head no_space_stripes; /* pending stripes, log has no space */
-	spinlock_t no_space_stripes_lock;
-
-	bool need_cache_flush;
-
-	/* for r5c_cache */
-	enum r5c_journal_mode r5c_journal_mode;
-
-	/* all stripes in r5cache, in the order of seq at sh->log_start */
-	struct list_head stripe_in_journal_list;
-
-	spinlock_t stripe_in_journal_lock;
-	atomic_t stripe_in_journal_count;
-
-	/* to submit async io_units, to fulfill ordering of flush */
-	struct work_struct deferred_io_work;
-	/* to disable write back during in degraded mode */
-	struct work_struct disable_writeback_work;
-
-	/* to for chunk_aligned_read in writeback mode, details below */
-	spinlock_t tree_lock;
-	struct radix_tree_root big_stripe_tree;
-};
-
 /*
  * Enable chunk_aligned_read() with write back cache.
  *
@@ -218,53 +134,6 @@ static inline sector_t r5c_tree_index(struct r5conf *conf,
 	return sect;
 }
 
-/*
- * an IO range starts from a meta data block and end at the next meta data
- * block. The io unit's the meta data block tracks data/parity followed it. io
- * unit is written to log disk with normal write, as we always flush log disk
- * first and then start move data to raid disks, there is no requirement to
- * write io unit with FLUSH/FUA
- */
-struct r5l_io_unit {
-	struct r5l_log *log;
-
-	struct page *meta_page;	/* store meta block */
-	int meta_offset;	/* current offset in meta_page */
-
-	struct bio *current_bio;/* current_bio accepting new data */
-
-	atomic_t pending_stripe;/* how many stripes not flushed to raid */
-	u64 seq;		/* seq number of the metablock */
-	sector_t log_start;	/* where the io_unit starts */
-	sector_t log_end;	/* where the io_unit ends */
-	struct list_head log_sibling; /* log->running_ios */
-	struct list_head stripe_list; /* stripes added to the io_unit */
-
-	int state;
-	bool need_split_bio;
-	struct bio *split_bio;
-
-	unsigned int has_flush:1;      /* include flush request */
-	unsigned int has_fua:1;        /* include fua request */
-	unsigned int has_null_flush:1; /* include empty flush request */
-	/*
-	 * io isn't sent yet, flush/fua request can only be submitted till it's
-	 * the first IO in running_ios list
-	 */
-	unsigned int io_deferred:1;
-
-	struct bio_list flush_barriers;   /* size == 0 flush bios */
-};
-
-/* r5l_io_unit state */
-enum r5l_io_unit_state {
-	IO_UNIT_RUNNING = 0,	/* accepting new IO */
-	IO_UNIT_IO_START = 1,	/* io_unit bio start writing to log,
-				 * don't accepting new bio */
-	IO_UNIT_IO_END = 2,	/* io_unit bio finish writing to log */
-	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
-};
-
 bool r5c_is_writeback(struct r5l_log *log)
 {
 	return (log != NULL &&
diff --git a/drivers/md/raid5-cache.h b/drivers/md/raid5-cache.h
new file mode 100644
index 000000000000..96dc95d4a36c
--- /dev/null
+++ b/drivers/md/raid5-cache.h
@@ -0,0 +1,166 @@
+#ifndef _RAID5_CACHE_H
+#define _RAID5_CACHE_H
+
+/*
+ * r5c journal modes of the array: write-back or write-through.
+ * write-through mode has identical behavior as existing log only
+ * implementation.
+ */
+enum r5c_journal_mode {
+	R5C_JOURNAL_MODE_WRITE_THROUGH = 0,
+	R5C_JOURNAL_MODE_WRITE_BACK = 1,
+};
+
+struct r5l_log {
+	struct md_rdev *rdev;
+
+	u32 uuid_checksum;
+
+	sector_t device_size;		/* log device size, round to
+					 * BLOCK_SECTORS */
+	sector_t max_free_space;	/* reclaim run if free space is at
+					 * this size */
+
+	sector_t last_checkpoint;	/* log tail. where recovery scan
+					 * starts from */
+	u64 last_cp_seq;		/* log tail sequence */
+
+	sector_t log_start;		/* log head. where new data appends */
+	u64 seq;			/* log head sequence */
+
+	sector_t next_checkpoint;
+
+	struct mutex io_mutex;
+	struct r5l_io_unit *current_io;	/* current io_unit accepting new data */
+
+	spinlock_t io_list_lock;
+	struct list_head running_ios;	/* io_units which are still running,
+					 * and have not yet been completely
+					 * written to the log */
+	struct list_head io_end_ios;	/* io_units which have been completely
+					 * written to the log but not yet written
+					 * to the RAID */
+	struct list_head flushing_ios;	/* io_units which are waiting for log
+					 * cache flush */
+	struct list_head finished_ios;	/* io_units which settle down in log disk */
+	struct bio flush_bio;
+
+	struct list_head no_mem_stripes;   /* pending stripes, -ENOMEM */
+
+	struct kmem_cache *io_kc;
+	mempool_t *io_pool;
+	struct bio_set *bs;
+	mempool_t *meta_pool;
+
+	struct md_thread *reclaim_thread;
+	unsigned long reclaim_target;	/* number of space that need to be
+					 * reclaimed.  if it's 0, reclaim spaces
+					 * used by io_units which are in
+					 * IO_UNIT_STRIPE_END state (eg, reclaim
+					 * dones't wait for specific io_unit
+					 * switching to IO_UNIT_STRIPE_END
+					 * state) */
+	wait_queue_head_t iounit_wait;
+
+	struct list_head no_space_stripes; /* pending stripes, log has no space */
+	spinlock_t no_space_stripes_lock;
+
+	bool need_cache_flush;
+
+	/* for r5c_cache */
+	enum r5c_journal_mode r5c_journal_mode;
+
+	/* all stripes in r5cache, in the order of seq at sh->log_start */
+	struct list_head stripe_in_journal_list;
+
+	spinlock_t stripe_in_journal_lock;
+	atomic_t stripe_in_journal_count;
+
+	/* to submit async io_units, to fulfill ordering of flush */
+	struct work_struct deferred_io_work;
+	/* to disable write back during in degraded mode */
+	struct work_struct disable_writeback_work;
+
+	/* to for chunk_aligned_read in writeback mode, details below */
+	spinlock_t tree_lock;
+	struct radix_tree_root big_stripe_tree;
+};
+
+/*
+ * an IO range starts from a meta data block and end at the next meta data
+ * block. The io unit's the meta data block tracks data/parity followed it. io
+ * unit is written to log disk with normal write, as we always flush log disk
+ * first and then start move data to raid disks, there is no requirement to
+ * write io unit with FLUSH/FUA
+ */
+struct r5l_io_unit {
+	struct r5l_log *log;
+
+	struct page *meta_page;	/* store meta block */
+	int meta_offset;	/* current offset in meta_page */
+
+	struct bio *current_bio;/* current_bio accepting new data */
+
+	atomic_t pending_stripe;/* how many stripes not flushed to raid */
+	u64 seq;		/* seq number of the metablock */
+	sector_t log_start;	/* where the io_unit starts */
+	sector_t log_end;	/* where the io_unit ends */
+	struct list_head log_sibling; /* log->running_ios */
+	struct list_head stripe_list; /* stripes added to the io_unit */
+
+	int state;
+	bool need_split_bio;
+	struct bio *split_bio;
+
+	unsigned int has_flush:1;      /* include flush request */
+	unsigned int has_fua:1;        /* include fua request */
+	unsigned int has_null_flush:1; /* include empty flush request */
+	/*
+	 * io isn't sent yet, flush/fua request can only be submitted till it's
+	 * the first IO in running_ios list
+	 */
+	unsigned int io_deferred:1;
+
+	struct bio_list flush_barriers;   /* size == 0 flush bios */
+};
+
+/* r5l_io_unit state */
+enum r5l_io_unit_state {
+	IO_UNIT_RUNNING = 0,	/* accepting new IO */
+	IO_UNIT_IO_START = 1,	/* io_unit bio start writing to log,
+				 * don't accepting new bio */
+	IO_UNIT_IO_END = 2,	/* io_unit bio finish writing to log */
+	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
+};
+
+extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
+extern void r5l_exit_log(struct r5l_log *log);
+extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh);
+extern void r5l_write_stripe_run(struct r5l_log *log);
+extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
+extern void r5l_stripe_write_finished(struct stripe_head *sh);
+extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
+extern void r5l_quiesce(struct r5l_log *log, int state);
+extern bool r5l_log_disk_error(struct r5conf *conf);
+extern bool r5c_is_writeback(struct r5l_log *log);
+extern int
+r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
+		      struct stripe_head_state *s, int disks);
+extern void
+r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
+			    struct stripe_head_state *s);
+extern void r5c_release_extra_page(struct stripe_head *sh);
+extern void r5c_use_extra_page(struct stripe_head *sh);
+extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+extern void r5c_handle_cached_data_endio(struct r5conf *conf,
+	struct stripe_head *sh, int disks, struct bio_list *return_bi);
+extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+			  struct stripe_head_state *s);
+extern void r5c_make_stripe_write_out(struct stripe_head *sh);
+extern void r5c_flush_cache(struct r5conf *conf, int num);
+extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
+extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+extern struct md_sysfs_entry r5c_journal_mode;
+extern void r5c_update_on_rdev_error(struct mddev *mddev);
+extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+#endif /* _RAID5_CACHE_H */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b62f671a93ab..d1cba941951e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -61,6 +61,7 @@
 #include "raid5.h"
 #include "raid0.h"
 #include "bitmap.h"
+#include "raid5-cache.h"
 
 #define UNSUPPORTED_MDDEV_FLAGS	(1L << MD_FAILFAST_SUPPORTED)
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index c0687df5ba06..0f64a58873de 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -763,34 +763,4 @@ extern struct stripe_head *
 raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 			int previous, int noblock, int noquiesce);
 extern int raid5_calc_degraded(struct r5conf *conf);
-extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
-extern void r5l_exit_log(struct r5l_log *log);
-extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *head_sh);
-extern void r5l_write_stripe_run(struct r5l_log *log);
-extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
-extern void r5l_stripe_write_finished(struct stripe_head *sh);
-extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
-extern void r5l_quiesce(struct r5l_log *log, int state);
-extern bool r5l_log_disk_error(struct r5conf *conf);
-extern bool r5c_is_writeback(struct r5l_log *log);
-extern int
-r5c_try_caching_write(struct r5conf *conf, struct stripe_head *sh,
-		      struct stripe_head_state *s, int disks);
-extern void
-r5c_finish_stripe_write_out(struct r5conf *conf, struct stripe_head *sh,
-			    struct stripe_head_state *s);
-extern void r5c_release_extra_page(struct stripe_head *sh);
-extern void r5c_use_extra_page(struct stripe_head *sh);
-extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
-extern void r5c_handle_cached_data_endio(struct r5conf *conf,
-	struct stripe_head *sh, int disks, struct bio_list *return_bi);
-extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
-			  struct stripe_head_state *s);
-extern void r5c_make_stripe_write_out(struct stripe_head *sh);
-extern void r5c_flush_cache(struct r5conf *conf, int num);
-extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
-extern void r5c_check_cached_full_stripe(struct r5conf *conf);
-extern struct md_sysfs_entry r5c_journal_mode;
-extern void r5c_update_on_rdev_error(struct mddev *mddev);
-extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/9] raid5-cache: add policy logic
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 1/9] raid5-cache: move declarations to separate header Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 3/9] md: superblock changes for PPL Artur Paszkiewicz
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Add wrappers for public log functions from raid5-cache and a struct
r5l_policy containing handlers for the log operations. This allows
adding different policies for raid5 logging without changing the
mechanism - calls from the raid5 personality stay the same.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-cache.c | 116 +++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid5-cache.h |  14 ++++++
 2 files changed, 106 insertions(+), 24 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 394d87b62efa..6fac581804a9 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -787,7 +787,7 @@ static inline void r5l_add_no_space_stripe(struct r5l_log *log,
  * running in raid5d, where reclaim could wait for raid5d too (when it flushes
  * data from log to raid disks), so we shouldn't wait for reclaim here
  */
-int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
+static int __r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	int write_disks = 0;
@@ -797,8 +797,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	int ret = 0;
 	bool wake_reclaim = false;
 
-	if (!log)
-		return -EAGAIN;
 	/* Don't support stripe batch */
 	if (sh->log_io || !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
 	    test_bit(STRIPE_SYNCING, &sh->state)) {
@@ -885,19 +883,28 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	return 0;
 }
 
-void r5l_write_stripe_run(struct r5l_log *log)
+int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
+{
+	if (log && log->policy->write_stripe)
+		return log->policy->write_stripe(log, sh);
+	return -EAGAIN;
+}
+
+static void __r5l_write_stripe_run(struct r5l_log *log)
 {
-	if (!log)
-		return;
 	mutex_lock(&log->io_mutex);
 	r5l_submit_current_io(log);
 	mutex_unlock(&log->io_mutex);
 }
 
-int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
+void r5l_write_stripe_run(struct r5l_log *log)
+{
+	if (log && log->policy->write_stripe_run)
+		log->policy->write_stripe_run(log);
+}
+
+static int __r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 {
-	if (!log)
-		return -ENODEV;
 
 	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH) {
 		/*
@@ -929,6 +936,13 @@ int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 	return -EAGAIN;
 }
 
+int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
+{
+	if (log && log->policy->handle_flush_request)
+		return log->policy->handle_flush_request(log, bio);
+	return -ENODEV;
+}
+
 /* This will run after log space is reclaimed */
 static void r5l_run_no_space_stripes(struct r5l_log *log)
 {
@@ -1049,8 +1063,9 @@ void r5l_stripe_write_finished(struct stripe_head *sh)
 	io = sh->log_io;
 	sh->log_io = NULL;
 
-	if (io && atomic_dec_and_test(&io->pending_stripe))
-		__r5l_stripe_write_finished(io);
+	if (io && atomic_dec_and_test(&io->pending_stripe) &&
+	    io->log->policy->stripe_write_finished)
+		io->log->policy->stripe_write_finished(io);
 }
 
 static void r5l_log_flush_endio(struct bio *bio)
@@ -1084,11 +1099,11 @@ static void r5l_log_flush_endio(struct bio *bio)
  * only write stripes of an io_unit to raid disks till the io_unit is the first
  * one whose data/parity is in log.
  */
-void r5l_flush_stripe_to_raid(struct r5l_log *log)
+static void __r5l_flush_stripe_to_raid(struct r5l_log *log)
 {
 	bool do_flush;
 
-	if (!log || !log->need_cache_flush)
+	if (!log->need_cache_flush)
 		return;
 
 	spin_lock_irq(&log->io_list_lock);
@@ -1110,6 +1125,12 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
 	submit_bio(&log->flush_bio);
 }
 
+void r5l_flush_stripe_to_raid(struct r5l_log *log)
+{
+	if (log && log->policy->flush_stripe_to_raid)
+		log->policy->flush_stripe_to_raid(log);
+}
+
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
 static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	sector_t end)
@@ -1366,10 +1387,10 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 	md_wakeup_thread(log->reclaim_thread);
 }
 
-void r5l_quiesce(struct r5l_log *log, int state)
+static void __r5l_quiesce(struct r5l_log *log, int state)
 {
 	struct mddev *mddev;
-	if (!log || state == 2)
+	if (state == 2)
 		return;
 	if (state == 0)
 		kthread_unpark(log->reclaim_thread->tsk);
@@ -1383,6 +1404,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	}
 }
 
+void r5l_quiesce(struct r5l_log *log, int state)
+{
+	if (log && log->policy->quiesce)
+		log->policy->quiesce(log, state);
+}
+
 bool r5l_log_disk_error(struct r5conf *conf)
 {
 	struct r5l_log *log;
@@ -2626,11 +2653,10 @@ void r5c_update_on_rdev_error(struct mddev *mddev)
 		schedule_work(&log->disable_writeback_work);
 }
 
-int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
+static int __r5l_init_log(struct r5l_log *log, struct r5conf *conf)
 {
+	struct md_rdev *rdev = log->rdev;
 	struct request_queue *q = bdev_get_queue(rdev->bdev);
-	struct r5l_log *log;
-
 	if (PAGE_SIZE != 4096)
 		return -EINVAL;
 
@@ -2649,10 +2675,6 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 		return -EINVAL;
 	}
 
-	log = kzalloc(sizeof(*log), GFP_KERNEL);
-	if (!log)
-		return -ENOMEM;
-	log->rdev = rdev;
 
 	log->need_cache_flush = test_bit(QUEUE_FLAG_WC, &q->queue_flags) != 0;
 
@@ -2728,11 +2750,10 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 io_pool:
 	kmem_cache_destroy(log->io_kc);
 io_kc:
-	kfree(log);
 	return -EINVAL;
 }
 
-void r5l_exit_log(struct r5l_log *log)
+static void __r5l_exit_log(struct r5l_log *log)
 {
 	flush_work(&log->disable_writeback_work);
 	md_unregister_thread(&log->reclaim_thread);
@@ -2740,5 +2761,52 @@ void r5l_exit_log(struct r5l_log *log)
 	bioset_free(log->bs);
 	mempool_destroy(log->io_pool);
 	kmem_cache_destroy(log->io_kc);
+}
+
+void r5l_exit_log(struct r5l_log *log)
+{
+	if (!log)
+		return;
+
+	if (log->policy->exit_log)
+		log->policy->exit_log(log);
+
 	kfree(log);
 }
+
+struct r5l_policy r5l_journal = {
+	.init_log = __r5l_init_log,
+	.exit_log = __r5l_exit_log,
+	.write_stripe = __r5l_write_stripe,
+	.write_stripe_run = __r5l_write_stripe_run,
+	.flush_stripe_to_raid = __r5l_flush_stripe_to_raid,
+	.stripe_write_finished = __r5l_stripe_write_finished,
+	.handle_flush_request = __r5l_handle_flush_request,
+	.quiesce = __r5l_quiesce,
+};
+
+int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
+{
+	int ret;
+	struct r5l_log *log;
+	struct mddev *mddev = conf->mddev;
+
+	log = kzalloc(sizeof(*log), GFP_KERNEL);
+	if (!log)
+		return -ENOMEM;
+
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		log->policy = &r5l_journal;
+	} else {
+		kfree(log);
+		return -EINVAL;
+	}
+
+	log->rdev = rdev;
+
+	ret = log->policy->init_log(log, conf);
+	if (ret)
+		kfree(log);
+
+	return ret;
+}
diff --git a/drivers/md/raid5-cache.h b/drivers/md/raid5-cache.h
index 96dc95d4a36c..97803f3ae0fe 100644
--- a/drivers/md/raid5-cache.h
+++ b/drivers/md/raid5-cache.h
@@ -84,6 +84,9 @@ struct r5l_log {
 	/* to for chunk_aligned_read in writeback mode, details below */
 	spinlock_t tree_lock;
 	struct radix_tree_root big_stripe_tree;
+
+	/* handlers for log operations */
+	struct r5l_policy *policy;
 };
 
 /*
@@ -133,6 +136,17 @@ enum r5l_io_unit_state {
 	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
 };
 
+struct r5l_policy {
+	int (*init_log)(struct r5l_log *log, struct r5conf *conf);
+	void (*exit_log)(struct r5l_log *log);
+	int (*write_stripe)(struct r5l_log *log, struct stripe_head *sh);
+	void (*write_stripe_run)(struct r5l_log *log);
+	void (*flush_stripe_to_raid)(struct r5l_log *log);
+	void (*stripe_write_finished)(struct r5l_io_unit *io);
+	int (*handle_flush_request)(struct r5l_log *log, struct bio *bio);
+	void (*quiesce)(struct r5l_log *log, int state);
+};
+
 extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
 extern void r5l_exit_log(struct r5l_log *log);
 extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh);
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/9] md: superblock changes for PPL
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 1/9] raid5-cache: move declarations to separate header Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 2/9] raid5-cache: add policy logic Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-02-07 21:20   ` Shaohua Li
  2017-01-30 18:59 ` [PATCH v3 4/9] raid5: calculate partial parity for a stripe Artur Paszkiewicz
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Include information about PPL location and size into mdp_superblock_1
and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
to mddev->flags to indicate that PPL is enabled on an array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c                | 15 +++++++++++++++
 drivers/md/md.h                |  8 ++++++++
 drivers/md/raid0.c             |  3 ++-
 drivers/md/raid1.c             |  3 ++-
 include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
 5 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 85ac98417a08..e96f73572e23 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1566,6 +1566,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
 	} else if (sb->bblog_offset != 0)
 		rdev->badblocks.shift = 0;
 
+	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
+		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
+		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
+		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
+	}
+
 	if (!refdev) {
 		ret = 1;
 	} else {
@@ -1678,6 +1684,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
 			set_bit(MD_HAS_JOURNAL, &mddev->flags);
+
+		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)
+			set_bit(MD_HAS_PPL, &mddev->flags);
 	} else if (mddev->pers == NULL) {
 		/* Insist of good event counter while assembling, except for
 		 * spares (which don't need an event count) */
@@ -1891,6 +1900,12 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
 		sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
 
+	if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		sb->feature_map |= cpu_to_le32(MD_FEATURE_PPL);
+		sb->ppl.offset = cpu_to_le16(rdev->ppl.offset);
+		sb->ppl.size = cpu_to_le16(rdev->ppl.size);
+	}
+
 	rdev_for_each(rdev2, mddev) {
 		i = rdev2->desc_nr;
 		if (test_bit(Faulty, &rdev2->flags))
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 968bbe72b237..abdb5f2ed2d3 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -122,6 +122,13 @@ struct md_rdev {
 					   * sysfs entry */
 
 	struct badblocks badblocks;
+
+	struct {
+		short offset;	/* Offset from superblock to start of PPL.
+				 * Not used by external metadata. */
+		unsigned int size;	/* Size in sectors of the PPL space */
+		sector_t sector;	/* First sector of the PPL space */
+	} ppl;
 };
 enum flag_bits {
 	Faulty,			/* device is known to have a fault */
@@ -229,6 +236,7 @@ enum mddev_flags {
 				 * supported as calls to md_error() will
 				 * never cause the array to become failed.
 				 */
+	MD_HAS_PPL,		/* The raid array has PPL feature set */
 };
 
 enum mddev_sb_flags {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 5b3db367814a..37fc1f5185a9 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -29,7 +29,8 @@
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
 	 (1L << MD_JOURNAL_CLEAN) |	\
-	 (1L << MD_FAILFAST_SUPPORTED))
+	 (1L << MD_FAILFAST_SUPPORTED) |\
+	 (1L << MD_HAS_PPL))
 
 static int raid0_congested(struct mddev *mddev, int bits)
 {
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647bcccb..53623a31b074 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -44,7 +44,8 @@
 
 #define UNSUPPORTED_MDDEV_FLAGS		\
 	((1L << MD_HAS_JOURNAL) |	\
-	 (1L << MD_JOURNAL_CLEAN))
+	 (1L << MD_JOURNAL_CLEAN) |	\
+	 (1L << MD_HAS_PPL))
 
 /*
  * Number of guaranteed r1bios in case of extreme VM load:
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index 9930f3e9040f..fe2112810c43 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -242,10 +242,18 @@ struct mdp_superblock_1 {
 
 	__le32	chunksize;	/* in 512byte sectors */
 	__le32	raid_disks;
-	__le32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
-				 * NOTE: signed, so bitmap can be before superblock
-				 * only meaningful of feature_map[0] is set.
-				 */
+	union {
+		__le32	bitmap_offset;	/* sectors after start of superblock that bitmap starts
+					 * NOTE: signed, so bitmap can be before superblock
+					 * only meaningful of feature_map[0] is set.
+					 */
+
+		/* only meaningful when feature_map[MD_FEATURE_PPL] is set */
+		struct {
+			__le16 offset; /* sectors from start of superblock that ppl starts (signed) */
+			__le16 size; /* ppl size in sectors */
+		} ppl;
+	};
 
 	/* These are only valid with feature bit '4' */
 	__le32	new_level;	/* new level we are reshaping to		*/
@@ -318,6 +326,7 @@ struct mdp_superblock_1 {
 					     */
 #define MD_FEATURE_CLUSTERED		256 /* clustered MD */
 #define	MD_FEATURE_JOURNAL		512 /* support write cache */
+#define	MD_FEATURE_PPL			1024 /* support PPL */
 #define	MD_FEATURE_ALL			(MD_FEATURE_BITMAP_OFFSET	\
 					|MD_FEATURE_RECOVERY_OFFSET	\
 					|MD_FEATURE_RESHAPE_ACTIVE	\
@@ -328,6 +337,7 @@ struct mdp_superblock_1 {
 					|MD_FEATURE_RECOVERY_BITMAP	\
 					|MD_FEATURE_CLUSTERED		\
 					|MD_FEATURE_JOURNAL		\
+					|MD_FEATURE_PPL			\
 					)
 
 struct r5l_payload_header {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/9] raid5: calculate partial parity for a stripe
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (2 preceding siblings ...)
  2017-01-30 18:59 ` [PATCH v3 3/9] md: superblock changes for PPL Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-02-07 21:25   ` Shaohua Li
  2017-01-30 18:59 ` [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Attach a page for holding the partial parity data to stripe_head.
Allocate it only if mddev has the MD_HAS_PPL flag set.

Partial parity is the xor of not modified data chunks of a stripe and is
calculated as follows:

- reconstruct-write case:
  xor data from all not updated disks in a stripe

- read-modify-write case:
  xor old data and parity from all updated disks in a stripe

Implement it using the async_tx API and integrate into raid_run_ops().
It must be called when we still have access to old data, so do it when
STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
stored into sh->ppl_page.

Partial parity is not meaningful for full stripe write and is not stored
in the log or used for recovery, so don't attempt to calculate it when
stripe has STRIPE_FULL_WRITE.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.h |  2 ++
 2 files changed, 100 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index d1cba941951e..e1e238da32ba 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -466,6 +466,11 @@ static void shrink_buffers(struct stripe_head *sh)
 		sh->dev[i].page = NULL;
 		put_page(p);
 	}
+
+	if (sh->ppl_page) {
+		put_page(sh->ppl_page);
+		sh->ppl_page = NULL;
+	}
 }
 
 static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
@@ -482,6 +487,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
 		sh->dev[i].page = page;
 		sh->dev[i].orig_page = page;
 	}
+
+	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
+		sh->ppl_page = alloc_page(gfp);
+		if (!sh->ppl_page)
+			return 1;
+	}
+
 	return 0;
 }
 
@@ -1977,6 +1989,55 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
 			   &sh->ops.zero_sum_result, percpu->spare_page, &submit);
 }
 
+static struct dma_async_tx_descriptor *
+ops_run_partial_parity(struct stripe_head *sh, struct raid5_percpu *percpu,
+		       struct dma_async_tx_descriptor *tx)
+{
+	int disks = sh->disks;
+	struct page **xor_srcs = flex_array_get(percpu->scribble, 0);
+	int count = 0, pd_idx = sh->pd_idx, i;
+	struct async_submit_ctl submit;
+
+	pr_debug("%s: stripe %llu\n", __func__, (unsigned long long)sh->sector);
+
+	/*
+	 * Partial parity is the XOR of stripe data chunks that are not changed
+	 * during the write request. Depending on available data
+	 * (read-modify-write vs. reconstruct-write case) we calculate it
+	 * differently.
+	 */
+	if (sh->reconstruct_state == reconstruct_state_prexor_drain_run) {
+		/* rmw: xor old data and parity from updated disks */
+		for (i = disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_Wantdrain, &dev->flags) || i == pd_idx)
+				xor_srcs[count++] = dev->page;
+		}
+	} else if (sh->reconstruct_state == reconstruct_state_drain_run) {
+		/* rcw: xor data from all not updated disks */
+		for (i = disks; i--;) {
+			struct r5dev *dev = &sh->dev[i];
+			if (test_bit(R5_UPTODATE, &dev->flags))
+				xor_srcs[count++] = dev->page;
+		}
+	} else {
+		return tx;
+	}
+
+	init_async_submit(&submit, ASYNC_TX_XOR_ZERO_DST, tx, NULL, sh,
+			  flex_array_get(percpu->scribble, 0)
+			  + sizeof(struct page *) * (sh->disks + 2));
+
+	if (count == 1)
+		tx = async_memcpy(sh->ppl_page, xor_srcs[0], 0, 0, PAGE_SIZE,
+				  &submit);
+	else
+		tx = async_xor(sh->ppl_page, xor_srcs, 0, count, PAGE_SIZE,
+			       &submit);
+
+	return tx;
+}
+
 static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 {
 	int overlap_clear = 0, i, disks = sh->disks;
@@ -2007,6 +2068,9 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 			async_tx_ack(tx);
 	}
 
+	if (test_bit(STRIPE_OP_PARTIAL_PARITY, &ops_request))
+		tx = ops_run_partial_parity(sh, percpu, tx);
+
 	if (test_bit(STRIPE_OP_PREXOR, &ops_request)) {
 		if (level < 6)
 			tx = ops_run_prexor5(sh, percpu, tx);
@@ -3058,6 +3122,12 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		s->locked++;
 	}
 
+	if (level == 5 && test_bit(MD_HAS_PPL, &conf->mddev->flags) &&
+	    test_bit(STRIPE_OP_BIODRAIN, &s->ops_request) &&
+	    !test_bit(STRIPE_FULL_WRITE, &sh->state) &&
+	    test_bit(R5_Insync, &sh->dev[pd_idx].flags))
+		set_bit(STRIPE_OP_PARTIAL_PARITY, &s->ops_request);
+
 	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
 		__func__, (unsigned long long)sh->sector,
 		s->locked, s->ops_request);
@@ -3105,6 +3175,34 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
 		goto overlap;
 
+	if (forwrite && test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
+		/*
+		 * With PPL only writes to consecutive data chunks within a
+		 * stripe are allowed. Not really an overlap, but
+		 * wait_for_overlap can be used to handle this.
+		 */
+		sector_t sector;
+		sector_t first = 0;
+		sector_t last = 0;
+		int count = 0;
+		int i;
+
+		for (i = 0; i < sh->disks; i++) {
+			if (i != sh->pd_idx &&
+			    (i == dd_idx || sh->dev[i].towrite)) {
+				sector = sh->dev[i].sector;
+				if (count == 0 || sector < first)
+					first = sector;
+				if (sector > last)
+					last = sector;
+				count++;
+			}
+		}
+
+		if (first + conf->chunk_sectors * (count - 1) != last)
+			goto overlap;
+	}
+
 	if (!forwrite || previous)
 		clear_bit(STRIPE_BATCH_READY, &sh->state);
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 0f64a58873de..88f1e52d9daf 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -228,6 +228,7 @@ struct stripe_head {
 	struct list_head	log_list;
 	sector_t		log_start; /* first meta block on the journal */
 	struct list_head	r5c; /* for r5c_cache->stripe_in_journal */
+	struct page		*ppl_page; /* partial parity of this stripe */
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -400,6 +401,7 @@ enum {
 	STRIPE_OP_BIODRAIN,
 	STRIPE_OP_RECONSTRUCT,
 	STRIPE_OP_CHECK,
+	STRIPE_OP_PARTIAL_PARITY,
 };
 
 /*
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (3 preceding siblings ...)
  2017-01-30 18:59 ` [PATCH v3 4/9] raid5: calculate partial parity for a stripe Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-02-07 21:42   ` Shaohua Li
  2017-01-30 18:59 ` [PATCH v3 6/9] md: add sysfs entries for PPL Artur Paszkiewicz
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

This implements the PPL write logging functionality, using the
raid5-cache policy logic introduced in previous patches. The description
of PPL is added to the documentation. More details can be found in the
comments in raid5-ppl.c.

Put the PPL metadata structures to md_p.h because userspace tools
(mdadm) will also need to read/write PPL.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Documentation/admin-guide/md.rst |  53 ++++
 drivers/md/Makefile              |   2 +-
 drivers/md/raid5-cache.c         |  13 +-
 drivers/md/raid5-cache.h         |   8 +
 drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c               |  15 +-
 include/uapi/linux/raid/md_p.h   |  26 ++
 7 files changed, 661 insertions(+), 7 deletions(-)
 create mode 100644 drivers/md/raid5-ppl.c

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index e449fb5f277c..7104ef757e73 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
 The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
 that all auto-detected arrays are assembled as partitionable.
 
+
+.. _dirty_degraded_boot:
+
 Boot time assembly of degraded/dirty arrays
 -------------------------------------------
 
@@ -176,6 +179,56 @@ and its role in the array.
 Once started with RUN_ARRAY, uninitialized spares can be added with
 HOT_ADD_DISK.
 
+.. _ppl:
+
+Partial Parity Log
+------------------
+
+Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
+addressed by PPL is that after a dirty shutdown, parity of a particular stripe
+may become inconsistent with data on other member disks. If the array is also
+in degraded state, there is no way to recalculate parity, because one of the
+disks is missing. This can lead to silent data corruption when rebuilding the
+array or using it is as degraded - data calculated from parity for array blocks
+that have not been touched by a write request during the unclean shutdown can
+be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
+this, md by default does not allow starting a dirty degraded array, see
+:ref:`dirty_degraded_boot`.
+
+Partial parity for a write operation is the XOR of stripe data chunks not
+modified by this write. It is just enough data needed for recovering from the
+write hole. XORing partial parity with the modified chunks produces parity for
+the stripe, consistent with its state before the write operation, regardless of
+which chunk writes have completed. If one of the not modified data disks of
+this stripe is missing, this updated parity can be used to recover its
+contents. PPL recovery is also performed when starting an array after an
+unclean shutdown and all disks are available, eliminating the need to resync
+the array. Because of this, using write-intent bitmap and PPL together is not
+supported.
+
+When handling a write request PPL writes partial parity before new data and
+parity are dispatched to disks. PPL is a distributed log - it is stored on
+array member drives in the metadata area, on the parity drive of a particular
+stripe.  It does not require a dedicated journaling drive. Write performance is
+reduced by up to 30%-40% but it scales with the number of drives in the array
+and the journaling drive does not become a bottleneck or a single point of
+failure.
+
+Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
+not a true journal. It does not protect from losing in-flight data, only from
+silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
+performed for this stripe (parity is not updated). So it is possible to have
+arbitrary data in the written part of a stripe if that disk is lost. In such
+case the behavior is the same as in plain raid5.
+
+PPL is available for md version-1 metadata and external (specifically IMSM)
+metadata arrays. It can be enabled using mdadm option
+``--consistency-policy=ppl``.
+
+Currently, volatile write-back cache should be disabled on all member drives
+when using PPL. Otherwise it cannot guarantee consistency in case of power
+failure.
+
 
 MD devices in sysfs
 -------------------
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 3cbda1af87a0..4d48714ccc6b 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -18,7 +18,7 @@ dm-cache-cleaner-y += dm-cache-policy-cleaner.o
 dm-era-y	+= dm-era-target.o
 dm-verity-y	+= dm-verity-target.o
 md-mod-y	+= md.o bitmap.o
-raid456-y	+= raid5.o raid5-cache.o
+raid456-y	+= raid5.o raid5-cache.o raid5-ppl.o
 
 # Note: link order is important.  All raid personalities
 # and must come before md.o, as they each initialise 
diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 6fac581804a9..7757c5137300 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -167,8 +167,8 @@ static bool r5l_has_free_space(struct r5l_log *log, sector_t size)
 	return log->device_size > used_size + size;
 }
 
-static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
-				    enum r5l_io_unit_state state)
+void __r5l_set_io_unit_state(struct r5l_io_unit *io,
+			     enum r5l_io_unit_state state)
 {
 	if (WARN_ON(io->state >= state))
 		return;
@@ -385,7 +385,7 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
 	}
 }
 
-static void r5l_io_run_stripes(struct r5l_io_unit *io)
+void r5l_io_run_stripes(struct r5l_io_unit *io)
 {
 	struct stripe_head *sh, *next;
 
@@ -995,7 +995,7 @@ static sector_t r5l_reclaimable_space(struct r5l_log *log)
 				 r5c_calculate_new_cp(conf));
 }
 
-static void r5l_run_no_mem_stripe(struct r5l_log *log)
+void r5l_run_no_mem_stripe(struct r5l_log *log)
 {
 	struct stripe_head *sh;
 
@@ -1421,7 +1421,7 @@ bool r5l_log_disk_error(struct r5conf *conf)
 	if (!log)
 		ret = test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
 	else
-		ret = test_bit(Faulty, &log->rdev->flags);
+		ret = log->rdev && test_bit(Faulty, &log->rdev->flags);
 	rcu_read_unlock();
 	return ret;
 }
@@ -2784,6 +2784,7 @@ struct r5l_policy r5l_journal = {
 	.handle_flush_request = __r5l_handle_flush_request,
 	.quiesce = __r5l_quiesce,
 };
+extern struct r5l_policy r5l_ppl;
 
 int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 {
@@ -2797,6 +2798,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 
 	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
 		log->policy = &r5l_journal;
+	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		log->policy = &r5l_ppl;
 	} else {
 		kfree(log);
 		return -EINVAL;
diff --git a/drivers/md/raid5-cache.h b/drivers/md/raid5-cache.h
index 97803f3ae0fe..6b622c2742de 100644
--- a/drivers/md/raid5-cache.h
+++ b/drivers/md/raid5-cache.h
@@ -87,6 +87,8 @@ struct r5l_log {
 
 	/* handlers for log operations */
 	struct r5l_policy *policy;
+
+	void *private;
 };
 
 /*
@@ -177,4 +179,10 @@ extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
 extern void r5c_update_on_rdev_error(struct mddev *mddev);
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
+
+extern void __r5l_set_io_unit_state(struct r5l_io_unit *io,
+				    enum r5l_io_unit_state state);
+extern void r5l_io_run_stripes(struct r5l_io_unit *io);
+extern void r5l_run_no_mem_stripe(struct r5l_log *log);
+
 #endif /* _RAID5_CACHE_H */
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
new file mode 100644
index 000000000000..6bc246c80f6b
--- /dev/null
+++ b/drivers/md/raid5-ppl.c
@@ -0,0 +1,551 @@
+/*
+ * Partial Parity Log for closing the RAID5 write hole
+ * Copyright (c) 2017, Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms and conditions of the GNU General Public License,
+ * version 2, as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/slab.h>
+#include <linux/crc32c.h>
+#include <linux/raid/md_p.h>
+#include "md.h"
+#include "raid5.h"
+#include "raid5-cache.h"
+
+/*
+ * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
+ * partial parity data. The header contains an array of entries
+ * (struct ppl_header_entry) which describe the logged write requests.
+ * Partial parity for the entries comes after the header, written in the same
+ * sequence as the entries:
+ *
+ * Header
+ *   entry0
+ *   ...
+ *   entryN
+ * PP data
+ *   PP for entry0
+ *   ...
+ *   PP for entryN
+ *
+ * Every entry holds a checksum of its partial parity, the header also has a
+ * checksum of the header itself. Entries for full stripes writes contain no
+ * partial parity, they only mark the stripes for which parity should be
+ * recalculated after an unclean shutdown.
+ *
+ * A write request is always logged to the PPL instance stored on the parity
+ * disk of the corresponding stripe. For each member disk there is one r5l_log
+ * used to handle logging for this disk, independently from others. They are
+ * grouped in child_logs array in struct ppl_conf, which is assigned to a
+ * common parent r5l_log. This parent log serves as a proxy and is used in
+ * raid5 personality code - it is assigned as _the_ log in r5conf->log.
+ *
+ * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
+ * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
+ * be appended to the last entry if the chunks to write are the same, otherwise
+ * a new entry is added. Checksums of entries are calculated incrementally as
+ * stripes containing partial parity are being added to entries.
+ * ppl_submit_iounit() calculates the checksum of the header and submits a bio
+ * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
+ * for all stripes of the io_unit. When the PPL write completes, the stripes
+ * associated with the io_unit are released and raid5d starts writing their data
+ * and parity. When all stripes are written, the io_unit is freed and the next
+ * can be submitted.
+ *
+ * An io_unit is used to gather stripes until it is submitted or becomes full
+ * (if the maximum number of entries or size of PPL is reached). Another io_unit
+ * can't be submitted until the previous has completed (PPL and stripe
+ * data+parity is written). The log->running_ios list tracks all io_units of
+ * a log (for a single member disk). New io_units are added to the end of the
+ * list and the first io_unit is submitted, if it is not submitted already.
+ * The current io_unit accepting new stripes is always the last on the list.
+ */
+
+struct ppl_conf {
+	struct mddev *mddev;
+
+	/* the log assigned to r5conf->log */
+	struct r5l_log *parent_log;
+
+	/* array of child logs, one for each raid disk */
+	struct r5l_log *child_logs;
+	int count;
+
+	/* the logical block size used for data_sector in ppl_header_entry */
+	int block_size;
+};
+
+static struct r5l_io_unit *ppl_new_iounit(struct r5l_log *log,
+					  struct stripe_head *sh)
+{
+	struct r5l_io_unit *io;
+	struct ppl_header *pplhdr;
+	struct ppl_conf *ppl_conf = log->private;
+	struct r5l_log *parent_log = ppl_conf->parent_log;
+
+	io = mempool_alloc(log->io_pool, GFP_ATOMIC);
+	if (!io)
+		return NULL;
+
+	memset(io, 0, sizeof(*io));
+	io->log = log;
+	INIT_LIST_HEAD(&io->log_sibling);
+	INIT_LIST_HEAD(&io->stripe_list);
+	io->state = IO_UNIT_RUNNING;
+
+	io->meta_page = mempool_alloc(log->meta_pool, GFP_NOIO);
+	pplhdr = page_address(io->meta_page);
+	clear_page(pplhdr);
+	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
+	pplhdr->signature = cpu_to_le32(log->uuid_checksum);
+
+	spin_lock(&parent_log->io_list_lock);
+	io->seq = ++parent_log->seq;
+	spin_unlock(&parent_log->io_list_lock);
+	pplhdr->generation = cpu_to_le64(io->seq);
+
+	return io;
+}
+
+static int ppl_log_stripe(struct r5l_log *log, struct stripe_head *sh)
+{
+	struct r5l_io_unit *io = NULL;
+	struct ppl_header *pplhdr;
+	struct ppl_header_entry *e = NULL;
+	int i;
+	sector_t data_sector = 0;
+	int data_disks = 0;
+	unsigned int entries_count;
+	unsigned int entry_space = (log->rdev->ppl.size << 9) - PPL_HEADER_SIZE;
+	struct r5conf *conf = sh->raid_conf;
+
+	pr_debug("%s: stripe: %llu\n", __func__, (unsigned long long)sh->sector);
+
+	if (log->current_io) {
+		io = log->current_io;
+		pplhdr = page_address(io->meta_page);
+		entries_count = le32_to_cpu(pplhdr->entries_count);
+
+		/* check if current io_unit is full */
+		if (io->meta_offset >= entry_space ||
+		    entries_count == PPL_HDR_MAX_ENTRIES) {
+			pr_debug("%s: add io_unit blocked by seq: %llu\n",
+				 __func__, io->seq);
+			io = NULL;
+		}
+	}
+
+	/* add a new unit if there is none or the current is full */
+	if (!io) {
+		io = ppl_new_iounit(log, sh);
+		if (!io)
+			return -ENOMEM;
+		spin_lock_irq(&log->io_list_lock);
+		list_add_tail(&io->log_sibling, &log->running_ios);
+		spin_unlock_irq(&log->io_list_lock);
+
+		log->current_io = io;
+		pplhdr = page_address(io->meta_page);
+		entries_count = 0;
+	}
+
+	for (i = 0; i < sh->disks; i++) {
+		struct r5dev *dev = &sh->dev[i];
+		if (i != sh->pd_idx && test_bit(R5_Wantwrite, &dev->flags)) {
+			if (!data_disks || dev->sector < data_sector)
+				data_sector = dev->sector;
+			data_disks++;
+		}
+	}
+	BUG_ON(!data_disks);
+
+	pr_debug("%s: seq: %llu data_sector: %llu data_disks: %d\n", __func__,
+		 io->seq, (unsigned long long)data_sector, data_disks);
+
+	if (entries_count > 0) {
+		struct ppl_header_entry *prev =
+				&pplhdr->entries[entries_count - 1];
+		u64 data_sector_prev = le64_to_cpu(prev->data_sector);
+		u32 data_size_prev = le32_to_cpu(prev->data_size);
+		u32 pp_size_prev = le32_to_cpu(prev->pp_size);
+
+		/*
+		 * Check if we can merge with the previous entry. Must be on
+		 * the same stripe and disks. Use bit shift and logarithm
+		 * to avoid 64-bit division.
+		 */
+		if ((data_sector >> ilog2(conf->chunk_sectors) ==
+		     data_sector_prev >> ilog2(conf->chunk_sectors)) &&
+		    ((pp_size_prev == 0 &&
+		      test_bit(STRIPE_FULL_WRITE, &sh->state)) ||
+		     ((data_sector_prev + (pp_size_prev >> 9) == data_sector) &&
+		      (data_size_prev == pp_size_prev * data_disks))))
+			e = prev;
+	}
+
+	if (!e) {
+		e = &pplhdr->entries[entries_count++];
+		pplhdr->entries_count = cpu_to_le32(entries_count);
+		e->data_sector = cpu_to_le64(data_sector);
+		e->parity_disk = cpu_to_le32(sh->pd_idx);
+		e->checksum = cpu_to_le32(~0);
+	}
+
+	le32_add_cpu(&e->data_size, data_disks << PAGE_SHIFT);
+
+	/* don't write any PP if full stripe write */
+	if (!test_bit(STRIPE_FULL_WRITE, &sh->state)) {
+		le32_add_cpu(&e->pp_size, PAGE_SIZE);
+		io->meta_offset += PAGE_SIZE;
+		e->checksum = cpu_to_le32(crc32c_le(le32_to_cpu(e->checksum),
+						    page_address(sh->ppl_page),
+						    PAGE_SIZE));
+	}
+
+	list_add_tail(&sh->log_list, &io->stripe_list);
+	atomic_inc(&io->pending_stripe);
+	sh->log_io = io;
+
+	return 0;
+}
+
+static int ppl_write_stripe(struct r5l_log *log, struct stripe_head *sh)
+{
+	struct r5l_io_unit *io = sh->log_io;
+
+	if (io || test_bit(STRIPE_SYNCING, &sh->state) ||
+	    !test_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags) ||
+	    !test_bit(R5_Insync, &sh->dev[sh->pd_idx].flags)) {
+		clear_bit(STRIPE_LOG_TRAPPED, &sh->state);
+		return -EAGAIN;
+	}
+
+	mutex_lock(&log->io_mutex);
+
+	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
+		mutex_unlock(&log->io_mutex);
+		return -EAGAIN;
+	}
+
+	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	atomic_inc(&sh->count);
+
+	if (ppl_log_stripe(log, sh)) {
+		spin_lock_irq(&log->io_list_lock);
+		list_add_tail(&sh->log_list, &log->no_mem_stripes);
+		spin_unlock_irq(&log->io_list_lock);
+	}
+
+	mutex_unlock(&log->io_mutex);
+
+	return 0;
+}
+
+static void ppl_log_endio(struct bio *bio)
+{
+	struct r5l_io_unit *io = bio->bi_private;
+	struct r5l_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->private;
+	unsigned long flags;
+
+	pr_debug("%s: seq: %llu\n", __func__, io->seq);
+
+	if (bio->bi_error)
+		md_error(ppl_conf->mddev, log->rdev);
+
+	bio_put(bio);
+	mempool_free(io->meta_page, log->meta_pool);
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_IO_END);
+	r5l_io_run_stripes(io);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+}
+
+static void ppl_submit_iounit(struct r5l_io_unit *io)
+{
+	struct r5l_log *log = io->log;
+	struct ppl_conf *ppl_conf = log->private;
+	struct r5conf *conf = ppl_conf->mddev->private;
+	struct ppl_header *pplhdr = page_address(io->meta_page);
+	struct bio *bio;
+	struct stripe_head *sh;
+	int i;
+	struct bio_list bios = BIO_EMPTY_LIST;
+	char b[BDEVNAME_SIZE];
+
+	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, log->bs);
+	bio->bi_private = io;
+	bio->bi_end_io = ppl_log_endio;
+	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
+	bio->bi_bdev = log->rdev->bdev;
+	bio->bi_iter.bi_sector = log->rdev->ppl.sector;
+	bio_add_page(bio, io->meta_page, PAGE_SIZE, 0);
+	bio_list_add(&bios, bio);
+
+	sh = list_first_entry(&io->stripe_list, struct stripe_head, log_list);
+
+	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
+		struct ppl_header_entry *e = &pplhdr->entries[i];
+		u32 pp_size = le32_to_cpu(e->pp_size);
+		u32 data_size = le32_to_cpu(e->data_size);
+		u64 data_sector = le64_to_cpu(e->data_sector);
+		int stripes_count;
+
+		if (pp_size > 0)
+			stripes_count = pp_size >> PAGE_SHIFT;
+		else
+			stripes_count = (data_size /
+					 (conf->raid_disks -
+					  conf->max_degraded)) >> PAGE_SHIFT;
+
+		while (stripes_count--) {
+			/*
+			 * if entry without partial parity just skip its stripes
+			 * without adding pages to bio
+			 */
+			if (pp_size > 0 &&
+			    !bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0)) {
+				struct bio *prev = bio;
+
+				bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES,
+						       log->bs);
+				bio->bi_opf = prev->bi_opf;
+				bio->bi_bdev = prev->bi_bdev;
+				bio->bi_iter.bi_sector = bio_end_sector(prev);
+				bio_add_page(bio, sh->ppl_page, PAGE_SIZE, 0);
+				bio_chain(bio, prev);
+				bio_list_add(&bios, bio);
+			}
+			sh = list_next_entry(sh, log_list);
+		}
+
+		pr_debug("%s: seq: %llu entry: %d data_sector: %llu pp_size: %u data_size: %u\n",
+			 __func__, io->seq, i, data_sector, pp_size, data_size);
+
+		e->data_sector = cpu_to_le64(data_sector >>
+					     ilog2(ppl_conf->block_size >> 9));
+		e->checksum = cpu_to_le32(~le32_to_cpu(e->checksum));
+	}
+
+	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PPL_HEADER_SIZE));
+
+	while ((bio = bio_list_pop(&bios))) {
+		pr_debug("%s: seq: %llu submit_bio() size: %u sector: %llu dev: %s\n",
+			 __func__, io->seq, bio->bi_iter.bi_size,
+			 (unsigned long long)bio->bi_iter.bi_sector,
+			 bdevname(bio->bi_bdev, b));
+		submit_bio(bio);
+	}
+}
+
+static void ppl_submit_current_io(struct r5l_log *log)
+{
+	struct r5l_io_unit *io;
+
+	spin_lock_irq(&log->io_list_lock);
+
+	io = list_first_entry_or_null(&log->running_ios,
+				      struct r5l_io_unit, log_sibling);
+
+	if (io && io->state == IO_UNIT_RUNNING) {
+		__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+		spin_unlock_irq(&log->io_list_lock);
+
+		if (io == log->current_io)
+			log->current_io = NULL;
+
+		ppl_submit_iounit(io);
+		return;
+	}
+
+	spin_unlock_irq(&log->io_list_lock);
+}
+
+static void ppl_write_stripe_run(struct r5l_log *log)
+{
+	mutex_lock(&log->io_mutex);
+	ppl_submit_current_io(log);
+	mutex_unlock(&log->io_mutex);
+}
+
+static void __ppl_stripe_write_finished(struct r5l_io_unit *io)
+{
+	struct r5l_log *log = io->log;
+	unsigned long flags;
+
+	pr_debug("%s: seq: %llu\n", __func__, io->seq);
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+
+	list_del(&io->log_sibling);
+	mempool_free(io, log->io_pool);
+	r5l_run_no_mem_stripe(log);
+
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+}
+
+static void __ppl_exit_log(struct r5l_log *log)
+{
+	struct ppl_conf *ppl_conf = log->private;
+
+	kfree(ppl_conf->child_logs);
+	kfree(ppl_conf);
+
+	mempool_destroy(log->meta_pool);
+	if (log->bs)
+		bioset_free(log->bs);
+	mempool_destroy(log->io_pool);
+	kmem_cache_destroy(log->io_kc);
+}
+
+static int __ppl_init_log(struct r5l_log *log, struct r5conf *conf)
+{
+	struct ppl_conf *ppl_conf;
+	struct mddev *mddev = conf->mddev;
+	int ret = 0;
+	int i;
+
+	if (PAGE_SIZE != 4096)
+		return -EINVAL;
+
+	if (mddev->bitmap) {
+		pr_warn("md/raid:%s PPL is not compatible with bitmap.\n",
+			mdname(mddev));
+		return -EINVAL;
+	}
+
+	ppl_conf = kzalloc(sizeof(struct ppl_conf), GFP_KERNEL);
+	if (!ppl_conf)
+		return -ENOMEM;
+	log->private = ppl_conf;
+
+	spin_lock_init(&log->io_list_lock);
+
+	log->io_kc = KMEM_CACHE(r5l_io_unit, 0);
+	if (!log->io_kc) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	log->io_pool = mempool_create_slab_pool(conf->raid_disks, log->io_kc);
+	if (!log->io_pool) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	log->bs = bioset_create(conf->raid_disks, 0);
+	if (!log->bs) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	log->meta_pool = mempool_create_page_pool(conf->raid_disks, 0);
+	if (!log->meta_pool) {
+		ret = -EINVAL;
+		goto err;
+	}
+
+	ppl_conf->parent_log = log;
+	ppl_conf->mddev = mddev;
+	ppl_conf->count = conf->raid_disks;
+	ppl_conf->child_logs = kcalloc(ppl_conf->count, sizeof(struct r5l_log),
+				       GFP_KERNEL);
+	if (!ppl_conf->child_logs) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+	if (!mddev->external) {
+		log->uuid_checksum = ~crc32c_le(~0, mddev->uuid,
+						sizeof(mddev->uuid));
+		ppl_conf->block_size = 512;
+	} else {
+		ppl_conf->block_size = queue_logical_block_size(mddev->queue);
+	}
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct r5l_log *log_child = &ppl_conf->child_logs[i];
+		struct md_rdev *rdev = conf->disks[i].rdev;
+
+		mutex_init(&log_child->io_mutex);
+		spin_lock_init(&log_child->io_list_lock);
+		INIT_LIST_HEAD(&log_child->running_ios);
+		INIT_LIST_HEAD(&log_child->no_mem_stripes);
+
+		log_child->rdev = rdev;
+		log_child->private = log->private;
+		log_child->io_kc = log->io_kc;
+		log_child->io_pool = log->io_pool;
+		log_child->bs = log->bs;
+		log_child->meta_pool = log->meta_pool;
+		log_child->uuid_checksum = log->uuid_checksum;
+		log_child->policy = log->policy;
+
+		if (rdev) {
+			struct request_queue *q = bdev_get_queue(rdev->bdev);
+
+			if (test_bit(QUEUE_FLAG_WC, &q->queue_flags))
+				log->need_cache_flush = true;
+
+			if (rdev->ppl.size < (PPL_HEADER_SIZE +
+					      STRIPE_SIZE) >> 9) {
+				char b[BDEVNAME_SIZE];
+				pr_warn("md/raid:%s: PPL space too small on %s.\n",
+					mdname(mddev), bdevname(rdev->bdev, b));
+				ret = -ENOSPC;
+			}
+		}
+	}
+
+	if (ret)
+		goto err;
+
+	if (log->need_cache_flush)
+		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
+			mdname(mddev));
+
+	conf->log = log;
+
+	return 0;
+err:
+	__ppl_exit_log(log);
+	return ret;
+}
+
+static int __ppl_write_stripe(struct r5l_log *log, struct stripe_head *sh)
+{
+	struct ppl_conf *ppl_conf = log->private;
+
+	return ppl_write_stripe(&ppl_conf->child_logs[sh->pd_idx], sh);
+}
+
+static void __ppl_write_stripe_run(struct r5l_log *log)
+{
+	struct ppl_conf *ppl_conf = log->private;
+	int i;
+
+	for (i = 0; i < ppl_conf->count; i++)
+		ppl_write_stripe_run(&ppl_conf->child_logs[i]);
+}
+
+struct r5l_policy r5l_ppl = {
+	.init_log = __ppl_init_log,
+	.exit_log = __ppl_exit_log,
+	.write_stripe = __ppl_write_stripe,
+	.write_stripe_run = __ppl_write_stripe_run,
+	.flush_stripe_to_raid = NULL,
+	.stripe_write_finished = __ppl_stripe_write_finished,
+	.handle_flush_request = NULL,
+	.quiesce = NULL,
+};
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e1e238da32ba..ffa19bd4ba6e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -926,7 +926,8 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	might_sleep();
 
-	if (!test_bit(STRIPE_R5C_CACHING, &sh->state)) {
+	if (!test_bit(STRIPE_R5C_CACHING, &sh->state) ||
+	    test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
 		/* writing out phase */
 		if (s->waiting_extra_page)
 			return;
@@ -7171,6 +7172,13 @@ static int raid5_run(struct mddev *mddev)
 		BUG_ON(mddev->delta_disks != 0);
 	}
 
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
+	    test_bit(MD_HAS_PPL, &mddev->flags)) {
+		pr_warn("md/raid:%s: using journal device and PPL not allowed - disabling PPL\n",
+			mdname(mddev));
+		clear_bit(MD_HAS_PPL, &mddev->flags);
+	}
+
 	if (mddev->private == NULL)
 		conf = setup_conf(mddev);
 	else
@@ -7397,6 +7405,11 @@ static int raid5_run(struct mddev *mddev)
 			 mdname(mddev), bdevname(journal_dev->bdev, b));
 		if (r5l_init_log(conf, journal_dev))
 			goto abort;
+	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		pr_debug("md/raid:%s: enabling distributed Partial Parity Log\n",
+			 mdname(mddev));
+		if (r5l_init_log(conf, NULL))
+			goto abort;
 	}
 
 	return 0;
diff --git a/include/uapi/linux/raid/md_p.h b/include/uapi/linux/raid/md_p.h
index fe2112810c43..2c28711cc5f1 100644
--- a/include/uapi/linux/raid/md_p.h
+++ b/include/uapi/linux/raid/md_p.h
@@ -398,4 +398,30 @@ struct r5l_meta_block {
 
 #define R5LOG_VERSION 0x1
 #define R5LOG_MAGIC 0x6433c509
+
+struct ppl_header_entry {
+	__le64 data_sector;	/* Raid sector of the new data */
+	__le32 pp_size;		/* Length of partial parity */
+	__le32 data_size;	/* Length of data */
+	__le32 parity_disk;	/* Member disk containing parity */
+	__le32 checksum;	/* Checksum of this entry */
+} __attribute__ ((__packed__));
+
+#define PPL_HEADER_SIZE 4096
+#define PPL_HDR_RESERVED 512
+#define PPL_HDR_ENTRY_SPACE \
+	(PPL_HEADER_SIZE - PPL_HDR_RESERVED - 4 * sizeof(u32) - sizeof(u64))
+#define PPL_HDR_MAX_ENTRIES \
+	(PPL_HDR_ENTRY_SPACE / sizeof(struct ppl_header_entry))
+
+struct ppl_header {
+	__u8 reserved[PPL_HDR_RESERVED];/* Reserved space */
+	__le32 signature;		/* Signature (family number of volume) */
+	__le32 padding;
+	__le64 generation;		/* Generation number of PP Header */
+	__le32 entries_count;		/* Number of entries in entry array */
+	__le32 checksum;		/* Checksum of PP Header */
+	struct ppl_header_entry entries[PPL_HDR_MAX_ENTRIES];
+} __attribute__ ((__packed__));
+
 #endif
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 6/9] md: add sysfs entries for PPL
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (4 preceding siblings ...)
  2017-01-30 18:59 ` [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-02-07 21:49   ` Shaohua Li
  2017-01-30 18:59 ` [PATCH v3 7/9] raid5-ppl: load and recover the log Artur Paszkiewicz
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Add 'consistency_policy' attribute for array. It indicates how the array
maintains consistency in case of unexpected shutdown.

Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
and size of the PPL space on the device.

These attributes are writable to allow enabling PPL for external
metadata arrays and (later) to enable/disable PPL for a running array.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 Documentation/admin-guide/md.rst |  32 ++++++++++-
 drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
index 7104ef757e73..2c153b3d798f 100644
--- a/Documentation/admin-guide/md.rst
+++ b/Documentation/admin-guide/md.rst
@@ -329,14 +329,14 @@ All md devices contain:
      array creation it will default to 0, though starting the array as
      ``clean`` will set it much larger.
 
-   new_dev
+  new_dev
      This file can be written but not read.  The value written should
      be a block device number as major:minor.  e.g. 8:0
      This will cause that device to be attached to the array, if it is
      available.  It will then appear at md/dev-XXX (depending on the
      name of the device) and further configuration is then possible.
 
-   safe_mode_delay
+  safe_mode_delay
      When an md array has seen no write requests for a certain period
      of time, it will be marked as ``clean``.  When another write
      request arrives, the array is marked as ``dirty`` before the write
@@ -345,7 +345,7 @@ All md devices contain:
      period as a number of seconds.  The default is 200msec (0.200).
      Writing a value of 0 disables safemode.
 
-   array_state
+  array_state
      This file contains a single word which describes the current
      state of the array.  In many cases, the state can be set by
      writing the word for the desired state, however some states
@@ -454,7 +454,30 @@ All md devices contain:
      once the array becomes non-degraded, and this fact has been
      recorded in the metadata.
 
+  consistency_policy
+     This indicates how the array maintains consistency in case of unexpected
+     shutdown. It can be:
 
+     none
+       Array has no redundancy information, e.g. raid0, linear.
+
+     resync
+       Full resync is performed and all redundancy is regenerated when the
+       array is started after unclean shutdown.
+
+     bitmap
+       Resync assisted by a write-intent bitmap.
+
+     journal
+       For raid4/5/6, journal device is used to log transactions and replay
+       after unclean shutdown.
+
+     ppl
+       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
+       resync.
+
+     The accepted values when writing to this file are ``ppl`` and ``resync``,
+     used to enable and disable PPL.
 
 
 As component devices are added to an md array, they appear in the ``md``
@@ -616,6 +639,9 @@ Each directory contains:
 	adds bad blocks without acknowledging them. This is largely
 	for testing.
 
+      ppl_sector, ppl_size
+        Location and size (in sectors) of the space used for Partial Parity Log
+        on this device.
 
 
 An active md device will also contain an entry for each active device
diff --git a/drivers/md/md.c b/drivers/md/md.c
index e96f73572e23..47ab3afe87cb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
 static struct rdev_sysfs_entry rdev_unack_bad_blocks =
 __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
 
+static ssize_t
+ppl_sector_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
+}
+
+static ssize_t
+ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
+{
+	unsigned long long sector;
+
+	if (kstrtoull(buf, 10, &sector) < 0)
+		return -EINVAL;
+	if (sector != (sector_t)sector)
+	        return -EINVAL;
+
+	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	if (rdev->mddev->persistent) {
+		if (rdev->mddev->major_version == 0)
+			return -EINVAL;
+		if ((sector > rdev->sb_start &&
+		     sector - rdev->sb_start > S16_MAX) ||
+		    (sector < rdev->sb_start &&
+		     rdev->sb_start - sector > -S16_MIN))
+			return -EINVAL;
+		rdev->ppl.offset = sector - rdev->sb_start;
+	} else if (!rdev->mddev->external) {
+		return -EBUSY;
+	}
+	rdev->ppl.sector = sector;
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_ppl_sector =
+__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
+
+static ssize_t
+ppl_size_show(struct md_rdev *rdev, char *page)
+{
+	return sprintf(page, "%u\n", rdev->ppl.size);
+}
+
+static ssize_t
+ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
+{
+	unsigned int size;
+
+	if (kstrtouint(buf, 10, &size) < 0)
+		return -EINVAL;
+
+	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
+	    rdev->raid_disk >= 0)
+		return -EBUSY;
+
+	if (rdev->mddev->persistent) {
+		if (rdev->mddev->major_version == 0)
+			return -EINVAL;
+		if (size > U16_MAX)
+			return -EINVAL;
+	} else if (!rdev->mddev->external) {
+		return -EBUSY;
+	}
+	rdev->ppl.size = size;
+	return len;
+}
+
+static struct rdev_sysfs_entry rdev_ppl_size =
+__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
+
 static struct attribute *rdev_default_attrs[] = {
 	&rdev_state.attr,
 	&rdev_errors.attr,
@@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
 	&rdev_recovery_start.attr,
 	&rdev_bad_blocks.attr,
 	&rdev_unack_bad_blocks.attr,
+	&rdev_ppl_sector.attr,
+	&rdev_ppl_size.attr,
 	NULL,
 };
 static ssize_t
@@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
 __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
        array_size_store);
 
+static ssize_t
+consistency_policy_show(struct mddev *mddev, char *page)
+{
+	int ret;
+
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		ret = sprintf(page, "journal\n");
+	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		ret = sprintf(page, "ppl\n");
+	} else if (mddev->bitmap) {
+		ret = sprintf(page, "bitmap\n");
+	} else if (mddev->pers) {
+		if (mddev->pers->sync_request)
+			ret = sprintf(page, "resync\n");
+		else
+			ret = sprintf(page, "none\n");
+	} else {
+		ret = sprintf(page, "unknown\n");
+	}
+
+	return ret;
+}
+
+static ssize_t
+consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
+{
+	if (mddev->pers) {
+		return -EBUSY;
+	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
+		set_bit(MD_HAS_PPL, &mddev->flags);
+		return len;
+	} else {
+		return -EINVAL;
+	}
+}
+
+static struct md_sysfs_entry md_consistency_policy =
+__ATTR(consistency_policy, S_IRUGO | S_IWUSR, consistency_policy_show,
+       consistency_policy_store);
+
 static struct attribute *md_default_attrs[] = {
 	&md_level.attr,
 	&md_layout.attr,
@@ -4966,6 +5080,7 @@ static struct attribute *md_default_attrs[] = {
 	&md_reshape_direction.attr,
 	&md_array_size.attr,
 	&max_corr_read_errors.attr,
+	&md_consistency_policy.attr,
 	NULL,
 };
 
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 7/9] raid5-ppl: load and recover the log
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (5 preceding siblings ...)
  2017-01-30 18:59 ` [PATCH v3 6/9] md: add sysfs entries for PPL Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 8/9] raid5-ppl: support disk hot add/remove with PPL Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 9/9] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
  8 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Load the log from each disk when starting the array and recover if the
array is dirty.

The initial empty PPL is written by mdadm. When loading the log we
verify the header checksum and signature. For external metadata arrays
the signature is verified in userspace, so here we read it from the
header, verifying only if it matches on all disks, and use it later when
writing PPL.

In addition to the header checksum, each header entry also contains a
checksum of its partial parity data. If the header is valid, recovery is
performed for each entry until an invalid entry is found. If the array
is not degraded and recovery using PPL fully succeeds, there is no need
to resync the array because data and parity will be consistent, so in
this case resync will be disabled.

Due to compatibility with IMSM implementations on other systems, we
can't assume that the recovery data block size is always 4K. Writes
generated by MD raid5 don't have this issue, but when recovering PPL
written in other environments it is possible to have entries with
512-byte sector granularity. The recovery code takes this into account
and also the logical sector size of the underlying drives.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-ppl.c | 485 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c     |   5 +-
 2 files changed, 489 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 6bc246c80f6b..b21b3bfa8f36 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -16,6 +16,7 @@
 #include <linux/blkdev.h>
 #include <linux/slab.h>
 #include <linux/crc32c.h>
+#include <linux/async_tx.h>
 #include <linux/raid/md_p.h>
 #include "md.h"
 #include "raid5.h"
@@ -82,6 +83,10 @@ struct ppl_conf {
 
 	/* the logical block size used for data_sector in ppl_header_entry */
 	int block_size;
+
+	/* used only for recovery */
+	int recovered_entries;
+	int mismatch_count;
 };
 
 static struct r5l_io_unit *ppl_new_iounit(struct r5l_log *log,
@@ -395,6 +400,469 @@ static void __ppl_stripe_write_finished(struct r5l_io_unit *io)
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 }
 
+static void ppl_xor(int size, struct page *page1, struct page *page2,
+		    struct page *page_result)
+{
+	struct async_submit_ctl submit;
+	struct dma_async_tx_descriptor *tx;
+	struct page *xor_srcs[] = { page1, page2 };
+
+	init_async_submit(&submit, ASYNC_TX_ACK|ASYNC_TX_XOR_DROP_DST,
+			  NULL, NULL, NULL, NULL);
+	tx = async_xor(page_result, xor_srcs, 0, 2, size, &submit);
+
+	async_tx_quiesce(&tx);
+}
+
+/*
+ * PPL recovery strategy: xor partial parity and data from all modified data
+ * disks within a stripe and write the result as the new stripe parity. If all
+ * stripe data disks are modified (full stripe write), no partial parity is
+ * available, so just xor the data disks.
+ *
+ * Recovery of a PPL entry shall occur only if all modified data disks are
+ * available and read from all of them succeeds.
+ *
+ * A PPL entry applies to a stripe, partial parity size for an entry is at most
+ * the size of the chunk. Examples of possible cases for a single entry:
+ *
+ * case 0: single data disk write:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
+ * | ------ | -data- | ------ | | pp | -> | data1 ^ pp         |
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = data_size
+ *
+ * case 1: more than one data disk write:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
+ * | -data- | -data- | ------ | | pp | -> | data0 ^ data1 ^ pp |
+ * | ------ | ------ | ------ | +----+    | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = data_size / modified_data_disks
+ *
+ * case 2: write to all data disks (also full stripe write):
+ *   data0    data1    data2                parity
+ * +--------+--------+--------+           +--------------------+
+ * | ------ | ------ | ------ |           | (no change)        |
+ * | -data- | -data- | -data- | --------> | xor all data       |
+ * | ------ | ------ | ------ | --------> | (no change)        |
+ * | ------ | ------ | ------ |           | (no change)        |
+ * +--------+--------+--------+           +--------------------+
+ * pp_size = 0
+ *
+ * The following cases are possible only in other implementations. The recovery
+ * code can handle them, but they are not generated at runtime because they can
+ * be reduced to cases 0, 1 and 2:
+ *
+ * case 3:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+ +----+    +--------------------+
+ * | ------ | -data- | -data- | | pp |    | data1 ^ data2 ^ pp |
+ * | ------ | -data- | -data- | | pp | -> | data1 ^ data2 ^ pp |
+ * | -data- | -data- | -data- | | -- | -> | xor all data       |
+ * | -data- | -data- | ------ | | pp |    | data0 ^ data1 ^ pp |
+ * +--------+--------+--------+ +----+    +--------------------+
+ * pp_size = chunk_size
+ *
+ * case 4:
+ *   data0    data1    data2     ppl        parity
+ * +--------+--------+--------+ +----+    +--------------------+
+ * | ------ | -data- | ------ | | pp |    | data1 ^ pp         |
+ * | ------ | ------ | ------ | | -- | -> | (no change)        |
+ * | ------ | ------ | ------ | | -- | -> | (no change)        |
+ * | -data- | ------ | ------ | | pp |    | data0 ^ pp         |
+ * +--------+--------+--------+ +----+    +--------------------+
+ * pp_size = chunk_size
+ */
+static int ppl_recover_entry(struct r5l_log *log, struct ppl_header_entry *e,
+			     sector_t ppl_sector)
+{
+	struct ppl_conf *ppl_conf = log->private;
+	struct mddev *mddev = ppl_conf->mddev;
+	struct r5conf *conf = mddev->private;
+	int block_size = ppl_conf->block_size;
+	struct page *pages;
+	struct page *page1;
+	struct page *page2;
+	sector_t r_sector_first;
+	sector_t r_sector_last;
+	int strip_sectors;
+	int data_disks;
+	int i;
+	int ret = 0;
+	char b[BDEVNAME_SIZE];
+	unsigned int pp_size = le32_to_cpu(e->pp_size);
+	unsigned int data_size = le32_to_cpu(e->data_size);
+
+	r_sector_first = le64_to_cpu(e->data_sector) * (block_size >> 9);
+
+	if ((pp_size >> 9) < conf->chunk_sectors) {
+		if (pp_size > 0) {
+			data_disks = data_size / pp_size;
+			strip_sectors = pp_size >> 9;
+		} else {
+			data_disks = conf->raid_disks - conf->max_degraded;
+			strip_sectors = (data_size >> 9) / data_disks;
+		}
+		r_sector_last = r_sector_first +
+				(data_disks - 1) * conf->chunk_sectors +
+				strip_sectors;
+	} else {
+		data_disks = conf->raid_disks - conf->max_degraded;
+		strip_sectors = conf->chunk_sectors;
+		r_sector_last = r_sector_first + (data_size >> 9);
+	}
+
+	pages = alloc_pages(GFP_KERNEL, 1);
+	if (!pages)
+		return -ENOMEM;
+	page1 = pages;
+	page2 = pages + 1;
+
+	pr_debug("%s: array sector first: %llu last: %llu\n", __func__,
+		 (unsigned long long)r_sector_first,
+		 (unsigned long long)r_sector_last);
+
+	/* if start and end is 4k aligned, use a 4k block */
+	if (block_size == 512 &&
+	    (r_sector_first & (STRIPE_SECTORS - 1)) == 0 &&
+	    (r_sector_last & (STRIPE_SECTORS - 1)) == 0)
+		block_size = STRIPE_SIZE;
+
+	/* iterate through blocks in strip */
+	for (i = 0; i < strip_sectors; i += (block_size >> 9)) {
+		bool update_parity = false;
+		sector_t parity_sector;
+		struct md_rdev *parity_rdev;
+		struct stripe_head sh;
+		int disk;
+		int indent = 0;
+
+		pr_debug("%s:%*s iter %d start\n", __func__, indent, "", i);
+		indent += 2;
+
+		memset(page_address(page1), 0, PAGE_SIZE);
+
+		/* iterate through data member disks */
+		for (disk = 0; disk < data_disks; disk++) {
+			int dd_idx;
+			struct md_rdev *rdev;
+			sector_t sector;
+			sector_t r_sector = r_sector_first + i +
+					    (disk * conf->chunk_sectors);
+
+			pr_debug("%s:%*s data member disk %d start\n",
+				 __func__, indent, "", disk);
+			indent += 2;
+
+			if (r_sector >= r_sector_last) {
+				pr_debug("%s:%*s array sector %llu doesn't need parity update\n",
+					 __func__, indent, "",
+					 (unsigned long long)r_sector);
+				indent -= 2;
+				continue;
+			}
+
+			update_parity = true;
+
+			/* map raid sector to member disk */
+			sector = raid5_compute_sector(conf, r_sector, 0,
+						      &dd_idx, NULL);
+			pr_debug("%s:%*s processing array sector %llu => data member disk %d, sector %llu\n",
+				 __func__, indent, "",
+				 (unsigned long long)r_sector, dd_idx,
+				 (unsigned long long)sector);
+
+			rdev = conf->disks[dd_idx].rdev;
+			if (!rdev) {
+				pr_debug("%s:%*s data member disk %d missing\n",
+					 __func__, indent, "", dd_idx);
+				update_parity = false;
+				break;
+			}
+
+			pr_debug("%s:%*s reading data member disk %s sector %llu\n",
+				 __func__, indent, "", bdevname(rdev->bdev, b),
+				 (unsigned long long)sector);
+			if (!sync_page_io(rdev, sector, block_size, page2,
+					REQ_OP_READ, 0, false)) {
+				md_error(mddev, rdev);
+				pr_debug("%s:%*s read failed!\n", __func__,
+					 indent, "");
+				ret = -EIO;
+				goto out;
+			}
+
+			ppl_xor(block_size, page1, page2, page1);
+
+			indent -= 2;
+		}
+
+		if (!update_parity)
+			continue;
+
+		if (pp_size > 0) {
+			pr_debug("%s:%*s reading pp disk sector %llu\n",
+				 __func__, indent, "",
+				 (unsigned long long)(ppl_sector + i));
+			if (!sync_page_io(log->rdev,
+					ppl_sector - log->rdev->data_offset + i,
+					block_size, page2, REQ_OP_READ, 0,
+					false)) {
+				pr_debug("%s:%*s read failed!\n", __func__,
+					 indent, "");
+				md_error(mddev, log->rdev);
+				ret = -EIO;
+				goto out;
+			}
+
+			ppl_xor(block_size, page1, page2, page1);
+		}
+
+		/* map raid sector to parity disk */
+		parity_sector = raid5_compute_sector(conf, r_sector_first + i,
+				0, &disk, &sh);
+		BUG_ON(sh.pd_idx != le32_to_cpu(e->parity_disk));
+		parity_rdev = conf->disks[sh.pd_idx].rdev;
+
+		BUG_ON(parity_rdev->bdev->bd_dev != log->rdev->bdev->bd_dev);
+		pr_debug("%s:%*s write parity at sector %llu, disk %s\n",
+			 __func__, indent, "",
+			 (unsigned long long)parity_sector,
+			 bdevname(parity_rdev->bdev, b));
+		if (!sync_page_io(parity_rdev, parity_sector, block_size,
+				page1, REQ_OP_WRITE, 0, false)) {
+			pr_debug("%s:%*s parity write error!\n", __func__,
+				 indent, "");
+			md_error(mddev, parity_rdev);
+			ret = -EIO;
+			goto out;
+		}
+	}
+out:
+	__free_pages(pages, 1);
+	return ret;
+}
+
+static int ppl_recover(struct r5l_log *log, struct ppl_header *pplhdr)
+{
+	struct ppl_conf *ppl_conf = log->private;
+	struct md_rdev *rdev = log->rdev;
+	struct mddev *mddev = rdev->mddev;
+	sector_t ppl_sector = rdev->ppl.sector + (PPL_HEADER_SIZE >> 9);
+	struct page *page;
+	int i;
+	int ret = 0;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	/* iterate through all PPL entries saved */
+	for (i = 0; i < le32_to_cpu(pplhdr->entries_count); i++) {
+		struct ppl_header_entry *e = &pplhdr->entries[i];
+		u32 pp_size = le32_to_cpu(e->pp_size);
+		sector_t sector = ppl_sector;
+		int ppl_entry_sectors = pp_size >> 9;
+		u32 crc, crc_stored;
+
+		pr_debug("%s: disk: %d entry: %d ppl_sector: %llu pp_size: %u\n",
+			 __func__, rdev->raid_disk, i,
+			 (unsigned long long)ppl_sector, pp_size);
+
+		crc = ~0;
+		crc_stored = le32_to_cpu(e->checksum);
+
+		/* read parial parity for this entry and calculate its checksum */
+		while (pp_size) {
+			int s = pp_size > PAGE_SIZE ? PAGE_SIZE : pp_size;
+
+			if (!sync_page_io(rdev, sector - rdev->data_offset,
+					s, page, REQ_OP_READ, 0, false)) {
+				md_error(mddev, rdev);
+				ret = -EIO;
+				goto out;
+			}
+
+			crc = crc32c_le(crc, page_address(page), s);
+
+			pp_size -= s;
+			sector += s >> 9;
+		}
+
+		crc = ~crc;
+
+		if (crc != crc_stored) {
+			/*
+			 * Don't recover this entry if the checksum does not
+			 * match, but keep going and try to recover other
+			 * entries.
+			 */
+			pr_debug("%s: ppl entry crc does not match: stored: 0x%x calculated: 0x%x\n",
+				 __func__, crc_stored, crc);
+			ppl_conf->mismatch_count++;
+		} else {
+			ret = ppl_recover_entry(log, e, ppl_sector);
+			if (ret)
+				goto out;
+			ppl_conf->recovered_entries++;
+		}
+
+		ppl_sector += ppl_entry_sectors;
+	}
+out:
+	__free_page(page);
+	return ret;
+}
+
+static int ppl_write_empty_header(struct r5l_log *log)
+{
+	struct page *page;
+	struct ppl_header *pplhdr;
+	struct md_rdev *rdev = log->rdev;
+	int ret = 0;
+
+	pr_debug("%s: disk: %d ppl_sector: %llu\n", __func__,
+		 rdev->raid_disk, (unsigned long long)rdev->ppl.sector);
+
+	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
+	if (!page)
+		return -ENOMEM;
+
+	pplhdr = page_address(page);
+	memset(pplhdr->reserved, 0xff, PPL_HDR_RESERVED);
+	pplhdr->signature = cpu_to_le32(log->uuid_checksum);
+	pplhdr->checksum = cpu_to_le32(~crc32c_le(~0, pplhdr, PAGE_SIZE));
+
+	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
+			  PPL_HEADER_SIZE, page, REQ_OP_WRITE, 0, false)) {
+		md_error(rdev->mddev, rdev);
+		ret = -EIO;
+	}
+
+	__free_page(page);
+	return ret;
+}
+
+static int ppl_load_distributed(struct r5l_log *log)
+{
+	struct ppl_conf *ppl_conf = log->private;
+	struct md_rdev *rdev = log->rdev;
+	struct mddev *mddev = rdev->mddev;
+	struct page *page;
+	struct ppl_header *pplhdr;
+	u32 crc, crc_stored;
+	u32 signature;
+	int ret = 0;
+
+	pr_debug("%s: disk: %d\n", __func__, rdev->raid_disk);
+
+	/* read PPL header */
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	if (!sync_page_io(rdev, rdev->ppl.sector - rdev->data_offset,
+			  PAGE_SIZE, page, REQ_OP_READ, 0, false)) {
+		md_error(mddev, rdev);
+		ret = -EIO;
+		goto out;
+	}
+	pplhdr = page_address(page);
+
+	/* check header validity */
+	crc_stored = le32_to_cpu(pplhdr->checksum);
+	pplhdr->checksum = 0;
+	crc = ~crc32c_le(~0, pplhdr, PAGE_SIZE);
+
+	if (crc_stored != crc) {
+		pr_debug("%s: ppl header crc does not match: stored: 0x%x calculated: 0x%x\n",
+			 __func__, crc_stored, crc);
+		ppl_conf->mismatch_count++;
+		goto out;
+	}
+
+	signature = le32_to_cpu(pplhdr->signature);
+
+	if (mddev->external) {
+		/*
+		 * For external metadata the header signature is set and
+		 * validated in userspace.
+		 */
+		log->uuid_checksum = signature;
+	} else if (log->uuid_checksum != signature) {
+		pr_debug("%s: ppl header signature does not match: stored: 0x%x configured: 0x%x\n",
+			 __func__, signature, log->uuid_checksum);
+		ppl_conf->mismatch_count++;
+		goto out;
+	}
+
+	/* attempt to recover from log if we are starting a dirty array */
+	if (!mddev->pers && mddev->recovery_cp != MaxSector)
+		ret = ppl_recover(log, pplhdr);
+out:
+	if (!ret && (le32_to_cpu(pplhdr->entries_count) > 0 ||
+		     ppl_conf->mismatch_count > 0))
+		ret = ppl_write_empty_header(log);
+
+	__free_page(page);
+
+	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
+		 __func__, ret, ppl_conf->mismatch_count,
+		 ppl_conf->recovered_entries);
+	return ret;
+}
+
+static int ppl_load(struct r5l_log *log)
+{
+	struct ppl_conf *ppl_conf = log->private;
+	int ret = 0;
+	int i;
+	u32 *signature = NULL;
+
+	for (i = 0; i < ppl_conf->count; i++) {
+		struct r5l_log *log_child = &ppl_conf->child_logs[i];
+
+		/* skip missing drive */
+		if (!log_child->rdev)
+			continue;
+
+		ret = ppl_load_distributed(log_child);
+		if (ret)
+			break;
+
+		/*
+		 * For external metadata we can't check if the signature is
+		 * correct on a single drive, but we can check if it is the same
+		 * on all drives.
+		 */
+		if (ppl_conf->mddev->external) {
+			if (!signature) {
+				signature = &log_child->uuid_checksum;
+			} else if (*signature != log_child->uuid_checksum) {
+				pr_warn("md/raid:%s: PPL header signature does not match on all member drives\n",
+					mdname(ppl_conf->mddev));
+				ret = -EINVAL;
+				break;
+			}
+		}
+	}
+
+	if (signature)
+		log->uuid_checksum = *signature;
+
+	pr_debug("%s: return: %d mismatch_count: %d recovered_entries: %d\n",
+		 __func__, ret, ppl_conf->mismatch_count,
+		 ppl_conf->recovered_entries);
+	return ret;
+}
+
 static void __ppl_exit_log(struct r5l_log *log)
 {
 	struct ppl_conf *ppl_conf = log->private;
@@ -515,6 +983,23 @@ static int __ppl_init_log(struct r5l_log *log, struct r5conf *conf)
 		pr_warn("md/raid:%s: Volatile write-back cache should be disabled on all member drives when using PPL!\n",
 			mdname(mddev));
 
+	/* load and possibly recover the logs from the member disks */
+	ret = ppl_load(log);
+
+	if (ret) {
+		goto err;
+	} else if (!mddev->pers &&
+		   mddev->recovery_cp == 0 && !mddev->degraded &&
+		   ppl_conf->recovered_entries > 0 &&
+		   ppl_conf->mismatch_count == 0) {
+		/*
+		 * If we are starting a dirty array and the recovery succeeds
+		 * without any issues, set the array as clean.
+		 */
+		mddev->recovery_cp = MaxSector;
+		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	}
+
 	conf->log = log;
 
 	return 0;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index ffa19bd4ba6e..c015925710f5 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7270,7 +7270,10 @@ static int raid5_run(struct mddev *mddev)
 
 	if (mddev->degraded > dirty_parity_disks &&
 	    mddev->recovery_cp != MaxSector) {
-		if (mddev->ok_start_degraded)
+		if (test_bit(MD_HAS_PPL, &mddev->flags))
+			pr_crit("md/raid:%s: starting dirty degraded array with PPL.\n",
+				mdname(mddev));
+		else if (mddev->ok_start_degraded)
 			pr_crit("md/raid:%s: starting dirty degraded array - data corruption possible.\n",
 				mdname(mddev));
 		else {
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 8/9] raid5-ppl: support disk hot add/remove with PPL
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (6 preceding siblings ...)
  2017-01-30 18:59 ` [PATCH v3 7/9] raid5-ppl: load and recover the log Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-01-30 18:59 ` [PATCH v3 9/9] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
  8 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Add a function to modify the log by removing an rdev when a drive fails
or adding when a spare/replacement is activated as a raid member.

Removing a disk just clears the child log rdev pointer. No new stripes
will be accepted for this child log in ppl_write_stripe() and running io
units will be processed without writing PPL to the device.

Adding a disk sets the child log rdev pointer and writes an empty PPL
header.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/raid5-cache.c |  8 ++++++++
 drivers/md/raid5-cache.h |  9 +++++++++
 drivers/md/raid5-ppl.c   | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       | 15 +++++++++++++++
 4 files changed, 80 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 7757c5137300..3031c3f720a9 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2774,6 +2774,14 @@ void r5l_exit_log(struct r5l_log *log)
 	kfree(log);
 }
 
+int r5l_modify_log(struct r5l_log *log, struct md_rdev *rdev,
+		   enum r5l_modify_log_operation operation)
+{
+	if (log && log->policy->modify_log)
+		return log->policy->modify_log(log, rdev, operation);
+	return 0;
+}
+
 struct r5l_policy r5l_journal = {
 	.init_log = __r5l_init_log,
 	.exit_log = __r5l_exit_log,
diff --git a/drivers/md/raid5-cache.h b/drivers/md/raid5-cache.h
index 6b622c2742de..9e351708c247 100644
--- a/drivers/md/raid5-cache.h
+++ b/drivers/md/raid5-cache.h
@@ -138,9 +138,16 @@ enum r5l_io_unit_state {
 	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
 };
 
+enum r5l_modify_log_operation {
+	R5L_MODIFY_LOG_DISK_REMOVE,
+	R5L_MODIFY_LOG_DISK_ADD,
+};
+
 struct r5l_policy {
 	int (*init_log)(struct r5l_log *log, struct r5conf *conf);
 	void (*exit_log)(struct r5l_log *log);
+	int (*modify_log)(struct r5l_log *log, struct md_rdev *rdev,
+			  enum r5l_modify_log_operation operation);
 	int (*write_stripe)(struct r5l_log *log, struct stripe_head *sh);
 	void (*write_stripe_run)(struct r5l_log *log);
 	void (*flush_stripe_to_raid)(struct r5l_log *log);
@@ -151,6 +158,8 @@ struct r5l_policy {
 
 extern int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev);
 extern void r5l_exit_log(struct r5l_log *log);
+extern int r5l_modify_log(struct r5l_log *log, struct md_rdev *rdev,
+			  enum r5l_modify_log_operation operation);
 extern int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh);
 extern void r5l_write_stripe_run(struct r5l_log *log);
 extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index b21b3bfa8f36..87e718f5b29f 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -291,6 +291,12 @@ static void ppl_submit_iounit(struct r5l_io_unit *io)
 
 	bio = bio_alloc_bioset(GFP_NOIO, BIO_MAX_PAGES, log->bs);
 	bio->bi_private = io;
+
+	if (!log->rdev || test_bit(Faulty, &log->rdev->flags)) {
+		ppl_log_endio(bio);
+		return;
+	}
+
 	bio->bi_end_io = ppl_log_endio;
 	bio->bi_opf = REQ_OP_WRITE | REQ_FUA;
 	bio->bi_bdev = log->rdev->bdev;
@@ -1008,6 +1014,47 @@ static int __ppl_init_log(struct r5l_log *log, struct r5conf *conf)
 	return ret;
 }
 
+static int __ppl_modify_log(struct r5l_log *log, struct md_rdev *rdev,
+			    enum r5l_modify_log_operation operation)
+{
+	struct r5l_log *log_child;
+	struct ppl_conf *ppl_conf = log->private;
+	int ret = 0;
+	char b[BDEVNAME_SIZE];
+
+	if (!rdev)
+		return -EINVAL;
+
+	pr_debug("%s: disk: %d operation: %s dev: %s\n",
+		 __func__, rdev->raid_disk,
+		 operation == R5L_MODIFY_LOG_DISK_REMOVE ? "remove" :
+		 (operation == R5L_MODIFY_LOG_DISK_ADD ? "add" : "?"),
+		 bdevname(rdev->bdev, b));
+
+	if (rdev->raid_disk < 0)
+		return 0;
+
+	if (rdev->raid_disk >= ppl_conf->count)
+		return -ENODEV;
+
+	log_child = &ppl_conf->child_logs[rdev->raid_disk];
+
+	mutex_lock(&log_child->io_mutex);
+	if (operation == R5L_MODIFY_LOG_DISK_REMOVE) {
+		log_child->rdev = NULL;
+	} else if (operation == R5L_MODIFY_LOG_DISK_ADD) {
+		log_child->rdev = rdev;
+		if (rdev->mddev->external)
+			log_child->uuid_checksum = log->uuid_checksum;
+		ret = ppl_write_empty_header(log_child);
+	} else {
+		ret = -EINVAL;
+	}
+	mutex_unlock(&log_child->io_mutex);
+
+	return ret;
+}
+
 static int __ppl_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 {
 	struct ppl_conf *ppl_conf = log->private;
@@ -1027,6 +1074,7 @@ static void __ppl_write_stripe_run(struct r5l_log *log)
 struct r5l_policy r5l_ppl = {
 	.init_log = __ppl_init_log,
 	.exit_log = __ppl_exit_log,
+	.modify_log = __ppl_modify_log,
 	.write_stripe = __ppl_write_stripe,
 	.write_stripe_run = __ppl_write_stripe_run,
 	.flush_stripe_to_raid = NULL,
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c015925710f5..a711bf940116 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7576,6 +7576,12 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			*rdevp = rdev;
 		}
 	}
+	if (test_bit(MD_HAS_PPL, &mddev->flags)) {
+		err = r5l_modify_log(conf->log, rdev,
+				     R5L_MODIFY_LOG_DISK_REMOVE);
+		if (err)
+			goto abort;
+	}
 	if (p->replacement) {
 		/* We must have just cleared 'rdev' */
 		p->rdev = p->replacement;
@@ -7585,6 +7591,10 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 			   */
 		p->replacement = NULL;
 		clear_bit(WantReplacement, &rdev->flags);
+
+		if (test_bit(MD_HAS_PPL, &mddev->flags))
+			err = r5l_modify_log(conf->log, p->rdev,
+					     R5L_MODIFY_LOG_DISK_ADD);
 	} else
 		/* We might have just removed the Replacement as faulty-
 		 * clear the bit just in case
@@ -7648,6 +7658,11 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 			if (rdev->saved_raid_disk != disk)
 				conf->fullsync = 1;
 			rcu_assign_pointer(p->rdev, rdev);
+
+			if (test_bit(MD_HAS_PPL, &mddev->flags))
+				err = r5l_modify_log(conf->log, rdev,
+						     R5L_MODIFY_LOG_DISK_ADD);
+
 			goto out;
 		}
 	}
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 9/9] raid5-ppl: runtime PPL enabling or disabling
  2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
                   ` (7 preceding siblings ...)
  2017-01-30 18:59 ` [PATCH v3 8/9] raid5-ppl: support disk hot add/remove with PPL Artur Paszkiewicz
@ 2017-01-30 18:59 ` Artur Paszkiewicz
  2017-03-27  5:08   ` NeilBrown
  8 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-01-30 18:59 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, neilb, jes.sorensen, Artur Paszkiewicz

Allow writing to 'consistency_policy' attribute when the array is
active. Add a new function 'change_consistency_policy' to the
md_personality operations structure to handle the change in the
personality code. Values "ppl" and "resync" are accepted and
turn PPL on and off respectively.

When enabling PPL its location and size should first be set using
'ppl_sector' and 'ppl_size' attributes and a valid PPL header should be
written at this location on each member device.

Enabling or disabling PPL is performed under a suspended array.  The
raid5_reset_stripe_cache function frees the stripe cache and allocates
it again in order to allocate or free the ppl_pages for the stripes in
the stripe cache.

Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
---
 drivers/md/md.c        | 12 +++++++---
 drivers/md/md.h        |  2 ++
 drivers/md/raid5-ppl.c | 19 +++++++++++++++-
 drivers/md/raid5.c     | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 47ab3afe87cb..efbaa2f90eb0 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5051,14 +5051,20 @@ consistency_policy_show(struct mddev *mddev, char *page)
 static ssize_t
 consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
 {
+	int err = 0;
+
 	if (mddev->pers) {
-		return -EBUSY;
+		if (mddev->pers->change_consistency_policy)
+			err = mddev->pers->change_consistency_policy(mddev, buf);
+		else
+			err = -EBUSY;
 	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
 		set_bit(MD_HAS_PPL, &mddev->flags);
-		return len;
 	} else {
-		return -EINVAL;
+		err = -EINVAL;
 	}
+
+	return err ? err : len;
 }
 
 static struct md_sysfs_entry md_consistency_policy =
diff --git a/drivers/md/md.h b/drivers/md/md.h
index abdb5f2ed2d3..ead1aeff9302 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -548,6 +548,8 @@ struct md_personality
 	/* congested implements bdi.congested_fn().
 	 * Will not be called while array is 'suspended' */
 	int (*congested)(struct mddev *mddev, int bits);
+	/* Changes the consistency policy of an active array. */
+	int (*change_consistency_policy)(struct mddev *mddev, const char *buf);
 };
 
 struct md_sysfs_entry {
diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
index 87e718f5b29f..5a8e7a56eac6 100644
--- a/drivers/md/raid5-ppl.c
+++ b/drivers/md/raid5-ppl.c
@@ -873,7 +873,20 @@ static void __ppl_exit_log(struct r5l_log *log)
 {
 	struct ppl_conf *ppl_conf = log->private;
 
-	kfree(ppl_conf->child_logs);
+	if (ppl_conf->child_logs) {
+		struct r5l_log *log_child;
+		int i;
+
+		for (i = 0; i < ppl_conf->count; i++) {
+			log_child = &ppl_conf->child_logs[i];
+			if (log_child->rdev) {
+				log_child->rdev->ppl.offset = 0;
+				log_child->rdev->ppl.sector = 0;
+				log_child->rdev->ppl.size = 0;
+			}
+		}
+		kfree(ppl_conf->child_logs);
+	}
 	kfree(ppl_conf);
 
 	mempool_destroy(log->meta_pool);
@@ -1004,6 +1017,10 @@ static int __ppl_init_log(struct r5l_log *log, struct r5conf *conf)
 		 */
 		mddev->recovery_cp = MaxSector;
 		set_bit(MD_SB_CHANGE_CLEAN, &mddev->sb_flags);
+	} else if (mddev->pers && ppl_conf->mismatch_count > 0) {
+		/* no mismatch allowed when enabling PPL for a running array */
+		ret = -EINVAL;
+		goto err;
 	}
 
 	conf->log = log;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a711bf940116..a095ac2f5d64 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8272,6 +8272,66 @@ static void *raid6_takeover(struct mddev *mddev)
 	return setup_conf(mddev);
 }
 
+static void raid5_reset_stripe_cache(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+
+	mutex_lock(&conf->cache_size_mutex);
+	while (conf->max_nr_stripes &&
+	       drop_one_stripe(conf))
+		;
+	while (conf->min_nr_stripes > conf->max_nr_stripes &&
+	       grow_one_stripe(conf, GFP_KERNEL))
+		;
+	mutex_unlock(&conf->cache_size_mutex);
+}
+
+static int raid5_change_consistency_policy(struct mddev *mddev, const char *buf)
+{
+	struct r5conf *conf;
+	int err;
+
+	err = mddev_lock(mddev);
+	if (err)
+		return err;
+	conf = mddev->private;
+	if (!conf) {
+		mddev_unlock(mddev);
+		return -ENODEV;
+	}
+
+	if (strncmp(buf, "ppl", 3) == 0 &&
+	    !test_bit(MD_HAS_PPL, &mddev->flags) &&
+	    !test_bit(MD_HAS_JOURNAL, &mddev->flags) &&
+	    !mddev->bitmap) {
+		mddev_suspend(mddev);
+		set_bit(MD_HAS_PPL, &mddev->flags);
+		err = r5l_init_log(conf, NULL);
+		if (err)
+			clear_bit(MD_HAS_PPL, &mddev->flags);
+		else
+			raid5_reset_stripe_cache(mddev);
+		mddev_resume(mddev);
+	} else if (strncmp(buf, "resync", 6) == 0 &&
+		   test_bit(MD_HAS_PPL, &mddev->flags)) {
+		mddev_suspend(mddev);
+		r5l_exit_log(conf->log);
+		conf->log = NULL;
+		clear_bit(MD_HAS_PPL, &mddev->flags);
+		raid5_reset_stripe_cache(mddev);
+		mddev_resume(mddev);
+	} else {
+		err = -EINVAL;
+	}
+
+	if (!err)
+		md_update_sb(mddev, 1);
+
+	mddev_unlock(mddev);
+
+	return err;
+}
+
 static struct md_personality raid6_personality =
 {
 	.name		= "raid6",
@@ -8317,6 +8377,7 @@ static struct md_personality raid5_personality =
 	.quiesce	= raid5_quiesce,
 	.takeover	= raid5_takeover,
 	.congested	= raid5_congested,
+	.change_consistency_policy = raid5_change_consistency_policy,
 };
 
 static struct md_personality raid4_personality =
-- 
2.11.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/9] md: superblock changes for PPL
  2017-01-30 18:59 ` [PATCH v3 3/9] md: superblock changes for PPL Artur Paszkiewicz
@ 2017-02-07 21:20   ` Shaohua Li
  2017-02-08 11:58     ` Artur Paszkiewicz
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-02-07 21:20 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Mon, Jan 30, 2017 at 07:59:47PM +0100, Artur Paszkiewicz wrote:
> Include information about PPL location and size into mdp_superblock_1
> and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
> put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
> 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
> to mddev->flags to indicate that PPL is enabled on an array.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/md.c                | 15 +++++++++++++++
>  drivers/md/md.h                |  8 ++++++++
>  drivers/md/raid0.c             |  3 ++-
>  drivers/md/raid1.c             |  3 ++-
>  include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
>  5 files changed, 41 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 85ac98417a08..e96f73572e23 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -1566,6 +1566,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>  	} else if (sb->bblog_offset != 0)
>  		rdev->badblocks.shift = 0;
>  
> +	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
> +		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
> +		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
> +		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
> +	}
> +
>  	if (!refdev) {
>  		ret = 1;
>  	} else {
> @@ -1678,6 +1684,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>  
>  		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>  			set_bit(MD_HAS_JOURNAL, &mddev->flags);
> +
> +		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)
> +			set_bit(MD_HAS_PPL, &mddev->flags);

I think we should check wrong configuration and bail out. For example, both
MD_FEATURE_PPL and MD_FEATURE_BITMAP_OFFSET set, or both MD_FEATURE_PPL and
MD_FEATURE_JOURNAL set.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/9] raid5: calculate partial parity for a stripe
  2017-01-30 18:59 ` [PATCH v3 4/9] raid5: calculate partial parity for a stripe Artur Paszkiewicz
@ 2017-02-07 21:25   ` Shaohua Li
  2017-02-08 11:58     ` Artur Paszkiewicz
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-02-07 21:25 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Mon, Jan 30, 2017 at 07:59:48PM +0100, Artur Paszkiewicz wrote:
> Attach a page for holding the partial parity data to stripe_head.
> Allocate it only if mddev has the MD_HAS_PPL flag set.
> 
> Partial parity is the xor of not modified data chunks of a stripe and is
> calculated as follows:
> 
> - reconstruct-write case:
>   xor data from all not updated disks in a stripe
> 
> - read-modify-write case:
>   xor old data and parity from all updated disks in a stripe
> 
> Implement it using the async_tx API and integrate into raid_run_ops().
> It must be called when we still have access to old data, so do it when
> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
> stored into sh->ppl_page.
> 
> Partial parity is not meaningful for full stripe write and is not stored
> in the log or used for recovery, so don't attempt to calculate it when
> stripe has STRIPE_FULL_WRITE.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  drivers/md/raid5.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.h |  2 ++
>  2 files changed, 100 insertions(+)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index d1cba941951e..e1e238da32ba 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -466,6 +466,11 @@ static void shrink_buffers(struct stripe_head *sh)
>  		sh->dev[i].page = NULL;
>  		put_page(p);
>  	}
> +
> +	if (sh->ppl_page) {
> +		put_page(sh->ppl_page);
> +		sh->ppl_page = NULL;
> +	}
>  }
>  
>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
> @@ -482,6 +487,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>  		sh->dev[i].page = page;
>  		sh->dev[i].orig_page = page;
>  	}
> +
> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {

I think the test should be something like:
	if (raid5_ppl_enabled())

Having the feature doesn't mean the feature is enabled.

>  	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
>  		__func__, (unsigned long long)sh->sector,
>  		s->locked, s->ops_request);
> @@ -3105,6 +3175,34 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>  	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
>  		goto overlap;
>  
> +	if (forwrite && test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
> +		/*
> +		 * With PPL only writes to consecutive data chunks within a
> +		 * stripe are allowed. Not really an overlap, but
> +		 * wait_for_overlap can be used to handle this.
> +		 */

Please describe why the data must be consecutive.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-01-30 18:59 ` [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
@ 2017-02-07 21:42   ` Shaohua Li
  2017-02-08 11:58     ` Artur Paszkiewicz
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-02-07 21:42 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> This implements the PPL write logging functionality, using the
> raid5-cache policy logic introduced in previous patches. The description
> of PPL is added to the documentation. More details can be found in the
> comments in raid5-ppl.c.
> 
> Put the PPL metadata structures to md_p.h because userspace tools
> (mdadm) will also need to read/write PPL.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  Documentation/admin-guide/md.rst |  53 ++++
>  drivers/md/Makefile              |   2 +-
>  drivers/md/raid5-cache.c         |  13 +-
>  drivers/md/raid5-cache.h         |   8 +
>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.c               |  15 +-
>  include/uapi/linux/raid/md_p.h   |  26 ++
>  7 files changed, 661 insertions(+), 7 deletions(-)
>  create mode 100644 drivers/md/raid5-ppl.c
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index e449fb5f277c..7104ef757e73 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>  that all auto-detected arrays are assembled as partitionable.
>  
> +
> +.. _dirty_degraded_boot:
> +
>  Boot time assembly of degraded/dirty arrays
>  -------------------------------------------
>  
> @@ -176,6 +179,56 @@ and its role in the array.
>  Once started with RUN_ARRAY, uninitialized spares can be added with
>  HOT_ADD_DISK.
>  
> +.. _ppl:
> +
> +Partial Parity Log
> +------------------
Please put this part to a separate file under Documentation/md. The admin-guide
is supposed to serve as a doc for user. The Documentation/md files will server
as a doc for developer.

> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> +may become inconsistent with data on other member disks. If the array is also
> +in degraded state, there is no way to recalculate parity, because one of the
> +disks is missing. This can lead to silent data corruption when rebuilding the
> +array or using it is as degraded - data calculated from parity for array blocks
> +that have not been touched by a write request during the unclean shutdown can
> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> +this, md by default does not allow starting a dirty degraded array, see
> +:ref:`dirty_degraded_boot`.
> +
> +Partial parity for a write operation is the XOR of stripe data chunks not
> +modified by this write. It is just enough data needed for recovering from the
> +write hole. XORing partial parity with the modified chunks produces parity for
> +the stripe, consistent with its state before the write operation, regardless of
> +which chunk writes have completed. If one of the not modified data disks of
> +this stripe is missing, this updated parity can be used to recover its
> +contents. PPL recovery is also performed when starting an array after an
> +unclean shutdown and all disks are available, eliminating the need to resync
> +the array. Because of this, using write-intent bitmap and PPL together is not
> +supported.
> +
> +When handling a write request PPL writes partial parity before new data and
> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> +array member drives in the metadata area, on the parity drive of a particular
> +stripe.  It does not require a dedicated journaling drive. Write performance is
> +reduced by up to 30%-40% but it scales with the number of drives in the array
> +and the journaling drive does not become a bottleneck or a single point of
> +failure.
> +
> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> +not a true journal. It does not protect from losing in-flight data, only from
> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> +performed for this stripe (parity is not updated). So it is possible to have
> +arbitrary data in the written part of a stripe if that disk is lost. In such
> +case the behavior is the same as in plain raid5.
> +
> +PPL is available for md version-1 metadata and external (specifically IMSM)
> +metadata arrays. It can be enabled using mdadm option
> +``--consistency-policy=ppl``.
> +
> +Currently, volatile write-back cache should be disabled on all member drives
> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> +failure.

The limitation is weird and dumb. Is this an implementation limitation which
can be fixed later or a design limitation?

> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> new file mode 100644
> index 000000000000..6bc246c80f6b
> --- /dev/null
> +++ b/drivers/md/raid5-ppl.c
> @@ -0,0 +1,551 @@
> +/*
> + * Partial Parity Log for closing the RAID5 write hole
> + * Copyright (c) 2017, Intel Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms and conditions of the GNU General Public License,
> + * version 2, as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope it will be useful, but WITHOUT
> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> + * more details.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/blkdev.h>
> +#include <linux/slab.h>
> +#include <linux/crc32c.h>
> +#include <linux/raid/md_p.h>
> +#include "md.h"
> +#include "raid5.h"
> +#include "raid5-cache.h"
> +
> +/*
> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> + * partial parity data. The header contains an array of entries
> + * (struct ppl_header_entry) which describe the logged write requests.
> + * Partial parity for the entries comes after the header, written in the same
> + * sequence as the entries:
> + *
> + * Header
> + *   entry0
> + *   ...
> + *   entryN
> + * PP data
> + *   PP for entry0
> + *   ...
> + *   PP for entryN
> + *
> + * Every entry holds a checksum of its partial parity, the header also has a
> + * checksum of the header itself. Entries for full stripes writes contain no
> + * partial parity, they only mark the stripes for which parity should be
> + * recalculated after an unclean shutdown.
> + *
> + * A write request is always logged to the PPL instance stored on the parity
> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> + * used to handle logging for this disk, independently from others. They are
> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> + * common parent r5l_log. This parent log serves as a proxy and is used in
> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> + *
> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> + * be appended to the last entry if the chunks to write are the same, otherwise
> + * a new entry is added. Checksums of entries are calculated incrementally as
> + * stripes containing partial parity are being added to entries.
> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> + * associated with the io_unit are released and raid5d starts writing their data
> + * and parity. When all stripes are written, the io_unit is freed and the next
> + * can be submitted.
> + *
> + * An io_unit is used to gather stripes until it is submitted or becomes full
> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> + * can't be submitted until the previous has completed (PPL and stripe
> + * data+parity is written). The log->running_ios list tracks all io_units of
> + * a log (for a single member disk). New io_units are added to the end of the
> + * list and the first io_unit is submitted, if it is not submitted already.
> + * The current io_unit accepting new stripes is always the last on the list.
> + */

I don't like the way that io_unit and r5l_log are reused here. The code below
shares very little with raid5-cache. There are a lot of fields in the data
structures which are not used by ppl. It's not a good practice to share data
structure like this way, so please define ppl its own data structure. The
workflow between raid5 cache and ppl might be similar, but what we really need
to share is the hooks to raid5 core.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
  2017-01-30 18:59 ` [PATCH v3 6/9] md: add sysfs entries for PPL Artur Paszkiewicz
@ 2017-02-07 21:49   ` Shaohua Li
  2017-02-08 11:58     ` Artur Paszkiewicz
  0 siblings, 1 reply; 25+ messages in thread
From: Shaohua Li @ 2017-02-07 21:49 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
> Add 'consistency_policy' attribute for array. It indicates how the array
> maintains consistency in case of unexpected shutdown.
> 
> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
> and size of the PPL space on the device.
> 
> These attributes are writable to allow enabling PPL for external
> metadata arrays and (later) to enable/disable PPL for a running array.
> 
> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> ---
>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 144 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> index 7104ef757e73..2c153b3d798f 100644
> --- a/Documentation/admin-guide/md.rst
> +++ b/Documentation/admin-guide/md.rst
> @@ -329,14 +329,14 @@ All md devices contain:
>       array creation it will default to 0, though starting the array as
>       ``clean`` will set it much larger.
>  
> -   new_dev
> +  new_dev
>       This file can be written but not read.  The value written should
>       be a block device number as major:minor.  e.g. 8:0
>       This will cause that device to be attached to the array, if it is
>       available.  It will then appear at md/dev-XXX (depending on the
>       name of the device) and further configuration is then possible.
>  
> -   safe_mode_delay
> +  safe_mode_delay
>       When an md array has seen no write requests for a certain period
>       of time, it will be marked as ``clean``.  When another write
>       request arrives, the array is marked as ``dirty`` before the write
> @@ -345,7 +345,7 @@ All md devices contain:
>       period as a number of seconds.  The default is 200msec (0.200).
>       Writing a value of 0 disables safemode.
>  
> -   array_state
> +  array_state
>       This file contains a single word which describes the current
>       state of the array.  In many cases, the state can be set by
>       writing the word for the desired state, however some states
> @@ -454,7 +454,30 @@ All md devices contain:
>       once the array becomes non-degraded, and this fact has been
>       recorded in the metadata.
>  
> +  consistency_policy
> +     This indicates how the array maintains consistency in case of unexpected
> +     shutdown. It can be:
>  
> +     none
> +       Array has no redundancy information, e.g. raid0, linear.
> +
> +     resync
> +       Full resync is performed and all redundancy is regenerated when the
> +       array is started after unclean shutdown.
> +
> +     bitmap
> +       Resync assisted by a write-intent bitmap.
> +
> +     journal
> +       For raid4/5/6, journal device is used to log transactions and replay
> +       after unclean shutdown.
> +
> +     ppl
> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
> +       resync.
> +
> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> +     used to enable and disable PPL.
>  
>  
>  As component devices are added to an md array, they appear in the ``md``
> @@ -616,6 +639,9 @@ Each directory contains:
>  	adds bad blocks without acknowledging them. This is largely
>  	for testing.
>  
> +      ppl_sector, ppl_size
> +        Location and size (in sectors) of the space used for Partial Parity Log
> +        on this device.
>  
>  
>  An active md device will also contain an entry for each active device
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index e96f73572e23..47ab3afe87cb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>  
> +static ssize_t
> +ppl_sector_show(struct md_rdev *rdev, char *page)
> +{
> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
> +}
> +
> +static ssize_t
> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
> +{
> +	unsigned long long sector;
> +
> +	if (kstrtoull(buf, 10, &sector) < 0)
> +		return -EINVAL;
> +	if (sector != (sector_t)sector)
> +	        return -EINVAL;
> +
> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> +	    rdev->raid_disk >= 0)
> +		return -EBUSY;
> +
> +	if (rdev->mddev->persistent) {
> +		if (rdev->mddev->major_version == 0)
> +			return -EINVAL;
> +		if ((sector > rdev->sb_start &&
> +		     sector - rdev->sb_start > S16_MAX) ||
> +		    (sector < rdev->sb_start &&
> +		     rdev->sb_start - sector > -S16_MIN))
> +			return -EINVAL;
> +		rdev->ppl.offset = sector - rdev->sb_start;

Why do we allow non-external array changes the ppl log position and size?

> +	} else if (!rdev->mddev->external) {
> +		return -EBUSY;
> +	}
> +	rdev->ppl.sector = sector;
> +	return len;
> +}
> +
> +static struct rdev_sysfs_entry rdev_ppl_sector =
> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
> +
> +static ssize_t
> +ppl_size_show(struct md_rdev *rdev, char *page)
> +{
> +	return sprintf(page, "%u\n", rdev->ppl.size);
> +}
> +
> +static ssize_t
> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
> +{
> +	unsigned int size;
> +
> +	if (kstrtouint(buf, 10, &size) < 0)
> +		return -EINVAL;
> +
> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> +	    rdev->raid_disk >= 0)
> +		return -EBUSY;
> +
> +	if (rdev->mddev->persistent) {
> +		if (rdev->mddev->major_version == 0)
> +			return -EINVAL;
> +		if (size > U16_MAX)
> +			return -EINVAL;
> +	} else if (!rdev->mddev->external) {
> +		return -EBUSY;
> +	}
> +	rdev->ppl.size = size;
> +	return len;
> +}
> +
> +static struct rdev_sysfs_entry rdev_ppl_size =
> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
> +
>  static struct attribute *rdev_default_attrs[] = {
>  	&rdev_state.attr,
>  	&rdev_errors.attr,
> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
>  	&rdev_recovery_start.attr,
>  	&rdev_bad_blocks.attr,
>  	&rdev_unack_bad_blocks.attr,
> +	&rdev_ppl_sector.attr,
> +	&rdev_ppl_size.attr,
>  	NULL,
>  };
>  static ssize_t
> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
>         array_size_store);
>  
> +static ssize_t
> +consistency_policy_show(struct mddev *mddev, char *page)
> +{
> +	int ret;
> +
> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> +		ret = sprintf(page, "journal\n");
> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
> +		ret = sprintf(page, "ppl\n");
> +	} else if (mddev->bitmap) {
> +		ret = sprintf(page, "bitmap\n");
> +	} else if (mddev->pers) {
> +		if (mddev->pers->sync_request)
> +			ret = sprintf(page, "resync\n");
> +		else
> +			ret = sprintf(page, "none\n");
> +	} else {
> +		ret = sprintf(page, "unknown\n");
> +	}
> +
> +	return ret;
> +}
> +
> +static ssize_t
> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
> +{
> +	if (mddev->pers) {
> +		return -EBUSY;
> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
> +		set_bit(MD_HAS_PPL, &mddev->flags);
> +		return len;

Here we should check if ppl position and size are valid before we set the flag.

And shouldn't we move this part to .change_consistency_policy added in patch 9?
This is a consistency policy change too.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-02-08 11:58     ` Artur Paszkiewicz
@ 2017-02-08  5:34       ` Shaohua Li
  2017-02-09 15:35         ` Artur Paszkiewicz
  2017-02-09 16:06         ` Wols Lists
  0 siblings, 2 replies; 25+ messages in thread
From: Shaohua Li @ 2017-02-08  5:34 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
> On 02/07/2017 10:42 PM, Shaohua Li wrote:
> > On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> >> This implements the PPL write logging functionality, using the
> >> raid5-cache policy logic introduced in previous patches. The description
> >> of PPL is added to the documentation. More details can be found in the
> >> comments in raid5-ppl.c.
> >>
> >> Put the PPL metadata structures to md_p.h because userspace tools
> >> (mdadm) will also need to read/write PPL.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/admin-guide/md.rst |  53 ++++
> >>  drivers/md/Makefile              |   2 +-
> >>  drivers/md/raid5-cache.c         |  13 +-
> >>  drivers/md/raid5-cache.h         |   8 +
> >>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
> >>  drivers/md/raid5.c               |  15 +-
> >>  include/uapi/linux/raid/md_p.h   |  26 ++
> >>  7 files changed, 661 insertions(+), 7 deletions(-)
> >>  create mode 100644 drivers/md/raid5-ppl.c
> >>
> >> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >> index e449fb5f277c..7104ef757e73 100644
> >> --- a/Documentation/admin-guide/md.rst
> >> +++ b/Documentation/admin-guide/md.rst
> >> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
> >>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
> >>  that all auto-detected arrays are assembled as partitionable.
> >>  
> >> +
> >> +.. _dirty_degraded_boot:
> >> +
> >>  Boot time assembly of degraded/dirty arrays
> >>  -------------------------------------------
> >>  
> >> @@ -176,6 +179,56 @@ and its role in the array.
> >>  Once started with RUN_ARRAY, uninitialized spares can be added with
> >>  HOT_ADD_DISK.
> >>  
> >> +.. _ppl:
> >> +
> >> +Partial Parity Log
> >> +------------------
> > Please put this part to a separate file under Documentation/md. The admin-guide
> > is supposed to serve as a doc for user. The Documentation/md files will server
> > as a doc for developer.
> 
> OK, I will move it.
> 
> >> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> >> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> >> +may become inconsistent with data on other member disks. If the array is also
> >> +in degraded state, there is no way to recalculate parity, because one of the
> >> +disks is missing. This can lead to silent data corruption when rebuilding the
> >> +array or using it is as degraded - data calculated from parity for array blocks
> >> +that have not been touched by a write request during the unclean shutdown can
> >> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> >> +this, md by default does not allow starting a dirty degraded array, see
> >> +:ref:`dirty_degraded_boot`.
> >> +
> >> +Partial parity for a write operation is the XOR of stripe data chunks not
> >> +modified by this write. It is just enough data needed for recovering from the
> >> +write hole. XORing partial parity with the modified chunks produces parity for
> >> +the stripe, consistent with its state before the write operation, regardless of
> >> +which chunk writes have completed. If one of the not modified data disks of
> >> +this stripe is missing, this updated parity can be used to recover its
> >> +contents. PPL recovery is also performed when starting an array after an
> >> +unclean shutdown and all disks are available, eliminating the need to resync
> >> +the array. Because of this, using write-intent bitmap and PPL together is not
> >> +supported.
> >> +
> >> +When handling a write request PPL writes partial parity before new data and
> >> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> >> +array member drives in the metadata area, on the parity drive of a particular
> >> +stripe.  It does not require a dedicated journaling drive. Write performance is
> >> +reduced by up to 30%-40% but it scales with the number of drives in the array
> >> +and the journaling drive does not become a bottleneck or a single point of
> >> +failure.
> >> +
> >> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> >> +not a true journal. It does not protect from losing in-flight data, only from
> >> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> >> +performed for this stripe (parity is not updated). So it is possible to have
> >> +arbitrary data in the written part of a stripe if that disk is lost. In such
> >> +case the behavior is the same as in plain raid5.
> >> +
> >> +PPL is available for md version-1 metadata and external (specifically IMSM)
> >> +metadata arrays. It can be enabled using mdadm option
> >> +``--consistency-policy=ppl``.
> >> +
> >> +Currently, volatile write-back cache should be disabled on all member drives
> >> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> >> +failure.
> > 
> > The limitation is weird and dumb. Is this an implementation limitation which
> > can be fixed later or a design limitation?
> 
> This can be fixed later and I'm planning to do this. The limitation
> results from the fact that partial parity is the xor of "old" data.
> During recovery we xor it with "new" data and that produces valid
> parity, assuming the "old" data has not changed after dirty shutdown.
> But it's not necessarily true, since the "old" data might have been only
> in cache and not on persistent storage. So fixing this will require
> making sure the old data is on persistent media before writing PPL.

so it's the implementation which doesn't flush disk cache before writting new
data to disks, right? That makes sense then. I think we should fix this soon.
Don't think people will (or remember to) disable write-back cache before using
ppl.

> >> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >> new file mode 100644
> >> index 000000000000..6bc246c80f6b
> >> --- /dev/null
> >> +++ b/drivers/md/raid5-ppl.c
> >> @@ -0,0 +1,551 @@
> >> +/*
> >> + * Partial Parity Log for closing the RAID5 write hole
> >> + * Copyright (c) 2017, Intel Corporation.
> >> + *
> >> + * This program is free software; you can redistribute it and/or modify it
> >> + * under the terms and conditions of the GNU General Public License,
> >> + * version 2, as published by the Free Software Foundation.
> >> + *
> >> + * This program is distributed in the hope it will be useful, but WITHOUT
> >> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> >> + * more details.
> >> + */
> >> +
> >> +#include <linux/kernel.h>
> >> +#include <linux/blkdev.h>
> >> +#include <linux/slab.h>
> >> +#include <linux/crc32c.h>
> >> +#include <linux/raid/md_p.h>
> >> +#include "md.h"
> >> +#include "raid5.h"
> >> +#include "raid5-cache.h"
> >> +
> >> +/*
> >> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> >> + * partial parity data. The header contains an array of entries
> >> + * (struct ppl_header_entry) which describe the logged write requests.
> >> + * Partial parity for the entries comes after the header, written in the same
> >> + * sequence as the entries:
> >> + *
> >> + * Header
> >> + *   entry0
> >> + *   ...
> >> + *   entryN
> >> + * PP data
> >> + *   PP for entry0
> >> + *   ...
> >> + *   PP for entryN
> >> + *
> >> + * Every entry holds a checksum of its partial parity, the header also has a
> >> + * checksum of the header itself. Entries for full stripes writes contain no
> >> + * partial parity, they only mark the stripes for which parity should be
> >> + * recalculated after an unclean shutdown.
> >> + *
> >> + * A write request is always logged to the PPL instance stored on the parity
> >> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> >> + * used to handle logging for this disk, independently from others. They are
> >> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> >> + * common parent r5l_log. This parent log serves as a proxy and is used in
> >> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> >> + *
> >> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> >> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> >> + * be appended to the last entry if the chunks to write are the same, otherwise
> >> + * a new entry is added. Checksums of entries are calculated incrementally as
> >> + * stripes containing partial parity are being added to entries.
> >> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> >> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> >> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> >> + * associated with the io_unit are released and raid5d starts writing their data
> >> + * and parity. When all stripes are written, the io_unit is freed and the next
> >> + * can be submitted.
> >> + *
> >> + * An io_unit is used to gather stripes until it is submitted or becomes full
> >> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> >> + * can't be submitted until the previous has completed (PPL and stripe
> >> + * data+parity is written). The log->running_ios list tracks all io_units of
> >> + * a log (for a single member disk). New io_units are added to the end of the
> >> + * list and the first io_unit is submitted, if it is not submitted already.
> >> + * The current io_unit accepting new stripes is always the last on the list.
> >> + */
> > 
> > I don't like the way that io_unit and r5l_log are reused here. The code below
> > shares very little with raid5-cache. There are a lot of fields in the data
> > structures which are not used by ppl. It's not a good practice to share data
> > structure like this way, so please define ppl its own data structure. The
> > workflow between raid5 cache and ppl might be similar, but what we really need
> > to share is the hooks to raid5 core.
> 
> Reusing the structures seemed like a good idea at first, but now I must
> admit you are right, especially since the structures were extended to
> support write-back caching. One instance of r5l_log will still be
> necessary though, to use for the hooks. Its private field will be used
> for holding the ppl structures. Is that ok?

I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
can store whatever data in the log_private. And convert all callbacks to use a
'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
data structure.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/9] md: superblock changes for PPL
  2017-02-07 21:20   ` Shaohua Li
@ 2017-02-08 11:58     ` Artur Paszkiewicz
  0 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen

On 02/07/2017 10:20 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:47PM +0100, Artur Paszkiewicz wrote:
>> Include information about PPL location and size into mdp_superblock_1
>> and copy it to/from rdev. Because PPL is mutually exclusive with bitmap,
>> put it in place of 'bitmap_offset'. Add a new flag MD_FEATURE_PPL for
>> 'feature_map', analogically to MD_FEATURE_BITMAP_OFFSET. Add MD_HAS_PPL
>> to mddev->flags to indicate that PPL is enabled on an array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/md.c                | 15 +++++++++++++++
>>  drivers/md/md.h                |  8 ++++++++
>>  drivers/md/raid0.c             |  3 ++-
>>  drivers/md/raid1.c             |  3 ++-
>>  include/uapi/linux/raid/md_p.h | 18 ++++++++++++++----
>>  5 files changed, 41 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index 85ac98417a08..e96f73572e23 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -1566,6 +1566,12 @@ static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int minor_
>>  	} else if (sb->bblog_offset != 0)
>>  		rdev->badblocks.shift = 0;
>>  
>> +	if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL) {
>> +		rdev->ppl.offset = (__s16)le16_to_cpu(sb->ppl.offset);
>> +		rdev->ppl.size = le16_to_cpu(sb->ppl.size);
>> +		rdev->ppl.sector = rdev->sb_start + rdev->ppl.offset;
>> +	}
>> +
>>  	if (!refdev) {
>>  		ret = 1;
>>  	} else {
>> @@ -1678,6 +1684,9 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
>>  
>>  		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
>>  			set_bit(MD_HAS_JOURNAL, &mddev->flags);
>> +
>> +		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_PPL)
>> +			set_bit(MD_HAS_PPL, &mddev->flags);
> 
> I think we should check wrong configuration and bail out. For example, both
> MD_FEATURE_PPL and MD_FEATURE_BITMAP_OFFSET set, or both MD_FEATURE_PPL and
> MD_FEATURE_JOURNAL set.

Makes sense. I will add it.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/9] raid5: calculate partial parity for a stripe
  2017-02-07 21:25   ` Shaohua Li
@ 2017-02-08 11:58     ` Artur Paszkiewicz
  0 siblings, 0 replies; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen

On 02/07/2017 10:25 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:48PM +0100, Artur Paszkiewicz wrote:
>> Attach a page for holding the partial parity data to stripe_head.
>> Allocate it only if mddev has the MD_HAS_PPL flag set.
>>
>> Partial parity is the xor of not modified data chunks of a stripe and is
>> calculated as follows:
>>
>> - reconstruct-write case:
>>   xor data from all not updated disks in a stripe
>>
>> - read-modify-write case:
>>   xor old data and parity from all updated disks in a stripe
>>
>> Implement it using the async_tx API and integrate into raid_run_ops().
>> It must be called when we still have access to old data, so do it when
>> STRIPE_OP_BIODRAIN is set, but before ops_run_prexor5(). The result is
>> stored into sh->ppl_page.
>>
>> Partial parity is not meaningful for full stripe write and is not stored
>> in the log or used for recovery, so don't attempt to calculate it when
>> stripe has STRIPE_FULL_WRITE.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  drivers/md/raid5.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.h |  2 ++
>>  2 files changed, 100 insertions(+)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index d1cba941951e..e1e238da32ba 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -466,6 +466,11 @@ static void shrink_buffers(struct stripe_head *sh)
>>  		sh->dev[i].page = NULL;
>>  		put_page(p);
>>  	}
>> +
>> +	if (sh->ppl_page) {
>> +		put_page(sh->ppl_page);
>> +		sh->ppl_page = NULL;
>> +	}
>>  }
>>  
>>  static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>> @@ -482,6 +487,13 @@ static int grow_buffers(struct stripe_head *sh, gfp_t gfp)
>>  		sh->dev[i].page = page;
>>  		sh->dev[i].orig_page = page;
>>  	}
>> +
>> +	if (test_bit(MD_HAS_PPL, &sh->raid_conf->mddev->flags)) {
> 
> I think the test should be something like:
> 	if (raid5_ppl_enabled())
> 
> Having the feature doesn't mean the feature is enabled.

This flag is set iff PPL is enabled, so this function would only check
the flag anyway. But I can add it to improve readability.

>>  	pr_debug("%s: stripe %llu locked: %d ops_request: %lx\n",
>>  		__func__, (unsigned long long)sh->sector,
>>  		s->locked, s->ops_request);
>> @@ -3105,6 +3175,34 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>>  	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
>>  		goto overlap;
>>  
>> +	if (forwrite && test_bit(MD_HAS_PPL, &conf->mddev->flags)) {
>> +		/*
>> +		 * With PPL only writes to consecutive data chunks within a
>> +		 * stripe are allowed. Not really an overlap, but
>> +		 * wait_for_overlap can be used to handle this.
>> +		 */
> 
> Please describe why the data must be consecutive.

In PPL metadata we don't store information about which drives we write
to, only the modified data range. For a single stripe_head we can only
have one PPL entry at a time, which describes one data range. This can
be improved in the future, but will require extending the PPL metadata.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-02-07 21:42   ` Shaohua Li
@ 2017-02-08 11:58     ` Artur Paszkiewicz
  2017-02-08  5:34       ` Shaohua Li
  0 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen

On 02/07/2017 10:42 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
>> This implements the PPL write logging functionality, using the
>> raid5-cache policy logic introduced in previous patches. The description
>> of PPL is added to the documentation. More details can be found in the
>> comments in raid5-ppl.c.
>>
>> Put the PPL metadata structures to md_p.h because userspace tools
>> (mdadm) will also need to read/write PPL.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  53 ++++
>>  drivers/md/Makefile              |   2 +-
>>  drivers/md/raid5-cache.c         |  13 +-
>>  drivers/md/raid5-cache.h         |   8 +
>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>>  drivers/md/raid5.c               |  15 +-
>>  include/uapi/linux/raid/md_p.h   |  26 ++
>>  7 files changed, 661 insertions(+), 7 deletions(-)
>>  create mode 100644 drivers/md/raid5-ppl.c
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index e449fb5f277c..7104ef757e73 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>>  that all auto-detected arrays are assembled as partitionable.
>>  
>> +
>> +.. _dirty_degraded_boot:
>> +
>>  Boot time assembly of degraded/dirty arrays
>>  -------------------------------------------
>>  
>> @@ -176,6 +179,56 @@ and its role in the array.
>>  Once started with RUN_ARRAY, uninitialized spares can be added with
>>  HOT_ADD_DISK.
>>  
>> +.. _ppl:
>> +
>> +Partial Parity Log
>> +------------------
> Please put this part to a separate file under Documentation/md. The admin-guide
> is supposed to serve as a doc for user. The Documentation/md files will server
> as a doc for developer.

OK, I will move it.

>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>> +may become inconsistent with data on other member disks. If the array is also
>> +in degraded state, there is no way to recalculate parity, because one of the
>> +disks is missing. This can lead to silent data corruption when rebuilding the
>> +array or using it is as degraded - data calculated from parity for array blocks
>> +that have not been touched by a write request during the unclean shutdown can
>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
>> +this, md by default does not allow starting a dirty degraded array, see
>> +:ref:`dirty_degraded_boot`.
>> +
>> +Partial parity for a write operation is the XOR of stripe data chunks not
>> +modified by this write. It is just enough data needed for recovering from the
>> +write hole. XORing partial parity with the modified chunks produces parity for
>> +the stripe, consistent with its state before the write operation, regardless of
>> +which chunk writes have completed. If one of the not modified data disks of
>> +this stripe is missing, this updated parity can be used to recover its
>> +contents. PPL recovery is also performed when starting an array after an
>> +unclean shutdown and all disks are available, eliminating the need to resync
>> +the array. Because of this, using write-intent bitmap and PPL together is not
>> +supported.
>> +
>> +When handling a write request PPL writes partial parity before new data and
>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>> +array member drives in the metadata area, on the parity drive of a particular
>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>> +and the journaling drive does not become a bottleneck or a single point of
>> +failure.
>> +
>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>> +not a true journal. It does not protect from losing in-flight data, only from
>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>> +performed for this stripe (parity is not updated). So it is possible to have
>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>> +case the behavior is the same as in plain raid5.
>> +
>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>> +metadata arrays. It can be enabled using mdadm option
>> +``--consistency-policy=ppl``.
>> +
>> +Currently, volatile write-back cache should be disabled on all member drives
>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>> +failure.
> 
> The limitation is weird and dumb. Is this an implementation limitation which
> can be fixed later or a design limitation?

This can be fixed later and I'm planning to do this. The limitation
results from the fact that partial parity is the xor of "old" data.
During recovery we xor it with "new" data and that produces valid
parity, assuming the "old" data has not changed after dirty shutdown.
But it's not necessarily true, since the "old" data might have been only
in cache and not on persistent storage. So fixing this will require
making sure the old data is on persistent media before writing PPL.
 
>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>> new file mode 100644
>> index 000000000000..6bc246c80f6b
>> --- /dev/null
>> +++ b/drivers/md/raid5-ppl.c
>> @@ -0,0 +1,551 @@
>> +/*
>> + * Partial Parity Log for closing the RAID5 write hole
>> + * Copyright (c) 2017, Intel Corporation.
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms and conditions of the GNU General Public License,
>> + * version 2, as published by the Free Software Foundation.
>> + *
>> + * This program is distributed in the hope it will be useful, but WITHOUT
>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>> + * more details.
>> + */
>> +
>> +#include <linux/kernel.h>
>> +#include <linux/blkdev.h>
>> +#include <linux/slab.h>
>> +#include <linux/crc32c.h>
>> +#include <linux/raid/md_p.h>
>> +#include "md.h"
>> +#include "raid5.h"
>> +#include "raid5-cache.h"
>> +
>> +/*
>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>> + * partial parity data. The header contains an array of entries
>> + * (struct ppl_header_entry) which describe the logged write requests.
>> + * Partial parity for the entries comes after the header, written in the same
>> + * sequence as the entries:
>> + *
>> + * Header
>> + *   entry0
>> + *   ...
>> + *   entryN
>> + * PP data
>> + *   PP for entry0
>> + *   ...
>> + *   PP for entryN
>> + *
>> + * Every entry holds a checksum of its partial parity, the header also has a
>> + * checksum of the header itself. Entries for full stripes writes contain no
>> + * partial parity, they only mark the stripes for which parity should be
>> + * recalculated after an unclean shutdown.
>> + *
>> + * A write request is always logged to the PPL instance stored on the parity
>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
>> + * used to handle logging for this disk, independently from others. They are
>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
>> + * common parent r5l_log. This parent log serves as a proxy and is used in
>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
>> + *
>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>> + * be appended to the last entry if the chunks to write are the same, otherwise
>> + * a new entry is added. Checksums of entries are calculated incrementally as
>> + * stripes containing partial parity are being added to entries.
>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
>> + * associated with the io_unit are released and raid5d starts writing their data
>> + * and parity. When all stripes are written, the io_unit is freed and the next
>> + * can be submitted.
>> + *
>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>> + * can't be submitted until the previous has completed (PPL and stripe
>> + * data+parity is written). The log->running_ios list tracks all io_units of
>> + * a log (for a single member disk). New io_units are added to the end of the
>> + * list and the first io_unit is submitted, if it is not submitted already.
>> + * The current io_unit accepting new stripes is always the last on the list.
>> + */
> 
> I don't like the way that io_unit and r5l_log are reused here. The code below
> shares very little with raid5-cache. There are a lot of fields in the data
> structures which are not used by ppl. It's not a good practice to share data
> structure like this way, so please define ppl its own data structure. The
> workflow between raid5 cache and ppl might be similar, but what we really need
> to share is the hooks to raid5 core.

Reusing the structures seemed like a good idea at first, but now I must
admit you are right, especially since the structures were extended to
support write-back caching. One instance of r5l_log will still be
necessary though, to use for the hooks. Its private field will be used
for holding the ppl structures. Is that ok? 

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
  2017-02-07 21:49   ` Shaohua Li
@ 2017-02-08 11:58     ` Artur Paszkiewicz
  2017-02-08 18:14       ` Shaohua Li
  0 siblings, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-02-08 11:58 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen

On 02/07/2017 10:49 PM, Shaohua Li wrote:
> On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
>> Add 'consistency_policy' attribute for array. It indicates how the array
>> maintains consistency in case of unexpected shutdown.
>>
>> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
>> and size of the PPL space on the device.
>>
>> These attributes are writable to allow enabling PPL for external
>> metadata arrays and (later) to enable/disable PPL for a running array.
>>
>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>> ---
>>  Documentation/admin-guide/md.rst |  32 ++++++++++-
>>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 144 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>> index 7104ef757e73..2c153b3d798f 100644
>> --- a/Documentation/admin-guide/md.rst
>> +++ b/Documentation/admin-guide/md.rst
>> @@ -329,14 +329,14 @@ All md devices contain:
>>       array creation it will default to 0, though starting the array as
>>       ``clean`` will set it much larger.
>>  
>> -   new_dev
>> +  new_dev
>>       This file can be written but not read.  The value written should
>>       be a block device number as major:minor.  e.g. 8:0
>>       This will cause that device to be attached to the array, if it is
>>       available.  It will then appear at md/dev-XXX (depending on the
>>       name of the device) and further configuration is then possible.
>>  
>> -   safe_mode_delay
>> +  safe_mode_delay
>>       When an md array has seen no write requests for a certain period
>>       of time, it will be marked as ``clean``.  When another write
>>       request arrives, the array is marked as ``dirty`` before the write
>> @@ -345,7 +345,7 @@ All md devices contain:
>>       period as a number of seconds.  The default is 200msec (0.200).
>>       Writing a value of 0 disables safemode.
>>  
>> -   array_state
>> +  array_state
>>       This file contains a single word which describes the current
>>       state of the array.  In many cases, the state can be set by
>>       writing the word for the desired state, however some states
>> @@ -454,7 +454,30 @@ All md devices contain:
>>       once the array becomes non-degraded, and this fact has been
>>       recorded in the metadata.
>>  
>> +  consistency_policy
>> +     This indicates how the array maintains consistency in case of unexpected
>> +     shutdown. It can be:
>>  
>> +     none
>> +       Array has no redundancy information, e.g. raid0, linear.
>> +
>> +     resync
>> +       Full resync is performed and all redundancy is regenerated when the
>> +       array is started after unclean shutdown.
>> +
>> +     bitmap
>> +       Resync assisted by a write-intent bitmap.
>> +
>> +     journal
>> +       For raid4/5/6, journal device is used to log transactions and replay
>> +       after unclean shutdown.
>> +
>> +     ppl
>> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
>> +       resync.
>> +
>> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
>> +     used to enable and disable PPL.
>>  
>>  
>>  As component devices are added to an md array, they appear in the ``md``
>> @@ -616,6 +639,9 @@ Each directory contains:
>>  	adds bad blocks without acknowledging them. This is largely
>>  	for testing.
>>  
>> +      ppl_sector, ppl_size
>> +        Location and size (in sectors) of the space used for Partial Parity Log
>> +        on this device.
>>  
>>  
>>  An active md device will also contain an entry for each active device
>> diff --git a/drivers/md/md.c b/drivers/md/md.c
>> index e96f73572e23..47ab3afe87cb 100644
>> --- a/drivers/md/md.c
>> +++ b/drivers/md/md.c
>> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
>>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
>>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
>>  
>> +static ssize_t
>> +ppl_sector_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
>> +}
>> +
>> +static ssize_t
>> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned long long sector;
>> +
>> +	if (kstrtoull(buf, 10, &sector) < 0)
>> +		return -EINVAL;
>> +	if (sector != (sector_t)sector)
>> +	        return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if ((sector > rdev->sb_start &&
>> +		     sector - rdev->sb_start > S16_MAX) ||
>> +		    (sector < rdev->sb_start &&
>> +		     rdev->sb_start - sector > -S16_MIN))
>> +			return -EINVAL;
>> +		rdev->ppl.offset = sector - rdev->sb_start;
> 
> Why do we allow non-external array changes the ppl log position and size?

To support enabling ppl for a running array. Mdadm should not change the
metadata if the array is running, so we can't read the ppl position and
size from the superblock. This is similar to adding an internal bitmap
with mdadm --grow --bitmap=internal - mdadm writes the bitmap offset to
"bitmap/location" in sysfs.

>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
>> +	rdev->ppl.sector = sector;
>> +	return len;
>> +}
>> +
>> +static struct rdev_sysfs_entry rdev_ppl_sector =
>> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
>> +
>> +static ssize_t
>> +ppl_size_show(struct md_rdev *rdev, char *page)
>> +{
>> +	return sprintf(page, "%u\n", rdev->ppl.size);
>> +}
>> +
>> +static ssize_t
>> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
>> +{
>> +	unsigned int size;
>> +
>> +	if (kstrtouint(buf, 10, &size) < 0)
>> +		return -EINVAL;
>> +
>> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
>> +	    rdev->raid_disk >= 0)
>> +		return -EBUSY;
>> +
>> +	if (rdev->mddev->persistent) {
>> +		if (rdev->mddev->major_version == 0)
>> +			return -EINVAL;
>> +		if (size > U16_MAX)
>> +			return -EINVAL;
>> +	} else if (!rdev->mddev->external) {
>> +		return -EBUSY;
>> +	}
>> +	rdev->ppl.size = size;
>> +	return len;
>> +}
>> +
>> +static struct rdev_sysfs_entry rdev_ppl_size =
>> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
>> +
>>  static struct attribute *rdev_default_attrs[] = {
>>  	&rdev_state.attr,
>>  	&rdev_errors.attr,
>> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
>>  	&rdev_recovery_start.attr,
>>  	&rdev_bad_blocks.attr,
>>  	&rdev_unack_bad_blocks.attr,
>> +	&rdev_ppl_sector.attr,
>> +	&rdev_ppl_size.attr,
>>  	NULL,
>>  };
>>  static ssize_t
>> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
>>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
>>         array_size_store);
>>  
>> +static ssize_t
>> +consistency_policy_show(struct mddev *mddev, char *page)
>> +{
>> +	int ret;
>> +
>> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
>> +		ret = sprintf(page, "journal\n");
>> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
>> +		ret = sprintf(page, "ppl\n");
>> +	} else if (mddev->bitmap) {
>> +		ret = sprintf(page, "bitmap\n");
>> +	} else if (mddev->pers) {
>> +		if (mddev->pers->sync_request)
>> +			ret = sprintf(page, "resync\n");
>> +		else
>> +			ret = sprintf(page, "none\n");
>> +	} else {
>> +		ret = sprintf(page, "unknown\n");
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +static ssize_t
>> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
>> +{
>> +	if (mddev->pers) {
>> +		return -EBUSY;
>> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
>> +		set_bit(MD_HAS_PPL, &mddev->flags);
>> +		return len;
> 
> Here we should check if ppl position and size are valid before we set the flag.
> 
> And shouldn't we move this part to .change_consistency_policy added in patch 9?
> This is a consistency policy change too.

Yes, but here the array is not started yet and we don't have the
personality for which to call .change_consistency_policy. We are not
changing the consistency policy here, but rather setting it initially
for an external metadata array.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 6/9] md: add sysfs entries for PPL
  2017-02-08 11:58     ` Artur Paszkiewicz
@ 2017-02-08 18:14       ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2017-02-08 18:14 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Wed, Feb 08, 2017 at 12:58:51PM +0100, Artur Paszkiewicz wrote:
> On 02/07/2017 10:49 PM, Shaohua Li wrote:
> > On Mon, Jan 30, 2017 at 07:59:50PM +0100, Artur Paszkiewicz wrote:
> >> Add 'consistency_policy' attribute for array. It indicates how the array
> >> maintains consistency in case of unexpected shutdown.
> >>
> >> Add 'ppl_sector' and 'ppl_size' for rdev, which describe the location
> >> and size of the PPL space on the device.
> >>
> >> These attributes are writable to allow enabling PPL for external
> >> metadata arrays and (later) to enable/disable PPL for a running array.
> >>
> >> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >> ---
> >>  Documentation/admin-guide/md.rst |  32 ++++++++++-
> >>  drivers/md/md.c                  | 115 +++++++++++++++++++++++++++++++++++++++
> >>  2 files changed, 144 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >> index 7104ef757e73..2c153b3d798f 100644
> >> --- a/Documentation/admin-guide/md.rst
> >> +++ b/Documentation/admin-guide/md.rst
> >> @@ -329,14 +329,14 @@ All md devices contain:
> >>       array creation it will default to 0, though starting the array as
> >>       ``clean`` will set it much larger.
> >>  
> >> -   new_dev
> >> +  new_dev
> >>       This file can be written but not read.  The value written should
> >>       be a block device number as major:minor.  e.g. 8:0
> >>       This will cause that device to be attached to the array, if it is
> >>       available.  It will then appear at md/dev-XXX (depending on the
> >>       name of the device) and further configuration is then possible.
> >>  
> >> -   safe_mode_delay
> >> +  safe_mode_delay
> >>       When an md array has seen no write requests for a certain period
> >>       of time, it will be marked as ``clean``.  When another write
> >>       request arrives, the array is marked as ``dirty`` before the write
> >> @@ -345,7 +345,7 @@ All md devices contain:
> >>       period as a number of seconds.  The default is 200msec (0.200).
> >>       Writing a value of 0 disables safemode.
> >>  
> >> -   array_state
> >> +  array_state
> >>       This file contains a single word which describes the current
> >>       state of the array.  In many cases, the state can be set by
> >>       writing the word for the desired state, however some states
> >> @@ -454,7 +454,30 @@ All md devices contain:
> >>       once the array becomes non-degraded, and this fact has been
> >>       recorded in the metadata.
> >>  
> >> +  consistency_policy
> >> +     This indicates how the array maintains consistency in case of unexpected
> >> +     shutdown. It can be:
> >>  
> >> +     none
> >> +       Array has no redundancy information, e.g. raid0, linear.
> >> +
> >> +     resync
> >> +       Full resync is performed and all redundancy is regenerated when the
> >> +       array is started after unclean shutdown.
> >> +
> >> +     bitmap
> >> +       Resync assisted by a write-intent bitmap.
> >> +
> >> +     journal
> >> +       For raid4/5/6, journal device is used to log transactions and replay
> >> +       after unclean shutdown.
> >> +
> >> +     ppl
> >> +       For raid5 only, :ref:`ppl` is used to close the write hole and eliminate
> >> +       resync.
> >> +
> >> +     The accepted values when writing to this file are ``ppl`` and ``resync``,
> >> +     used to enable and disable PPL.
> >>  
> >>  
> >>  As component devices are added to an md array, they appear in the ``md``
> >> @@ -616,6 +639,9 @@ Each directory contains:
> >>  	adds bad blocks without acknowledging them. This is largely
> >>  	for testing.
> >>  
> >> +      ppl_sector, ppl_size
> >> +        Location and size (in sectors) of the space used for Partial Parity Log
> >> +        on this device.
> >>  
> >>  
> >>  An active md device will also contain an entry for each active device
> >> diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> index e96f73572e23..47ab3afe87cb 100644
> >> --- a/drivers/md/md.c
> >> +++ b/drivers/md/md.c
> >> @@ -3205,6 +3205,78 @@ static ssize_t ubb_store(struct md_rdev *rdev, const char *page, size_t len)
> >>  static struct rdev_sysfs_entry rdev_unack_bad_blocks =
> >>  __ATTR(unacknowledged_bad_blocks, S_IRUGO|S_IWUSR, ubb_show, ubb_store);
> >>  
> >> +static ssize_t
> >> +ppl_sector_show(struct md_rdev *rdev, char *page)
> >> +{
> >> +	return sprintf(page, "%llu\n", (unsigned long long)rdev->ppl.sector);
> >> +}
> >> +
> >> +static ssize_t
> >> +ppl_sector_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> +{
> >> +	unsigned long long sector;
> >> +
> >> +	if (kstrtoull(buf, 10, &sector) < 0)
> >> +		return -EINVAL;
> >> +	if (sector != (sector_t)sector)
> >> +	        return -EINVAL;
> >> +
> >> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> >> +	    rdev->raid_disk >= 0)
> >> +		return -EBUSY;
> >> +
> >> +	if (rdev->mddev->persistent) {
> >> +		if (rdev->mddev->major_version == 0)
> >> +			return -EINVAL;
> >> +		if ((sector > rdev->sb_start &&
> >> +		     sector - rdev->sb_start > S16_MAX) ||
> >> +		    (sector < rdev->sb_start &&
> >> +		     rdev->sb_start - sector > -S16_MIN))
> >> +			return -EINVAL;
> >> +		rdev->ppl.offset = sector - rdev->sb_start;
> > 
> > Why do we allow non-external array changes the ppl log position and size?
> 
> To support enabling ppl for a running array. Mdadm should not change the
> metadata if the array is running, so we can't read the ppl position and
> size from the superblock. This is similar to adding an internal bitmap
> with mdadm --grow --bitmap=internal - mdadm writes the bitmap offset to
> "bitmap/location" in sysfs.

So this is to add ppl support in runtime, right? I think more sanity checks
should be done here to make sure the input is valid. For example the
position/size don't overlap with data, bitmap isn't enabled yet.

> >> +	} else if (!rdev->mddev->external) {
> >> +		return -EBUSY;
> >> +	}
> >> +	rdev->ppl.sector = sector;
> >> +	return len;
> >> +}
> >> +
> >> +static struct rdev_sysfs_entry rdev_ppl_sector =
> >> +__ATTR(ppl_sector, S_IRUGO|S_IWUSR, ppl_sector_show, ppl_sector_store);
> >> +
> >> +static ssize_t
> >> +ppl_size_show(struct md_rdev *rdev, char *page)
> >> +{
> >> +	return sprintf(page, "%u\n", rdev->ppl.size);
> >> +}
> >> +
> >> +static ssize_t
> >> +ppl_size_store(struct md_rdev *rdev, const char *buf, size_t len)
> >> +{
> >> +	unsigned int size;
> >> +
> >> +	if (kstrtouint(buf, 10, &size) < 0)
> >> +		return -EINVAL;
> >> +
> >> +	if (rdev->mddev->pers && test_bit(MD_HAS_PPL, &rdev->mddev->flags) &&
> >> +	    rdev->raid_disk >= 0)
> >> +		return -EBUSY;
> >> +
> >> +	if (rdev->mddev->persistent) {
> >> +		if (rdev->mddev->major_version == 0)
> >> +			return -EINVAL;
> >> +		if (size > U16_MAX)
> >> +			return -EINVAL;
> >> +	} else if (!rdev->mddev->external) {
> >> +		return -EBUSY;
> >> +	}
> >> +	rdev->ppl.size = size;
> >> +	return len;
> >> +}
> >> +
> >> +static struct rdev_sysfs_entry rdev_ppl_size =
> >> +__ATTR(ppl_size, S_IRUGO|S_IWUSR, ppl_size_show, ppl_size_store);
> >> +
> >>  static struct attribute *rdev_default_attrs[] = {
> >>  	&rdev_state.attr,
> >>  	&rdev_errors.attr,
> >> @@ -3215,6 +3287,8 @@ static struct attribute *rdev_default_attrs[] = {
> >>  	&rdev_recovery_start.attr,
> >>  	&rdev_bad_blocks.attr,
> >>  	&rdev_unack_bad_blocks.attr,
> >> +	&rdev_ppl_sector.attr,
> >> +	&rdev_ppl_size.attr,
> >>  	NULL,
> >>  };
> >>  static ssize_t
> >> @@ -4951,6 +5025,46 @@ static struct md_sysfs_entry md_array_size =
> >>  __ATTR(array_size, S_IRUGO|S_IWUSR, array_size_show,
> >>         array_size_store);
> >>  
> >> +static ssize_t
> >> +consistency_policy_show(struct mddev *mddev, char *page)
> >> +{
> >> +	int ret;
> >> +
> >> +	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
> >> +		ret = sprintf(page, "journal\n");
> >> +	} else if (test_bit(MD_HAS_PPL, &mddev->flags)) {
> >> +		ret = sprintf(page, "ppl\n");
> >> +	} else if (mddev->bitmap) {
> >> +		ret = sprintf(page, "bitmap\n");
> >> +	} else if (mddev->pers) {
> >> +		if (mddev->pers->sync_request)
> >> +			ret = sprintf(page, "resync\n");
> >> +		else
> >> +			ret = sprintf(page, "none\n");
> >> +	} else {
> >> +		ret = sprintf(page, "unknown\n");
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static ssize_t
> >> +consistency_policy_store(struct mddev *mddev, const char *buf, size_t len)
> >> +{
> >> +	if (mddev->pers) {
> >> +		return -EBUSY;
> >> +	} else if (mddev->external && strncmp(buf, "ppl", 3) == 0) {
> >> +		set_bit(MD_HAS_PPL, &mddev->flags);
> >> +		return len;
> > 
> > Here we should check if ppl position and size are valid before we set the flag.
> > 
> > And shouldn't we move this part to .change_consistency_policy added in patch 9?
> > This is a consistency policy change too.
> 
> Yes, but here the array is not started yet and we don't have the
> personality for which to call .change_consistency_policy. We are not
> changing the consistency policy here, but rather setting it initially
> for an external metadata array.

Ok, makes sense.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-02-08  5:34       ` Shaohua Li
@ 2017-02-09 15:35         ` Artur Paszkiewicz
  2017-02-09 17:09           ` Shaohua Li
  2017-02-09 16:06         ` Wols Lists
  1 sibling, 1 reply; 25+ messages in thread
From: Artur Paszkiewicz @ 2017-02-09 15:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, shli, neilb, jes.sorensen

On 02/08/2017 06:34 AM, Shaohua Li wrote:
> On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
>> On 02/07/2017 10:42 PM, Shaohua Li wrote:
>>> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
>>>> This implements the PPL write logging functionality, using the
>>>> raid5-cache policy logic introduced in previous patches. The description
>>>> of PPL is added to the documentation. More details can be found in the
>>>> comments in raid5-ppl.c.
>>>>
>>>> Put the PPL metadata structures to md_p.h because userspace tools
>>>> (mdadm) will also need to read/write PPL.
>>>>
>>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
>>>> ---
>>>>  Documentation/admin-guide/md.rst |  53 ++++
>>>>  drivers/md/Makefile              |   2 +-
>>>>  drivers/md/raid5-cache.c         |  13 +-
>>>>  drivers/md/raid5-cache.h         |   8 +
>>>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
>>>>  drivers/md/raid5.c               |  15 +-
>>>>  include/uapi/linux/raid/md_p.h   |  26 ++
>>>>  7 files changed, 661 insertions(+), 7 deletions(-)
>>>>  create mode 100644 drivers/md/raid5-ppl.c
>>>>
>>>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
>>>> index e449fb5f277c..7104ef757e73 100644
>>>> --- a/Documentation/admin-guide/md.rst
>>>> +++ b/Documentation/admin-guide/md.rst
>>>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
>>>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
>>>>  that all auto-detected arrays are assembled as partitionable.
>>>>  
>>>> +
>>>> +.. _dirty_degraded_boot:
>>>> +
>>>>  Boot time assembly of degraded/dirty arrays
>>>>  -------------------------------------------
>>>>  
>>>> @@ -176,6 +179,56 @@ and its role in the array.
>>>>  Once started with RUN_ARRAY, uninitialized spares can be added with
>>>>  HOT_ADD_DISK.
>>>>  
>>>> +.. _ppl:
>>>> +
>>>> +Partial Parity Log
>>>> +------------------
>>> Please put this part to a separate file under Documentation/md. The admin-guide
>>> is supposed to serve as a doc for user. The Documentation/md files will server
>>> as a doc for developer.
>>
>> OK, I will move it.
>>
>>>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
>>>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
>>>> +may become inconsistent with data on other member disks. If the array is also
>>>> +in degraded state, there is no way to recalculate parity, because one of the
>>>> +disks is missing. This can lead to silent data corruption when rebuilding the
>>>> +array or using it is as degraded - data calculated from parity for array blocks
>>>> +that have not been touched by a write request during the unclean shutdown can
>>>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
>>>> +this, md by default does not allow starting a dirty degraded array, see
>>>> +:ref:`dirty_degraded_boot`.
>>>> +
>>>> +Partial parity for a write operation is the XOR of stripe data chunks not
>>>> +modified by this write. It is just enough data needed for recovering from the
>>>> +write hole. XORing partial parity with the modified chunks produces parity for
>>>> +the stripe, consistent with its state before the write operation, regardless of
>>>> +which chunk writes have completed. If one of the not modified data disks of
>>>> +this stripe is missing, this updated parity can be used to recover its
>>>> +contents. PPL recovery is also performed when starting an array after an
>>>> +unclean shutdown and all disks are available, eliminating the need to resync
>>>> +the array. Because of this, using write-intent bitmap and PPL together is not
>>>> +supported.
>>>> +
>>>> +When handling a write request PPL writes partial parity before new data and
>>>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
>>>> +array member drives in the metadata area, on the parity drive of a particular
>>>> +stripe.  It does not require a dedicated journaling drive. Write performance is
>>>> +reduced by up to 30%-40% but it scales with the number of drives in the array
>>>> +and the journaling drive does not become a bottleneck or a single point of
>>>> +failure.
>>>> +
>>>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
>>>> +not a true journal. It does not protect from losing in-flight data, only from
>>>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
>>>> +performed for this stripe (parity is not updated). So it is possible to have
>>>> +arbitrary data in the written part of a stripe if that disk is lost. In such
>>>> +case the behavior is the same as in plain raid5.
>>>> +
>>>> +PPL is available for md version-1 metadata and external (specifically IMSM)
>>>> +metadata arrays. It can be enabled using mdadm option
>>>> +``--consistency-policy=ppl``.
>>>> +
>>>> +Currently, volatile write-back cache should be disabled on all member drives
>>>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
>>>> +failure.
>>>
>>> The limitation is weird and dumb. Is this an implementation limitation which
>>> can be fixed later or a design limitation?
>>
>> This can be fixed later and I'm planning to do this. The limitation
>> results from the fact that partial parity is the xor of "old" data.
>> During recovery we xor it with "new" data and that produces valid
>> parity, assuming the "old" data has not changed after dirty shutdown.
>> But it's not necessarily true, since the "old" data might have been only
>> in cache and not on persistent storage. So fixing this will require
>> making sure the old data is on persistent media before writing PPL.
> 
> so it's the implementation which doesn't flush disk cache before writting new
> data to disks, right? That makes sense then. I think we should fix this soon.
> Don't think people will (or remember to) disable write-back cache before using
> ppl.
> 
>>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
>>>> new file mode 100644
>>>> index 000000000000..6bc246c80f6b
>>>> --- /dev/null
>>>> +++ b/drivers/md/raid5-ppl.c
>>>> @@ -0,0 +1,551 @@
>>>> +/*
>>>> + * Partial Parity Log for closing the RAID5 write hole
>>>> + * Copyright (c) 2017, Intel Corporation.
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or modify it
>>>> + * under the terms and conditions of the GNU General Public License,
>>>> + * version 2, as published by the Free Software Foundation.
>>>> + *
>>>> + * This program is distributed in the hope it will be useful, but WITHOUT
>>>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
>>>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
>>>> + * more details.
>>>> + */
>>>> +
>>>> +#include <linux/kernel.h>
>>>> +#include <linux/blkdev.h>
>>>> +#include <linux/slab.h>
>>>> +#include <linux/crc32c.h>
>>>> +#include <linux/raid/md_p.h>
>>>> +#include "md.h"
>>>> +#include "raid5.h"
>>>> +#include "raid5-cache.h"
>>>> +
>>>> +/*
>>>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
>>>> + * partial parity data. The header contains an array of entries
>>>> + * (struct ppl_header_entry) which describe the logged write requests.
>>>> + * Partial parity for the entries comes after the header, written in the same
>>>> + * sequence as the entries:
>>>> + *
>>>> + * Header
>>>> + *   entry0
>>>> + *   ...
>>>> + *   entryN
>>>> + * PP data
>>>> + *   PP for entry0
>>>> + *   ...
>>>> + *   PP for entryN
>>>> + *
>>>> + * Every entry holds a checksum of its partial parity, the header also has a
>>>> + * checksum of the header itself. Entries for full stripes writes contain no
>>>> + * partial parity, they only mark the stripes for which parity should be
>>>> + * recalculated after an unclean shutdown.
>>>> + *
>>>> + * A write request is always logged to the PPL instance stored on the parity
>>>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
>>>> + * used to handle logging for this disk, independently from others. They are
>>>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
>>>> + * common parent r5l_log. This parent log serves as a proxy and is used in
>>>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
>>>> + *
>>>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
>>>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
>>>> + * be appended to the last entry if the chunks to write are the same, otherwise
>>>> + * a new entry is added. Checksums of entries are calculated incrementally as
>>>> + * stripes containing partial parity are being added to entries.
>>>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
>>>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
>>>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
>>>> + * associated with the io_unit are released and raid5d starts writing their data
>>>> + * and parity. When all stripes are written, the io_unit is freed and the next
>>>> + * can be submitted.
>>>> + *
>>>> + * An io_unit is used to gather stripes until it is submitted or becomes full
>>>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
>>>> + * can't be submitted until the previous has completed (PPL and stripe
>>>> + * data+parity is written). The log->running_ios list tracks all io_units of
>>>> + * a log (for a single member disk). New io_units are added to the end of the
>>>> + * list and the first io_unit is submitted, if it is not submitted already.
>>>> + * The current io_unit accepting new stripes is always the last on the list.
>>>> + */
>>>
>>> I don't like the way that io_unit and r5l_log are reused here. The code below
>>> shares very little with raid5-cache. There are a lot of fields in the data
>>> structures which are not used by ppl. It's not a good practice to share data
>>> structure like this way, so please define ppl its own data structure. The
>>> workflow between raid5 cache and ppl might be similar, but what we really need
>>> to share is the hooks to raid5 core.
>>
>> Reusing the structures seemed like a good idea at first, but now I must
>> admit you are right, especially since the structures were extended to
>> support write-back caching. One instance of r5l_log will still be
>> necessary though, to use for the hooks. Its private field will be used
>> for holding the ppl structures. Is that ok?
> 
> I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
> can store whatever data in the log_private. And convert all callbacks to use a
> 'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
> data structure.

This way also 'struct r5l_policy *policy' will have to be moved from
r5l_log to r5conf. Maybe it will be better if instead of adding
log_private and moving this 'policy' field we add a dedicated field for
ppl in r5conf, something like 'struct ppl_conf *ppl', and eliminate
r5l_policy? From raid5 core we could just call corresponding functions
depending on which pointer is set (log or ppl). This way raid5-cache.c
won't have to be changed at all. It seems appropriate to separate
raid5-cache and ppl if they won't be sharing any data structures.

Thanks,
Artur

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-02-08  5:34       ` Shaohua Li
  2017-02-09 15:35         ` Artur Paszkiewicz
@ 2017-02-09 16:06         ` Wols Lists
  2017-02-09 17:13           ` Shaohua Li
  1 sibling, 1 reply; 25+ messages in thread
From: Wols Lists @ 2017-02-09 16:06 UTC (permalink / raw)
  To: Shaohua Li, Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On 08/02/17 05:34, Shaohua Li wrote:
> so it's the implementation which doesn't flush disk cache before writting new
> data to disks, right? That makes sense then. I think we should fix this soon.
> Don't think people will (or remember to) disable write-back cache before using
> ppl.

Dunno whether I'm making a fool of myself :-) but might a quick-n-dirty
workaround be to make enabling PPL also disable write-back cache? With
maybe a warning not to switch it back on?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-02-09 15:35         ` Artur Paszkiewicz
@ 2017-02-09 17:09           ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2017-02-09 17:09 UTC (permalink / raw)
  To: Artur Paszkiewicz; +Cc: linux-raid, shli, neilb, jes.sorensen

On Thu, Feb 09, 2017 at 04:35:34PM +0100, Artur Paszkiewicz wrote:
> On 02/08/2017 06:34 AM, Shaohua Li wrote:
> > On Wed, Feb 08, 2017 at 12:58:42PM +0100, Artur Paszkiewicz wrote:
> >> On 02/07/2017 10:42 PM, Shaohua Li wrote:
> >>> On Mon, Jan 30, 2017 at 07:59:49PM +0100, Artur Paszkiewicz wrote:
> >>>> This implements the PPL write logging functionality, using the
> >>>> raid5-cache policy logic introduced in previous patches. The description
> >>>> of PPL is added to the documentation. More details can be found in the
> >>>> comments in raid5-ppl.c.
> >>>>
> >>>> Put the PPL metadata structures to md_p.h because userspace tools
> >>>> (mdadm) will also need to read/write PPL.
> >>>>
> >>>> Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
> >>>> ---
> >>>>  Documentation/admin-guide/md.rst |  53 ++++
> >>>>  drivers/md/Makefile              |   2 +-
> >>>>  drivers/md/raid5-cache.c         |  13 +-
> >>>>  drivers/md/raid5-cache.h         |   8 +
> >>>>  drivers/md/raid5-ppl.c           | 551 +++++++++++++++++++++++++++++++++++++++
> >>>>  drivers/md/raid5.c               |  15 +-
> >>>>  include/uapi/linux/raid/md_p.h   |  26 ++
> >>>>  7 files changed, 661 insertions(+), 7 deletions(-)
> >>>>  create mode 100644 drivers/md/raid5-ppl.c
> >>>>
> >>>> diff --git a/Documentation/admin-guide/md.rst b/Documentation/admin-guide/md.rst
> >>>> index e449fb5f277c..7104ef757e73 100644
> >>>> --- a/Documentation/admin-guide/md.rst
> >>>> +++ b/Documentation/admin-guide/md.rst
> >>>> @@ -86,6 +86,9 @@ superblock can be autodetected and run at boot time.
> >>>>  The kernel parameter ``raid=partitionable`` (or ``raid=part``) means
> >>>>  that all auto-detected arrays are assembled as partitionable.
> >>>>  
> >>>> +
> >>>> +.. _dirty_degraded_boot:
> >>>> +
> >>>>  Boot time assembly of degraded/dirty arrays
> >>>>  -------------------------------------------
> >>>>  
> >>>> @@ -176,6 +179,56 @@ and its role in the array.
> >>>>  Once started with RUN_ARRAY, uninitialized spares can be added with
> >>>>  HOT_ADD_DISK.
> >>>>  
> >>>> +.. _ppl:
> >>>> +
> >>>> +Partial Parity Log
> >>>> +------------------
> >>> Please put this part to a separate file under Documentation/md. The admin-guide
> >>> is supposed to serve as a doc for user. The Documentation/md files will server
> >>> as a doc for developer.
> >>
> >> OK, I will move it.
> >>
> >>>> +Partial Parity Log (PPL) is a feature available for RAID5 arrays. The issue
> >>>> +addressed by PPL is that after a dirty shutdown, parity of a particular stripe
> >>>> +may become inconsistent with data on other member disks. If the array is also
> >>>> +in degraded state, there is no way to recalculate parity, because one of the
> >>>> +disks is missing. This can lead to silent data corruption when rebuilding the
> >>>> +array or using it is as degraded - data calculated from parity for array blocks
> >>>> +that have not been touched by a write request during the unclean shutdown can
> >>>> +be incorrect. Such condition is known as the ``RAID5 Write Hole``. Because of
> >>>> +this, md by default does not allow starting a dirty degraded array, see
> >>>> +:ref:`dirty_degraded_boot`.
> >>>> +
> >>>> +Partial parity for a write operation is the XOR of stripe data chunks not
> >>>> +modified by this write. It is just enough data needed for recovering from the
> >>>> +write hole. XORing partial parity with the modified chunks produces parity for
> >>>> +the stripe, consistent with its state before the write operation, regardless of
> >>>> +which chunk writes have completed. If one of the not modified data disks of
> >>>> +this stripe is missing, this updated parity can be used to recover its
> >>>> +contents. PPL recovery is also performed when starting an array after an
> >>>> +unclean shutdown and all disks are available, eliminating the need to resync
> >>>> +the array. Because of this, using write-intent bitmap and PPL together is not
> >>>> +supported.
> >>>> +
> >>>> +When handling a write request PPL writes partial parity before new data and
> >>>> +parity are dispatched to disks. PPL is a distributed log - it is stored on
> >>>> +array member drives in the metadata area, on the parity drive of a particular
> >>>> +stripe.  It does not require a dedicated journaling drive. Write performance is
> >>>> +reduced by up to 30%-40% but it scales with the number of drives in the array
> >>>> +and the journaling drive does not become a bottleneck or a single point of
> >>>> +failure.
> >>>> +
> >>>> +Unlike raid5-cache, the other solution in md for closing the write hole, PPL is
> >>>> +not a true journal. It does not protect from losing in-flight data, only from
> >>>> +silent data corruption. If a dirty disk of a stripe is lost, no PPL recovery is
> >>>> +performed for this stripe (parity is not updated). So it is possible to have
> >>>> +arbitrary data in the written part of a stripe if that disk is lost. In such
> >>>> +case the behavior is the same as in plain raid5.
> >>>> +
> >>>> +PPL is available for md version-1 metadata and external (specifically IMSM)
> >>>> +metadata arrays. It can be enabled using mdadm option
> >>>> +``--consistency-policy=ppl``.
> >>>> +
> >>>> +Currently, volatile write-back cache should be disabled on all member drives
> >>>> +when using PPL. Otherwise it cannot guarantee consistency in case of power
> >>>> +failure.
> >>>
> >>> The limitation is weird and dumb. Is this an implementation limitation which
> >>> can be fixed later or a design limitation?
> >>
> >> This can be fixed later and I'm planning to do this. The limitation
> >> results from the fact that partial parity is the xor of "old" data.
> >> During recovery we xor it with "new" data and that produces valid
> >> parity, assuming the "old" data has not changed after dirty shutdown.
> >> But it's not necessarily true, since the "old" data might have been only
> >> in cache and not on persistent storage. So fixing this will require
> >> making sure the old data is on persistent media before writing PPL.
> > 
> > so it's the implementation which doesn't flush disk cache before writting new
> > data to disks, right? That makes sense then. I think we should fix this soon.
> > Don't think people will (or remember to) disable write-back cache before using
> > ppl.
> > 
> >>>> diff --git a/drivers/md/raid5-ppl.c b/drivers/md/raid5-ppl.c
> >>>> new file mode 100644
> >>>> index 000000000000..6bc246c80f6b
> >>>> --- /dev/null
> >>>> +++ b/drivers/md/raid5-ppl.c
> >>>> @@ -0,0 +1,551 @@
> >>>> +/*
> >>>> + * Partial Parity Log for closing the RAID5 write hole
> >>>> + * Copyright (c) 2017, Intel Corporation.
> >>>> + *
> >>>> + * This program is free software; you can redistribute it and/or modify it
> >>>> + * under the terms and conditions of the GNU General Public License,
> >>>> + * version 2, as published by the Free Software Foundation.
> >>>> + *
> >>>> + * This program is distributed in the hope it will be useful, but WITHOUT
> >>>> + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
> >>>> + * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
> >>>> + * more details.
> >>>> + */
> >>>> +
> >>>> +#include <linux/kernel.h>
> >>>> +#include <linux/blkdev.h>
> >>>> +#include <linux/slab.h>
> >>>> +#include <linux/crc32c.h>
> >>>> +#include <linux/raid/md_p.h>
> >>>> +#include "md.h"
> >>>> +#include "raid5.h"
> >>>> +#include "raid5-cache.h"
> >>>> +
> >>>> +/*
> >>>> + * PPL consists of a 4KB header (struct ppl_header) and at least 128KB for
> >>>> + * partial parity data. The header contains an array of entries
> >>>> + * (struct ppl_header_entry) which describe the logged write requests.
> >>>> + * Partial parity for the entries comes after the header, written in the same
> >>>> + * sequence as the entries:
> >>>> + *
> >>>> + * Header
> >>>> + *   entry0
> >>>> + *   ...
> >>>> + *   entryN
> >>>> + * PP data
> >>>> + *   PP for entry0
> >>>> + *   ...
> >>>> + *   PP for entryN
> >>>> + *
> >>>> + * Every entry holds a checksum of its partial parity, the header also has a
> >>>> + * checksum of the header itself. Entries for full stripes writes contain no
> >>>> + * partial parity, they only mark the stripes for which parity should be
> >>>> + * recalculated after an unclean shutdown.
> >>>> + *
> >>>> + * A write request is always logged to the PPL instance stored on the parity
> >>>> + * disk of the corresponding stripe. For each member disk there is one r5l_log
> >>>> + * used to handle logging for this disk, independently from others. They are
> >>>> + * grouped in child_logs array in struct ppl_conf, which is assigned to a
> >>>> + * common parent r5l_log. This parent log serves as a proxy and is used in
> >>>> + * raid5 personality code - it is assigned as _the_ log in r5conf->log.
> >>>> + *
> >>>> + * r5l_io_unit represents a full PPL write, meta_page contains the ppl_header.
> >>>> + * PPL entries for logged stripes are added in ppl_log_stripe(). A stripe can
> >>>> + * be appended to the last entry if the chunks to write are the same, otherwise
> >>>> + * a new entry is added. Checksums of entries are calculated incrementally as
> >>>> + * stripes containing partial parity are being added to entries.
> >>>> + * ppl_submit_iounit() calculates the checksum of the header and submits a bio
> >>>> + * containing the meta_page (ppl_header) and partial parity pages (sh->ppl_page)
> >>>> + * for all stripes of the io_unit. When the PPL write completes, the stripes
> >>>> + * associated with the io_unit are released and raid5d starts writing their data
> >>>> + * and parity. When all stripes are written, the io_unit is freed and the next
> >>>> + * can be submitted.
> >>>> + *
> >>>> + * An io_unit is used to gather stripes until it is submitted or becomes full
> >>>> + * (if the maximum number of entries or size of PPL is reached). Another io_unit
> >>>> + * can't be submitted until the previous has completed (PPL and stripe
> >>>> + * data+parity is written). The log->running_ios list tracks all io_units of
> >>>> + * a log (for a single member disk). New io_units are added to the end of the
> >>>> + * list and the first io_unit is submitted, if it is not submitted already.
> >>>> + * The current io_unit accepting new stripes is always the last on the list.
> >>>> + */
> >>>
> >>> I don't like the way that io_unit and r5l_log are reused here. The code below
> >>> shares very little with raid5-cache. There are a lot of fields in the data
> >>> structures which are not used by ppl. It's not a good practice to share data
> >>> structure like this way, so please define ppl its own data structure. The
> >>> workflow between raid5 cache and ppl might be similar, but what we really need
> >>> to share is the hooks to raid5 core.
> >>
> >> Reusing the structures seemed like a good idea at first, but now I must
> >> admit you are right, especially since the structures were extended to
> >> support write-back caching. One instance of r5l_log will still be
> >> necessary though, to use for the hooks. Its private field will be used
> >> for holding the ppl structures. Is that ok?
> > 
> > I'd prefer adding a 'void *log_private' in struct r5conf. raid5-cache and ppl
> > can store whatever data in the log_private. And convert all callbacks to use a
> > 'struct r5conf' parameter. This way raid5 cache and ppl don't need to share the
> > data structure.
> 
> This way also 'struct r5l_policy *policy' will have to be moved from
> r5l_log to r5conf. Maybe it will be better if instead of adding
> log_private and moving this 'policy' field we add a dedicated field for
> ppl in r5conf, something like 'struct ppl_conf *ppl', and eliminate
> r5l_policy? From raid5 core we could just call corresponding functions
> depending on which pointer is set (log or ppl). This way raid5-cache.c
> won't have to be changed at all. It seems appropriate to separate
> raid5-cache and ppl if they won't be sharing any data structures.

It's ok to add ppl specific data in r5conf. Just make sure the raid5 core only
calls one set of hooks if possible and the hooks can call into ppl or
raid5-cache. It's quite unlikely we will add other policies, so it's fine to
eliminate r5l_policy.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation
  2017-02-09 16:06         ` Wols Lists
@ 2017-02-09 17:13           ` Shaohua Li
  0 siblings, 0 replies; 25+ messages in thread
From: Shaohua Li @ 2017-02-09 17:13 UTC (permalink / raw)
  To: Wols Lists; +Cc: Artur Paszkiewicz, linux-raid, shli, neilb, jes.sorensen

On Thu, Feb 09, 2017 at 04:06:00PM +0000, Wols Lists wrote:
> On 08/02/17 05:34, Shaohua Li wrote:
> > so it's the implementation which doesn't flush disk cache before writting new
> > data to disks, right? That makes sense then. I think we should fix this soon.
> > Don't think people will (or remember to) disable write-back cache before using
> > ppl.
> 
> Dunno whether I'm making a fool of myself :-) but might a quick-n-dirty
> workaround be to make enabling PPL also disable write-back cache? With
> maybe a warning not to switch it back on?

We don't have API for this in kernel side. This doesn't mean we can't add it
but since the limitation is in the implementation, we'd better fix the
implementation.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 9/9] raid5-ppl: runtime PPL enabling or disabling
  2017-01-30 18:59 ` [PATCH v3 9/9] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
@ 2017-03-27  5:08   ` NeilBrown
  0 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2017-03-27  5:08 UTC (permalink / raw)
  To: linux-raid; +Cc: shli, jes.sorensen, Artur Paszkiewicz

[-- Attachment #1: Type: text/plain, Size: 928 bytes --]

On Mon, Jan 30 2017, Artur Paszkiewicz wrote:

>  
> +static void raid5_reset_stripe_cache(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +
> +	mutex_lock(&conf->cache_size_mutex);
> +	while (conf->max_nr_stripes &&
> +	       drop_one_stripe(conf))
> +		;
> +	while (conf->min_nr_stripes > conf->max_nr_stripes &&
> +	       grow_one_stripe(conf, GFP_KERNEL))
> +		;
> +	mutex_unlock(&conf->cache_size_mutex);
> +}

This is called with the raid array suspended, so any writeout to the
array will block.
A GFP_KERNEL allocation can block waiting for writeout.
So this can deadlock.
At the very least, this should use GFP_NOIO.

It would be better to do something like resize_stripes() does, or maybe
even just use resize_stripes().
It allocates the new stripes first before suspending IO.  Then it
cleans out the old stripes, and inserts the pre-allocated stripes.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-03-27  5:08 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-30 18:59 [PATCH v3 0/9] Partial Parity Log for MD RAID 5 Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 1/9] raid5-cache: move declarations to separate header Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 2/9] raid5-cache: add policy logic Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 3/9] md: superblock changes for PPL Artur Paszkiewicz
2017-02-07 21:20   ` Shaohua Li
2017-02-08 11:58     ` Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 4/9] raid5: calculate partial parity for a stripe Artur Paszkiewicz
2017-02-07 21:25   ` Shaohua Li
2017-02-08 11:58     ` Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 5/9] raid5-ppl: Partial Parity Log write logging implementation Artur Paszkiewicz
2017-02-07 21:42   ` Shaohua Li
2017-02-08 11:58     ` Artur Paszkiewicz
2017-02-08  5:34       ` Shaohua Li
2017-02-09 15:35         ` Artur Paszkiewicz
2017-02-09 17:09           ` Shaohua Li
2017-02-09 16:06         ` Wols Lists
2017-02-09 17:13           ` Shaohua Li
2017-01-30 18:59 ` [PATCH v3 6/9] md: add sysfs entries for PPL Artur Paszkiewicz
2017-02-07 21:49   ` Shaohua Li
2017-02-08 11:58     ` Artur Paszkiewicz
2017-02-08 18:14       ` Shaohua Li
2017-01-30 18:59 ` [PATCH v3 7/9] raid5-ppl: load and recover the log Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 8/9] raid5-ppl: support disk hot add/remove with PPL Artur Paszkiewicz
2017-01-30 18:59 ` [PATCH v3 9/9] raid5-ppl: runtime PPL enabling or disabling Artur Paszkiewicz
2017-03-27  5:08   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.