All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/8] raid5-cache: enabling cache features
@ 2016-10-13  5:49 Song Liu
  2016-10-13  5:49 ` [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log Song Liu
                   ` (7 more replies)
  0 siblings, 8 replies; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

These are the 5th version of patches to enable write cache part of
raid5-cache. The journal part was released with kernel 4.4.

The caching part uses same disk format of raid456 journal, and provides
acceleration to writes. Write operations are committed (bio_endio) once
the data is secured in journal. Reconstruct and RMW are postponed to
reclaim path, which is (hopefully) not on the critical path.

The changes are organized in 8 patches (details below).

Patch for chunk_aligned_read in earlier RFC is not included yet
(http://marc.info/?l=linux-raid&m=146432700719277). But we may still need
some optimizations later, especially for SSD raid devices.

Changes between v5 and v4 (http://marc.info/?l=linux-raid&m=147629531615172)
  1. Change the output/input of sysfs entry r5c_state
  2. Move heavy reclaim work from raid5_make_request() to r5c_do_reclaim()
  3. Fix an issue with orig_page handling in the write path

Changes between v4 and v3 (http://marc.info/?l=linux-raid&m=147573807306070):
  1. Make reclaim robust
  2. Fix a bug in recovery

Changes between v3 and v2 (http://marc.info/?l=linux-raid&m=147493266208102):
  1. Incorporate feedback from Shaohua
  2. Reorganize the patches, for hopefully easier review
  3. Make sure no change to write through mode (journal only)
  4. Change reclaim design to avoid deadlock due to log space

Thanks,
Song

Song Liu (8):
  md/r5cache: Check array size in r5l_init_log
  md/r5cache: move some code to raid5.h
  md/r5cache: State machine for raid5-cache write back mode
  md/r5cache: write part of r5cache
  md/r5cache: reclaim support
  md/r5cache: sysfs entry r5c_state
  md/r5cache: r5c recovery
  md/r5cache: handle SYNC and FUA

 drivers/md/raid5-cache.c | 1659 ++++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.c       |  261 +++++---
 drivers/md/raid5.h       |  150 ++++-
 3 files changed, 1772 insertions(+), 298 deletions(-)

--
2.9.3

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-13  5:49 ` [PATCH v5 2/8] md/r5cache: move some code to raid5.h Song Liu
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

Currently, r5l_write_stripe checks meta size for each stripe write,
which is not necessary.

With this patch, r5l_init_log checks maximal meta size of the array,
which is (r5l_meta_block + raid_disks x r5l_payload_data_parity).
If this is too big to fit in one page, r5l_init_log aborts.

With current meta data, r5l_log support raid_disks up to 203.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 1b1ab4a..7557791b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -441,7 +441,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 {
 	int write_disks = 0;
 	int data_pages, parity_pages;
-	int meta_size;
 	int reserve;
 	int i;
 	int ret = 0;
@@ -473,15 +472,6 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	parity_pages = 1 + !!(sh->qd_idx >= 0);
 	data_pages = write_disks - parity_pages;
 
-	meta_size =
-		((sizeof(struct r5l_payload_data_parity) + sizeof(__le32))
-		 * data_pages) +
-		sizeof(struct r5l_payload_data_parity) +
-		sizeof(__le32) * parity_pages;
-	/* Doesn't work with very big raid array */
-	if (meta_size + sizeof(struct r5l_meta_block) > PAGE_SIZE)
-		return -EINVAL;
-
 	set_bit(STRIPE_LOG_TRAPPED, &sh->state);
 	/*
 	 * The stripe must enter state machine again to finish the write, so
@@ -1184,6 +1174,22 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 
 	if (PAGE_SIZE != 4096)
 		return -EINVAL;
+
+	/*
+	 * The PAGE_SIZE must be big enough to hold 1 r5l_meta_block and
+	 * raid_disks r5l_payload_data_parity.
+	 *
+	 * Write journal and cache does not work for very big array
+	 * (raid_disks > 203)
+	 */
+	if (sizeof(struct r5l_meta_block) +
+	    ((sizeof(struct r5l_payload_data_parity) + sizeof(__le32)) *
+	     conf->raid_disks) > PAGE_SIZE) {
+		pr_err("md/raid:%s: write journal/cache doesn't work for array with %d disks\n",
+		       mdname(conf->mddev), conf->raid_disks);
+		return -EINVAL;
+	}
+
 	log = kzalloc(sizeof(*log), GFP_KERNEL);
 	if (!log)
 		return -ENOMEM;
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 2/8] md/r5cache: move some code to raid5.h
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
  2016-10-13  5:49 ` [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-13  5:49 ` [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode Song Liu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

Move some define and inline functions to raid5.h, so they can be
used in raid5-cache.c

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5.c | 71 -------------------------------------------------
 drivers/md/raid5.h | 77 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 71 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f94472d..67d4f49 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -70,19 +70,6 @@ module_param(devices_handle_discard_safely, bool, 0644);
 MODULE_PARM_DESC(devices_handle_discard_safely,
 		 "Set to Y if all devices in each array reliably return zeroes on reads from discarded regions");
 static struct workqueue_struct *raid5_wq;
-/*
- * Stripe cache
- */
-
-#define NR_STRIPES		256
-#define STRIPE_SIZE		PAGE_SIZE
-#define STRIPE_SHIFT		(PAGE_SHIFT - 9)
-#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
-#define	IO_THRESHOLD		1
-#define BYPASS_THRESHOLD	1
-#define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
-#define HASH_MASK		(NR_HASH - 1)
-#define MAX_STRIPE_BATCH	8
 
 static inline struct hlist_head *stripe_hash(struct r5conf *conf, sector_t sect)
 {
@@ -126,64 +113,6 @@ static inline void unlock_all_device_hash_locks_irq(struct r5conf *conf)
 	local_irq_enable();
 }
 
-/* bio's attached to a stripe+device for I/O are linked together in bi_sector
- * order without overlap.  There may be several bio's per stripe+device, and
- * a bio could span several devices.
- * When walking this list for a particular stripe+device, we must never proceed
- * beyond a bio that extends past this device, as the next bio might no longer
- * be valid.
- * This function is used to determine the 'next' bio in the list, given the sector
- * of the current stripe+device
- */
-static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
-{
-	int sectors = bio_sectors(bio);
-	if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
-		return bio->bi_next;
-	else
-		return NULL;
-}
-
-/*
- * We maintain a biased count of active stripes in the bottom 16 bits of
- * bi_phys_segments, and a count of processed stripes in the upper 16 bits
- */
-static inline int raid5_bi_processed_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return (atomic_read(segments) >> 16) & 0xffff;
-}
-
-static inline int raid5_dec_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	return atomic_sub_return(1, segments) & 0xffff;
-}
-
-static inline void raid5_inc_bi_active_stripes(struct bio *bio)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_inc(segments);
-}
-
-static inline void raid5_set_bi_processed_stripes(struct bio *bio,
-	unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	int old, new;
-
-	do {
-		old = atomic_read(segments);
-		new = (old & 0xffff) | (cnt << 16);
-	} while (atomic_cmpxchg(segments, old, new) != old);
-}
-
-static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
-{
-	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
-	atomic_set(segments, cnt);
-}
-
 /* Find first data disk in a raid6 stripe */
 static inline int raid6_d0(struct stripe_head *sh)
 {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 517d4b6..46cfe93 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -410,6 +410,83 @@ struct disk_info {
 	struct md_rdev	*rdev, *replacement;
 };
 
+/*
+ * Stripe cache
+ */
+
+#define NR_STRIPES		256
+#define STRIPE_SIZE		PAGE_SIZE
+#define STRIPE_SHIFT		(PAGE_SHIFT - 9)
+#define STRIPE_SECTORS		(STRIPE_SIZE>>9)
+#define	IO_THRESHOLD		1
+#define BYPASS_THRESHOLD	1
+#define NR_HASH			(PAGE_SIZE / sizeof(struct hlist_head))
+#define HASH_MASK		(NR_HASH - 1)
+#define MAX_STRIPE_BATCH	8
+
+/* bio's attached to a stripe+device for I/O are linked together in bi_sector
+ * order without overlap.  There may be several bio's per stripe+device, and
+ * a bio could span several devices.
+ * When walking this list for a particular stripe+device, we must never proceed
+ * beyond a bio that extends past this device, as the next bio might no longer
+ * be valid.
+ * This function is used to determine the 'next' bio in the list, given the
+ * sector of the current stripe+device
+ */
+static inline struct bio *r5_next_bio(struct bio *bio, sector_t sector)
+{
+	int sectors = bio_sectors(bio);
+
+	if (bio->bi_iter.bi_sector + sectors < sector + STRIPE_SECTORS)
+		return bio->bi_next;
+	else
+		return NULL;
+}
+
+/*
+ * We maintain a biased count of active stripes in the bottom 16 bits of
+ * bi_phys_segments, and a count of processed stripes in the upper 16 bits
+ */
+static inline int raid5_bi_processed_stripes(struct bio *bio)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	return (atomic_read(segments) >> 16) & 0xffff;
+}
+
+static inline int raid5_dec_bi_active_stripes(struct bio *bio)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	return atomic_sub_return(1, segments) & 0xffff;
+}
+
+static inline void raid5_inc_bi_active_stripes(struct bio *bio)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	atomic_inc(segments);
+}
+
+static inline void raid5_set_bi_processed_stripes(struct bio *bio,
+	unsigned int cnt)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	int old, new;
+
+	do {
+		old = atomic_read(segments);
+		new = (old & 0xffff) | (cnt << 16);
+	} while (atomic_cmpxchg(segments, old, new) != old);
+}
+
+static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+
+	atomic_set(segments, cnt);
+}
+
 /* NOTE NR_STRIPE_HASH_LOCKS must remain below 64.
  * This is because we sometimes take all the spinlocks
  * and creating that much locking depth can cause
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
  2016-10-13  5:49 ` [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log Song Liu
  2016-10-13  5:49 ` [PATCH v5 2/8] md/r5cache: move some code to raid5.h Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-14  6:13   ` NeilBrown
  2016-10-13  5:49 ` [PATCH v5 4/8] md/r5cache: write part of r5cache Song Liu
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

The raid5-cache write back mode as 2 states for each stripe: write state
and reclaim state. This patch adds bare state machine for these two
states.

2 flags are added to sh->state for raid5-cache states:
 - STRIPE_R5C_FROZEN
 - STRIPE_R5C_WRITTEN

STRIPE_R5C_FROZEN is the key flag to differentiate write state
and reclaim state.

STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from
reclaim state back to write state.

In write through mode, every stripe also goes between write state
and reclaim state (in r5c_handle_stripe_dirtying() and
r5c_handle_stripe_written()).

Please note: this is a "no-op" patch for raid5-cache write through
mode.

The following detailed explanation is copied from the raid5-cache.c:

/*
 * raid5 cache state machine
 *
 * The RAID cache works in two major states for each stripe: write state
 * and reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN
 * and STRIPE_R5C_WRITTEN
 *
 * STRIPE_R5C_FROZEN is the key flag to differentiate write state and reclaim
 * state. The write state runs w/ STRIPE_R5C_FROZEN == 0. While the reclaim
 * state runs w/ STRIPE_R5C_FROZEN == 1.
 *
 * STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from reclaim
 * state to write state. Specifically, STRIPE_R5C_WRITTEN triggers clean up
 * process in r5c_handle_stripe_written. STRIPE_R5C_WRITTEN is set when data
 * and parity of a stripe is all in journal device; and cleared when the data
 * and parity are all in RAID disks.
 *
 * The following is another way to show how STRIPE_R5C_FROZEN and
 * STRIPE_R5C_WRITTEN work:
 *
 * write state: STRIPE_R5C_FROZEN = 0 STRIPE_R5C_WRITTEN = 0
 * reclaim state: STRIPE_R5C_FROZEN = 1
 *
 * write => reclaim: set STRIPE_R5C_FROZEN in r5c_freeze_stripe_for_reclaim
 * reclaim => write:
 * 1. write parity to journal, when finished, set STRIPE_R5C_WRITTEN
 * 2. write data/parity to raid disks, when finished, clear both
 *    STRIPE_R5C_FROZEN and STRIPE_R5C_WRITTEN
 *
 * In write through mode (journal only) the stripe still goes through these
 * state change, except that STRIPE_R5C_FROZEN is set on write in
 * r5c_handle_stripe_dirtying().
 */

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 125 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/raid5.c       |  20 ++++++--
 drivers/md/raid5.h       |  10 +++-
 3 files changed, 148 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 7557791b..9e05850 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -40,6 +40,47 @@
  */
 #define R5L_POOL_SIZE	4
 
+enum r5c_state {
+	R5C_STATE_NO_CACHE = 0,
+	R5C_STATE_WRITE_THROUGH = 1,
+	R5C_STATE_WRITE_BACK = 2,
+	R5C_STATE_CACHE_BROKEN = 3,
+};
+
+/*
+ * raid5 cache state machine
+ *
+ * The RAID cache works in two major states for each stripe: write state and
+ * reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN and
+ * STRIPE_R5C_WRITTEN
+ *
+ * STRIPE_R5C_FROZEN is the key flag to differentiate write state and reclaim
+ * state. The write state runs w/ STRIPE_R5C_FROZEN == 0. While the reclaim
+ * state runs w/ STRIPE_R5C_FROZEN == 1.
+ *
+ * STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from reclaim
+ * state to write state. Specifically, STRIPE_R5C_WRITTEN triggers clean up
+ * process in r5c_handle_stripe_written. STRIPE_R5C_WRITTEN is set when data
+ * and parity of a stripe is all in journal device; and cleared when the data
+ * and parity are all in RAID disks.
+ *
+ * The following is another way to show how STRIPE_R5C_FROZEN and
+ * STRIPE_R5C_WRITTEN work:
+ *
+ * write state: STRIPE_R5C_FROZEN = 0 STRIPE_R5C_WRITTEN = 0
+ * reclaim state: STRIPE_R5C_FROZEN = 1
+ *
+ * write => reclaim: set STRIPE_R5C_FROZEN in r5c_freeze_stripe_for_reclaim
+ * reclaim => write:
+ * 1. write parity to journal, when finished, set STRIPE_R5C_WRITTEN
+ * 2. write data/parity to raid disks, when finished, clear both
+ *    STRIPE_R5C_FROZEN and STRIPE_R5C_WRITTEN
+ *
+ * In write through mode (journal only) the stripe also goes through these
+ * state change, except that STRIPE_R5C_FROZEN is set on write in
+ * r5c_handle_stripe_dirtying().
+ */
+
 struct r5l_log {
 	struct md_rdev *rdev;
 
@@ -96,6 +137,9 @@ struct r5l_log {
 	spinlock_t no_space_stripes_lock;
 
 	bool need_cache_flush;
+
+	/* for r5c_cache */
+	enum r5c_state r5c_state;
 };
 
 /*
@@ -133,6 +177,11 @@ enum r5l_io_unit_state {
 	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
 };
 
+bool r5c_is_writeback(struct r5l_log *log)
+{
+	return (log != NULL && log->r5c_state == R5C_STATE_WRITE_BACK);
+}
+
 static sector_t r5l_ring_add(struct r5l_log *log, sector_t start, sector_t inc)
 {
 	start += inc;
@@ -168,12 +217,44 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+/*
+ * Freeze the stripe, thus send the stripe into reclaim path.
+ *
+ * In current implementation, STRIPE_R5C_FROZEN is also set in write through
+ * mode (in r5c_handle_stripe_dirtying). This does not change the behavior of
+ * for write through mode.
+ */
+void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5l_log *log = conf->log;
+
+	if (!log)
+		return;
+	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
+	set_bit(STRIPE_R5C_FROZEN, &sh->state);
+}
+
+static void r5c_finish_cache_stripe(struct stripe_head *sh)
+{
+	struct r5l_log *log = sh->raid_conf->log;
+
+	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
+		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
+		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
+	} else
+		BUG(); /* write back logic in next patch */
+}
+
 static void r5l_io_run_stripes(struct r5l_io_unit *io)
 {
 	struct stripe_head *sh, *next;
 
 	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
 		list_del_init(&sh->log_list);
+
+		r5c_finish_cache_stripe(sh);
+
 		set_bit(STRIPE_HANDLE, &sh->state);
 		raid5_release_stripe(sh);
 	}
@@ -412,18 +493,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 		r5l_append_payload_page(log, sh->dev[i].page);
 	}
 
-	if (sh->qd_idx >= 0) {
+	if (parity_pages == 2) {
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
 					sh->dev[sh->qd_idx].log_checksum, true);
 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
 		r5l_append_payload_page(log, sh->dev[sh->qd_idx].page);
-	} else {
+	} else if (parity_pages == 1) {
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
 					0, false);
 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
-	}
+	} else
+		BUG_ON(parity_pages != 0);
 
 	list_add_tail(&sh->log_list, &io->stripe_list);
 	atomic_inc(&io->pending_stripe);
@@ -455,6 +537,8 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 		return -EAGAIN;
 	}
 
+	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
+
 	for (i = 0; i < sh->disks; i++) {
 		void *addr;
 
@@ -1101,6 +1185,39 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 }
 
+int r5c_handle_stripe_dirtying(struct r5conf *conf,
+			       struct stripe_head *sh,
+			       struct stripe_head_state *s,
+			       int disks)
+{
+	struct r5l_log *log = conf->log;
+
+	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
+		return -EAGAIN;
+
+	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH ||
+	    conf->mddev->degraded != 0) {
+		/* write through mode */
+		r5c_freeze_stripe_for_reclaim(sh);
+		return -EAGAIN;
+	}
+	BUG();  /* write back logic in next commit */
+	return 0;
+}
+
+/*
+ * clean up the stripe (clear STRIPE_R5C_FROZEN etc.) after the stripe is
+ * committed to RAID disks
+*/
+void r5c_handle_stripe_written(struct r5conf *conf,
+			       struct stripe_head *sh)
+{
+	if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
+		return;
+	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
+	clear_bit(STRIPE_R5C_FROZEN, &sh->state);
+}
+
 static int r5l_load_log(struct r5l_log *log)
 {
 	struct md_rdev *rdev = log->rdev;
@@ -1236,6 +1353,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
 
+	log->r5c_state = R5C_STATE_WRITE_THROUGH;
+
 	if (r5l_load_log(log))
 		goto error;
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 67d4f49..2e3e61a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3506,6 +3506,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	int rmw = 0, rcw = 0, i;
 	sector_t recovery_cp = conf->mddev->recovery_cp;
 
+	if (r5c_handle_stripe_dirtying(conf, sh, s, disks) == 0)
+		return;
+
 	/* Check whether resync is now happening or should start.
 	 * If yes, then the array is dirty (after unclean shutdown or
 	 * initial creation), so parity in some stripes might be inconsistent.
@@ -4396,13 +4399,23 @@ static void handle_stripe(struct stripe_head *sh)
 	    || s.expanding)
 		handle_stripe_fill(sh, &s, disks);
 
-	/* Now to consider new write requests and what else, if anything
-	 * should be read.  We do not handle new writes when:
+	/*
+	 * When the stripe finishes full journal write cycle (write to journal
+	 * and raid disk), this is the clean up procedure so it is ready for
+	 * next operation.
+	 */
+	r5c_handle_stripe_written(conf, sh);
+
+	/*
+	 * Now to consider new write requests, cache write back and what else,
+	 * if anything should be read.  We do not handle new writes when:
 	 * 1/ A 'write' operation (copy+xor) is already in flight.
 	 * 2/ A 'check' operation is in flight, as it may clobber the parity
 	 *    block.
+	 * 3/ A r5c cache log write is in flight.
 	 */
-	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
+	if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) &&
+	    !sh->reconstruct_state && !sh->check_state && !sh->log_io)
 		handle_stripe_dirtying(conf, sh, &s, disks);
 
 	/* maybe we need to check and possibly fix the parity for this stripe
@@ -5122,6 +5135,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	 * data on failed drives.
 	 */
 	if (rw == READ && mddev->degraded == 0 &&
+	    !r5c_is_writeback(conf->log) &&
 	    mddev->reshape_position == MaxSector) {
 		bi = chunk_aligned_read(mddev, bi);
 		if (!bi)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 46cfe93..8bae64b 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -345,7 +345,9 @@ enum {
 	STRIPE_BITMAP_PENDING,	/* Being added to bitmap, don't add
 				 * to batch yet.
 				 */
-	STRIPE_LOG_TRAPPED, /* trapped into log */
+	STRIPE_LOG_TRAPPED,	/* trapped into log */
+	STRIPE_R5C_FROZEN,	/* r5c_cache frozen and being written out */
+	STRIPE_R5C_WRITTEN,	/* ready for r5c_handle_stripe_written() */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
@@ -712,4 +714,10 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
 extern bool r5l_log_disk_error(struct r5conf *conf);
+extern bool r5c_is_writeback(struct r5l_log *log);
+extern int
+r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
+			   struct stripe_head_state *s, int disks);
+extern void
+r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh);
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 4/8] md/r5cache: write part of r5cache
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
                   ` (2 preceding siblings ...)
  2016-10-13  5:49 ` [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-14  6:53   ` NeilBrown
  2016-10-13  5:49 ` [PATCH v5 5/8] md/r5cache: reclaim support Song Liu
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

This is the write part of r5cache. The cache is integrated with
stripe cache of raid456. It leverages code of r5l_log to write
data to journal device.

r5cache split current write path into 2 parts: the write path
and the reclaim path. The write path is as following:
1. write data to journal
   (r5c_handle_stripe_dirtying, r5c_cache_data)
2. call bio_endio
   (r5c_handle_data_cached, r5c_return_dev_pending_writes).

Then the reclaim path is as:
1. Freeze the stripe (r5c_freeze_stripe_for_reclaim)
2. Calcualte parity (reconstruct or RMW)
3. Write parity (and maybe some other data) to journal device
4. Write data and parity to RAID disks

Reclaim path of the cache is implemented in the next patch.

With r5cache, write operation does not wait for parity calculation
and write out, so the write latency is lower (1 write to journal
device vs. read and then write to raid disks). Also, r5cache will
reduce RAID overhead (multipile IO due to read-modify-write of
parity) and provide more opportunities of full stripe writes.

This patch adds 2 flags to stripe_head.state:
 - STRIPE_R5C_PARTIAL_STRIPE,
 - STRIPE_R5C_FULL_STRIPE,

Instead of inactive_list, stripes with cached data are tracked in
r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
are not considered as "active".

For RMW, the code allocates an extra page for each data block
being updated.  This is stored in r5dev->page and the old data
is read into it.  Then the prexor calculation subtracts ->page
from the parity block, and the reconstruct calculation adds the
->orig_page data back into the parity block.

r5cache naturally excludes SkipCopy. With R5_Wantcache bit set,
async_copy_data will not skip copy.

There are some known limitations of the cache implementation:

1. Write cache only covers full page writes (R5_OVERWRITE). Writes
   of smaller granularity are write through.
2. Only one log io (sh->log_io) for each stripe at anytime. Later
   writes for the same stripe have to wait. This can be improved by
   moving log_io to r5dev.
3. With writeback cache, read path must enter state machine, which
   is a significant bottleneck for some workloads.
4. There is no per stripe checkpoint (with r5l_payload_flush) in
   the log, so recovery code has to replay more than necessary data
   (sometimes all the log from last_checkpoint). This reduces
   availability of the array.

This patch includes a fix proposed by ZhengYuan Liu
<liuzhengyuan@kylinos.cn>

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 204 +++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/raid5.c       | 144 ++++++++++++++++++++++++++++-----
 drivers/md/raid5.h       |  22 +++++
 3 files changed, 344 insertions(+), 26 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9e05850..92d3d7b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -20,6 +20,7 @@
 #include <linux/random.h>
 #include "md.h"
 #include "raid5.h"
+#include "bitmap.h"
 
 /*
  * metadata/data stored in disk with 4k size unit (a block) regardless
@@ -217,6 +218,44 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
+static void
+r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
+			      struct bio_list *return_bi)
+{
+	struct bio *wbi, *wbi2;
+
+	wbi = dev->written;
+	dev->written = NULL;
+	while (wbi && wbi->bi_iter.bi_sector <
+	       dev->sector + STRIPE_SECTORS) {
+		wbi2 = r5_next_bio(wbi, dev->sector);
+		if (!raid5_dec_bi_active_stripes(wbi)) {
+			md_write_end(conf->mddev);
+			bio_list_add(return_bi, wbi);
+		}
+		wbi = wbi2;
+	}
+}
+
+void r5c_handle_cached_data_endio(struct r5conf *conf,
+	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
+{
+	int i;
+
+	for (i = sh->disks; i--; ) {
+		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
+		    sh->dev[i].written) {
+			set_bit(R5_UPTODATE, &sh->dev[i].flags);
+			r5c_return_dev_pending_writes(conf, &sh->dev[i],
+						      return_bi);
+			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
+					STRIPE_SECTORS,
+					!test_bit(STRIPE_DEGRADED, &sh->state),
+					0);
+		}
+	}
+}
+
 /*
  * Freeze the stripe, thus send the stripe into reclaim path.
  *
@@ -233,6 +272,48 @@ void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
 		return;
 	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
 	set_bit(STRIPE_R5C_FROZEN, &sh->state);
+
+	if (log->r5c_state == R5C_STATE_WRITE_THROUGH)
+		return;
+
+	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		atomic_inc(&conf->preread_active_stripes);
+
+	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_partial_stripes);
+	}
+
+	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_full_stripes);
+	}
+}
+
+static void r5c_handle_data_cached(struct stripe_head *sh)
+{
+	int i;
+
+	for (i = sh->disks; i--; )
+		if (test_and_clear_bit(R5_Wantcache, &sh->dev[i].flags)) {
+			set_bit(R5_InCache, &sh->dev[i].flags);
+			clear_bit(R5_LOCKED, &sh->dev[i].flags);
+			atomic_inc(&sh->dev_in_cache);
+		}
+}
+
+/*
+ * this journal write must contain full parity,
+ * it may also contain some data pages
+ */
+static void r5c_handle_parity_cached(struct stripe_head *sh)
+{
+	int i;
+
+	for (i = sh->disks; i--; )
+		if (test_bit(R5_InCache, &sh->dev[i].flags))
+			set_bit(R5_Wantwrite, &sh->dev[i].flags);
+	set_bit(STRIPE_R5C_WRITTEN, &sh->state);
 }
 
 static void r5c_finish_cache_stripe(struct stripe_head *sh)
@@ -242,8 +323,10 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
 	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
 		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
 		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
-	} else
-		BUG(); /* write back logic in next patch */
+	} else if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
+		r5c_handle_parity_cached(sh);
+	else
+		r5c_handle_data_cached(sh);
 }
 
 static void r5l_io_run_stripes(struct r5l_io_unit *io)
@@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	io = log->current_io;
 
 	for (i = 0; i < sh->disks; i++) {
-		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
+		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
+		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
 			continue;
 		if (i == sh->pd_idx || i == sh->qd_idx)
 			continue;
@@ -514,7 +598,6 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	return 0;
 }
 
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
 /*
  * running in raid5d, where reclaim could wait for raid5d too (when it flushes
  * data from log to raid disks), so we shouldn't wait for reclaim here
@@ -544,6 +627,10 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 
 		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
 			continue;
+
+		if (test_bit(R5_InCache, &sh->dev[i].flags))
+			continue;
+
 		write_disks++;
 		/* checksum is already calculated in last run */
 		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
@@ -809,7 +896,6 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 	}
 }
 
-
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
@@ -872,7 +958,7 @@ static void r5l_reclaim_thread(struct md_thread *thread)
 	r5l_do_reclaim(log);
 }
 
-static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
+void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 {
 	unsigned long target;
 	unsigned long new = (unsigned long)space; /* overflow in theory */
@@ -1191,6 +1277,8 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
 			       int disks)
 {
 	struct r5l_log *log = conf->log;
+	int i;
+	struct r5dev *dev;
 
 	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
 		return -EAGAIN;
@@ -1201,21 +1289,121 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
 		r5c_freeze_stripe_for_reclaim(sh);
 		return -EAGAIN;
 	}
-	BUG();  /* write back logic in next commit */
+
+	s->to_cache = 0;
+
+	for (i = disks; i--; ) {
+		dev = &sh->dev[i];
+		/* if none-overwrite, use the reclaim path (write through) */
+		if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
+		    !test_bit(R5_InCache, &dev->flags)) {
+			r5c_freeze_stripe_for_reclaim(sh);
+			return -EAGAIN;
+		}
+	}
+
+	for (i = disks; i--; ) {
+		dev = &sh->dev[i];
+		if (dev->towrite) {
+			set_bit(R5_Wantcache, &dev->flags);
+			set_bit(R5_Wantdrain, &dev->flags);
+			set_bit(R5_LOCKED, &dev->flags);
+			s->to_cache++;
+		}
+	}
+
+	if (s->to_cache)
+		set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
+
 	return 0;
 }
 
 /*
  * clean up the stripe (clear STRIPE_R5C_FROZEN etc.) after the stripe is
  * committed to RAID disks
-*/
+ */
 void r5c_handle_stripe_written(struct r5conf *conf,
 			       struct stripe_head *sh)
 {
+	int i;
+	int do_wakeup = 0;
+
 	if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
 		return;
 	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
 	clear_bit(STRIPE_R5C_FROZEN, &sh->state);
+
+	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH)
+		return;
+
+	for (i = sh->disks; i--; ) {
+		if (test_and_clear_bit(R5_InCache, &sh->dev[i].flags))
+			atomic_dec(&sh->dev_in_cache);
+		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
+			do_wakeup = 1;
+	}
+
+	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
+		if (atomic_dec_and_test(&conf->pending_full_writes))
+			md_wakeup_thread(conf->mddev->thread);
+
+	if (do_wakeup)
+		wake_up(&conf->wait_for_overlap);
+}
+
+int
+r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+	       struct stripe_head_state *s)
+{
+	int pages;
+	int reserve;
+	int i;
+	int ret = 0;
+	int page_count = 0;
+
+	BUG_ON(!log);
+
+	for (i = 0; i < sh->disks; i++) {
+		void *addr;
+
+		if (!test_bit(R5_Wantcache, &sh->dev[i].flags))
+			continue;
+		addr = kmap_atomic(sh->dev[i].page);
+		sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum,
+						    addr, PAGE_SIZE);
+		kunmap_atomic(addr);
+		page_count++;
+	}
+	WARN_ON(page_count != s->to_cache);
+	pages = s->to_cache;
+
+	/*
+	 * The stripe must enter state machine again to call endio, so
+	 * don't delay.
+	 */
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	atomic_inc(&sh->count);
+
+	mutex_lock(&log->io_mutex);
+	/* meta + data */
+	reserve = (1 + pages) << (PAGE_SHIFT - 9);
+	if (!r5l_has_free_space(log, reserve)) {
+		spin_lock(&log->no_space_stripes_lock);
+		list_add_tail(&sh->log_list, &log->no_space_stripes);
+		spin_unlock(&log->no_space_stripes_lock);
+
+		r5l_wake_reclaim(log, reserve);
+	} else {
+		ret = r5l_log_stripe(log, sh, pages, 0);
+		if (ret) {
+			spin_lock_irq(&log->io_list_lock);
+			list_add_tail(&sh->log_list, &log->no_mem_stripes);
+			spin_unlock_irq(&log->io_list_lock);
+		}
+	}
+
+	mutex_unlock(&log->io_mutex);
+	return 0;
 }
 
 static int r5l_load_log(struct r5l_log *log)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 2e3e61a..0539f34 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -245,8 +245,25 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 			    < IO_THRESHOLD)
 				md_wakeup_thread(conf->mddev->thread);
 		atomic_dec(&conf->active_stripes);
-		if (!test_bit(STRIPE_EXPANDING, &sh->state))
-			list_add_tail(&sh->lru, temp_inactive_list);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			if (atomic_read(&sh->dev_in_cache) == 0) {
+				list_add_tail(&sh->lru, temp_inactive_list);
+			} else if (atomic_read(&sh->dev_in_cache) ==
+				   conf->raid_disks - conf->max_degraded) {
+				/* full stripe */
+				if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
+					atomic_inc(&conf->r5c_cached_full_stripes);
+				if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
+					atomic_dec(&conf->r5c_cached_partial_stripes);
+				list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
+			} else {
+				/* partial stripe */
+				if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
+						      &sh->state))
+					atomic_inc(&conf->r5c_cached_partial_stripes);
+				list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
+			}
+		}
 	}
 }
 
@@ -830,6 +847,11 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	might_sleep();
 
+	if (s->to_cache) {
+		r5c_cache_data(conf->log, sh, s);
+		return;
+	}
+
 	if (r5l_write_stripe(conf->log, sh) == 0)
 		return;
 	for (i = disks; i--; ) {
@@ -1044,7 +1066,7 @@ again:
 static struct dma_async_tx_descriptor *
 async_copy_data(int frombio, struct bio *bio, struct page **page,
 	sector_t sector, struct dma_async_tx_descriptor *tx,
-	struct stripe_head *sh)
+	struct stripe_head *sh, int no_skipcopy)
 {
 	struct bio_vec bvl;
 	struct bvec_iter iter;
@@ -1084,7 +1106,8 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
 			if (frombio) {
 				if (sh->raid_conf->skip_copy &&
 				    b_offset == 0 && page_offset == 0 &&
-				    clen == STRIPE_SIZE)
+				    clen == STRIPE_SIZE &&
+				    !no_skipcopy)
 					*page = bio_page;
 				else
 					tx = async_memcpy(*page, bio_page, page_offset,
@@ -1166,7 +1189,7 @@ static void ops_run_biofill(struct stripe_head *sh)
 			while (rbi && rbi->bi_iter.bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, &dev->page,
-					dev->sector, tx, sh);
+						     dev->sector, tx, sh, 0);
 				rbi = r5_next_bio(rbi, dev->sector);
 			}
 		}
@@ -1293,7 +1316,8 @@ static int set_syndrome_sources(struct page **srcs,
 		if (i == sh->qd_idx || i == sh->pd_idx ||
 		    (srctype == SYNDROME_SRC_ALL) ||
 		    (srctype == SYNDROME_SRC_WANT_DRAIN &&
-		     test_bit(R5_Wantdrain, &dev->flags)) ||
+		     (test_bit(R5_Wantdrain, &dev->flags) ||
+		      test_bit(R5_InCache, &dev->flags))) ||
 		    (srctype == SYNDROME_SRC_WRITTEN &&
 		     dev->written))
 			srcs[slot] = sh->dev[i].page;
@@ -1472,9 +1496,25 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
 static void ops_complete_prexor(void *stripe_head_ref)
 {
 	struct stripe_head *sh = stripe_head_ref;
+	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
+
+	if (!r5c_is_writeback(sh->raid_conf->log))
+		return;
+
+	/*
+	 * raid5-cache write back uses orig_page during prexor. after prexor,
+	 * it is time to free orig_page
+	 */
+	for (i = sh->disks; i--; )
+		if (sh->dev[i].page != sh->dev[i].orig_page) {
+			struct page *p = sh->dev[i].page;
+
+			sh->dev[i].page = sh->dev[i].orig_page;
+			put_page(p);
+		}
 }
 
 static struct dma_async_tx_descriptor *
@@ -1496,7 +1536,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
 	for (i = disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 		/* Only process blocks that are known to be uptodate */
-		if (test_bit(R5_Wantdrain, &dev->flags))
+		if (test_bit(R5_Wantdrain, &dev->flags) ||
+		    test_bit(R5_InCache, &dev->flags))
 			xor_srcs[count++] = dev->page;
 	}
 
@@ -1547,6 +1588,10 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
 
 again:
 			dev = &sh->dev[i];
+			if (test_and_clear_bit(R5_InCache, &dev->flags)) {
+				BUG_ON(atomic_read(&sh->dev_in_cache) == 0);
+				atomic_dec(&sh->dev_in_cache);
+			}
 			spin_lock_irq(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
@@ -1566,8 +1611,10 @@ again:
 					set_bit(R5_Discard, &dev->flags);
 				else {
 					tx = async_copy_data(1, wbi, &dev->page,
-						dev->sector, tx, sh);
-					if (dev->page != dev->orig_page) {
+							     dev->sector, tx, sh,
+							     test_bit(R5_Wantcache, &dev->flags));
+					if (dev->page != dev->orig_page &&
+					    !test_bit(R5_Wantcache, &dev->flags)) {
 						set_bit(R5_SkipCopy, &dev->flags);
 						clear_bit(R5_UPTODATE, &dev->flags);
 						clear_bit(R5_OVERWRITE, &dev->flags);
@@ -1675,7 +1722,8 @@ again:
 		xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (head_sh->dev[i].written)
+			if (head_sh->dev[i].written ||
+			    test_bit(R5_InCache, &head_sh->dev[i].flags))
 				xor_srcs[count++] = dev->page;
 		}
 	} else {
@@ -1930,6 +1978,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
 		atomic_set(&sh->count, 1);
+		atomic_set(&sh->dev_in_cache, 0);
 		for (i = 0; i < disks; i++) {
 			struct r5dev *dev = &sh->dev[i];
 
@@ -2810,12 +2859,30 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 
+			/*
+			 * Initially, handle_stripe_dirtying decided to run rmw
+			 * and allocates extra page for prexor. However, rcw is
+			 * cheaper later on. We need to free the extra page
+			 * now, because we won't be able to do that in
+			 * ops_complete_prexor().
+			 */
+			if (sh->dev[i].page != sh->dev[i].orig_page) {
+				struct page *p = sh->dev[i].page;
+
+				p = sh->dev[i].page;
+				sh->dev[i].page = sh->dev[i].orig_page;
+				put_page(p);
+			}
+
 			if (dev->towrite) {
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantdrain, &dev->flags);
 				if (!expand)
 					clear_bit(R5_UPTODATE, &dev->flags);
 				s->locked++;
+			} else if (test_bit(R5_InCache, &dev->flags)) {
+				set_bit(R5_LOCKED, &dev->flags);
+				s->locked++;
 			}
 		}
 		/* if we are not expanding this is a proper write request, and
@@ -2855,6 +2922,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 				set_bit(R5_LOCKED, &dev->flags);
 				clear_bit(R5_UPTODATE, &dev->flags);
 				s->locked++;
+			} else if (test_bit(R5_InCache, &dev->flags)) {
+				set_bit(R5_LOCKED, &dev->flags);
+				s->locked++;
 			}
 		}
 		if (!s->locked)
@@ -3529,9 +3599,12 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
+		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
+		     test_bit(R5_InCache, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
-		    !(test_bit(R5_UPTODATE, &dev->flags) ||
+		    !((test_bit(R5_UPTODATE, &dev->flags) &&
+		       (!test_bit(R5_InCache, &dev->flags) ||
+			dev->page != dev->orig_page)) ||
 		      test_bit(R5_Wantcompute, &dev->flags))) {
 			if (test_bit(R5_Insync, &dev->flags))
 				rmw++;
@@ -3543,13 +3616,15 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 		    i != sh->pd_idx && i != sh->qd_idx &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
 		    !(test_bit(R5_UPTODATE, &dev->flags) ||
-		    test_bit(R5_Wantcompute, &dev->flags))) {
+		      test_bit(R5_InCache, &dev->flags) ||
+		      test_bit(R5_Wantcompute, &dev->flags))) {
 			if (test_bit(R5_Insync, &dev->flags))
 				rcw++;
 			else
 				rcw += 2*disks;
 		}
 	}
+
 	pr_debug("for sector %llu, rmw=%d rcw=%d\n",
 		(unsigned long long)sh->sector, rmw, rcw);
 	set_bit(STRIPE_HANDLE, &sh->state);
@@ -3561,10 +3636,18 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 					  (unsigned long long)sh->sector, rmw);
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
+			if (test_bit(R5_InCache, &dev->flags) &&
+			    dev->page == dev->orig_page)
+				dev->page = alloc_page(GFP_NOIO);  /* prexor */
+
+			if ((dev->towrite ||
+			     i == sh->pd_idx || i == sh->qd_idx ||
+			     test_bit(R5_InCache, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-			    !(test_bit(R5_UPTODATE, &dev->flags) ||
-			    test_bit(R5_Wantcompute, &dev->flags)) &&
+			    !((test_bit(R5_UPTODATE, &dev->flags) &&
+			       (!test_bit(R5_InCache, &dev->flags) ||
+				dev->page != dev->orig_page)) ||
+			      test_bit(R5_Wantcompute, &dev->flags)) &&
 			    test_bit(R5_Insync, &dev->flags)) {
 				if (test_bit(STRIPE_PREREAD_ACTIVE,
 					     &sh->state)) {
@@ -3590,6 +3673,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 			    i != sh->pd_idx && i != sh->qd_idx &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
 			    !(test_bit(R5_UPTODATE, &dev->flags) ||
+			      test_bit(R5_InCache, &dev->flags) ||
 			      test_bit(R5_Wantcompute, &dev->flags))) {
 				rcw++;
 				if (test_bit(R5_Insync, &dev->flags) &&
@@ -3629,7 +3713,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
 	 */
 	if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
 	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
-	    !test_bit(STRIPE_BIT_DELAY, &sh->state)))
+	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
 		schedule_reconstruction(sh, s, rcw == 0, 0);
 }
 
@@ -4120,6 +4204,10 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 			if (rdev && !test_bit(Faulty, &rdev->flags))
 				do_recovery = 1;
 		}
+		if (test_bit(R5_InCache, &dev->flags) && dev->written)
+			s->just_cached++;
+		if (test_bit(R5_Wantcache, &dev->flags) && dev->written)
+			s->want_cache++;
 	}
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
@@ -4285,6 +4373,17 @@ static void handle_stripe(struct stripe_head *sh)
 
 	analyse_stripe(sh, &s);
 
+	if (s.want_cache) {
+		/* In last run of handle_stripe, we have finished
+		 * r5c_handle_stripe_dirtying and ops_run_biodrain, but
+		 * r5c_cache_data didn't finish because the journal device
+		 * didn't have enough space. This time we should continue
+		 * r5c_cache_data
+		 */
+		s.to_cache = s.want_cache;
+		goto finish;
+	}
+
 	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
 		goto finish;
 
@@ -4348,7 +4447,7 @@ static void handle_stripe(struct stripe_head *sh)
 			struct r5dev *dev = &sh->dev[i];
 			if (test_bit(R5_LOCKED, &dev->flags) &&
 				(i == sh->pd_idx || i == sh->qd_idx ||
-				 dev->written)) {
+				 dev->written || test_bit(R5_InCache, &dev->flags))) {
 				pr_debug("Writing block %d\n", i);
 				set_bit(R5_Wantwrite, &dev->flags);
 				if (prexor)
@@ -4388,6 +4487,10 @@ static void handle_stripe(struct stripe_head *sh)
 				 test_bit(R5_Discard, &qdev->flags))))))
 		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
 
+	if (s.just_cached)
+		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
+	r5l_stripe_write_finished(sh);
+
 	/* Now we might consider reading some blocks, either to check/generate
 	 * parity, or to satisfy requests
 	 * or to load a block that is being partially written.
@@ -6526,6 +6629,11 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
 		INIT_LIST_HEAD(conf->temp_inactive_list + i);
 
+	atomic_set(&conf->r5c_cached_full_stripes, 0);
+	INIT_LIST_HEAD(&conf->r5c_full_stripe_list);
+	atomic_set(&conf->r5c_cached_partial_stripes, 0);
+	INIT_LIST_HEAD(&conf->r5c_partial_stripe_list);
+
 	conf->level = mddev->new_level;
 	conf->chunk_sectors = mddev->new_chunk_sectors;
 	if (raid5_alloc_percpu(conf) != 0)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 8bae64b..ac6d7c7 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -226,6 +226,7 @@ struct stripe_head {
 
 	struct r5l_io_unit	*log_io;
 	struct list_head	log_list;
+	atomic_t		dev_in_cache;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -263,6 +264,7 @@ struct stripe_head_state {
 	 */
 	int syncing, expanding, expanded, replacing;
 	int locked, uptodate, to_read, to_write, failed, written;
+	int to_cache, want_cache, just_cached;
 	int to_fill, compute, req_compute, non_overwrite;
 	int failed_num[2];
 	int p_failed, q_failed;
@@ -313,6 +315,8 @@ enum r5dev_flags {
 			 */
 	R5_Discard,	/* Discard the stripe */
 	R5_SkipCopy,	/* Don't copy data from bio to stripe cache */
+	R5_Wantcache,	/* Want write data to write cache */
+	R5_InCache,	/* Data in cache */
 };
 
 /*
@@ -348,6 +352,12 @@ enum {
 	STRIPE_LOG_TRAPPED,	/* trapped into log */
 	STRIPE_R5C_FROZEN,	/* r5c_cache frozen and being written out */
 	STRIPE_R5C_WRITTEN,	/* ready for r5c_handle_stripe_written() */
+	STRIPE_R5C_PARTIAL_STRIPE,	/* in r5c cache (to-be/being handled or
+					 * in conf->r5c_partial_stripe_list)
+					 */
+	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
+				 * in conf->r5c_full_stripe_list)
+				 */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
@@ -600,6 +610,12 @@ struct r5conf {
 	 */
 	atomic_t		active_stripes;
 	struct list_head	inactive_list[NR_STRIPE_HASH_LOCKS];
+
+	atomic_t		r5c_cached_full_stripes;
+	struct list_head	r5c_full_stripe_list;
+	atomic_t		r5c_cached_partial_stripes;
+	struct list_head	r5c_partial_stripe_list;
+
 	atomic_t		empty_inactive_list_nr;
 	struct llist_head	released_stripes;
 	wait_queue_head_t	wait_for_quiescent;
@@ -720,4 +736,10 @@ r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
 			   struct stripe_head_state *s, int disks);
 extern void
 r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh);
+extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
+extern void r5c_handle_cached_data_endio(struct r5conf *conf,
+	struct stripe_head *sh, int disks, struct bio_list *return_bi);
+extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
+			  struct stripe_head_state *s);
+extern void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh);
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 5/8] md/r5cache: reclaim support
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
                   ` (3 preceding siblings ...)
  2016-10-13  5:49 ` [PATCH v5 4/8] md/r5cache: write part of r5cache Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-19  2:03   ` NeilBrown
  2016-10-13  5:49 ` [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state Song Liu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

There are two limited resources, stripe cache and journal disk space.
For better performance, we priotize reclaim of full stripe writes.
To free up more journal space, we free earliest data on the journal.

In current implementation, reclaim happens when:
1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 5 seconds) reclaim
   if there is no reclaim in the past 5 seconds.
2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
   (r5c_check_cached_full_stripe)
3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)

r5c_do_reclaim() contains new logic of reclaim.

For stripe cache:

When stripe cache pressure is high (more than 3/4 stripes are cached,
or there is empty inactive lists), flush all full stripe. If fewer
than R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2) full stripes
are flushed, flush some paritial stripes. When stripe cache pressure
is moderate (1/2 to 3/4 of stripes are cached), flush all full stripes.

For log space:

To avoid deadlock due to log space, we need to reserve enough space
to flush cached data. The size of required log space depends on total
number of cached stripes (stripe_in_cache_count). In current
implementation, the reclaim path automatically include pending
data writes with parity writes (similar to write through case).
Therefore, we need up to (conf->raid_disks + 1) pages for each cached
stripe (1 page for meta data, raid_disks pages for all data and
parity). r5c_log_required_to_flush_cache() calculates log space
required to flush cache. In the following, we refer to the space
calculated by r5c_log_required_to_flush_cache() as
reclaim_required_space.

Two flags are added to r5conf->cache_state: R5C_LOG_TIGHT and
R5C_LOG_CRITICAL. R5C_LOG_TIGHT is set when free space on the log
device is less than 3x of reclaim_required_space. R5C_LOG_CRITICAL
is set when free space on the log device is less than 2x of
reclaim_required_space.

r5c_cache keeps all data in cache (not fully committed to RAID) in
a list (stripe_in_cache_list). These stripes are in the order of their
first appearance on the journal. So the log tail (last_checkpoint)
should point to the journal_start of the first item in the list.

When R5C_LOG_TIGHT is set, r5l_reclaim_thread starts flushing out
stripes at the head of stripe_in_cache. When R5C_LOG_CRITICAL is
set, the state machine only writes data that are already in the
log device (in stripe_in_cache_list).

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 390 ++++++++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid5.c       |  17 +++
 drivers/md/raid5.h       |  39 +++--
 3 files changed, 410 insertions(+), 36 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 92d3d7b..9f29bfd2 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -29,12 +29,21 @@
 #define BLOCK_SECTORS (8)
 
 /*
- * reclaim runs every 1/4 disk size or 10G reclaimable space. This can prevent
- * recovery scans a very long log
+ * log->max_free_space is min(1/4 disk size, 10G reclaimable space).
+ *
+ * In write through mode, the reclaim runs every log->max_free_space.
+ * This can prevent the recovery scans for too long
  */
 #define RECLAIM_MAX_FREE_SPACE (10 * 1024 * 1024 * 2) /* sector */
 #define RECLAIM_MAX_FREE_SPACE_SHIFT (2)
 
+/* wake up reclaim thread periodically */
+#define R5C_RECLAIM_WAKEUP_INTERVAL (5 * HZ)
+/* start flush with these full stripes */
+#define R5C_FULL_STRIPE_FLUSH_BATCH 32
+/* reclaim stripes in groups */
+#define R5C_RECLAIM_STRIPE_GROUP (NR_STRIPE_HASH_LOCKS * 2)
+
 /*
  * We only need 2 bios per I/O unit to make progress, but ensure we
  * have a few more available to not get too tight.
@@ -141,6 +150,12 @@ struct r5l_log {
 
 	/* for r5c_cache */
 	enum r5c_state r5c_state;
+
+	/* all stripes in r5cache, in the order of seq at sh->log_start */
+	struct list_head stripe_in_cache_list;
+
+	spinlock_t stripe_in_cache_lock;
+	atomic_t stripe_in_cache_count;
 };
 
 /*
@@ -257,6 +272,85 @@ void r5c_handle_cached_data_endio(struct r5conf *conf,
 }
 
 /*
+ * check whether we should flush some stripes to free up stripe cache
+ */
+void r5c_check_stripe_cache_usage(struct r5conf *conf)
+{
+	int total_cached;
+
+	if (!r5c_is_writeback(conf->log))
+		return;
+
+	total_cached = atomic_read(&conf->r5c_cached_partial_stripes) +
+		atomic_read(&conf->r5c_cached_full_stripes);
+	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)
+		r5l_wake_reclaim(conf->log, 0);
+}
+
+/*
+ * flush cache when there are R5C_FULL_STRIPE_FLUSH_BATCH or more full
+ * stripes in the cache
+ */
+void r5c_check_cached_full_stripe(struct r5conf *conf)
+{
+	if (!r5c_is_writeback(conf->log))
+		return;
+	if (atomic_read(&conf->r5c_cached_full_stripes) >=
+	    R5C_FULL_STRIPE_FLUSH_BATCH)
+		r5l_wake_reclaim(conf->log, 0);
+}
+
+/*
+ * Total log space (in sectors) needed to flush all data in cache
+ *
+ * Currently, reclaim path automatically includes all pending writes
+ * to the same sector. So the reclaim of each stripe takes up to
+ * (conf->raid_disks + 1) pages of log space.
+ *
+ * To totally avoid deadlock due to log space, the code reserves
+ * (conf->raid_disks + 1) pages for each stripe in cache, which is not
+ * necessary in most cases.
+ *
+ * To improve this, we will need reclaim path to be able to NOT include
+ * pending writes, which will reduce the requirement to
+ * (conf->max_degraded + 1) pages per stripe in cache.
+ */
+static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
+{
+	struct r5l_log *log = conf->log;
+
+	if (!r5c_is_writeback(log))
+		return 0;
+
+	return BLOCK_SECTORS * (conf->raid_disks + 1) *
+		atomic_read(&log->stripe_in_cache_count);
+}
+
+/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
+static inline void r5c_update_log_state(struct r5l_log *log)
+{
+	struct r5conf *conf = log->rdev->mddev->private;
+	sector_t free_space = r5l_ring_distance(log, log->log_start,
+						log->last_checkpoint);
+	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
+
+	if (!r5c_is_writeback(log))
+		return;
+	else if (free_space < 2 * reclaim_space) {
+		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
+	} else if (free_space < 3 * reclaim_space) {
+		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
+	} else if (free_space > 4 * reclaim_space) {
+		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
+		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+	} else
+		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
+}
+
+/*
  * Freeze the stripe, thus send the stripe into reclaim path.
  *
  * In current implementation, STRIPE_R5C_FROZEN is also set in write through
@@ -290,6 +384,19 @@ void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
 	}
 }
 
+/*
+ * do not release a stripe to cached lists in quiesce
+ */
+void r5c_prepare_stripe_for_release_in_quiesce(struct stripe_head *sh)
+{
+	if (!test_bit(STRIPE_HANDLE, &sh->state) &&
+	    atomic_read(&sh->dev_in_cache) != 0) {
+		if (!test_bit(STRIPE_R5C_FROZEN, &sh->state))
+			r5c_freeze_stripe_for_reclaim(sh);
+		set_bit(STRIPE_HANDLE, &sh->state);
+	}
+}
+
 static void r5c_handle_data_cached(struct stripe_head *sh)
 {
 	int i;
@@ -435,6 +542,7 @@ static void r5_reserve_log_entry(struct r5l_log *log, struct r5l_io_unit *io)
 {
 	log->log_start = r5l_ring_add(log, log->log_start, BLOCK_SECTORS);
 
+	r5c_update_log_state(log);
 	/*
 	 * If we filled up the log device start from the beginning again,
 	 * which will require a new bio.
@@ -552,6 +660,7 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	int meta_size;
 	int ret;
 	struct r5l_io_unit *io;
+	unsigned long flags;
 
 	meta_size =
 		((sizeof(struct r5l_payload_data_parity) + sizeof(__le32))
@@ -595,20 +704,42 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 	atomic_inc(&io->pending_stripe);
 	sh->log_io = io;
 
+	if (log->r5c_state == R5C_STATE_WRITE_THROUGH)
+		return 0;
+
+	if (sh->log_start == MaxSector) {
+		BUG_ON(!list_empty(&sh->r5c));
+		sh->log_start = io->log_start;
+		spin_lock_irqsave(&log->stripe_in_cache_lock, flags);
+		list_add_tail(&sh->r5c,
+			      &log->stripe_in_cache_list);
+		spin_unlock_irqrestore(&log->stripe_in_cache_lock, flags);
+		atomic_inc(&log->stripe_in_cache_count);
+	}
 	return 0;
 }
 
+/* add stripe to no_space_stripes, and then wake up reclaim */
+static inline void r5l_add_no_space_stripe(struct r5l_log *log, struct stripe_head *sh)
+{
+	spin_lock(&log->no_space_stripes_lock);
+	list_add_tail(&sh->log_list, &log->no_space_stripes);
+	spin_unlock(&log->no_space_stripes_lock);
+}
+
 /*
  * running in raid5d, where reclaim could wait for raid5d too (when it flushes
  * data from log to raid disks), so we shouldn't wait for reclaim here
  */
 int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 {
+	struct r5conf *conf = sh->raid_conf;
 	int write_disks = 0;
 	int data_pages, parity_pages;
 	int reserve;
 	int i;
 	int ret = 0;
+	bool wake_reclaim = false;
 
 	if (!log)
 		return -EAGAIN;
@@ -654,22 +785,49 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
 	mutex_lock(&log->io_mutex);
 	/* meta + data */
 	reserve = (1 + write_disks) << (PAGE_SHIFT - 9);
-	if (!r5l_has_free_space(log, reserve)) {
-		spin_lock(&log->no_space_stripes_lock);
-		list_add_tail(&sh->log_list, &log->no_space_stripes);
-		spin_unlock(&log->no_space_stripes_lock);
 
-		r5l_wake_reclaim(log, reserve);
-	} else {
-		ret = r5l_log_stripe(log, sh, data_pages, parity_pages);
-		if (ret) {
-			spin_lock_irq(&log->io_list_lock);
-			list_add_tail(&sh->log_list, &log->no_mem_stripes);
-			spin_unlock_irq(&log->io_list_lock);
+	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
+		if (!r5l_has_free_space(log, reserve)) {
+			r5l_add_no_space_stripe(log, sh);
+			wake_reclaim = true;
+		} else {
+			ret = r5l_log_stripe(log, sh, data_pages, parity_pages);
+			if (ret) {
+				spin_lock_irq(&log->io_list_lock);
+				list_add_tail(&sh->log_list,
+					      &log->no_mem_stripes);
+				spin_unlock_irq(&log->io_list_lock);
+			}
+		}
+	} else {  /* R5C_STATE_WRITE_BACK */
+		/*
+		 * log space critical, do not process stripes that are
+		 * not in cache yet (sh->log_start == MaxSector).
+		 */
+		if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
+		    sh->log_start == MaxSector) {
+			r5l_add_no_space_stripe(log, sh);
+			wake_reclaim = true;
+			reserve = 0;
+		} else if (!r5l_has_free_space(log, reserve)) {
+			if (sh->log_start == log->last_checkpoint)
+				BUG();
+			else {
+				r5l_add_no_space_stripe(log, sh);
+			}
+		} else {
+			ret = r5l_log_stripe(log, sh, data_pages, parity_pages);
+			if (ret) {
+				spin_lock_irq(&log->io_list_lock);
+				list_add_tail(&sh->log_list, &log->no_mem_stripes);
+				spin_unlock_irq(&log->io_list_lock);
+			}
 		}
 	}
 
 	mutex_unlock(&log->io_mutex);
+	if (wake_reclaim)
+		r5l_wake_reclaim(log, reserve);
 	return 0;
 }
 
@@ -709,6 +867,7 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
 	while (!list_empty(&log->no_space_stripes)) {
 		sh = list_first_entry(&log->no_space_stripes,
 				      struct stripe_head, log_list);
+		set_bit(STRIPE_PREREAD_ACTIVE, &sh->state);
 		list_del_init(&sh->log_list);
 		set_bit(STRIPE_HANDLE, &sh->state);
 		raid5_release_stripe(sh);
@@ -716,10 +875,40 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
 	spin_unlock(&log->no_space_stripes_lock);
 }
 
+/*
+ * calculate new last_checkpoint
+ * for write through mode, returns log->next_checkpoint
+ * for write back, returns log_start of first sh in stripe_in_cache_list
+ */
+static sector_t r5c_calculate_new_cp(struct r5conf *conf)
+{
+	struct stripe_head *sh;
+	struct r5l_log *log = conf->log;
+	sector_t end = MaxSector;
+	unsigned long flags;
+
+	if (log->r5c_state == R5C_STATE_WRITE_THROUGH)
+		return log->next_checkpoint;
+
+	spin_lock_irqsave(&log->stripe_in_cache_lock, flags);
+	if (list_empty(&conf->log->stripe_in_cache_list)) {
+		/* all stripes flushed */
+		spin_unlock_irqrestore(&log->stripe_in_cache_lock, flags);
+		return log->next_checkpoint;
+	}
+	sh = list_first_entry(&conf->log->stripe_in_cache_list,
+			      struct stripe_head, r5c);
+	end = sh->log_start;
+	spin_unlock_irqrestore(&log->stripe_in_cache_lock, flags);
+	return end;
+}
+
 static sector_t r5l_reclaimable_space(struct r5l_log *log)
 {
+	struct r5conf *conf = log->rdev->mddev->private;
+
 	return r5l_ring_distance(log, log->last_checkpoint,
-				 log->next_checkpoint);
+				 r5c_calculate_new_cp(conf));
 }
 
 static void r5l_run_no_mem_stripe(struct r5l_log *log)
@@ -765,6 +954,7 @@ static bool r5l_complete_finished_ios(struct r5l_log *log)
 static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
 {
 	struct r5l_log *log = io->log;
+	struct r5conf *conf = log->rdev->mddev->private;
 	unsigned long flags;
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
@@ -775,7 +965,8 @@ static void __r5l_stripe_write_finished(struct r5l_io_unit *io)
 		return;
 	}
 
-	if (r5l_reclaimable_space(log) > log->max_free_space)
+	if (r5l_reclaimable_space(log) > log->max_free_space ||
+	    test_bit(R5C_LOG_TIGHT, &conf->cache_state))
 		r5l_wake_reclaim(log, 0);
 
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
@@ -898,10 +1089,10 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 
 static void r5l_do_reclaim(struct r5l_log *log)
 {
+	struct r5conf *conf = log->rdev->mddev->private;
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
 	sector_t reclaimable;
 	sector_t next_checkpoint;
-	u64 next_cp_seq;
 
 	spin_lock_irq(&log->io_list_lock);
 	/*
@@ -924,8 +1115,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 				    log->io_list_lock);
 	}
 
-	next_checkpoint = log->next_checkpoint;
-	next_cp_seq = log->next_cp_seq;
+	next_checkpoint = r5c_calculate_new_cp(conf);
 	spin_unlock_irq(&log->io_list_lock);
 
 	BUG_ON(reclaimable < 0);
@@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = next_checkpoint;
-	log->last_cp_seq = next_cp_seq;
+	r5c_update_log_state(log);
 	mutex_unlock(&log->io_mutex);
 
 	r5l_run_no_space_stripes(log);
@@ -955,6 +1145,7 @@ static void r5l_reclaim_thread(struct md_thread *thread)
 
 	if (!log)
 		return;
+	r5c_do_reclaim(conf);
 	r5l_do_reclaim(log);
 }
 
@@ -963,6 +1154,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 	unsigned long target;
 	unsigned long new = (unsigned long)space; /* overflow in theory */
 
+	if (!log)
+		return;
 	do {
 		target = log->reclaim_target;
 		if (new < target)
@@ -986,11 +1179,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
 			return;
 		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 					log->rdev->mddev, "reclaim");
+		log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
 	} else if (state == 1) {
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
 		wake_up(&mddev->sb_wait);
-		r5l_wake_reclaim(log, -1L);
+		r5l_wake_reclaim(log, MaxSector);
 		md_unregister_thread(&log->reclaim_thread);
 		r5l_do_reclaim(log);
 	}
@@ -1271,6 +1465,63 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
 	set_bit(MD_CHANGE_DEVS, &mddev->flags);
 }
 
+/*
+ * r5c_flush_stripe moves stripe from cached list to handle_list. When called,
+ * the stripe must be on r5c_cached_full_stripes or r5c_cached_partial_stripes.
+ *
+ * must hold conf->device_lock
+ */
+static void r5c_flush_stripe(struct r5conf *conf, struct stripe_head *sh)
+{
+	BUG_ON(list_empty(&sh->lru));
+	BUG_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
+	BUG_ON(test_bit(STRIPE_HANDLE, &sh->state));
+	assert_spin_locked(&conf->device_lock);
+
+	list_del_init(&sh->lru);
+	atomic_inc(&sh->count);
+
+	set_bit(STRIPE_HANDLE, &sh->state);
+	atomic_inc(&conf->active_stripes);
+	r5c_freeze_stripe_for_reclaim(sh);
+
+	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		atomic_inc(&conf->preread_active_stripes);
+	raid5_release_stripe(sh);
+}
+
+/*
+ * if num < 0, flush all stripes
+ * if num == 0, flush all full stripes
+ * if num > 0, flush all full stripes. If less than num full stripes are
+ *             flushed, flush some partial stripes until totally num stripes are
+ *             flushed or there is no more cached stripes.
+ */
+void r5c_flush_cache(struct r5conf *conf, int num)
+{
+	int count;
+	struct stripe_head *sh, *next;
+
+	assert_spin_locked(&conf->device_lock);
+	if (!conf->log)
+		return;
+
+	count = 0;
+	list_for_each_entry_safe(sh, next, &conf->r5c_full_stripe_list, lru) {
+		r5c_flush_stripe(conf, sh);
+		count++;
+	}
+
+	if (num == 0 || (num > 0 && count >= num))
+		return;
+	list_for_each_entry_safe(sh, next,
+				 &conf->r5c_partial_stripe_list, lru) {
+		r5c_flush_stripe(conf, sh);
+		if (num > 0 && count++ >= num)
+			break;
+	}
+}
+
 int r5c_handle_stripe_dirtying(struct r5conf *conf,
 			       struct stripe_head *sh,
 			       struct stripe_head_state *s,
@@ -1327,6 +1578,7 @@ void r5c_handle_stripe_written(struct r5conf *conf,
 {
 	int i;
 	int do_wakeup = 0;
+	unsigned long flags;
 
 	if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
 		return;
@@ -1349,12 +1601,22 @@ void r5c_handle_stripe_written(struct r5conf *conf,
 
 	if (do_wakeup)
 		wake_up(&conf->wait_for_overlap);
+
+	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH)
+		return;
+
+	spin_lock_irqsave(&conf->log->stripe_in_cache_lock, flags);
+	list_del_init(&sh->r5c);
+	spin_unlock_irqrestore(&conf->log->stripe_in_cache_lock, flags);
+	sh->log_start = MaxSector;
+	atomic_dec(&conf->log->stripe_in_cache_count);
 }
 
 int
 r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	       struct stripe_head_state *s)
 {
+	struct r5conf *conf = sh->raid_conf;
 	int pages;
 	int reserve;
 	int i;
@@ -1387,12 +1649,15 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	mutex_lock(&log->io_mutex);
 	/* meta + data */
 	reserve = (1 + pages) << (PAGE_SHIFT - 9);
-	if (!r5l_has_free_space(log, reserve)) {
-		spin_lock(&log->no_space_stripes_lock);
-		list_add_tail(&sh->log_list, &log->no_space_stripes);
-		spin_unlock(&log->no_space_stripes_lock);
 
-		r5l_wake_reclaim(log, reserve);
+	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
+	    sh->log_start == MaxSector)
+		r5l_add_no_space_stripe(log, sh);
+	else if (!r5l_has_free_space(log, reserve)) {
+		if (sh->log_start == log->last_checkpoint)
+			BUG();
+		else
+			r5l_add_no_space_stripe(log, sh);
 	} else {
 		ret = r5l_log_stripe(log, sh, pages, 0);
 		if (ret) {
@@ -1406,6 +1671,73 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	return 0;
 }
 
+void r5c_do_reclaim(struct r5conf *conf)
+{
+	struct r5l_log *log = conf->log;
+	struct stripe_head *sh;
+	int count = 0;
+	unsigned long flags;
+	int total_cached;
+	int stripes_to_flush;
+
+	if (!r5c_is_writeback(log))
+		return;
+
+	total_cached = atomic_read(&conf->r5c_cached_partial_stripes) +
+		atomic_read(&conf->r5c_cached_full_stripes);
+
+	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)
+		/*
+		 * if stripe cache pressure high, flush all full stripes and
+		 * some partial stripes
+		 */
+		stripes_to_flush = R5C_RECLAIM_STRIPE_GROUP;
+	else if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+		 atomic_read(&conf->r5c_cached_full_stripes) >
+		 R5C_FULL_STRIPE_FLUSH_BATCH)
+		/*
+		 * if stripe cache pressure moderate, or if there is many full stripes,
+		 * flush all full stripes
+		 */
+		stripes_to_flush = 0;
+	else
+		/* no need to flush */
+		stripes_to_flush = -1;
+
+	if (stripes_to_flush >= 0) {
+		spin_lock_irqsave(&conf->device_lock, flags);
+		r5c_flush_cache(conf, stripes_to_flush);
+		spin_unlock_irqrestore(&conf->device_lock, flags);
+	}
+
+	/* if log space is tight, flush stripes on stripe_in_cache_list */
+	if (test_bit(R5C_LOG_TIGHT, &conf->cache_state)) {
+		spin_lock_irqsave(&log->stripe_in_cache_lock, flags);
+		spin_lock(&conf->device_lock);
+		list_for_each_entry(sh, &log->stripe_in_cache_list, r5c) {
+			/*
+			 * stripes on stripe_in_cache_list could be in any
+			 * state of the stripe_cache state machine. In this
+			 * case, we only want to flush stripe on
+			 * r5c_cached_full/partial_stripes. The following
+			 * condition makes sure the stripe is on one of the
+			 * two lists.
+			 */
+			if (!list_empty(&sh->lru) &&
+			    !test_bit(STRIPE_HANDLE, &sh->state) &&
+			    atomic_read(&sh->count) == 0) {
+				r5c_flush_stripe(conf, sh);
+			}
+			if (count++ >= R5C_RECLAIM_STRIPE_GROUP)
+				break;
+		}
+		spin_unlock(&conf->device_lock);
+		spin_unlock_irqrestore(&log->stripe_in_cache_lock, flags);
+	}
+	md_wakeup_thread(conf->mddev->thread);
+}
+
 static int r5l_load_log(struct r5l_log *log)
 {
 	struct md_rdev *rdev = log->rdev;
@@ -1463,6 +1795,9 @@ create:
 	if (log->max_free_space > RECLAIM_MAX_FREE_SPACE)
 		log->max_free_space = RECLAIM_MAX_FREE_SPACE;
 	log->last_checkpoint = cp;
+	mutex_lock(&log->io_mutex);
+	r5c_update_log_state(log);
+	mutex_unlock(&log->io_mutex);
 
 	__free_page(page);
 
@@ -1534,6 +1869,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
 		goto reclaim_thread;
+	log->reclaim_thread->timeout = R5C_RECLAIM_WAKEUP_INTERVAL;
+
 	init_waitqueue_head(&log->iounit_wait);
 
 	INIT_LIST_HEAD(&log->no_mem_stripes);
@@ -1542,6 +1879,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	spin_lock_init(&log->no_space_stripes_lock);
 
 	log->r5c_state = R5C_STATE_WRITE_THROUGH;
+	INIT_LIST_HEAD(&log->stripe_in_cache_list);
+	spin_lock_init(&log->stripe_in_cache_lock);
+	atomic_set(&log->stripe_in_cache_count, 0);
 
 	if (r5l_load_log(log))
 		goto error;
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0539f34..beebec1 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -220,6 +220,11 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 {
 	BUG_ON(!list_empty(&sh->lru));
 	BUG_ON(atomic_read(&conf->active_stripes)==0);
+
+	/* When quiesce in r5c write back, make sure the stripe is handled*/
+	if (conf->quiesce && r5c_is_writeback(conf->log))
+		r5c_prepare_stripe_for_release_in_quiesce(sh);
+
 	if (test_bit(STRIPE_HANDLE, &sh->state)) {
 		if (test_bit(STRIPE_DELAYED, &sh->state) &&
 		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
@@ -256,6 +261,7 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 				if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
 					atomic_dec(&conf->r5c_cached_partial_stripes);
 				list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
+				r5c_check_cached_full_stripe(conf);
 			} else {
 				/* partial stripe */
 				if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
@@ -626,9 +632,12 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 			}
 			if (noblock && sh == NULL)
 				break;
+
+			r5c_check_stripe_cache_usage(conf);
 			if (!sh) {
 				set_bit(R5_INACTIVE_BLOCKED,
 					&conf->cache_state);
+				r5l_wake_reclaim(conf->log, 0);
 				wait_event_lock_irq(
 					conf->wait_for_stripe,
 					!list_empty(conf->inactive_list + hash) &&
@@ -854,6 +863,7 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
 
 	if (r5l_write_stripe(conf->log, sh) == 0)
 		return;
+
 	for (i = disks; i--; ) {
 		int op, op_flags = 0;
 		int replace_only = 0;
@@ -1977,8 +1987,10 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
 		spin_lock_init(&sh->batch_lock);
 		INIT_LIST_HEAD(&sh->batch_list);
 		INIT_LIST_HEAD(&sh->lru);
+		INIT_LIST_HEAD(&sh->r5c);
 		atomic_set(&sh->count, 1);
 		atomic_set(&sh->dev_in_cache, 0);
+		sh->log_start = MaxSector;
 		for (i = 0; i < disks; i++) {
 			struct r5dev *dev = &sh->dev[i];
 
@@ -4748,6 +4760,10 @@ static int raid5_congested(struct mddev *mddev, int bits)
 
 	if (test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state))
 		return 1;
+
+	/* Also checks whether there is pressure on r5cache log space */
+	if (test_bit(R5C_LOG_TIGHT, &conf->cache_state))
+		return 1;
 	if (conf->quiesce)
 		return 1;
 	if (atomic_read(&conf->empty_inactive_list_nr))
@@ -7713,6 +7729,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 		/* '2' tells resync/reshape to pause so that all
 		 * active stripes can drain
 		 */
+		r5c_flush_cache(conf, -1);
 		conf->quiesce = 2;
 		wait_event_cmd(conf->wait_for_quiescent,
 				    atomic_read(&conf->active_stripes) == 0 &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index ac6d7c7..e63a7b0 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -227,6 +227,8 @@ struct stripe_head {
 	struct r5l_io_unit	*log_io;
 	struct list_head	log_list;
 	atomic_t		dev_in_cache;
+	sector_t		log_start; /* first meta block on the journal */
+	struct list_head	r5c; /* for r5c_cache->stripe_in_cache */
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
@@ -521,6 +523,26 @@ struct r5worker_group {
 	int stripes_cnt;
 };
 
+enum r5_cache_state {
+	R5_INACTIVE_BLOCKED,	/* release of inactive stripes blocked,
+				 * waiting for 25% to be free
+				 */
+	R5_ALLOC_MORE,		/* It might help to allocate another
+				 * stripe.
+				 */
+	R5_DID_ALLOC,		/* A stripe was allocated, don't allocate
+				 * more until at least one has been
+				 * released.  This avoids flooding
+				 * the cache.
+				 */
+	R5C_LOG_TIGHT,		/* journal device space tight, need to
+				 * prioritize stripes at last_checkpoint
+				 */
+	R5C_LOG_CRITICAL,	/* journal device is running out of space,
+				 * only process stripes at last_checkpoint
+				 */
+};
+
 struct r5conf {
 	struct hlist_head	*stripe_hashtbl;
 	/* only protect corresponding hash list and inactive_list */
@@ -622,17 +644,6 @@ struct r5conf {
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;
 	unsigned long		cache_state;
-#define R5_INACTIVE_BLOCKED	1	/* release of inactive stripes blocked,
-					 * waiting for 25% to be free
-					 */
-#define R5_ALLOC_MORE		2	/* It might help to allocate another
-					 * stripe.
-					 */
-#define R5_DID_ALLOC		4	/* A stripe was allocated, don't allocate
-					 * more until at least one has been
-					 * released.  This avoids flooding
-					 * the cache.
-					 */
 	struct shrinker		shrinker;
 	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
@@ -742,4 +753,10 @@ extern void r5c_handle_cached_data_endio(struct r5conf *conf,
 extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 			  struct stripe_head_state *s);
 extern void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh);
+extern void r5c_prepare_stripe_for_release_in_quiesce(struct stripe_head *sh);
+extern void r5c_do_reclaim(struct r5conf *conf);
+extern void r5c_flush_cache(struct r5conf *conf, int num);
+extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
+extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
                   ` (4 preceding siblings ...)
  2016-10-13  5:49 ` [PATCH v5 5/8] md/r5cache: reclaim support Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-19  2:19   ` NeilBrown
  2016-10-13  5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
  2016-10-13  5:49 ` [PATCH v5 8/8] md/r5cache: handle SYNC and FUA Song Liu
  7 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

r5c_state have 4 states:
* no-cache;
* write-through (write journal only);
* write-back (w/ write cache);
* cache-broken (journal missing or Faulty)

When there is functional write cache, r5c_state is a knob to
switch between write-back and write-through.

When the journal device is broken, the raid array is forced
in readonly mode. In this case, r5c_state can be used to
remove "journal feature", and thus make the array read-write
without journal. By writing into r5c_cache_mode, the array
can transit from cache-broken to no-cache, which removes
journal feature for the array.

To remove the journal feature:
- When journal fails, the raid array is forced readonly mode
  (enforced by kernel)
- User uses the new interface to remove journal (writing 0
  to r5c_state, I will add a mdadm option for that later)
- User forces array read-write;
- Kernel updates superblock and array can run read/write.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.c       |  1 +
 drivers/md/raid5.h       |  1 +
 3 files changed, 89 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 9f29bfd2..e2fc9f5 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -57,6 +57,8 @@ enum r5c_state {
 	R5C_STATE_CACHE_BROKEN = 3,
 };
 
+static char *r5c_state_str[] = {"no-cache", "write-through",
+				"write-back", "cache-broken"};
 /*
  * raid5 cache state machine
  *
@@ -1522,6 +1524,91 @@ void r5c_flush_cache(struct r5conf *conf, int num)
 	}
 }
 
+ssize_t r5c_state_show(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	int val = 0;
+	int ret;
+
+	if (conf->log)
+		val = conf->log->r5c_state;
+	else if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
+		val = R5C_STATE_CACHE_BROKEN;
+
+	switch (val) {
+	case R5C_STATE_NO_CACHE:
+		ret = snprintf(page, PAGE_SIZE, "%s\n", r5c_state_str[val]);
+		break;
+	case R5C_STATE_CACHE_BROKEN:
+		ret = snprintf(page, PAGE_SIZE, "%s [%s]\n",
+			       r5c_state_str[R5C_STATE_NO_CACHE],
+			       r5c_state_str[R5C_STATE_CACHE_BROKEN]);
+		break;
+	case R5C_STATE_WRITE_THROUGH:
+		ret = snprintf(page, PAGE_SIZE, "[%s] %s\n",
+			       r5c_state_str[R5C_STATE_WRITE_THROUGH],
+			       r5c_state_str[R5C_STATE_WRITE_BACK]);
+		break;
+	case R5C_STATE_WRITE_BACK:
+		ret = snprintf(page, PAGE_SIZE, "%s [%s]\n",
+			       r5c_state_str[R5C_STATE_WRITE_THROUGH],
+			       r5c_state_str[R5C_STATE_WRITE_BACK]);
+		break;
+	default:
+		pr_err("md/raid:%s: Detected invalid r5c_state %d\n",
+		       mdname(mddev), val);
+	}
+	return ret;
+}
+
+ssize_t r5c_state_store(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	struct r5l_log *log = conf->log;
+	int val = -1, i;
+
+	for (i = 0; i < sizeof(r5c_state_str) / sizeof(r5c_state_str[0]); i++)
+		if (strlen(r5c_state_str[i]) == len - 1 &&
+		    strncmp(page, r5c_state_str[i], len - 1) == 0) {
+			val = i;
+			break;
+		}
+	if (val == -1)
+		return -EINVAL;
+
+	if (!log && val != R5C_STATE_NO_CACHE)
+		return -EINVAL;
+
+	if (val < R5C_STATE_NO_CACHE || val > R5C_STATE_WRITE_BACK)
+		return -EINVAL;
+	if (val == R5C_STATE_NO_CACHE) {
+		if (conf->log &&
+		    !test_bit(Faulty, &log->rdev->flags)) {
+			pr_err("md/raid:%s: journal device is in use, cannot remove it\n",
+			       mdname(mddev));
+			return -EINVAL;
+		}
+	}
+
+	if (log) {
+		mddev_suspend(mddev);
+		conf->log->r5c_state = val;
+		mddev_resume(mddev);
+	}
+
+	if (val == R5C_STATE_NO_CACHE) {
+		clear_bit(MD_HAS_JOURNAL, &mddev->flags);
+		set_bit(MD_UPDATE_SB_FLAGS, &mddev->flags);
+	}
+	pr_info("md/raid:%s: setting r5c cache mode to %d: %s\n",
+		mdname(mddev), val, r5c_state_str[val]);
+	return len;
+}
+
+struct md_sysfs_entry
+r5c_state = __ATTR(r5c_state, S_IRUGO | S_IWUSR,
+		   r5c_state_show, r5c_state_store);
+
 int r5c_handle_stripe_dirtying(struct r5conf *conf,
 			       struct stripe_head *sh,
 			       struct stripe_head_state *s,
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index beebec1..187843c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6304,6 +6304,7 @@ static struct attribute *raid5_attrs[] =  {
 	&raid5_group_thread_cnt.attr,
 	&raid5_skip_copy.attr,
 	&raid5_rmw_level.attr,
+	&r5c_state.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index e63a7b0..6311892 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -758,5 +758,6 @@ extern void r5c_do_reclaim(struct r5conf *conf);
 extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
+extern struct md_sysfs_entry r5c_state;
 
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 7/8] md/r5cache: r5c recovery
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
                   ` (5 preceding siblings ...)
  2016-10-13  5:49 ` [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state Song Liu
@ 2016-10-13  5:49 ` Song Liu
  2016-10-19  2:32   ` NeilBrown
  2016-10-20  3:03   ` NeilBrown
  2016-10-13  5:49 ` [PATCH v5 8/8] md/r5cache: handle SYNC and FUA Song Liu
  7 siblings, 2 replies; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

This is the recovery part of raid5-cache.

With cache feature, there are 2 different scenarios of recovery:
1. Data-Parity stripe: a stripe with complete parity in journal.
2. Data-Only stripe: a stripe with only data in journal (or partial
   parity).

The code differentiate Data-Parity stripe from Data-Only stripe with
flag (STRIPE_R5C_WRITTEN).

For Data-Parity stripes, we use the same procedure as raid5 journal,
where all the data and parity are replayed to the RAID devices.

For Data-Only strips, we need to finish complete calculate parity and
finish the full reconstruct write or RMW write. For simplicity, in
the recovery, we load the stripe to stripe cache. Once the array is
started, the stripe cache state machine will handle these stripes
through normal write path.

r5c_recovery_flush_log contains the main procedure of recovery. The
recovery code first scans through the journal and loads data to
stripe cache. The code keeps tracks of all these stripes in a list
(use sh->lru and ctx->cached_list), stripes in the list are
organized in the order of its first appearance on the journal.
During the scan, the recovery code assesses each stripe as
Data-Parity or Data-Only.

During scan, the array may run out of stripe cache. In these cases,
the recovery code will also call raid5_set_cache_size to increase
stripe cache size.

At the end of scan, the recovery code replays all Data-Parity
stripes, and sets proper states for Data-Only stripes. The recovery
code also increases seq number by 10 and rewrites all Data-Only
stripes to journal. This is to avoid confusion after repeated
crashes. More details is explained in raid5-cache.c before
r5c_recovery_rewrite_data_only_stripes().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 673 +++++++++++++++++++++++++++++++++++++----------
 1 file changed, 539 insertions(+), 134 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index e2fc9f5..d3a68da 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -1,5 +1,6 @@
 /*
  * Copyright (C) 2015 Shaohua Li <shli@fb.com>
+ * Copyright (C) 2016 Song Liu <songliubraving@fb.com>
  *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
@@ -1213,10 +1214,13 @@ struct r5l_recovery_ctx {
 	sector_t meta_total_blocks;	/* total size of current meta and data */
 	sector_t pos;			/* recovery position */
 	u64 seq;			/* recovery position seq */
+	int data_parity_stripes;	/* number of data_parity stripes */
+	int data_only_stripes;		/* number of data_only stripes */
+	struct list_head cached_list;
 };
 
-static int r5l_read_meta_block(struct r5l_log *log,
-			       struct r5l_recovery_ctx *ctx)
+static int r5l_recovery_read_meta_block(struct r5l_log *log,
+					struct r5l_recovery_ctx *ctx)
 {
 	struct page *page = ctx->meta_page;
 	struct r5l_meta_block *mb;
@@ -1248,170 +1252,569 @@ static int r5l_read_meta_block(struct r5l_log *log,
 	return 0;
 }
 
-static int r5l_recovery_flush_one_stripe(struct r5l_log *log,
-					 struct r5l_recovery_ctx *ctx,
-					 sector_t stripe_sect,
-					 int *offset, sector_t *log_offset)
+/*
+ * r5l_recovery_load_data and r5l_recovery_load_parity uses flag R5_Wantwrite
+ * to mark valid (potentially not flushed) data in the journal.
+ *
+ * We already verified checksum in r5l_recovery_verify_data_checksum_for_mb,
+ * so there should not be any mismatch here.
+ */
+static void r5l_recovery_load_data(struct r5l_log *log,
+				   struct stripe_head *sh,
+				   struct r5l_recovery_ctx *ctx,
+				   struct r5l_payload_data_parity *payload,
+				   sector_t log_offset)
 {
-	struct r5conf *conf = log->rdev->mddev->private;
-	struct stripe_head *sh;
-	struct r5l_payload_data_parity *payload;
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
 	int disk_index;
 
-	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 0, 0);
-	while (1) {
-		payload = page_address(ctx->meta_page) + *offset;
+	raid5_compute_sector(conf,
+			     le64_to_cpu(payload->location), 0,
+			     &disk_index, sh);
+	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
+		     sh->dev[disk_index].page, REQ_OP_READ, 0, false);
+	sh->dev[disk_index].log_checksum =
+		le32_to_cpu(payload->checksum[0]);
+	ctx->meta_total_blocks += BLOCK_SECTORS;
 
-		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_DATA) {
-			raid5_compute_sector(conf,
-					     le64_to_cpu(payload->location), 0,
-					     &disk_index, sh);
+	set_bit(R5_Wantwrite, &sh->dev[disk_index].flags);
+}
 
-			sync_page_io(log->rdev, *log_offset, PAGE_SIZE,
-				     sh->dev[disk_index].page, REQ_OP_READ, 0,
-				     false);
-			sh->dev[disk_index].log_checksum =
-				le32_to_cpu(payload->checksum[0]);
-			set_bit(R5_Wantwrite, &sh->dev[disk_index].flags);
-			ctx->meta_total_blocks += BLOCK_SECTORS;
-		} else {
-			disk_index = sh->pd_idx;
-			sync_page_io(log->rdev, *log_offset, PAGE_SIZE,
-				     sh->dev[disk_index].page, REQ_OP_READ, 0,
-				     false);
-			sh->dev[disk_index].log_checksum =
-				le32_to_cpu(payload->checksum[0]);
-			set_bit(R5_Wantwrite, &sh->dev[disk_index].flags);
-
-			if (sh->qd_idx >= 0) {
-				disk_index = sh->qd_idx;
-				sync_page_io(log->rdev,
-					     r5l_ring_add(log, *log_offset, BLOCK_SECTORS),
-					     PAGE_SIZE, sh->dev[disk_index].page,
-					     REQ_OP_READ, 0, false);
-				sh->dev[disk_index].log_checksum =
-					le32_to_cpu(payload->checksum[1]);
-				set_bit(R5_Wantwrite,
-					&sh->dev[disk_index].flags);
-			}
-			ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
-		}
+static void r5l_recovery_load_parity(struct r5l_log *log,
+				     struct stripe_head *sh,
+				     struct r5l_recovery_ctx *ctx,
+				     struct r5l_payload_data_parity *payload,
+				     sector_t log_offset)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
 
-		*log_offset = r5l_ring_add(log, *log_offset,
-					   le32_to_cpu(payload->size));
-		*offset += sizeof(struct r5l_payload_data_parity) +
-			sizeof(__le32) *
-			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
-		if (le16_to_cpu(payload->header.type) == R5LOG_PAYLOAD_PARITY)
-			break;
+	ctx->meta_total_blocks += BLOCK_SECTORS * conf->max_degraded;
+	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
+		     sh->dev[sh->pd_idx].page, REQ_OP_READ, 0, false);
+	sh->dev[sh->pd_idx].log_checksum =
+		le32_to_cpu(payload->checksum[0]);
+	set_bit(R5_Wantwrite, &sh->dev[sh->pd_idx].flags);
+
+	if (sh->qd_idx >= 0) {
+		sync_page_io(log->rdev,
+			     r5l_ring_add(log, log_offset, BLOCK_SECTORS),
+			     PAGE_SIZE, sh->dev[sh->qd_idx].page,
+			     REQ_OP_READ, 0, false);
+		sh->dev[sh->qd_idx].log_checksum =
+			le32_to_cpu(payload->checksum[1]);
+		set_bit(R5_Wantwrite, &sh->dev[sh->qd_idx].flags);
 	}
+	set_bit(STRIPE_R5C_WRITTEN, &sh->state);
+}
 
-	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
-		void *addr;
-		u32 checksum;
+static void r5l_recovery_reset_stripe(struct stripe_head *sh)
+{
+	int i;
 
+	sh->state = 0;
+	sh->log_start = MaxSector;
+	for (i = sh->disks; i--; )
+		sh->dev[i].flags = 0;
+}
+
+static void
+r5l_recovery_replay_one_stripe(struct r5conf *conf,
+			       struct stripe_head *sh,
+			       struct r5l_recovery_ctx *ctx)
+{
+	struct md_rdev *rdev, *rrdev;
+	int disk_index;
+	int data_count = 0;
+
+	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
 		if (!test_bit(R5_Wantwrite, &sh->dev[disk_index].flags))
 			continue;
-		addr = kmap_atomic(sh->dev[disk_index].page);
-		checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
-		kunmap_atomic(addr);
-		if (checksum != sh->dev[disk_index].log_checksum)
-			goto error;
+		if (disk_index == sh->qd_idx || disk_index == sh->pd_idx)
+			continue;
+		data_count++;
 	}
+	/* stripes only have parity are already flushed to RAID */
+	if (data_count == 0)
+		goto out;
 
 	for (disk_index = 0; disk_index < sh->disks; disk_index++) {
-		struct md_rdev *rdev, *rrdev;
-
-		if (!test_and_clear_bit(R5_Wantwrite,
-					&sh->dev[disk_index].flags))
+		if (!test_bit(R5_Wantwrite, &sh->dev[disk_index].flags))
 			continue;
 
 		/* in case device is broken */
 		rdev = rcu_dereference(conf->disks[disk_index].rdev);
 		if (rdev)
-			sync_page_io(rdev, stripe_sect, PAGE_SIZE,
+			sync_page_io(rdev, sh->sector, PAGE_SIZE,
 				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
 				     false);
 		rrdev = rcu_dereference(conf->disks[disk_index].replacement);
 		if (rrdev)
-			sync_page_io(rrdev, stripe_sect, PAGE_SIZE,
+			sync_page_io(rrdev, sh->sector, PAGE_SIZE,
 				     sh->dev[disk_index].page, REQ_OP_WRITE, 0,
 				     false);
 	}
-	raid5_release_stripe(sh);
+	ctx->data_parity_stripes++;
+out:
+	r5l_recovery_reset_stripe(sh);
+}
+
+static void
+r5l_recovery_create_emtpy_meta_block(struct r5l_log *log,
+				     struct page *page,
+				     sector_t pos, u64 seq)
+{
+	struct r5l_meta_block *mb;
+	u32 crc;
+
+	mb = page_address(page);
+	clear_page(mb);
+	mb->magic = cpu_to_le32(R5LOG_MAGIC);
+	mb->version = R5LOG_VERSION;
+	mb->meta_size = cpu_to_le32(sizeof(struct r5l_meta_block));
+	mb->seq = cpu_to_le64(seq);
+	mb->position = cpu_to_le64(pos);
+	crc = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
+	mb->checksum = cpu_to_le32(crc);
+}
+
+static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
+					  u64 seq)
+{
+	struct page *page;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+	r5l_recovery_create_emtpy_meta_block(log, page, pos, seq);
+	if (!sync_page_io(log->rdev, pos, PAGE_SIZE, page, REQ_OP_WRITE,
+			  WRITE_FUA, false)) {
+		__free_page(page);
+		return -EIO;
+	}
+	__free_page(page);
 	return 0;
+}
 
-error:
-	for (disk_index = 0; disk_index < sh->disks; disk_index++)
-		sh->dev[disk_index].flags = 0;
-	raid5_release_stripe(sh);
-	return -EINVAL;
+static struct stripe_head *
+r5c_recovery_alloc_stripe(struct r5conf *conf,
+			  struct list_head *recovery_list,
+			  sector_t stripe_sect,
+			  sector_t log_start)
+{
+	struct stripe_head *sh;
+
+	sh = raid5_get_active_stripe(conf, stripe_sect, 0, 1, 0);
+	if (!sh)
+		return NULL;  /* no more stripe available */
+
+	r5l_recovery_reset_stripe(sh);
+	sh->log_start = log_start;
+
+	return sh;
 }
 
-static int r5l_recovery_flush_one_meta(struct r5l_log *log,
-				       struct r5l_recovery_ctx *ctx)
+static struct stripe_head *
+r5c_recovery_lookup_stripe(struct list_head *list, sector_t sect)
 {
-	struct r5conf *conf = log->rdev->mddev->private;
+	struct stripe_head *sh;
+
+	list_for_each_entry(sh, list, lru)
+		if (sh->sector == sect)
+			return sh;
+	return NULL;
+}
+
+static void
+r5c_recovery_replay_stripes(struct list_head *cached_stripe_list,
+			    struct r5l_recovery_ctx *ctx)
+{
+	struct stripe_head *sh, *next;
+
+	list_for_each_entry_safe(sh, next, cached_stripe_list, lru)
+		if (test_bit(STRIPE_R5C_WRITTEN, &sh->state)) {
+			r5l_recovery_replay_one_stripe(sh->raid_conf, sh, ctx);
+			list_del_init(&sh->lru);
+			raid5_release_stripe(sh);
+		}
+}
+
+/* returns 0 for match; 1 for mismtach */
+static int
+r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
+				  sector_t log_offset, __le32 log_checksum)
+{
+	void *addr;
+	u32 checksum;
+
+	sync_page_io(log->rdev, log_offset, PAGE_SIZE,
+		     page, REQ_OP_READ, 0, false);
+	addr = kmap_atomic(page);
+	checksum = crc32c_le(log->uuid_checksum, addr, PAGE_SIZE);
+	kunmap_atomic(addr);
+	return le32_to_cpu(log_checksum) != checksum;
+}
+
+/*
+ * before loading data to stripe cache, we need verify checksum for all data,
+ * if there is mismatch for any data page, we drop all data in the mata block
+ */
+static int
+r5l_recovery_verify_data_checksum_for_mb(struct r5l_log *log,
+					 struct r5l_recovery_ctx *ctx)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
+	struct r5l_meta_block *mb = page_address(ctx->meta_page);
+	sector_t mb_offset = sizeof(struct r5l_meta_block);
+	sector_t log_offset = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
+	struct page *page;
 	struct r5l_payload_data_parity *payload;
+
+	page = alloc_page(GFP_KERNEL);
+	if (!page)
+		return -ENOMEM;
+
+	while (mb_offset < le32_to_cpu(mb->meta_size)) {
+		payload = (void *)mb + mb_offset;
+
+		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
+			if (r5l_recovery_verify_data_checksum(
+				    log, page, log_offset,
+				    payload->checksum[0]))
+				goto mismatch;
+		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY) {
+			if (r5l_recovery_verify_data_checksum(
+				    log, page, log_offset,
+				    payload->checksum[0]))
+				goto mismatch;
+			if (conf->max_degraded == 2 && /* q for RAID 6 */
+			    r5l_recovery_verify_data_checksum(
+				    log, page,
+				    r5l_ring_add(log, log_offset,
+						 BLOCK_SECTORS),
+				    payload->checksum[1]))
+				goto mismatch;
+		} else
+			goto mismatch;
+
+		log_offset = r5l_ring_add(log, log_offset,
+					  le32_to_cpu(payload->size));
+
+		mb_offset += sizeof(struct r5l_payload_data_parity) +
+			sizeof(__le32) *
+			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
+	}
+
+	put_page(page);
+	return 0;
+
+mismatch:
+	put_page(page);
+	return -EINVAL;
+}
+
+static int
+r5c_recovery_analyze_meta_block(struct r5l_log *log,
+				struct r5l_recovery_ctx *ctx,
+				struct list_head *cached_stripe_list)
+{
+	struct mddev *mddev = log->rdev->mddev;
+	struct r5conf *conf = mddev->private;
 	struct r5l_meta_block *mb;
-	int offset;
+	struct r5l_payload_data_parity *payload;
+	int mb_offset;
 	sector_t log_offset;
-	sector_t stripe_sector;
+	sector_t stripe_sect;
+	struct stripe_head *sh;
+	int ret;
+
+	/*
+	 * for mismatch in data blocks, we will drop all data in this mb, but
+	 * we will still read next mb for other data with FLUSH flag, as
+	 * io_unit could finish out of order.
+	 */
+	ret = r5l_recovery_verify_data_checksum_for_mb(log, ctx);
+	if (ret == -EINVAL)
+		return -EAGAIN;
+	else if (ret)
+		return ret;
 
 	mb = page_address(ctx->meta_page);
-	offset = sizeof(struct r5l_meta_block);
+	mb_offset = sizeof(struct r5l_meta_block);
 	log_offset = r5l_ring_add(log, ctx->pos, BLOCK_SECTORS);
 
-	while (offset < le32_to_cpu(mb->meta_size)) {
+	while (mb_offset < le32_to_cpu(mb->meta_size)) {
 		int dd;
 
-		payload = (void *)mb + offset;
-		stripe_sector = raid5_compute_sector(conf,
-						     le64_to_cpu(payload->location), 0, &dd, NULL);
-		if (r5l_recovery_flush_one_stripe(log, ctx, stripe_sector,
-						  &offset, &log_offset))
+		payload = (void *)mb + mb_offset;
+		stripe_sect = (payload->header.type == R5LOG_PAYLOAD_DATA) ?
+			raid5_compute_sector(
+				conf, le64_to_cpu(payload->location), 0, &dd,
+				NULL)
+			: le64_to_cpu(payload->location);
+
+		sh = r5c_recovery_lookup_stripe(cached_stripe_list,
+						stripe_sect);
+
+		if (!sh) {
+			sh = r5c_recovery_alloc_stripe(conf, cached_stripe_list,
+						       stripe_sect, ctx->pos);
+			if (!sh) {
+				pr_info("md/raid:%s: Increasing stripe cache size to %d to recovery data on journal.\n",
+					mdname(mddev),
+					conf->min_nr_stripes * 2);
+				raid5_set_cache_size(mddev,
+						     conf->min_nr_stripes * 2);
+				sh = r5c_recovery_alloc_stripe(
+					conf, cached_stripe_list, stripe_sect,
+					ctx->pos);
+			}
+			if (!sh) {
+				pr_err("md/raid:%s: Cannot get enough stripe_cache. Recovery interrupted.\n",
+				       mdname(mddev));
+				return -ENOMEM;
+			}
+			list_add_tail(&sh->lru, cached_stripe_list);
+		}
+		if (!sh)
+			return -ENOMEM;
+
+		if (payload->header.type == R5LOG_PAYLOAD_DATA) {
+			if (test_bit(STRIPE_R5C_WRITTEN, &sh->state)) {
+				r5l_recovery_reset_stripe(sh);
+				sh->log_start = ctx->pos;
+				list_move_tail(&sh->lru, cached_stripe_list);
+			}
+			r5l_recovery_load_data(log, sh, ctx, payload,
+					       log_offset);
+		} else if (payload->header.type == R5LOG_PAYLOAD_PARITY)
+			r5l_recovery_load_parity(log, sh, ctx, payload,
+						 log_offset);
+		else
 			return -EINVAL;
+
+		log_offset = r5l_ring_add(log, log_offset,
+					  le32_to_cpu(payload->size));
+
+		mb_offset += sizeof(struct r5l_payload_data_parity) +
+			sizeof(__le32) *
+			(le32_to_cpu(payload->size) >> (PAGE_SHIFT - 9));
 	}
+
 	return 0;
 }
 
-/* copy data/parity from log to raid disks */
-static void r5l_recovery_flush_log(struct r5l_log *log,
+/*
+ * Load the stripe into cache. The stripe will be written out later by
+ * the stripe cache state machine.
+ */
+static void r5c_recovery_load_one_stripe(struct r5l_log *log,
+					 struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct r5dev *dev;
+	int i;
+
+	atomic_set(&sh->dev_in_cache, 0);
+	for (i = sh->disks; i--; ) {
+		dev = sh->dev + i;
+		if (test_and_clear_bit(R5_Wantwrite, &dev->flags)) {
+			set_bit(R5_InCache, &dev->flags);
+			atomic_inc(&sh->dev_in_cache);
+		}
+	}
+	set_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state);
+	atomic_inc(&conf->r5c_cached_partial_stripes);
+	list_add_tail(&sh->r5c, &log->stripe_in_cache_list);
+}
+
+/*
+ * Scan through the log for all to-be-flushed data
+ *
+ * For stripes with data and parity, namely Data-Parity stripe
+ * (STRIPE_R5C_WRITTEN == 0), we simply replay all the writes.
+ *
+ * For stripes with only data, namely Data-Only stripe
+ * (STRIPE_R5C_WRITTEN == 1), we load them to stripe cache state machine.
+ *
+ * For a stripe, if we see data after parity, we should discard all previous
+ * data and parity for this stripe, as these data are already flushed to
+ * the array.
+ *
+ * At the end of the scan, we return the new journal_tail, which points to
+ * first data-only stripe on the journal device, or next invalid meta block.
+ */
+static void r5c_recovery_flush_log(struct r5l_log *log,
 				   struct r5l_recovery_ctx *ctx)
 {
+	struct stripe_head *sh, *next;
+	int ret;
+
+	/* scan through the log */
 	while (1) {
-		if (r5l_read_meta_block(log, ctx))
-			return;
-		if (r5l_recovery_flush_one_meta(log, ctx))
-			return;
+		if (r5l_recovery_read_meta_block(log, ctx))
+			break;
+
+		ret = r5c_recovery_analyze_meta_block(log, ctx,
+						      &ctx->cached_list);
+		/*
+		 * -EAGAIN means mismatch in data block, in this case, we still
+		 * try scan the next metablock
+		 */
+		if (ret && ret != -EAGAIN)
+			break;
 		ctx->seq++;
 		ctx->pos = r5l_ring_add(log, ctx->pos, ctx->meta_total_blocks);
 	}
+
+	/* replay data-parity stripes */
+	r5c_recovery_replay_stripes(&ctx->cached_list, ctx);
+
+	/* load data-only stripes to stripe cache */
+	list_for_each_entry_safe(sh, next, &ctx->cached_list, lru) {
+		WARN_ON(test_bit(STRIPE_R5C_WRITTEN, &sh->state));
+		r5c_recovery_load_one_stripe(log, sh);
+		list_del_init(&sh->lru);
+		raid5_release_stripe(sh);
+		ctx->data_only_stripes++;
+	}
+
+	return;
 }
 
-static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
-					  u64 seq)
+/*
+ * we did a recovery. Now ctx.pos points to an invalid meta block. New
+ * log will start here. but we can't let superblock point to last valid
+ * meta block. The log might looks like:
+ * | meta 1| meta 2| meta 3|
+ * meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If
+ * superblock points to meta 1, we write a new valid meta 2n.  if crash
+ * happens again, new recovery will start from meta 1. Since meta 2n is
+ * valid now, recovery will think meta 3 is valid, which is wrong.
+ * The solution is we create a new meta in meta2 with its seq == meta
+ * 1's seq + 10 and let superblock points to meta2. The same recovery will
+ * not think meta 3 is a valid meta, because its seq doesn't match
+ */
+
+/*
+ * Before recovery, the log looks like the following
+ *
+ *   ---------------------------------------------
+ *   |           valid log        | invalid log  |
+ *   ---------------------------------------------
+ *   ^
+ *   |- log->last_checkpoint
+ *   |- log->last_cp_seq
+ *
+ * Now we scan through the log until we see invalid entry
+ *
+ *   ---------------------------------------------
+ *   |           valid log        | invalid log  |
+ *   ---------------------------------------------
+ *   ^                            ^
+ *   |- log->last_checkpoint      |- ctx->pos
+ *   |- log->last_cp_seq          |- ctx->seq
+ *
+ * From this point, we need to increase seq number by 10 to avoid
+ * confusing next recovery.
+ *
+ *   ---------------------------------------------
+ *   |           valid log        | invalid log  |
+ *   ---------------------------------------------
+ *   ^                              ^
+ *   |- log->last_checkpoint        |- ctx->pos+1
+ *   |- log->last_cp_seq            |- ctx->seq+11
+ *
+ * However, it is not safe to start the state machine yet, because data only
+ * parities are not yet secured in RAID. To save these data only parities, we
+ * rewrite them from seq+11.
+ *
+ *   -----------------------------------------------------------------
+ *   |           valid log        | data only stripes | invalid log  |
+ *   -----------------------------------------------------------------
+ *   ^                                                ^
+ *   |- log->last_checkpoint                          |- ctx->pos+n
+ *   |- log->last_cp_seq                              |- ctx->seq+10+n
+ *
+ * If failure happens again during this process, the recovery can safe start
+ * again from log->last_checkpoint.
+ *
+ * Once data only stripes are rewritten to journal, we move log_tail
+ *
+ *   -----------------------------------------------------------------
+ *   |     old log        |    data only stripes    | invalid log  |
+ *   -----------------------------------------------------------------
+ *                        ^                         ^
+ *                        |- log->last_checkpoint   |- ctx->pos+n
+ *                        |- log->last_cp_seq       |- ctx->seq+10+n
+ *
+ * Then we can safely start the state machine. If failure happens from this
+ * point on, the recovery will start from new log->last_checkpoint.
+ */
+static int
+r5c_recovery_rewrite_data_only_stripes(struct r5l_log *log,
+				       struct r5l_recovery_ctx *ctx)
 {
+	struct stripe_head *sh;
+	struct mddev *mddev = log->rdev->mddev;
 	struct page *page;
-	struct r5l_meta_block *mb;
-	u32 crc;
 
-	page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-	if (!page)
+	page = alloc_page(GFP_KERNEL);
+	if (!page) {
+		pr_err("md/raid:%s: cannot allocate memory to rewrite data only stripes\n",
+		       mdname(mddev));
 		return -ENOMEM;
-	mb = page_address(page);
-	mb->magic = cpu_to_le32(R5LOG_MAGIC);
-	mb->version = R5LOG_VERSION;
-	mb->meta_size = cpu_to_le32(sizeof(struct r5l_meta_block));
-	mb->seq = cpu_to_le64(seq);
-	mb->position = cpu_to_le64(pos);
-	crc = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
-	mb->checksum = cpu_to_le32(crc);
+	}
 
-	if (!sync_page_io(log->rdev, pos, PAGE_SIZE, page, REQ_OP_WRITE,
-			  WRITE_FUA, false)) {
-		__free_page(page);
-		return -EIO;
+	ctx->seq += 10;
+	list_for_each_entry(sh, &ctx->cached_list, lru) {
+		struct r5l_meta_block *mb;
+		int i;
+		int offset;
+		sector_t write_pos;
+
+		WARN_ON(test_bit(STRIPE_R5C_WRITTEN, &sh->state));
+		r5l_recovery_create_emtpy_meta_block(log, page,
+						     ctx->pos, ctx->seq);
+		mb = page_address(page);
+		offset = le32_to_cpu(mb->meta_size);
+		write_pos = ctx->pos + BLOCK_SECTORS;
+
+		for (i = sh->disks; i--; ) {
+			struct r5dev *dev = &sh->dev[i];
+			struct r5l_payload_data_parity *payload;
+			void *addr;
+
+			if (test_bit(R5_InCache, &dev->flags)) {
+				payload = (void *)mb + offset;
+				payload->header.type = cpu_to_le16(
+					R5LOG_PAYLOAD_DATA);
+				payload->size = BLOCK_SECTORS;
+				payload->location = cpu_to_le64(
+					raid5_compute_blocknr(sh, i, 0));
+				addr = kmap_atomic(dev->page);
+				payload->checksum[0] = cpu_to_le32(
+					crc32c_le(log->uuid_checksum, addr,
+						  PAGE_SIZE));
+				kunmap_atomic(addr);
+				sync_page_io(log->rdev, write_pos, PAGE_SIZE,
+					     dev->page, REQ_OP_WRITE, 0, false);
+				write_pos = r5l_ring_add(log, write_pos,
+							 BLOCK_SECTORS);
+				offset += sizeof(__le32) +
+					sizeof(struct r5l_payload_data_parity);
+
+			}
+		}
+		mb->meta_size = cpu_to_le32(offset);
+		mb->checksum = crc32c_le(log->uuid_checksum, mb, PAGE_SIZE);
+		sync_page_io(log->rdev, ctx->pos, PAGE_SIZE, page,
+			     REQ_OP_WRITE, WRITE_FUA, false);
+		sh->log_start = ctx->pos;
+		ctx->pos = write_pos;
+		ctx->seq += 1;
 	}
 	__free_page(page);
 	return 0;
@@ -1419,43 +1822,45 @@ static int r5l_log_write_empty_meta_block(struct r5l_log *log, sector_t pos,
 
 static int r5l_recovery_log(struct r5l_log *log)
 {
+	struct mddev *mddev = log->rdev->mddev;
 	struct r5l_recovery_ctx ctx;
 
 	ctx.pos = log->last_checkpoint;
 	ctx.seq = log->last_cp_seq;
 	ctx.meta_page = alloc_page(GFP_KERNEL);
+	ctx.data_only_stripes = 0;
+	ctx.data_parity_stripes = 0;
+	INIT_LIST_HEAD(&ctx.cached_list);
+
 	if (!ctx.meta_page)
 		return -ENOMEM;
 
-	r5l_recovery_flush_log(log, &ctx);
+	r5c_recovery_flush_log(log, &ctx);
+
 	__free_page(ctx.meta_page);
 
-	/*
-	 * we did a recovery. Now ctx.pos points to an invalid meta block. New
-	 * log will start here. but we can't let superblock point to last valid
-	 * meta block. The log might looks like:
-	 * | meta 1| meta 2| meta 3|
-	 * meta 1 is valid, meta 2 is invalid. meta 3 could be valid. If
-	 * superblock points to meta 1, we write a new valid meta 2n.  if crash
-	 * happens again, new recovery will start from meta 1. Since meta 2n is
-	 * valid now, recovery will think meta 3 is valid, which is wrong.
-	 * The solution is we create a new meta in meta2 with its seq == meta
-	 * 1's seq + 10 and let superblock points to meta2. The same recovery will
-	 * not think meta 3 is a valid meta, because its seq doesn't match
-	 */
-	if (ctx.seq > log->last_cp_seq + 1) {
-		int ret;
-
-		ret = r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq + 10);
-		if (ret)
-			return ret;
-		log->seq = ctx.seq + 11;
-		log->log_start = r5l_ring_add(log, ctx.pos, BLOCK_SECTORS);
-		r5l_write_super(log, ctx.pos);
-	} else {
-		log->log_start = ctx.pos;
-		log->seq = ctx.seq;
+	if ((ctx.data_only_stripes == 0) && (ctx.data_parity_stripes == 0))
+		pr_info("md/raid:%s: starting from clean shutdown\n",
+			mdname(mddev));
+	else {
+		pr_info("md/raid:%s: recoverying %d data-only stripes and %d data-parity stripes\n",
+			mdname(mddev), ctx.data_only_stripes,
+			ctx.data_parity_stripes);
+
+		if (ctx.data_only_stripes > 0)
+			if (r5c_recovery_rewrite_data_only_stripes(log, &ctx)) {
+				pr_err("md/raid:%s: failed to rewrite stripes to journal\n",
+				       mdname(mddev));
+				return -EIO;
+			}
 	}
+
+	log->log_start = ctx.pos;
+	log->next_checkpoint = ctx.pos;
+	log->seq = ctx.seq;
+	r5l_log_write_empty_meta_block(log, ctx.pos, ctx.seq);
+	r5l_write_super(log, ctx.pos);
+
 	return 0;
 }
 
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v5 8/8] md/r5cache: handle SYNC and FUA
  2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
                   ` (6 preceding siblings ...)
  2016-10-13  5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
@ 2016-10-13  5:49 ` Song Liu
  7 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-13  5:49 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

With raid5 cache, we committing data from journal device. When
there is flush request, we need to flush journal device's cache.
This was not needed in raid5 journal, because we will flush the
journal before committing data to raid disks.

This is similar to FUA, except that we also need flush journal for
FUA. Otherwise, corruptions in earlier meta data will stop recovery
from reaching FUA data.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 162 +++++++++++++++++++++++++++++++++++++++++------
 drivers/md/raid5.c       |   8 +++
 drivers/md/raid5.h       |   1 +
 3 files changed, 153 insertions(+), 18 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index d3a68da..49e637e 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -19,6 +19,7 @@
 #include <linux/raid/md_p.h>
 #include <linux/crc32c.h>
 #include <linux/random.h>
+#include <trace/events/block.h>
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
@@ -159,6 +160,9 @@ struct r5l_log {
 
 	spinlock_t stripe_in_cache_lock;
 	atomic_t stripe_in_cache_count;
+
+	/* to submit async io_units, to fulfill ordering of flush */
+	struct work_struct deferred_io_work;
 };
 
 /*
@@ -185,6 +189,18 @@ struct r5l_io_unit {
 
 	int state;
 	bool need_split_bio;
+	struct bio *split_bio;
+
+	unsigned int has_flush:1;      /* include flush request */
+	unsigned int has_fua:1;        /* include fua request */
+	unsigned int has_null_flush:1; /* include empty flush request */
+	/*
+	 * io isn't sent yet, flush/fua request can only be submitted till it's
+	 * the first IO in running_ios list
+	 */
+	unsigned int io_deferred:1;
+
+	struct bio_list flush_barriers;   /* size == 0 flush bios */
 };
 
 /* r5l_io_unit state */
@@ -483,9 +499,11 @@ static void r5l_move_to_end_ios(struct r5l_log *log)
 	}
 }
 
+static void __r5l_stripe_write_finished(struct r5l_io_unit *io);
 static void r5l_log_endio(struct bio *bio)
 {
 	struct r5l_io_unit *io = bio->bi_private;
+	struct r5l_io_unit *io_deferred;
 	struct r5l_log *log = io->log;
 	unsigned long flags;
 
@@ -501,18 +519,89 @@ static void r5l_log_endio(struct bio *bio)
 		r5l_move_to_end_ios(log);
 	else
 		r5l_log_run_stripes(log);
+	if (!list_empty(&log->running_ios)) {
+		/*
+		 * FLUSH/FUA io_unit is deferred because of ordering, now we
+		 * can dispatch it
+		 */
+		io_deferred = list_first_entry(&log->running_ios,
+					       struct r5l_io_unit, log_sibling);
+		if (io_deferred->io_deferred)
+			schedule_work(&log->deferred_io_work);
+	}
+
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
 	if (log->need_cache_flush)
 		md_wakeup_thread(log->rdev->mddev->thread);
+
+	if (io->has_null_flush) {
+		struct bio *bi;
+
+		WARN_ON(bio_list_empty(&io->flush_barriers));
+		while ((bi = bio_list_pop(&io->flush_barriers)) != NULL) {
+			bio_endio(bi);
+			atomic_dec(&io->pending_stripe);
+		}
+		if (atomic_read(&io->pending_stripe) == 0)
+			__r5l_stripe_write_finished(io);
+	}
+}
+
+static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+
+	if (io->has_flush)
+		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FLUSH);
+	if (io->has_fua)
+		bio_set_op_attrs(io->current_bio, REQ_OP_WRITE, WRITE_FUA);
+	submit_bio(io->current_bio);
+
+	if (!io->split_bio)
+		return;
+
+	if (io->has_flush)
+		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FLUSH);
+	if (io->has_fua)
+		bio_set_op_attrs(io->split_bio, REQ_OP_WRITE, WRITE_FUA);
+	submit_bio(io->split_bio);
+}
+
+/* deferred io_unit will be dispatched here */
+static void r5l_submit_io_async(struct work_struct *work)
+{
+	struct r5l_log *log = container_of(work, struct r5l_log,
+					   deferred_io_work);
+	struct r5l_io_unit *io = NULL;
+	unsigned long flags;
+
+	spin_lock_irqsave(&log->io_list_lock, flags);
+	if (!list_empty(&log->running_ios)) {
+		io = list_first_entry(&log->running_ios, struct r5l_io_unit,
+				      log_sibling);
+		if (!io->io_deferred)
+			io = NULL;
+		else
+			io->io_deferred = 0;
+	}
+	spin_unlock_irqrestore(&log->io_list_lock, flags);
+	if (io)
+		r5l_do_submit_io(log, io);
 }
 
 static void r5l_submit_current_io(struct r5l_log *log)
 {
 	struct r5l_io_unit *io = log->current_io;
+	struct bio *bio;
 	struct r5l_meta_block *block;
 	unsigned long flags;
 	u32 crc;
+	bool do_submit = true;
 
 	if (!io)
 		return;
@@ -521,13 +610,20 @@ static void r5l_submit_current_io(struct r5l_log *log)
 	block->meta_size = cpu_to_le32(io->meta_offset);
 	crc = crc32c_le(log->uuid_checksum, block, PAGE_SIZE);
 	block->checksum = cpu_to_le32(crc);
+	bio = io->current_bio;
 
 	log->current_io = NULL;
 	spin_lock_irqsave(&log->io_list_lock, flags);
-	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
+	if (io->has_flush || io->has_fua) {
+		if (io != list_first_entry(&log->running_ios,
+					   struct r5l_io_unit, log_sibling)) {
+			io->io_deferred = 1;
+			do_submit = false;
+		}
+	}
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
-
-	submit_bio(io->current_bio);
+	if (do_submit)
+		r5l_do_submit_io(log, io);
 }
 
 static struct bio *r5l_bio_alloc(struct r5l_log *log)
@@ -572,6 +668,7 @@ static struct r5l_io_unit *r5l_new_meta(struct r5l_log *log)
 	io->log = log;
 	INIT_LIST_HEAD(&io->log_sibling);
 	INIT_LIST_HEAD(&io->stripe_list);
+	bio_list_init(&io->flush_barriers);
 	io->state = IO_UNIT_RUNNING;
 
 	io->meta_page = mempool_alloc(log->meta_pool, GFP_NOIO);
@@ -642,12 +739,11 @@ static void r5l_append_payload_page(struct r5l_log *log, struct page *page)
 	struct r5l_io_unit *io = log->current_io;
 
 	if (io->need_split_bio) {
-		struct bio *prev = io->current_bio;
-
+		BUG_ON(io->split_bio);
+		io->split_bio = io->current_bio;
 		io->current_bio = r5l_bio_alloc(log);
-		bio_chain(io->current_bio, prev);
-
-		submit_bio(prev);
+		bio_chain(io->current_bio, io->split_bio);
+		io->need_split_bio = false;
 	}
 
 	if (!bio_add_page(io->current_bio, page, PAGE_SIZE, 0))
@@ -677,12 +773,23 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
 
 	io = log->current_io;
 
+	if (test_and_clear_bit(STRIPE_R5C_PREFLUSH, &sh->state))
+		io->has_flush = 1;
+
 	for (i = 0; i < sh->disks; i++) {
 		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
 		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
 			continue;
 		if (i == sh->pd_idx || i == sh->qd_idx)
 			continue;
+		if (test_bit(R5_WantFUA, &sh->dev[i].flags)) {
+			io->has_fua = 1;
+			/*
+			 * we need to flush journal to make sure recovery can
+			 * reach the data with fua flag
+			 */
+			io->has_flush = 1;
+		}
 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_DATA,
 					raid5_compute_blocknr(sh, i, 0),
 					sh->dev[i].log_checksum, 0, false);
@@ -847,17 +954,34 @@ int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio)
 {
 	if (!log)
 		return -ENODEV;
-	/*
-	 * we flush log disk cache first, then write stripe data to raid disks.
-	 * So if bio is finished, the log disk cache is flushed already. The
-	 * recovery guarantees we can recovery the bio from log disk, so we
-	 * don't need to flush again
-	 */
-	if (bio->bi_iter.bi_size == 0) {
-		bio_endio(bio);
-		return 0;
+
+	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
+		/*
+		 * in write through (journal only)
+		 * we flush log disk cache first, then write stripe data to
+		 * raid disks. So if bio is finished, the log disk cache is
+		 * flushed already. The recovery guarantees we can recovery
+		 * the bio from log disk, so we don't need to flush again
+		 */
+		if (bio->bi_iter.bi_size == 0) {
+			bio_endio(bio);
+			return 0;
+		}
+		bio->bi_opf &= ~REQ_PREFLUSH;
+	} else {
+		/* write back (with cache) */
+		if (bio->bi_iter.bi_size == 0) {
+			mutex_lock(&log->io_mutex);
+			r5l_get_meta(log, 0);
+			bio_list_add(&log->current_io->flush_barriers, bio);
+			log->current_io->has_flush = 1;
+			log->current_io->has_null_flush = 1;
+			atomic_inc(&log->current_io->pending_stripe);
+			r5l_submit_current_io(log);
+			mutex_unlock(&log->io_mutex);
+			return 0;
+		}
 	}
-	bio->bi_opf &= ~REQ_PREFLUSH;
 	return -EAGAIN;
 }
 
@@ -2370,6 +2494,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	INIT_LIST_HEAD(&log->no_space_stripes);
 	spin_lock_init(&log->no_space_stripes_lock);
 
+	INIT_WORK(&log->deferred_io_work, r5l_submit_io_async);
+
 	log->r5c_state = R5C_STATE_WRITE_THROUGH;
 	INIT_LIST_HEAD(&log->stripe_in_cache_list);
 	spin_lock_init(&log->stripe_in_cache_lock);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 187843c..e21fccc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5233,6 +5233,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	int remaining;
 	DEFINE_WAIT(w);
 	bool do_prepare;
+	bool do_flush = false;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = r5l_handle_flush_request(conf->log, bi);
@@ -5244,6 +5245,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 			return;
 		}
 		/* ret == -EAGAIN, fallback */
+		do_flush = true;
 	}
 
 	md_write_start(mddev, bi);
@@ -5383,6 +5385,12 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 				do_prepare = true;
 				goto retry;
 			}
+			if (do_flush) {
+				set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+				/* we only need flush for one stripe */
+				do_flush = false;
+			}
+
 			set_bit(STRIPE_HANDLE, &sh->state);
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			if ((!sh->batch_head || sh == sh->batch_head) &&
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 6311892..b396c5d 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -360,6 +360,7 @@ enum {
 	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
 				 * in conf->r5c_full_stripe_list)
 				 */
+	STRIPE_R5C_PREFLUSH,	/* need to flush journal device */
 };
 
 #define STRIPE_EXPAND_SYNC_FLAGS \
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode
  2016-10-13  5:49 ` [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode Song Liu
@ 2016-10-14  6:13   ` NeilBrown
  2016-10-14  7:03     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-10-14  6:13 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

[-- Attachment #1: Type: text/plain, Size: 14667 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> The raid5-cache write back mode as 2 states for each stripe: write state
> and reclaim state. This patch adds bare state machine for these two
> states.
>
> 2 flags are added to sh->state for raid5-cache states:
>  - STRIPE_R5C_FROZEN
>  - STRIPE_R5C_WRITTEN
>
> STRIPE_R5C_FROZEN is the key flag to differentiate write state
> and reclaim state.
>
> STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from
> reclaim state back to write state.
>
> In write through mode, every stripe also goes between write state
> and reclaim state (in r5c_handle_stripe_dirtying() and
> r5c_handle_stripe_written()).
>
> Please note: this is a "no-op" patch for raid5-cache write through
> mode.
>
> The following detailed explanation is copied from the raid5-cache.c:
>
> /*
>  * raid5 cache state machine
>  *
>  * The RAID cache works in two major states for each stripe: write state
>  * and reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN
>  * and STRIPE_R5C_WRITTEN

Hi,
 It would really help at this point to understand exactly what it is
 that is "FROZEN".  I guess the contents of the stripe_head are frozen?
 so that nothing changes between writing to the journal and writing to
 the RAID?  Making that explicit would help.

 Only... the stripe_head gets "frozen" before that.  It gets frozen
 before the parity is calculated, and not released until the data and
 parity is written to the RAID.  This new "FROZEN" state is a subset of
 that - yes?

 Given that STRIPE_R5C_FROZEN is set when the stripe is in "reclaim
 state", I wonder why you didn't call it "STRIPE_R5C_RECLAIM"
 .... though I'm not sure "reclaim" is a good word.  Reclaim is
 something that happens to the journal after the data has been written
 to the RAID.

 Maybe STRIPE_R5C_CACHED  which means that the data is cached in the
 journal, but not safe in the RAID.... only that describes the
 STRIPE_R5C_WRITTEN flag.

 Choosing names is hard, I understand that.  I think improving these
 names would help a lot though.

>  *
>  * STRIPE_R5C_FROZEN is the key flag to differentiate write state and reclaim
>  * state. The write state runs w/ STRIPE_R5C_FROZEN == 0. While the reclaim
>  * state runs w/ STRIPE_R5C_FROZEN == 1.
>  *
>  * STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from reclaim
>  * state to write state. Specifically, STRIPE_R5C_WRITTEN triggers clean up
>  * process in r5c_handle_stripe_written. STRIPE_R5C_WRITTEN is set when data
>  * and parity of a stripe is all in journal device; and cleared when the data
>  * and parity are all in RAID disks.
>  *
>  * The following is another way to show how STRIPE_R5C_FROZEN and
>  * STRIPE_R5C_WRITTEN work:
>  *
>  * write state: STRIPE_R5C_FROZEN = 0 STRIPE_R5C_WRITTEN = 0
>  * reclaim state: STRIPE_R5C_FROZEN = 1
>  *
>  * write => reclaim: set STRIPE_R5C_FROZEN in r5c_freeze_stripe_for_reclaim
>  * reclaim => write:
>  * 1. write parity to journal, when finished, set STRIPE_R5C_WRITTEN
>  * 2. write data/parity to raid disks, when finished, clear both
>  *    STRIPE_R5C_FROZEN and STRIPE_R5C_WRITTEN
>  *
>  * In write through mode (journal only) the stripe still goes through these
>  * state change, except that STRIPE_R5C_FROZEN is set on write in
>  * r5c_handle_stripe_dirtying().
>  */
>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 125 +++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/md/raid5.c       |  20 ++++++--
>  drivers/md/raid5.h       |  10 +++-
>  3 files changed, 148 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 7557791b..9e05850 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -40,6 +40,47 @@
>   */
>  #define R5L_POOL_SIZE	4
>  
> +enum r5c_state {
> +	R5C_STATE_NO_CACHE = 0,
> +	R5C_STATE_WRITE_THROUGH = 1,
> +	R5C_STATE_WRITE_BACK = 2,
> +	R5C_STATE_CACHE_BROKEN = 3,

How is CACHE_BROKEN different from NO_CACHE??
Maybe a latter patch will tell me, but I'd rather know now :-)

As this state is stored in the r5l_log, and as we don't allocate that
when there is no cache, why have a NO_CACHE state?

> +};
> +
> +/*
> + * raid5 cache state machine
> + *
> + * The RAID cache works in two major states for each stripe: write state and
> + * reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN and
> + * STRIPE_R5C_WRITTEN
> + *
> + * STRIPE_R5C_FROZEN is the key flag to differentiate write state and reclaim
> + * state. The write state runs w/ STRIPE_R5C_FROZEN == 0. While the reclaim
> + * state runs w/ STRIPE_R5C_FROZEN == 1.
> + *
> + * STRIPE_R5C_WRITTEN is a helper flag to bring the stripe back from reclaim
> + * state to write state. Specifically, STRIPE_R5C_WRITTEN triggers clean up
> + * process in r5c_handle_stripe_written. STRIPE_R5C_WRITTEN is set when data
> + * and parity of a stripe is all in journal device; and cleared when the data
> + * and parity are all in RAID disks.
> + *
> + * The following is another way to show how STRIPE_R5C_FROZEN and
> + * STRIPE_R5C_WRITTEN work:
> + *
> + * write state: STRIPE_R5C_FROZEN = 0 STRIPE_R5C_WRITTEN = 0
> + * reclaim state: STRIPE_R5C_FROZEN = 1
> + *
> + * write => reclaim: set STRIPE_R5C_FROZEN in r5c_freeze_stripe_for_reclaim
> + * reclaim => write:
> + * 1. write parity to journal, when finished, set STRIPE_R5C_WRITTEN
> + * 2. write data/parity to raid disks, when finished, clear both
> + *    STRIPE_R5C_FROZEN and STRIPE_R5C_WRITTEN
> + *
> + * In write through mode (journal only) the stripe also goes through these
> + * state change, except that STRIPE_R5C_FROZEN is set on write in
> + * r5c_handle_stripe_dirtying().
> + */
> +
>  struct r5l_log {
>  	struct md_rdev *rdev;
>  
> @@ -96,6 +137,9 @@ struct r5l_log {
>  	spinlock_t no_space_stripes_lock;
>  
>  	bool need_cache_flush;
> +
> +	/* for r5c_cache */
> +	enum r5c_state r5c_state;
>  };
>  
>  /*
> @@ -133,6 +177,11 @@ enum r5l_io_unit_state {
>  	IO_UNIT_STRIPE_END = 3,	/* stripes data finished writing to raid */
>  };
>  
> +bool r5c_is_writeback(struct r5l_log *log)
> +{
> +	return (log != NULL && log->r5c_state == R5C_STATE_WRITE_BACK);
> +}
> +
>  static sector_t r5l_ring_add(struct r5l_log *log, sector_t start, sector_t inc)
>  {
>  	start += inc;
> @@ -168,12 +217,44 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
>  	io->state = state;
>  }
>  
> +/*
> + * Freeze the stripe, thus send the stripe into reclaim path.
> + *
> + * In current implementation, STRIPE_R5C_FROZEN is also set in write through
> + * mode (in r5c_handle_stripe_dirtying). This does not change the behavior of
> + * for write through mode.
> + */
> +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)

"freeze_stripe_for_reclaim" seems an odd name.  How does "reclaim" fit?
You are freezing the stripe so it can be written to the journal, and
then eventually to the RAID are you not?
Or are you freezing it so the space used in the journal cannot be reclaimed?

Maybe I misunderstand the "FROZEN" concept still.


> +{
> +	struct r5conf *conf = sh->raid_conf;
> +	struct r5l_log *log = conf->log;
> +
> +	if (!log)
> +		return;
> +	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
> +	set_bit(STRIPE_R5C_FROZEN, &sh->state);
> +}
> +
> +static void r5c_finish_cache_stripe(struct stripe_head *sh)

A comment just before this function saying when it is called would help.
"Call when a stripe_head is safe in the journal, but has not yet been
written to the RAID" ??

> +{
> +	struct r5l_log *log = sh->raid_conf->log;
> +
> +	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
> +		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
> +		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
> +	} else
> +		BUG(); /* write back logic in next patch */
> +}
> +
>  static void r5l_io_run_stripes(struct r5l_io_unit *io)
>  {
>  	struct stripe_head *sh, *next;
>  
>  	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>  		list_del_init(&sh->log_list);
> +
> +		r5c_finish_cache_stripe(sh);
> +
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);
>  	}
> @@ -412,18 +493,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>  		r5l_append_payload_page(log, sh->dev[i].page);
>  	}
>  
> -	if (sh->qd_idx >= 0) {
> +	if (parity_pages == 2) {
>  		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
>  					sh->sector, sh->dev[sh->pd_idx].log_checksum,
>  					sh->dev[sh->qd_idx].log_checksum, true);
>  		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
>  		r5l_append_payload_page(log, sh->dev[sh->qd_idx].page);
> -	} else {
> +	} else if (parity_pages == 1) {
>  		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
>  					sh->sector, sh->dev[sh->pd_idx].log_checksum,
>  					0, false);
>  		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
> -	}
> +	} else
> +		BUG_ON(parity_pages != 0);


Why this change?  I don't think it is wrong, but I don't see why it
belongs here.  How does it help?


>  
>  	list_add_tail(&sh->log_list, &io->stripe_list);
>  	atomic_inc(&io->pending_stripe);
> @@ -455,6 +537,8 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
>  		return -EAGAIN;
>  	}
>  
> +	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
> +
>  	for (i = 0; i < sh->disks; i++) {
>  		void *addr;
>  
> @@ -1101,6 +1185,39 @@ static void r5l_write_super(struct r5l_log *log, sector_t cp)
>  	set_bit(MD_CHANGE_DEVS, &mddev->flags);
>  }
>  
> +int r5c_handle_stripe_dirtying(struct r5conf *conf,
> +			       struct stripe_head *sh,
> +			       struct stripe_head_state *s,
> +			       int disks)
> +{
> +	struct r5l_log *log = conf->log;
> +
> +	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
> +		return -EAGAIN;
> +
> +	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH ||
> +	    conf->mddev->degraded != 0) {

Why do you disable write-back when the array is degraded?

> +		/* write through mode */
> +		r5c_freeze_stripe_for_reclaim(sh);
> +		return -EAGAIN;
> +	}
> +	BUG();  /* write back logic in next commit */
> +	return 0;
> +}
> +
> +/*
> + * clean up the stripe (clear STRIPE_R5C_FROZEN etc.) after the stripe is
> + * committed to RAID disks
> +*/
> +void r5c_handle_stripe_written(struct r5conf *conf,
> +			       struct stripe_head *sh)
> +{
> +	if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
> +		return;
> +	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
> +	clear_bit(STRIPE_R5C_FROZEN, &sh->state);
> +}
> +
>  static int r5l_load_log(struct r5l_log *log)
>  {
>  	struct md_rdev *rdev = log->rdev;
> @@ -1236,6 +1353,8 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	INIT_LIST_HEAD(&log->no_space_stripes);
>  	spin_lock_init(&log->no_space_stripes_lock);
>  
> +	log->r5c_state = R5C_STATE_WRITE_THROUGH;
> +
>  	if (r5l_load_log(log))
>  		goto error;
>  
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 67d4f49..2e3e61a 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3506,6 +3506,9 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  	int rmw = 0, rcw = 0, i;
>  	sector_t recovery_cp = conf->mddev->recovery_cp;
>  
> +	if (r5c_handle_stripe_dirtying(conf, sh, s, disks) == 0)
> +		return;
> +
>  	/* Check whether resync is now happening or should start.
>  	 * If yes, then the array is dirty (after unclean shutdown or
>  	 * initial creation), so parity in some stripes might be inconsistent.
> @@ -4396,13 +4399,23 @@ static void handle_stripe(struct stripe_head *sh)
>  	    || s.expanding)
>  		handle_stripe_fill(sh, &s, disks);
>  
> -	/* Now to consider new write requests and what else, if anything
> -	 * should be read.  We do not handle new writes when:
> +	/*
> +	 * When the stripe finishes full journal write cycle (write to journal
> +	 * and raid disk), this is the clean up procedure so it is ready for
> +	 * next operation.
> +	 */
> +	r5c_handle_stripe_written(conf, sh);
> +
> +	/*
> +	 * Now to consider new write requests, cache write back and what else,
> +	 * if anything should be read.  We do not handle new writes when:
>  	 * 1/ A 'write' operation (copy+xor) is already in flight.
>  	 * 2/ A 'check' operation is in flight, as it may clobber the parity
>  	 *    block.
> +	 * 3/ A r5c cache log write is in flight.
>  	 */
> -	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
> +	if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) &&
> +	    !sh->reconstruct_state && !sh->check_state && !sh->log_io)
>  		handle_stripe_dirtying(conf, sh, &s, disks);

OK, I'm definitely confused now.
If the stripe is FROZEN, surely you don't want to call
handle_stripe_dirtying().


>  
>  	/* maybe we need to check and possibly fix the parity for this stripe
> @@ -5122,6 +5135,7 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
>  	 * data on failed drives.
>  	 */
>  	if (rw == READ && mddev->degraded == 0 &&
> +	    !r5c_is_writeback(conf->log) &&
>  	    mddev->reshape_position == MaxSector) {
>  		bi = chunk_aligned_read(mddev, bi);
>  		if (!bi)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 46cfe93..8bae64b 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -345,7 +345,9 @@ enum {
>  	STRIPE_BITMAP_PENDING,	/* Being added to bitmap, don't add
>  				 * to batch yet.
>  				 */
> -	STRIPE_LOG_TRAPPED, /* trapped into log */
> +	STRIPE_LOG_TRAPPED,	/* trapped into log */
> +	STRIPE_R5C_FROZEN,	/* r5c_cache frozen and being written out */
> +	STRIPE_R5C_WRITTEN,	/* ready for r5c_handle_stripe_written() */
>  };
>  
>  #define STRIPE_EXPAND_SYNC_FLAGS \
> @@ -712,4 +714,10 @@ extern void r5l_stripe_write_finished(struct stripe_head *sh);
>  extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
>  extern void r5l_quiesce(struct r5l_log *log, int state);
>  extern bool r5l_log_disk_error(struct r5conf *conf);
> +extern bool r5c_is_writeback(struct r5l_log *log);
> +extern int
> +r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
> +			   struct stripe_head_state *s, int disks);
> +extern void
> +r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh);
>  #endif
> -- 
> 2.9.3


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
  2016-10-13  5:49 ` [PATCH v5 4/8] md/r5cache: write part of r5cache Song Liu
@ 2016-10-14  6:53   ` NeilBrown
  2016-10-14  7:33     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-10-14  6:53 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

[-- Attachment #1: Type: text/plain, Size: 29713 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> This is the write part of r5cache. The cache is integrated with
> stripe cache of raid456. It leverages code of r5l_log to write
> data to journal device.
>
> r5cache split current write path into 2 parts: the write path
> and the reclaim path. The write path is as following:
> 1. write data to journal
>    (r5c_handle_stripe_dirtying, r5c_cache_data)
> 2. call bio_endio
>    (r5c_handle_data_cached, r5c_return_dev_pending_writes).
>
> Then the reclaim path is as:
> 1. Freeze the stripe (r5c_freeze_stripe_for_reclaim)
> 2. Calcualte parity (reconstruct or RMW)
> 3. Write parity (and maybe some other data) to journal device
> 4. Write data and parity to RAID disks
>
> Reclaim path of the cache is implemented in the next patch.
>
> With r5cache, write operation does not wait for parity calculation
> and write out, so the write latency is lower (1 write to journal
> device vs. read and then write to raid disks). Also, r5cache will
> reduce RAID overhead (multipile IO due to read-modify-write of
> parity) and provide more opportunities of full stripe writes.
>
> This patch adds 2 flags to stripe_head.state:
>  - STRIPE_R5C_PARTIAL_STRIPE,
>  - STRIPE_R5C_FULL_STRIPE,
>
> Instead of inactive_list, stripes with cached data are tracked in
> r5conf->r5c_full_stripe_list and r5conf->r5c_partial_stripe_list.
> STRIPE_R5C_FULL_STRIPE and STRIPE_R5C_PARTIAL_STRIPE are flags for
> stripes in these lists. Note: stripes in r5c_full/partial_stripe_list
> are not considered as "active".
>
> For RMW, the code allocates an extra page for each data block
> being updated.  This is stored in r5dev->page and the old data
> is read into it.  Then the prexor calculation subtracts ->page
> from the parity block, and the reconstruct calculation adds the
> ->orig_page data back into the parity block.

What happens if the alloc_page() fails?

>
> r5cache naturally excludes SkipCopy. With R5_Wantcache bit set,
> async_copy_data will not skip copy.
>
> There are some known limitations of the cache implementation:
>
> 1. Write cache only covers full page writes (R5_OVERWRITE). Writes
>    of smaller granularity are write through.
> 2. Only one log io (sh->log_io) for each stripe at anytime. Later
>    writes for the same stripe have to wait. This can be improved by
>    moving log_io to r5dev.
> 3. With writeback cache, read path must enter state machine, which
>    is a significant bottleneck for some workloads.
> 4. There is no per stripe checkpoint (with r5l_payload_flush) in
>    the log, so recovery code has to replay more than necessary data
>    (sometimes all the log from last_checkpoint). This reduces
>    availability of the array.
>
> This patch includes a fix proposed by ZhengYuan Liu
> <liuzhengyuan@kylinos.cn>
>
> Signed-off-by: Song Liu <songliubraving@fb.com>

You introduce a new flag:
> +	R5_InCache,	/* Data in cache */

but don't document it at all (apart from those three words).
I must confess that I found it confuses when you talk about the
"journal" as the "cache".  I understand way, but in my mind, the "cache"
is the in-memory cache of stripe_heads.
We now have something that we sometimes call a "journal", sometimes call
a "log" and sometimes call a "cache".  That is a recipe for confusion.

A sentence or two explaining how this flag will be used would help in
reading the rest of the code.


> ---
>  drivers/md/raid5-cache.c | 204 +++++++++++++++++++++++++++++++++++++++++++++--
>  drivers/md/raid5.c       | 144 ++++++++++++++++++++++++++++-----
>  drivers/md/raid5.h       |  22 +++++
>  3 files changed, 344 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 9e05850..92d3d7b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -20,6 +20,7 @@
>  #include <linux/random.h>
>  #include "md.h"
>  #include "raid5.h"
> +#include "bitmap.h"
>  
>  /*
>   * metadata/data stored in disk with 4k size unit (a block) regardless
> @@ -217,6 +218,44 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
>  	io->state = state;
>  }
>  
> +static void
> +r5c_return_dev_pending_writes(struct r5conf *conf, struct r5dev *dev,
> +			      struct bio_list *return_bi)
> +{
> +	struct bio *wbi, *wbi2;
> +
> +	wbi = dev->written;
> +	dev->written = NULL;
> +	while (wbi && wbi->bi_iter.bi_sector <
> +	       dev->sector + STRIPE_SECTORS) {
> +		wbi2 = r5_next_bio(wbi, dev->sector);
> +		if (!raid5_dec_bi_active_stripes(wbi)) {
> +			md_write_end(conf->mddev);
> +			bio_list_add(return_bi, wbi);
> +		}
> +		wbi = wbi2;
> +	}
> +}
> +
> +void r5c_handle_cached_data_endio(struct r5conf *conf,
> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
> +{
> +	int i;
> +
> +	for (i = sh->disks; i--; ) {
> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
> +		    sh->dev[i].written) {

Is it possible for R5_InCache to be set, but 'written' to be NULL ???

> +			set_bit(R5_UPTODATE, &sh->dev[i].flags);
> +			r5c_return_dev_pending_writes(conf, &sh->dev[i],
> +						      return_bi);
> +			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> +					STRIPE_SECTORS,
> +					!test_bit(STRIPE_DEGRADED, &sh->state),
> +					0);
> +		}
> +	}
> +}
> +
>  /*
>   * Freeze the stripe, thus send the stripe into reclaim path.
>   *
> @@ -233,6 +272,48 @@ void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
>  		return;
>  	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
>  	set_bit(STRIPE_R5C_FROZEN, &sh->state);
> +
> +	if (log->r5c_state == R5C_STATE_WRITE_THROUGH)
> +		return;
> +
> +	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +		atomic_inc(&conf->preread_active_stripes);
> +
> +	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
> +		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
> +		atomic_dec(&conf->r5c_cached_partial_stripes);
> +	}
> +
> +	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
> +		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
> +		atomic_dec(&conf->r5c_cached_full_stripes);
> +	}
> +}
> +
> +static void r5c_handle_data_cached(struct stripe_head *sh)
> +{
> +	int i;
> +
> +	for (i = sh->disks; i--; )
> +		if (test_and_clear_bit(R5_Wantcache, &sh->dev[i].flags)) {
> +			set_bit(R5_InCache, &sh->dev[i].flags);
> +			clear_bit(R5_LOCKED, &sh->dev[i].flags);
> +			atomic_inc(&sh->dev_in_cache);
> +		}
> +}
> +
> +/*
> + * this journal write must contain full parity,
> + * it may also contain some data pages
> + */
> +static void r5c_handle_parity_cached(struct stripe_head *sh)
> +{
> +	int i;
> +
> +	for (i = sh->disks; i--; )
> +		if (test_bit(R5_InCache, &sh->dev[i].flags))
> +			set_bit(R5_Wantwrite, &sh->dev[i].flags);
> +	set_bit(STRIPE_R5C_WRITTEN, &sh->state);
>  }
>  
>  static void r5c_finish_cache_stripe(struct stripe_head *sh)
> @@ -242,8 +323,10 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
>  	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
>  		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
>  		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
> -	} else
> -		BUG(); /* write back logic in next patch */
> +	} else if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
> +		r5c_handle_parity_cached(sh);
> +	else
> +		r5c_handle_data_cached(sh);
>  }
>  
>  static void r5l_io_run_stripes(struct r5l_io_unit *io)
> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>  	io = log->current_io;
>  
>  	for (i = 0; i < sh->disks; i++) {
> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>  			continue;

If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
that were destined for the journal, then this would just be

 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))

which would make lots of sense...  Just a thought.

>  		if (i == sh->pd_idx || i == sh->qd_idx)
>  			continue;
> @@ -514,7 +598,6 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>  	return 0;
>  }
>  
> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
>  /*
>   * running in raid5d, where reclaim could wait for raid5d too (when it flushes
>   * data from log to raid disks), so we shouldn't wait for reclaim here
> @@ -544,6 +627,10 @@ int r5l_write_stripe(struct r5l_log *log, struct stripe_head *sh)
>  
>  		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>  			continue;
> +
> +		if (test_bit(R5_InCache, &sh->dev[i].flags))
> +			continue;
> +
>  		write_disks++;
>  		/* checksum is already calculated in last run */
>  		if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
> @@ -809,7 +896,6 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>  	}
>  }
>  
> -
>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
> @@ -872,7 +958,7 @@ static void r5l_reclaim_thread(struct md_thread *thread)
>  	r5l_do_reclaim(log);
>  }
>  
> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
> +void r5l_wake_reclaim(struct r5l_log *log, sector_t space)

Why are you removing the 'static' here?  You don't call it from any
other file.

>  {
>  	unsigned long target;
>  	unsigned long new = (unsigned long)space; /* overflow in theory */
> @@ -1191,6 +1277,8 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
>  			       int disks)
>  {
>  	struct r5l_log *log = conf->log;
> +	int i;
> +	struct r5dev *dev;
>  
>  	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
>  		return -EAGAIN;
> @@ -1201,21 +1289,121 @@ int r5c_handle_stripe_dirtying(struct r5conf *conf,
>  		r5c_freeze_stripe_for_reclaim(sh);
>  		return -EAGAIN;
>  	}
> -	BUG();  /* write back logic in next commit */
> +
> +	s->to_cache = 0;
> +
> +	for (i = disks; i--; ) {
> +		dev = &sh->dev[i];
> +		/* if none-overwrite, use the reclaim path (write through) */
                      non-overwrite

> +		if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
> +		    !test_bit(R5_InCache, &dev->flags)) {
> +			r5c_freeze_stripe_for_reclaim(sh);
> +			return -EAGAIN;
> +		}
> +	}
> +
> +	for (i = disks; i--; ) {
> +		dev = &sh->dev[i];
> +		if (dev->towrite) {
> +			set_bit(R5_Wantcache, &dev->flags);
> +			set_bit(R5_Wantdrain, &dev->flags);
> +			set_bit(R5_LOCKED, &dev->flags);
> +			s->to_cache++;
> +		}
> +	}
> +
> +	if (s->to_cache)
> +		set_bit(STRIPE_OP_BIODRAIN, &s->ops_request);
> +
>  	return 0;
>  }
>  
>  /*
>   * clean up the stripe (clear STRIPE_R5C_FROZEN etc.) after the stripe is
>   * committed to RAID disks
> -*/
> + */
>  void r5c_handle_stripe_written(struct r5conf *conf,
>  			       struct stripe_head *sh)
>  {
> +	int i;
> +	int do_wakeup = 0;
> +
>  	if (!test_and_clear_bit(STRIPE_R5C_WRITTEN, &sh->state))
>  		return;
>  	WARN_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
>  	clear_bit(STRIPE_R5C_FROZEN, &sh->state);
> +
> +	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH)
> +		return;
> +
> +	for (i = sh->disks; i--; ) {
> +		if (test_and_clear_bit(R5_InCache, &sh->dev[i].flags))
> +			atomic_dec(&sh->dev_in_cache);
> +		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
> +			do_wakeup = 1;
> +	}
> +
> +	if (test_and_clear_bit(STRIPE_FULL_WRITE, &sh->state))
> +		if (atomic_dec_and_test(&conf->pending_full_writes))
> +			md_wakeup_thread(conf->mddev->thread);
> +
> +	if (do_wakeup)
> +		wake_up(&conf->wait_for_overlap);
> +}
> +
> +int
> +r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
> +	       struct stripe_head_state *s)
> +{
> +	int pages;
> +	int reserve;
> +	int i;
> +	int ret = 0;
> +	int page_count = 0;
> +
> +	BUG_ON(!log);
> +
> +	for (i = 0; i < sh->disks; i++) {
> +		void *addr;
> +
> +		if (!test_bit(R5_Wantcache, &sh->dev[i].flags))
> +			continue;
> +		addr = kmap_atomic(sh->dev[i].page);
> +		sh->dev[i].log_checksum = crc32c_le(log->uuid_checksum,
> +						    addr, PAGE_SIZE);
> +		kunmap_atomic(addr);
> +		page_count++;
> +	}
> +	WARN_ON(page_count != s->to_cache);
> +	pages = s->to_cache;
> +
> +	/*
> +	 * The stripe must enter state machine again to call endio, so
> +	 * don't delay.
> +	 */
> +	clear_bit(STRIPE_DELAYED, &sh->state);
> +	atomic_inc(&sh->count);
> +
> +	mutex_lock(&log->io_mutex);
> +	/* meta + data */
> +	reserve = (1 + pages) << (PAGE_SHIFT - 9);
> +	if (!r5l_has_free_space(log, reserve)) {
> +		spin_lock(&log->no_space_stripes_lock);
> +		list_add_tail(&sh->log_list, &log->no_space_stripes);
> +		spin_unlock(&log->no_space_stripes_lock);
> +
> +		r5l_wake_reclaim(log, reserve);
> +	} else {
> +		ret = r5l_log_stripe(log, sh, pages, 0);
> +		if (ret) {
> +			spin_lock_irq(&log->io_list_lock);
> +			list_add_tail(&sh->log_list, &log->no_mem_stripes);
> +			spin_unlock_irq(&log->io_list_lock);
> +		}
> +	}
> +
> +	mutex_unlock(&log->io_mutex);
> +	return 0;
>  }
>  
>  static int r5l_load_log(struct r5l_log *log)
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 2e3e61a..0539f34 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -245,8 +245,25 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>  			    < IO_THRESHOLD)
>  				md_wakeup_thread(conf->mddev->thread);
>  		atomic_dec(&conf->active_stripes);
> -		if (!test_bit(STRIPE_EXPANDING, &sh->state))
> -			list_add_tail(&sh->lru, temp_inactive_list);
> +		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> +			if (atomic_read(&sh->dev_in_cache) == 0) {
> +				list_add_tail(&sh->lru, temp_inactive_list);
> +			} else if (atomic_read(&sh->dev_in_cache) ==
> +				   conf->raid_disks - conf->max_degraded) {

Is this really the test that you want?
Presumably "full stripes" are those that can be written out without
reading anything, while "partial stripes" will need something read in
first.
It is possible that the stripe head contains data for some devices which
is not in the cache, possibly because a READ request filled it in.
So we should really be counting the number of devices with R5_UPTODATE.
??

> +				/* full stripe */
> +				if (!test_and_set_bit(STRIPE_R5C_FULL_STRIPE, &sh->state))
> +					atomic_inc(&conf->r5c_cached_full_stripes);
> +				if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state))
> +					atomic_dec(&conf->r5c_cached_partial_stripes);
> +				list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
> +			} else {
> +				/* partial stripe */
> +				if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
> +						      &sh->state))
> +					atomic_inc(&conf->r5c_cached_partial_stripes);
> +				list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
> +			}
> +		}
>  	}
>  }
>  
> @@ -830,6 +847,11 @@ static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
>  
>  	might_sleep();
>  
> +	if (s->to_cache) {
> +		r5c_cache_data(conf->log, sh, s);
> +		return;
> +	}
> +
>  	if (r5l_write_stripe(conf->log, sh) == 0)
>  		return;
>  	for (i = disks; i--; ) {
> @@ -1044,7 +1066,7 @@ again:
>  static struct dma_async_tx_descriptor *
>  async_copy_data(int frombio, struct bio *bio, struct page **page,
>  	sector_t sector, struct dma_async_tx_descriptor *tx,
> -	struct stripe_head *sh)
> +	struct stripe_head *sh, int no_skipcopy)
>  {
>  	struct bio_vec bvl;
>  	struct bvec_iter iter;
> @@ -1084,7 +1106,8 @@ async_copy_data(int frombio, struct bio *bio, struct page **page,
>  			if (frombio) {
>  				if (sh->raid_conf->skip_copy &&
>  				    b_offset == 0 && page_offset == 0 &&
> -				    clen == STRIPE_SIZE)
> +				    clen == STRIPE_SIZE &&
> +				    !no_skipcopy)
>  					*page = bio_page;
>  				else
>  					tx = async_memcpy(*page, bio_page, page_offset,
> @@ -1166,7 +1189,7 @@ static void ops_run_biofill(struct stripe_head *sh)
>  			while (rbi && rbi->bi_iter.bi_sector <
>  				dev->sector + STRIPE_SECTORS) {
>  				tx = async_copy_data(0, rbi, &dev->page,
> -					dev->sector, tx, sh);
> +						     dev->sector, tx, sh, 0);
>  				rbi = r5_next_bio(rbi, dev->sector);
>  			}
>  		}
> @@ -1293,7 +1316,8 @@ static int set_syndrome_sources(struct page **srcs,
>  		if (i == sh->qd_idx || i == sh->pd_idx ||
>  		    (srctype == SYNDROME_SRC_ALL) ||
>  		    (srctype == SYNDROME_SRC_WANT_DRAIN &&
> -		     test_bit(R5_Wantdrain, &dev->flags)) ||
> +		     (test_bit(R5_Wantdrain, &dev->flags) ||
> +		      test_bit(R5_InCache, &dev->flags))) ||
>  		    (srctype == SYNDROME_SRC_WRITTEN &&
>  		     dev->written))
>  			srcs[slot] = sh->dev[i].page;
> @@ -1472,9 +1496,25 @@ ops_run_compute6_2(struct stripe_head *sh, struct raid5_percpu *percpu)
>  static void ops_complete_prexor(void *stripe_head_ref)
>  {
>  	struct stripe_head *sh = stripe_head_ref;
> +	int i;
>  
>  	pr_debug("%s: stripe %llu\n", __func__,
>  		(unsigned long long)sh->sector);
> +
> +	if (!r5c_is_writeback(sh->raid_conf->log))
> +		return;
> +
> +	/*
> +	 * raid5-cache write back uses orig_page during prexor. after prexor,
> +	 * it is time to free orig_page
> +	 */
> +	for (i = sh->disks; i--; )
> +		if (sh->dev[i].page != sh->dev[i].orig_page) {
> +			struct page *p = sh->dev[i].page;
> +
> +			sh->dev[i].page = sh->dev[i].orig_page;
> +			put_page(p);
> +		}
>  }
>  
>  static struct dma_async_tx_descriptor *
> @@ -1496,7 +1536,8 @@ ops_run_prexor5(struct stripe_head *sh, struct raid5_percpu *percpu,
>  	for (i = disks; i--; ) {
>  		struct r5dev *dev = &sh->dev[i];
>  		/* Only process blocks that are known to be uptodate */
> -		if (test_bit(R5_Wantdrain, &dev->flags))
> +		if (test_bit(R5_Wantdrain, &dev->flags) ||
> +		    test_bit(R5_InCache, &dev->flags))
>  			xor_srcs[count++] = dev->page;
>  	}
>  
> @@ -1547,6 +1588,10 @@ ops_run_biodrain(struct stripe_head *sh, struct dma_async_tx_descriptor *tx)
>  
>  again:
>  			dev = &sh->dev[i];
> +			if (test_and_clear_bit(R5_InCache, &dev->flags)) {
> +				BUG_ON(atomic_read(&sh->dev_in_cache) == 0);
> +				atomic_dec(&sh->dev_in_cache);
> +			}
>  			spin_lock_irq(&sh->stripe_lock);
>  			chosen = dev->towrite;
>  			dev->towrite = NULL;
> @@ -1566,8 +1611,10 @@ again:
>  					set_bit(R5_Discard, &dev->flags);
>  				else {
>  					tx = async_copy_data(1, wbi, &dev->page,
> -						dev->sector, tx, sh);
> -					if (dev->page != dev->orig_page) {
> +							     dev->sector, tx, sh,
> +							     test_bit(R5_Wantcache, &dev->flags));
> +					if (dev->page != dev->orig_page &&
> +					    !test_bit(R5_Wantcache, &dev->flags)) {
>  						set_bit(R5_SkipCopy, &dev->flags);
>  						clear_bit(R5_UPTODATE, &dev->flags);
>  						clear_bit(R5_OVERWRITE, &dev->flags);
> @@ -1675,7 +1722,8 @@ again:
>  		xor_dest = xor_srcs[count++] = sh->dev[pd_idx].page;
>  		for (i = disks; i--; ) {
>  			struct r5dev *dev = &sh->dev[i];
> -			if (head_sh->dev[i].written)
> +			if (head_sh->dev[i].written ||
> +			    test_bit(R5_InCache, &head_sh->dev[i].flags))
>  				xor_srcs[count++] = dev->page;
>  		}
>  	} else {
> @@ -1930,6 +1978,7 @@ static struct stripe_head *alloc_stripe(struct kmem_cache *sc, gfp_t gfp,
>  		INIT_LIST_HEAD(&sh->batch_list);
>  		INIT_LIST_HEAD(&sh->lru);
>  		atomic_set(&sh->count, 1);
> +		atomic_set(&sh->dev_in_cache, 0);
>  		for (i = 0; i < disks; i++) {
>  			struct r5dev *dev = &sh->dev[i];
>  
> @@ -2810,12 +2859,30 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
>  		for (i = disks; i--; ) {
>  			struct r5dev *dev = &sh->dev[i];
>  
> +			/*
> +			 * Initially, handle_stripe_dirtying decided to run rmw
> +			 * and allocates extra page for prexor. However, rcw is
> +			 * cheaper later on. We need to free the extra page
> +			 * now, because we won't be able to do that in
> +			 * ops_complete_prexor().
> +			 */
> +			if (sh->dev[i].page != sh->dev[i].orig_page) {
> +				struct page *p = sh->dev[i].page;
> +
> +				p = sh->dev[i].page;
> +				sh->dev[i].page = sh->dev[i].orig_page;
> +				put_page(p);
> +			}
> +
>  			if (dev->towrite) {
>  				set_bit(R5_LOCKED, &dev->flags);
>  				set_bit(R5_Wantdrain, &dev->flags);
>  				if (!expand)
>  					clear_bit(R5_UPTODATE, &dev->flags);
>  				s->locked++;
> +			} else if (test_bit(R5_InCache, &dev->flags)) {
> +				set_bit(R5_LOCKED, &dev->flags);
> +				s->locked++;
>  			}
>  		}
>  		/* if we are not expanding this is a proper write request, and
> @@ -2855,6 +2922,9 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
>  				set_bit(R5_LOCKED, &dev->flags);
>  				clear_bit(R5_UPTODATE, &dev->flags);
>  				s->locked++;
> +			} else if (test_bit(R5_InCache, &dev->flags)) {
> +				set_bit(R5_LOCKED, &dev->flags);
> +				s->locked++;
>  			}
>  		}
>  		if (!s->locked)
> @@ -3529,9 +3599,12 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  	} else for (i = disks; i--; ) {
>  		/* would I have to read this buffer for read_modify_write */
>  		struct r5dev *dev = &sh->dev[i];
> -		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
> +		if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx ||
> +		     test_bit(R5_InCache, &dev->flags)) &&
>  		    !test_bit(R5_LOCKED, &dev->flags) &&
> -		    !(test_bit(R5_UPTODATE, &dev->flags) ||
> +		    !((test_bit(R5_UPTODATE, &dev->flags) &&
> +		       (!test_bit(R5_InCache, &dev->flags) ||
> +			dev->page != dev->orig_page)) ||
>  		      test_bit(R5_Wantcompute, &dev->flags))) {
>  			if (test_bit(R5_Insync, &dev->flags))
>  				rmw++;
> @@ -3543,13 +3616,15 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  		    i != sh->pd_idx && i != sh->qd_idx &&
>  		    !test_bit(R5_LOCKED, &dev->flags) &&
>  		    !(test_bit(R5_UPTODATE, &dev->flags) ||
> -		    test_bit(R5_Wantcompute, &dev->flags))) {
> +		      test_bit(R5_InCache, &dev->flags) ||
> +		      test_bit(R5_Wantcompute, &dev->flags))) {
>  			if (test_bit(R5_Insync, &dev->flags))
>  				rcw++;
>  			else
>  				rcw += 2*disks;
>  		}
>  	}
> +
>  	pr_debug("for sector %llu, rmw=%d rcw=%d\n",
>  		(unsigned long long)sh->sector, rmw, rcw);
>  	set_bit(STRIPE_HANDLE, &sh->state);
> @@ -3561,10 +3636,18 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  					  (unsigned long long)sh->sector, rmw);
>  		for (i = disks; i--; ) {
>  			struct r5dev *dev = &sh->dev[i];
> -			if ((dev->towrite || i == sh->pd_idx || i == sh->qd_idx) &&
> +			if (test_bit(R5_InCache, &dev->flags) &&
> +			    dev->page == dev->orig_page)
> +				dev->page = alloc_page(GFP_NOIO);  /* prexor */
> +
> +			if ((dev->towrite ||
> +			     i == sh->pd_idx || i == sh->qd_idx ||
> +			     test_bit(R5_InCache, &dev->flags)) &&
>  			    !test_bit(R5_LOCKED, &dev->flags) &&
> -			    !(test_bit(R5_UPTODATE, &dev->flags) ||
> -			    test_bit(R5_Wantcompute, &dev->flags)) &&
> +			    !((test_bit(R5_UPTODATE, &dev->flags) &&
> +			       (!test_bit(R5_InCache, &dev->flags) ||
> +				dev->page != dev->orig_page)) ||
> +			      test_bit(R5_Wantcompute, &dev->flags)) &&
>  			    test_bit(R5_Insync, &dev->flags)) {
>  				if (test_bit(STRIPE_PREREAD_ACTIVE,
>  					     &sh->state)) {
> @@ -3590,6 +3673,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  			    i != sh->pd_idx && i != sh->qd_idx &&
>  			    !test_bit(R5_LOCKED, &dev->flags) &&
>  			    !(test_bit(R5_UPTODATE, &dev->flags) ||
> +			      test_bit(R5_InCache, &dev->flags) ||
>  			      test_bit(R5_Wantcompute, &dev->flags))) {
>  				rcw++;
>  				if (test_bit(R5_Insync, &dev->flags) &&
> @@ -3629,7 +3713,7 @@ static void handle_stripe_dirtying(struct r5conf *conf,
>  	 */
>  	if ((s->req_compute || !test_bit(STRIPE_COMPUTE_RUN, &sh->state)) &&
>  	    (s->locked == 0 && (rcw == 0 || rmw == 0) &&
> -	    !test_bit(STRIPE_BIT_DELAY, &sh->state)))
> +	     !test_bit(STRIPE_BIT_DELAY, &sh->state)))
>  		schedule_reconstruction(sh, s, rcw == 0, 0);
>  }
>  
> @@ -4120,6 +4204,10 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
>  			if (rdev && !test_bit(Faulty, &rdev->flags))
>  				do_recovery = 1;
>  		}
> +		if (test_bit(R5_InCache, &dev->flags) && dev->written)
> +			s->just_cached++;
> +		if (test_bit(R5_Wantcache, &dev->flags) && dev->written)
> +			s->want_cache++;
>  	}
>  	if (test_bit(STRIPE_SYNCING, &sh->state)) {
>  		/* If there is a failed device being replaced,
> @@ -4285,6 +4373,17 @@ static void handle_stripe(struct stripe_head *sh)
>  
>  	analyse_stripe(sh, &s);
>  
> +	if (s.want_cache) {
> +		/* In last run of handle_stripe, we have finished
> +		 * r5c_handle_stripe_dirtying and ops_run_biodrain, but
> +		 * r5c_cache_data didn't finish because the journal device
> +		 * didn't have enough space. This time we should continue
> +		 * r5c_cache_data
> +		 */
> +		s.to_cache = s.want_cache;
> +		goto finish;
> +	}
> +
>  	if (test_bit(STRIPE_LOG_TRAPPED, &sh->state))
>  		goto finish;
>  
> @@ -4348,7 +4447,7 @@ static void handle_stripe(struct stripe_head *sh)
>  			struct r5dev *dev = &sh->dev[i];
>  			if (test_bit(R5_LOCKED, &dev->flags) &&
>  				(i == sh->pd_idx || i == sh->qd_idx ||
> -				 dev->written)) {
> +				 dev->written || test_bit(R5_InCache, &dev->flags))) {
>  				pr_debug("Writing block %d\n", i);
>  				set_bit(R5_Wantwrite, &dev->flags);
>  				if (prexor)
> @@ -4388,6 +4487,10 @@ static void handle_stripe(struct stripe_head *sh)
>  				 test_bit(R5_Discard, &qdev->flags))))))
>  		handle_stripe_clean_event(conf, sh, disks, &s.return_bi);
>  
> +	if (s.just_cached)
> +		r5c_handle_cached_data_endio(conf, sh, disks, &s.return_bi);
> +	r5l_stripe_write_finished(sh);
> +
>  	/* Now we might consider reading some blocks, either to check/generate
>  	 * parity, or to satisfy requests
>  	 * or to load a block that is being partially written.
> @@ -6526,6 +6629,11 @@ static struct r5conf *setup_conf(struct mddev *mddev)
>  	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++)
>  		INIT_LIST_HEAD(conf->temp_inactive_list + i);
>  
> +	atomic_set(&conf->r5c_cached_full_stripes, 0);
> +	INIT_LIST_HEAD(&conf->r5c_full_stripe_list);
> +	atomic_set(&conf->r5c_cached_partial_stripes, 0);
> +	INIT_LIST_HEAD(&conf->r5c_partial_stripe_list);
> +
>  	conf->level = mddev->new_level;
>  	conf->chunk_sectors = mddev->new_chunk_sectors;
>  	if (raid5_alloc_percpu(conf) != 0)
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 8bae64b..ac6d7c7 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -226,6 +226,7 @@ struct stripe_head {
>  
>  	struct r5l_io_unit	*log_io;
>  	struct list_head	log_list;
> +	atomic_t		dev_in_cache;
>  	/**
>  	 * struct stripe_operations
>  	 * @target - STRIPE_OP_COMPUTE_BLK target
> @@ -263,6 +264,7 @@ struct stripe_head_state {
>  	 */
>  	int syncing, expanding, expanded, replacing;
>  	int locked, uptodate, to_read, to_write, failed, written;
> +	int to_cache, want_cache, just_cached;
>  	int to_fill, compute, req_compute, non_overwrite;
>  	int failed_num[2];
>  	int p_failed, q_failed;
> @@ -313,6 +315,8 @@ enum r5dev_flags {
>  			 */
>  	R5_Discard,	/* Discard the stripe */
>  	R5_SkipCopy,	/* Don't copy data from bio to stripe cache */
> +	R5_Wantcache,	/* Want write data to write cache */
> +	R5_InCache,	/* Data in cache */
>  };
>  
>  /*
> @@ -348,6 +352,12 @@ enum {
>  	STRIPE_LOG_TRAPPED,	/* trapped into log */
>  	STRIPE_R5C_FROZEN,	/* r5c_cache frozen and being written out */
>  	STRIPE_R5C_WRITTEN,	/* ready for r5c_handle_stripe_written() */
> +	STRIPE_R5C_PARTIAL_STRIPE,	/* in r5c cache (to-be/being handled or
> +					 * in conf->r5c_partial_stripe_list)
> +					 */
> +	STRIPE_R5C_FULL_STRIPE,	/* in r5c cache (to-be/being handled or
> +				 * in conf->r5c_full_stripe_list)
> +				 */
>  };
>  
>  #define STRIPE_EXPAND_SYNC_FLAGS \
> @@ -600,6 +610,12 @@ struct r5conf {
>  	 */
>  	atomic_t		active_stripes;
>  	struct list_head	inactive_list[NR_STRIPE_HASH_LOCKS];
> +
> +	atomic_t		r5c_cached_full_stripes;
> +	struct list_head	r5c_full_stripe_list;
> +	atomic_t		r5c_cached_partial_stripes;
> +	struct list_head	r5c_partial_stripe_list;
> +
>  	atomic_t		empty_inactive_list_nr;
>  	struct llist_head	released_stripes;
>  	wait_queue_head_t	wait_for_quiescent;
> @@ -720,4 +736,10 @@ r5c_handle_stripe_dirtying(struct r5conf *conf, struct stripe_head *sh,
>  			   struct stripe_head_state *s, int disks);
>  extern void
>  r5c_handle_stripe_written(struct r5conf *conf, struct stripe_head *sh);
> +extern void r5l_wake_reclaim(struct r5l_log *log, sector_t space);
> +extern void r5c_handle_cached_data_endio(struct r5conf *conf,
> +	struct stripe_head *sh, int disks, struct bio_list *return_bi);
> +extern int r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
> +			  struct stripe_head_state *s);
> +extern void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh);
>  #endif
> -- 
> 2.9.3

This mostly seems go, though obviously I found a few little issues.
My eyes started to glaze over towards the end though ... maybe next time
I should read the patch from the bottom upwards :-)

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode
  2016-10-14  6:13   ` NeilBrown
@ 2016-10-14  7:03     ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-14  7:03 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan


> On Oct 13, 2016, at 11:13 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
> 
>> 
>> /*
>> * raid5 cache state machine
>> *
>> * The RAID cache works in two major states for each stripe: write state
>> * and reclaim state. These states are controlled by flags STRIPE_R5C_FROZEN
>> * and STRIPE_R5C_WRITTEN
> 
> Hi,
> It would really help at this point to understand exactly what it is
> that is "FROZEN".  I guess the contents of the stripe_head are frozen?
> so that nothing changes between writing to the journal and writing to
> the RAID?  Making that explicit would help.

I guess I (wrongly) borrowed the frozen name from Shaohua's earlier
patch, where frozen means "no more writes until flush done". 

In this implementation, STRIPE_R5C_FROZEN is to differentiate write state
and reclaim state (reclaim may not be a good name here though). 

In write state, data writes are handled as:
    write to journal
    return IO
    write to journal
    return IO, etc

In reclaim state, the state machine works as
     calculate parity
     write parity (maybe also some data) to journal
     write data and parity to raid disks
     (maybe) return IO

The reclaim state is very similar to current write journal (or write through 
cache) behavior; while the write state is only applicable to write back cache. 

This particular patch does NOT add detailed logic of the write back cache, 
so the stripe enters reclaim state (set  STRIPE_R5C_FROZEN) immediately
on every write (in handle dirtying). 


> Only... the stripe_head gets "frozen" before that.  It gets frozen
> before the parity is calculated, and not released until the data and
> parity is written to the RAID.  This new "FROZEN" state is a subset of
> that - yes?

Yes, the stripe got frozen before parity is calculated. And released after
data and parity is written to RAID. 

The timeline is like:
    write data to journal, return IO
    write data to journal, return IO
    ....
    frozen
    calculate parity
    write parity to journal
    write data and parity to RAID
    release or un-frozen or melt.

> 
> Given that STRIPE_R5C_FROZEN is set when the stripe is in "reclaim
> state", I wonder why you didn't call it "STRIPE_R5C_RECLAIM"
> .... though I'm not sure "reclaim" is a good word.  Reclaim is
> something that happens to the journal after the data has been written
> to the RAID.
Reclaim might be a good name here. It is the flag that the state machine is
working towards "write all data and parity to RAID". 

>> 
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index 7557791b..9e05850 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -40,6 +40,47 @@
>>  */
>> #define R5L_POOL_SIZE	4
>> 
>> +enum r5c_state {
>> +	R5C_STATE_NO_CACHE = 0,
>> +	R5C_STATE_WRITE_THROUGH = 1,
>> +	R5C_STATE_WRITE_BACK = 2,
>> +	R5C_STATE_CACHE_BROKEN = 3,
> 
> How is CACHE_BROKEN different from NO_CACHE??
> Maybe a latter patch will tell me, but I'd rather know now :-)

Yeah, a later patch (sysfs entry r5c_state) has more details. In current write journal
implementation, we force the array read-only when the journal device is broken. 
The array is read-only even after re-assemble, because it still has JOURNAL bit in 
its feature map. So that is CACHE_BROKEN, 

> As this state is stored in the r5l_log, and as we don't allocate that
> when there is no cache, why have a NO_CACHE state?

We only store "write through" or "write back" in r5l_log. However, the r5c_state sysfs
entry has 2 functionalities:
     1. switching between "write through" and "write back";
     2. put array from CACHE_BROKEN to NO_CACHE (remove JOURNAL bit from 
         feature map). 

>> 
>> +/*
>> + * Freeze the stripe, thus send the stripe into reclaim path.
>> + *
>> + * In current implementation, STRIPE_R5C_FROZEN is also set in write through
>> + * mode (in r5c_handle_stripe_dirtying). This does not change the behavior of
>> + * for write through mode.
>> + */
>> +void r5c_freeze_stripe_for_reclaim(struct stripe_head *sh)
> 
> "freeze_stripe_for_reclaim" seems an odd name.  How does "reclaim" fit?
> You are freezing the stripe so it can be written to the journal, and
> then eventually to the RAID are you not?
> Or are you freezing it so the space used in the journal cannot be reclaimed?
> 
> Maybe I misunderstand the "FROZEN" concept still.

It is more like "frozen" so we kickoff the reclaim process. 

> 
>> +{
>> +	struct r5conf *conf = sh->raid_conf;
>> +	struct r5l_log *log = conf->log;
>> +
>> +	if (!log)
>> +		return;
>> +	WARN_ON(test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> +	set_bit(STRIPE_R5C_FROZEN, &sh->state);
>> +}
>> +
>> +static void r5c_finish_cache_stripe(struct stripe_head *sh)
> 
> A comment just before this function saying when it is called would help.
> "Call when a stripe_head is safe in the journal, but has not yet been
> written to the RAID" ??

Yes, exactly "Call when a stripe_head is safe in the journal, but has not yet 
been written to the RAID" 

>> +{
>> +	struct r5l_log *log = sh->raid_conf->log;
>> +
>> +	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
>> +		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> +		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
>> +	} else
>> +		BUG(); /* write back logic in next patch */
>> +}
>> +
>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>> {
>> 	struct stripe_head *sh, *next;
>> 
>> 	list_for_each_entry_safe(sh, next, &io->stripe_list, log_list) {
>> 		list_del_init(&sh->log_list);
>> +
>> +		r5c_finish_cache_stripe(sh);
>> +
>> 		set_bit(STRIPE_HANDLE, &sh->state);
>> 		raid5_release_stripe(sh);
>> 	}
>> @@ -412,18 +493,19 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>> 		r5l_append_payload_page(log, sh->dev[i].page);
>> 	}
>> 
>> -	if (sh->qd_idx >= 0) {
>> +	if (parity_pages == 2) {
>> 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
>> 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
>> 					sh->dev[sh->qd_idx].log_checksum, true);
>> 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
>> 		r5l_append_payload_page(log, sh->dev[sh->qd_idx].page);
>> -	} else {
>> +	} else if (parity_pages == 1) {
>> 		r5l_append_payload_meta(log, R5LOG_PAYLOAD_PARITY,
>> 					sh->sector, sh->dev[sh->pd_idx].log_checksum,
>> 					0, false);
>> 		r5l_append_payload_page(log, sh->dev[sh->pd_idx].page);
>> -	}
>> +	} else
>> +		BUG_ON(parity_pages != 0);
> 
> 
> Why this change?  I don't think it is wrong, but I don't see why it
> belongs here.  How does it help?

In the journal only implementation, parity_pages could only be 1 or 2. With
write back cache, parity pages could also be 0 (data only). So this is new 
case to be handled here. The "BUG_ON" part is not really necessary. 

> 
>> +
>> +	if (!log || test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		return -EAGAIN;
>> +
>> +	if (conf->log->r5c_state == R5C_STATE_WRITE_THROUGH ||
>> +	    conf->mddev->degraded != 0) {
> 
> Why do you disable write-back when the array is degraded?

I think this is not really necessary. Our original thought was to simplify the case.
Since the array is already much slower in degraded mode, write-back cache
cannot make it awesome anyway.

I think we can enable write-back in degraded mode later on. 

>> 
>> +	/*
>> +	 * Now to consider new write requests, cache write back and what else,
>> +	 * if anything should be read.  We do not handle new writes when:
>> 	 * 1/ A 'write' operation (copy+xor) is already in flight.
>> 	 * 2/ A 'check' operation is in flight, as it may clobber the parity
>> 	 *    block.
>> +	 * 3/ A r5c cache log write is in flight.
>> 	 */
>> -	if (s.to_write && !sh->reconstruct_state && !sh->check_state)
>> +	if ((s.to_write || test_bit(STRIPE_R5C_FROZEN, &sh->state)) &&
>> +	    !sh->reconstruct_state && !sh->check_state && !sh->log_io)
>> 		handle_stripe_dirtying(conf, sh, &s, disks);
> 
> OK, I'm definitely confused now.
> If the stripe is FROZEN, surely you don't want to call
> handle_stripe_dirtying().

handle_stripe_dirtying() is the beginning of original write path: shall we do 
rcw or rmw? Once frozen, we need to calculate parity and write to RAID, 
so we need handle_stripe_dirtying here. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
  2016-10-14  6:53   ` NeilBrown
@ 2016-10-14  7:33     ` Song Liu
  2016-10-19  0:53       ` NeilBrown
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2016-10-14  7:33 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan


> On Oct 13, 2016, at 11:53 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
>> 
>> For RMW, the code allocates an extra page for each data block
>> being updated.  This is stored in r5dev->page and the old data
>> is read into it.  Then the prexor calculation subtracts ->page
>> from the parity block, and the reconstruct calculation adds the
>> ->orig_page data back into the parity block.
> 
> What happens if the alloc_page() fails?

That will be tough, but solvable.. We can
    read old data to page
    do prexor 
    read new data from journal device to page
    do xor 
    do the rest of the work. 

Or we can force the code to rcw, which does not need extra page. 
But rcw, does not always work in degraded mode. So, this is a good 
reason not to do write-back in degraded mode...

>> 
>> This patch includes a fix proposed by ZhengYuan Liu
>> <liuzhengyuan@kylinos.cn>
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
> 
> You introduce a new flag:
>> +	R5_InCache,	/* Data in cache */
> 
> but don't document it at all (apart from those three words).
> I must confess that I found it confuses when you talk about the
> "journal" as the "cache".  I understand way, but in my mind, the "cache"
> is the in-memory cache of stripe_heads.
> We now have something that we sometimes call a "journal", sometimes call
> a "log" and sometimes call a "cache".  That is a recipe for confusion.
> 
> A sentence or two explaining how this flag will be used would help in
> reading the rest of the code.

I will try to unify the names here, and add more comments. 

> 
>> +
>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
>> +{
>> +	int i;
>> +
>> +	for (i = sh->disks; i--; ) {
>> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>> +		    sh->dev[i].written) {
> 
> Is it possible for R5_InCache to be set, but 'written' to be NULL ???

Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
times before parity calculation. When it comes here the second time, dev written in the 
first time will have R5_InCache set, but its written will be NULL. 

> 
>> 
>> static void r5c_finish_cache_stripe(struct stripe_head *sh)
>> @@ -242,8 +323,10 @@ static void r5c_finish_cache_stripe(struct stripe_head *sh)
>> 	if (log->r5c_state == R5C_STATE_WRITE_THROUGH) {
>> 		BUG_ON(!test_bit(STRIPE_R5C_FROZEN, &sh->state));
>> 		set_bit(STRIPE_R5C_WRITTEN, &sh->state);
>> -	} else
>> -		BUG(); /* write back logic in next patch */
>> +	} else if (test_bit(STRIPE_R5C_FROZEN, &sh->state))
>> +		r5c_handle_parity_cached(sh);
>> +	else
>> +		r5c_handle_data_cached(sh);
>> }
>> 
>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>> 	io = log->current_io;
>> 
>> 	for (i = 0; i < sh->disks; i++) {
>> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>> 			continue;
> 
> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
> that were destined for the journal, then this would just be
> 
> 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
> 
> which would make lots of sense...  Just a thought.

We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
those places aware of journal. I guess that is not ideal either? 

 
> 
>> }
>> 
>> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> +void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
> 
> Why are you removing the 'static' here?  You don't call it from any
> other file.

In next patch, it is called in raid.c. 

> 
>> 
>> 
>> static int r5l_load_log(struct r5l_log *log)
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 2e3e61a..0539f34 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -245,8 +245,25 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
>> 			    < IO_THRESHOLD)
>> 				md_wakeup_thread(conf->mddev->thread);
>> 		atomic_dec(&conf->active_stripes);
>> -		if (!test_bit(STRIPE_EXPANDING, &sh->state))
>> -			list_add_tail(&sh->lru, temp_inactive_list);
>> +		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
>> +			if (atomic_read(&sh->dev_in_cache) == 0) {
>> +				list_add_tail(&sh->lru, temp_inactive_list);
>> +			} else if (atomic_read(&sh->dev_in_cache) ==
>> +				   conf->raid_disks - conf->max_degraded) {
> 
> Is this really the test that you want?
> Presumably "full stripes" are those that can be written out without
> reading anything, while "partial stripes" will need something read in
> first.
> It is possible that the stripe head contains data for some devices which
> is not in the cache, possibly because a READ request filled it in.
> So we should really be counting the number of devices with R5_UPTODATE.
> ??

Yes, counting R5_UPTODATE will be better. Let me check whether there are
corner cases we need to cover. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
  2016-10-14  7:33     ` Song Liu
@ 2016-10-19  0:53       ` NeilBrown
  2016-10-21 20:42         ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-10-19  0:53 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan

[-- Attachment #1: Type: text/plain, Size: 4061 bytes --]

On Fri, Oct 14 2016, Song Liu wrote:

>> On Oct 13, 2016, at 11:53 PM, NeilBrown <neilb@suse.com> wrote:
>> 
>> On Thu, Oct 13 2016, Song Liu wrote:
>>> 
>>> For RMW, the code allocates an extra page for each data block
>>> being updated.  This is stored in r5dev->page and the old data
>>> is read into it.  Then the prexor calculation subtracts ->page
>>> from the parity block, and the reconstruct calculation adds the
>>> ->orig_page data back into the parity block.
>> 
>> What happens if the alloc_page() fails?
>
> That will be tough, but solvable.. We can
>     read old data to page
>     do prexor 
>     read new data from journal device to page
>     do xor 
>     do the rest of the work. 
>
> Or we can force the code to rcw, which does not need extra page. 
> But rcw, does not always work in degraded mode. So, this is a good 
> reason not to do write-back in degraded mode...

Prohibiting write-back in degraded mode would not be enough to ensure
that you can always use rcw.  The array can become degraded after you
make the decision to use caching, and before to need to read old data
for rmw.

I would suggest a small (2 entry?) mempool where each entry in the
mempool holds enough pages to complete an rmw.  Only use the mempool if
an alloc_page() fails.

>> 
>>> +
>>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>>> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
>>> +{
>>> +	int i;
>>> +
>>> +	for (i = sh->disks; i--; ) {
>>> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>>> +		    sh->dev[i].written) {
>> 
>> Is it possible for R5_InCache to be set, but 'written' to be NULL ???
>
> Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
> times before parity calculation. When it comes here the second time, dev written in the 
> first time will have R5_InCache set, but its written will be NULL. 

OK, that makes sense.
So is it possible for 'written' to be set, but R5_InCache to be clear?
i.e. do we really need to test R5_InCache here?

>>> 
>>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>>> 	io = log->current_io;
>>> 
>>> 	for (i = 0; i < sh->disks; i++) {
>>> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>>> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>>> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>>> 			continue;
>> 
>> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
>> that were destined for the journal, then this would just be
>> 
>> 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
>> 
>> which would make lots of sense...  Just a thought.
>
> We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
> those places aware of journal. I guess that is not ideal either? 

Maybe...
We have so many state flags that I like to be very cautious about adding
more, and to make sure they have a very well defined meaning that
doesn't overlap with other flags too much.
The above code suggests that Wantwrite and Wantcache overlap to some
extent.

Could we discard Wantcache and just use Wantwrite combined with InCache?
Wantwrite means that the block needed to be written to the RAID.
If InCache isn't set, it also needs to be written to the cache (if the
cache is being used).
Then the above code would be
   if (!test_bit(R5_Wantwrite) || test_bit(R5_InCache))
      continue;

which means "if we don't want to write this, or if it is already in the
cache, then nothing to do here".

Maybe.

>
>  
>> 
>>> }
>>> 
>>> -static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>>> +void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> 
>> Why are you removing the 'static' here?  You don't call it from any
>> other file.
>
> In next patch, it is called in raid.c.

So remove 'static' in the next patch please.


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 5/8] md/r5cache: reclaim support
  2016-10-13  5:49 ` [PATCH v5 5/8] md/r5cache: reclaim support Song Liu
@ 2016-10-19  2:03   ` NeilBrown
  2016-10-21 21:04     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-10-19  2:03 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

[-- Attachment #1: Type: text/plain, Size: 7199 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> There are two limited resources, stripe cache and journal disk space.
> For better performance, we priotize reclaim of full stripe writes.
> To free up more journal space, we free earliest data on the journal.
>
> In current implementation, reclaim happens when:
> 1. Periodically (every R5C_RECLAIM_WAKEUP_INTERVAL, 5 seconds) reclaim
>    if there is no reclaim in the past 5 seconds.

5 seconds is an arbitrary number.  I don't think it needs to be made
configurable, but it might help to justify it.
I would probably make it closer to 30 seconds.  There really isn't any
rush.  If there is much load, it will never wait this long.  If there is
not much load ... then it probably doesn't matter.


> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>    (r5c_check_cached_full_stripe)

This is another arbitrary number, but I think it needs more
justification.  Why 32?  That's 128K per device, which isn't very much.
It might make sense to set the batch size based on the stripe size
(e.g. at least on stripe?) or on the cache size.

> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>
> r5c_do_reclaim() contains new logic of reclaim.
>
> For stripe cache:
>
> When stripe cache pressure is high (more than 3/4 stripes are cached,

You say "3/4" here bit I think the code says "1/2"
> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
But there is also
> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)

so maybe you do mean 3/4..  Maybe some comments closer to the different
fractions would help.

> @@ -141,6 +150,12 @@ struct r5l_log {
>  
>  	/* for r5c_cache */
>  	enum r5c_state r5c_state;
> +
> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
> +	struct list_head stripe_in_cache_list;
> +
> +	spinlock_t stripe_in_cache_lock;

You use the _irqsave / _irqrestore versions of spinlock for this lock,
but I cannot see where it is takes from interrupt-context.
What did I miss?

> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
> +static inline void r5c_update_log_state(struct r5l_log *log)
> +{
> +	struct r5conf *conf = log->rdev->mddev->private;
> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
> +						log->last_checkpoint);
> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
> +
> +	if (!r5c_is_writeback(log))
> +		return;
> +	else if (free_space < 2 * reclaim_space) {
> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space < 3 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

Hmmm...
We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space

Would the code be clearer if you just made that explicit.

At the very least, change
> +	} else if (free_space > 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);

to

> +	} else if (free_space < 4 * reclaim_space) {
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +	} else {
> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> +     }

so the order of the branches is consistent and all the tests a '<'.

>  
> +/*
> + * do not release a stripe to cached lists in quiesce
> + */

You tell me what this function doesn't do, but not what it does.

This function is only called once, and I think that could would be
clearer if this was just included in-line in that one place.

> +void r5c_prepare_stripe_for_release_in_quiesce(struct stripe_head *sh)
> +{
> +	if (!test_bit(STRIPE_HANDLE, &sh->state) &&
> +	    atomic_read(&sh->dev_in_cache) != 0) {
> +		if (!test_bit(STRIPE_R5C_FROZEN, &sh->state))
> +			r5c_freeze_stripe_for_reclaim(sh);
> +		set_bit(STRIPE_HANDLE, &sh->state);
> +	}
> +}
> +
>  
> @@ -709,6 +867,7 @@ static void r5l_run_no_space_stripes(struct r5l_log *log)
>  	while (!list_empty(&log->no_space_stripes)) {
>  		sh = list_first_entry(&log->no_space_stripes,
>  				      struct stripe_head, log_list);
> +		set_bit(STRIPE_PREREAD_ACTIVE, &sh->state);
>  		list_del_init(&sh->log_list);
>  		set_bit(STRIPE_HANDLE, &sh->state);
>  		raid5_release_stripe(sh);

This chunk doesn't seem to belong here.  It looks like it might be a
bugfix that is quite independent of the rest of the code.  Is it?
If is, it probably belongs in a separate patch.


>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
> +	struct r5conf *conf = log->rdev->mddev->private;
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
>  	sector_t reclaimable;
>  	sector_t next_checkpoint;
> -	u64 next_cp_seq;
>  
>  	spin_lock_irq(&log->io_list_lock);
>  	/*
> @@ -924,8 +1115,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  				    log->io_list_lock);
>  	}
>  
> -	next_checkpoint = log->next_checkpoint;
> -	next_cp_seq = log->next_cp_seq;
> +	next_checkpoint = r5c_calculate_new_cp(conf);
>  	spin_unlock_irq(&log->io_list_lock);
>  
>  	BUG_ON(reclaimable < 0);
> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = next_checkpoint;
> -	log->last_cp_seq = next_cp_seq;

Why don't we update last_cp_seq here any more?


> +
> +/*
> + * if num < 0, flush all stripes
> + * if num == 0, flush all full stripes
> + * if num > 0, flush all full stripes. If less than num full stripes are
> + *             flushed, flush some partial stripes until totally num stripes are
> + *             flushed or there is no more cached stripes.
> + */

I'm not very keen on the idea of having "num < 0".

I'd rather the meaning was:
  flush all full stripes, and a total of at least num stripes.

To flush all stripes, pass MAX_INT for num (or similar).

> +void r5c_flush_cache(struct r5conf *conf, int num)
> +{
> +	int count;
> +	struct stripe_head *sh, *next;
> +
> +	assert_spin_locked(&conf->device_lock);
> +	if (!conf->log)
> +		return;
> +
> +	count = 0;
> +	list_for_each_entry_safe(sh, next, &conf->r5c_full_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		count++;
> +	}
> +
> +	if (num == 0 || (num > 0 && count >= num))
> +		return;
> +	list_for_each_entry_safe(sh, next,
> +				 &conf->r5c_partial_stripe_list, lru) {
> +		r5c_flush_stripe(conf, sh);
> +		if (num > 0 && count++ >= num)
> +			break;
> +	}
> +}
> +


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state
  2016-10-13  5:49 ` [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state Song Liu
@ 2016-10-19  2:19   ` NeilBrown
  2016-10-21 21:20     ` Song Liu
  0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-10-19  2:19 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> r5c_state have 4 states:
> * no-cache;
> * write-through (write journal only);
> * write-back (w/ write cache);
> * cache-broken (journal missing or Faulty)

I don't like this.

We already have a mechanism for removing a failed device from
an array: write 'remove' to the 'state' file in the relevant dev-*
subdirectory.
You can also use that file to tell if the journal has failed.

Please call the file something a little more obvious that 'r5c_state',
maybe 'journal_mode', and use it only to switch between write-through
and write-back.
If there is no journal, then either remove the file, or have writes fail
and reads return something obvious ... maybe ENODEV.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
  2016-10-13  5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
@ 2016-10-19  2:32   ` NeilBrown
  2016-10-20  3:03   ` NeilBrown
  1 sibling, 0 replies; 23+ messages in thread
From: NeilBrown @ 2016-10-19  2:32 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

[-- Attachment #1: Type: text/plain, Size: 1748 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> This is the recovery part of raid5-cache.
>
> With cache feature, there are 2 different scenarios of recovery:
> 1. Data-Parity stripe: a stripe with complete parity in journal.
> 2. Data-Only stripe: a stripe with only data in journal (or partial
>    parity).
>
> The code differentiate Data-Parity stripe from Data-Only stripe with
> flag (STRIPE_R5C_WRITTEN).
>
> For Data-Parity stripes, we use the same procedure as raid5 journal,
> where all the data and parity are replayed to the RAID devices.
>
> For Data-Only strips, we need to finish complete calculate parity and
> finish the full reconstruct write or RMW write. For simplicity, in
> the recovery, we load the stripe to stripe cache. Once the array is
> started, the stripe cache state machine will handle these stripes
> through normal write path.
>
> r5c_recovery_flush_log contains the main procedure of recovery. The
> recovery code first scans through the journal and loads data to
> stripe cache. The code keeps tracks of all these stripes in a list
> (use sh->lru and ctx->cached_list), stripes in the list are
> organized in the order of its first appearance on the journal.
> During the scan, the recovery code assesses each stripe as
> Data-Parity or Data-Only.
>
> During scan, the array may run out of stripe cache. In these cases,
> the recovery code will also call raid5_set_cache_size to increase
> stripe cache size.

What if this fails.  Maybe the array was created on a machine with lots
or memory, but that machine died and you are trying to recovery you data
on a much less capable machine.

I don't think there is an easy answer, but at least you need to fail
gracefully.

I'll try to look at the rest tomorrow.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
  2016-10-13  5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
  2016-10-19  2:32   ` NeilBrown
@ 2016-10-20  3:03   ` NeilBrown
  2016-10-21 21:12     ` Song Liu
  1 sibling, 1 reply; 23+ messages in thread
From: NeilBrown @ 2016-10-20  3:03 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuang521,
	liuzhengyuan, Song Liu

[-- Attachment #1: Type: text/plain, Size: 3060 bytes --]

On Thu, Oct 13 2016, Song Liu wrote:

> This is the recovery part of raid5-cache.
>
> With cache feature, there are 2 different scenarios of recovery:
> 1. Data-Parity stripe: a stripe with complete parity in journal.
> 2. Data-Only stripe: a stripe with only data in journal (or partial
>    parity).
>
> The code differentiate Data-Parity stripe from Data-Only stripe with
> flag (STRIPE_R5C_WRITTEN).
>
> For Data-Parity stripes, we use the same procedure as raid5 journal,
> where all the data and parity are replayed to the RAID devices.
>
> For Data-Only strips, we need to finish complete calculate parity and
> finish the full reconstruct write or RMW write. For simplicity, in
> the recovery, we load the stripe to stripe cache. Once the array is
> started, the stripe cache state machine will handle these stripes
> through normal write path.
>
> r5c_recovery_flush_log contains the main procedure of recovery. The
> recovery code first scans through the journal and loads data to
> stripe cache. The code keeps tracks of all these stripes in a list
> (use sh->lru and ctx->cached_list), stripes in the list are
> organized in the order of its first appearance on the journal.
> During the scan, the recovery code assesses each stripe as
> Data-Parity or Data-Only.
>
> During scan, the array may run out of stripe cache. In these cases,
> the recovery code will also call raid5_set_cache_size to increase
> stripe cache size.
>
> At the end of scan, the recovery code replays all Data-Parity
> stripes, and sets proper states for Data-Only stripes. The recovery
> code also increases seq number by 10 and rewrites all Data-Only
> stripes to journal. This is to avoid confusion after repeated
> crashes. More details is explained in raid5-cache.c before
> r5c_recovery_rewrite_data_only_stripes().
>
> Signed-off-by: Song Liu <songliubraving@fb.com>


This patch seems to do a number of different things.
I think it re-factors the journal reading code.
It adds code to write a new "empty" journal metadata block
and it adds support for recovery of cached data.

All this together makes it hard to review.  I'd rather more smaller
patches.


> +	/* stripes only have parity are already flushed to RAID */
> +	if (data_count == 0)
> +		goto out;

Can you explain why that is?  When were they flushed to the RAID, and
how was the parity determined?


> +
> +static void
> +r5l_recovery_create_emtpy_meta_block(struct r5l_log *log,

"empty"

> +				     struct page *page,
> +				     sector_t pos, u64 seq)

> +/* returns 0 for match; 1 for mismtach */

No, please don't do that.
You can return an negative error if you like, and call it as
   function_name() < 0
 or
   function_name() == 0

or give the function a name that describes what it tests
i.e.
   r5l_data_checksum_is_correct()
or
   r5l_data_checksum_does_not_match()

and then return 0 if the test fails and 1 if it succeeds.

> +static int
> +r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
> +				  sector_t log_offset, __le32 log_checksum)



Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 4/8] md/r5cache: write part of r5cache
  2016-10-19  0:53       ` NeilBrown
@ 2016-10-21 20:42         ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-21 20:42 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan


> On Oct 18, 2016, at 5:53 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Fri, Oct 14 2016, Song Liu wrote:
> 
>>> On Oct 1st of the work. 
>> 
>> Or we can force the code to rcw, which does not need extra page. 
>> But rcw, does not always work in degraded mode. So, this is a good 
>> reason not to do write-back in degraded mode...
> 
> Prohibiting write-back in degraded mode would not be enough to ensure
> that you can always use rcw.  The array can become degraded after you
> make the decision to use caching, and before to need to read old data
> for rmw.
> 
> I would suggest a small (2 entry?) mempool where each entry in the
> mempool holds enough pages to complete an rmw.  Only use the mempool if
> an alloc_page() fails.
> 
This is a great idea. I will try it. 


>>> 
>>>> +
>>>> +void r5c_handle_cached_data_endio(struct r5conf *conf,
>>>> +	  struct stripe_head *sh, int disks, struct bio_list *return_bi)
>>>> +{
>>>> +	int i;
>>>> +
>>>> +	for (i = sh->disks; i--; ) {
>>>> +		if (test_bit(R5_InCache, &sh->dev[i].flags) &&
>>>> +		    sh->dev[i].written) {
>>> 
>>> Is it possible for R5_InCache to be set, but 'written' to be NULL ???
>> 
>> Yes, it is possible. A stripe may go through "write data to journal, return IO" multiple
>> times before parity calculation. When it comes here the second time, dev written in the 
>> first time will have R5_InCache set, but its written will be NULL. 
> 
> OK, that makes sense.
> So is it possible for 'written' to be set, but R5_InCache to be clear?
> i.e. do we really need to test R5_InCache here?

In current implementation, there is only one log_io per sh, so we should be fine just check 
'written'. Let me double check. 

> 
>>>> 
>>>> static void r5l_io_run_stripes(struct r5l_io_unit *io)
>>>> @@ -483,7 +566,8 @@ static int r5l_log_stripe(struct r5l_log *log, struct stripe_head *sh,
>>>> 	io = log->current_io;
>>>> 
>>>> 	for (i = 0; i < sh->disks; i++) {
>>>> -		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags))
>>>> +		if (!test_bit(R5_Wantwrite, &sh->dev[i].flags) &&
>>>> +		    !test_bit(R5_Wantcache, &sh->dev[i].flags))
>>>> 			continue;
>>> 
>>> If changed R5_Wantcache to R5_Wantjournal, and always set it on blocks
>>> that were destined for the journal, then this would just be
>>> 
>>> 		if (!test_bit(R5_Wantjournal, &sh->dev[i].flags))
>>> 
>>> which would make lots of sense...  Just a thought.
>> 
>> We set R5_Wantwrite in multiple places. If we simplify the code here, we will need to make
>> those places aware of journal. I guess that is not ideal either? 
> 
> Maybe...
> We have so many state flags that I like to be very cautious about adding
> more, and to make sure they have a very well defined meaning that
> doesn't overlap with other flags too much.
> The above code suggests that Wantwrite and Wantcache overlap to some
> extent.
> 
> Could we discard Wantcache and just use Wantwrite combined with InCache?
> Wantwrite means that the block needed to be written to the RAID.
> If InCache isn't set, it also needs to be written to the cache (if the
> cache is being used).
> Then the above code would be
>   if (!test_bit(R5_Wantwrite) || test_bit(R5_InCache))
>      continue;
> 
> which means "if we don't want to write this, or if it is already in the
> cache, then nothing to do here".
> 
> Maybe.

I am not sure whether we can totally remove Wantcache. Let me try it. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 5/8] md/r5cache: reclaim support
  2016-10-19  2:03   ` NeilBrown
@ 2016-10-21 21:04     ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-21 21:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan


> On Oct 18, 2016, at 7:03 PM, NeilBrown <neilb@suse.com> wrote:
> 
>> AL, 5 seconds) reclaim
>>   if there is no reclaim in the past 5 seconds.
> 
> 5 seconds is an arbitrary number.  I don't think it needs to be made
> configurable, but it might help to justify it.
> I would probably make it closer to 30 seconds.  There really isn't any
> rush.  If there is much load, it will never wait this long.  If there is
> not much load ... then it probably doesn't matter.

Yeah, 30 second should work here. 
> 
> 
>> 2. when there are R5C_FULL_STRIPE_FLUSH_BATCH (32) cached full stripes
>>   (r5c_check_cached_full_stripe)
> 
> This is another arbitrary number, but I think it needs more
> justification.  Why 32?  That's 128K per device, which isn't very much.
> It might make sense to set the batch size based on the stripe size
> (e.g. at least on stripe?) or on the cache size.

How about we use something like min(stripe_cache_size / 8, 256)? Journal space
will trigger reclaim in other conditions (#4 below). 

> 
>> 3. when there is pressure on stripe cache (r5c_check_stripe_cache_usage)
>> 4. when there is pressure on journal space (r5l_write_stripe, r5c_cache_data)
>> 
>> r5c_do_reclaim() contains new logic of reclaim.
>> 
>> For stripe cache:
>> 
>> When stripe cache pressure is high (more than 3/4 stripes are cached,
> 
> You say "3/4" here bit I think the code says "1/2"
>> +	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
> But there is also
>> +	if (total_cached > conf->min_nr_stripes * 3 / 4 ||
>> +	    atomic_read(&conf->empty_inactive_list_nr) > 0)
> 
> so maybe you do mean 3/4..  Maybe some comments closer to the different
> fractions would help.

There are level of cache pressure:
high pressure: total_cached > 3/4 || empty_inactivelist_nr > 0
moderate pressure: total_cached > 1/2. 

The condition:
+	if (total_cached > conf->min_nr_stripes * 1 / 2 ||
+	    atomic_read(&conf->empty_inactive_list_nr) > 0)

covers "either high or moderate pressure". 


>> @@ -141,6 +150,12 @@ struct r5l_log {
>> 
>> 	/* for r5c_cache */
>> 	enum r5c_state r5c_state;
>> +
>> +	/* all stripes in r5cache, in the order of seq at sh->log_start */
>> +	struct list_head stripe_in_cache_list;
>> +
>> +	spinlock_t stripe_in_cache_lock;
> 
> You use the _irqsave / _irqrestore versions of spinlock for this lock,
> but I cannot see where it is takes from interrupt-context.
> What did I miss?

Let me fix them. 

> 
>> +/* evaluate log space usage and update R5C_LOG_TIGHT and R5C_LOG_CRITICAL */
>> +static inline void r5c_update_log_state(struct r5l_log *log)
>> +{
>> +	struct r5conf *conf = log->rdev->mddev->private;
>> +	sector_t free_space = r5l_ring_distance(log, log->log_start,
>> +						log->last_checkpoint);
>> +	sector_t reclaim_space = r5c_log_required_to_flush_cache(conf);
>> +
>> +	if (!r5c_is_writeback(log))
>> +		return;
>> +	else if (free_space < 2 * reclaim_space) {
>> +		set_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	} else if (free_space < 3 * reclaim_space) {
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +		set_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +	} else if (free_space > 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> 
> Hmmm...
> We set LOG_CRITICAL if free_space < 2*reclaim_space, else we clear it.
> We set LOG_TIGHT if free_space < 3*reclaim_space and clear it if > 4*reclaim_space
> 
> Would the code be clearer if you just made that explicit.
> 
> At the very least, change
>> +	} else if (free_space > 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
> 
> to
> 
>> +	} else if (free_space < 4 * reclaim_space) {
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +	} else {
>> +		clear_bit(R5C_LOG_TIGHT, &conf->cache_state);
>> +		clear_bit(R5C_LOG_CRITICAL, &conf->cache_state);
>> +     }
> 
> so the order of the branches is consistent and all the tests a '<'.

I will revise the code and the comments. 

> 
>> 
>> 	BUG_ON(reclaimable < 0);
>> @@ -941,7 +1131,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>> 
>> 	mutex_lock(&log->io_mutex);
>> 	log->last_checkpoint = next_checkpoint;
>> -	log->last_cp_seq = next_cp_seq;
> 
> Why don't we update last_cp_seq here any more?

There are two reasons: 
1. We are not using last_cp_seq after recovery is done. So it is not necessary
    to update it. Shaohua had some idea to use it for reclaim (keep ordering of 
    stripes). But my initial implementation uses sh->log_start and 
    r5l_ring_distance() instead. 
2. With write back cache, last_cp_seq will be the seq of first stripe in 
    stripe_in_cache_list. To track last_cp_seq, we will need an extra first_seq_in_cache
    in stripe_head. I feel it is a waste of memory. 
   
> 
>> +
>> +/*
>> + * if num < 0, flush all stripes
>> + * if num == 0, flush all full stripes
>> + * if num > 0, flush all full stripes. If less than num full stripes are
>> + *             flushed, flush some partial stripes until totally num stripes are
>> + *             flushed or there is no more cached stripes.
>> + */
> 
> I'm not very keen on the idea of having "num < 0".
> 
> I'd rather the meaning was:
>  flush all full stripes, and a total of at least num stripes.
> 
> To flush all stripes, pass MAX_INT for num (or similar).

This is a great idea. I will fix it. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
  2016-10-20  3:03   ` NeilBrown
@ 2016-10-21 21:12     ` Song Liu
  2016-10-26  1:18       ` NeilBrown
  0 siblings, 1 reply; 23+ messages in thread
From: Song Liu @ 2016-10-21 21:12 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan


> On Oct 19, 2016, at 8:03 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
> 
>> This is the recovery part of raid5-cache.
>> 
>> With cache feature, there are 2 different scenarios of recovery:
>> 1. Data-Parity stripe: a stripe with complete parity in journal.
>> 2. Data-Only stripe: a stripe with only data in journal (or partial
>>   parity).
>> 
>> The code differentiate Data-Parity stripe from Data-Only stripe with
>> flag (STRIPE_R5C_WRITTEN).
>> 
>> For Data-Parity stripes, we use the same procedure as raid5 journal,
>> where all the data and parity are replayed to the RAID devices.
>> 
>> For Data-Only strips, we need to finish complete calculate parity and
>> finish the full reconstruct write or RMW write. For simplicity, in
>> the recovery, we load the stripe to stripe cache. Once the array is
>> started, the stripe cache state machine will handle these stripes
>> through normal write path.
>> 
>> r5c_recovery_flush_log contains the main procedure of recovery. The
>> recovery code first scans through the journal and loads data to
>> stripe cache. The code keeps tracks of all these stripes in a list
>> (use sh->lru and ctx->cached_list), stripes in the list are
>> organized in the order of its first appearance on the journal.
>> During the scan, the recovery code assesses each stripe as
>> Data-Parity or Data-Only.
>> 
>> During scan, the array may run out of stripe cache. In these cases,
>> the recovery code will also call raid5_set_cache_size to increase
>> stripe cache size.
>> 
>> At the end of scan, the recovery code replays all Data-Parity
>> stripes, and sets proper states for Data-Only stripes. The recovery
>> code also increases seq number by 10 and rewrites all Data-Only
>> stripes to journal. This is to avoid confusion after repeated
>> crashes. More details is explained in raid5-cache.c before
>> r5c_recovery_rewrite_data_only_stripes().
>> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
> 
> 
> This patch seems to do a number of different things.
> I think it re-factors the journal reading code.
> It adds code to write a new "empty" journal metadata block
> and it adds support for recovery of cached data.
> 
> All this together makes it hard to review.  I'd rather more smaller
> patches.
> 

I will try to split the patch into meaningful smaller patches. 
> 
>> +	/* stripes only have parity are already flushed to RAID */
>> +	if (data_count == 0)
>> +		goto out;
> 
> Can you explain why that is?  When were they flushed to the RAID, and
> how was the parity determined?

It happens like this: say two stripes on journal: 100 and 200. The data (D)
and parity (P) pages are store in journal as:

      --->  D100   D200   P100  P200 ---->  newer data

Before we flush D100, journal_start points as D100. Then we flush D100, 
and new journal_start points as D200. Now the system fails, so next 
recovery starts from D200. Recovery code will find stripe 100 only has
parity. This means, stripe 100 is already flushed to raid. so we can ignore it. 

> 
>> +/* returns 0 for match; 1 for mismtach */
> 
> No, please don't do that.
> You can return an negative error if you like, and call it as
>   function_name() < 0
> or
>   function_name() == 0
> 
> or give the function a name that describes what it tests
> i.e.
>   r5l_data_checksum_is_correct()
> or
>   r5l_data_checksum_does_not_match()
> 
> and then return 0 if the test fails and 1 if it succeeds.
> 
>> +static int
>> +r5l_recovery_verify_data_checksum(struct r5l_log *log, struct page *page,
>> +				  sector_t log_offset, __le32 log_checksum)

I will fix this. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state
  2016-10-19  2:19   ` NeilBrown
@ 2016-10-21 21:20     ` Song Liu
  0 siblings, 0 replies; 23+ messages in thread
From: Song Liu @ 2016-10-21 21:20 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan


> On Oct 18, 2016, at 7:19 PM, NeilBrown <neilb@suse.com> wrote:
> 
> On Thu, Oct 13 2016, Song Liu wrote:
> 
>> r5c_state have 4 states:
>> * no-cache;
>> * write-through (write journal only);
>> * write-back (w/ write cache);
>> * cache-broken (journal missing or Faulty)
> 
> I don't like this.
> 
> We already have a mechanism for removing a failed device from
> an array: write 'remove' to the 'state' file in the relevant dev-*
> subdirectory.
> You can also use that file to tell if the journal has failed.

I think we would like an separate mechanism to remove "journal feature",
not "journal device".  Maybe we should put it in an different sysfs file. 


> Please call the file something a little more obvious that 'r5c_state',
> maybe 'journal_mode', and use it only to switch between write-through
> and write-back.
> If there is no journal, then either remove the file, or have writes fail
> and reads return something obvious ... maybe ENODEV.

Thanks,
Song


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v5 7/8] md/r5cache: r5c recovery
  2016-10-21 21:12     ` Song Liu
@ 2016-10-26  1:18       ` NeilBrown
  0 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2016-10-26  1:18 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Shaohua Li, Kernel Team, dan.j.williams, hch,
	liuzhengyuang521, liuzhengyuan

[-- Attachment #1: Type: text/plain, Size: 1202 bytes --]

On Sat, Oct 22 2016, Song Liu wrote:

>>> +	/* stripes only have parity are already flushed to RAID */
>>> +	if (data_count == 0)
>>> +		goto out;
>> 
>> Can you explain why that is?  When were they flushed to the RAID, and
>> how was the parity determined?
>
> It happens like this: say two stripes on journal: 100 and 200. The data (D)
> and parity (P) pages are store in journal as:
>
>       --->  D100   D200   P100  P200 ---->  newer data
>
> Before we flush D100, journal_start points as D100. Then we flush D100, 
> and new journal_start points as D200. Now the system fails, so next 
> recovery starts from D200. Recovery code will find stripe 100 only has
> parity. This means, stripe 100 is already flushed to raid. so we can ignore it. 

OK, I see.  Thanks.

So the data for the stripe had previously been flushed before the crash
that is currently being recovered from.  It might help to make the time
of the flush more explicit in the comment:

   /* stripes that only have parity must have been flushed
    * before the crash that we are now recovering from, so
    * there is nothing more to recovery.
    */

Something like that.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2016-10-26  1:18 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-13  5:49 [PATCH v5 0/8] raid5-cache: enabling cache features Song Liu
2016-10-13  5:49 ` [PATCH v5 1/8] md/r5cache: Check array size in r5l_init_log Song Liu
2016-10-13  5:49 ` [PATCH v5 2/8] md/r5cache: move some code to raid5.h Song Liu
2016-10-13  5:49 ` [PATCH v5 3/8] md/r5cache: State machine for raid5-cache write back mode Song Liu
2016-10-14  6:13   ` NeilBrown
2016-10-14  7:03     ` Song Liu
2016-10-13  5:49 ` [PATCH v5 4/8] md/r5cache: write part of r5cache Song Liu
2016-10-14  6:53   ` NeilBrown
2016-10-14  7:33     ` Song Liu
2016-10-19  0:53       ` NeilBrown
2016-10-21 20:42         ` Song Liu
2016-10-13  5:49 ` [PATCH v5 5/8] md/r5cache: reclaim support Song Liu
2016-10-19  2:03   ` NeilBrown
2016-10-21 21:04     ` Song Liu
2016-10-13  5:49 ` [PATCH v5 6/8] md/r5cache: sysfs entry r5c_state Song Liu
2016-10-19  2:19   ` NeilBrown
2016-10-21 21:20     ` Song Liu
2016-10-13  5:49 ` [PATCH v5 7/8] md/r5cache: r5c recovery Song Liu
2016-10-19  2:32   ` NeilBrown
2016-10-20  3:03   ` NeilBrown
2016-10-21 21:12     ` Song Liu
2016-10-26  1:18       ` NeilBrown
2016-10-13  5:49 ` [PATCH v5 8/8] md/r5cache: handle SYNC and FUA Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.