[PATCH 1/2] md/r5cache: disable write back for degraded raid6

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/2] md/r5cache: disable write back for degraded raid6
@ 2017-01-18 23:56 Song Liu
  2017-01-18 23:56 ` [PATCH 2/2] md/r5cache: improve journal device efficiency Song Liu
  2017-01-21 18:42 ` [PATCH 1/2] md/r5cache: disable write back for degraded raid6 Shaohua Li
  0 siblings, 2 replies; 4+ messages in thread
From: Song Liu @ 2017-01-18 23:56 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

raid6 handles write differently in degraded mode. Specifically,
handle_stripe_fill() is called for writes. As a result, write
back cache has very little performance benefit for degraded
raid6. (On the other hand, write back cache does help sequential
writes on degraded raid4 and raid5).

Write back cache for degraded mode also introduces data integrity
corner cases. This is mostly because handle_stripe_fill() is
called on write. To avoid handling these corner cases, this patch
disables write back cache for degraded raid6.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 4957297..b31ae41 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -2371,6 +2371,16 @@ int r5c_try_caching_write(struct r5conf *conf,
 		set_bit(STRIPE_R5C_CACHING, &sh->state);
 	}
 
+	/*
+	 * When raid6 array runs in degraded mode, handle_stripe_fill() is
+	 * called on every write. So write back cache doesn't help the
+	 * performance. To simplify the code, do write-through.
+	 */
+	if (conf->level == 6 && s->failed) {
+		r5c_make_stripe_write_out(sh);
+		return -EAGAIN;
+	}
+
 	for (i = disks; i--; ) {
 		dev = &sh->dev[i];
 		/* if non-overwrite, use writing-out phase */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] md/r5cache: improve journal device efficiency
  2017-01-18 23:56 [PATCH 1/2] md/r5cache: disable write back for degraded raid6 Song Liu
@ 2017-01-18 23:56 ` Song Liu
  2017-01-21 18:54   ` Shaohua Li
  2017-01-21 18:42 ` [PATCH 1/2] md/r5cache: disable write back for degraded raid6 Shaohua Li
  1 sibling, 1 reply; 4+ messages in thread
From: Song Liu @ 2017-01-18 23:56 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

It is important to be able to flush all stripes in raid5-cache.
Therefore, we need reserve some space on the journal device for
these flushes. If flush operation includes pending writes to the
stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
for the flush out. This reduces the efficiency of journal space.
If we exclude these pending writes from flush operation, we only
need (conf->max_degraded + 1) pages per stripe.

With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
pending writes will be excluded from stripe flush out. Therefore,
we can reduce reserved space for flush out and thus improve journal
device efficiency.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 15 +++------------
 drivers/md/raid5.c       | 42 ++++++++++++++++++++++++++++++++----------
 2 files changed, 35 insertions(+), 22 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index b31ae41..c027f1b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -387,17 +387,8 @@ void r5c_check_cached_full_stripe(struct r5conf *conf)
 /*
  * Total log space (in sectors) needed to flush all data in cache
  *
- * Currently, writing-out phase automatically includes all pending writes
- * to the same sector. So the reclaim of each stripe takes up to
- * (conf->raid_disks + 1) pages of log space.
- *
- * To totally avoid deadlock due to log space, the code reserves
- * (conf->raid_disks + 1) pages for each stripe in cache, which is not
- * necessary in most cases.
- *
- * To improve this, we will need writing-out phase to be able to NOT include
- * pending writes, which will reduce the requirement to
- * (conf->max_degraded + 1) pages per stripe in cache.
+ * To flush all stripes in cache, we need (conf->max_degraded + 1)
+ * pages per stripe in cache.
  */
 static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
 {
@@ -406,7 +397,7 @@ static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
 	if (!r5c_is_writeback(log))
 		return 0;
 
-	return BLOCK_SECTORS * (conf->raid_disks + 1) *
+	return BLOCK_SECTORS * (conf->max_degraded + 1) *
 		atomic_read(&log->stripe_in_journal_count);
 }
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 193acd3..55a6156 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2953,13 +2953,35 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
  *      like to flush data in journal to RAID disks first, so complex rmw
  *      is handled in the write patch (handle_stripe_dirtying).
  *
- *   2. to be added
+ *   2. when journal space is critical (R5C_LOG_CRITICAL=1)
+ *
+ *      It is important to be able to flush all stripes in raid5-cache.
+ *      Therefore, we need reserve some space on the journal device for
+ *      these flushes. If flush operation includes pending writes to the
+ *      stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
+ *      for the flush out. If we exclude these pending writes from flush
+ *      operation, we only need (conf->max_degraded + 1) pages per stripe.
+ *      Therefore, excluding pending writes in these cases enables more
+ *      efficient use of the journal device.
+ *
+ *      Note: To make sure the stripe makes progress, we only delay
+ *      towrite for stripes with data already in journal (injournal > 0).
+ *      When LOG_CRITICAL, stripes with injournal == 0 will be sent to
+ *      no_space_stripes list.
  */
-static inline bool delay_towrite(struct r5dev *dev,
-				   struct stripe_head_state *s)
+static inline bool delay_towrite(struct r5conf *conf,
+				 struct r5dev *dev,
+				 struct stripe_head_state *s)
 {
-	return dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
-		!test_bit(R5_Insync, &dev->flags) && s->injournal;
+	/* case 1 above */
+	if (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags) &&
+	    !test_bit(R5_Insync, &dev->flags) && s->injournal)
+		return true;
+	/* case 2 above */
+	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
+	    s->injournal > 0)
+		return true;
+	return false;
 }
 
 static void
@@ -2982,7 +3004,7 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 
-			if (dev->towrite && !delay_towrite(dev, s)) {
+			if (dev->towrite && !delay_towrite(conf, dev, s)) {
 				set_bit(R5_LOCKED, &dev->flags);
 				set_bit(R5_Wantdrain, &dev->flags);
 				if (!expand)
@@ -3733,7 +3755,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 	} else for (i = disks; i--; ) {
 		/* would I have to read this buffer for read_modify_write */
 		struct r5dev *dev = &sh->dev[i];
-		if (((dev->towrite && !delay_towrite(dev, s)) ||
+		if (((dev->towrite && !delay_towrite(conf, dev, s)) ||
 		     i == sh->pd_idx || i == sh->qd_idx ||
 		     test_bit(R5_InJournal, &dev->flags)) &&
 		    !test_bit(R5_LOCKED, &dev->flags) &&
@@ -3757,8 +3779,8 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 		}
 	}
 
-	pr_debug("for sector %llu, rmw=%d rcw=%d\n",
-		(unsigned long long)sh->sector, rmw, rcw);
+	pr_debug("for sector %llu state 0x%lx, rmw=%d rcw=%d\n",
+		 (unsigned long long)sh->sector, sh->state, rmw, rcw);
 	set_bit(STRIPE_HANDLE, &sh->state);
 	if ((rmw < rcw || (rmw == rcw && conf->rmw_level == PARITY_PREFER_RMW)) && rmw > 0) {
 		/* prefer read-modify-write, but need to get some data */
@@ -3798,7 +3820,7 @@ static int handle_stripe_dirtying(struct r5conf *conf,
 
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
-			if (((dev->towrite && !delay_towrite(dev, s)) ||
+			if (((dev->towrite && !delay_towrite(conf, dev, s)) ||
 			     i == sh->pd_idx || i == sh->qd_idx ||
 			     test_bit(R5_InJournal, &dev->flags)) &&
 			    !test_bit(R5_LOCKED, &dev->flags) &&
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] md/r5cache: disable write back for degraded raid6
  2017-01-18 23:56 [PATCH 1/2] md/r5cache: disable write back for degraded raid6 Song Liu
  2017-01-18 23:56 ` [PATCH 2/2] md/r5cache: improve journal device efficiency Song Liu
@ 2017-01-21 18:42 ` Shaohua Li
  1 sibling, 0 replies; 4+ messages in thread
From: Shaohua Li @ 2017-01-21 18:42 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuan, liuyun01, Jes.Sorensen

On Wed, Jan 18, 2017 at 03:56:49PM -0800, Song Liu wrote:
> raid6 handles write differently in degraded mode. Specifically,
> handle_stripe_fill() is called for writes. As a result, write
> back cache has very little performance benefit for degraded
> raid6. (On the other hand, write back cache does help sequential
> writes on degraded raid4 and raid5).

To be honest I really hate the idea of writeback for degraded array. It adds a
lot of complexity. It maybe improve write performance, but read performance is
always bad, so the disk performance is already bad. Improving write performance
doesn't change disk performance fully. Can't imagine who care about the
performance of degraded array. When a disk is broken, the first thing people
should do is to take action to avoid further disk loss. In other word, you are
optimizing the performance of a corner case. I hope we can delete the logic
later.

> Write back cache for degraded mode also introduces data integrity
> corner cases. This is mostly because handle_stripe_fill() is
> called on write. To avoid handling these corner cases, this patch
> disables write back cache for degraded raid6.
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 4957297..b31ae41 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -2371,6 +2371,16 @@ int r5c_try_caching_write(struct r5conf *conf,
>  		set_bit(STRIPE_R5C_CACHING, &sh->state);
>  	}
>  
> +	/*
> +	 * When raid6 array runs in degraded mode, handle_stripe_fill() is
> +	 * called on every write. So write back cache doesn't help the
> +	 * performance. To simplify the code, do write-through.
> +	 */
> +	if (conf->level == 6 && s->failed) {
> +		r5c_make_stripe_write_out(sh);
> +		return -EAGAIN;
> +	}

Instead of this adhoc handling, why don't we fully switch to writethrough mode
after disk is broken?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 2/2] md/r5cache: improve journal device efficiency
  2017-01-18 23:56 ` [PATCH 2/2] md/r5cache: improve journal device efficiency Song Liu
@ 2017-01-21 18:54   ` Shaohua Li
  0 siblings, 0 replies; 4+ messages in thread
From: Shaohua Li @ 2017-01-21 18:54 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuan, liuyun01, Jes.Sorensen

On Wed, Jan 18, 2017 at 03:56:50PM -0800, Song Liu wrote:
> It is important to be able to flush all stripes in raid5-cache.
> Therefore, we need reserve some space on the journal device for
> these flushes. If flush operation includes pending writes to the
> stripe, we need to reserve (conf->raid_disk + 1) pages per stripe
> for the flush out. This reduces the efficiency of journal space.
> If we exclude these pending writes from flush operation, we only
> need (conf->max_degraded + 1) pages per stripe.
> 
> With this patch, when log space is critical (R5C_LOG_CRITICAL=1),
> pending writes will be excluded from stripe flush out. Therefore,
> we can reduce reserved space for flush out and thus improve journal
> device efficiency.

So this improves space efficiency, but it harms performance. If we don't delay
'towrite', we only need flush the stripe once in reclaim. With this, we need
flush the stripe twice. This will create more IO.

> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 15 +++------------
>  drivers/md/raid5.c       | 42 ++++++++++++++++++++++++++++++++----------
>  2 files changed, 35 insertions(+), 22 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index b31ae41..c027f1b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -387,17 +387,8 @@ void r5c_check_cached_full_stripe(struct r5conf *conf)
>  /*
>   * Total log space (in sectors) needed to flush all data in cache
>   *
> - * Currently, writing-out phase automatically includes all pending writes
> - * to the same sector. So the reclaim of each stripe takes up to
> - * (conf->raid_disks + 1) pages of log space.
> - *
> - * To totally avoid deadlock due to log space, the code reserves
> - * (conf->raid_disks + 1) pages for each stripe in cache, which is not
> - * necessary in most cases.
> - *
> - * To improve this, we will need writing-out phase to be able to NOT include
> - * pending writes, which will reduce the requirement to
> - * (conf->max_degraded + 1) pages per stripe in cache.
> + * To flush all stripes in cache, we need (conf->max_degraded + 1)
> + * pages per stripe in cache.
>   */
>  static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
>  {
> @@ -406,7 +397,7 @@ static sector_t r5c_log_required_to_flush_cache(struct r5conf *conf)
>  	if (!r5c_is_writeback(log))
>  		return 0;
>  
> -	return BLOCK_SECTORS * (conf->raid_disks + 1) *
> +	return BLOCK_SECTORS * (conf->max_degraded + 1) *
>  		atomic_read(&log->stripe_in_journal_count);
>  }

Is this really safe? The check of R5C_LOG_CRITICAL isn't synchronized with this
one. If R5C_LOG_CRITICAL isn't set at the begining, but it's set in the middle
of handling a stripe, could we run out of space?

I think we probably can calculate the reserved space by 'max_degraded + 1 +
pending_write_disks'. We can maintain a count for pending_write_disks, increase
it with new write and decrease after write data to journal. Don't think we
should worry about delaying the towrite, it should happen rarely.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2017-01-21 18:54 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-18 23:56 [PATCH 1/2] md/r5cache: disable write back for degraded raid6 Song Liu
2017-01-18 23:56 ` [PATCH 2/2] md/r5cache: improve journal device efficiency Song Liu
2017-01-21 18:54   ` Shaohua Li
2017-01-21 18:42 ` [PATCH 1/2] md/r5cache: disable write back for degraded raid6 Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.