All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot
@ 2017-01-11  1:42 Song Liu
  2017-01-11  1:42 ` [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Song Liu
  2017-01-11 17:54 ` [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Shaohua Li
  0 siblings, 2 replies; 6+ messages in thread
From: Song Liu @ 2017-01-11  1:42 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

It will be used in drivers/md/raid5-cache.c

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 lib/radix-tree.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 6f382e0..1ee7449 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -1099,6 +1099,7 @@ void radix_tree_replace_slot(struct radix_tree_root *root,
 {
 	replace_slot(root, NULL, slot, item, true);
 }
+EXPORT_SYMBOL(radix_tree_replace_slot);
 
 /**
  * radix_tree_iter_replace - replace item in a slot
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache
  2017-01-11  1:42 [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Song Liu
@ 2017-01-11  1:42 ` Song Liu
  2017-01-11  4:10   ` NeilBrown
  2017-01-11 18:02   ` Shaohua Li
  2017-01-11 17:54 ` [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Shaohua Li
  1 sibling, 2 replies; 6+ messages in thread
From: Song Liu @ 2017-01-11  1:42 UTC (permalink / raw)
  To: linux-raid
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen

Chunk aligned read significantly reduces CPU usage of raid456.
However, it is not safe to fully bypass the write back cache.
This patch enables chunk aligned read with write back cache.

For chunk aligned read, we track stripes in write back cache at
a bigger granularity, "big_stripe". Each chunk may contain more
than one stripe (for example, a 256kB chunk contains 64 4kB-page,
so this chunk contain 64 stripes). For chunk_aligned_read, these
stripes are grouped into one big_stripe, so we only need one lookup
for the whole chunk.

For each big_stripe, struct big_stripe_info tracks how many stripes
of this big_stripe are in the write back cache. We count how many
stripes of this big_stripe are in the write back cache. These
counters are tracked in a radix tree (big_stripe_tree).
r5c_tree_index() is used to calculate keys for the radix tree.

chunk_aligned_read() calls r5c_big_stripe_cached() to look up
big_stripe of each chunk in the tree. If this big_stripe is in the
tree, chunk_aligned_read() aborts. This look up is protected by
rcu_read_lock().

It is necessary to remember whether a stripe is counted in
big_stripe_tree. Instead of adding new flag, we reuses existing flags:
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
two flags are set, the stripe is counted in big_stripe_tree. This
requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
r5c_try_caching_write(); and moving clear_bit of
STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
r5c_finish_stripe_write_out().

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 164 ++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.c       |  19 ++++--
 drivers/md/raid5.h       |   1 +
 3 files changed, 160 insertions(+), 24 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 3e3e5dc..2ff2510 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -20,6 +20,7 @@
 #include <linux/crc32c.h>
 #include <linux/random.h>
 #include <linux/kthread.h>
+#include <linux/types.h>
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
@@ -162,9 +163,59 @@ struct r5l_log {
 
 	/* to submit async io_units, to fulfill ordering of flush */
 	struct work_struct deferred_io_work;
+
+	/* to for chunk_aligned_read in writeback mode, details below */
+	spinlock_t tree_lock;
+	struct radix_tree_root big_stripe_tree;
 };
 
 /*
+ * Enable chunk_aligned_read() with write back cache.
+ *
+ * Each chunk may contain more than one stripe (for example, a 256kB
+ * chunk contains 64 4kB-page, so this chunk contain 64 stripes). For
+ * chunk_aligned_read, these stripes are grouped into one "big_stripe".
+ * For each big_stripe, we count how many stripes of this big_stripe
+ * are in the write back cache. These data are tracked in a radix tree
+ * (big_stripe_tree). We use radix_tree item pointer as the counter.
+ * r5c_tree_index() is used to calculate keys for the radix tree.
+ *
+ * chunk_aligned_read() calls r5c_big_stripe_cached() to look up
+ * big_stripe of each chunk in the tree. If this big_stripe is in the
+ * tree, chunk_aligned_read() aborts. This look up is protected by
+ * rcu_read_lock().
+ *
+ * It is necessary to remember whether a stripe is counted in
+ * big_stripe_tree. Instead of adding new flag, we reuses existing flags:
+ * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
+ * two flags are set, the stripe is counted in big_stripe_tree. This
+ * requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
+ * r5c_try_caching_write(); and moving clear_bit of
+ * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
+ * r5c_finish_stripe_write_out().
+ */
+
+/*
+ * radix tree requests lowest 2 bits of data pointer to be 2b'00, so we
+ * adds 4 for each stripe
+ */
+#define R5C_RADIX_COUNT_UNIT 4
+
+/*
+ * calculate key for big_stripe_tree
+ *
+ * sect: align_bi->bi_iter.bi_sector or sh->sector
+ */
+static inline sector_t r5c_tree_index(struct r5conf *conf,
+				      sector_t sect)
+{
+	sector_t offset;
+
+	offset = sector_div(sect, conf->chunk_sectors);
+	return sect;
+}
+
+/*
  * an IO range starts from a meta data block and end at the next meta data
  * block. The io unit's the meta data block tracks data/parity followed it. io
  * unit is written to log disk with normal write, as we always flush log disk
@@ -410,16 +461,6 @@ void r5c_make_stripe_write_out(struct stripe_head *sh)
 
 	if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 		atomic_inc(&conf->preread_active_stripes);
-
-	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
-		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
-		atomic_dec(&conf->r5c_cached_partial_stripes);
-	}
-
-	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
-		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
-		atomic_dec(&conf->r5c_cached_full_stripes);
-	}
 }
 
 static void r5c_handle_data_cached(struct stripe_head *sh)
@@ -2303,6 +2344,10 @@ int r5c_try_caching_write(struct r5conf *conf,
 	int i;
 	struct r5dev *dev;
 	int to_cache = 0;
+	void **pslot;
+	sector_t tree_index;
+	int ret;
+	uintptr_t refcount;
 
 	BUG_ON(!r5c_is_writeback(log));
 
@@ -2337,6 +2382,40 @@ int r5c_try_caching_write(struct r5conf *conf,
 		}
 	}
 
+	/* if the stripe is not counted in big_stripe_tree, add it now */
+	if (!test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state) &&
+	    !test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		tree_index = r5c_tree_index(conf, sh->sector);
+		spin_lock(&log->tree_lock);
+		pslot = radix_tree_lookup_slot(&log->big_stripe_tree,
+					       tree_index);
+		if (pslot) {
+			refcount = (uintptr_t)radix_tree_deref_slot(pslot);
+			radix_tree_replace_slot(
+				&log->big_stripe_tree, pslot,
+				(void *)(refcount + R5C_RADIX_COUNT_UNIT));
+		} else {
+			/* this radix_tree_insert can fail safely, so no
+			 * need to call radix_tree_preload()
+			 */
+			ret = radix_tree_insert(
+				&log->big_stripe_tree, tree_index,
+				(void *)R5C_RADIX_COUNT_UNIT);
+			if (ret) {
+				spin_unlock(&log->tree_lock);
+				r5c_make_stripe_write_out(sh);
+				return -EAGAIN;
+			}
+		}
+		spin_unlock(&log->tree_lock);
+
+		/* set STRIPE_R5C_PARTIAL_STRIPE, this shows the stripe is
+		 * counted in the radix tree
+		 */
+		set_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state);
+		atomic_dec(&conf->r5c_cached_partial_stripes);
+	}
+
 	for (i = disks; i--; ) {
 		dev = &sh->dev[i];
 		if (dev->towrite) {
@@ -2411,17 +2490,20 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 				 struct stripe_head *sh,
 				 struct stripe_head_state *s)
 {
+	struct r5l_log *log = conf->log;
 	int i;
 	int do_wakeup = 0;
+	sector_t tree_index;
+	void **pslot;
+	uintptr_t refcount;
 
-	if (!conf->log ||
-	    !test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
+	if (!log || !test_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags))
 		return;
 
 	WARN_ON(test_bit(STRIPE_R5C_CACHING, &sh->state));
 	clear_bit(R5_InJournal, &sh->dev[sh->pd_idx].flags);
 
-	if (conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
+	if (log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_THROUGH)
 		return;
 
 	for (i = sh->disks; i--; ) {
@@ -2443,12 +2525,41 @@ void r5c_finish_stripe_write_out(struct r5conf *conf,
 	if (do_wakeup)
 		wake_up(&conf->wait_for_overlap);
 
-	spin_lock_irq(&conf->log->stripe_in_journal_lock);
+	spin_lock_irq(&log->stripe_in_journal_lock);
 	list_del_init(&sh->r5c);
-	spin_unlock_irq(&conf->log->stripe_in_journal_lock);
+	spin_unlock_irq(&log->stripe_in_journal_lock);
 	sh->log_start = MaxSector;
-	atomic_dec(&conf->log->stripe_in_journal_count);
-	r5c_update_log_state(conf->log);
+
+	atomic_dec(&log->stripe_in_journal_count);
+	r5c_update_log_state(log);
+
+	/* stop counting this stripe in big_stripe_tree */
+	if (test_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state) ||
+	    test_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		tree_index = r5c_tree_index(conf, sh->sector);
+		spin_lock(&log->tree_lock);
+		pslot = radix_tree_lookup_slot(&log->big_stripe_tree,
+					       tree_index);
+		BUG_ON(pslot == NULL);
+		refcount = (uintptr_t)radix_tree_deref_slot(pslot);
+		if (refcount == R5C_RADIX_COUNT_UNIT)
+			radix_tree_delete(&log->big_stripe_tree, tree_index);
+		else
+			radix_tree_replace_slot(
+				&log->big_stripe_tree, pslot,
+				(void *)refcount - R5C_RADIX_COUNT_UNIT);
+		spin_unlock(&log->tree_lock);
+	}
+
+	if (test_and_clear_bit(STRIPE_R5C_PARTIAL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_partial_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_partial_stripes);
+	}
+
+	if (test_and_clear_bit(STRIPE_R5C_FULL_STRIPE, &sh->state)) {
+		BUG_ON(atomic_read(&conf->r5c_cached_full_stripes) == 0);
+		atomic_dec(&conf->r5c_cached_full_stripes);
+	}
 }
 
 int
@@ -2508,6 +2619,22 @@ r5c_cache_data(struct r5l_log *log, struct stripe_head *sh,
 	return 0;
 }
 
+/* check whether this big stripe is in write back cache. */
+bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect)
+{
+	struct r5l_log *log = conf->log;
+	sector_t tree_index;
+	void **pslot;
+
+	if (!log)
+		return false;
+
+	WARN_ON_ONCE(!rcu_read_lock_held());
+	tree_index = r5c_tree_index(conf, sect);
+	pslot = radix_tree_lookup_slot(&log->big_stripe_tree, tree_index);
+	return pslot != NULL;
+}
+
 static int r5l_load_log(struct r5l_log *log)
 {
 	struct md_rdev *rdev = log->rdev;
@@ -2641,6 +2768,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 	if (!log->meta_pool)
 		goto out_mempool;
 
+	spin_lock_init(&log->tree_lock);
+	INIT_RADIX_TREE(&log->big_stripe_tree, GFP_ATOMIC);
+
 	log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 						 log->rdev->mddev, "reclaim");
 	if (!log->reclaim_thread)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5a42f4b..4394ebc 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -287,13 +287,12 @@ static void do_release_stripe(struct r5conf *conf, struct stripe_head *sh,
 						atomic_dec(&conf->r5c_cached_partial_stripes);
 					list_add_tail(&sh->lru, &conf->r5c_full_stripe_list);
 					r5c_check_cached_full_stripe(conf);
-				} else {
-					/* partial stripe */
-					if (!test_and_set_bit(STRIPE_R5C_PARTIAL_STRIPE,
-							      &sh->state))
-						atomic_inc(&conf->r5c_cached_partial_stripes);
+				} else
+					/* STRIPE_R5C_PARTIAL_STRIPE is set in
+					 * r5c_try_caching_write(). No need to
+					 * set it again.
+					 */
 					list_add_tail(&sh->lru, &conf->r5c_partial_stripe_list);
-				}
 			}
 		}
 	}
@@ -5059,6 +5058,13 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 		      rdev->recovery_offset >= end_sector)))
 			rdev = NULL;
 	}
+
+	if (r5c_big_stripe_cached(conf, align_bi->bi_iter.bi_sector)) {
+		rcu_read_unlock();
+		bio_put(align_bi);
+		return 0;
+	}
+
 	if (rdev) {
 		sector_t first_bad;
 		int bad_sectors;
@@ -5415,7 +5421,6 @@ static void raid5_make_request(struct mddev *mddev, struct bio * bi)
 	 * data on failed drives.
 	 */
 	if (rw == READ && mddev->degraded == 0 &&
-	    !r5c_is_writeback(conf->log) &&
 	    mddev->reshape_position == MaxSector) {
 		bi = chunk_aligned_read(mddev, bi);
 		if (!bi)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 876c75f..06b240e 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -795,4 +795,5 @@ extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
+extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
 #endif
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache
  2017-01-11  1:42 ` [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Song Liu
@ 2017-01-11  4:10   ` NeilBrown
  2017-01-11 18:02   ` Shaohua Li
  1 sibling, 0 replies; 6+ messages in thread
From: NeilBrown @ 2017-01-11  4:10 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, kernel-team, dan.j.williams, hch, liuzhengyuan, liuyun01,
	Song Liu, Jes.Sorensen

[-- Attachment #1: Type: text/plain, Size: 1960 bytes --]

On Wed, Jan 11 2017, Song Liu wrote:

> Chunk aligned read significantly reduces CPU usage of raid456.
> However, it is not safe to fully bypass the write back cache.
> This patch enables chunk aligned read with write back cache.
>
> For chunk aligned read, we track stripes in write back cache at
> a bigger granularity, "big_stripe". Each chunk may contain more
> than one stripe (for example, a 256kB chunk contains 64 4kB-page,
> so this chunk contain 64 stripes). For chunk_aligned_read, these
> stripes are grouped into one big_stripe, so we only need one lookup
> for the whole chunk.
>
> For each big_stripe, struct big_stripe_info tracks how many stripes
> of this big_stripe are in the write back cache. We count how many
> stripes of this big_stripe are in the write back cache. These
> counters are tracked in a radix tree (big_stripe_tree).
> r5c_tree_index() is used to calculate keys for the radix tree.
>
> chunk_aligned_read() calls r5c_big_stripe_cached() to look up
> big_stripe of each chunk in the tree. If this big_stripe is in the
> tree, chunk_aligned_read() aborts. This look up is protected by
> rcu_read_lock().
>
> It is necessary to remember whether a stripe is counted in
> big_stripe_tree. Instead of adding new flag, we reuses existing flags:
> STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
> two flags are set, the stripe is counted in big_stripe_tree. This
> requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
> r5c_try_caching_write(); and moving clear_bit of
> STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
> r5c_finish_stripe_write_out().
>
> Signed-off-by: Song Liu <songliubraving@fb.com>

Thanks, this looks quite good.

One thing I wonder about is reshape.  If the chunksize is being
reshaped, that would confused things.
But maybe reshaped isn't supported when the journal is in use, in
which case it wouldn't matter.

Reviewed-by: NeilBrown <neilb@suse.com>

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot
  2017-01-11  1:42 [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Song Liu
  2017-01-11  1:42 ` [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Song Liu
@ 2017-01-11 17:54 ` Shaohua Li
  1 sibling, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2017-01-11 17:54 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuan, liuyun01, Jes.Sorensen

On Tue, Jan 10, 2017 at 05:42:50PM -0800, Song Liu wrote:
> It will be used in drivers/md/raid5-cache.c

title and patch don't match
 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  lib/radix-tree.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/lib/radix-tree.c b/lib/radix-tree.c
> index 6f382e0..1ee7449 100644
> --- a/lib/radix-tree.c
> +++ b/lib/radix-tree.c
> @@ -1099,6 +1099,7 @@ void radix_tree_replace_slot(struct radix_tree_root *root,
>  {
>  	replace_slot(root, NULL, slot, item, true);
>  }
> +EXPORT_SYMBOL(radix_tree_replace_slot);
>  
>  /**
>   * radix_tree_iter_replace - replace item in a slot
> -- 
> 2.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache
  2017-01-11  1:42 ` [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Song Liu
  2017-01-11  4:10   ` NeilBrown
@ 2017-01-11 18:02   ` Shaohua Li
  1 sibling, 0 replies; 6+ messages in thread
From: Shaohua Li @ 2017-01-11 18:02 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, neilb, shli, kernel-team, dan.j.williams, hch,
	liuzhengyuan, liuyun01, Jes.Sorensen

On Tue, Jan 10, 2017 at 05:42:51PM -0800, Song Liu wrote:
> Chunk aligned read significantly reduces CPU usage of raid456.
> However, it is not safe to fully bypass the write back cache.
> This patch enables chunk aligned read with write back cache.
> 
> For chunk aligned read, we track stripes in write back cache at
> a bigger granularity, "big_stripe". Each chunk may contain more
> than one stripe (for example, a 256kB chunk contains 64 4kB-page,
> so this chunk contain 64 stripes). For chunk_aligned_read, these
> stripes are grouped into one big_stripe, so we only need one lookup
> for the whole chunk.
> 
> For each big_stripe, struct big_stripe_info tracks how many stripes
> of this big_stripe are in the write back cache. We count how many
> stripes of this big_stripe are in the write back cache. These
> counters are tracked in a radix tree (big_stripe_tree).
> r5c_tree_index() is used to calculate keys for the radix tree.
> 
> chunk_aligned_read() calls r5c_big_stripe_cached() to look up
> big_stripe of each chunk in the tree. If this big_stripe is in the
> tree, chunk_aligned_read() aborts. This look up is protected by
> rcu_read_lock().
> 
> It is necessary to remember whether a stripe is counted in
> big_stripe_tree. Instead of adding new flag, we reuses existing flags:
> STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
> two flags are set, the stripe is counted in big_stripe_tree. This
> requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
> r5c_try_caching_write(); and moving clear_bit of
> STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
> r5c_finish_stripe_write_out().
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 164 ++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/md/raid5.c       |  19 ++++--
>  drivers/md/raid5.h       |   1 +
>  3 files changed, 160 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 3e3e5dc..2ff2510 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -20,6 +20,7 @@
>  #include <linux/crc32c.h>
>  #include <linux/random.h>
>  #include <linux/kthread.h>
> +#include <linux/types.h>
>  #include "md.h"
>  #include "raid5.h"
>  #include "bitmap.h"
> @@ -162,9 +163,59 @@ struct r5l_log {
>  
>  	/* to submit async io_units, to fulfill ordering of flush */
>  	struct work_struct deferred_io_work;
> +
> +	/* to for chunk_aligned_read in writeback mode, details below */
> +	spinlock_t tree_lock;
> +	struct radix_tree_root big_stripe_tree;
>  };
>  
>  /*
> + * Enable chunk_aligned_read() with write back cache.
> + *
> + * Each chunk may contain more than one stripe (for example, a 256kB
> + * chunk contains 64 4kB-page, so this chunk contain 64 stripes). For
> + * chunk_aligned_read, these stripes are grouped into one "big_stripe".
> + * For each big_stripe, we count how many stripes of this big_stripe
> + * are in the write back cache. These data are tracked in a radix tree
> + * (big_stripe_tree). We use radix_tree item pointer as the counter.
> + * r5c_tree_index() is used to calculate keys for the radix tree.
> + *
> + * chunk_aligned_read() calls r5c_big_stripe_cached() to look up
> + * big_stripe of each chunk in the tree. If this big_stripe is in the
> + * tree, chunk_aligned_read() aborts. This look up is protected by
> + * rcu_read_lock().
> + *
> + * It is necessary to remember whether a stripe is counted in
> + * big_stripe_tree. Instead of adding new flag, we reuses existing flags:
> + * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
> + * two flags are set, the stripe is counted in big_stripe_tree. This
> + * requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
> + * r5c_try_caching_write(); and moving clear_bit of
> + * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
> + * r5c_finish_stripe_write_out().
> + */
> +
> +/*
> + * radix tree requests lowest 2 bits of data pointer to be 2b'00, so we
> + * adds 4 for each stripe
> + */
> +#define R5C_RADIX_COUNT_UNIT 4

I'd use bit shift here. To increase/decrease refcount, write (refcount +/- 1)
<< 2. It's much more readable than refcount +/- R5C_RADIX_COUNT_UNIT.

> +/* check whether this big stripe is in write back cache. */
> +bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect)
> +{
> +	struct r5l_log *log = conf->log;
> +	sector_t tree_index;
> +	void **pslot;
> +
> +	if (!log)
> +		return false;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	tree_index = r5c_tree_index(conf, sect);
> +	pslot = radix_tree_lookup_slot(&log->big_stripe_tree, tree_index);

The comment above radix_tree_lookup_slot explains:
 *	This function can be called under rcu_read_lock iff the slot is not
 *	modified by radix_tree_replace_slot, otherwise it must be called
 *	exclusive from other writers. 

It's not the case here, since other threads are add/delete items.

> +	return pslot != NULL;
> +}
> +
>  static int r5l_load_log(struct r5l_log *log)
>  {
>  	struct md_rdev *rdev = log->rdev;
> @@ -2641,6 +2768,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	if (!log->meta_pool)
>  		goto out_mempool;
>  
> +	spin_lock_init(&log->tree_lock);
> +	INIT_RADIX_TREE(&log->big_stripe_tree, GFP_ATOMIC);

Since the allocation can fail safely, this should be GFP_NOWAIT | __GFP_NOWARN.
GFP_ATOMIC can use reserved memory, which is unnecessary here.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache
@ 2017-01-11 20:47 Julia Lawall
  0 siblings, 0 replies; 6+ messages in thread
From: Julia Lawall @ 2017-01-11 20:47 UTC (permalink / raw)
  Cc: neilb, shli, kernel-team, dan.j.williams, hch, liuzhengyuan,
	liuyun01, Song Liu, Jes.Sorensen, linux-raid, kbuild-all

Please check on the type of reclaimable.  With respect to the test on lime
1437, it should not be unsigned.

julia

---------- Forwarded message ----------
Date: Thu, 12 Jan 2017 02:18:47 +0800
From: kbuild test robot <fengguang.wu@intel.com>
To: kbuild@01.org
Cc: Julia Lawall <julia.lawall@lip6.fr>
Subject: Re: [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write
    back cache

In-Reply-To: <20170111014251.3236610-2-songliubraving@fb.com>

Hi Song,

[auto build test WARNING on next-20170111]
[cannot apply to linus/master linux/master v4.9-rc8 v4.9-rc7 v4.9-rc6 v4.10-rc3]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Song-Liu/EXPORT_SYMBOL-radix_tree_lookup_slot/20170112-000003
:::::: branch date: 2 hours ago
:::::: commit date: 2 hours ago

>> drivers/md/raid5-cache.c:1437:8-19: WARNING: Unsigned expression compared with zero: reclaimable < 0

git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout 325354ecd953e312b0109229d654ef52db1f35a7
vim +1437 drivers/md/raid5-cache.c

17036461 Christoph Hellwig 2015-10-05  1421  		if (reclaimable >= reclaim_target ||
0576b1c6 Shaohua Li        2015-08-13  1422  		    (list_empty(&log->running_ios) &&
0576b1c6 Shaohua Li        2015-08-13  1423  		     list_empty(&log->io_end_ios) &&
a8c34f91 Shaohua Li        2015-09-02  1424  		     list_empty(&log->flushing_ios) &&
04732f74 Christoph Hellwig 2015-10-05  1425  		     list_empty(&log->finished_ios)))
0576b1c6 Shaohua Li        2015-08-13  1426  			break;
0576b1c6 Shaohua Li        2015-08-13  1427
17036461 Christoph Hellwig 2015-10-05  1428  		md_wakeup_thread(log->rdev->mddev->thread);
17036461 Christoph Hellwig 2015-10-05  1429  		wait_event_lock_irq(log->iounit_wait,
17036461 Christoph Hellwig 2015-10-05  1430  				    r5l_reclaimable_space(log) > reclaimable,
17036461 Christoph Hellwig 2015-10-05  1431  				    log->io_list_lock);
0576b1c6 Shaohua Li        2015-08-13  1432  	}
17036461 Christoph Hellwig 2015-10-05  1433
a39f7afd Song Liu          2016-11-17  1434  	next_checkpoint = r5c_calculate_new_cp(conf);
0576b1c6 Shaohua Li        2015-08-13  1435  	spin_unlock_irq(&log->io_list_lock);
0576b1c6 Shaohua Li        2015-08-13  1436
17036461 Christoph Hellwig 2015-10-05 @1437  	BUG_ON(reclaimable < 0);
a39f7afd Song Liu          2016-11-17  1438
a39f7afd Song Liu          2016-11-17  1439  	if (reclaimable == 0 || !write_super)
0576b1c6 Shaohua Li        2015-08-13  1440  		return;
0576b1c6 Shaohua Li        2015-08-13  1441
0576b1c6 Shaohua Li        2015-08-13  1442  	/*
0576b1c6 Shaohua Li        2015-08-13  1443  	 * write_super will flush cache of each raid disk. We must write super
0576b1c6 Shaohua Li        2015-08-13  1444  	 * here, because the log area might be reused soon and we don't want to
0576b1c6 Shaohua Li        2015-08-13  1445  	 * confuse recovery

:::::: The code at line 1437 was first introduced by commit
:::::: 170364619ac21c2b14869571eeaf767ae825f96c raid5-cache: free I/O units earlier

:::::: TO: Christoph Hellwig <hch@lst.de>
:::::: CC: NeilBrown <neilb@suse.com>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-11 20:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-11  1:42 [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Song Liu
2017-01-11  1:42 ` [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Song Liu
2017-01-11  4:10   ` NeilBrown
2017-01-11 18:02   ` Shaohua Li
2017-01-11 17:54 ` [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Shaohua Li
2017-01-11 20:47 [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Julia Lawall

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.