All of lore.kernel.org
 help / color / mirror / Atom feed
From: Shaohua Li <shli@kernel.org>
To: Song Liu <songliubraving@fb.com>
Cc: linux-raid@vger.kernel.org, neilb@suse.com, shli@fb.com,
	kernel-team@fb.com, dan.j.williams@intel.com, hch@infradead.org,
	liuzhengyuan@kylinos.cn, liuyun01@kylinos.cn,
	Jes.Sorensen@redhat.com
Subject: Re: [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache
Date: Wed, 11 Jan 2017 10:02:11 -0800	[thread overview]
Message-ID: <20170111180211.2qfiav4lxd7inyrx@kernel.org> (raw)
In-Reply-To: <20170111014251.3236610-2-songliubraving@fb.com>

On Tue, Jan 10, 2017 at 05:42:51PM -0800, Song Liu wrote:
> Chunk aligned read significantly reduces CPU usage of raid456.
> However, it is not safe to fully bypass the write back cache.
> This patch enables chunk aligned read with write back cache.
> 
> For chunk aligned read, we track stripes in write back cache at
> a bigger granularity, "big_stripe". Each chunk may contain more
> than one stripe (for example, a 256kB chunk contains 64 4kB-page,
> so this chunk contain 64 stripes). For chunk_aligned_read, these
> stripes are grouped into one big_stripe, so we only need one lookup
> for the whole chunk.
> 
> For each big_stripe, struct big_stripe_info tracks how many stripes
> of this big_stripe are in the write back cache. We count how many
> stripes of this big_stripe are in the write back cache. These
> counters are tracked in a radix tree (big_stripe_tree).
> r5c_tree_index() is used to calculate keys for the radix tree.
> 
> chunk_aligned_read() calls r5c_big_stripe_cached() to look up
> big_stripe of each chunk in the tree. If this big_stripe is in the
> tree, chunk_aligned_read() aborts. This look up is protected by
> rcu_read_lock().
> 
> It is necessary to remember whether a stripe is counted in
> big_stripe_tree. Instead of adding new flag, we reuses existing flags:
> STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
> two flags are set, the stripe is counted in big_stripe_tree. This
> requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
> r5c_try_caching_write(); and moving clear_bit of
> STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
> r5c_finish_stripe_write_out().
> 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 164 ++++++++++++++++++++++++++++++++++++++++++-----
>  drivers/md/raid5.c       |  19 ++++--
>  drivers/md/raid5.h       |   1 +
>  3 files changed, 160 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 3e3e5dc..2ff2510 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -20,6 +20,7 @@
>  #include <linux/crc32c.h>
>  #include <linux/random.h>
>  #include <linux/kthread.h>
> +#include <linux/types.h>
>  #include "md.h"
>  #include "raid5.h"
>  #include "bitmap.h"
> @@ -162,9 +163,59 @@ struct r5l_log {
>  
>  	/* to submit async io_units, to fulfill ordering of flush */
>  	struct work_struct deferred_io_work;
> +
> +	/* to for chunk_aligned_read in writeback mode, details below */
> +	spinlock_t tree_lock;
> +	struct radix_tree_root big_stripe_tree;
>  };
>  
>  /*
> + * Enable chunk_aligned_read() with write back cache.
> + *
> + * Each chunk may contain more than one stripe (for example, a 256kB
> + * chunk contains 64 4kB-page, so this chunk contain 64 stripes). For
> + * chunk_aligned_read, these stripes are grouped into one "big_stripe".
> + * For each big_stripe, we count how many stripes of this big_stripe
> + * are in the write back cache. These data are tracked in a radix tree
> + * (big_stripe_tree). We use radix_tree item pointer as the counter.
> + * r5c_tree_index() is used to calculate keys for the radix tree.
> + *
> + * chunk_aligned_read() calls r5c_big_stripe_cached() to look up
> + * big_stripe of each chunk in the tree. If this big_stripe is in the
> + * tree, chunk_aligned_read() aborts. This look up is protected by
> + * rcu_read_lock().
> + *
> + * It is necessary to remember whether a stripe is counted in
> + * big_stripe_tree. Instead of adding new flag, we reuses existing flags:
> + * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE. If either of these
> + * two flags are set, the stripe is counted in big_stripe_tree. This
> + * requires moving set_bit(STRIPE_R5C_PARTIAL_STRIPE) to
> + * r5c_try_caching_write(); and moving clear_bit of
> + * STRIPE_R5C_PARTIAL_STRIPE and STRIPE_R5C_FULL_STRIPE to
> + * r5c_finish_stripe_write_out().
> + */
> +
> +/*
> + * radix tree requests lowest 2 bits of data pointer to be 2b'00, so we
> + * adds 4 for each stripe
> + */
> +#define R5C_RADIX_COUNT_UNIT 4

I'd use bit shift here. To increase/decrease refcount, write (refcount +/- 1)
<< 2. It's much more readable than refcount +/- R5C_RADIX_COUNT_UNIT.

> +/* check whether this big stripe is in write back cache. */
> +bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect)
> +{
> +	struct r5l_log *log = conf->log;
> +	sector_t tree_index;
> +	void **pslot;
> +
> +	if (!log)
> +		return false;
> +
> +	WARN_ON_ONCE(!rcu_read_lock_held());
> +	tree_index = r5c_tree_index(conf, sect);
> +	pslot = radix_tree_lookup_slot(&log->big_stripe_tree, tree_index);

The comment above radix_tree_lookup_slot explains:
 *	This function can be called under rcu_read_lock iff the slot is not
 *	modified by radix_tree_replace_slot, otherwise it must be called
 *	exclusive from other writers. 

It's not the case here, since other threads are add/delete items.

> +	return pslot != NULL;
> +}
> +
>  static int r5l_load_log(struct r5l_log *log)
>  {
>  	struct md_rdev *rdev = log->rdev;
> @@ -2641,6 +2768,9 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
>  	if (!log->meta_pool)
>  		goto out_mempool;
>  
> +	spin_lock_init(&log->tree_lock);
> +	INIT_RADIX_TREE(&log->big_stripe_tree, GFP_ATOMIC);

Since the allocation can fail safely, this should be GFP_NOWAIT | __GFP_NOWARN.
GFP_ATOMIC can use reserved memory, which is unnecessary here.

Thanks,
Shaohua

  parent reply	other threads:[~2017-01-11 18:02 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-11  1:42 [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Song Liu
2017-01-11  1:42 ` [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Song Liu
2017-01-11  4:10   ` NeilBrown
2017-01-11 18:02   ` Shaohua Li [this message]
2017-01-11 17:54 ` [PATCH v2 1/2] EXPORT_SYMBOL radix_tree_lookup_slot Shaohua Li
2017-01-11 20:47 [PATCH v2 2/2] md/r5cache: enable chunk_aligned_read with write back cache Julia Lawall

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170111180211.2qfiav4lxd7inyrx@kernel.org \
    --to=shli@kernel.org \
    --cc=Jes.Sorensen@redhat.com \
    --cc=dan.j.williams@intel.com \
    --cc=hch@infradead.org \
    --cc=kernel-team@fb.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=liuyun01@kylinos.cn \
    --cc=liuzhengyuan@kylinos.cn \
    --cc=neilb@suse.com \
    --cc=shli@fb.com \
    --cc=songliubraving@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.