All of lore.kernel.org
 help / color / mirror / Atom feed
* [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
@ 2016-12-27 15:47 Coly Li
  2016-12-27 15:47 ` [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Coly Li @ 2016-12-27 15:47 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but there are several
challenges are very difficult to solve,
 - code complexity
   Sliding resync window requires several veriables to work collectively,
   this is complexed and very hard to make it work correctly. Just grep
   "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
   the original resync window patch. This is not the end, any further
   related modification may easily introduce more regreassion.
 - multiple sliding resync windows
   Currently raid1 code only has a single sliding resync window, we cannot
   do parallel resync with current I/O barrier implementation.
   Implementing multiple resync windows are much more complexed, and very
   hard to make it correctly.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   9
        #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
   if multiple I/O requests hit different barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by,
        int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
   that sector_nr is the start sector number of a bio. We use function
   align_to_barrier_unit_end() to calculate sectors number from sector_nr
   to the next barrier unit size boundary, if the requesting bio size
   goes across the boundary, we split the bio in raid1_make_request(), to
   make sure the finall bio sent into generic_make_request() won't exceed
   barrier unit boundary.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So it is similar to
   single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket index with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

If user has a (realy) large raid1 device, for example 10PB size, we may
just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
it is possible to be a raid1-created-time-defined variable in future.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&conf->device_lock, flags);
        conf->nr_queued[idx]--;
        spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Original function raid1_make_request() is split into two functions,
   - raid1_make_read_request(): handles regular read request and calls
     wait_read_barrier() for I/O barrier.
   - raid1_make_write_request(): handles regular write request and calls
     wait_barrier() for I/O barrier.
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
 - align_to_barrier_unit_end() is called to make sure both regular and
   resync I/O won't go across a barrier unit boundary.

Changelog
V1:
- Original RFC patch for comments
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 485 ++++++++++++++++++++++++++++++-----------------------
 drivers/md/raid1.h |  37 ++--
 2 files changed, 291 insertions(+), 231 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index a1f3fbe..5813656 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -67,9 +67,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 #define raid1_log(md, fmt, args...)				\
 	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
@@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, r1_bio->sector);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -219,10 +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	int idx;
 
+	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -239,8 +239,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
-	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
 		unsigned long flags;
@@ -265,7 +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 		 * Wake up any possible resync thread that waits for the device
 		 * to go idle.
 		 */
-		allow_barrier(conf, start_next_window, bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 	}
 }
 
@@ -513,6 +511,25 @@ static void raid1_end_write_request(struct bio *bio)
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/* len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
+	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -809,168 +826,179 @@ static void flush_pending_writes(struct r1conf *conf)
  */
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
+	conf->total_barriers++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in sector number ranges corresponding to idx.
+	 * C: while conf->total_barriers >= RESYNC_DEPTH, meaning resync reach
+	 *    the max count which allowed on the whole raid1 device.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			     !conf->nr_pending[idx] &&
+			     conf->total_barriers < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
+	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
+
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->total_barriers--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		raid1_log(conf->mddev, "wait barrier");
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, int idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+/* conf->resync_lock should be held */
+static int get_all_pendings(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx];
+	return ret;
+}
+
+/* conf->resync_lock should be held */
+static int get_all_queued(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_queued[idx];
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
-	/* stop syncio and normal IO and wait for everything to
+	/* Stop sync I/O and normal I/O and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
-	 * This is called in the context of one normal IO request
-	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
-	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * This is called in two situations:
+	 * 1) management command handlers (reshape, remove disk, quiesce).
+	 * 2) one normal I/O request failed.
+
+	 * After array_frozen is set to 1, new sync IO will be blocked at
+	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier().
+	 * The flying I/Os will either complete or be queued. When everything
+	 * goes quite, there are only queued I/Os left.
+
+	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
+	 * barrier bucket index which this I/O request hits. When all sync and
+	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
+	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
+	 * in handle_read_error(), we may call freeze_array() before trying to
+	 * fix the read error. In this case, the error read I/O is not queued,
+	 * so get_all_pending() == get_all_queued() + 1.
+	 *
+	 * Therefore before this function returns, we need to wait until
+	 * get_all_pendings(conf) gets equal to get_all_queued(conf)+extra. For
+	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
 	raid1_log(conf->mddev, "wait freeze");
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_all_pendings(conf) == get_all_queued(conf)+extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1066,64 +1094,23 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void raid1_make_request(struct mddev *mddev, struct bio * bio)
+static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
 	struct r1bio *r1_bio;
 	struct bio *read_bio;
-	int i, disks;
 	struct bitmap *bitmap;
-	unsigned long flags;
 	const int op = bio_op(bio);
-	const int rw = bio_data_dir(bio);
 	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
-	const unsigned long do_flush_fua = (bio->bi_opf &
-						(REQ_PREFLUSH | REQ_FUA));
-	struct md_rdev *blocked_rdev;
-	struct blk_plug_cb *cb;
-	struct raid1_plug_cb *plug = NULL;
-	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
+	int rdisk;
 
-	/*
-	 * Register the new request and wait if the reconstruction
-	 * thread has put up a bar for new requests.
-	 * Continue immediately if no resync is active currently.
+	/* Still need barrier for READ in case that whole
+	 * array is frozen.
 	 */
-
-	md_write_start(mddev, bio); /* wait on superblock update early */
-
-	if (bio_data_dir(bio) == WRITE &&
-	    ((bio_end_sector(bio) > mddev->suspend_lo &&
-	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
-	    (mddev_is_clustered(mddev) &&
-	     md_cluster_ops->area_resyncing(mddev, WRITE,
-		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
-		/* As the suspend_* range is controlled by
-		 * userspace, we want an interruptible
-		 * wait.
-		 */
-		DEFINE_WAIT(w);
-		for (;;) {
-			flush_signals(current);
-			prepare_to_wait(&conf->wait_barrier,
-					&w, TASK_INTERRUPTIBLE);
-			if (bio_end_sector(bio) <= mddev->suspend_lo ||
-			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
-			    (mddev_is_clustered(mddev) &&
-			     !md_cluster_ops->area_resyncing(mddev, WRITE,
-				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
-				break;
-			schedule();
-		}
-		finish_wait(&conf->wait_barrier, &w);
-	}
-
-	start_next_window = wait_barrier(conf, bio);
-
+	wait_read_barrier(conf, bio->bi_iter.bi_sector);
 	bitmap = mddev->bitmap;
 
 	/*
@@ -1149,12 +1136,9 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	bio->bi_phys_segments = 0;
 	bio_clear_flag(bio, BIO_SEG_VALID);
 
-	if (rw == READ) {
 		/*
 		 * read balancing logic:
 		 */
-		int rdisk;
-
 read_again:
 		rdisk = read_balance(conf, r1_bio, &max_sectors);
 
@@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 				   atomic_read(&bitmap->behind_writes) == 0);
 		}
 		r1_bio->read_disk = rdisk;
-		r1_bio->start_next_window = 0;
 
 		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 		} else
 			generic_make_request(read_bio);
 		return;
+}
+
+static void raid1_make_write_request(struct mddev *mddev, struct bio *bio)
+{
+	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio;
+	int i, disks;
+	struct bitmap *bitmap;
+	unsigned long flags;
+	const int op = bio_op(bio);
+	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
+	const unsigned long do_flush_fua = (bio->bi_opf &
+						(REQ_PREFLUSH | REQ_FUA));
+	struct md_rdev *blocked_rdev;
+	struct blk_plug_cb *cb;
+	struct raid1_plug_cb *plug = NULL;
+	int first_clone;
+	int sectors_handled;
+	int max_sectors;
+
+	/*
+	 * Register the new request and wait if the reconstruction
+	 * thread has put up a bar for new requests.
+	 * Continue immediately if no resync is active currently.
+	 */
+
+	md_write_start(mddev, bio); /* wait on superblock update early */
+
+	if (((bio_end_sector(bio) > mddev->suspend_lo &&
+	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
+	    (mddev_is_clustered(mddev) &&
+	     md_cluster_ops->area_resyncing(mddev, WRITE,
+		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
+		/* As the suspend_* range is controlled by
+		 * userspace, we want an interruptible
+		 * wait.
+		 */
+		DEFINE_WAIT(w);
+
+		for (;;) {
+			flush_signals(current);
+			prepare_to_wait(&conf->wait_barrier,
+					&w, TASK_INTERRUPTIBLE);
+			if (bio_end_sector(bio) <= mddev->suspend_lo ||
+			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
+			    (mddev_is_clustered(mddev) &&
+			     !md_cluster_ops->area_resyncing(
+						mddev,
+						WRITE,
+						bio->bi_iter.bi_sector,
+						bio_end_sector(bio))))
+				break;
+			schedule();
+		}
+		finish_wait(&conf->wait_barrier, &w);
 	}
 
+	wait_barrier(conf, bio->bi_iter.bi_sector);
+	bitmap = mddev->bitmap;
+
 	/*
-	 * WRITE:
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+
+	/* We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
 	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
+
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
 		raid1_log(mddev, "wait queued");
@@ -1256,7 +1317,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1324,25 +1384,15 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+		wait_barrier(conf, bio->bi_iter.bi_sector);
 		goto retry_write;
 	}
 
@@ -1464,6 +1514,31 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 	wake_up(&conf->wait_barrier);
 }
 
+static void raid1_make_request(struct mddev *mddev, struct bio *bio)
+{
+	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
+	struct bio *split;
+	sector_t sectors;
+
+	make_request_fn = (bio_data_dir(bio) == READ) ?
+			  raid1_make_read_request :
+			  raid1_make_write_request;
+
+	/* if bio exceeds barrier unit boundary, split it */
+	do {
+		sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector,
+						    bio_sectors(bio));
+		if (sectors < bio_sectors(bio)) {
+			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+			bio_chain(split, bio);
+		} else {
+			split = bio;
+		}
+
+		make_request_fn(mddev, split);
+	} while (split != bio);
+}
+
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
 {
 	struct r1conf *conf = mddev->private;
@@ -1552,19 +1627,11 @@ static void print_conf(struct r1conf *conf)
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -2311,8 +2378,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 
 static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 {
-	int m;
+	int m, idx;
 	bool fail = false;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2338,7 +2406,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2460,6 +2529,7 @@ static void raid1d(struct md_thread *thread)
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	int idx;
 
 	md_check_recovery(mddev);
 
@@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread *thread)
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			idx = hash_long(r1_bio->sector,
+					BARRIER_BUCKETS_NR_BITS);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2498,7 +2569,8 @@ static void raid1d(struct md_thread *thread)
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2537,7 +2609,6 @@ static int init_resync(struct r1conf *conf)
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2615,7 +2687,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2642,6 +2714,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit boundary */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2927,9 +3001,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c52ef42..817115d 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -1,6 +1,14 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR_BITS		9
+#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +43,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +68,11 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			nr_pending[BARRIER_BUCKETS_NR];
+	int			nr_waiting[BARRIER_BUCKETS_NR];
+	int			nr_queued[BARRIER_BUCKETS_NR];
+	int			barrier[BARRIER_BUCKETS_NR];
+	int			total_barriers;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +125,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2016-12-27 15:47 [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
@ 2016-12-27 15:47 ` Coly Li
  2017-01-04 19:59   ` Shaohua Li
  2017-01-06  1:52   ` NeilBrown
  2017-01-04 19:35 ` [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
  2017-01-05 23:08 ` NeilBrown
  2 siblings, 2 replies; 8+ messages in thread
From: Coly Li @ 2016-12-27 15:47 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it is really necessary.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. Now I write the patch based on new simpler raid1
I/O barrier code.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which in turns calls _wait_barrier(), is used for regular write I/O.
   If there is resync I/O happening on the same barrier bucket index, or
   the whole array is frozen, task will wait until no barrier on same
   bucket index, or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), so it is
   unnecessary to wait for barrier in regular read I/Os, they only have to
   wait only when the whole array is frozen.
The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() or
wait_read_barrier() if no barrier raised in same barrier bucket index or
array is not frozen, the regular I/O doesn't need to hold conf->
resync_lock, it can just increase conf->nr_pending[idx], and return to its
caller. For heavy parallel reading I/Os, the lockless I/O barrier code
almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Open question:
Shaohua points out the memory barrier should be added to some atomic
operations. Now I am reading the document to learn how to add the memory
barriers correctly. Anyway, if anyone has suggestion, please don't
hesitate to let me know.

Changelog
V1:
- Original RFC patch for comments.
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 134 +++++++++++++++++++++++++++++++----------------------
 drivers/md/raid1.h |  12 ++---
 2 files changed, 85 insertions(+), 61 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 5813656..b1fb4c1 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -222,7 +222,7 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued[idx]++;
+	atomic_inc(&conf->nr_queued[idx]);
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -831,13 +831,13 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
-	conf->total_barriers++;
-
+	atomic_inc(&conf->barrier[idx]);
+	atomic_inc(&conf->total_barriers);
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
 	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
@@ -846,44 +846,69 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	 *    the max count which allowed on the whole raid1 device.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen &&
-			     !conf->nr_pending[idx] &&
-			     conf->total_barriers < RESYNC_DEPTH,
+			    !atomic_read(&conf->array_frozen) &&
+			     !atomic_read(&conf->nr_pending[idx]) &&
+			     atomic_read(&conf->total_barriers) < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
-	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->total_barriers--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
+	BUG_ON(atomic_read(&conf->total_barriers) <= 0);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->total_barriers);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
 static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/* We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket. Also if the array is frozen, I/O
+	 * should be blocked until array is unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+
+	/* Don't worry about checking two atomic_t variables at same time
+	 * here. conf->array_frozen MUST be checked firstly, The logic is,
+	 * if the array is frozen, no matter there is any barrier or not,
+	 * all I/O should be blocked. If there is no barrier in current
+	 * barrier bucket, we still need to check whether the array is frozen,
+	 * otherwise I/O will happen on frozen array, that's buggy.
+	 * If during we check conf->barrier[idx], the array is frozen (a.k.a
+	 * conf->array_frozen is set), and chonf->barrier[idx] is 0, it is
+	 * safe to return and make the I/O continue. Because the array is
+	 * frozen, all I/O returned here will eventually complete or be
+	 * queued, see code comment in frozen_array().
+	 */
+	if (!atomic_read(&conf->array_frozen) &&
+	    !atomic_read(&conf->barrier[idx]))
+		return;
 
-	conf->nr_pending[idx]++;
+	/* After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen) &&
+			     !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -891,18 +916,23 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
 
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/* Very similar to _wait_barrier(). The difference is, for read
+	 * I/O we don't need wait for sync I/O, but if the whole array
+	 * is frozen, the read I/O still has to wait until the array is
+	 * unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	if (!atomic_read(&conf->array_frozen))
+		return;
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/* Wait for array to be unfrozen */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -923,11 +953,7 @@ static void wait_all_barriers(struct r1conf *conf)
 
 static void _allow_barrier(struct r1conf *conf, int idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -952,7 +978,7 @@ static int get_all_pendings(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx];
+		ret += atomic_read(&conf->nr_pending[idx]);
 	return ret;
 }
 
@@ -962,7 +988,7 @@ static int get_all_queued(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_queued[idx]);
 	return ret;
 }
 
@@ -992,7 +1018,7 @@ static void freeze_array(struct r1conf *conf, int extra)
 	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 1;
+	atomic_set(&conf->array_frozen, 1);
 	raid1_log(conf->mddev, "wait freeze");
 	wait_event_lock_irq_cmd(
 		conf->wait_barrier,
@@ -1005,7 +1031,7 @@ static void unfreeze_array(struct r1conf *conf)
 {
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 0;
+	atomic_set(&conf->array_frozen, 0);
 	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -2407,7 +2433,7 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
 		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
-		conf->nr_queued[idx]++;
+		atomic_inc(&conf->nr_queued[idx]);
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2546,9 +2572,7 @@ static void raid1d(struct md_thread *thread)
 			list_del(&r1_bio->retry_list);
 			idx = hash_long(r1_bio->sector,
 					BARRIER_BUCKETS_NR_BITS);
-			spin_lock_irqsave(&conf->device_lock, flags);
-			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			atomic_dec(&conf->nr_queued[idx]);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2570,7 +2594,7 @@ static void raid1d(struct md_thread *thread)
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
 		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
-		conf->nr_queued[idx]--;
+		atomic_dec(&conf->nr_queued[idx]);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2687,7 +2711,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -3316,7 +3340,7 @@ static void *raid1_takeover(struct mddev *mddev)
 		conf = setup_conf(mddev);
 		if (!IS_ERR(conf)) {
 			/* Array must appear to be quiesced */
-			conf->array_frozen = 1;
+			atomic_set(&conf->array_frozen, 1);
 			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
 			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
 		}
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index 817115d..bbe65f7 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -68,12 +68,12 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending[BARRIER_BUCKETS_NR];
-	int			nr_waiting[BARRIER_BUCKETS_NR];
-	int			nr_queued[BARRIER_BUCKETS_NR];
-	int			barrier[BARRIER_BUCKETS_NR];
-	int			total_barriers;
-	int			array_frozen;
+	atomic_t		nr_pending[BARRIER_BUCKETS_NR];
+	atomic_t		nr_waiting[BARRIER_BUCKETS_NR];
+	atomic_t		nr_queued[BARRIER_BUCKETS_NR];
+	atomic_t		barrier[BARRIER_BUCKETS_NR];
+	atomic_t		total_barriers;
+	atomic_t		array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
 	 * Cleared when a sync completes.
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-12-27 15:47 [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
  2016-12-27 15:47 ` [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
@ 2017-01-04 19:35 ` Shaohua Li
  2017-01-16  6:08   ` Coly Li
  2017-01-05 23:08 ` NeilBrown
  2 siblings, 1 reply; 8+ messages in thread
From: Shaohua Li @ 2017-01-04 19:35 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On Tue, Dec 27, 2016 at 11:47:37PM +0800, Coly Li wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
> 
> The idea of sliding resync widow is awesome, but there are several
> challenges are very difficult to solve,
>  - code complexity
>    Sliding resync window requires several veriables to work collectively,
>    this is complexed and very hard to make it work correctly. Just grep
>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>    the original resync window patch. This is not the end, any further
>    related modification may easily introduce more regreassion.
>  - multiple sliding resync windows
>    Currently raid1 code only has a single sliding resync window, we cannot
>    do parallel resync with current I/O barrier implementation.
>    Implementing multiple resync windows are much more complexed, and very
>    hard to make it correctly.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   9
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    if multiple I/O requests hit different barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by,
>         int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
>    that sector_nr is the start sector number of a bio. We use function
>    align_to_barrier_unit_end() to calculate sectors number from sector_nr
>    to the next barrier unit size boundary, if the requesting bio size
>    goes across the boundary, we split the bio in raid1_make_request(), to
>    make sure the finall bio sent into generic_make_request() won't exceed
>    barrier unit boundary.
> 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So it is similar to
>    single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket index with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
> 
> If user has a (realy) large raid1 device, for example 10PB size, we may
> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> it is possible to be a raid1-created-time-defined variable in future.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Original function raid1_make_request() is split into two functions,
>    - raid1_make_read_request(): handles regular read request and calls
>      wait_read_barrier() for I/O barrier.
>    - raid1_make_write_request(): handles regular write request and calls
>      wait_barrier() for I/O barrier.
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
>  - align_to_barrier_unit_end() is called to make sure both regular and
>    resync I/O won't go across a barrier unit boundary.
> 
> Changelog
> V1:
> - Original RFC patch for comments
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];
>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
> 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  
> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> +					  sector_t sectors)
> +{
> +	sector_t len;
> +
> +	WARN_ON(sectors == 0);
> +	/* len is the number of sectors from start_sector to end of the
> +	 * barrier unit which start_sector belongs to.
> +	 */

The correct format for comments is:
/*
 * something
 */

There are some other places with the same issue

> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> +	      start_sector;

This one makes me nervous. shouldn't this be:
 +	len = ((start_sector +  (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
 +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
 +	      start_sector;

And you can use round_up()

>  
> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
> +static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
>  {

Please rebase the patches to latest md-next. The raid1_make_request already
split for read/write code path recently.

Otherwise, the patch looks good. After these are fixed, I'll add it for 4.11

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2016-12-27 15:47 ` [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
@ 2017-01-04 19:59   ` Shaohua Li
  2017-01-06  1:52   ` NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: Shaohua Li @ 2017-01-04 19:59 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

On Tue, Dec 27, 2016 at 11:47:38PM +0800, Coly Li wrote:
> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
> 
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>          + 89.92% allow_barrier
>          + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
>          - 100.00% wait_barrier
> 
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
> 
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it is really necessary.
> 
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. Now I write the patch based on new simpler raid1
> I/O barrier code.
> 
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which in turns calls _wait_barrier(), is used for regular write I/O.
>    If there is resync I/O happening on the same barrier bucket index, or
>    the whole array is frozen, task will wait until no barrier on same
>    bucket index, or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), so it is
>    unnecessary to wait for barrier in regular read I/Os, they only have to
>    wait only when the whole array is frozen.
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() or
> wait_read_barrier() if no barrier raised in same barrier bucket index or
> array is not frozen, the regular I/O doesn't need to hold conf->
> resync_lock, it can just increase conf->nr_pending[idx], and return to its
> caller. For heavy parallel reading I/Os, the lockless I/O barrier code
> almostly gets rid of all spin lock cost.
> 
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
> 
> Open question:
> Shaohua points out the memory barrier should be added to some atomic
> operations. Now I am reading the document to learn how to add the memory
> barriers correctly. Anyway, if anyone has suggestion, please don't
> hesitate to let me know.

Yes, because the raise_barrier/_wait_barrier depend on the atomic opertions
order, while atomic_inc/atomic_read don't imply a barrier.
 
> @@ -1005,7 +1031,7 @@ static void unfreeze_array(struct r1conf *conf)
>  {
>  	/* reverse the effect of the freeze */
>  	spin_lock_irq(&conf->resync_lock);
> -	conf->array_frozen = 0;
> +	atomic_set(&conf->array_frozen, 0);
>  	wake_up(&conf->wait_barrier);
>  	spin_unlock_irq(&conf->resync_lock);
>  }

Nitpick: This one doesn't need the lock.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-12-27 15:47 [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
  2016-12-27 15:47 ` [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
  2017-01-04 19:35 ` [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
@ 2017-01-05 23:08 ` NeilBrown
  2017-01-16  9:06   ` Coly Li
  2 siblings, 1 reply; 8+ messages in thread
From: NeilBrown @ 2017-01-05 23:08 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 40643 bytes --]

On Wed, Dec 28 2016, Coly Li wrote:

> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
>
> The idea of sliding resync widow is awesome, but there are several
> challenges are very difficult to solve,
>  - code complexity
>    Sliding resync window requires several veriables to work collectively,
>    this is complexed and very hard to make it work correctly. Just grep
>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>    the original resync window patch. This is not the end, any further
>    related modification may easily introduce more regreassion.
>  - multiple sliding resync windows
>    Currently raid1 code only has a single sliding resync window, we cannot
>    do parallel resync with current I/O barrier implementation.
>    Implementing multiple resync windows are much more complexed, and very
>    hard to make it correctly.

I think I've asked this before, but why do you think that parallel
resync might ever be a useful idea?  I don't think it makes any sense, so
it is wrong for you use it as part of the justification for this patch.
Just don't mention it at all unless you have a genuine expectation that
it would really be a good thing, in which case: explain the value.

>
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
>
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
>
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.

It would be good to say here what number you chose, and why you chose
it.
You have picked 64MB.  This divides a 1TB device into 4096 regions.
Any write request must fit into one of these regions, so we mustn't make
the region too small, else we would get the benefits for sending large
requests down.

We want the resync to move from region to region fairly quickly so that
the slowness caused by having to synchronize with the resync is averaged
out overa fairly small time frame.  At full speed, 64MB should take less
than 1 second.  When resync is competing with other IO, it could easily
take up to a minute(?).  I think that is a fairly good range.

So I think 64MB is probably a very good choice.  I just would like to
see the justification clearly stated.

>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   9
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)

Why 512 buckets?  What are the tradeoffs?
More buckets means more memory consumed for counters.
Fewer buckets means more false sharing.
With 512 buckets, a request which is smaller than the region size has a
0.2% chance of having to wait for resync to pause.  I think that is
quite a small enough fraction.
I think you originally chose the number of buckets so that a set of
4-byte counters fits exactly into a page.  I think that is still a good
guideline, so I would have
	#define BARRIER_BUCKETS_NR_BITS	(PAGE_SHIFT - 2)
(which makes it 10 ...).

>    if multiple I/O requests hit different barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by,
>         int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)

This isn't right.  You have to divide by BARRIER_UNIT_SECTOR_SIZE first.
	int idx = hash_long(sector_nr >> BARRIER_UNIT_SECTOR_BITS, BARRIER_BUCKETS_NR_BITS);

>    that sector_nr is the start sector number of a bio. We use function
>    align_to_barrier_unit_end() to calculate sectors number from sector_nr
>    to the next barrier unit size boundary, if the requesting bio size
>    goes across the boundary, we split the bio in raid1_make_request(), to
>    make sure the finall bio sent into generic_make_request() won't exceed
>    barrier unit boundary.
>
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So it is similar to
>    single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket index with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
>
> If user has a (realy) large raid1 device, for example 10PB size, we may
> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> it is possible to be a raid1-created-time-defined variable in future.

Why?  Why would a large array require more buckets?  Are you just
guessing, or do you see some concrete reason for there to be a
relationship between the size of the array and the number of buckets?
If you can see a connection, please state it.  If not, don't mention it.

>
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Original function raid1_make_request() is split into two functions,
>    - raid1_make_read_request(): handles regular read request and calls
>      wait_read_barrier() for I/O barrier.
>    - raid1_make_write_request(): handles regular write request and calls
>      wait_barrier() for I/O barrier.
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
>  - align_to_barrier_unit_end() is called to make sure both regular and
>    resync I/O won't go across a barrier unit boundary.
>
> Changelog
> V1:
> - Original RFC patch for comments
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.

I really don't think this is needed.
As long as RESYNC_DEPTH * RESYNC_SECTORS is less than BARRIER_UNIT_SECTOR_SIZE
just testing again ->barrier[idx] will ensure the number of barrier
requests never exceeds RESYNC_DEPTH*2.  That is sufficient.

Also, I think the reason for imposing the RESYNC_DEPTH limit is to make
sure regular IO never has to wait too long for pending resync requests
to flush.  With the simple test, regular IO will never need to wait for
more than RESYNC_DEPTH requests to complete.

So I think have this field brings no valid, and is potentially confusing.

> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];

I don't like this.  It makes the r1conf 4 pages is size, most of which
is wasted.  A 4-page allocation is more likely to fail than a few 1-page
allocations.
I think these should be:
>         +       int                     *nr_pending;
>         +       int                     *nr_waiting;
>         +       int                     *nr_queued;
>         +       int                     *barrier;

Then use kcalloc(BARRIER_BUCKETS_NR, sizeof(int), GFP_KERNEL)
to allocate each array.   I think this approach addresses Shaohua's
concerns without requiring a multi-page allocation.

>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid1.c | 485 ++++++++++++++++++++++++++++++-----------------------
>  drivers/md/raid1.h |  37 ++--
>  2 files changed, 291 insertions(+), 231 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index a1f3fbe..5813656 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -67,9 +67,8 @@
>   */
>  static int max_queued_requests = 1024;
>  
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector);
> -static void lower_barrier(struct r1conf *conf);
> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>  
>  #define raid1_log(md, fmt, args...)				\
>  	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
> @@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>  
>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>  {
> @@ -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
>  
>  	mempool_free(r1_bio, conf->r1buf_pool);
>  
> -	lower_barrier(conf);
> +	lower_barrier(conf, r1_bio->sector);
>  }
>  
>  static void reschedule_retry(struct r1bio *r1_bio)
> @@ -219,10 +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
>  	unsigned long flags;
>  	struct mddev *mddev = r1_bio->mddev;
>  	struct r1conf *conf = mddev->private;
> +	int idx;
>  
> +	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  	list_add(&r1_bio->retry_list, &conf->retry_list);
> -	conf->nr_queued ++;
> +	conf->nr_queued[idx]++;
>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  	wake_up(&conf->wait_barrier);
> @@ -239,8 +239,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  	struct bio *bio = r1_bio->master_bio;
>  	int done;
>  	struct r1conf *conf = r1_bio->mddev->private;
> -	sector_t start_next_window = r1_bio->start_next_window;
> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>  
>  	if (bio->bi_phys_segments) {
>  		unsigned long flags;
> @@ -265,7 +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
>  		 * Wake up any possible resync thread that waits for the device
>  		 * to go idle.
>  		 */
> -		allow_barrier(conf, start_next_window, bi_sector);
> +		allow_barrier(conf, bio->bi_iter.bi_sector);

Why did you change this to use "bio->bi_iter.bi_sector" instead of
"bi_sector"?

I assume you thought it was an optimization that you would just slip
in.  Can't hurt, right?

Just before this line is:
		bio_endio(bio);
and that might cause the bio to be freed.  So your code could
access freed memory.

Please be *very* cautious when making changes that are not directly
related to the purpose of the patch.
                

>  	}
>  }
>  
> @@ -513,6 +511,25 @@ static void raid1_end_write_request(struct bio *bio)
>  		bio_put(to_put);
>  }
>  
> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> +					  sector_t sectors)
> +{
> +	sector_t len;
> +
> +	WARN_ON(sectors == 0);
> +	/* len is the number of sectors from start_sector to end of the
> +	 * barrier unit which start_sector belongs to.
> +	 */
> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> +	      start_sector;

This would be better as

    len = round_up(start_sector+1, BARRIER_UNIT_SECTOR_SIZE) - start_sector;


> +
> +	if (len > sectors)
> +		len = sectors;
> +
> +	return len;
> +}
> +
>  /*
>   * This routine returns the disk from which the requested read should
>   * be done. There is a per-array 'next expected sequential IO' sector
> @@ -809,168 +826,179 @@ static void flush_pending_writes(struct r1conf *conf)
>   */
>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
>  	spin_lock_irq(&conf->resync_lock);
>  
>  	/* Wait until no block IO is waiting */
> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>  			    conf->resync_lock);
>  
>  	/* block any new IO from starting */
> -	conf->barrier++;
> -	conf->next_resync = sector_nr;
> +	conf->barrier[idx]++;
> +	conf->total_barriers++;
>  
>  	/* For these conditions we must wait:
>  	 * A: while the array is in frozen state
> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
> -	 *    the max count which allowed.
> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
> -	 *    next resync will reach to the window which normal bios are
> -	 *    handling.
> -	 * D: while there are any active requests in the current window.
> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
> +	 *    existing in sector number ranges corresponding to idx.
> +	 * C: while conf->total_barriers >= RESYNC_DEPTH, meaning resync reach
> +	 *    the max count which allowed on the whole raid1 device.
>  	 */
>  	wait_event_lock_irq(conf->wait_barrier,
>  			    !conf->array_frozen &&
> -			    conf->barrier < RESYNC_DEPTH &&
> -			    conf->current_window_requests == 0 &&
> -			    (conf->start_next_window >=
> -			     conf->next_resync + RESYNC_SECTORS),
> +			     !conf->nr_pending[idx] &&
> +			     conf->total_barriers < RESYNC_DEPTH,
>  			    conf->resync_lock);
>  
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> -static void lower_barrier(struct r1conf *conf)
> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
>  	unsigned long flags;
> -	BUG_ON(conf->barrier <= 0);
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
> +	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
> +
>  	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->barrier--;
> -	conf->nr_pending--;
> +	conf->barrier[idx]--;
> +	conf->total_barriers--;
> +	conf->nr_pending[idx]--;
>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
> +static void _wait_barrier(struct r1conf *conf, int idx)
>  {
> -	bool wait = false;
> -
> -	if (conf->array_frozen || !bio)
> -		wait = true;
> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
> -		if ((conf->mddev->curr_resync_completed
> -		     >= bio_end_sector(bio)) ||
> -		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
> -		     <= bio->bi_iter.bi_sector))
> -			wait = false;
> -		else
> -			wait = true;
> +	spin_lock_irq(&conf->resync_lock);
> +	if (conf->array_frozen || conf->barrier[idx]) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for the barrier to drop. */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen && !conf->barrier[idx],
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>  	}
>  
> -	return wait;
> +	conf->nr_pending[idx]++;
> +	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> -	sector_t sector = 0;
> +	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
>  	spin_lock_irq(&conf->resync_lock);
> -	if (need_to_wait_for_sync(conf, bio)) {
> -		conf->nr_waiting++;
> -		/* Wait for the barrier to drop.
> -		 * However if there are already pending
> -		 * requests (preventing the barrier from
> -		 * rising completely), and the
> -		 * per-process bio queue isn't empty,
> -		 * then don't wait, as we need to empty
> -		 * that queue to allow conf->start_next_window
> -		 * to increase.
> -		 */
> -		raid1_log(conf->mddev, "wait barrier");
> -		wait_event_lock_irq(conf->wait_barrier,
> -				    !conf->array_frozen &&
> -				    (!conf->barrier ||
> -				     ((conf->start_next_window <
> -				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> -				    conf->resync_lock);
> -		conf->nr_waiting--;
> -	}
> -
> -	if (bio && bio_data_dir(bio) == WRITE) {
> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
> -			if (conf->start_next_window == MaxSector)
> -				conf->start_next_window =
> -					conf->next_resync +
> -					NEXT_NORMALIO_DISTANCE;
> -
> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
> -			    <= bio->bi_iter.bi_sector)
> -				conf->next_window_requests++;
> -			else
> -				conf->current_window_requests++;
> -			sector = conf->start_next_window;
> -		}
> +	if (conf->array_frozen) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for array to unfreeze */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen,
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>  	}
>  
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>  	spin_unlock_irq(&conf->resync_lock);
> -	return sector;
>  }
>  
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector)
> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
> +{
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
> +	_wait_barrier(conf, idx);
> +}
> +
> +static void wait_all_barriers(struct r1conf *conf)
> +{
> +	int idx;
> +
> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		_wait_barrier(conf, idx);
> +}
> +
> +static void _allow_barrier(struct r1conf *conf, int idx)
>  {
>  	unsigned long flags;
>  
>  	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->nr_pending--;
> -	if (start_next_window) {
> -		if (start_next_window == conf->start_next_window) {
> -			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
> -			    <= bi_sector)
> -				conf->next_window_requests--;
> -			else
> -				conf->current_window_requests--;
> -		} else
> -			conf->current_window_requests--;
> -
> -		if (!conf->current_window_requests) {
> -			if (conf->next_window_requests) {
> -				conf->current_window_requests =
> -					conf->next_window_requests;
> -				conf->next_window_requests = 0;
> -				conf->start_next_window +=
> -					NEXT_NORMALIO_DISTANCE;
> -			} else
> -				conf->start_next_window = MaxSector;
> -		}
> -	}
> +	conf->nr_pending[idx]--;
>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
> +{
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
> +
> +	_allow_barrier(conf, idx);
> +}
> +
> +static void allow_all_barriers(struct r1conf *conf)
> +{
> +	int idx;
> +
> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		_allow_barrier(conf, idx);
> +}
> +
> +/* conf->resync_lock should be held */
> +static int get_all_pendings(struct r1conf *conf)
> +{
> +	int idx, ret;
> +
> +	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		ret += conf->nr_pending[idx];
> +	return ret;
> +}
> +
> +/* conf->resync_lock should be held */
> +static int get_all_queued(struct r1conf *conf)
> +{
> +	int idx, ret;
> +
> +	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		ret += conf->nr_queued[idx];
> +	return ret;
> +}
> +
>  static void freeze_array(struct r1conf *conf, int extra)
>  {
> -	/* stop syncio and normal IO and wait for everything to
> +	/* Stop sync I/O and normal I/O and wait for everything to
>  	 * go quite.
> -	 * We wait until nr_pending match nr_queued+extra
> -	 * This is called in the context of one normal IO request
> -	 * that has failed. Thus any sync request that might be pending
> -	 * will be blocked by nr_pending, and we need to wait for
> -	 * pending IO requests to complete or be queued for re-try.
> -	 * Thus the number queued (nr_queued) plus this request (extra)
> -	 * must match the number of pending IOs (nr_pending) before
> -	 * we continue.
> +	 * This is called in two situations:
> +	 * 1) management command handlers (reshape, remove disk, quiesce).
> +	 * 2) one normal I/O request failed.
> +
> +	 * After array_frozen is set to 1, new sync IO will be blocked at
> +	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier().
> +	 * The flying I/Os will either complete or be queued. When everything
> +	 * goes quite, there are only queued I/Os left.
> +
> +	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
> +	 * barrier bucket index which this I/O request hits. When all sync and
> +	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
> +	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
> +	 * in handle_read_error(), we may call freeze_array() before trying to
> +	 * fix the read error. In this case, the error read I/O is not queued,
> +	 * so get_all_pending() == get_all_queued() + 1.
> +	 *
> +	 * Therefore before this function returns, we need to wait until
> +	 * get_all_pendings(conf) gets equal to get_all_queued(conf)+extra. For
> +	 * normal I/O context, extra is 1, in rested situations extra is 0.
>  	 */
>  	spin_lock_irq(&conf->resync_lock);
>  	conf->array_frozen = 1;
>  	raid1_log(conf->mddev, "wait freeze");
> -	wait_event_lock_irq_cmd(conf->wait_barrier,
> -				conf->nr_pending == conf->nr_queued+extra,
> -				conf->resync_lock,
> -				flush_pending_writes(conf));
> +	wait_event_lock_irq_cmd(
> +		conf->wait_barrier,
> +		get_all_pendings(conf) == get_all_queued(conf)+extra,
> +		conf->resync_lock,
> +		flush_pending_writes(conf));
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  static void unfreeze_array(struct r1conf *conf)
> @@ -1066,64 +1094,23 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
>  	kfree(plug);
>  }
>  
> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
> +static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r1conf *conf = mddev->private;
>  	struct raid1_info *mirror;
>  	struct r1bio *r1_bio;
>  	struct bio *read_bio;
> -	int i, disks;
>  	struct bitmap *bitmap;
> -	unsigned long flags;
>  	const int op = bio_op(bio);
> -	const int rw = bio_data_dir(bio);
>  	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
> -	const unsigned long do_flush_fua = (bio->bi_opf &
> -						(REQ_PREFLUSH | REQ_FUA));
> -	struct md_rdev *blocked_rdev;
> -	struct blk_plug_cb *cb;
> -	struct raid1_plug_cb *plug = NULL;
> -	int first_clone;
>  	int sectors_handled;
>  	int max_sectors;
> -	sector_t start_next_window;
> +	int rdisk;
>  
> -	/*
> -	 * Register the new request and wait if the reconstruction
> -	 * thread has put up a bar for new requests.
> -	 * Continue immediately if no resync is active currently.
> +	/* Still need barrier for READ in case that whole
> +	 * array is frozen.
>  	 */
> -
> -	md_write_start(mddev, bio); /* wait on superblock update early */
> -
> -	if (bio_data_dir(bio) == WRITE &&
> -	    ((bio_end_sector(bio) > mddev->suspend_lo &&
> -	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> -	    (mddev_is_clustered(mddev) &&
> -	     md_cluster_ops->area_resyncing(mddev, WRITE,
> -		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
> -		/* As the suspend_* range is controlled by
> -		 * userspace, we want an interruptible
> -		 * wait.
> -		 */
> -		DEFINE_WAIT(w);
> -		for (;;) {
> -			flush_signals(current);
> -			prepare_to_wait(&conf->wait_barrier,
> -					&w, TASK_INTERRUPTIBLE);
> -			if (bio_end_sector(bio) <= mddev->suspend_lo ||
> -			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
> -			    (mddev_is_clustered(mddev) &&
> -			     !md_cluster_ops->area_resyncing(mddev, WRITE,
> -				     bio->bi_iter.bi_sector, bio_end_sector(bio))))
> -				break;
> -			schedule();
> -		}
> -		finish_wait(&conf->wait_barrier, &w);
> -	}
> -
> -	start_next_window = wait_barrier(conf, bio);
> -
> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
>  	bitmap = mddev->bitmap;
>  
>  	/*
> @@ -1149,12 +1136,9 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  	bio->bi_phys_segments = 0;
>  	bio_clear_flag(bio, BIO_SEG_VALID);
>  
> -	if (rw == READ) {
>  		/*
>  		 * read balancing logic:
>  		 */
> -		int rdisk;
> -
>  read_again:
>  		rdisk = read_balance(conf, r1_bio, &max_sectors);
>  
> @@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  				   atomic_read(&bitmap->behind_writes) == 0);
>  		}
>  		r1_bio->read_disk = rdisk;
> -		r1_bio->start_next_window = 0;
>  
>  		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
>  		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
> @@ -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  		} else
>  			generic_make_request(read_bio);
>  		return;
> +}
> +
> +static void raid1_make_write_request(struct mddev *mddev, struct bio *bio)
> +{
> +	struct r1conf *conf = mddev->private;
> +	struct r1bio *r1_bio;
> +	int i, disks;
> +	struct bitmap *bitmap;
> +	unsigned long flags;
> +	const int op = bio_op(bio);
> +	const unsigned long do_sync = (bio->bi_opf & REQ_SYNC);
> +	const unsigned long do_flush_fua = (bio->bi_opf &
> +						(REQ_PREFLUSH | REQ_FUA));
> +	struct md_rdev *blocked_rdev;
> +	struct blk_plug_cb *cb;
> +	struct raid1_plug_cb *plug = NULL;
> +	int first_clone;
> +	int sectors_handled;
> +	int max_sectors;
> +
> +	/*
> +	 * Register the new request and wait if the reconstruction
> +	 * thread has put up a bar for new requests.
> +	 * Continue immediately if no resync is active currently.
> +	 */
> +
> +	md_write_start(mddev, bio); /* wait on superblock update early */
> +
> +	if (((bio_end_sector(bio) > mddev->suspend_lo &&
> +	    bio->bi_iter.bi_sector < mddev->suspend_hi) ||
> +	    (mddev_is_clustered(mddev) &&
> +	     md_cluster_ops->area_resyncing(mddev, WRITE,
> +		     bio->bi_iter.bi_sector, bio_end_sector(bio))))) {
> +		/* As the suspend_* range is controlled by
> +		 * userspace, we want an interruptible
> +		 * wait.
> +		 */
> +		DEFINE_WAIT(w);
> +
> +		for (;;) {
> +			flush_signals(current);
> +			prepare_to_wait(&conf->wait_barrier,
> +					&w, TASK_INTERRUPTIBLE);
> +			if (bio_end_sector(bio) <= mddev->suspend_lo ||
> +			    bio->bi_iter.bi_sector >= mddev->suspend_hi ||
> +			    (mddev_is_clustered(mddev) &&
> +			     !md_cluster_ops->area_resyncing(
> +						mddev,
> +						WRITE,
> +						bio->bi_iter.bi_sector,
> +						bio_end_sector(bio))))
> +				break;
> +			schedule();
> +		}
> +		finish_wait(&conf->wait_barrier, &w);
>  	}
>  
> +	wait_barrier(conf, bio->bi_iter.bi_sector);
> +	bitmap = mddev->bitmap;
> +
>  	/*
> -	 * WRITE:
> +	 * make_request() can abort the operation when read-ahead is being
> +	 * used and no empty request is available.
> +	 *
> +	 */
> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +
> +	r1_bio->master_bio = bio;
> +	r1_bio->sectors = bio_sectors(bio);
> +	r1_bio->state = 0;
> +	r1_bio->mddev = mddev;
> +	r1_bio->sector = bio->bi_iter.bi_sector;
> +
> +	/* We might need to issue multiple reads to different
> +	 * devices if there are bad blocks around, so we keep
> +	 * track of the number of reads in bio->bi_phys_segments.
> +	 * If this is 0, there is only one r1_bio and no locking
> +	 * will be needed when requests complete.  If it is
> +	 * non-zero, then it is the number of not-completed requests.

This comment mentions "reads".  It should probably be changed to discuss
what happens to "writes" since this is raid1_make_write_request().

>  	 */
> +	bio->bi_phys_segments = 0;
> +	bio_clear_flag(bio, BIO_SEG_VALID);
> +
>  	if (conf->pending_count >= max_queued_requests) {
>  		md_wakeup_thread(mddev->thread);
>  		raid1_log(mddev, "wait queued");
> @@ -1256,7 +1317,6 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  
>  	disks = conf->raid_disks * 2;
>   retry_write:
> -	r1_bio->start_next_window = start_next_window;
>  	blocked_rdev = NULL;
>  	rcu_read_lock();
>  	max_sectors = r1_bio->sectors;
> @@ -1324,25 +1384,15 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  	if (unlikely(blocked_rdev)) {
>  		/* Wait for this device to become unblocked */
>  		int j;
> -		sector_t old = start_next_window;
>  
>  		for (j = 0; j < i; j++)
>  			if (r1_bio->bios[j])
>  				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
>  		r1_bio->state = 0;
> -		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
> +		allow_barrier(conf, bio->bi_iter.bi_sector);
>  		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
>  		md_wait_for_blocked_rdev(blocked_rdev, mddev);
> -		start_next_window = wait_barrier(conf, bio);
> -		/*
> -		 * We must make sure the multi r1bios of bio have
> -		 * the same value of bi_phys_segments
> -		 */
> -		if (bio->bi_phys_segments && old &&
> -		    old != start_next_window)
> -			/* Wait for the former r1bio(s) to complete */
> -			wait_event(conf->wait_barrier,
> -				   bio->bi_phys_segments == 1);
> +		wait_barrier(conf, bio->bi_iter.bi_sector);
>  		goto retry_write;
>  	}
>  
> @@ -1464,6 +1514,31 @@ static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>  	wake_up(&conf->wait_barrier);
>  }
>  
> +static void raid1_make_request(struct mddev *mddev, struct bio *bio)
> +{
> +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
> +	struct bio *split;
> +	sector_t sectors;
> +
> +	make_request_fn = (bio_data_dir(bio) == READ) ?
> +			  raid1_make_read_request :
> +			  raid1_make_write_request;
> +
> +	/* if bio exceeds barrier unit boundary, split it */
> +	do {
> +		sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector,
> +						    bio_sectors(bio));
> +		if (sectors < bio_sectors(bio)) {
> +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +			bio_chain(split, bio);
> +		} else {
> +			split = bio;
> +		}
> +
> +		make_request_fn(mddev, split);
> +	} while (split != bio);
> +}
> +
>  static void raid1_status(struct seq_file *seq, struct mddev *mddev)
>  {
>  	struct r1conf *conf = mddev->private;
> @@ -1552,19 +1627,11 @@ static void print_conf(struct r1conf *conf)
>  
>  static void close_sync(struct r1conf *conf)
>  {
> -	wait_barrier(conf, NULL);
> -	allow_barrier(conf, 0, 0);
> +	wait_all_barriers(conf);
> +	allow_all_barriers(conf);
>  
>  	mempool_destroy(conf->r1buf_pool);
>  	conf->r1buf_pool = NULL;
> -
> -	spin_lock_irq(&conf->resync_lock);
> -	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
> -	conf->start_next_window = MaxSector;
> -	conf->current_window_requests +=
> -		conf->next_window_requests;
> -	conf->next_window_requests = 0;
> -	spin_unlock_irq(&conf->resync_lock);
>  }
>  
>  static int raid1_spare_active(struct mddev *mddev)
> @@ -2311,8 +2378,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
>  
>  static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  {
> -	int m;
> +	int m, idx;
>  	bool fail = false;
> +
>  	for (m = 0; m < conf->raid_disks * 2 ; m++)
>  		if (r1_bio->bios[m] == IO_MADE_GOOD) {
>  			struct md_rdev *rdev = conf->mirrors[m].rdev;
> @@ -2338,7 +2406,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  	if (fail) {
>  		spin_lock_irq(&conf->device_lock);
>  		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
> -		conf->nr_queued++;
> +		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> +		conf->nr_queued[idx]++;
>  		spin_unlock_irq(&conf->device_lock);
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -2460,6 +2529,7 @@ static void raid1d(struct md_thread *thread)
>  	struct r1conf *conf = mddev->private;
>  	struct list_head *head = &conf->retry_list;
>  	struct blk_plug plug;
> +	int idx;
>  
>  	md_check_recovery(mddev);
>  
> @@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread *thread)
>  	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
>  		LIST_HEAD(tmp);
>  		spin_lock_irqsave(&conf->device_lock, flags);
> -		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
> -			while (!list_empty(&conf->bio_end_io_list)) {
> -				list_move(conf->bio_end_io_list.prev, &tmp);
> -				conf->nr_queued--;
> -			}
> -		}
> +		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
> +			list_splice_init(&conf->bio_end_io_list, &tmp);
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  		while (!list_empty(&tmp)) {
>  			r1_bio = list_first_entry(&tmp, struct r1bio,
>  						  retry_list);
>  			list_del(&r1_bio->retry_list);
> +			idx = hash_long(r1_bio->sector,
> +					BARRIER_BUCKETS_NR_BITS);
> +			spin_lock_irqsave(&conf->device_lock, flags);
> +			conf->nr_queued[idx]--;
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
>  			if (mddev->degraded)
>  				set_bit(R1BIO_Degraded, &r1_bio->state);
>  			if (test_bit(R1BIO_WriteError, &r1_bio->state))
> @@ -2498,7 +2569,8 @@ static void raid1d(struct md_thread *thread)
>  		}
>  		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
>  		list_del(head->prev);
> -		conf->nr_queued--;
> +		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> +		conf->nr_queued[idx]--;
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  		mddev = r1_bio->mddev;
> @@ -2537,7 +2609,6 @@ static int init_resync(struct r1conf *conf)
>  					  conf->poolinfo);
>  	if (!conf->r1buf_pool)
>  		return -ENOMEM;
> -	conf->next_resync = 0;
>  	return 0;
>  }
>  
> @@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	int still_degraded = 0;
>  	int good_sectors = RESYNC_SECTORS;
>  	int min_bad = 0; /* number of sectors that are bad in all devices */
> +	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
>  	if (!conf->r1buf_pool)
>  		if (init_resync(conf))
> @@ -2615,7 +2687,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	 * If there is non-resync activity waiting for a turn, then let it
>  	 * though before starting on this new sync request.
>  	 */
> -	if (conf->nr_waiting)
> +	if (conf->nr_waiting[idx])
>  		schedule_timeout_uninterruptible(1);
>  
>  	/* we are incrementing sector_nr below. To be safe, we check against
> @@ -2642,6 +2714,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	r1_bio->sector = sector_nr;
>  	r1_bio->state = 0;
>  	set_bit(R1BIO_IsSync, &r1_bio->state);
> +	/* make sure good_sectors won't go across barrier unit boundary */
> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>  
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  		struct md_rdev *rdev;
> @@ -2927,9 +3001,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
>  	conf->pending_count = 0;
>  	conf->recovery_disabled = mddev->recovery_disabled - 1;
>  
> -	conf->start_next_window = MaxSector;
> -	conf->current_window_requests = conf->next_window_requests = 0;
> -
>  	err = -EIO;
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index c52ef42..817115d 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -1,6 +1,14 @@
>  #ifndef _RAID1_H
>  #define _RAID1_H
>  
> +/* each barrier unit size is 64MB fow now
> + * note: it must be larger than RESYNC_DEPTH
> + */
> +#define BARRIER_UNIT_SECTOR_BITS	17
> +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
> +#define BARRIER_BUCKETS_NR_BITS		9
> +#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
> +
>  struct raid1_info {
>  	struct md_rdev	*rdev;
>  	sector_t	head_position;
> @@ -35,25 +43,6 @@ struct r1conf {
>  						 */
>  	int			raid_disks;
>  
> -	/* During resync, read_balancing is only allowed on the part
> -	 * of the array that has been resynced.  'next_resync' tells us
> -	 * where that is.
> -	 */
> -	sector_t		next_resync;
> -
> -	/* When raid1 starts resync, we divide array into four partitions
> -	 * |---------|--------------|---------------------|-------------|
> -	 *        next_resync   start_next_window       end_window
> -	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
> -	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
> -	 * current_window_requests means the count of normalIO between
> -	 *   start_next_window and end_window.
> -	 * next_window_requests means the count of normalIO after end_window.
> -	 * */
> -	sector_t		start_next_window;
> -	int			current_window_requests;
> -	int			next_window_requests;
> -
>  	spinlock_t		device_lock;
>  
>  	/* list of 'struct r1bio' that need to be processed by raid1d,
> @@ -79,10 +68,11 @@ struct r1conf {
>  	 */
>  	wait_queue_head_t	wait_barrier;
>  	spinlock_t		resync_lock;
> -	int			nr_pending;
> -	int			nr_waiting;
> -	int			nr_queued;
> -	int			barrier;
> +	int			nr_pending[BARRIER_BUCKETS_NR];
> +	int			nr_waiting[BARRIER_BUCKETS_NR];
> +	int			nr_queued[BARRIER_BUCKETS_NR];
> +	int			barrier[BARRIER_BUCKETS_NR];
> +	int			total_barriers;
>  	int			array_frozen;
>  
>  	/* Set to 1 if a full sync is needed, (fresh device added).
> @@ -135,7 +125,6 @@ struct r1bio {
>  						 * in this BehindIO request
>  						 */
>  	sector_t		sector;
> -	sector_t		start_next_window;
>  	int			sectors;
>  	unsigned long		state;
>  	struct mddev		*mddev;
> -- 
> 2.6.6


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2016-12-27 15:47 ` [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
  2017-01-04 19:59   ` Shaohua Li
@ 2017-01-06  1:52   ` NeilBrown
  1 sibling, 0 replies; 8+ messages in thread
From: NeilBrown @ 2017-01-06  1:52 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 18858 bytes --]

On Wed, Dec 28 2016, Coly Li wrote:

> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
>
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>          + 89.92% allow_barrier
>          + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
>          - 100.00% wait_barrier
>
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
>
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it is really necessary.
>
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. Now I write the patch based on new simpler raid1
> I/O barrier code.
>
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which in turns calls _wait_barrier(), is used for regular write I/O.
>    If there is resync I/O happening on the same barrier bucket index, or
>    the whole array is frozen, task will wait until no barrier on same
>    bucket index, or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), so it is
>    unnecessary to wait for barrier in regular read I/Os, they only have to
>    wait only when the whole array is frozen.
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() or
> wait_read_barrier() if no barrier raised in same barrier bucket index or
> array is not frozen, the regular I/O doesn't need to hold conf->
> resync_lock, it can just increase conf->nr_pending[idx], and return to its
> caller. For heavy parallel reading I/Os, the lockless I/O barrier code
> almostly gets rid of all spin lock cost.
>
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
>
> Open question:
> Shaohua points out the memory barrier should be added to some atomic
> operations. Now I am reading the document to learn how to add the memory
> barriers correctly. Anyway, if anyone has suggestion, please don't
> hesitate to let me know.

When converting code from the use of spinlocks to the use of atomics
the most important consideration is to understand changed ordering
requirements.
When spinlocks are used, a sequence of operations within a spinlocked
region are indivisible with respect to other threads running spinlocked
code, so the order within those regions is not relevant.  When you drop
the spinlocks, the order might become relevant, and so needs to be
understood.
The most obvious ordering requirement that your code shows is in
the need to move the increment of nr_pending[] before checking for
barrier[] to be zero.  There might be others.

Once you understand the ordering requirements, you can then determine
if any barriers might be needed to make sure the ordering is globally
visible.

>
> Changelog
> V1:
> - Original RFC patch for comments.
> V2:
> - Remove a spin_lock/unlock pair in raid1d().
> - Add more code comments to explain why there is no racy when checking two
>   atomic_t variables at same time.
>
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid1.c | 134 +++++++++++++++++++++++++++++++----------------------
>  drivers/md/raid1.h |  12 ++---
>  2 files changed, 85 insertions(+), 61 deletions(-)
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 5813656..b1fb4c1 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -222,7 +222,7 @@ static void reschedule_retry(struct r1bio *r1_bio)
>  	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  	list_add(&r1_bio->retry_list, &conf->retry_list);
> -	conf->nr_queued[idx]++;
> +	atomic_inc(&conf->nr_queued[idx]);
>  	spin_unlock_irqrestore(&conf->device_lock, flags);

nr_queued is only tested in freeze_array().
freeze_array() might be waiting for nr_queued to be incremented so that
it matches nr_pending.  That will implies that all pending requests are
in a queue, and so are not active.
So the increment must happen before the wake_up.  That is the only
ordering requirement on nr_queued.
In every case, nr_queued is still incremented inside the
device_lock spinlocked region, so the spin_unlock() will ensure the new
value is visible.
There is one case (in raid1d) where nr_queued is decremented outside of
any spinlock, both nothing is waiting for nr_queued to be decremented,
so that cannot matter.

I do wonder if nr_pending really needs to be an atomic_t.  It quite be
quite easy to leave it as a simple 'int'.  Maybe not helpful though.


>  
>  	wake_up(&conf->wait_barrier);
> @@ -831,13 +831,13 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  	spin_lock_irq(&conf->resync_lock);
>  
>  	/* Wait until no block IO is waiting */
> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
> +	wait_event_lock_irq(conf->wait_barrier,
> +			    !atomic_read(&conf->nr_waiting[idx]),
>  			    conf->resync_lock);
>  
>  	/* block any new IO from starting */
> -	conf->barrier[idx]++;
> -	conf->total_barriers++;
> -
> +	atomic_inc(&conf->barrier[idx]);
> +	atomic_inc(&conf->total_barriers);
>  	/* For these conditions we must wait:
>  	 * A: while the array is in frozen state
>  	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
> @@ -846,44 +846,69 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  	 *    the max count which allowed on the whole raid1 device.
>  	 */
>  	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->array_frozen &&
> -			     !conf->nr_pending[idx] &&
> -			     conf->total_barriers < RESYNC_DEPTH,
> +			    !atomic_read(&conf->array_frozen) &&
> +			     !atomic_read(&conf->nr_pending[idx]) &&
> +			     atomic_read(&conf->total_barriers) < RESYNC_DEPTH,
>  			    conf->resync_lock);
>  
> -	conf->nr_pending[idx]++;
> +	atomic_inc(&conf->nr_pending[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
>  static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> -	unsigned long flags;
>  	int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
> -	BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0);
> -
> -	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->barrier[idx]--;
> -	conf->total_barriers--;
> -	conf->nr_pending[idx]--;
> -	spin_unlock_irqrestore(&conf->resync_lock, flags);
> +	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
> +	BUG_ON(atomic_read(&conf->total_barriers) <= 0);
> +	atomic_dec(&conf->barrier[idx]);
> +	atomic_dec(&conf->total_barriers);
> +	atomic_dec(&conf->nr_pending[idx]);
>  	wake_up(&conf->wait_barrier);

This is the first place where you remove a spin_lock call so it
is worth look at this.

The only code that really cares about the ->barrier and ->nr_pending
values are the wait_event*() calls in freeze_array(), raise_barrier(),
wait_barrier(). As long as the change happens *before* the wake_up()
these changes are safe, and the wake_up() provides all barriers that
were needed.

>  }
>  
>  static void _wait_barrier(struct r1conf *conf, int idx)
>  {
> -	spin_lock_irq(&conf->resync_lock);
> -	if (conf->array_frozen || conf->barrier[idx]) {
> -		conf->nr_waiting[idx]++;
> -		/* Wait for the barrier to drop. */
> -		wait_event_lock_irq(
> -			conf->wait_barrier,
> -			!conf->array_frozen && !conf->barrier[idx],
> -			conf->resync_lock);
> -		conf->nr_waiting[idx]--;
> -	}
> +	/* We need to increase conf->nr_pending[idx] very early here,
> +	 * then raise_barrier() can be blocked when it waits for
> +	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
> +	 * conf->resync_lock when there is no barrier raised in same
> +	 * barrier unit bucket. Also if the array is frozen, I/O
> +	 * should be blocked until array is unfrozen.
> +	 */
> +	atomic_inc(&conf->nr_pending[idx]);

It is important that this atomic_inc() happens *before* the
atomic_read() of ->barrier below.
If the order is reversed then immediately after we read ->barrier,
raise_barrer() might increment->barrier, then test ->nr_pending and find
it to be zero - just before we increment it.
In that case, both wait_barrier() and raise_barrier() would proceed
without waiting.
So it is important that the atomic_inc() is visible to raise_barrier()
before we read conf->barrier.
It is equally important that the atomic_inc of conf->barrer in
raise_barrier() is visible here before raise_barrier() reads
conf->nr_pending.

It would be nice we could call
         atomic_inc_acquire(&conf->nr_pending[idx]);

but there is no atomic_inc_acquire().  The closest is
atomic_inc_acquire_return(), but we don't want the return value.

We could either use it, and just ignore the return value, or
put an explicit smp_mb__after_atomic() after the atomic_inc().

Whatever is done here should also be done in raise_barrier().


> +
> +	/* Don't worry about checking two atomic_t variables at same time
> +	 * here. conf->array_frozen MUST be checked firstly, The logic is,
> +	 * if the array is frozen, no matter there is any barrier or not,
> +	 * all I/O should be blocked. If there is no barrier in current
> +	 * barrier bucket, we still need to check whether the array is frozen,
> +	 * otherwise I/O will happen on frozen array, that's buggy.
> +	 * If during we check conf->barrier[idx], the array is frozen (a.k.a
> +	 * conf->array_frozen is set), and chonf->barrier[idx] is 0, it is
> +	 * safe to return and make the I/O continue. Because the array is
> +	 * frozen, all I/O returned here will eventually complete or be
> +	 * queued, see code comment in frozen_array().
> +	 */
> +	if (!atomic_read(&conf->array_frozen) &&
> +	    !atomic_read(&conf->barrier[idx]))
> +		return;
>  
> -	conf->nr_pending[idx]++;
> +	/* After holding conf->resync_lock, conf->nr_pending[idx]
> +	 * should be decreased before waiting for barrier to drop.
> +	 * Otherwise, we may encounter a race condition because
> +	 * raise_barrer() might be waiting for conf->nr_pending[idx]
> +	 * to be 0 at same time.
> +	 */

You say "After holding conf->resync_lock", but you aren't holding
conf->resync_lock here.

> +	atomic_inc(&conf->nr_waiting[idx]);
> +	atomic_dec(&conf->nr_pending[idx]);

There is an ordering dependency between this atomic_inc and atomic_dec.
If the spinlock is held here (Which I think is your intention)
then you won't need any extra barrier.

> +	/* Wait for the barrier in same barrier unit bucket to drop. */
> +	wait_event_lock_irq(conf->wait_barrier,
> +			    !atomic_read(&conf->array_frozen) &&
> +			     !atomic_read(&conf->barrier[idx]),
> +			    conf->resync_lock);
> +	atomic_inc(&conf->nr_pending[idx]);
> +	atomic_dec(&conf->nr_waiting[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> @@ -891,18 +916,23 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
>  	long idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>  
> -	spin_lock_irq(&conf->resync_lock);
> -	if (conf->array_frozen) {
> -		conf->nr_waiting[idx]++;
> -		/* Wait for array to unfreeze */
> -		wait_event_lock_irq(
> -			conf->wait_barrier,
> -			!conf->array_frozen,
> -			conf->resync_lock);
> -		conf->nr_waiting[idx]--;
> -	}
> +	/* Very similar to _wait_barrier(). The difference is, for read
> +	 * I/O we don't need wait for sync I/O, but if the whole array
> +	 * is frozen, the read I/O still has to wait until the array is
> +	 * unfrozen.
> +	 */
> +	atomic_inc(&conf->nr_pending[idx]);

I think that above should be atomic_inc_return_acquire() too.

> +	if (!atomic_read(&conf->array_frozen))
> +		return;
>  
> -	conf->nr_pending[idx]++;

again, missing spin_lock(resync_lock);

> +	atomic_inc(&conf->nr_waiting[idx]);
> +	atomic_dec(&conf->nr_pending[idx]);
> +	/* Wait for array to be unfrozen */
> +	wait_event_lock_irq(conf->wait_barrier,
> +			    !atomic_read(&conf->array_frozen),
> +			    conf->resync_lock);
> +	atomic_inc(&conf->nr_pending[idx]);
> +	atomic_dec(&conf->nr_waiting[idx]);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> @@ -923,11 +953,7 @@ static void wait_all_barriers(struct r1conf *conf)
>  
>  static void _allow_barrier(struct r1conf *conf, int idx)
>  {
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->nr_pending[idx]--;
> -	spin_unlock_irqrestore(&conf->resync_lock, flags);
> +	atomic_dec(&conf->nr_pending[idx]);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> @@ -952,7 +978,7 @@ static int get_all_pendings(struct r1conf *conf)
>  	int idx, ret;
>  
>  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		ret += conf->nr_pending[idx];
> +		ret += atomic_read(&conf->nr_pending[idx]);
>  	return ret;
>  }
>  
> @@ -962,7 +988,7 @@ static int get_all_queued(struct r1conf *conf)
>  	int idx, ret;
>  
>  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		ret += conf->nr_queued[idx];
> +		ret += atomic_read(&conf->nr_queued[idx]);
>  	return ret;
>  }

I don't really like these get_all_pending, and get_all_queued, but I
can see why they are needed.
Maybe just have a single get_unqueued_pending() which does:

  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
		ret += atomic_read(&conf->nr_pending[idx]) - atomic_read(&conf->nr_queued[idx]);

then have freeze_array() wait for  get_unqueued_pending() == extra.


>  
> @@ -992,7 +1018,7 @@ static void freeze_array(struct r1conf *conf, int extra)
>  	 * normal I/O context, extra is 1, in rested situations extra is 0.
>  	 */
>  	spin_lock_irq(&conf->resync_lock);
> -	conf->array_frozen = 1;
> +	atomic_set(&conf->array_frozen, 1);

There is no reason for ->array_frozen to be an atomic_t.
Just leave it as an 'int'.
You may still need to think about what ordering dependences it has.

>  	raid1_log(conf->mddev, "wait freeze");
>  	wait_event_lock_irq_cmd(
>  		conf->wait_barrier,
> @@ -1005,7 +1031,7 @@ static void unfreeze_array(struct r1conf *conf)
>  {
>  	/* reverse the effect of the freeze */
>  	spin_lock_irq(&conf->resync_lock);
> -	conf->array_frozen = 0;
> +	atomic_set(&conf->array_frozen, 0);
>  	wake_up(&conf->wait_barrier);
>  	spin_unlock_irq(&conf->resync_lock);
>  }
> @@ -2407,7 +2433,7 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  		spin_lock_irq(&conf->device_lock);
>  		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
>  		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> -		conf->nr_queued[idx]++;
> +		atomic_inc(&conf->nr_queued[idx]);
>  		spin_unlock_irq(&conf->device_lock);
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -2546,9 +2572,7 @@ static void raid1d(struct md_thread *thread)
>  			list_del(&r1_bio->retry_list);
>  			idx = hash_long(r1_bio->sector,
>  					BARRIER_BUCKETS_NR_BITS);
> -			spin_lock_irqsave(&conf->device_lock, flags);
> -			conf->nr_queued[idx]--;
> -			spin_unlock_irqrestore(&conf->device_lock, flags);
> +			atomic_dec(&conf->nr_queued[idx]);
>  			if (mddev->degraded)
>  				set_bit(R1BIO_Degraded, &r1_bio->state);
>  			if (test_bit(R1BIO_WriteError, &r1_bio->state))
> @@ -2570,7 +2594,7 @@ static void raid1d(struct md_thread *thread)
>  		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
>  		list_del(head->prev);
>  		idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS);
> -		conf->nr_queued[idx]--;
> +		atomic_dec(&conf->nr_queued[idx]);
>  		spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  		mddev = r1_bio->mddev;
> @@ -2687,7 +2711,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
>  	 * If there is non-resync activity waiting for a turn, then let it
>  	 * though before starting on this new sync request.
>  	 */
> -	if (conf->nr_waiting[idx])
> +	if (atomic_read(&conf->nr_waiting[idx]))
>  		schedule_timeout_uninterruptible(1);
>  
>  	/* we are incrementing sector_nr below. To be safe, we check against
> @@ -3316,7 +3340,7 @@ static void *raid1_takeover(struct mddev *mddev)
>  		conf = setup_conf(mddev);
>  		if (!IS_ERR(conf)) {
>  			/* Array must appear to be quiesced */
> -			conf->array_frozen = 1;
> +			atomic_set(&conf->array_frozen, 1);
>  			clear_bit(MD_HAS_JOURNAL, &mddev->flags);
>  			clear_bit(MD_JOURNAL_CLEAN, &mddev->flags);
>  		}
> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index 817115d..bbe65f7 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -68,12 +68,12 @@ struct r1conf {
>  	 */
>  	wait_queue_head_t	wait_barrier;
>  	spinlock_t		resync_lock;
> -	int			nr_pending[BARRIER_BUCKETS_NR];
> -	int			nr_waiting[BARRIER_BUCKETS_NR];
> -	int			nr_queued[BARRIER_BUCKETS_NR];
> -	int			barrier[BARRIER_BUCKETS_NR];
> -	int			total_barriers;
> -	int			array_frozen;
> +	atomic_t		nr_pending[BARRIER_BUCKETS_NR];
> +	atomic_t		nr_waiting[BARRIER_BUCKETS_NR];
> +	atomic_t		nr_queued[BARRIER_BUCKETS_NR];
> +	atomic_t		barrier[BARRIER_BUCKETS_NR];
> +	atomic_t		total_barriers;
> +	atomic_t		array_frozen;
>  
>  	/* Set to 1 if a full sync is needed, (fresh device added).
>  	 * Cleared when a sync completes.
> -- 
> 2.6.6

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-01-04 19:35 ` [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
@ 2017-01-16  6:08   ` Coly Li
  0 siblings, 0 replies; 8+ messages in thread
From: Coly Li @ 2017-01-16  6:08 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On 2017/1/5 上午3:35, Shaohua Li wrote:
> On Tue, Dec 27, 2016 at 11:47:37PM +0800, Coly Li wrote:
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> I/O barriers to happen only inside a slidingresync window, for regular
>> I/Os out of this resync window they don't need to wait for barrier any
>> more. On large raid1 device, it helps a lot to improve parallel writing
>> I/O throughput when there are background resync I/Os performing at
>> same time.
>>
>> The idea of sliding resync widow is awesome, but there are several
>> challenges are very difficult to solve,
>>  - code complexity
>>    Sliding resync window requires several veriables to work collectively,
>>    this is complexed and very hard to make it work correctly. Just grep
>>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>>    the original resync window patch. This is not the end, any further
>>    related modification may easily introduce more regreassion.
>>  - multiple sliding resync windows
>>    Currently raid1 code only has a single sliding resync window, we cannot
>>    do parallel resync with current I/O barrier implementation.
>>    Implementing multiple resync windows are much more complexed, and very
>>    hard to make it correctly.
>>
>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> removing resync window code, I believe life will be much easier.
>>
>> The brief idea of the simpler barrier is,
>>  - Do not maintain a logbal unique resync window
>>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>    I/O only has to wait for a resync I/O when both them have same barrier
>>    bucket index, vice versa.
>>  - I/O barrier can be recuded to an acceptable number if there are enought
>>    barrier buckets
>>
>> Here I explain how the barrier buckets are designed,
>>  - BARRIER_UNIT_SECTOR_SIZE
>>    The whole LBA address space of a raid1 device is divided into multiple
>>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>    Bio request won't go across border of barrier unit size, that means
>>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>>  - BARRIER_BUCKETS_NR
>>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>>         #define BARRIER_BUCKETS_NR_BITS   9
>>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>>    if multiple I/O requests hit different barrier units, they only need
>>    to compete I/O barrier with other I/Os which hit the same barrier
>>    bucket index with each other. The index of a barrier bucket which a
>>    bio should look for is calculated by,
>>         int idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
>>    that sector_nr is the start sector number of a bio. We use function
>>    align_to_barrier_unit_end() to calculate sectors number from sector_nr
>>    to the next barrier unit size boundary, if the requesting bio size
>>    goes across the boundary, we split the bio in raid1_make_request(), to
>>    make sure the finall bio sent into generic_make_request() won't exceed
>>    barrier unit boundary.
>>
>> Comparing to single sliding resync window,
>>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>>    will have confliction within a single barrier units. So it is similar to
>>    single sliding resync window.
>>  - But a barrier unit bucket is shared by all barrier units with identical
>>    barrier uinit index, the probability of confliction might be higher
>>    than single sliding resync window, in condition that writing I/Os
>>    always hit barrier units which have identical barrier bucket index with
>>    the resync I/Os. This is a very rare condition in real I/O work loads,
>>    I cannot imagine how it could happen in practice.
>>  - Therefore we can achieve a good enough low confliction rate with much
>>    simpler barrier algorithm and implementation.
>>
>> If user has a (realy) large raid1 device, for example 10PB size, we may
>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>> it is possible to be a raid1-created-time-defined variable in future.
>>
>> There are two changes should be noticed,
>>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>>    single loop, it looks like this,
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         conf->nr_queued[idx]--;
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>    This change generates more spin lock operations, but in next patch of
>>    this patch set, it will be replaced by a single line code,
>>         atomic_dec(conf->nr_queueud[idx]);
>>    So we don't need to worry about spin lock cost here.
>>  - Original function raid1_make_request() is split into two functions,
>>    - raid1_make_read_request(): handles regular read request and calls
>>      wait_read_barrier() for I/O barrier.
>>    - raid1_make_write_request(): handles regular write request and calls
>>      wait_barrier() for I/O barrier.
>>    The differnece is wait_read_barrier() only waits if array is frozen,
>>    using different barrier function in different code path makes the code
>>    more clean and easy to read.
>>  - align_to_barrier_unit_end() is called to make sure both regular and
>>    resync I/O won't go across a barrier unit boundary.
>>
>> Changelog
>> V1:
>> - Original RFC patch for comments
>> V2:
>> - Use bio_split() to split the orignal bio if it goes across barrier unit
>>   bounday, to make the code more simple, by suggestion from Shaohua and
>>   Neil.
>> - Use hash_long() to replace original linear hash, to avoid a possible
>>   confilict between resync I/O and sequential write I/O, by suggestion from
>>   Shaohua.
>> - Add conf->total_barriers to record barrier depth, which is used to
>>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
>> - In V1 patch the bellowed barrier buckets related members in r1conf are
>>   allocated in memory page. To make the code more simple, V2 patch moves
>>   the memory space into struct r1conf, like this,
>>         -       int                     nr_pending;
>>         -       int                     nr_waiting;
>>         -       int                     nr_queued;
>>         -       int                     barrier;
>>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>>         +       int                     barrier[BARRIER_BUCKETS_NR];
>>   This change is by the suggestion from Shaohua.
>> - Remove some inrelavent code comments, by suggestion from Guoqing.
>> - Add a missing wait_barrier() before jumping to retry_write, in
>>   raid1_make_write_request().
>>
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Shaohua Li <shli@fb.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Johannes Thumshirn <jthumshirn@suse.de>
>> Cc: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  
>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>> +					  sector_t sectors)
>> +{
>> +	sector_t len;
>> +
>> +	WARN_ON(sectors == 0);
>> +	/* len is the number of sectors from start_sector to end of the
>> +	 * barrier unit which start_sector belongs to.
>> +	 */
> 
> The correct format for comments is:
> /*
>  * something
>  */
> 

Copied, I will modify this.

> There are some other places with the same issue
> 
>> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>> +	      start_sector;
> 
> This one makes me nervous. shouldn't this be:
>  +	len = ((start_sector +  (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>  +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>  +	      start_sector;
> 

If start_sector is barrier unit sector size aligned, the above
modification will assign 0 to len. But in this case, len should be
BARRIER_UNIT_SECTOR_SIZE.

> And you can use round_up()

round_up() has similar problem. For example, if we use,
	len = round_up(start_sector, BARRIER_UNIT_SECTOR_SIZE) -
	      start_sector,

and start_sector is 0, round_up will return 0, and len will be 0 as
well. But in this case, correct value of len should be
BARRIER_UNIT_SECTOR_SIZE.

>>  
>> -static void raid1_make_request(struct mddev *mddev, struct bio * bio)
>> +static void raid1_make_read_request(struct mddev *mddev, struct bio *bio)
>>  {
> 
> Please rebase the patches to latest md-next. The raid1_make_request already
> split for read/write code path recently.
> 

Yes, I will do it.

> Otherwise, the patch looks good. After these are fixed, I'll add it for 4.11
> 

I will send out another version, with review comments from you and Neil.

Thanks!

Coly






^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-01-05 23:08 ` NeilBrown
@ 2017-01-16  9:06   ` Coly Li
  0 siblings, 0 replies; 8+ messages in thread
From: Coly Li @ 2017-01-16  9:06 UTC (permalink / raw)
  To: NeilBrown
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On 2017/1/6 上午7:08, NeilBrown wrote:
> On Wed, Dec 28 2016, Coly Li wrote:
> 
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of
>> iobarrier.")' introduces a sliding resync window for raid1 I/O
>> barrier, this idea limits I/O barriers to happen only inside a
>> slidingresync window, for regular I/Os out of this resync window
>> they don't need to wait for barrier any more. On large raid1
>> device, it helps a lot to improve parallel writing I/O throughput
>> when there are background resync I/Os performing at same time.
>> 
>> The idea of sliding resync widow is awesome, but there are
>> several challenges are very difficult to solve, - code
>> complexity Sliding resync window requires several veriables to
>> work collectively, this is complexed and very hard to make it
>> work correctly. Just grep "Fixes: 79ef3a8aa1" in kernel git log,
>> there are 8 more patches to fix the original resync window patch.
>> This is not the end, any further related modification may easily
>> introduce more regreassion. - multiple sliding resync windows 
>> Currently raid1 code only has a single sliding resync window, we
>> cannot do parallel resync with current I/O barrier
>> implementation. Implementing multiple resync windows are much
>> more complexed, and very hard to make it correctly.
> 
> I think I've asked this before, but why do you think that parallel 
> resync might ever be a useful idea?  I don't think it makes any
> sense, so it is wrong for you use it as part of the justification
> for this patch. Just don't mention it at all unless you have a
> genuine expectation that it would really be a good thing, in which
> case: explain the value.
> 

I will remove this from the patch log. Thanks for your suggestion.


>> 
>> Therefore I decide to implement a much simpler raid1 I/O barrier,
>> by removing resync window code, I believe life will be much
>> easier.
>> 
>> The brief idea of the simpler barrier is, - Do not maintain a
>> logbal unique resync window - Use multiple hash buckets to reduce
>> I/O barrier conflictions, regular I/O only has to wait for a
>> resync I/O when both them have same barrier bucket index, vice
>> versa. - I/O barrier can be recuded to an acceptable number if
>> there are enought barrier buckets
>> 
>> Here I explain how the barrier buckets are designed, -
>> BARRIER_UNIT_SECTOR_SIZE The whole LBA address space of a raid1
>> device is divided into multiple barrier units, by the size of
>> BARRIER_UNIT_SECTOR_SIZE. Bio request won't go across border of
>> barrier unit size, that means maximum bio size is
>> BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
> 
> It would be good to say here what number you chose, and why you
> chose it. You have picked 64MB.  This divides a 1TB device into
> 4096 regions. Any write request must fit into one of these regions,
> so we mustn't make the region too small, else we would get the
> benefits for sending large requests down.
> 
> We want the resync to move from region to region fairly quickly so
> that the slowness caused by having to synchronize with the resync
> is averaged out overa fairly small time frame.  At full speed, 64MB
> should take less than 1 second.  When resync is competing with
> other IO, it could easily take up to a minute(?).  I think that is
> a fairly good range.
> 
> So I think 64MB is probably a very good choice.  I just would like
> to see the justification clearly stated.

I see, I will add text to explain why I choose 64MB bucket unit size.

A reason for 64MB is just as you mentioned, that's the trade off
between memory consume and hash conflict rate. I did some calculation
for md raid1 targe size from 1TB to 10PB, and the maximum I/O
throughput of NVMe SSD, finally I deside a bucket size between
64~128MB, bucket number between 512~1024 are proper numbers.

I will explain the calculation in detail in next version patch.

> 
>> - BARRIER_BUCKETS_NR There are BARRIER_BUCKETS_NR buckets in
>> total, which is defined by, #define BARRIER_BUCKETS_NR_BITS   9 
>> #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
> 
> Why 512 buckets?  What are the tradeoffs? More buckets means more
> memory consumed for counters. Fewer buckets means more false
> sharing. With 512 buckets, a request which is smaller than the
> region size has a 0.2% chance of having to wait for resync to
> pause.  I think that is quite a small enough fraction. I think you
> originally chose the number of buckets so that a set of 4-byte
> counters fits exactly into a page.  I think that is still a good 
> guideline, so I would have #define BARRIER_BUCKETS_NR_BITS
> (PAGE_SHIFT - 2) (which makes it 10 ...).
> 

Good suggestion, 1024 buckets makes less hash conflict in each bucket.
I will change it in next version patch.


>> if multiple I/O requests hit different barrier units, they only
>> need to compete I/O barrier with other I/Os which hit the same
>> barrier bucket index with each other. The index of a barrier
>> bucket which a bio should look for is calculated by, int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS)
> 
> This isn't right.  You have to divide by BARRIER_UNIT_SECTOR_SIZE
> first. int idx = hash_long(sector_nr >> BARRIER_UNIT_SECTOR_BITS,
> BARRIER_BUCKETS_NR_BITS);
> 

Oops, thanks for catching this. I will fix it in next version patch.


>> that sector_nr is the start sector number of a bio. We use
>> function align_to_barrier_unit_end() to calculate sectors number
>> from sector_nr to the next barrier unit size boundary, if the
>> requesting bio size goes across the boundary, we split the bio in
>> raid1_make_request(), to make sure the finall bio sent into
>> generic_make_request() won't exceed barrier unit boundary.
>> 
>> Comparing to single sliding resync window, - Currently resync I/O
>> grows linearly, therefore regular and resync I/O will have
>> confliction within a single barrier units. So it is similar to 
>> single sliding resync window. - But a barrier unit bucket is
>> shared by all barrier units with identical barrier uinit index,
>> the probability of confliction might be higher than single
>> sliding resync window, in condition that writing I/Os always hit
>> barrier units which have identical barrier bucket index with the
>> resync I/Os. This is a very rare condition in real I/O work
>> loads, I cannot imagine how it could happen in practice. -
>> Therefore we can achieve a good enough low confliction rate with
>> much simpler barrier algorithm and implementation.
>> 
>> If user has a (realy) large raid1 device, for example 10PB size,
>> we may just increase the buckets number BARRIER_BUCKETS_NR. Now
>> this is a macro, it is possible to be a
>> raid1-created-time-defined variable in future.
> 
> Why?  Why would a large array require more buckets?  Are you just 
> guessing, or do you see some concrete reason for there to be a 
> relationship between the size of the array and the number of
> buckets? If you can see a connection, please state it.  If not,
> don't mention it.
> 

This is a assumption. OK I will remove these text from the patch log.


>> 
>> There are two changes should be noticed, - In raid1d(), I change
>> the code to decrease conf->nr_pending[idx] into single loop, it
>> looks like this, spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags); This change
>> generates more spin lock operations, but in next patch of this
>> patch set, it will be replaced by a single line code, 
>> atomic_dec(conf->nr_queueud[idx]); So we don't need to worry
>> about spin lock cost here. - Original function
>> raid1_make_request() is split into two functions, -
>> raid1_make_read_request(): handles regular read request and
>> calls wait_read_barrier() for I/O barrier. -
>> raid1_make_write_request(): handles regular write request and
>> calls wait_barrier() for I/O barrier. The differnece is
>> wait_read_barrier() only waits if array is frozen, using
>> different barrier function in different code path makes the code 
>> more clean and easy to read. - align_to_barrier_unit_end() is
>> called to make sure both regular and resync I/O won't go across a
>> barrier unit boundary.
>> 
>> Changelog V1: - Original RFC patch for comments V2: - Use
>> bio_split() to split the orignal bio if it goes across barrier
>> unit bounday, to make the code more simple, by suggestion from
>> Shaohua and Neil. - Use hash_long() to replace original linear
>> hash, to avoid a possible confilict between resync I/O and
>> sequential write I/O, by suggestion from Shaohua. - Add
>> conf->total_barriers to record barrier depth, which is used to 
>> control number of parallel sync I/O barriers, by suggestion from
>> Shaohua.
> 
> I really don't think this is needed. As long as RESYNC_DEPTH *
> RESYNC_SECTORS is less than BARRIER_UNIT_SECTOR_SIZE just testing
> again ->barrier[idx] will ensure the number of barrier requests
> never exceeds RESYNC_DEPTH*2.  That is sufficient.
> 
> Also, I think the reason for imposing the RESYNC_DEPTH limit is to
> make sure regular IO never has to wait too long for pending resync
> requests to flush.  With the simple test, regular IO will never
> need to wait for more than RESYNC_DEPTH requests to complete.
> 
> So I think have this field brings no valid, and is potentially
> confusing.
> 

Ok, I will remove conf->total_barriers and back to the original
implementation. IMHO, I think this is a threshold for hard disk, for
SSD this limitation could be much larger, that's why the original
version there is no conf->total_barrier.

I will fix in next version patch.


>> - In V1 patch the bellowed barrier buckets related members in
>> r1conf are allocated in memory page. To make the code more
>> simple, V2 patch moves the memory space into struct r1conf, like
>> this, -       int                     nr_pending; -       int
>> nr_waiting; -       int                     nr_queued; -
>> int                     barrier; +       int
>> nr_pending[BARRIER_BUCKETS_NR]; +       int
>> nr_waiting[BARRIER_BUCKETS_NR]; +       int
>> nr_queued[BARRIER_BUCKETS_NR]; +       int
>> barrier[BARRIER_BUCKETS_NR];
> 
> I don't like this.  It makes the r1conf 4 pages is size, most of
> which is wasted.  A 4-page allocation is more likely to fail than a
> few 1-page allocations. I think these should be:
>> +       int                     *nr_pending; +       int
>> *nr_waiting; +       int                     *nr_queued; +
>> int                     *barrier;
> 
> Then use kcalloc(BARRIER_BUCKETS_NR, sizeof(int), GFP_KERNEL) to
> allocate each array.   I think this approach addresses Shaohua's 
> concerns without requiring a multi-page allocation.
> 

Very constructive suggestion. Yes, I will do this change in next
version patch.

>> This change is by the suggestion from Shaohua. - Remove some
>> inrelavent code comments, by suggestion from Guoqing. - Add a
>> missing wait_barrier() before jumping to retry_write, in 
>> raid1_make_write_request().
>> 
>> Signed-off-by: Coly Li <colyli@suse.de> Cc: Shaohua Li
>> <shli@fb.com> Cc: Neil Brown <neilb@suse.de> Cc: Johannes
>> Thumshirn <jthumshirn@suse.de> Cc: Guoqing Jiang
>> <gqjiang@suse.com> --- drivers/md/raid1.c | 485
>> ++++++++++++++++++++++++++++++----------------------- 
>> drivers/md/raid1.h |  37 ++-- 2 files changed, 291 insertions(+),
>> 231 deletions(-)
>> 
>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index
>> a1f3fbe..5813656 100644 --- a/drivers/md/raid1.c +++
>> b/drivers/md/raid1.c @@ -67,9 +67,8 @@ */ static int
>> max_queued_requests = 1024;
>> 
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window, -			  sector_t bi_sector); -static void
>> lower_barrier(struct r1conf *conf); +static void
>> allow_barrier(struct r1conf *conf, sector_t sector_nr); +static
>> void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>> 
>> #define raid1_log(md, fmt, args...)				\ do { if ((md)->queue)
>> blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while
>> (0) @@ -96,7 +95,6 @@ static void r1bio_pool_free(void *r1_bio,
>> void *data) #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9) 
>> #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW) #define
>> CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9) 
>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>> 
>> static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data) { @@
>> -211,7 +209,7 @@ static void put_buf(struct r1bio *r1_bio)
>> 
>> mempool_free(r1_bio, conf->r1buf_pool);
>> 
>> -	lower_barrier(conf); +	lower_barrier(conf, r1_bio->sector); }
>> 
>> static void reschedule_retry(struct r1bio *r1_bio) @@ -219,10
>> +217,12 @@ static void reschedule_retry(struct r1bio *r1_bio) 
>> unsigned long flags; struct mddev *mddev = r1_bio->mddev; struct
>> r1conf *conf = mddev->private; +	int idx;
>> 
>> +	idx = hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); 
>> spin_lock_irqsave(&conf->device_lock, flags); 
>> list_add(&r1_bio->retry_list, &conf->retry_list); -
>> conf->nr_queued ++; +	conf->nr_queued[idx]++; 
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> 
>> wake_up(&conf->wait_barrier); @@ -239,8 +239,6 @@ static void
>> call_bio_endio(struct r1bio *r1_bio) struct bio *bio =
>> r1_bio->master_bio; int done; struct r1conf *conf =
>> r1_bio->mddev->private; -	sector_t start_next_window =
>> r1_bio->start_next_window; -	sector_t bi_sector =
>> bio->bi_iter.bi_sector;
>> 
>> if (bio->bi_phys_segments) { unsigned long flags; @@ -265,7
>> +263,7 @@ static void call_bio_endio(struct r1bio *r1_bio) * Wake
>> up any possible resync thread that waits for the device * to go
>> idle. */ -		allow_barrier(conf, start_next_window, bi_sector); +
>> allow_barrier(conf, bio->bi_iter.bi_sector);
> 
> Why did you change this to use "bio->bi_iter.bi_sector" instead of 
> "bi_sector"?
> 
> I assume you thought it was an optimization that you would just
> slip in.  Can't hurt, right?
> 
> Just before this line is: bio_endio(bio); and that might cause the
> bio to be freed.  So your code could access freed memory.
> 
> Please be *very* cautious when making changes that are not
> directly related to the purpose of the patch.

Copied, I will fix it in next version patch. This is a regression I
introduced when I use bio_split() and move the location of
allow_barrier(). Thanks for catching this.

>> } }
>> 
>> @@ -513,6 +511,25 @@ static void raid1_end_write_request(struct
>> bio *bio) bio_put(to_put); }
>> 
>> +static sector_t align_to_barrier_unit_end(sector_t
>> start_sector, +					  sector_t sectors) +{ +	sector_t len; + +
>> WARN_ON(sectors == 0); +	/* len is the number of sectors from
>> start_sector to end of the +	 * barrier unit which start_sector
>> belongs to. +	 */ +	len = ((start_sector + sectors +
>> (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & +
>> (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - +	      start_sector;
> 
> This would be better as
> 
> len = round_up(start_sector+1, BARRIER_UNIT_SECTOR_SIZE) -
> start_sector;
> 

Aha! Yes, I will modify the code this way. Thanks for the suggestion.

> 
>> + +	if (len > sectors) +		len = sectors; + +	return len; +} + /* 
>> * This routine returns the disk from which the requested read
>> should * be done. There is a per-array 'next expected sequential
>> IO' sector @@ -809,168 +826,179 @@ static void
>> flush_pending_writes(struct r1conf *conf) */ static void
>> raise_barrier(struct r1conf *conf, sector_t sector_nr) { +	int
>> idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + 
>> spin_lock_irq(&conf->resync_lock);
>> 
>> /* Wait until no block IO is waiting */ -
>> wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting, +
>> wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx], 
>> conf->resync_lock);
>> 
>> /* block any new IO from starting */ -	conf->barrier++; -
>> conf->next_resync = sector_nr; +	conf->barrier[idx]++; +
>> conf->total_barriers++;
>> 
>> /* For these conditions we must wait: * A: while the array is in
>> frozen state -	 * B: while barrier >= RESYNC_DEPTH, meaning
>> resync reach -	 *    the max count which allowed. -	 * C:
>> next_resync + RESYNC_SECTORS > start_next_window, meaning -	 *
>> next resync will reach to the window which normal bios are -	 *
>> handling. -	 * D: while there are any active requests in the
>> current window. +	 * B: while conf->nr_pending[idx] is not 0,
>> meaning regular I/O +	 *    existing in sector number ranges
>> corresponding to idx. +	 * C: while conf->total_barriers >=
>> RESYNC_DEPTH, meaning resync reach +	 *    the max count which
>> allowed on the whole raid1 device. */ 
>> wait_event_lock_irq(conf->wait_barrier, !conf->array_frozen && -
>> conf->barrier < RESYNC_DEPTH && -
>> conf->current_window_requests == 0 && -
>> (conf->start_next_window >= -			     conf->next_resync +
>> RESYNC_SECTORS), +			     !conf->nr_pending[idx] && +
>> conf->total_barriers < RESYNC_DEPTH, conf->resync_lock);
>> 
>> -	conf->nr_pending++; +	conf->nr_pending[idx]++; 
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> -static void lower_barrier(struct r1conf *conf) +static void
>> lower_barrier(struct r1conf *conf, sector_t sector_nr) { unsigned
>> long flags; -	BUG_ON(conf->barrier <= 0); +	int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + +
>> BUG_ON((conf->barrier[idx] <= 0) || conf->total_barriers <= 0); 
>> + spin_lock_irqsave(&conf->resync_lock, flags); -
>> conf->barrier--; -	conf->nr_pending--; +	conf->barrier[idx]--; +
>> conf->total_barriers--; +	conf->nr_pending[idx]--; 
>> spin_unlock_irqrestore(&conf->resync_lock, flags); 
>> wake_up(&conf->wait_barrier); }
>> 
>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct
>> bio *bio) +static void _wait_barrier(struct r1conf *conf, int
>> idx) { -	bool wait = false; - -	if (conf->array_frozen || !bio) -
>> wait = true; -	else if (conf->barrier && bio_data_dir(bio) ==
>> WRITE) { -		if ((conf->mddev->curr_resync_completed -		     >=
>> bio_end_sector(bio)) || -		    (conf->start_next_window +
>> NEXT_NORMALIO_DISTANCE -		     <= bio->bi_iter.bi_sector)) -
>> wait = false; -		else -			wait = true; +
>> spin_lock_irq(&conf->resync_lock); +	if (conf->array_frozen ||
>> conf->barrier[idx]) { +		conf->nr_waiting[idx]++; +		/* Wait for
>> the barrier to drop. */ +		wait_event_lock_irq( +
>> conf->wait_barrier, +			!conf->array_frozen &&
>> !conf->barrier[idx], +			conf->resync_lock); +
>> conf->nr_waiting[idx]--; }
>> 
>> -	return wait; +	conf->nr_pending[idx]++; +
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> -static sector_t wait_barrier(struct r1conf *conf, struct bio
>> *bio) +static void wait_read_barrier(struct r1conf *conf,
>> sector_t sector_nr) { -	sector_t sector = 0; +	long idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>> 
>> spin_lock_irq(&conf->resync_lock); -	if
>> (need_to_wait_for_sync(conf, bio)) { -		conf->nr_waiting++; -		/*
>> Wait for the barrier to drop. -		 * However if there are already
>> pending -		 * requests (preventing the barrier from -		 * rising
>> completely), and the -		 * per-process bio queue isn't empty, -
>> * then don't wait, as we need to empty -		 * that queue to allow
>> conf->start_next_window -		 * to increase. -		 */ -
>> raid1_log(conf->mddev, "wait barrier"); -
>> wait_event_lock_irq(conf->wait_barrier, -
>> !conf->array_frozen && -				    (!conf->barrier || -
>> ((conf->start_next_window < -				       conf->next_resync +
>> RESYNC_SECTORS) && -				      current->bio_list && -
>> !bio_list_empty(current->bio_list))), -
>> conf->resync_lock); -		conf->nr_waiting--; -	} - -	if (bio &&
>> bio_data_dir(bio) == WRITE) { -		if (bio->bi_iter.bi_sector >=
>> conf->next_resync) { -			if (conf->start_next_window ==
>> MaxSector) -				conf->start_next_window = -					conf->next_resync
>> + -					NEXT_NORMALIO_DISTANCE; - -			if
>> ((conf->start_next_window + NEXT_NORMALIO_DISTANCE) -			    <=
>> bio->bi_iter.bi_sector) -				conf->next_window_requests++; -
>> else -				conf->current_window_requests++; -			sector =
>> conf->start_next_window; -		} +	if (conf->array_frozen) { +
>> conf->nr_waiting[idx]++; +		/* Wait for array to unfreeze */ +
>> wait_event_lock_irq( +			conf->wait_barrier, +
>> !conf->array_frozen, +			conf->resync_lock); +
>> conf->nr_waiting[idx]--; }
>> 
>> -	conf->nr_pending++; +	conf->nr_pending[idx]++; 
>> spin_unlock_irq(&conf->resync_lock); -	return sector; }
>> 
>> -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window, -			  sector_t bi_sector) +static void
>> wait_barrier(struct r1conf *conf, sector_t sector_nr) +{ +	int
>> idx = hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS); + +
>> _wait_barrier(conf, idx); +} + +static void
>> wait_all_barriers(struct r1conf *conf) +{ +	int idx; + +	for (idx
>> = 0; idx < BARRIER_BUCKETS_NR; idx++) +		_wait_barrier(conf,
>> idx); +} + +static void _allow_barrier(struct r1conf *conf, int
>> idx) { unsigned long flags;
>> 
>> spin_lock_irqsave(&conf->resync_lock, flags); -
>> conf->nr_pending--; -	if (start_next_window) { -		if
>> (start_next_window == conf->start_next_window) { -			if
>> (conf->start_next_window + NEXT_NORMALIO_DISTANCE -			    <=
>> bi_sector) -				conf->next_window_requests--; -			else -
>> conf->current_window_requests--; -		} else -
>> conf->current_window_requests--; - -		if
>> (!conf->current_window_requests) { -			if
>> (conf->next_window_requests) { -				conf->current_window_requests
>> = -					conf->next_window_requests; -
>> conf->next_window_requests = 0; -				conf->start_next_window += -
>> NEXT_NORMALIO_DISTANCE; -			} else -				conf->start_next_window =
>> MaxSector; -		} -	} +	conf->nr_pending[idx]--; 
>> spin_unlock_irqrestore(&conf->resync_lock, flags); 
>> wake_up(&conf->wait_barrier); }
>> 
>> +static void allow_barrier(struct r1conf *conf, sector_t
>> sector_nr) +{ +	int idx = hash_long(sector_nr,
>> BARRIER_BUCKETS_NR_BITS); + +	_allow_barrier(conf, idx); +} + 
>> +static void allow_all_barriers(struct r1conf *conf) +{ +	int
>> idx; + +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +
>> _allow_barrier(conf, idx); +} + +/* conf->resync_lock should be
>> held */ +static int get_all_pendings(struct r1conf *conf) +{ +
>> int idx, ret; + +	for (ret = 0, idx = 0; idx <
>> BARRIER_BUCKETS_NR; idx++) +		ret += conf->nr_pending[idx]; +
>> return ret; +} + +/* conf->resync_lock should be held */ +static
>> int get_all_queued(struct r1conf *conf) +{ +	int idx, ret; + +
>> for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +		ret +=
>> conf->nr_queued[idx]; +	return ret; +} + static void
>> freeze_array(struct r1conf *conf, int extra) { -	/* stop syncio
>> and normal IO and wait for everything to +	/* Stop sync I/O and
>> normal I/O and wait for everything to * go quite. -	 * We wait
>> until nr_pending match nr_queued+extra -	 * This is called in the
>> context of one normal IO request -	 * that has failed. Thus any
>> sync request that might be pending -	 * will be blocked by
>> nr_pending, and we need to wait for -	 * pending IO requests to
>> complete or be queued for re-try. -	 * Thus the number queued
>> (nr_queued) plus this request (extra) -	 * must match the number
>> of pending IOs (nr_pending) before -	 * we continue. +	 * This is
>> called in two situations: +	 * 1) management command handlers
>> (reshape, remove disk, quiesce). +	 * 2) one normal I/O request
>> failed. + +	 * After array_frozen is set to 1, new sync IO will
>> be blocked at +	 * raise_barrier(), and new normal I/O will
>> blocked at _wait_barrier(). +	 * The flying I/Os will either
>> complete or be queued. When everything +	 * goes quite, there are
>> only queued I/Os left. + +	 * Every flying I/O contributes to a
>> conf->nr_pending[idx], idx is the +	 * barrier bucket index which
>> this I/O request hits. When all sync and +	 * normal I/O are
>> queued, sum of all conf->nr_pending[] will match sum +	 * of all
>> conf->nr_queued[]. But normal I/O failure is an exception, +	 *
>> in handle_read_error(), we may call freeze_array() before trying
>> to +	 * fix the read error. In this case, the error read I/O is
>> not queued, +	 * so get_all_pending() == get_all_queued() + 1. +
>> * +	 * Therefore before this function returns, we need to wait
>> until +	 * get_all_pendings(conf) gets equal to
>> get_all_queued(conf)+extra. For +	 * normal I/O context, extra is
>> 1, in rested situations extra is 0. */ 
>> spin_lock_irq(&conf->resync_lock); conf->array_frozen = 1; 
>> raid1_log(conf->mddev, "wait freeze"); -
>> wait_event_lock_irq_cmd(conf->wait_barrier, -				conf->nr_pending
>> == conf->nr_queued+extra, -				conf->resync_lock, -
>> flush_pending_writes(conf)); +	wait_event_lock_irq_cmd( +
>> conf->wait_barrier, +		get_all_pendings(conf) ==
>> get_all_queued(conf)+extra, +		conf->resync_lock, +
>> flush_pending_writes(conf)); 
>> spin_unlock_irq(&conf->resync_lock); } static void
>> unfreeze_array(struct r1conf *conf) @@ -1066,64 +1094,23 @@
>> static void raid1_unplug(struct blk_plug_cb *cb, bool
>> from_schedule) kfree(plug); }
>> 
>> -static void raid1_make_request(struct mddev *mddev, struct bio *
>> bio) +static void raid1_make_read_request(struct mddev *mddev,
>> struct bio *bio) { struct r1conf *conf = mddev->private; struct
>> raid1_info *mirror; struct r1bio *r1_bio; struct bio *read_bio; -
>> int i, disks; struct bitmap *bitmap; -	unsigned long flags; const
>> int op = bio_op(bio); -	const int rw = bio_data_dir(bio); const
>> unsigned long do_sync = (bio->bi_opf & REQ_SYNC); -	const
>> unsigned long do_flush_fua = (bio->bi_opf & -						(REQ_PREFLUSH
>> | REQ_FUA)); -	struct md_rdev *blocked_rdev; -	struct blk_plug_cb
>> *cb; -	struct raid1_plug_cb *plug = NULL; -	int first_clone; int
>> sectors_handled; int max_sectors; -	sector_t start_next_window; +
>> int rdisk;
>> 
>> -	/* -	 * Register the new request and wait if the
>> reconstruction -	 * thread has put up a bar for new requests. -
>> * Continue immediately if no resync is active currently. +	/*
>> Still need barrier for READ in case that whole +	 * array is
>> frozen. */ - -	md_write_start(mddev, bio); /* wait on superblock
>> update early */ - -	if (bio_data_dir(bio) == WRITE && -
>> ((bio_end_sector(bio) > mddev->suspend_lo && -
>> bio->bi_iter.bi_sector < mddev->suspend_hi) || -
>> (mddev_is_clustered(mddev) && -
>> md_cluster_ops->area_resyncing(mddev, WRITE, -
>> bio->bi_iter.bi_sector, bio_end_sector(bio))))) { -		/* As the
>> suspend_* range is controlled by -		 * userspace, we want an
>> interruptible -		 * wait. -		 */ -		DEFINE_WAIT(w); -		for (;;)
>> { -			flush_signals(current); -
>> prepare_to_wait(&conf->wait_barrier, -					&w,
>> TASK_INTERRUPTIBLE); -			if (bio_end_sector(bio) <=
>> mddev->suspend_lo || -			    bio->bi_iter.bi_sector >=
>> mddev->suspend_hi || -			    (mddev_is_clustered(mddev) && -
>> !md_cluster_ops->area_resyncing(mddev, WRITE, -
>> bio->bi_iter.bi_sector, bio_end_sector(bio)))) -				break; -
>> schedule(); -		} -		finish_wait(&conf->wait_barrier, &w); -	} - -
>> start_next_window = wait_barrier(conf, bio); - +
>> wait_read_barrier(conf, bio->bi_iter.bi_sector); bitmap =
>> mddev->bitmap;
>> 
>> /* @@ -1149,12 +1136,9 @@ static void raid1_make_request(struct
>> mddev *mddev, struct bio * bio) bio->bi_phys_segments = 0; 
>> bio_clear_flag(bio, BIO_SEG_VALID);
>> 
>> -	if (rw == READ) { /* * read balancing logic: */ -		int rdisk; 
>> - read_again: rdisk = read_balance(conf, r1_bio, &max_sectors);
>> 
>> @@ -1176,7 +1160,6 @@ static void raid1_make_request(struct mddev
>> *mddev, struct bio * bio) atomic_read(&bitmap->behind_writes) ==
>> 0); } r1_bio->read_disk = rdisk; -		r1_bio->start_next_window =
>> 0;
>> 
>> read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev); 
>> bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector, @@
>> -1232,11 +1215,89 @@ static void raid1_make_request(struct mddev
>> *mddev, struct bio * bio) } else generic_make_request(read_bio); 
>> return; +} + +static void raid1_make_write_request(struct mddev
>> *mddev, struct bio *bio) +{ +	struct r1conf *conf =
>> mddev->private; +	struct r1bio *r1_bio; +	int i, disks; +	struct
>> bitmap *bitmap; +	unsigned long flags; +	const int op =
>> bio_op(bio); +	const unsigned long do_sync = (bio->bi_opf &
>> REQ_SYNC); +	const unsigned long do_flush_fua = (bio->bi_opf & +
>> (REQ_PREFLUSH | REQ_FUA)); +	struct md_rdev *blocked_rdev; +
>> struct blk_plug_cb *cb; +	struct raid1_plug_cb *plug = NULL; +
>> int first_clone; +	int sectors_handled; +	int max_sectors; + +
>> /* +	 * Register the new request and wait if the reconstruction +
>> * thread has put up a bar for new requests. +	 * Continue
>> immediately if no resync is active currently. +	 */ + +
>> md_write_start(mddev, bio); /* wait on superblock update early
>> */ + +	if (((bio_end_sector(bio) > mddev->suspend_lo && +
>> bio->bi_iter.bi_sector < mddev->suspend_hi) || +
>> (mddev_is_clustered(mddev) && +
>> md_cluster_ops->area_resyncing(mddev, WRITE, +
>> bio->bi_iter.bi_sector, bio_end_sector(bio))))) { +		/* As the
>> suspend_* range is controlled by +		 * userspace, we want an
>> interruptible +		 * wait. +		 */ +		DEFINE_WAIT(w); + +		for (;;)
>> { +			flush_signals(current); +
>> prepare_to_wait(&conf->wait_barrier, +					&w,
>> TASK_INTERRUPTIBLE); +			if (bio_end_sector(bio) <=
>> mddev->suspend_lo || +			    bio->bi_iter.bi_sector >=
>> mddev->suspend_hi || +			    (mddev_is_clustered(mddev) && +
>> !md_cluster_ops->area_resyncing( +						mddev, +						WRITE, +
>> bio->bi_iter.bi_sector, +						bio_end_sector(bio)))) +
>> break; +			schedule(); +		} +		finish_wait(&conf->wait_barrier,
>> &w); }
>> 
>> +	wait_barrier(conf, bio->bi_iter.bi_sector); +	bitmap =
>> mddev->bitmap; + /* -	 * WRITE: +	 * make_request() can abort the
>> operation when read-ahead is being +	 * used and no empty request
>> is available. +	 * +	 */ +	r1_bio =
>> mempool_alloc(conf->r1bio_pool, GFP_NOIO); + +	r1_bio->master_bio
>> = bio; +	r1_bio->sectors = bio_sectors(bio); +	r1_bio->state =
>> 0; +	r1_bio->mddev = mddev; +	r1_bio->sector =
>> bio->bi_iter.bi_sector; + +	/* We might need to issue multiple
>> reads to different +	 * devices if there are bad blocks around,
>> so we keep +	 * track of the number of reads in
>> bio->bi_phys_segments. +	 * If this is 0, there is only one
>> r1_bio and no locking +	 * will be needed when requests complete.
>> If it is +	 * non-zero, then it is the number of not-completed
>> requests.
> 
> This comment mentions "reads".  It should probably be changed to
> discuss what happens to "writes" since this is
> raid1_make_write_request().
> 

Yes, I will fix this.


>> */ +	bio->bi_phys_segments = 0; +	bio_clear_flag(bio,
>> BIO_SEG_VALID); + if (conf->pending_count >= max_queued_requests)
>> { md_wakeup_thread(mddev->thread); raid1_log(mddev, "wait
>> queued"); @@ -1256,7 +1317,6 @@ static void
>> raid1_make_request(struct mddev *mddev, struct bio * bio)
>> 
>> disks = conf->raid_disks * 2; retry_write: -
>> r1_bio->start_next_window = start_next_window; blocked_rdev =
>> NULL; rcu_read_lock(); max_sectors = r1_bio->sectors; @@ -1324,25
>> +1384,15 @@ static void raid1_make_request(struct mddev *mddev,
>> struct bio * bio) if (unlikely(blocked_rdev)) { /* Wait for this
>> device to become unblocked */ int j; -		sector_t old =
>> start_next_window;
>> 
>> for (j = 0; j < i; j++) if (r1_bio->bios[j]) 
>> rdev_dec_pending(conf->mirrors[j].rdev, mddev); r1_bio->state =
>> 0; -		allow_barrier(conf, start_next_window,
>> bio->bi_iter.bi_sector); +		allow_barrier(conf,
>> bio->bi_iter.bi_sector); raid1_log(mddev, "wait rdev %d blocked",
>> blocked_rdev->raid_disk); md_wait_for_blocked_rdev(blocked_rdev,
>> mddev); -		start_next_window = wait_barrier(conf, bio); -		/* -
>> * We must make sure the multi r1bios of bio have -		 * the same
>> value of bi_phys_segments -		 */ -		if (bio->bi_phys_segments &&
>> old && -		    old != start_next_window) -			/* Wait for the
>> former r1bio(s) to complete */ -
>> wait_event(conf->wait_barrier, -				   bio->bi_phys_segments ==
>> 1); +		wait_barrier(conf, bio->bi_iter.bi_sector); goto
>> retry_write; }
>> 
>> @@ -1464,6 +1514,31 @@ static void raid1_make_request(struct
>> mddev *mddev, struct bio * bio) wake_up(&conf->wait_barrier); }
>> 
>> +static void raid1_make_request(struct mddev *mddev, struct bio
>> *bio) +{ +	void (*make_request_fn)(struct mddev *mddev, struct
>> bio *bio); +	struct bio *split; +	sector_t sectors; + +
>> make_request_fn = (bio_data_dir(bio) == READ) ? +
>> raid1_make_read_request : +			  raid1_make_write_request; + +	/*
>> if bio exceeds barrier unit boundary, split it */ +	do { +
>> sectors = align_to_barrier_unit_end(bio->bi_iter.bi_sector, +
>> bio_sectors(bio)); +		if (sectors < bio_sectors(bio)) { +			split
>> = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); +
>> bio_chain(split, bio); +		} else { +			split = bio; +		} + +
>> make_request_fn(mddev, split); +	} while (split != bio); +} + 
>> static void raid1_status(struct seq_file *seq, struct mddev
>> *mddev) { struct r1conf *conf = mddev->private; @@ -1552,19
>> +1627,11 @@ static void print_conf(struct r1conf *conf)
>> 
>> static void close_sync(struct r1conf *conf) { -
>> wait_barrier(conf, NULL); -	allow_barrier(conf, 0, 0); +
>> wait_all_barriers(conf); +	allow_all_barriers(conf);
>> 
>> mempool_destroy(conf->r1buf_pool); conf->r1buf_pool = NULL; - -
>> spin_lock_irq(&conf->resync_lock); -	conf->next_resync =
>> MaxSector - 2 * NEXT_NORMALIO_DISTANCE; -	conf->start_next_window
>> = MaxSector; -	conf->current_window_requests += -
>> conf->next_window_requests; -	conf->next_window_requests = 0; -
>> spin_unlock_irq(&conf->resync_lock); }
>> 
>> static int raid1_spare_active(struct mddev *mddev) @@ -2311,8
>> +2378,9 @@ static void handle_sync_write_finished(struct r1conf
>> *conf, struct r1bio *r1_bio
>> 
>> static void handle_write_finished(struct r1conf *conf, struct
>> r1bio *r1_bio) { -	int m; +	int m, idx; bool fail = false; + for
>> (m = 0; m < conf->raid_disks * 2 ; m++) if (r1_bio->bios[m] ==
>> IO_MADE_GOOD) { struct md_rdev *rdev = conf->mirrors[m].rdev; @@
>> -2338,7 +2406,8 @@ static void handle_write_finished(struct
>> r1conf *conf, struct r1bio *r1_bio) if (fail) { 
>> spin_lock_irq(&conf->device_lock); list_add(&r1_bio->retry_list,
>> &conf->bio_end_io_list); -		conf->nr_queued++; +		idx =
>> hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); +
>> conf->nr_queued[idx]++; spin_unlock_irq(&conf->device_lock); 
>> md_wakeup_thread(conf->mddev->thread); } else { @@ -2460,6
>> +2529,7 @@ static void raid1d(struct md_thread *thread) struct
>> r1conf *conf = mddev->private; struct list_head *head =
>> &conf->retry_list; struct blk_plug plug; +	int idx;
>> 
>> md_check_recovery(mddev);
>> 
>> @@ -2467,17 +2537,18 @@ static void raid1d(struct md_thread
>> *thread) !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { 
>> LIST_HEAD(tmp); spin_lock_irqsave(&conf->device_lock, flags); -
>> if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) { -
>> while (!list_empty(&conf->bio_end_io_list)) { -
>> list_move(conf->bio_end_io_list.prev, &tmp); -
>> conf->nr_queued--; -			} -		} +		if
>> (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) +
>> list_splice_init(&conf->bio_end_io_list, &tmp); 
>> spin_unlock_irqrestore(&conf->device_lock, flags); while
>> (!list_empty(&tmp)) { r1_bio = list_first_entry(&tmp, struct
>> r1bio, retry_list); list_del(&r1_bio->retry_list); +			idx =
>> hash_long(r1_bio->sector, +					BARRIER_BUCKETS_NR_BITS); +
>> spin_lock_irqsave(&conf->device_lock, flags); +
>> conf->nr_queued[idx]--; +
>> spin_unlock_irqrestore(&conf->device_lock, flags); if
>> (mddev->degraded) set_bit(R1BIO_Degraded, &r1_bio->state); if
>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2498,7 +2569,8
>> @@ static void raid1d(struct md_thread *thread) } r1_bio =
>> list_entry(head->prev, struct r1bio, retry_list); 
>> list_del(head->prev); -		conf->nr_queued--; +		idx =
>> hash_long(r1_bio->sector, BARRIER_BUCKETS_NR_BITS); +
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags);
>> 
>> mddev = r1_bio->mddev; @@ -2537,7 +2609,6 @@ static int
>> init_resync(struct r1conf *conf) conf->poolinfo); if
>> (!conf->r1buf_pool) return -ENOMEM; -	conf->next_resync = 0; 
>> return 0; }
>> 
>> @@ -2566,6 +2637,7 @@ static sector_t raid1_sync_request(struct
>> mddev *mddev, sector_t sector_nr, int still_degraded = 0; int
>> good_sectors = RESYNC_SECTORS; int min_bad = 0; /* number of
>> sectors that are bad in all devices */ +	int idx =
>> hash_long(sector_nr, BARRIER_BUCKETS_NR_BITS);
>> 
>> if (!conf->r1buf_pool) if (init_resync(conf)) @@ -2615,7 +2687,7
>> @@ static sector_t raid1_sync_request(struct mddev *mddev,
>> sector_t sector_nr, * If there is non-resync activity waiting for
>> a turn, then let it * though before starting on this new sync
>> request. */ -	if (conf->nr_waiting) +	if (conf->nr_waiting[idx]) 
>> schedule_timeout_uninterruptible(1);
>> 
>> /* we are incrementing sector_nr below. To be safe, we check
>> against @@ -2642,6 +2714,8 @@ static sector_t
>> raid1_sync_request(struct mddev *mddev, sector_t sector_nr, 
>> r1_bio->sector = sector_nr; r1_bio->state = 0; 
>> set_bit(R1BIO_IsSync, &r1_bio->state); +	/* make sure
>> good_sectors won't go across barrier unit boundary */ +
>> good_sectors = align_to_barrier_unit_end(sector_nr,
>> good_sectors);
>> 
>> for (i = 0; i < conf->raid_disks * 2; i++) { struct md_rdev
>> *rdev; @@ -2927,9 +3001,6 @@ static struct r1conf
>> *setup_conf(struct mddev *mddev) conf->pending_count = 0; 
>> conf->recovery_disabled = mddev->recovery_disabled - 1;
>> 
>> -	conf->start_next_window = MaxSector; -
>> conf->current_window_requests = conf->next_window_requests = 0; 
>> - err = -EIO; for (i = 0; i < conf->raid_disks * 2; i++) {
>> 
>> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h index
>> c52ef42..817115d 100644 --- a/drivers/md/raid1.h +++
>> b/drivers/md/raid1.h @@ -1,6 +1,14 @@ #ifndef _RAID1_H #define
>> _RAID1_H
>> 
>> +/* each barrier unit size is 64MB fow now + * note: it must be
>> larger than RESYNC_DEPTH + */ +#define BARRIER_UNIT_SECTOR_BITS
>> 17 +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17) +#define
>> BARRIER_BUCKETS_NR_BITS		9 +#define BARRIER_BUCKETS_NR
>> (1<<BARRIER_BUCKETS_NR_BITS) + struct raid1_info { struct md_rdev
>> *rdev; sector_t	head_position; @@ -35,25 +43,6 @@ struct r1conf
>> { */ int			raid_disks;
>> 
>> -	/* During resync, read_balancing is only allowed on the part -
>> * of the array that has been resynced.  'next_resync' tells us -
>> * where that is. -	 */ -	sector_t		next_resync; - -	/* When raid1
>> starts resync, we divide array into four partitions -	 *
>> |---------|--------------|---------------------|-------------| -
>> *        next_resync   start_next_window       end_window -	 *
>> start_next_window = next_resync + NEXT_NORMALIO_DISTANCE -	 *
>> end_window = start_next_window + NEXT_NORMALIO_DISTANCE -	 *
>> current_window_requests means the count of normalIO between -	 *
>> start_next_window and end_window. -	 * next_window_requests means
>> the count of normalIO after end_window. -	 * */ -	sector_t
>> start_next_window; -	int			current_window_requests; -	int
>> next_window_requests; - spinlock_t		device_lock;
>> 
>> /* list of 'struct r1bio' that need to be processed by raid1d, @@
>> -79,10 +68,11 @@ struct r1conf { */ wait_queue_head_t
>> wait_barrier; spinlock_t		resync_lock; -	int			nr_pending; -	int
>> nr_waiting; -	int			nr_queued; -	int			barrier; +	int
>> nr_pending[BARRIER_BUCKETS_NR]; +	int
>> nr_waiting[BARRIER_BUCKETS_NR]; +	int
>> nr_queued[BARRIER_BUCKETS_NR]; +	int
>> barrier[BARRIER_BUCKETS_NR]; +	int			total_barriers; int
>> array_frozen;
>> 
>> /* Set to 1 if a full sync is needed, (fresh device added). @@
>> -135,7 +125,6 @@ struct r1bio { * in this BehindIO request */ 
>> sector_t		sector; -	sector_t		start_next_window; int			sectors; 
>> unsigned long		state; struct mddev		*mddev; --

Thanks for your review, I will update all the fixes in next version patch.

Coly


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-01-16  9:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-27 15:47 [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
2016-12-27 15:47 ` [v2 PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
2017-01-04 19:59   ` Shaohua Li
2017-01-06  1:52   ` NeilBrown
2017-01-04 19:35 ` [v2 PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-01-16  6:08   ` Coly Li
2017-01-05 23:08 ` NeilBrown
2017-01-16  9:06   ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.