All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
@ 2016-11-21 21:54 Coly Li
  2016-11-21 21:54 ` [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Coly Li @ 2016-11-21 21:54 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time. 

The idea of sliding resync widow is awesome, but there are several
challenges are very difficult to solve,
 - code complexity
   Sliding resync window requires several veriables to work collectively,
   this is complexed and very hard to make it work correctly. Just grep
   "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
   the original resync window patch. This is not the end, any further
   related modification may easily introduce more regreassion.
 - multiple sliding resync windows
   Currently raid1 code only has a single sliding resync window, we cannot
   do parallel resync with current I/O barrier implementation.
   Implementing multiple resync windows are much more complexed, and very
   hard to make it correctly.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
   hit different barrier units, they only need to compete I/O barrier with
   other I/Os which hit the same barrier bucket index with each other. The
   index of a barrier bucket which a bio should look for is calculated by
   get_barrier_bucket_idx(),
	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
   sector is the start sector number of a bio. align_to_barrier_unit_end()
   will make sure the finall bio sent into generic_make_request() won't
   exceed border of the barrier unit size.
 - RRIER_BUCKETS_NR
   Number of barrier buckets is defined by,
	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
   For 4KB page size, there are 512 buckets for each raid1 device. That
   means the propobility of full random I/O barrier confliction may be
   reduced down to 1/512.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So it is similar to
   single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket index with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

If user has a (realy) large raid1 device, for example 10PB size, we may
just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
it is possible to be a raid1-created-time-defined variable in future.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
	spin_lock_irqsave(&conf->device_lock, flags);
	conf->nr_queued[idx]--;
	spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
	atomic_dec(conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - In raid1_make_request(), wait_barrier() is replaced by,
   a) wait_read_barrier(): wait barrier in regular read I/O code path
   b) wait_barrier(): wait barrier in regular write I/O code path
   The differnece is wait_read_barrier() only waits if array is frozen, I
   am not able to combile them into one function, because they must receive
   differnet data types in their arguments list.
 - align_to_barrier_unit_end() is called to make sure both regular and
   resync I/O won't go across the border of a barrier unit size.
 
Open question:
 - Need review from md clustring developer, I don't touch related code now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
 drivers/md/raid1.h |  42 +++++-------
 2 files changed, 242 insertions(+), 189 deletions(-)

Index: linux-raid1/drivers/md/raid1.c
===================================================================
--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -66,9 +66,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
 {
 	struct r1conf *conf = r1_bio->mddev->private;
 	int i;
+	sector_t sector_nr = r1_bio->sector;
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct bio *bio = r1_bio->bios[i];
@@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, sector_nr);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	sector_t sector_nr;
+	long idx;
+
+	sector_nr = r1_bio->sector;
+	idx = get_barrier_bucket_idx(sector_nr);
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
-	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
 		unsigned long flags;
@@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
 	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
 		bio->bi_error = -EIO;
 
-	if (done) {
+	if (done)
 		bio_endio(bio);
-		/*
-		 * Wake up any possible resync thread that waits for the device
-		 * to go idle.
-		 */
-		allow_barrier(conf, start_next_window, bi_sector);
-	}
 }
 
 static void raid_end_bio_io(struct r1bio *r1_bio)
 {
 	struct bio *bio = r1_bio->master_bio;
+	struct r1conf *conf = r1_bio->mddev->private;
 
 	/* if nobody has done the final endio yet, do it now */
 	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
@@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
 
 		call_bio_endio(r1_bio);
 	}
+
+	/*
+	 * Wake up any possible resync thread that waits for the device
+	 * to go idle.
+	 */
+	allow_barrier(conf, r1_bio->sector);
 	free_r1bio(r1_bio);
 }
 
@@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
 	return mirror;
 }
 
+/* bi_end_io callback for regular READ bio */
 static void raid1_end_read_request(struct bio *bio)
 {
 	int uptodate = !bio->bi_error;
@@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/* len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
+	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
 		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
 	}
 	rcu_read_unlock();
+	sectors = align_to_barrier_unit_end(this_sector, sectors);
 	*max_sectors = sectors;
 
 	return best_disk;
@@ -779,168 +803,174 @@ static void flush_pending_writes(struct
  *    there is no normal IO happeing.  It must arrange to call
  *    lower_barrier when the particular background IO completes.
  */
+
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	long idx = get_barrier_bucket_idx(sector_nr);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in sector number ranges corresponding to idx.
+	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
+	 *    the max count which allowed in sector number ranges
+	 *    conrresponding to idx.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			    !conf->array_frozen && !conf->nr_pending[idx] &&
+			    conf->barrier[idx] < RESYNC_DEPTH,
 			    conf->resync_lock);
-
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	BUG_ON(conf->barrier[idx] <= 0);
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+/* A regular I/O should wait when,
+ * - The whole array is frozen (both READ and WRITE)
+ * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
+ */
+static void _wait_barrier(struct r1conf *conf, long idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	long idx = get_barrier_bucket_idx(sector_nr);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
-
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	long idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, long idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	long idx = get_barrier_bucket_idx(sector_nr);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	long idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+
+/* conf->resync_lock should be held */
+static int get_all_pendings(struct r1conf *conf)
+{
+	long idx;
+	int ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx];
+	return ret;
+}
+
+/* conf->resync_lock should be held */
+static int get_all_queued(struct r1conf *conf)
+{
+	long idx;
+	int  ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_queued[idx];
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
 	/* stop syncio and normal IO and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
+	 * We wait until get_all_pending() matches get_all_queued()+extra,
+	 * which means sum of conf->nr_pending[] matches sum of
+	 * conf->nr_queued[] plus extra (which might be 0 or 1).
 	 * This is called in the context of one normal IO request
 	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
+	 * will be blocked by a conf->nr_pending[idx] which the idx depends
+	 * on the request's sector number, and we need to wait for
 	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * Thus the number queued (sum of conf->nr_queued[]) plus this
+	 * request (extra) must match the number of pending IOs (sum
+	 * of conf->nr_pending[]) before we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_all_pendings(conf) == get_all_queued(conf)+extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1031,6 +1061,7 @@ static void raid1_unplug(struct blk_plug
 	kfree(plug);
 }
 
+
 static void raid1_make_request(struct mddev *mddev, struct bio * bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -1051,7 +1082,6 @@ static void raid1_make_request(struct md
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1087,8 +1117,6 @@ static void raid1_make_request(struct md
 		finish_wait(&conf->wait_barrier, &w);
 	}
 
-	start_next_window = wait_barrier(conf, bio);
-
 	bitmap = mddev->bitmap;
 
 	/*
@@ -1121,6 +1149,14 @@ static void raid1_make_request(struct md
 		int rdisk;
 
 read_again:
+		/* Still need barrier for READ in case that whole
+		 * array is frozen.
+		 */
+		wait_read_barrier(conf, r1_bio->sector);
+		/* max_sectors from read_balance is  modified to no
+		 * go across border of the barrier unit which
+		 * r1_bio->sector is in.
+		 */
 		rdisk = read_balance(conf, r1_bio, &max_sectors);
 
 		if (rdisk < 0) {
@@ -1140,7 +1176,6 @@ read_again:
 				   atomic_read(&bitmap->behind_writes) == 0);
 		}
 		r1_bio->read_disk = rdisk;
-		r1_bio->start_next_window = 0;
 
 		read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 		bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1211,7 +1246,7 @@ read_again:
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
+	wait_barrier(conf, r1_bio->sector);
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1279,27 +1314,17 @@ read_again:
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, r1_bio->sector);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
 		goto retry_write;
 	}
 
+	max_sectors = align_to_barrier_unit_end(r1_bio->sector, max_sectors);
 	if (max_sectors < r1_bio->sectors) {
 		/* We are splitting this write into multiple parts, so
 		 * we need to prepare for allocating another r1_bio.
@@ -1495,19 +1520,11 @@ static void print_conf(struct r1conf *co
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -1787,7 +1804,7 @@ static int fix_sync_read_error(struct r1
 	struct bio *bio = r1_bio->bios[r1_bio->read_disk];
 	sector_t sect = r1_bio->sector;
 	int sectors = r1_bio->sectors;
-	int idx = 0;
+	long idx = 0;
 
 	while(sectors) {
 		int s = sectors;
@@ -1983,6 +2000,14 @@ static void process_checks(struct r1bio
 	}
 }
 
+/* If there is no error encountered during sync writing out, there are two
+ * places to destroy r1_bio:
+ *  - sync_request_write(): If all wbio completed even before returning
+ *    back to its caller.
+ *  - end_sync_write(): when all remaining sync writes are done
+ * When there are error encountered from the above functions, r1_bio will
+ * be handled to handle_sync_write_finish() by reschedule_retry().
+ */
 static void sync_request_write(struct mddev *mddev, struct r1bio *r1_bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -2244,6 +2269,9 @@ static void handle_write_finished(struct
 {
 	int m;
 	bool fail = false;
+	sector_t sector_nr;
+	long idx;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2269,7 +2297,9 @@ static void handle_write_finished(struct
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		sector_nr = r1_bio->sector;
+		idx = get_barrier_bucket_idx(sector_nr);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2380,6 +2410,8 @@ static void raid1d(struct md_thread *thr
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	sector_t sector_nr;
+	long idx;
 
 	md_check_recovery(mddev);
 
@@ -2387,17 +2419,19 @@ static void raid1d(struct md_thread *thr
 	    !test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_CHANGE_PENDING, &mddev->flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			sector_nr = r1_bio->sector;
+			idx = get_barrier_bucket_idx(sector_nr);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2418,7 +2452,9 @@ static void raid1d(struct md_thread *thr
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		sector_nr = r1_bio->sector;
+		idx = get_barrier_bucket_idx(sector_nr);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2457,7 +2493,6 @@ static int init_resync(struct r1conf *co
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2486,6 +2521,7 @@ static sector_t raid1_sync_request(struc
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	long idx = get_barrier_bucket_idx(sector_nr);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit border */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
 	if (!conf)
 		goto abort;
 
+	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_pending)
+		goto abort;
+
+	conf->nr_waiting = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_waiting)
+		goto abort;
+
+	conf->nr_queued = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->nr_queued)
+		goto abort;
+
+	conf->barrier = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	if (!conf->barrier)
+		goto abort;
+
 	conf->mirrors = kzalloc(sizeof(struct raid1_info)
 				* mddev->raid_disks * 2,
 				 GFP_KERNEL);
@@ -2841,9 +2895,6 @@ static struct r1conf *setup_conf(struct
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
@@ -2890,6 +2941,10 @@ static struct r1conf *setup_conf(struct
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
+		kfree(conf->nr_pending);
+		kfree(conf->nr_waiting);
+		kfree(conf->nr_queued);
+		kfree(conf->barrier);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -2992,6 +3047,10 @@ static void raid1_free(struct mddev *mdd
 	kfree(conf->mirrors);
 	safe_put_page(conf->tmppage);
 	kfree(conf->poolinfo);
+	kfree(conf->nr_pending);
+	kfree(conf->nr_waiting);
+	kfree(conf->nr_queued);
+	kfree(conf->barrier);
 	kfree(conf);
 }
 
Index: linux-raid1/drivers/md/raid1.h
===================================================================
--- linux-raid1.orig/drivers/md/raid1.h
+++ linux-raid1/drivers/md/raid1.h
@@ -1,6 +1,20 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(long))
+
+/* will use bit shift later */
+static inline long get_barrier_bucket_idx(sector_t sector)
+{
+	return (long)(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR;
+
+}
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +49,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +74,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			*nr_pending;
+	int			*nr_waiting;
+	int			*nr_queued;
+	int			*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +130,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2016-11-21 21:54 [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
@ 2016-11-21 21:54 ` Coly Li
  2016-11-22 21:58   ` Shaohua Li
  2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
  2016-11-24  7:34 ` Guoqing Jiang
  2 siblings, 1 reply; 15+ messages in thread
From: Coly Li @ 2016-11-21 21:54 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
	 + 89.92% allow_barrier
	 + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
	 - 100.00% wait_barrier 

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it is really necessary.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. Now I write the patch based on new simpler raid1
I/O barrier code.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which in turns calls _wait_barrier(), is used for regular write I/O.
   If there is resync I/O happening on the same barrier bucket index, or
   the whole array is frozen, task will wait untill no barrier on same
   bucket index, or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), so it is
   unnecessary to wait for barrier in regular read I/Os, they only have to
   wait only when the whole array is frozen.
The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() or
wait_read_barrier() if no barrier raised in same barrier bucket index or
array is not frozen, the regular I/O doesn't need to hold conf->
resync_lock, it can just increase conf->nr_pending[idx], and return to its
caller. For heavy parallel reading I/Os, the lockless I/O barrier code
almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Open question:
 - I am not comfortable with freeze_array() and unfreeze_array(), for
   writing I/Os if devices failed, wait_barrier() may have race with
   freeze_array(), I am still looking for a solution now.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 125 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------------------------------------------------
 drivers/md/raid1.h |  12 +++++------
 2 files changed, 75 insertions(+), 62 deletions(-)

Index: linux-raid1/drivers/md/raid1.c
===================================================================
--- linux-raid1.orig/drivers/md/raid1.c
+++ linux-raid1/drivers/md/raid1.c
@@ -222,8 +222,8 @@ static void reschedule_retry(struct r1bi
 
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
+	atomic_inc(&conf->nr_queued[idx]);
 
 	wake_up(&conf->wait_barrier);
 	md_wakeup_thread(mddev->thread);
@@ -811,11 +811,12 @@ static void raise_barrier(struct r1conf
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
+	atomic_inc(&conf->barrier[idx]);
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
@@ -826,23 +827,21 @@ static void raise_barrier(struct r1conf
 	 *    conrresponding to idx.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
-			    !conf->array_frozen && !conf->nr_pending[idx] &&
-			    conf->barrier[idx] < RESYNC_DEPTH,
+			    !atomic_read(&conf->array_frozen) &&
+			    !atomic_read(&conf->nr_pending[idx]) &&
+			    atomic_read(&conf->barrier[idx]) < RESYNC_DEPTH,
 			    conf->resync_lock);
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	long idx = get_barrier_bucket_idx(sector_nr);
 
-	BUG_ON(conf->barrier[idx] <= 0);
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -852,36 +851,55 @@ static void lower_barrier(struct r1conf
  */
 static void _wait_barrier(struct r1conf *conf, long idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/* We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	if (!atomic_read(&conf->barrier[idx]) &&
+	    !atomic_read(&conf->array_frozen))
+		return;
 
-	conf->nr_pending[idx]++;
+	/* After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen) &&
+			    !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
+/* Very similar to _wait_barrier(), but only wait if array is frozen.
+ */
 static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	long idx = get_barrier_bucket_idx(sector_nr);
 
+	atomic_inc(&conf->nr_pending[idx]);
+	if (!atomic_read(&conf->array_frozen))
+		return;
+
 	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
-	conf->nr_pending[idx]++;
+	/* Wait for array to unfreeze */
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->array_frozen),
+			    conf->resync_lock);
+	atomic_dec(&conf->nr_waiting[idx]);
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -902,11 +920,7 @@ static void wait_all_barriers(struct r1c
 
 static void _allow_barrier(struct r1conf *conf, long idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -933,7 +947,7 @@ static int get_all_pendings(struct r1con
 	int ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx];
+		ret += atomic_read(&conf->nr_pending[idx]);
 	return ret;
 }
 
@@ -944,7 +958,7 @@ static int get_all_queued(struct r1conf
 	int  ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_queued[idx]);
 	return ret;
 }
 
@@ -965,7 +979,7 @@ static void freeze_array(struct r1conf *
 	 * of conf->nr_pending[]) before we continue.
 	 */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 1;
+	atomic_set(&conf->array_frozen, 1);
 	wait_event_lock_irq_cmd(
 		conf->wait_barrier,
 		get_all_pendings(conf) == get_all_queued(conf)+extra,
@@ -977,7 +991,7 @@ static void unfreeze_array(struct r1conf
 {
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
-	conf->array_frozen = 0;
+	atomic_set(&conf->array_frozen, 0);
 	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
 }
@@ -2295,12 +2309,16 @@ static void handle_write_finished(struct
 					 conf->mddev);
 		}
 	if (fail) {
+		/* set sector_nr before r1_bio add into conf->bio_end_io_list,
+		 * we can't touch r1_bio once it is in this list, because
+		 * it might be freed by raid_end_bio_io() in raid1d()
+		 */
+		sector_nr = r1_bio->sector;
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		sector_nr = r1_bio->sector;
-		idx = get_barrier_bucket_idx(sector_nr);
-		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
+		idx = get_barrier_bucket_idx(sector_nr);
+		atomic_inc(&conf->nr_queued[idx]);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
 		if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2410,7 +2428,6 @@ static void raid1d(struct md_thread *thr
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
-	sector_t sector_nr;
 	long idx;
 
 	md_check_recovery(mddev);
@@ -2427,11 +2444,8 @@ static void raid1d(struct md_thread *thr
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
-			sector_nr = r1_bio->sector;
-			idx = get_barrier_bucket_idx(sector_nr);
-			spin_lock_irqsave(&conf->device_lock, flags);
-			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
+			idx = get_barrier_bucket_idx(r1_bio->sector);
+			atomic_dec(&conf->nr_queued[idx]);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2452,10 +2466,9 @@ static void raid1d(struct md_thread *thr
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		sector_nr = r1_bio->sector;
-		idx = get_barrier_bucket_idx(sector_nr);
-		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
+		idx = get_barrier_bucket_idx(r1_bio->sector);
+		atomic_dec(&conf->nr_queued[idx]);
 
 		mddev = r1_bio->mddev;
 		conf = mddev->private;
@@ -2571,7 +2584,7 @@ static sector_t raid1_sync_request(struc
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -3224,7 +3237,7 @@ static void *raid1_takeover(struct mddev
 		conf = setup_conf(mddev);
 		if (!IS_ERR(conf))
 			/* Array must appear to be quiesced */
-			conf->array_frozen = 1;
+			atomic_set(&conf->array_frozen, 1);
 		return conf;
 	}
 	return ERR_PTR(-EINVAL);
Index: linux-raid1/drivers/md/raid1.h
===================================================================
--- linux-raid1.orig/drivers/md/raid1.h
+++ linux-raid1/drivers/md/raid1.h
@@ -6,7 +6,7 @@
  */
 #define BARRIER_UNIT_SECTOR_BITS	17
 #define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
-#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(long))
+#define BARRIER_BUCKETS_NR		(PAGE_SIZE/sizeof(atomic_t))
 
 /* will use bit shift later */
 static inline long get_barrier_bucket_idx(sector_t sector)
@@ -74,11 +74,11 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			*nr_pending;
-	int			*nr_waiting;
-	int			*nr_queued;
-	int			*barrier;
-	int			array_frozen;
+	atomic_t		*nr_pending;
+	atomic_t		*nr_waiting;
+	atomic_t		*nr_queued;
+	atomic_t		*barrier;
+	atomic_t		array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
 	 * Cleared when a sync completes.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-21 21:54 [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
  2016-11-21 21:54 ` [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
@ 2016-11-22 21:35 ` Shaohua Li
  2016-11-23  9:05   ` Guoqing Jiang
                     ` (2 more replies)
  2016-11-24  7:34 ` Guoqing Jiang
  2 siblings, 3 replies; 15+ messages in thread
From: Shaohua Li @ 2016-11-22 21:35 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time. 
> 
> The idea of sliding resync widow is awesome, but there are several
> challenges are very difficult to solve,
>  - code complexity
>    Sliding resync window requires several veriables to work collectively,
>    this is complexed and very hard to make it work correctly. Just grep
>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>    the original resync window patch. This is not the end, any further
>    related modification may easily introduce more regreassion.
>  - multiple sliding resync windows
>    Currently raid1 code only has a single sliding resync window, we cannot
>    do parallel resync with current I/O barrier implementation.
>    Implementing multiple resync windows are much more complexed, and very
>    hard to make it correctly.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>    hit different barrier units, they only need to compete I/O barrier with
>    other I/Os which hit the same barrier bucket index with each other. The
>    index of a barrier bucket which a bio should look for is calculated by
>    get_barrier_bucket_idx(),
> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>    sector is the start sector number of a bio. align_to_barrier_unit_end()
>    will make sure the finall bio sent into generic_make_request() won't
>    exceed border of the barrier unit size.
>  - RRIER_BUCKETS_NR
>    Number of barrier buckets is defined by,
> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
>    For 4KB page size, there are 512 buckets for each raid1 device. That
>    means the propobility of full random I/O barrier confliction may be
>    reduced down to 1/512.

Thanks! The idea is awesome and does makes the code easier to understand.

For hash conflict, I'm worrying about one case. resync does sequential access.
Say we have a sequential normal IO. If the first normal IO and resync IO have
conflict, it's possible they will have conflict subsequently, since both are
doing sequential access. We can change the hash to something like this:

for the first 64M * 512, the hash is 0, 1, ... 511
For the second 64M * 512, the hash is 1, 3, 5 ...
The third 64M * 512, the hash is 2, 5, 8 ...

It should be very easy to implement something like this, and this should reduce
the conflict of sequential access.
 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So it is similar to
>    single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket index with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
> 
> If user has a (realy) large raid1 device, for example 10PB size, we may
> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> it is possible to be a raid1-created-time-defined variable in future.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
> 	spin_lock_irqsave(&conf->device_lock, flags);
> 	conf->nr_queued[idx]--;
> 	spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
> 	atomic_dec(conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - In raid1_make_request(), wait_barrier() is replaced by,
>    a) wait_read_barrier(): wait barrier in regular read I/O code path
>    b) wait_barrier(): wait barrier in regular write I/O code path
>    The differnece is wait_read_barrier() only waits if array is frozen, I
>    am not able to combile them into one function, because they must receive
>    differnet data types in their arguments list.
>  - align_to_barrier_unit_end() is called to make sure both regular and
>    resync I/O won't go across the border of a barrier unit size.
>  
> Open question:
>  - Need review from md clustring developer, I don't touch related code now.

Don't think it matters, but please open eyes, Guoqing!
 
> Signed-off-by: Coly Li <colyli@suse.de>
> Cc: Shaohua Li <shli@fb.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> Cc: Guoqing Jiang <gqjiang@suse.com>
> ---
>  drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
>  drivers/md/raid1.h |  42 +++++-------
>  2 files changed, 242 insertions(+), 189 deletions(-)
> 
> Index: linux-raid1/drivers/md/raid1.c
> ===================================================================
> --- linux-raid1.orig/drivers/md/raid1.c
> +++ linux-raid1/drivers/md/raid1.c
> @@ -66,9 +66,8 @@
>   */
>  static int max_queued_requests = 1024;
>  
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector);
> -static void lower_barrier(struct r1conf *conf);
> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>  
>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
>  {
> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>  
>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>  {
> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
>  {
>  	struct r1conf *conf = r1_bio->mddev->private;
>  	int i;
> +	sector_t sector_nr = r1_bio->sector;
>  
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  		struct bio *bio = r1_bio->bios[i];
> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
>  
>  	mempool_free(r1_bio, conf->r1buf_pool);
>  
> -	lower_barrier(conf);
> +	lower_barrier(conf, sector_nr);
>  }
>  
>  static void reschedule_retry(struct r1bio *r1_bio)
> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>  	unsigned long flags;
>  	struct mddev *mddev = r1_bio->mddev;
>  	struct r1conf *conf = mddev->private;
> +	sector_t sector_nr;
> +	long idx;
> +
> +	sector_nr = r1_bio->sector;
> +	idx = get_barrier_bucket_idx(sector_nr);
>  
>  	spin_lock_irqsave(&conf->device_lock, flags);
>  	list_add(&r1_bio->retry_list, &conf->retry_list);
> -	conf->nr_queued ++;
> +	conf->nr_queued[idx]++;
>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>  
>  	wake_up(&conf->wait_barrier);
> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
>  	struct bio *bio = r1_bio->master_bio;
>  	int done;
>  	struct r1conf *conf = r1_bio->mddev->private;
> -	sector_t start_next_window = r1_bio->start_next_window;
> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>  
>  	if (bio->bi_phys_segments) {
>  		unsigned long flags;
> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>  		bio->bi_error = -EIO;
>  
> -	if (done) {
> +	if (done)
>  		bio_endio(bio);
> -		/*
> -		 * Wake up any possible resync thread that waits for the device
> -		 * to go idle.
> -		 */
> -		allow_barrier(conf, start_next_window, bi_sector);
> -	}
>  }
>  
>  static void raid_end_bio_io(struct r1bio *r1_bio)
>  {
>  	struct bio *bio = r1_bio->master_bio;
> +	struct r1conf *conf = r1_bio->mddev->private;
>  
>  	/* if nobody has done the final endio yet, do it now */
>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>  
>  		call_bio_endio(r1_bio);
>  	}
> +
> +	/*
> +	 * Wake up any possible resync thread that waits for the device
> +	 * to go idle.
> +	 */
> +	allow_barrier(conf, r1_bio->sector);
Why this change?

>  	free_r1bio(r1_bio);
>  }
>  
> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>  	return mirror;
>  }
>  
> +/* bi_end_io callback for regular READ bio */
>  static void raid1_end_read_request(struct bio *bio)
>  {
>  	int uptodate = !bio->bi_error;
> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
>  		bio_put(to_put);
>  }
>  
> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> +					  sector_t sectors)
> +{
> +	sector_t len;
> +
> +	WARN_ON(sectors == 0);
> +	/* len is the number of sectors from start_sector to end of the
> +	 * barrier unit which start_sector belongs to.
> +	 */
> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> +	      start_sector;
> +
> +	if (len > sectors)
> +		len = sectors;
> +
> +	return len;
> +}

I'd prefer calling bio_split at the early of raid1_make_request and split the
bio if it crosses the bucket boundary. please see raid10_make_request for
example. resync will not cross the boundary. So the code will not need to worry
about the boundary. I believe this will make the code simpiler (all the
align_to_barrier_unit_end calling can removed) and easy to understand.

> +
>  /*
>   * This routine returns the disk from which the requested read should
>   * be done. There is a per-array 'next expected sequential IO' sector
> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
>  		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>  	}
>  	rcu_read_unlock();
> +	sectors = align_to_barrier_unit_end(this_sector, sectors);
>  	*max_sectors = sectors;
>  
>  	return best_disk;
> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
>   *    there is no normal IO happeing.  It must arrange to call
>   *    lower_barrier when the particular background IO completes.
>   */
> +
>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> +	long idx = get_barrier_bucket_idx(sector_nr);
> +
>  	spin_lock_irq(&conf->resync_lock);
>  
>  	/* Wait until no block IO is waiting */
> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>  			    conf->resync_lock);
>  
>  	/* block any new IO from starting */
> -	conf->barrier++;
> -	conf->next_resync = sector_nr;
> +	conf->barrier[idx]++;
>  
>  	/* For these conditions we must wait:
>  	 * A: while the array is in frozen state
> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
> -	 *    the max count which allowed.
> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
> -	 *    next resync will reach to the window which normal bios are
> -	 *    handling.
> -	 * D: while there are any active requests in the current window.
> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
> +	 *    existing in sector number ranges corresponding to idx.
> +	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
> +	 *    the max count which allowed in sector number ranges
> +	 *    conrresponding to idx.
>  	 */
>  	wait_event_lock_irq(conf->wait_barrier,
> -			    !conf->array_frozen &&
> -			    conf->barrier < RESYNC_DEPTH &&
> -			    conf->current_window_requests == 0 &&
> -			    (conf->start_next_window >=
> -			     conf->next_resync + RESYNC_SECTORS),
> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
> +			    conf->barrier[idx] < RESYNC_DEPTH,
>  			    conf->resync_lock);

So there is a slight behavior change. Originally we guarautee no more than
RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
important this is, but at least we can check barrier[idx] + barrier[idx-1].

> -
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>  	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> -static void lower_barrier(struct r1conf *conf)
> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
>  	unsigned long flags;
> -	BUG_ON(conf->barrier <= 0);
> +	long idx = get_barrier_bucket_idx(sector_nr);
> +
> +	BUG_ON(conf->barrier[idx] <= 0);
>  	spin_lock_irqsave(&conf->resync_lock, flags);
> -	conf->barrier--;
> -	conf->nr_pending--;
> +	conf->barrier[idx]--;
> +	conf->nr_pending[idx]--;
>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>  	wake_up(&conf->wait_barrier);
>  }
>  
> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
> +/* A regular I/O should wait when,
> + * - The whole array is frozen (both READ and WRITE)
> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
> + */
> +static void _wait_barrier(struct r1conf *conf, long idx)
>  {
> -	bool wait = false;
> -
> -	if (conf->array_frozen || !bio)
> -		wait = true;
> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
> -		if ((conf->mddev->curr_resync_completed
> -		     >= bio_end_sector(bio)) ||
> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
> -		     <= bio->bi_iter.bi_sector))
> -			wait = false;
> -		else
> -			wait = true;
> +	spin_lock_irq(&conf->resync_lock);
> +	if (conf->array_frozen || conf->barrier[idx]) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for the barrier to drop. */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen && !conf->barrier[idx],
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>  	}
>  
> -	return wait;
> +	conf->nr_pending[idx]++;
> +	spin_unlock_irq(&conf->resync_lock);
>  }
>  
> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>  {
> -	sector_t sector = 0;
> +	long idx = get_barrier_bucket_idx(sector_nr);
>  
>  	spin_lock_irq(&conf->resync_lock);
> -	if (need_to_wait_for_sync(conf, bio)) {
> -		conf->nr_waiting++;
> -		/* Wait for the barrier to drop.
> -		 * However if there are already pending
> -		 * requests (preventing the barrier from
> -		 * rising completely), and the
> -		 * per-process bio queue isn't empty,
> -		 * then don't wait, as we need to empty
> -		 * that queue to allow conf->start_next_window
> -		 * to increase.
> -		 */
> -		wait_event_lock_irq(conf->wait_barrier,
> -				    !conf->array_frozen &&
> -				    (!conf->barrier ||
> -				     ((conf->start_next_window <
> -				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> -				    conf->resync_lock);
> -		conf->nr_waiting--;
> -	}
> -
> -	if (bio && bio_data_dir(bio) == WRITE) {
> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
> -			if (conf->start_next_window == MaxSector)
> -				conf->start_next_window =
> -					conf->next_resync +
> -					NEXT_NORMALIO_DISTANCE;
> -
> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
> -			    <= bio->bi_iter.bi_sector)
> -				conf->next_window_requests++;
> -			else
> -				conf->current_window_requests++;
> -			sector = conf->start_next_window;
> -		}
> +	if (conf->array_frozen) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for array to unfreeze */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen,
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>  	}
> -
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>  	spin_unlock_irq(&conf->resync_lock);
> -	return sector;
>  }
>  
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector)
> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
> +{
> +	long idx = get_barrier_bucket_idx(sector_nr);
> +
> +	_wait_barrier(conf, idx);
> +}
> +
> +static void wait_all_barriers(struct r1conf *conf)
> +{
> +	long idx;
> +
> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> +		_wait_barrier(conf, idx);

This is racy. Since resync is sequential, idealy we only need to check the
bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
_wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
511 should be check before bucket 0 if the resync is just crossing 512 buckets.
It would be better we check the resync position and adjust the bucket wait
order according to the position.

>  	int good_sectors = RESYNC_SECTORS;
>  	int min_bad = 0; /* number of sectors that are bad in all devices */
> +	long idx = get_barrier_bucket_idx(sector_nr);
>  
>  	if (!conf->r1buf_pool)
>  		if (init_resync(conf))
> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
>  	 * If there is non-resync activity waiting for a turn, then let it
>  	 * though before starting on this new sync request.
>  	 */
> -	if (conf->nr_waiting)
> +	if (conf->nr_waiting[idx])
>  		schedule_timeout_uninterruptible(1);
>  
>  	/* we are incrementing sector_nr below. To be safe, we check against
> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
>  	r1_bio->sector = sector_nr;
>  	r1_bio->state = 0;
>  	set_bit(R1BIO_IsSync, &r1_bio->state);
> +	/* make sure good_sectors won't go across barrier unit border */
> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>  
>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>  		struct md_rdev *rdev;
> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>  	if (!conf)
>  		goto abort;
>  
> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +	if (!conf->nr_pending)
> +		goto abort;

This allocation is a little wierd. I'd define a count and uses
sizeof(nr_pending) * count to do the allocation. There is nothing related to
PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2016-11-21 21:54 ` [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
@ 2016-11-22 21:58   ` Shaohua Li
  0 siblings, 0 replies; 15+ messages in thread
From: Shaohua Li @ 2016-11-22 21:58 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

On Tue, Nov 22, 2016 at 05:54:01AM +0800, Coly Li wrote:
> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
> 
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
> 	 + 89.92% allow_barrier
> 	 + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
> 	 - 100.00% wait_barrier 
> 
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
> 
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it is really necessary.
> 
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. Now I write the patch based on new simpler raid1
> I/O barrier code.
> 
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which in turns calls _wait_barrier(), is used for regular write I/O.
>    If there is resync I/O happening on the same barrier bucket index, or
>    the whole array is frozen, task will wait untill no barrier on same
>    bucket index, or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), so it is
>    unnecessary to wait for barrier in regular read I/Os, they only have to
>    wait only when the whole array is frozen.
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() or
> wait_read_barrier() if no barrier raised in same barrier bucket index or
> array is not frozen, the regular I/O doesn't need to hold conf->
> resync_lock, it can just increase conf->nr_pending[idx], and return to its
> caller. For heavy parallel reading I/Os, the lockless I/O barrier code
> almostly gets rid of all spin lock cost.
> 
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
> 
> Open question:
>  - I am not comfortable with freeze_array() and unfreeze_array(), for
>    writing I/Os if devices failed, wait_barrier() may have race with
>    freeze_array(), I am still looking for a solution now.

Only have rough review so far, this one does need more time to look.

Since you make all the count atomic, is it safe to check several atomic at the
same time? For example, in freeze_array we check both nr_queued and nr_pending,
but they are updated without lock so could be in any order. Please add comments
to explain.

Also you need to add smp_mb__after_atomic and friends to maintain ordering. An
example is _wait_barrier, 'atomic_inc(&conf->nr_pending[idx])' should happen
before read 'barrier'.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
@ 2016-11-23  9:05   ` Guoqing Jiang
  2016-11-24  5:45   ` NeilBrown
  2016-11-28  6:42   ` Coly Li
  2 siblings, 0 replies; 15+ messages in thread
From: Guoqing Jiang @ 2016-11-23  9:05 UTC (permalink / raw)
  To: Shaohua Li, Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn



On 11/23/2016 05:35 AM, Shaohua Li wrote:
> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> I/O barriers to happen only inside a slidingresync window, for regular
>> I/Os out of this resync window they don't need to wait for barrier any
>> more. On large raid1 device, it helps a lot to improve parallel writing
>> I/O throughput when there are background resync I/Os performing at
>> same time.
>>
>> The idea of sliding resync widow is awesome, but there are several
>> challenges are very difficult to solve,
>>   - code complexity
>>     Sliding resync window requires several veriables to work collectively,
>>     this is complexed and very hard to make it work correctly. Just grep
>>     "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>>     the original resync window patch. This is not the end, any further
>>     related modification may easily introduce more regreassion.
>>   - multiple sliding resync windows
>>     Currently raid1 code only has a single sliding resync window, we cannot
>>     do parallel resync with current I/O barrier implementation.
>>     Implementing multiple resync windows are much more complexed, and very
>>     hard to make it correctly.
>>
>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> removing resync window code, I believe life will be much easier.
>>
>> The brief idea of the simpler barrier is,
>>   - Do not maintain a logbal unique resync window
>>   - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>     I/O only has to wait for a resync I/O when both them have same barrier
>>     bucket index, vice versa.
>>   - I/O barrier can be recuded to an acceptable number if there are enought
>>     barrier buckets
>>
>> Here I explain how the barrier buckets are designed,
>>   - BARRIER_UNIT_SECTOR_SIZE
>>     The whole LBA address space of a raid1 device is divided into multiple
>>     barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>     Bio request won't go across border of barrier unit size, that means
>>     maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>>   - BARRIER_BUCKETS_NR
>>     There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>>     hit different barrier units, they only need to compete I/O barrier with
>>     other I/Os which hit the same barrier bucket index with each other. The
>>     index of a barrier bucket which a bio should look for is calculated by
>>     get_barrier_bucket_idx(),
>> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>>     sector is the start sector number of a bio. align_to_barrier_unit_end()
>>     will make sure the finall bio sent into generic_make_request() won't
>>     exceed border of the barrier unit size.
>>   - RRIER_BUCKETS_NR
>>     Number of barrier buckets is defined by,
>> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
>>     For 4KB page size, there are 512 buckets for each raid1 device. That
>>     means the propobility of full random I/O barrier confliction may be
>>     reduced down to 1/512.
> Thanks! The idea is awesome and does makes the code easier to understand.

Fully agree!

> Open question:
>   - Need review from md clustring developer, I don't touch related code now.
> Don't think it matters, but please open eyes, Guoqing!

Thanks for reminding, I agree.

Anyway,  I will try to comment it though I am sticking with lvm2 bugs 
now and
run some tests with the two patches applied.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
  2016-11-23  9:05   ` Guoqing Jiang
@ 2016-11-24  5:45   ` NeilBrown
  2016-11-24  6:05     ` Guoqing Jiang
  2016-11-28  6:59     ` Coly Li
  2016-11-28  6:42   ` Coly Li
  2 siblings, 2 replies; 15+ messages in thread
From: NeilBrown @ 2016-11-24  5:45 UTC (permalink / raw)
  To: Shaohua Li, Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 7494 bytes --]

On Wed, Nov 23 2016, Shaohua Li wrote:
>
> Thanks! The idea is awesome and does makes the code easier to understand.

This is a key point - simpler code!!

>
> For hash conflict, I'm worrying about one case. resync does sequential access.
> Say we have a sequential normal IO. If the first normal IO and resync IO have
> conflict, it's possible they will have conflict subsequently, since both are
> doing sequential access. We can change the hash to something like this:
>
> for the first 64M * 512, the hash is 0, 1, ... 511
> For the second 64M * 512, the hash is 1, 3, 5 ...
> The third 64M * 512, the hash is 2, 5, 8 ...

This would happen when the write and the resync are at different
addresses which happen to collide in the hash.
They would only remain in sync if the two proceeded at the same speed.
Normally resync is throttled when there is competing IO, so that is
fairly unlikely.

If this really was important, it might make sense to just use
hash_long() to make the relevant bits of the sector number into and
index. (i.e. keep the code simple)

>> 
>> If user has a (realy) large raid1 device, for example 10PB size, we may
>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>> it is possible to be a raid1-created-time-defined variable in future.

I don't think the size of the array is very relevant.  No matter how
big the array is, resync can be expected to directly interfere with 0.2%
of all writes.  So, assuming a fairly random distribution over times,
most writes will most often not be blocked by resync at all.

I wouldn't give any thought to any problems like this until they are
actually measured.

>>  static void reschedule_retry(struct r1bio *r1_bio)
>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>>  	unsigned long flags;
>>  	struct mddev *mddev = r1_bio->mddev;
>>  	struct r1conf *conf = mddev->private;
>> +	sector_t sector_nr;
>> +	long idx;

I wonder about the use of "long" for "idx".
As that is an array index, it would have to be a REALLY big array before
"long" is better than "int".

>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>  		bio->bi_error = -EIO;
>>  
>> -	if (done) {
>> +	if (done)
>>  		bio_endio(bio);
>> -		/*
>> -		 * Wake up any possible resync thread that waits for the device
>> -		 * to go idle.
>> -		 */
>> -		allow_barrier(conf, start_next_window, bi_sector);
>> -	}
>>  }
>>  
>>  static void raid_end_bio_io(struct r1bio *r1_bio)
>>  {
>>  	struct bio *bio = r1_bio->master_bio;
>> +	struct r1conf *conf = r1_bio->mddev->private;
>>  
>>  	/* if nobody has done the final endio yet, do it now */
>>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>  
>>  		call_bio_endio(r1_bio);
>>  	}
>> +
>> +	/*
>> +	 * Wake up any possible resync thread that waits for the device
>> +	 * to go idle.
>> +	 */
>> +	allow_barrier(conf, r1_bio->sector);
> Why this change?

I wondered too.  I think it may be correct, but it should be in a
separate patch.
When you have a write-mostly device, I think the current code will
allow_barrier() before the writes to the write-mostly devices have
completed.

>>  
>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>> +					  sector_t sectors)
>> +{
>> +	sector_t len;
>> +
>> +	WARN_ON(sectors == 0);
>> +	/* len is the number of sectors from start_sector to end of the
>> +	 * barrier unit which start_sector belongs to.
>> +	 */
>> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>> +	      start_sector;
>> +
>> +	if (len > sectors)
>> +		len = sectors;
>> +
>> +	return len;
>> +}
>
> I'd prefer calling bio_split at the early of raid1_make_request and split the
> bio if it crosses the bucket boundary. please see raid10_make_request for
> example. resync will not cross the boundary. So the code will not need to worry
> about the boundary. I believe this will make the code simpiler (all the
> align_to_barrier_unit_end calling can removed) and easy to understand.

align_to_barrier_unit_end() is only called twice, once for writes and
once for resync/recovery.  If bio_split() were used, you would still
need the same number of calls.

Changing raid1.c to use bio_split() more may well be a valuable
simplification, but I think it is a separate issue.

>>  	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen &&
>> -			    conf->barrier < RESYNC_DEPTH &&
>> -			    conf->current_window_requests == 0 &&
>> -			    (conf->start_next_window >=
>> -			     conf->next_resync + RESYNC_SECTORS),
>> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
>> +			    conf->barrier[idx] < RESYNC_DEPTH,
>>  			    conf->resync_lock);
>
> So there is a slight behavior change. Originally we guarautee no more than
> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
> important this is, but at least we can check barrier[idx] + barrier[idx-1].

I confess that I'm not entirely sure the point of RESYNC_DEPTH - it was
already in the code when I started working on it.
My best guess is that it limits the number of requests we need to wait
for before regular writes can be permitted in the resync region.
In that case, the current code is good enough.

Your "barrier[idx] + barrier[idx-1]" idea is interesting, but would only
make a difference 1/1024 of the time (bucket is 64Meg, the
RESYNC_BLOCK_SIZE is 64K).


>> +
>> +static void wait_all_barriers(struct r1conf *conf)
>> +{
>> +	long idx;
>> +
>> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>> +		_wait_barrier(conf, idx);
>
> This is racy. Since resync is sequential, idealy we only need to check the
> bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
> It would be better we check the resync position and adjust the bucket wait
> order according to the position.

I cannot see how it is racy.  No new sync requests will be issued at
this time, so once ->barrier[idx] reaches 0, it will stay at zero.

>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>>  	if (!conf)
>>  		goto abort;
>>  
>> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> +	if (!conf->nr_pending)
>> +		goto abort;
>
> This allocation is a little wierd. I'd define a count and uses
> sizeof(nr_pending) * count to do the allocation. There is nothing related to
> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.

The thing about PAGE_SIZE is that the allocation will succeed and won't
waste space.
If you make all the arrays simple members of r1conf, r1conf will be
about 4pages larger, so will require an 8-page allocation, which is more
likely to fail.
I agree that it seems a little weird, but I think it results in the best
use of memory.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 800 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-24  5:45   ` NeilBrown
@ 2016-11-24  6:05     ` Guoqing Jiang
  2016-11-28  6:59     ` Coly Li
  1 sibling, 0 replies; 15+ messages in thread
From: Guoqing Jiang @ 2016-11-24  6:05 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li, Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn



On 11/24/2016 01:45 PM, NeilBrown wrote:
>
>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>>   	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>>   		bio->bi_error = -EIO;
>>>   
>>> -	if (done) {
>>> +	if (done)
>>>   		bio_endio(bio);
>>> -		/*
>>> -		 * Wake up any possible resync thread that waits for the device
>>> -		 * to go idle.
>>> -		 */
>>> -		allow_barrier(conf, start_next_window, bi_sector);
>>> -	}
>>>   }
>>>   
>>>   static void raid_end_bio_io(struct r1bio *r1_bio)
>>>   {
>>>   	struct bio *bio = r1_bio->master_bio;
>>> +	struct r1conf *conf = r1_bio->mddev->private;
>>>   
>>>   	/* if nobody has done the final endio yet, do it now */
>>>   	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>>   
>>>   		call_bio_endio(r1_bio);
>>>   	}
>>> +
>>> +	/*
>>> +	 * Wake up any possible resync thread that waits for the device
>>> +	 * to go idle.
>>> +	 */
>>> +	allow_barrier(conf, r1_bio->sector);
>> Why this change?
> I wondered too.  I think it may be correct, but it should be in a
> separate patch.
> When you have a write-mostly device, I think the current code will
> allow_barrier() before the writes to the write-mostly devices have
> completed.
>

Seems the change is moved from call_bio_endio, but call_bio_endio is 
also called
from raid1_end_write_request, I think it is better to keep the original 
code.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-21 21:54 [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
  2016-11-21 21:54 ` [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
  2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
@ 2016-11-24  7:34 ` Guoqing Jiang
  2016-11-28  7:33   ` Coly Li
  2 siblings, 1 reply; 15+ messages in thread
From: Guoqing Jiang @ 2016-11-24  7:34 UTC (permalink / raw)
  To: Coly Li, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn

Hi Coly,

Please see below comments, just FYI.

On 11/22/2016 05:54 AM, Coly Li wrote:
>   - In raid1_make_request(), wait_barrier() is replaced by,
>     a) wait_read_barrier(): wait barrier in regular read I/O code path
>     b) wait_barrier(): wait barrier in regular write I/O code path
>     The differnece is wait_read_barrier() only waits if array is frozen, I
>     am not able to combile them into one function, because they must receive
>     differnet data types in their arguments list.

Maybe it is possible to add a parameter to distinguish read and write, then
the two functions can be unified.

>   - align_to_barrier_unit_end() is called to make sure both regular and
>     resync I/O won't go across the border of a barrier unit size.
>   
> Open question:
>   - Need review from md clustring developer, I don't touch related code now.

I don't find problems with some tests so far.

>   static void reschedule_retry(struct r1bio *r1_bio)
> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>   	unsigned long flags;
>   	struct mddev *mddev = r1_bio->mddev;
>   	struct r1conf *conf = mddev->private;
> +	sector_t sector_nr;
> +	long idx;
> +
> +	sector_nr = r1_bio->sector;
> +	idx = get_barrier_bucket_idx(sector_nr);

Isn't "idx = get_barrier_bucket_idx(r1_bio->sector)" enough here?

>   @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>   	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>   		bio->bi_error = -EIO;
>   
> -	if (done) {
> +	if (done)
>   		bio_endio(bio);
> -		/*
> -		 * Wake up any possible resync thread that waits for the device
> -		 * to go idle.
> -		 */
> -		allow_barrier(conf, start_next_window, bi_sector);
> -	}
>   }
>   
>   static void raid_end_bio_io(struct r1bio *r1_bio)
>   {
>   	struct bio *bio = r1_bio->master_bio;
> +	struct r1conf *conf = r1_bio->mddev->private;
>   
>   	/* if nobody has done the final endio yet, do it now */
>   	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>   
>   		call_bio_endio(r1_bio);
>   	}
> +
> +	/*
> +	 * Wake up any possible resync thread that waits for the device
> +	 * to go idle.
> +	 */
> +	allow_barrier(conf, r1_bio->sector);
>   	free_r1bio(r1_bio);
>   }

I am not sure it is safe for above changes since call_bio_endio is not only
called within raid_end_bio_io.

>   
> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>   	return mirror;
>   }
>   
> +/* bi_end_io callback for regular READ bio */

Not related to the patch itself, it would be better to make the similar
changes in other patches.

> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
> +/* A regular I/O should wait when,
> + * - The whole array is frozen (both READ and WRITE)
> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
> + */
> +static void _wait_barrier(struct r1conf *conf, long idx)
>   {
> -	bool wait = false;
> -
> -	if (conf->array_frozen || !bio)
> -		wait = true;
> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
> -		if ((conf->mddev->curr_resync_completed
> -		     >= bio_end_sector(bio)) ||
> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
> -		     <= bio->bi_iter.bi_sector))
> -			wait = false;
> -		else
> -			wait = true;
> +	spin_lock_irq(&conf->resync_lock);
> +	if (conf->array_frozen || conf->barrier[idx]) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for the barrier to drop. */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen && !conf->barrier[idx],
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>   	}
>   
> -	return wait;
> +	conf->nr_pending[idx]++;
> +	spin_unlock_irq(&conf->resync_lock);
>   }
>   
> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>   {
> -	sector_t sector = 0;
> +	long idx = get_barrier_bucket_idx(sector_nr);
>   
>   	spin_lock_irq(&conf->resync_lock);
> -	if (need_to_wait_for_sync(conf, bio)) {
> -		conf->nr_waiting++;
> -		/* Wait for the barrier to drop.
> -		 * However if there are already pending
> -		 * requests (preventing the barrier from
> -		 * rising completely), and the
> -		 * per-process bio queue isn't empty,
> -		 * then don't wait, as we need to empty
> -		 * that queue to allow conf->start_next_window
> -		 * to increase.
> -		 */
> -		wait_event_lock_irq(conf->wait_barrier,
> -				    !conf->array_frozen &&
> -				    (!conf->barrier ||
> -				     ((conf->start_next_window <
> -				       conf->next_resync + RESYNC_SECTORS) &&
> -				      current->bio_list &&
> -				      !bio_list_empty(current->bio_list))),
> -				    conf->resync_lock);
> -		conf->nr_waiting--;
> -	}
> -
> -	if (bio && bio_data_dir(bio) == WRITE) {
> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
> -			if (conf->start_next_window == MaxSector)
> -				conf->start_next_window =
> -					conf->next_resync +
> -					NEXT_NORMALIO_DISTANCE;
> -
> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
> -			    <= bio->bi_iter.bi_sector)
> -				conf->next_window_requests++;
> -			else
> -				conf->current_window_requests++;
> -			sector = conf->start_next_window;
> -		}
> +	if (conf->array_frozen) {
> +		conf->nr_waiting[idx]++;
> +		/* Wait for array to unfreeze */
> +		wait_event_lock_irq(
> +			conf->wait_barrier,
> +			!conf->array_frozen,
> +			conf->resync_lock);
> +		conf->nr_waiting[idx]--;
>   	}
> -
> -	conf->nr_pending++;
> +	conf->nr_pending[idx]++;
>   	spin_unlock_irq(&conf->resync_lock);
> -	return sector;
>   }
>   
> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> -			  sector_t bi_sector)
> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
> +{
> +	long idx = get_barrier_bucket_idx(sector_nr);
> +
> +	_wait_barrier(conf, idx);
> +}
> +

I personally prefer to use one wait_barrier to cover both read and 
write, something like:

wait_barrier(struct r1conf *conf, long idx, int direction)

BTW: there are some unnecessary changes inside the patch, maybe you can 
remove it
or introduce other patches for them.

Regards,
Guoqing

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
  2016-11-23  9:05   ` Guoqing Jiang
  2016-11-24  5:45   ` NeilBrown
@ 2016-11-28  6:42   ` Coly Li
  2016-11-29 19:29     ` Shaohua Li
  2 siblings, 1 reply; 15+ messages in thread
From: Coly Li @ 2016-11-28  6:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On 2016/11/23 上午5:35, Shaohua Li wrote:
> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> I/O barriers to happen only inside a slidingresync window, for regular
>> I/Os out of this resync window they don't need to wait for barrier any
>> more. On large raid1 device, it helps a lot to improve parallel writing
>> I/O throughput when there are background resync I/Os performing at
>> same time. 
>>
>> The idea of sliding resync widow is awesome, but there are several
>> challenges are very difficult to solve,
>>  - code complexity
>>    Sliding resync window requires several veriables to work collectively,
>>    this is complexed and very hard to make it work correctly. Just grep
>>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>>    the original resync window patch. This is not the end, any further
>>    related modification may easily introduce more regreassion.
>>  - multiple sliding resync windows
>>    Currently raid1 code only has a single sliding resync window, we cannot
>>    do parallel resync with current I/O barrier implementation.
>>    Implementing multiple resync windows are much more complexed, and very
>>    hard to make it correctly.
>>
>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> removing resync window code, I believe life will be much easier.
>>
>> The brief idea of the simpler barrier is,
>>  - Do not maintain a logbal unique resync window
>>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>    I/O only has to wait for a resync I/O when both them have same barrier
>>    bucket index, vice versa.
>>  - I/O barrier can be recuded to an acceptable number if there are enought
>>    barrier buckets
>>
>> Here I explain how the barrier buckets are designed,
>>  - BARRIER_UNIT_SECTOR_SIZE
>>    The whole LBA address space of a raid1 device is divided into multiple
>>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>    Bio request won't go across border of barrier unit size, that means
>>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>>  - BARRIER_BUCKETS_NR
>>    There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>>    hit different barrier units, they only need to compete I/O barrier with
>>    other I/Os which hit the same barrier bucket index with each other. The
>>    index of a barrier bucket which a bio should look for is calculated by
>>    get_barrier_bucket_idx(),
>> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>>    sector is the start sector number of a bio. align_to_barrier_unit_end()
>>    will make sure the finall bio sent into generic_make_request() won't
>>    exceed border of the barrier unit size.
>>  - RRIER_BUCKETS_NR
>>    Number of barrier buckets is defined by,
>> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
>>    For 4KB page size, there are 512 buckets for each raid1 device. That
>>    means the propobility of full random I/O barrier confliction may be
>>    reduced down to 1/512.
> 
> Thanks! The idea is awesome and does makes the code easier to understand.
> 
> For hash conflict, I'm worrying about one case. resync does sequential access.
> Say we have a sequential normal IO. If the first normal IO and resync IO have
> conflict, it's possible they will have conflict subsequently, since both are
> doing sequential access. We can change the hash to something like this:
> 
> for the first 64M * 512, the hash is 0, 1, ... 511
> For the second 64M * 512, the hash is 1, 3, 5 ...
> The third 64M * 512, the hash is 2, 5, 8 ...
> 
> It should be very easy to implement something like this, and this should reduce
> the conflict of sequential access.
>  

Hi Shaohua,

What I concerned was memory footprint. For very fast sequential I/O,
lineal mapping buckets means each (64 bytes) cache line contains 8 long
integer, it is very friendly for CPU caching,
 - sequential writing 8 barrier units only requires 1 memory fetching
for each barrier variables (barrier[], nr_pending[], nr_waiting[],
nr_queued[]).
 - memory prefetch may have positive effect in sequential I/O.

It will be very rare that resync I/O and regular I/O are always
conflicted in same barrier bucket: resync I/O throughput usually is
slower than regular I/O, it is very hard to keep them always conflicting
in same bucket. If this condition really happens, current sliding resync
window implementation may also have a similar conflict (regular I/O
always hits resync window). So the barrier buckets don't make things worse.

If we use a non-continuous mapping, memory fingerprint of the buckets
will be quite distributed, and occupies more cache lines for a
sequential I/O, which will increase the probability of more cache bounce
if there are multiple sequential I/O (on different offset) happen on
different CPU cores.

This is why I use a very simple linear buckets hashing.


>> Comparing to single sliding resync window,
>>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>>    will have confliction within a single barrier units. So it is similar to
>>    single sliding resync window.
>>  - But a barrier unit bucket is shared by all barrier units with identical
>>    barrier uinit index, the probability of confliction might be higher
>>    than single sliding resync window, in condition that writing I/Os
>>    always hit barrier units which have identical barrier bucket index with
>>    the resync I/Os. This is a very rare condition in real I/O work loads,
>>    I cannot imagine how it could happen in practice.
>>  - Therefore we can achieve a good enough low confliction rate with much
>>    simpler barrier algorithm and implementation.
>>
>> If user has a (realy) large raid1 device, for example 10PB size, we may
>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>> it is possible to be a raid1-created-time-defined variable in future.
>>
>> There are two changes should be noticed,
>>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>>    single loop, it looks like this,
>> 	spin_lock_irqsave(&conf->device_lock, flags);
>> 	conf->nr_queued[idx]--;
>> 	spin_unlock_irqrestore(&conf->device_lock, flags);
>>    This change generates more spin lock operations, but in next patch of
>>    this patch set, it will be replaced by a single line code,
>> 	atomic_dec(conf->nr_queueud[idx]);
>>    So we don't need to worry about spin lock cost here.
>>  - In raid1_make_request(), wait_barrier() is replaced by,
>>    a) wait_read_barrier(): wait barrier in regular read I/O code path
>>    b) wait_barrier(): wait barrier in regular write I/O code path
>>    The differnece is wait_read_barrier() only waits if array is frozen, I
>>    am not able to combile them into one function, because they must receive
>>    differnet data types in their arguments list.
>>  - align_to_barrier_unit_end() is called to make sure both regular and
>>    resync I/O won't go across the border of a barrier unit size.
>>  
>> Open question:
>>  - Need review from md clustring developer, I don't touch related code now.
> 
> Don't think it matters, but please open eyes, Guoqing!
>  
>> Signed-off-by: Coly Li <colyli@suse.de>
>> Cc: Shaohua Li <shli@fb.com>
>> Cc: Neil Brown <neilb@suse.de>
>> Cc: Johannes Thumshirn <jthumshirn@suse.de>
>> Cc: Guoqing Jiang <gqjiang@suse.com>
>> ---
>>  drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
>>  drivers/md/raid1.h |  42 +++++-------
>>  2 files changed, 242 insertions(+), 189 deletions(-)
>>
>> Index: linux-raid1/drivers/md/raid1.c
>> ===================================================================
>> --- linux-raid1.orig/drivers/md/raid1.c
>> +++ linux-raid1/drivers/md/raid1.c
>> @@ -66,9 +66,8 @@
>>   */
>>  static int max_queued_requests = 1024;
>>  
>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>> -			  sector_t bi_sector);
>> -static void lower_barrier(struct r1conf *conf);
>> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>  
>>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
>>  {
>> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
>>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>>  
>>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>>  {
>> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
>>  {
>>  	struct r1conf *conf = r1_bio->mddev->private;
>>  	int i;
>> +	sector_t sector_nr = r1_bio->sector;
>>  
>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>  		struct bio *bio = r1_bio->bios[i];
>> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
>>  
>>  	mempool_free(r1_bio, conf->r1buf_pool);
>>  
>> -	lower_barrier(conf);
>> +	lower_barrier(conf, sector_nr);
>>  }
>>  
>>  static void reschedule_retry(struct r1bio *r1_bio)
>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>>  	unsigned long flags;
>>  	struct mddev *mddev = r1_bio->mddev;
>>  	struct r1conf *conf = mddev->private;
>> +	sector_t sector_nr;
>> +	long idx;
>> +
>> +	sector_nr = r1_bio->sector;
>> +	idx = get_barrier_bucket_idx(sector_nr);
>>  
>>  	spin_lock_irqsave(&conf->device_lock, flags);
>>  	list_add(&r1_bio->retry_list, &conf->retry_list);
>> -	conf->nr_queued ++;
>> +	conf->nr_queued[idx]++;
>>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>>  
>>  	wake_up(&conf->wait_barrier);
>> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
>>  	struct bio *bio = r1_bio->master_bio;
>>  	int done;
>>  	struct r1conf *conf = r1_bio->mddev->private;
>> -	sector_t start_next_window = r1_bio->start_next_window;
>> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>>  
>>  	if (bio->bi_phys_segments) {
>>  		unsigned long flags;
>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>  		bio->bi_error = -EIO;
>>  
>> -	if (done) {
>> +	if (done)
>>  		bio_endio(bio);
>> -		/*
>> -		 * Wake up any possible resync thread that waits for the device
>> -		 * to go idle.
>> -		 */
>> -		allow_barrier(conf, start_next_window, bi_sector);
>> -	}
>>  }
>>  
>>  static void raid_end_bio_io(struct r1bio *r1_bio)
>>  {
>>  	struct bio *bio = r1_bio->master_bio;
>> +	struct r1conf *conf = r1_bio->mddev->private;
>>  
>>  	/* if nobody has done the final endio yet, do it now */
>>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>  
>>  		call_bio_endio(r1_bio);
>>  	}
>> +
>> +	/*
>> +	 * Wake up any possible resync thread that waits for the device
>> +	 * to go idle.
>> +	 */
>> +	allow_barrier(conf, r1_bio->sector);
> Why this change?
> 

I move allow_barrier() here on purpose. Because current raid1 code
raises nr_pending for each master bio, the barrier buckets patch raises
nr_pending[idx] for each r1_bio.

Current code increases nr_pending for each master bio (the bio structure
received by raid1_make_request()). A master bio may be split by multiple
r1_bio structures, when all r1_bio completes, nr_pending is decreased
for the original master bio.

Since the master bio may go across multiple barrier units, in this patch
master bio will be split into multiple r1_bio structures for each
barrier units. In this case, we need to call wait_barrer() for each
r1_bio, and call allow_barrier() on each r1_bio completion. This is why
allow_barrier() is moved form master bio completion time to r1_bio
completion time.

A master bio may also be split into multiple r1_bio if bad blocks
encountered, and these r1_bio may stay in same barrier unit. In this
case the different r1_bio does increase nr_pending[] for same bucket
index, we should also call wait_barrier() for each r1_bio and call
allow_barrier() at its completion time.



>>  	free_r1bio(r1_bio);
>>  }
>>  
>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>>  	return mirror;
>>  }
>>  
>> +/* bi_end_io callback for regular READ bio */
>>  static void raid1_end_read_request(struct bio *bio)
>>  {
>>  	int uptodate = !bio->bi_error;
>> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
>>  		bio_put(to_put);
>>  }
>>  
>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>> +					  sector_t sectors)
>> +{
>> +	sector_t len;
>> +
>> +	WARN_ON(sectors == 0);
>> +	/* len is the number of sectors from start_sector to end of the
>> +	 * barrier unit which start_sector belongs to.
>> +	 */
>> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>> +	      start_sector;
>> +
>> +	if (len > sectors)
>> +		len = sectors;
>> +
>> +	return len;
>> +}
> 
> I'd prefer calling bio_split at the early of raid1_make_request and split the
> bio if it crosses the bucket boundary. please see raid10_make_request for
> example. resync will not cross the boundary. So the code will not need to worry
> about the boundary. I believe this will make the code simpiler (all the
> align_to_barrier_unit_end calling can removed) and easy to understand.
> 

Indeed, my first implementation uses bio_split(). The reasons why I
don't use it later are,
- indeed I need to write more code.
- I can't simply use existing r1_bio completion code to call
bio_end_io() to the master bio when bio->bi_phys_segments is zero (see
call_bio_endio()). Because r1_bio->master_bio is the split bio, not the
original master bio received by raid1_make_request()

Therefore finally I decide to use align_to_barrier_unit_end() to
generate more r1_bio structures if the original master bio goes across
barrier unit size, it makes less modification in this patch.

>> +
>>  /*
>>   * This routine returns the disk from which the requested read should
>>   * be done. There is a per-array 'next expected sequential IO' sector
>> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
>>  		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>>  	}
>>  	rcu_read_unlock();
>> +	sectors = align_to_barrier_unit_end(this_sector, sectors);
>>  	*max_sectors = sectors;
>>  
>>  	return best_disk;
>> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
>>   *    there is no normal IO happeing.  It must arrange to call
>>   *    lower_barrier when the particular background IO completes.
>>   */
>> +
>>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>>  {
>> +	long idx = get_barrier_bucket_idx(sector_nr);
>> +
>>  	spin_lock_irq(&conf->resync_lock);
>>  
>>  	/* Wait until no block IO is waiting */
>> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
>> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>>  			    conf->resync_lock);
>>  
>>  	/* block any new IO from starting */
>> -	conf->barrier++;
>> -	conf->next_resync = sector_nr;
>> +	conf->barrier[idx]++;
>>  
>>  	/* For these conditions we must wait:
>>  	 * A: while the array is in frozen state
>> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
>> -	 *    the max count which allowed.
>> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
>> -	 *    next resync will reach to the window which normal bios are
>> -	 *    handling.
>> -	 * D: while there are any active requests in the current window.
>> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
>> +	 *    existing in sector number ranges corresponding to idx.
>> +	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
>> +	 *    the max count which allowed in sector number ranges
>> +	 *    conrresponding to idx.
>>  	 */
>>  	wait_event_lock_irq(conf->wait_barrier,
>> -			    !conf->array_frozen &&
>> -			    conf->barrier < RESYNC_DEPTH &&
>> -			    conf->current_window_requests == 0 &&
>> -			    (conf->start_next_window >=
>> -			     conf->next_resync + RESYNC_SECTORS),
>> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
>> +			    conf->barrier[idx] < RESYNC_DEPTH,
>>  			    conf->resync_lock);
> 
> So there is a slight behavior change. Originally we guarautee no more than
> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
> important this is, but at least we can check barrier[idx] + barrier[idx-1].

The original code wants to rise more multiple barriers, but less then
RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and
regular I/O conflict. Considering conf->barrier is a global barrier,
large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for
whole raid1 device.

In the barrier buckets patch, conf->barrier[idx] only takes effect in a
single bucket. More barriers raised in this barrier buckets won't
interfere regular I/O in other barrier buckets, therefore we can have
much more resync I/O barriers to raise, which is good for resync I/O
throughput without starve more regular I/Os.

If we have 512 barrier buckets, that means on the whole raid1 device
there are 512*RESYNC_DEPTH barriers can be raised, and allows more
parallel resync I/O.

Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this
patch, because we only have a resync I/O hits one barrier buckets, same
RESYNC_DEPTH won't change barrier behavior now. Latter when we have
parallel resync I/O, let's see whether we need to modify this value,
also still keep it as 32.


>> -
>> -	conf->nr_pending++;
>> +	conf->nr_pending[idx]++;
>>  	spin_unlock_irq(&conf->resync_lock);
>>  }
>>  
>> -static void lower_barrier(struct r1conf *conf)
>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>>  {
>>  	unsigned long flags;
>> -	BUG_ON(conf->barrier <= 0);
>> +	long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> +	BUG_ON(conf->barrier[idx] <= 0);
>>  	spin_lock_irqsave(&conf->resync_lock, flags);
>> -	conf->barrier--;
>> -	conf->nr_pending--;
>> +	conf->barrier[idx]--;
>> +	conf->nr_pending[idx]--;
>>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>>  	wake_up(&conf->wait_barrier);
>>  }
>>  
>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>> +/* A regular I/O should wait when,
>> + * - The whole array is frozen (both READ and WRITE)
>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>> + */
>> +static void _wait_barrier(struct r1conf *conf, long idx)
>>  {
>> -	bool wait = false;
>> -
>> -	if (conf->array_frozen || !bio)
>> -		wait = true;
>> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>> -		if ((conf->mddev->curr_resync_completed
>> -		     >= bio_end_sector(bio)) ||
>> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
>> -		     <= bio->bi_iter.bi_sector))
>> -			wait = false;
>> -		else
>> -			wait = true;
>> +	spin_lock_irq(&conf->resync_lock);
>> +	if (conf->array_frozen || conf->barrier[idx]) {
>> +		conf->nr_waiting[idx]++;
>> +		/* Wait for the barrier to drop. */
>> +		wait_event_lock_irq(
>> +			conf->wait_barrier,
>> +			!conf->array_frozen && !conf->barrier[idx],
>> +			conf->resync_lock);
>> +		conf->nr_waiting[idx]--;
>>  	}
>>  
>> -	return wait;
>> +	conf->nr_pending[idx]++;
>> +	spin_unlock_irq(&conf->resync_lock);
>>  }
>>  
>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>  {
>> -	sector_t sector = 0;
>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>  
>>  	spin_lock_irq(&conf->resync_lock);
>> -	if (need_to_wait_for_sync(conf, bio)) {
>> -		conf->nr_waiting++;
>> -		/* Wait for the barrier to drop.
>> -		 * However if there are already pending
>> -		 * requests (preventing the barrier from
>> -		 * rising completely), and the
>> -		 * per-process bio queue isn't empty,
>> -		 * then don't wait, as we need to empty
>> -		 * that queue to allow conf->start_next_window
>> -		 * to increase.
>> -		 */
>> -		wait_event_lock_irq(conf->wait_barrier,
>> -				    !conf->array_frozen &&
>> -				    (!conf->barrier ||
>> -				     ((conf->start_next_window <
>> -				       conf->next_resync + RESYNC_SECTORS) &&
>> -				      current->bio_list &&
>> -				      !bio_list_empty(current->bio_list))),
>> -				    conf->resync_lock);
>> -		conf->nr_waiting--;
>> -	}
>> -
>> -	if (bio && bio_data_dir(bio) == WRITE) {
>> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
>> -			if (conf->start_next_window == MaxSector)
>> -				conf->start_next_window =
>> -					conf->next_resync +
>> -					NEXT_NORMALIO_DISTANCE;
>> -
>> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>> -			    <= bio->bi_iter.bi_sector)
>> -				conf->next_window_requests++;
>> -			else
>> -				conf->current_window_requests++;
>> -			sector = conf->start_next_window;
>> -		}
>> +	if (conf->array_frozen) {
>> +		conf->nr_waiting[idx]++;
>> +		/* Wait for array to unfreeze */
>> +		wait_event_lock_irq(
>> +			conf->wait_barrier,
>> +			!conf->array_frozen,
>> +			conf->resync_lock);
>> +		conf->nr_waiting[idx]--;
>>  	}
>> -
>> -	conf->nr_pending++;
>> +	conf->nr_pending[idx]++;
>>  	spin_unlock_irq(&conf->resync_lock);
>> -	return sector;
>>  }
>>  
>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>> -			  sector_t bi_sector)
>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>> +{
>> +	long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> +	_wait_barrier(conf, idx);
>> +}
>> +
>> +static void wait_all_barriers(struct r1conf *conf)
>> +{
>> +	long idx;
>> +
>> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>> +		_wait_barrier(conf, idx);
> 
> This is racy. Since resync is sequential, idealy we only need to check the
> bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
> It would be better we check the resync position and adjust the bucket wait
> order according to the position.
> 

wait_all_barriers() and allow_all_barriers() are used in close_sync() only,
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
[snip]
 }

wait_barrier() and allow_barrier() called here is for a synchronization
purpose, to make sure before close_sync() returns there is no resync I/O
existing.

close_sync() is called in raid1_sync_request() when resync I/O exceeds
the size of raid1 device, and resync should be closed. In this
condition, there won't be any new resync I/O generated. If a
conf->barrier[idx] reaches 0, it won't increase before a new resync
restarts. Therefore it is safe to call _wait_barrier() one by one, until
every conf->barrier[idx] reaches to 0.

This is a usage of wait/allow_barrier() which is not mentioned in the
code. They are not for regular I/O waiting for resync I/O, they are used
as a synchronization to make sure all on-flying resync I/O to complete.


>>  	int good_sectors = RESYNC_SECTORS;
>>  	int min_bad = 0; /* number of sectors that are bad in all devices */
>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>  
>>  	if (!conf->r1buf_pool)
>>  		if (init_resync(conf))
>> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
>>  	 * If there is non-resync activity waiting for a turn, then let it
>>  	 * though before starting on this new sync request.
>>  	 */
>> -	if (conf->nr_waiting)
>> +	if (conf->nr_waiting[idx])
>>  		schedule_timeout_uninterruptible(1);
>>  
>>  	/* we are incrementing sector_nr below. To be safe, we check against
>> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
>>  	r1_bio->sector = sector_nr;
>>  	r1_bio->state = 0;
>>  	set_bit(R1BIO_IsSync, &r1_bio->state);
>> +	/* make sure good_sectors won't go across barrier unit border */
>> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>>  
>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>  		struct md_rdev *rdev;
>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>>  	if (!conf)
>>  		goto abort;
>>  
>> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>> +	if (!conf->nr_pending)
>> +		goto abort;
> 
> This allocation is a little wierd. I'd define a count and uses
> sizeof(nr_pending) * count to do the allocation. There is nothing related to
> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.

This is just for better memory usage, r1conf is allocated by slab
allocator, large not aligned size may cause internal memory
fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will
avoid such issue.

Coly

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-24  5:45   ` NeilBrown
  2016-11-24  6:05     ` Guoqing Jiang
@ 2016-11-28  6:59     ` Coly Li
  1 sibling, 0 replies; 15+ messages in thread
From: Coly Li @ 2016-11-28  6:59 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, linux-raid, Shaohua Li, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

On 2016/11/24 下午1:45, NeilBrown wrote:
> On Wed, Nov 23 2016, Shaohua Li wrote:
>> 
>> Thanks! The idea is awesome and does makes the code easier to
>> understand.
> 
> This is a key point - simpler code!!
> 
>> 
>> For hash conflict, I'm worrying about one case. resync does
>> sequential access. Say we have a sequential normal IO. If the
>> first normal IO and resync IO have conflict, it's possible they
>> will have conflict subsequently, since both are doing sequential
>> access. We can change the hash to something like this:
>> 
>> for the first 64M * 512, the hash is 0, 1, ... 511 For the second
>> 64M * 512, the hash is 1, 3, 5 ... The third 64M * 512, the hash
>> is 2, 5, 8 ...
> 
> This would happen when the write and the resync are at different 
> addresses which happen to collide in the hash. They would only
> remain in sync if the two proceeded at the same speed. Normally
> resync is throttled when there is competing IO, so that is fairly
> unlikely.
> 
> If this really was important, it might make sense to just use 
> hash_long() to make the relevant bits of the sector number into
> and index. (i.e. keep the code simple)
> 
>>> 
>>> If user has a (realy) large raid1 device, for example 10PB
>>> size, we may just increase the buckets number
>>> BARRIER_BUCKETS_NR. Now this is a macro, it is possible to be a
>>> raid1-created-time-defined variable in future.
> 
> I don't think the size of the array is very relevant.  No matter
> how big the array is, resync can be expected to directly interfere
> with 0.2% of all writes.  So, assuming a fairly random distribution
> over times, most writes will most often not be blocked by resync at
> all.
> 
> I wouldn't give any thought to any problems like this until they
> are actually measured.
> 

Yes, I agree with you. If we do have to use more buckets in future,
current patch makes future modification easier.


>>> static void reschedule_retry(struct r1bio *r1_bio) @@ -215,10
>>> +214,15 @@ static void reschedule_retry(struct r1bi unsigned
>>> long flags; struct mddev *mddev = r1_bio->mddev; struct r1conf
>>> *conf = mddev->private; +	sector_t sector_nr; +	long idx;
> 
> I wonder about the use of "long" for "idx". As that is an array
> index, it would have to be a REALLY big array before "long" is
> better than "int".
> 

Good suggestion! I will modify it to int in next version.


>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio 
>>> if (!test_bit(R1BIO_Uptodate, &r1_bio->state)) bio->bi_error =
>>> -EIO;
>>> 
>>> -	if (done) { +	if (done) bio_endio(bio); -		/* -		 * Wake up
>>> any possible resync thread that waits for the device -		 * to
>>> go idle. -		 */ -		allow_barrier(conf, start_next_window,
>>> bi_sector); -	} }
>>> 
>>> static void raid_end_bio_io(struct r1bio *r1_bio) { struct bio
>>> *bio = r1_bio->master_bio; +	struct r1conf *conf =
>>> r1_bio->mddev->private;
>>> 
>>> /* if nobody has done the final endio yet, do it now */ if
>>> (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) { @@ -278,6
>>> +275,12 @@ static void raid_end_bio_io(struct r1bio
>>> 
>>> call_bio_endio(r1_bio); } + +	/* +	 * Wake up any possible
>>> resync thread that waits for the device +	 * to go idle. +	 */ 
>>> +	allow_barrier(conf, r1_bio->sector);
>> Why this change?
> 
> I wondered too.  I think it may be correct, but it should be in a 
> separate patch. When you have a write-mostly device, I think the
> current code will allow_barrier() before the writes to the
> write-mostly devices have completed.
> 

I explain this in the reply to Shaohua's email. Current upstream code
increases/decreases conf->nr_pending for each bio (a.k.a
r1_bio->master_bio) received by raid1_make_request(). In this patch,
conf->nr_pending[idx] increases/decreases for each r1_bio, this is the
reason why allow_barrier() moves to raid1_end_bio_io(), you may also
find out wait_barrier() also changes its location and is replaced by
wait_barrier() and wait_read_barrier() in raid1_make_request().

>>> 
>>> +static sector_t align_to_barrier_unit_end(sector_t
>>> start_sector, +					  sector_t sectors) +{ +	sector_t len; + +
>>> WARN_ON(sectors == 0); +	/* len is the number of sectors from
>>> start_sector to end of the +	 * barrier unit which start_sector
>>> belongs to. +	 */ +	len = ((start_sector + sectors +
>>> (1<<BARRIER_UNIT_SECTOR_BITS) - 1) & +
>>> (~(BARRIER_UNIT_SECTOR_SIZE - 1))) - +	      start_sector; + +
>>> if (len > sectors) +		len = sectors; + +	return len; +}
>> 
>> I'd prefer calling bio_split at the early of raid1_make_request
>> and split the bio if it crosses the bucket boundary. please see
>> raid10_make_request for example. resync will not cross the
>> boundary. So the code will not need to worry about the boundary.
>> I believe this will make the code simpiler (all the 
>> align_to_barrier_unit_end calling can removed) and easy to
>> understand.
> 
> align_to_barrier_unit_end() is only called twice, once for writes
> and once for resync/recovery.  If bio_split() were used, you would
> still need the same number of calls.
> 
> Changing raid1.c to use bio_split() more may well be a valuable 
> simplification, but I think it is a separate issue.
> 

To use bio_split() won't be simpler, indeed it is a little bit more
complexed, because when r1_bio->master_bio will be the split bio of
master bio, not the original master bio, that's why I decide to not
use it.


>>> wait_event_lock_irq(conf->wait_barrier, -
>>> !conf->array_frozen && -			    conf->barrier < RESYNC_DEPTH && 
>>> -			    conf->current_window_requests == 0 && -
>>> (conf->start_next_window >= -			     conf->next_resync +
>>> RESYNC_SECTORS), +			    !conf->array_frozen &&
>>> !conf->nr_pending[idx] && +			    conf->barrier[idx] <
>>> RESYNC_DEPTH, conf->resync_lock);
>> 
>> So there is a slight behavior change. Originally we guarautee no
>> more than RESYNC_DEPTH sync. Now this only checks
>> 'conf->barrier[idx] < RESYNC_DEPTH'. How about barrier[idx-1],
>> barrier[idx-2]...? It's possible conf->barrier[idx] < 
>> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH.
>> Not sure how important this is, but at least we can check
>> barrier[idx] + barrier[idx-1].
> 
> I confess that I'm not entirely sure the point of RESYNC_DEPTH - it
> was already in the code when I started working on it. My best guess
> is that it limits the number of requests we need to wait for before
> regular writes can be permitted in the resync region. In that case,
> the current code is good enough.
> 
> Your "barrier[idx] + barrier[idx-1]" idea is interesting, but would
> only make a difference 1/1024 of the time (bucket is 64Meg, the 
> RESYNC_BLOCK_SIZE is 64K).
> 
> 

IMHO RESYNC_DEPTH is for better resync throughput, since now we have
linear resync, and only one resync window, I doubt whether it takes
effect or not. Now only one resync sliding window, therefore it is
very similar for testing the resync depth within the sliding window,
or within a single barrier bucket. A resync I/O won't hit more then
one barrier bucket now.

In future when parallel resync implemented, the modification here may
increase whole raid1 device resync depth to BARRIER_BUCKETS_NR *
RESYNC_DEPTH, which results better resync I/O when regular I/O is idle.



>>> + +static void wait_all_barriers(struct r1conf *conf) +{ +	long
>>> idx; + +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++) +
>>> _wait_barrier(conf, idx);
>> 
>> This is racy. Since resync is sequential, idealy we only need to
>> check the bucket sequentially. The compiler could do
>> _wait_barrier(conf, 1) and then _wait_barrier(conf, 0). It's
>> possible the bucket 1 has barrier right after the check of bucket
>> 0. Even the compiler doesn't reorder, there is case the bucket 
>> 511 should be check before bucket 0 if the resync is just
>> crossing 512 buckets. It would be better we check the resync
>> position and adjust the bucket wait order according to the
>> position.
> 
> I cannot see how it is racy.  No new sync requests will be issued
> at this time, so once ->barrier[idx] reaches 0, it will stay at
> zero.
> 

Yes, this is what I mean.


>>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct 
>>> if (!conf) goto abort;
>>> 
>>> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL); +	if
>>> (!conf->nr_pending) +		goto abort;
>> 
>> This allocation is a little wierd. I'd define a count and uses 
>> sizeof(nr_pending) * count to do the allocation. There is nothing
>> related to PAGE_SIZE. Alternatively, just make nr_pending an
>> array in r1conf.
> 
> The thing about PAGE_SIZE is that the allocation will succeed and
> won't waste space. If you make all the arrays simple members of
> r1conf, r1conf will be about 4pages larger, so will require an
> 8-page allocation, which is more likely to fail. I agree that it
> seems a little weird, but I think it results in the best use of
> memory.

Yes, this is what I mean.

Thanks for your comments!

Coly




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-24  7:34 ` Guoqing Jiang
@ 2016-11-28  7:33   ` Coly Li
  2016-11-30  6:37     ` Guoqing Jiang
  0 siblings, 1 reply; 15+ messages in thread
From: Coly Li @ 2016-11-28  7:33 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn

On 2016/11/24 下午3:34, Guoqing Jiang wrote:
> Hi Coly,
> 
> Please see below comments, just FYI.
> 
> On 11/22/2016 05:54 AM, Coly Li wrote:
>>   - In raid1_make_request(), wait_barrier() is replaced by,
>>     a) wait_read_barrier(): wait barrier in regular read I/O code path
>>     b) wait_barrier(): wait barrier in regular write I/O code path
>>     The differnece is wait_read_barrier() only waits if array is
>> frozen, I
>>     am not able to combile them into one function, because they must
>> receive
>>     differnet data types in their arguments list.
> 
> Maybe it is possible to add a parameter to distinguish read and write, then
> the two functions can be unified.

Hi Guoqing,

read_balance() will make sure a regular read won't access a non-synced
disk, therefore regular read only needs to wait when the whole raid1 is
frozen. That's why wait_read_barrier() only waits for
!conf->array_frozen, but _wait_barrier() will waits for both
!conf->array_frozen and !conf->barrier[idx].

Combine them is possible, but the code won't be simpler, because,
- inside the function I still need to wait two wait_event_lock_irq() for
different wait condition, and if I use sector_t as the function
parameter, in wait_all_barrier() I have to send bogus sector number
(calculated by bucket index) into _wait_barrier().

that's why I choose current implementation.

> 
>>   - align_to_barrier_unit_end() is called to make sure both regular and
>>     resync I/O won't go across the border of a barrier unit size.
>>   Open question:
>>   - Need review from md clustring developer, I don't touch related
>> code now.
> 
> I don't find problems with some tests so far.

In raid1_sync_request(), I see,
conf->cluster_sync_low = mddev->curr_resync_completed;
conf->cluster_sync_high = conf->cluster_sync_low +
CLUSTER_RESYNC_WINDOW_SECTORS;

Is it possible that LBA range [conf->cluster_sync_low,
conf->cluster_sync_high] goes across the border of a barrier unit size ?


> 
>>   static void reschedule_retry(struct r1bio *r1_bio)
>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>>       unsigned long flags;
>>       struct mddev *mddev = r1_bio->mddev;
>>       struct r1conf *conf = mddev->private;
>> +    sector_t sector_nr;
>> +    long idx;
>> +
>> +    sector_nr = r1_bio->sector;
>> +    idx = get_barrier_bucket_idx(sector_nr);
> 
> Isn't "idx = get_barrier_bucket_idx(r1_bio->sector)" enough here?

Nice catch! I will modify this in next version.

> 
>>   @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>       if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>           bio->bi_error = -EIO;
>>   -    if (done) {
>> +    if (done)
>>           bio_endio(bio);
>> -        /*
>> -         * Wake up any possible resync thread that waits for the device
>> -         * to go idle.
>> -         */
>> -        allow_barrier(conf, start_next_window, bi_sector);
>> -    }
>>   }
>>     static void raid_end_bio_io(struct r1bio *r1_bio)
>>   {
>>       struct bio *bio = r1_bio->master_bio;
>> +    struct r1conf *conf = r1_bio->mddev->private;
>>         /* if nobody has done the final endio yet, do it now */
>>       if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>             call_bio_endio(r1_bio);
>>       }
>> +
>> +    /*
>> +     * Wake up any possible resync thread that waits for the device
>> +     * to go idle.
>> +     */
>> +    allow_barrier(conf, r1_bio->sector);
>>       free_r1bio(r1_bio);
>>   }
> 
> I am not sure it is safe for above changes since call_bio_endio is not only
> called within raid_end_bio_io.
> 

This is safe. I move it here because location of wait_barrier() is
changed and replaced by wait_read_barrier() for READ I/O, and
wait_barrier() for WRITE I/O. In this patch, conf->nr_pending[idx] is
raised for each ri_bio, so allow_barrier() should be called for each
r1_bio destroy to decrease corresponding conf->nr_pending[idx].


>>   @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>>       return mirror;
>>   }
>>   +/* bi_end_io callback for regular READ bio */
> 
> Not related to the patch itself, it would be better to make the similar
> changes in other patches.
> 

OK, I will move it into another separated patch.


>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>> +/* A regular I/O should wait when,
>> + * - The whole array is frozen (both READ and WRITE)
>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>> + */
>> +static void _wait_barrier(struct r1conf *conf, long idx)
>>   {
>> -    bool wait = false;
>> -
>> -    if (conf->array_frozen || !bio)
>> -        wait = true;
>> -    else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>> -        if ((conf->mddev->curr_resync_completed
>> -             >= bio_end_sector(bio)) ||
>> -            (conf->next_resync + NEXT_NORMALIO_DISTANCE
>> -             <= bio->bi_iter.bi_sector))
>> -            wait = false;
>> -        else
>> -            wait = true;
>> +    spin_lock_irq(&conf->resync_lock);
>> +    if (conf->array_frozen || conf->barrier[idx]) {
>> +        conf->nr_waiting[idx]++;
>> +        /* Wait for the barrier to drop. */
>> +        wait_event_lock_irq(
>> +            conf->wait_barrier,
>> +            !conf->array_frozen && !conf->barrier[idx],
>> +            conf->resync_lock);
>> +        conf->nr_waiting[idx]--;
>>       }
>>   -    return wait;
>> +    conf->nr_pending[idx]++;
>> +    spin_unlock_irq(&conf->resync_lock);
>>   }
>>   -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>   {
>> -    sector_t sector = 0;
>> +    long idx = get_barrier_bucket_idx(sector_nr);
>>         spin_lock_irq(&conf->resync_lock);
>> -    if (need_to_wait_for_sync(conf, bio)) {
>> -        conf->nr_waiting++;
>> -        /* Wait for the barrier to drop.
>> -         * However if there are already pending
>> -         * requests (preventing the barrier from
>> -         * rising completely), and the
>> -         * per-process bio queue isn't empty,
>> -         * then don't wait, as we need to empty
>> -         * that queue to allow conf->start_next_window
>> -         * to increase.
>> -         */
>> -        wait_event_lock_irq(conf->wait_barrier,
>> -                    !conf->array_frozen &&
>> -                    (!conf->barrier ||
>> -                     ((conf->start_next_window <
>> -                       conf->next_resync + RESYNC_SECTORS) &&
>> -                      current->bio_list &&
>> -                      !bio_list_empty(current->bio_list))),
>> -                    conf->resync_lock);
>> -        conf->nr_waiting--;
>> -    }
>> -
>> -    if (bio && bio_data_dir(bio) == WRITE) {
>> -        if (bio->bi_iter.bi_sector >= conf->next_resync) {
>> -            if (conf->start_next_window == MaxSector)
>> -                conf->start_next_window =
>> -                    conf->next_resync +
>> -                    NEXT_NORMALIO_DISTANCE;
>> -
>> -            if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>> -                <= bio->bi_iter.bi_sector)
>> -                conf->next_window_requests++;
>> -            else
>> -                conf->current_window_requests++;
>> -            sector = conf->start_next_window;
>> -        }
>> +    if (conf->array_frozen) {
>> +        conf->nr_waiting[idx]++;
>> +        /* Wait for array to unfreeze */
>> +        wait_event_lock_irq(
>> +            conf->wait_barrier,
>> +            !conf->array_frozen,
>> +            conf->resync_lock);
>> +        conf->nr_waiting[idx]--;
>>       }
>> -
>> -    conf->nr_pending++;
>> +    conf->nr_pending[idx]++;
>>       spin_unlock_irq(&conf->resync_lock);
>> -    return sector;
>>   }
>>   -static void allow_barrier(struct r1conf *conf, sector_t
>> start_next_window,
>> -              sector_t bi_sector)
>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>> +{
>> +    long idx = get_barrier_bucket_idx(sector_nr);
>> +
>> +    _wait_barrier(conf, idx);
>> +}
>> +
> 
> I personally prefer to use one wait_barrier to cover both read and
> write, something like:
> 
> wait_barrier(struct r1conf *conf, long idx, int direction)
> 

As I explain previous, combine them together won't make the code
simpler, maybe more complexed. Let me try to see, if there is any better
solution to make current patch more simpler.


> BTW: there are some unnecessary changes inside the patch, maybe you can
> remove it
> or introduce other patches for them.

Yes, great suggestion, I will do this in next version.

Thanks for your review !

Coly


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-28  6:42   ` Coly Li
@ 2016-11-29 19:29     ` Shaohua Li
  2016-11-30  2:57       ` Coly Li
  0 siblings, 1 reply; 15+ messages in thread
From: Shaohua Li @ 2016-11-29 19:29 UTC (permalink / raw)
  To: Coly Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On Mon, Nov 28, 2016 at 02:42:22PM +0800, Coly Li wrote:
> On 2016/11/23 上午5:35, Shaohua Li wrote:
> > On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
> >> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> >> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> >> I/O barriers to happen only inside a slidingresync window, for regular
> >> I/Os out of this resync window they don't need to wait for barrier any
> >> more. On large raid1 device, it helps a lot to improve parallel writing
> >> I/O throughput when there are background resync I/Os performing at
> >> same time. 
> >>
> >> The idea of sliding resync widow is awesome, but there are several
> >> challenges are very difficult to solve,
> >>  - code complexity
> >>    Sliding resync window requires several veriables to work collectively,
> >>    this is complexed and very hard to make it work correctly. Just grep
> >>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
> >>    the original resync window patch. This is not the end, any further
> >>    related modification may easily introduce more regreassion.
> >>  - multiple sliding resync windows
> >>    Currently raid1 code only has a single sliding resync window, we cannot
> >>    do parallel resync with current I/O barrier implementation.
> >>    Implementing multiple resync windows are much more complexed, and very
> >>    hard to make it correctly.
> >>
> >> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> >> removing resync window code, I believe life will be much easier.
> >>
> >> The brief idea of the simpler barrier is,
> >>  - Do not maintain a logbal unique resync window
> >>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
> >>    I/O only has to wait for a resync I/O when both them have same barrier
> >>    bucket index, vice versa.
> >>  - I/O barrier can be recuded to an acceptable number if there are enought
> >>    barrier buckets
> >>
> >> Here I explain how the barrier buckets are designed,
> >>  - BARRIER_UNIT_SECTOR_SIZE
> >>    The whole LBA address space of a raid1 device is divided into multiple
> >>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
> >>    Bio request won't go across border of barrier unit size, that means
> >>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
> >>  - BARRIER_BUCKETS_NR
> >>    There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
> >>    hit different barrier units, they only need to compete I/O barrier with
> >>    other I/Os which hit the same barrier bucket index with each other. The
> >>    index of a barrier bucket which a bio should look for is calculated by
> >>    get_barrier_bucket_idx(),
> >> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
> >>    sector is the start sector number of a bio. align_to_barrier_unit_end()
> >>    will make sure the finall bio sent into generic_make_request() won't
> >>    exceed border of the barrier unit size.
> >>  - RRIER_BUCKETS_NR
> >>    Number of barrier buckets is defined by,
> >> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
> >>    For 4KB page size, there are 512 buckets for each raid1 device. That
> >>    means the propobility of full random I/O barrier confliction may be
> >>    reduced down to 1/512.
> > 
> > Thanks! The idea is awesome and does makes the code easier to understand.
> > 
> > For hash conflict, I'm worrying about one case. resync does sequential access.
> > Say we have a sequential normal IO. If the first normal IO and resync IO have
> > conflict, it's possible they will have conflict subsequently, since both are
> > doing sequential access. We can change the hash to something like this:
> > 
> > for the first 64M * 512, the hash is 0, 1, ... 511
> > For the second 64M * 512, the hash is 1, 3, 5 ...
> > The third 64M * 512, the hash is 2, 5, 8 ...
> > 
> > It should be very easy to implement something like this, and this should reduce
> > the conflict of sequential access.
> >  
> 
> Hi Shaohua,
> 
> What I concerned was memory footprint. For very fast sequential I/O,
> lineal mapping buckets means each (64 bytes) cache line contains 8 long
> integer, it is very friendly for CPU caching,
>  - sequential writing 8 barrier units only requires 1 memory fetching
> for each barrier variables (barrier[], nr_pending[], nr_waiting[],
> nr_queued[]).
>  - memory prefetch may have positive effect in sequential I/O.
> 
> It will be very rare that resync I/O and regular I/O are always
> conflicted in same barrier bucket: resync I/O throughput usually is
> slower than regular I/O, it is very hard to keep them always conflicting
> in same bucket. If this condition really happens, current sliding resync
> window implementation may also have a similar conflict (regular I/O
> always hits resync window). So the barrier buckets don't make things worse.
> 
> If we use a non-continuous mapping, memory fingerprint of the buckets
> will be quite distributed, and occupies more cache lines for a
> sequential I/O, which will increase the probability of more cache bounce
> if there are multiple sequential I/O (on different offset) happen on
> different CPU cores.
> 
> This is why I use a very simple linear buckets hashing.

Don't think memory footprint matters much. And wasting a little memory (eg,
make the barrier and stuffes cacheline aligned) doesn't matter here too if
necessary. If we could reduce the hash confilict in an easy way, why not? I
think Neil's suggestion (using hash_long) makes sense.
 
> >> Comparing to single sliding resync window,
> >>  - Currently resync I/O grows linearly, therefore regular and resync I/O
> >>    will have confliction within a single barrier units. So it is similar to
> >>    single sliding resync window.
> >>  - But a barrier unit bucket is shared by all barrier units with identical
> >>    barrier uinit index, the probability of confliction might be higher
> >>    than single sliding resync window, in condition that writing I/Os
> >>    always hit barrier units which have identical barrier bucket index with
> >>    the resync I/Os. This is a very rare condition in real I/O work loads,
> >>    I cannot imagine how it could happen in practice.
> >>  - Therefore we can achieve a good enough low confliction rate with much
> >>    simpler barrier algorithm and implementation.
> >>
> >> If user has a (realy) large raid1 device, for example 10PB size, we may
> >> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
> >> it is possible to be a raid1-created-time-defined variable in future.
> >>
> >> There are two changes should be noticed,
> >>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
> >>    single loop, it looks like this,
> >> 	spin_lock_irqsave(&conf->device_lock, flags);
> >> 	conf->nr_queued[idx]--;
> >> 	spin_unlock_irqrestore(&conf->device_lock, flags);
> >>    This change generates more spin lock operations, but in next patch of
> >>    this patch set, it will be replaced by a single line code,
> >> 	atomic_dec(conf->nr_queueud[idx]);
> >>    So we don't need to worry about spin lock cost here.
> >>  - In raid1_make_request(), wait_barrier() is replaced by,
> >>    a) wait_read_barrier(): wait barrier in regular read I/O code path
> >>    b) wait_barrier(): wait barrier in regular write I/O code path
> >>    The differnece is wait_read_barrier() only waits if array is frozen, I
> >>    am not able to combile them into one function, because they must receive
> >>    differnet data types in their arguments list.
> >>  - align_to_barrier_unit_end() is called to make sure both regular and
> >>    resync I/O won't go across the border of a barrier unit size.
> >>  
> >> Open question:
> >>  - Need review from md clustring developer, I don't touch related code now.
> > 
> > Don't think it matters, but please open eyes, Guoqing!
> >  
> >> Signed-off-by: Coly Li <colyli@suse.de>
> >> Cc: Shaohua Li <shli@fb.com>
> >> Cc: Neil Brown <neilb@suse.de>
> >> Cc: Johannes Thumshirn <jthumshirn@suse.de>
> >> Cc: Guoqing Jiang <gqjiang@suse.com>
> >> ---
> >>  drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
> >>  drivers/md/raid1.h |  42 +++++-------
> >>  2 files changed, 242 insertions(+), 189 deletions(-)
> >>
> >> Index: linux-raid1/drivers/md/raid1.c
> >> ===================================================================
> >> --- linux-raid1.orig/drivers/md/raid1.c
> >> +++ linux-raid1/drivers/md/raid1.c
> >> @@ -66,9 +66,8 @@
> >>   */
> >>  static int max_queued_requests = 1024;
> >>  
> >> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> >> -			  sector_t bi_sector);
> >> -static void lower_barrier(struct r1conf *conf);
> >> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
> >> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
> >>  
> >>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
> >>  {
> >> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
> >>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
> >>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
> >>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
> >> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
> >>  
> >>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
> >>  {
> >> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
> >>  {
> >>  	struct r1conf *conf = r1_bio->mddev->private;
> >>  	int i;
> >> +	sector_t sector_nr = r1_bio->sector;
> >>  
> >>  	for (i = 0; i < conf->raid_disks * 2; i++) {
> >>  		struct bio *bio = r1_bio->bios[i];
> >> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
> >>  
> >>  	mempool_free(r1_bio, conf->r1buf_pool);
> >>  
> >> -	lower_barrier(conf);
> >> +	lower_barrier(conf, sector_nr);
> >>  }
> >>  
> >>  static void reschedule_retry(struct r1bio *r1_bio)
> >> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
> >>  	unsigned long flags;
> >>  	struct mddev *mddev = r1_bio->mddev;
> >>  	struct r1conf *conf = mddev->private;
> >> +	sector_t sector_nr;
> >> +	long idx;
> >> +
> >> +	sector_nr = r1_bio->sector;
> >> +	idx = get_barrier_bucket_idx(sector_nr);
> >>  
> >>  	spin_lock_irqsave(&conf->device_lock, flags);
> >>  	list_add(&r1_bio->retry_list, &conf->retry_list);
> >> -	conf->nr_queued ++;
> >> +	conf->nr_queued[idx]++;
> >>  	spin_unlock_irqrestore(&conf->device_lock, flags);
> >>  
> >>  	wake_up(&conf->wait_barrier);
> >> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
> >>  	struct bio *bio = r1_bio->master_bio;
> >>  	int done;
> >>  	struct r1conf *conf = r1_bio->mddev->private;
> >> -	sector_t start_next_window = r1_bio->start_next_window;
> >> -	sector_t bi_sector = bio->bi_iter.bi_sector;
> >>  
> >>  	if (bio->bi_phys_segments) {
> >>  		unsigned long flags;
> >> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
> >>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
> >>  		bio->bi_error = -EIO;
> >>  
> >> -	if (done) {
> >> +	if (done)
> >>  		bio_endio(bio);
> >> -		/*
> >> -		 * Wake up any possible resync thread that waits for the device
> >> -		 * to go idle.
> >> -		 */
> >> -		allow_barrier(conf, start_next_window, bi_sector);
> >> -	}
> >>  }
> >>  
> >>  static void raid_end_bio_io(struct r1bio *r1_bio)
> >>  {
> >>  	struct bio *bio = r1_bio->master_bio;
> >> +	struct r1conf *conf = r1_bio->mddev->private;
> >>  
> >>  	/* if nobody has done the final endio yet, do it now */
> >>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
> >> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
> >>  
> >>  		call_bio_endio(r1_bio);
> >>  	}
> >> +
> >> +	/*
> >> +	 * Wake up any possible resync thread that waits for the device
> >> +	 * to go idle.
> >> +	 */
> >> +	allow_barrier(conf, r1_bio->sector);
> > Why this change?
> > 
> 
> I move allow_barrier() here on purpose. Because current raid1 code
> raises nr_pending for each master bio, the barrier buckets patch raises
> nr_pending[idx] for each r1_bio.
> 
> Current code increases nr_pending for each master bio (the bio structure
> received by raid1_make_request()). A master bio may be split by multiple
> r1_bio structures, when all r1_bio completes, nr_pending is decreased
> for the original master bio.
> 
> Since the master bio may go across multiple barrier units, in this patch
> master bio will be split into multiple r1_bio structures for each
> barrier units. In this case, we need to call wait_barrer() for each
> r1_bio, and call allow_barrier() on each r1_bio completion. This is why
> allow_barrier() is moved form master bio completion time to r1_bio
> completion time.
> 
> A master bio may also be split into multiple r1_bio if bad blocks
> encountered, and these r1_bio may stay in same barrier unit. In this
> case the different r1_bio does increase nr_pending[] for same bucket
> index, we should also call wait_barrier() for each r1_bio and call
> allow_barrier() at its completion time.

I think if you do bio_split before raid1_make_request, these issues go away.

> >>  	free_r1bio(r1_bio);
> >>  }
> >>  
> >> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
> >>  	return mirror;
> >>  }
> >>  
> >> +/* bi_end_io callback for regular READ bio */
> >>  static void raid1_end_read_request(struct bio *bio)
> >>  {
> >>  	int uptodate = !bio->bi_error;
> >> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
> >>  		bio_put(to_put);
> >>  }
> >>  
> >> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
> >> +					  sector_t sectors)
> >> +{
> >> +	sector_t len;
> >> +
> >> +	WARN_ON(sectors == 0);
> >> +	/* len is the number of sectors from start_sector to end of the
> >> +	 * barrier unit which start_sector belongs to.
> >> +	 */
> >> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
> >> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
> >> +	      start_sector;
> >> +
> >> +	if (len > sectors)
> >> +		len = sectors;
> >> +
> >> +	return len;
> >> +}
> > 
> > I'd prefer calling bio_split at the early of raid1_make_request and split the
> > bio if it crosses the bucket boundary. please see raid10_make_request for
> > example. resync will not cross the boundary. So the code will not need to worry
> > about the boundary. I believe this will make the code simpiler (all the
> > align_to_barrier_unit_end calling can removed) and easy to understand.
> > 
> 
> Indeed, my first implementation uses bio_split(). The reasons why I
> don't use it later are,
> - indeed I need to write more code.
> - I can't simply use existing r1_bio completion code to call
> bio_end_io() to the master bio when bio->bi_phys_segments is zero (see
> call_bio_endio()). Because r1_bio->master_bio is the split bio, not the
> original master bio received by raid1_make_request()
> 
> Therefore finally I decide to use align_to_barrier_unit_end() to
> generate more r1_bio structures if the original master bio goes across
> barrier unit size, it makes less modification in this patch.

That bi_phys_segments issue doesn't make sense to me. Please see how raid10 use
bio_split.
- rename raid1_make_request to __raid1_make_request
- add a raid1_make_request, split bio there, and then call __raid1_make_request

the __raid1_make_request does everthing it currently does, the bi_phys_segments
should still work. This might add more code. But now we don't need to worry
about the boundary problem everywhere else except the new raid1_make_request.
For example, the allow_barrier issue goes away with this approach. This will
make the code more understandable and less error prone.
 
> >> +
> >>  /*
> >>   * This routine returns the disk from which the requested read should
> >>   * be done. There is a per-array 'next expected sequential IO' sector
> >> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
> >>  		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
> >>  	}
> >>  	rcu_read_unlock();
> >> +	sectors = align_to_barrier_unit_end(this_sector, sectors);
> >>  	*max_sectors = sectors;
> >>  
> >>  	return best_disk;
> >> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
> >>   *    there is no normal IO happeing.  It must arrange to call
> >>   *    lower_barrier when the particular background IO completes.
> >>   */
> >> +
> >>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
> >>  {
> >> +	long idx = get_barrier_bucket_idx(sector_nr);
> >> +
> >>  	spin_lock_irq(&conf->resync_lock);
> >>  
> >>  	/* Wait until no block IO is waiting */
> >> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
> >> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
> >>  			    conf->resync_lock);
> >>  
> >>  	/* block any new IO from starting */
> >> -	conf->barrier++;
> >> -	conf->next_resync = sector_nr;
> >> +	conf->barrier[idx]++;
> >>  
> >>  	/* For these conditions we must wait:
> >>  	 * A: while the array is in frozen state
> >> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
> >> -	 *    the max count which allowed.
> >> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
> >> -	 *    next resync will reach to the window which normal bios are
> >> -	 *    handling.
> >> -	 * D: while there are any active requests in the current window.
> >> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
> >> +	 *    existing in sector number ranges corresponding to idx.
> >> +	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
> >> +	 *    the max count which allowed in sector number ranges
> >> +	 *    conrresponding to idx.
> >>  	 */
> >>  	wait_event_lock_irq(conf->wait_barrier,
> >> -			    !conf->array_frozen &&
> >> -			    conf->barrier < RESYNC_DEPTH &&
> >> -			    conf->current_window_requests == 0 &&
> >> -			    (conf->start_next_window >=
> >> -			     conf->next_resync + RESYNC_SECTORS),
> >> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
> >> +			    conf->barrier[idx] < RESYNC_DEPTH,
> >>  			    conf->resync_lock);
> > 
> > So there is a slight behavior change. Originally we guarautee no more than
> > RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
> > How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
> > RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
> > important this is, but at least we can check barrier[idx] + barrier[idx-1].
> 
> The original code wants to rise more multiple barriers, but less then
> RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and
> regular I/O conflict. Considering conf->barrier is a global barrier,
> large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for
> whole raid1 device.
> 
> In the barrier buckets patch, conf->barrier[idx] only takes effect in a
> single bucket. More barriers raised in this barrier buckets won't
> interfere regular I/O in other barrier buckets, therefore we can have
> much more resync I/O barriers to raise, which is good for resync I/O
> throughput without starve more regular I/Os.
> 
> If we have 512 barrier buckets, that means on the whole raid1 device
> there are 512*RESYNC_DEPTH barriers can be raised, and allows more
> parallel resync I/O.
> 
> Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this
> patch, because we only have a resync I/O hits one barrier buckets, same
> RESYNC_DEPTH won't change barrier behavior now. Latter when we have
> parallel resync I/O, let's see whether we need to modify this value,
> also still keep it as 32.

The problem is resync write could harm latency of regular IO. Even we have
resync limitation mechanism, sending a lot of resync write out could harm
regular IO latency for worst case. If the application has P99 metrics, more
resync IO definitively will harm it. This will not be easy to observe though.
Could we have a global pending counter to record barrier IO and using it in the
same way as the old code? That said I'm not sure how important this is, but
this could make things worse, so I'd like to avoid the regression.
 
> >> -
> >> -	conf->nr_pending++;
> >> +	conf->nr_pending[idx]++;
> >>  	spin_unlock_irq(&conf->resync_lock);
> >>  }
> >>  
> >> -static void lower_barrier(struct r1conf *conf)
> >> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
> >>  {
> >>  	unsigned long flags;
> >> -	BUG_ON(conf->barrier <= 0);
> >> +	long idx = get_barrier_bucket_idx(sector_nr);
> >> +
> >> +	BUG_ON(conf->barrier[idx] <= 0);
> >>  	spin_lock_irqsave(&conf->resync_lock, flags);
> >> -	conf->barrier--;
> >> -	conf->nr_pending--;
> >> +	conf->barrier[idx]--;
> >> +	conf->nr_pending[idx]--;
> >>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
> >>  	wake_up(&conf->wait_barrier);
> >>  }
> >>  
> >> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
> >> +/* A regular I/O should wait when,
> >> + * - The whole array is frozen (both READ and WRITE)
> >> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
> >> + */
> >> +static void _wait_barrier(struct r1conf *conf, long idx)
> >>  {
> >> -	bool wait = false;
> >> -
> >> -	if (conf->array_frozen || !bio)
> >> -		wait = true;
> >> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
> >> -		if ((conf->mddev->curr_resync_completed
> >> -		     >= bio_end_sector(bio)) ||
> >> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
> >> -		     <= bio->bi_iter.bi_sector))
> >> -			wait = false;
> >> -		else
> >> -			wait = true;
> >> +	spin_lock_irq(&conf->resync_lock);
> >> +	if (conf->array_frozen || conf->barrier[idx]) {
> >> +		conf->nr_waiting[idx]++;
> >> +		/* Wait for the barrier to drop. */
> >> +		wait_event_lock_irq(
> >> +			conf->wait_barrier,
> >> +			!conf->array_frozen && !conf->barrier[idx],
> >> +			conf->resync_lock);
> >> +		conf->nr_waiting[idx]--;
> >>  	}
> >>  
> >> -	return wait;
> >> +	conf->nr_pending[idx]++;
> >> +	spin_unlock_irq(&conf->resync_lock);
> >>  }
> >>  
> >> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
> >> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
> >>  {
> >> -	sector_t sector = 0;
> >> +	long idx = get_barrier_bucket_idx(sector_nr);
> >>  
> >>  	spin_lock_irq(&conf->resync_lock);
> >> -	if (need_to_wait_for_sync(conf, bio)) {
> >> -		conf->nr_waiting++;
> >> -		/* Wait for the barrier to drop.
> >> -		 * However if there are already pending
> >> -		 * requests (preventing the barrier from
> >> -		 * rising completely), and the
> >> -		 * per-process bio queue isn't empty,
> >> -		 * then don't wait, as we need to empty
> >> -		 * that queue to allow conf->start_next_window
> >> -		 * to increase.
> >> -		 */
> >> -		wait_event_lock_irq(conf->wait_barrier,
> >> -				    !conf->array_frozen &&
> >> -				    (!conf->barrier ||
> >> -				     ((conf->start_next_window <
> >> -				       conf->next_resync + RESYNC_SECTORS) &&
> >> -				      current->bio_list &&
> >> -				      !bio_list_empty(current->bio_list))),
> >> -				    conf->resync_lock);
> >> -		conf->nr_waiting--;
> >> -	}
> >> -
> >> -	if (bio && bio_data_dir(bio) == WRITE) {
> >> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
> >> -			if (conf->start_next_window == MaxSector)
> >> -				conf->start_next_window =
> >> -					conf->next_resync +
> >> -					NEXT_NORMALIO_DISTANCE;
> >> -
> >> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
> >> -			    <= bio->bi_iter.bi_sector)
> >> -				conf->next_window_requests++;
> >> -			else
> >> -				conf->current_window_requests++;
> >> -			sector = conf->start_next_window;
> >> -		}
> >> +	if (conf->array_frozen) {
> >> +		conf->nr_waiting[idx]++;
> >> +		/* Wait for array to unfreeze */
> >> +		wait_event_lock_irq(
> >> +			conf->wait_barrier,
> >> +			!conf->array_frozen,
> >> +			conf->resync_lock);
> >> +		conf->nr_waiting[idx]--;
> >>  	}
> >> -
> >> -	conf->nr_pending++;
> >> +	conf->nr_pending[idx]++;
> >>  	spin_unlock_irq(&conf->resync_lock);
> >> -	return sector;
> >>  }
> >>  
> >> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
> >> -			  sector_t bi_sector)
> >> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
> >> +{
> >> +	long idx = get_barrier_bucket_idx(sector_nr);
> >> +
> >> +	_wait_barrier(conf, idx);
> >> +}
> >> +
> >> +static void wait_all_barriers(struct r1conf *conf)
> >> +{
> >> +	long idx;
> >> +
> >> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> >> +		_wait_barrier(conf, idx);
> > 
> > This is racy. Since resync is sequential, idealy we only need to check the
> > bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
> > _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
> > check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
> > 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
> > It would be better we check the resync position and adjust the bucket wait
> > order according to the position.
> > 
> 
> wait_all_barriers() and allow_all_barriers() are used in close_sync() only,
>  static void close_sync(struct r1conf *conf)
>  {
> -	wait_barrier(conf, NULL);
> -	allow_barrier(conf, 0, 0);
> +	wait_all_barriers(conf);
> +	allow_all_barriers(conf);
> [snip]
>  }
> 
> wait_barrier() and allow_barrier() called here is for a synchronization
> purpose, to make sure before close_sync() returns there is no resync I/O
> existing.
> 
> close_sync() is called in raid1_sync_request() when resync I/O exceeds
> the size of raid1 device, and resync should be closed. In this
> condition, there won't be any new resync I/O generated. If a
> conf->barrier[idx] reaches 0, it won't increase before a new resync
> restarts. Therefore it is safe to call _wait_barrier() one by one, until
> every conf->barrier[idx] reaches to 0.
> 
> This is a usage of wait/allow_barrier() which is not mentioned in the
> code. They are not for regular I/O waiting for resync I/O, they are used
> as a synchronization to make sure all on-flying resync I/O to complete.

thanks, this makes sense.

> >>  	int good_sectors = RESYNC_SECTORS;
> >>  	int min_bad = 0; /* number of sectors that are bad in all devices */
> >> +	long idx = get_barrier_bucket_idx(sector_nr);
> >>  
> >>  	if (!conf->r1buf_pool)
> >>  		if (init_resync(conf))
> >> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
> >>  	 * If there is non-resync activity waiting for a turn, then let it
> >>  	 * though before starting on this new sync request.
> >>  	 */
> >> -	if (conf->nr_waiting)
> >> +	if (conf->nr_waiting[idx])
> >>  		schedule_timeout_uninterruptible(1);
> >>  
> >>  	/* we are incrementing sector_nr below. To be safe, we check against
> >> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
> >>  	r1_bio->sector = sector_nr;
> >>  	r1_bio->state = 0;
> >>  	set_bit(R1BIO_IsSync, &r1_bio->state);
> >> +	/* make sure good_sectors won't go across barrier unit border */
> >> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
> >>  
> >>  	for (i = 0; i < conf->raid_disks * 2; i++) {
> >>  		struct md_rdev *rdev;
> >> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
> >>  	if (!conf)
> >>  		goto abort;
> >>  
> >> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
> >> +	if (!conf->nr_pending)
> >> +		goto abort;
> > 
> > This allocation is a little wierd. I'd define a count and uses
> > sizeof(nr_pending) * count to do the allocation. There is nothing related to
> > PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.
> 
> This is just for better memory usage, r1conf is allocated by slab
> allocator, large not aligned size may cause internal memory
> fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will
> avoid such issue.

Not sure if slab has closest slab for this, but that's not important. I think
sizeof(nr_pending) * count makes more sense here, even sizeof(nr_pending)
* count == PAGE_SIZE.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-29 19:29     ` Shaohua Li
@ 2016-11-30  2:57       ` Coly Li
  0 siblings, 0 replies; 15+ messages in thread
From: Coly Li @ 2016-11-30  2:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On 2016/11/30 上午3:29, Shaohua Li wrote:
> On Mon, Nov 28, 2016 at 02:42:22PM +0800, Coly Li wrote:
>> On 2016/11/23 上午5:35, Shaohua Li wrote:
>>> On Tue, Nov 22, 2016 at 05:54:00AM +0800, Coly Li wrote:
>>>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>>>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>>>> I/O barriers to happen only inside a slidingresync window, for regular
>>>> I/Os out of this resync window they don't need to wait for barrier any
>>>> more. On large raid1 device, it helps a lot to improve parallel writing
>>>> I/O throughput when there are background resync I/Os performing at
>>>> same time. 
>>>>
>>>> The idea of sliding resync widow is awesome, but there are several
>>>> challenges are very difficult to solve,
>>>>  - code complexity
>>>>    Sliding resync window requires several veriables to work collectively,
>>>>    this is complexed and very hard to make it work correctly. Just grep
>>>>    "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches to fix
>>>>    the original resync window patch. This is not the end, any further
>>>>    related modification may easily introduce more regreassion.
>>>>  - multiple sliding resync windows
>>>>    Currently raid1 code only has a single sliding resync window, we cannot
>>>>    do parallel resync with current I/O barrier implementation.
>>>>    Implementing multiple resync windows are much more complexed, and very
>>>>    hard to make it correctly.
>>>>
>>>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>>>> removing resync window code, I believe life will be much easier.
>>>>
>>>> The brief idea of the simpler barrier is,
>>>>  - Do not maintain a logbal unique resync window
>>>>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>>>    I/O only has to wait for a resync I/O when both them have same barrier
>>>>    bucket index, vice versa.
>>>>  - I/O barrier can be recuded to an acceptable number if there are enought
>>>>    barrier buckets
>>>>
>>>> Here I explain how the barrier buckets are designed,
>>>>  - BARRIER_UNIT_SECTOR_SIZE
>>>>    The whole LBA address space of a raid1 device is divided into multiple
>>>>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>>>    Bio request won't go across border of barrier unit size, that means
>>>>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 in bytes.
>>>>  - BARRIER_BUCKETS_NR
>>>>    There are BARRIER_BUCKETS_NR buckets in total, if multiple I/O requests
>>>>    hit different barrier units, they only need to compete I/O barrier with
>>>>    other I/Os which hit the same barrier bucket index with each other. The
>>>>    index of a barrier bucket which a bio should look for is calculated by
>>>>    get_barrier_bucket_idx(),
>>>> 	(sector >> BARRIER_UNIT_SECTOR_BITS) % BARRIER_BUCKETS_NR
>>>>    sector is the start sector number of a bio. align_to_barrier_unit_end()
>>>>    will make sure the finall bio sent into generic_make_request() won't
>>>>    exceed border of the barrier unit size.
>>>>  - RRIER_BUCKETS_NR
>>>>    Number of barrier buckets is defined by,
>>>> 	#define BARRIER_BUCKETS_NR	(PAGE_SIZE/sizeof(long))
>>>>    For 4KB page size, there are 512 buckets for each raid1 device. That
>>>>    means the propobility of full random I/O barrier confliction may be
>>>>    reduced down to 1/512.
>>>
>>> Thanks! The idea is awesome and does makes the code easier to understand.
>>>
>>> For hash conflict, I'm worrying about one case. resync does sequential access.
>>> Say we have a sequential normal IO. If the first normal IO and resync IO have
>>> conflict, it's possible they will have conflict subsequently, since both are
>>> doing sequential access. We can change the hash to something like this:
>>>
>>> for the first 64M * 512, the hash is 0, 1, ... 511
>>> For the second 64M * 512, the hash is 1, 3, 5 ...
>>> The third 64M * 512, the hash is 2, 5, 8 ...
>>>
>>> It should be very easy to implement something like this, and this should reduce
>>> the conflict of sequential access.
>>>  
>>
>> Hi Shaohua,
>>
>> What I concerned was memory footprint. For very fast sequential I/O,
>> lineal mapping buckets means each (64 bytes) cache line contains 8 long
>> integer, it is very friendly for CPU caching,
>>  - sequential writing 8 barrier units only requires 1 memory fetching
>> for each barrier variables (barrier[], nr_pending[], nr_waiting[],
>> nr_queued[]).
>>  - memory prefetch may have positive effect in sequential I/O.
>>
>> It will be very rare that resync I/O and regular I/O are always
>> conflicted in same barrier bucket: resync I/O throughput usually is
>> slower than regular I/O, it is very hard to keep them always conflicting
>> in same bucket. If this condition really happens, current sliding resync
>> window implementation may also have a similar conflict (regular I/O
>> always hits resync window). So the barrier buckets don't make things worse.
>>
>> If we use a non-continuous mapping, memory fingerprint of the buckets
>> will be quite distributed, and occupies more cache lines for a
>> sequential I/O, which will increase the probability of more cache bounce
>> if there are multiple sequential I/O (on different offset) happen on
>> different CPU cores.
>>
>> This is why I use a very simple linear buckets hashing.
> 
> Don't think memory footprint matters much. And wasting a little memory (eg,
> make the barrier and stuffes cacheline aligned) doesn't matter here too if
> necessary. If we could reduce the hash confilict in an easy way, why not? I
> think Neil's suggestion (using hash_long) makes sense.
>  

Okey, I will do it in next version.


>>>> Comparing to single sliding resync window,
>>>>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>>>>    will have confliction within a single barrier units. So it is similar to
>>>>    single sliding resync window.
>>>>  - But a barrier unit bucket is shared by all barrier units with identical
>>>>    barrier uinit index, the probability of confliction might be higher
>>>>    than single sliding resync window, in condition that writing I/Os
>>>>    always hit barrier units which have identical barrier bucket index with
>>>>    the resync I/Os. This is a very rare condition in real I/O work loads,
>>>>    I cannot imagine how it could happen in practice.
>>>>  - Therefore we can achieve a good enough low confliction rate with much
>>>>    simpler barrier algorithm and implementation.
>>>>
>>>> If user has a (realy) large raid1 device, for example 10PB size, we may
>>>> just increase the buckets number BARRIER_BUCKETS_NR. Now this is a macro,
>>>> it is possible to be a raid1-created-time-defined variable in future.
>>>>
>>>> There are two changes should be noticed,
>>>>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>>>>    single loop, it looks like this,
>>>> 	spin_lock_irqsave(&conf->device_lock, flags);
>>>> 	conf->nr_queued[idx]--;
>>>> 	spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>    This change generates more spin lock operations, but in next patch of
>>>>    this patch set, it will be replaced by a single line code,
>>>> 	atomic_dec(conf->nr_queueud[idx]);
>>>>    So we don't need to worry about spin lock cost here.
>>>>  - In raid1_make_request(), wait_barrier() is replaced by,
>>>>    a) wait_read_barrier(): wait barrier in regular read I/O code path
>>>>    b) wait_barrier(): wait barrier in regular write I/O code path
>>>>    The differnece is wait_read_barrier() only waits if array is frozen, I
>>>>    am not able to combile them into one function, because they must receive
>>>>    differnet data types in their arguments list.
>>>>  - align_to_barrier_unit_end() is called to make sure both regular and
>>>>    resync I/O won't go across the border of a barrier unit size.
>>>>  
>>>> Open question:
>>>>  - Need review from md clustring developer, I don't touch related code now.
>>>
>>> Don't think it matters, but please open eyes, Guoqing!
>>>  
>>>> Signed-off-by: Coly Li <colyli@suse.de>
>>>> Cc: Shaohua Li <shli@fb.com>
>>>> Cc: Neil Brown <neilb@suse.de>
>>>> Cc: Johannes Thumshirn <jthumshirn@suse.de>
>>>> Cc: Guoqing Jiang <gqjiang@suse.com>
>>>> ---
>>>>  drivers/md/raid1.c | 389 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++---------------------------------------------
>>>>  drivers/md/raid1.h |  42 +++++-------
>>>>  2 files changed, 242 insertions(+), 189 deletions(-)
>>>>
>>>> Index: linux-raid1/drivers/md/raid1.c
>>>> ===================================================================
>>>> --- linux-raid1.orig/drivers/md/raid1.c
>>>> +++ linux-raid1/drivers/md/raid1.c
>>>> @@ -66,9 +66,8 @@
>>>>   */
>>>>  static int max_queued_requests = 1024;
>>>>  
>>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>>>> -			  sector_t bi_sector);
>>>> -static void lower_barrier(struct r1conf *conf);
>>>> +static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
>>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
>>>>  
>>>>  static void * r1bio_pool_alloc(gfp_t gfp_flags, void *data)
>>>>  {
>>>> @@ -92,7 +91,6 @@ static void r1bio_pool_free(void *r1_bio
>>>>  #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
>>>>  #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
>>>>  #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
>>>> -#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
>>>>  
>>>>  static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
>>>>  {
>>>> @@ -198,6 +196,7 @@ static void put_buf(struct r1bio *r1_bio
>>>>  {
>>>>  	struct r1conf *conf = r1_bio->mddev->private;
>>>>  	int i;
>>>> +	sector_t sector_nr = r1_bio->sector;
>>>>  
>>>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>>>  		struct bio *bio = r1_bio->bios[i];
>>>> @@ -207,7 +206,7 @@ static void put_buf(struct r1bio *r1_bio
>>>>  
>>>>  	mempool_free(r1_bio, conf->r1buf_pool);
>>>>  
>>>> -	lower_barrier(conf);
>>>> +	lower_barrier(conf, sector_nr);
>>>>  }
>>>>  
>>>>  static void reschedule_retry(struct r1bio *r1_bio)
>>>> @@ -215,10 +214,15 @@ static void reschedule_retry(struct r1bi
>>>>  	unsigned long flags;
>>>>  	struct mddev *mddev = r1_bio->mddev;
>>>>  	struct r1conf *conf = mddev->private;
>>>> +	sector_t sector_nr;
>>>> +	long idx;
>>>> +
>>>> +	sector_nr = r1_bio->sector;
>>>> +	idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	spin_lock_irqsave(&conf->device_lock, flags);
>>>>  	list_add(&r1_bio->retry_list, &conf->retry_list);
>>>> -	conf->nr_queued ++;
>>>> +	conf->nr_queued[idx]++;
>>>>  	spin_unlock_irqrestore(&conf->device_lock, flags);
>>>>  
>>>>  	wake_up(&conf->wait_barrier);
>>>> @@ -235,8 +239,6 @@ static void call_bio_endio(struct r1bio
>>>>  	struct bio *bio = r1_bio->master_bio;
>>>>  	int done;
>>>>  	struct r1conf *conf = r1_bio->mddev->private;
>>>> -	sector_t start_next_window = r1_bio->start_next_window;
>>>> -	sector_t bi_sector = bio->bi_iter.bi_sector;
>>>>  
>>>>  	if (bio->bi_phys_segments) {
>>>>  		unsigned long flags;
>>>> @@ -255,19 +257,14 @@ static void call_bio_endio(struct r1bio
>>>>  	if (!test_bit(R1BIO_Uptodate, &r1_bio->state))
>>>>  		bio->bi_error = -EIO;
>>>>  
>>>> -	if (done) {
>>>> +	if (done)
>>>>  		bio_endio(bio);
>>>> -		/*
>>>> -		 * Wake up any possible resync thread that waits for the device
>>>> -		 * to go idle.
>>>> -		 */
>>>> -		allow_barrier(conf, start_next_window, bi_sector);
>>>> -	}
>>>>  }
>>>>  
>>>>  static void raid_end_bio_io(struct r1bio *r1_bio)
>>>>  {
>>>>  	struct bio *bio = r1_bio->master_bio;
>>>> +	struct r1conf *conf = r1_bio->mddev->private;
>>>>  
>>>>  	/* if nobody has done the final endio yet, do it now */
>>>>  	if (!test_and_set_bit(R1BIO_Returned, &r1_bio->state)) {
>>>> @@ -278,6 +275,12 @@ static void raid_end_bio_io(struct r1bio
>>>>  
>>>>  		call_bio_endio(r1_bio);
>>>>  	}
>>>> +
>>>> +	/*
>>>> +	 * Wake up any possible resync thread that waits for the device
>>>> +	 * to go idle.
>>>> +	 */
>>>> +	allow_barrier(conf, r1_bio->sector);
>>> Why this change?
>>>
>>
>> I move allow_barrier() here on purpose. Because current raid1 code
>> raises nr_pending for each master bio, the barrier buckets patch raises
>> nr_pending[idx] for each r1_bio.
>>
>> Current code increases nr_pending for each master bio (the bio structure
>> received by raid1_make_request()). A master bio may be split by multiple
>> r1_bio structures, when all r1_bio completes, nr_pending is decreased
>> for the original master bio.
>>
>> Since the master bio may go across multiple barrier units, in this patch
>> master bio will be split into multiple r1_bio structures for each
>> barrier units. In this case, we need to call wait_barrer() for each
>> r1_bio, and call allow_barrier() on each r1_bio completion. This is why
>> allow_barrier() is moved form master bio completion time to r1_bio
>> completion time.
>>
>> A master bio may also be split into multiple r1_bio if bad blocks
>> encountered, and these r1_bio may stay in same barrier unit. In this
>> case the different r1_bio does increase nr_pending[] for same bucket
>> index, we should also call wait_barrier() for each r1_bio and call
>> allow_barrier() at its completion time.
> 
> I think if you do bio_split before raid1_make_request, these issues go away.
> 

Let me try it :-)

>>>>  	free_r1bio(r1_bio);
>>>>  }
>>>>  
>>>> @@ -311,6 +314,7 @@ static int find_bio_disk(struct r1bio *r
>>>>  	return mirror;
>>>>  }
>>>>  
>>>> +/* bi_end_io callback for regular READ bio */
>>>>  static void raid1_end_read_request(struct bio *bio)
>>>>  {
>>>>  	int uptodate = !bio->bi_error;
>>>> @@ -490,6 +494,25 @@ static void raid1_end_write_request(stru
>>>>  		bio_put(to_put);
>>>>  }
>>>>  
>>>> +static sector_t align_to_barrier_unit_end(sector_t start_sector,
>>>> +					  sector_t sectors)
>>>> +{
>>>> +	sector_t len;
>>>> +
>>>> +	WARN_ON(sectors == 0);
>>>> +	/* len is the number of sectors from start_sector to end of the
>>>> +	 * barrier unit which start_sector belongs to.
>>>> +	 */
>>>> +	len = ((start_sector + sectors + (1<<BARRIER_UNIT_SECTOR_BITS) - 1) &
>>>> +	       (~(BARRIER_UNIT_SECTOR_SIZE - 1))) -
>>>> +	      start_sector;
>>>> +
>>>> +	if (len > sectors)
>>>> +		len = sectors;
>>>> +
>>>> +	return len;
>>>> +}
>>>
>>> I'd prefer calling bio_split at the early of raid1_make_request and split the
>>> bio if it crosses the bucket boundary. please see raid10_make_request for
>>> example. resync will not cross the boundary. So the code will not need to worry
>>> about the boundary. I believe this will make the code simpiler (all the
>>> align_to_barrier_unit_end calling can removed) and easy to understand.
>>>
>>
>> Indeed, my first implementation uses bio_split(). The reasons why I
>> don't use it later are,
>> - indeed I need to write more code.
>> - I can't simply use existing r1_bio completion code to call
>> bio_end_io() to the master bio when bio->bi_phys_segments is zero (see
>> call_bio_endio()). Because r1_bio->master_bio is the split bio, not the
>> original master bio received by raid1_make_request()
>>
>> Therefore finally I decide to use align_to_barrier_unit_end() to
>> generate more r1_bio structures if the original master bio goes across
>> barrier unit size, it makes less modification in this patch.
> 
> That bi_phys_segments issue doesn't make sense to me. Please see how raid10 use
> bio_split.
> - rename raid1_make_request to __raid1_make_request
> - add a raid1_make_request, split bio there, and then call __raid1_make_request
> 
> the __raid1_make_request does everthing it currently does, the bi_phys_segments
> should still work. This might add more code. But now we don't need to worry
> about the boundary problem everywhere else except the new raid1_make_request.
> For example, the allow_barrier issue goes away with this approach. This will
> make the code more understandable and less error prone.
>  

Thanks for the hint. I will use this method in next version.


>>>> +
>>>>  /*
>>>>   * This routine returns the disk from which the requested read should
>>>>   * be done. There is a per-array 'next expected sequential IO' sector
>>>> @@ -691,6 +714,7 @@ static int read_balance(struct r1conf *c
>>>>  		conf->mirrors[best_disk].next_seq_sect = this_sector + sectors;
>>>>  	}
>>>>  	rcu_read_unlock();
>>>> +	sectors = align_to_barrier_unit_end(this_sector, sectors);
>>>>  	*max_sectors = sectors;
>>>>  
>>>>  	return best_disk;
>>>> @@ -779,168 +803,174 @@ static void flush_pending_writes(struct
>>>>   *    there is no normal IO happeing.  It must arrange to call
>>>>   *    lower_barrier when the particular background IO completes.
>>>>   */
>>>> +
>>>>  static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>>  	spin_lock_irq(&conf->resync_lock);
>>>>  
>>>>  	/* Wait until no block IO is waiting */
>>>> -	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
>>>> +	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
>>>>  			    conf->resync_lock);
>>>>  
>>>>  	/* block any new IO from starting */
>>>> -	conf->barrier++;
>>>> -	conf->next_resync = sector_nr;
>>>> +	conf->barrier[idx]++;
>>>>  
>>>>  	/* For these conditions we must wait:
>>>>  	 * A: while the array is in frozen state
>>>> -	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
>>>> -	 *    the max count which allowed.
>>>> -	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
>>>> -	 *    next resync will reach to the window which normal bios are
>>>> -	 *    handling.
>>>> -	 * D: while there are any active requests in the current window.
>>>> +	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
>>>> +	 *    existing in sector number ranges corresponding to idx.
>>>> +	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning resync reach
>>>> +	 *    the max count which allowed in sector number ranges
>>>> +	 *    conrresponding to idx.
>>>>  	 */
>>>>  	wait_event_lock_irq(conf->wait_barrier,
>>>> -			    !conf->array_frozen &&
>>>> -			    conf->barrier < RESYNC_DEPTH &&
>>>> -			    conf->current_window_requests == 0 &&
>>>> -			    (conf->start_next_window >=
>>>> -			     conf->next_resync + RESYNC_SECTORS),
>>>> +			    !conf->array_frozen && !conf->nr_pending[idx] &&
>>>> +			    conf->barrier[idx] < RESYNC_DEPTH,
>>>>  			    conf->resync_lock);
>>>
>>> So there is a slight behavior change. Originally we guarautee no more than
>>> RESYNC_DEPTH sync. Now this only checks 'conf->barrier[idx] < RESYNC_DEPTH'.
>>> How about barrier[idx-1], barrier[idx-2]...? It's possible conf->barrier[idx] <
>>> RESYNC_DEPTH, but barrier[idx] + barrier[idx-1] > RESYNC_DEPTH. Not sure how
>>> important this is, but at least we can check barrier[idx] + barrier[idx-1].
>>
>> The original code wants to rise more multiple barriers, but less then
>> RESYNC_DEPTH. It helps resync I/O throughput with less resync I/O and
>> regular I/O conflict. Considering conf->barrier is a global barrier,
>> large RESYNC_DEPTH will starve regular I/O, it is only set to 32 for
>> whole raid1 device.
>>
>> In the barrier buckets patch, conf->barrier[idx] only takes effect in a
>> single bucket. More barriers raised in this barrier buckets won't
>> interfere regular I/O in other barrier buckets, therefore we can have
>> much more resync I/O barriers to raise, which is good for resync I/O
>> throughput without starve more regular I/Os.
>>
>> If we have 512 barrier buckets, that means on the whole raid1 device
>> there are 512*RESYNC_DEPTH barriers can be raised, and allows more
>> parallel resync I/O.
>>
>> Before implementing parallel resync, I keep RESYNC_DEPTH as 32 in this
>> patch, because we only have a resync I/O hits one barrier buckets, same
>> RESYNC_DEPTH won't change barrier behavior now. Latter when we have
>> parallel resync I/O, let's see whether we need to modify this value,
>> also still keep it as 32.
> 
> The problem is resync write could harm latency of regular IO. Even we have
> resync limitation mechanism, sending a lot of resync write out could harm
> regular IO latency for worst case. If the application has P99 metrics, more
> resync IO definitively will harm it. This will not be easy to observe though.
> Could we have a global pending counter to record barrier IO and using it in the
> same way as the old code? That said I'm not sure how important this is, but
> this could make things worse, so I'd like to avoid the regression.
>  

Okey, I will maintain a global barrier depth counter, and keep global
barrier depth to 32 as current upstream code behavior does.

>>>> -
>>>> -	conf->nr_pending++;
>>>> +	conf->nr_pending[idx]++;
>>>>  	spin_unlock_irq(&conf->resync_lock);
>>>>  }
>>>>  
>>>> -static void lower_barrier(struct r1conf *conf)
>>>> +static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>>  	unsigned long flags;
>>>> -	BUG_ON(conf->barrier <= 0);
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>> +	BUG_ON(conf->barrier[idx] <= 0);
>>>>  	spin_lock_irqsave(&conf->resync_lock, flags);
>>>> -	conf->barrier--;
>>>> -	conf->nr_pending--;
>>>> +	conf->barrier[idx]--;
>>>> +	conf->nr_pending[idx]--;
>>>>  	spin_unlock_irqrestore(&conf->resync_lock, flags);
>>>>  	wake_up(&conf->wait_barrier);
>>>>  }
>>>>  
>>>> -static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
>>>> +/* A regular I/O should wait when,
>>>> + * - The whole array is frozen (both READ and WRITE)
>>>> + * - bio is WRITE and in same barrier bucket conf->barrier[idx] raised
>>>> + */
>>>> +static void _wait_barrier(struct r1conf *conf, long idx)
>>>>  {
>>>> -	bool wait = false;
>>>> -
>>>> -	if (conf->array_frozen || !bio)
>>>> -		wait = true;
>>>> -	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
>>>> -		if ((conf->mddev->curr_resync_completed
>>>> -		     >= bio_end_sector(bio)) ||
>>>> -		    (conf->next_resync + NEXT_NORMALIO_DISTANCE
>>>> -		     <= bio->bi_iter.bi_sector))
>>>> -			wait = false;
>>>> -		else
>>>> -			wait = true;
>>>> +	spin_lock_irq(&conf->resync_lock);
>>>> +	if (conf->array_frozen || conf->barrier[idx]) {
>>>> +		conf->nr_waiting[idx]++;
>>>> +		/* Wait for the barrier to drop. */
>>>> +		wait_event_lock_irq(
>>>> +			conf->wait_barrier,
>>>> +			!conf->array_frozen && !conf->barrier[idx],
>>>> +			conf->resync_lock);
>>>> +		conf->nr_waiting[idx]--;
>>>>  	}
>>>>  
>>>> -	return wait;
>>>> +	conf->nr_pending[idx]++;
>>>> +	spin_unlock_irq(&conf->resync_lock);
>>>>  }
>>>>  
>>>> -static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
>>>> +static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
>>>>  {
>>>> -	sector_t sector = 0;
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	spin_lock_irq(&conf->resync_lock);
>>>> -	if (need_to_wait_for_sync(conf, bio)) {
>>>> -		conf->nr_waiting++;
>>>> -		/* Wait for the barrier to drop.
>>>> -		 * However if there are already pending
>>>> -		 * requests (preventing the barrier from
>>>> -		 * rising completely), and the
>>>> -		 * per-process bio queue isn't empty,
>>>> -		 * then don't wait, as we need to empty
>>>> -		 * that queue to allow conf->start_next_window
>>>> -		 * to increase.
>>>> -		 */
>>>> -		wait_event_lock_irq(conf->wait_barrier,
>>>> -				    !conf->array_frozen &&
>>>> -				    (!conf->barrier ||
>>>> -				     ((conf->start_next_window <
>>>> -				       conf->next_resync + RESYNC_SECTORS) &&
>>>> -				      current->bio_list &&
>>>> -				      !bio_list_empty(current->bio_list))),
>>>> -				    conf->resync_lock);
>>>> -		conf->nr_waiting--;
>>>> -	}
>>>> -
>>>> -	if (bio && bio_data_dir(bio) == WRITE) {
>>>> -		if (bio->bi_iter.bi_sector >= conf->next_resync) {
>>>> -			if (conf->start_next_window == MaxSector)
>>>> -				conf->start_next_window =
>>>> -					conf->next_resync +
>>>> -					NEXT_NORMALIO_DISTANCE;
>>>> -
>>>> -			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
>>>> -			    <= bio->bi_iter.bi_sector)
>>>> -				conf->next_window_requests++;
>>>> -			else
>>>> -				conf->current_window_requests++;
>>>> -			sector = conf->start_next_window;
>>>> -		}
>>>> +	if (conf->array_frozen) {
>>>> +		conf->nr_waiting[idx]++;
>>>> +		/* Wait for array to unfreeze */
>>>> +		wait_event_lock_irq(
>>>> +			conf->wait_barrier,
>>>> +			!conf->array_frozen,
>>>> +			conf->resync_lock);
>>>> +		conf->nr_waiting[idx]--;
>>>>  	}
>>>> -
>>>> -	conf->nr_pending++;
>>>> +	conf->nr_pending[idx]++;
>>>>  	spin_unlock_irq(&conf->resync_lock);
>>>> -	return sector;
>>>>  }
>>>>  
>>>> -static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
>>>> -			  sector_t bi_sector)
>>>> +static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
>>>> +{
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>> +
>>>> +	_wait_barrier(conf, idx);
>>>> +}
>>>> +
>>>> +static void wait_all_barriers(struct r1conf *conf)
>>>> +{
>>>> +	long idx;
>>>> +
>>>> +	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
>>>> +		_wait_barrier(conf, idx);
>>>
>>> This is racy. Since resync is sequential, idealy we only need to check the
>>> bucket sequentially. The compiler could do  _wait_barrier(conf, 1) and then
>>> _wait_barrier(conf, 0). It's possible the bucket 1 has barrier right after the
>>> check of bucket 0. Even the compiler doesn't reorder, there is case the bucket
>>> 511 should be check before bucket 0 if the resync is just crossing 512 buckets.
>>> It would be better we check the resync position and adjust the bucket wait
>>> order according to the position.
>>>
>>
>> wait_all_barriers() and allow_all_barriers() are used in close_sync() only,
>>  static void close_sync(struct r1conf *conf)
>>  {
>> -	wait_barrier(conf, NULL);
>> -	allow_barrier(conf, 0, 0);
>> +	wait_all_barriers(conf);
>> +	allow_all_barriers(conf);
>> [snip]
>>  }
>>
>> wait_barrier() and allow_barrier() called here is for a synchronization
>> purpose, to make sure before close_sync() returns there is no resync I/O
>> existing.
>>
>> close_sync() is called in raid1_sync_request() when resync I/O exceeds
>> the size of raid1 device, and resync should be closed. In this
>> condition, there won't be any new resync I/O generated. If a
>> conf->barrier[idx] reaches 0, it won't increase before a new resync
>> restarts. Therefore it is safe to call _wait_barrier() one by one, until
>> every conf->barrier[idx] reaches to 0.
>>
>> This is a usage of wait/allow_barrier() which is not mentioned in the
>> code. They are not for regular I/O waiting for resync I/O, they are used
>> as a synchronization to make sure all on-flying resync I/O to complete.
> 
> thanks, this makes sense.
> 
>>>>  	int good_sectors = RESYNC_SECTORS;
>>>>  	int min_bad = 0; /* number of sectors that are bad in all devices */
>>>> +	long idx = get_barrier_bucket_idx(sector_nr);
>>>>  
>>>>  	if (!conf->r1buf_pool)
>>>>  		if (init_resync(conf))
>>>> @@ -2535,7 +2571,7 @@ static sector_t raid1_sync_request(struc
>>>>  	 * If there is non-resync activity waiting for a turn, then let it
>>>>  	 * though before starting on this new sync request.
>>>>  	 */
>>>> -	if (conf->nr_waiting)
>>>> +	if (conf->nr_waiting[idx])
>>>>  		schedule_timeout_uninterruptible(1);
>>>>  
>>>>  	/* we are incrementing sector_nr below. To be safe, we check against
>>>> @@ -2562,6 +2598,8 @@ static sector_t raid1_sync_request(struc
>>>>  	r1_bio->sector = sector_nr;
>>>>  	r1_bio->state = 0;
>>>>  	set_bit(R1BIO_IsSync, &r1_bio->state);
>>>> +	/* make sure good_sectors won't go across barrier unit border */
>>>> +	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
>>>>  
>>>>  	for (i = 0; i < conf->raid_disks * 2; i++) {
>>>>  		struct md_rdev *rdev;
>>>> @@ -2786,6 +2824,22 @@ static struct r1conf *setup_conf(struct
>>>>  	if (!conf)
>>>>  		goto abort;
>>>>  
>>>> +	conf->nr_pending = kzalloc(PAGE_SIZE, GFP_KERNEL);
>>>> +	if (!conf->nr_pending)
>>>> +		goto abort;
>>>
>>> This allocation is a little wierd. I'd define a count and uses
>>> sizeof(nr_pending) * count to do the allocation. There is nothing related to
>>> PAGE_SIZE. Alternatively, just make nr_pending an array in r1conf.
>>
>> This is just for better memory usage, r1conf is allocated by slab
>> allocator, large not aligned size may cause internal memory
>> fragmentation inside slab's pages. call kzalloc() with PAGE_SIZE will
>> avoid such issue.
> 
> Not sure if slab has closest slab for this, but that's not important. I think
> sizeof(nr_pending) * count makes more sense here, even sizeof(nr_pending)
> * count == PAGE_SIZE.
> 

Okey, I will move these counters into r1conf, in next version.

Thanks for your review and suggestion !

Coly


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-28  7:33   ` Coly Li
@ 2016-11-30  6:37     ` Guoqing Jiang
  2016-11-30  7:19       ` Coly Li
  0 siblings, 1 reply; 15+ messages in thread
From: Guoqing Jiang @ 2016-11-30  6:37 UTC (permalink / raw)
  To: Coly Li, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn



On 11/28/2016 03:33 PM, Coly Li wrote:
> In raid1_sync_request(), I see,
> conf->cluster_sync_low = mddev->curr_resync_completed;
> conf->cluster_sync_high = conf->cluster_sync_low +
> CLUSTER_RESYNC_WINDOW_SECTORS;
>
> Is it possible that LBA range [conf->cluster_sync_low,
> conf->cluster_sync_high] goes across the border of a barrier unit size ?

Not pretty sure about it, but since cluster's resync window is 32M which is
less than BARRIER_UNIT_SECTOR_SIZE, I guess it would not cause trouble.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2016-11-30  6:37     ` Guoqing Jiang
@ 2016-11-30  7:19       ` Coly Li
  0 siblings, 0 replies; 15+ messages in thread
From: Coly Li @ 2016-11-30  7:19 UTC (permalink / raw)
  To: Guoqing Jiang, linux-raid; +Cc: Shaohua Li, Neil Brown, Johannes Thumshirn

On 2016/11/30 下午2:37, Guoqing Jiang wrote:
> 
> 
> On 11/28/2016 03:33 PM, Coly Li wrote:
>> In raid1_sync_request(), I see,
>> conf->cluster_sync_low = mddev->curr_resync_completed;
>> conf->cluster_sync_high = conf->cluster_sync_low +
>> CLUSTER_RESYNC_WINDOW_SECTORS;
>>
>> Is it possible that LBA range [conf->cluster_sync_low,
>> conf->cluster_sync_high] goes across the border of a barrier unit size ?
> 
> Not pretty sure about it, but since cluster's resync window is 32M which is
> less than BARRIER_UNIT_SECTOR_SIZE, I guess it would not cause trouble.

Thanks for the hint. So BARRIER_UNIT_SECTOR_SIZE is 32MB aligned,
cluster's resync window won't go across the boundary.


Coly

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2016-11-30  7:19 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-21 21:54 [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Coly Li
2016-11-21 21:54 ` [RFC PATCH 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code Coly Li
2016-11-22 21:58   ` Shaohua Li
2016-11-22 21:35 ` [RFC PATCH 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2016-11-23  9:05   ` Guoqing Jiang
2016-11-24  5:45   ` NeilBrown
2016-11-24  6:05     ` Guoqing Jiang
2016-11-28  6:59     ` Coly Li
2016-11-28  6:42   ` Coly Li
2016-11-29 19:29     ` Shaohua Li
2016-11-30  2:57       ` Coly Li
2016-11-24  7:34 ` Guoqing Jiang
2016-11-28  7:33   ` Coly Li
2016-11-30  6:37     ` Guoqing Jiang
2016-11-30  7:19       ` Coly Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.