All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
@ 2017-02-15 16:35 colyli
  2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
                   ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: colyli @ 2017-02-15 16:35 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
introduces a sliding resync window for raid1 I/O barrier, this idea limits
I/O barriers to happen only inside a slidingresync window, for regular
I/Os out of this resync window they don't need to wait for barrier any
more. On large raid1 device, it helps a lot to improve parallel writing
I/O throughput when there are background resync I/Os performing at
same time.

The idea of sliding resync widow is awesome, but code complexity is a
challenge. Sliding resync window requires several veriables to work
collectively, this is complexed and very hard to make it work correctly.
Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
to fix the original resync window patch. This is not the end, any further
related modification may easily introduce more regreassion.

Therefore I decide to implement a much simpler raid1 I/O barrier, by
removing resync window code, I believe life will be much easier.

The brief idea of the simpler barrier is,
 - Do not maintain a logbal unique resync window
 - Use multiple hash buckets to reduce I/O barrier conflictions, regular
   I/O only has to wait for a resync I/O when both them have same barrier
   bucket index, vice versa.
 - I/O barrier can be recuded to an acceptable number if there are enought
   barrier buckets

Here I explain how the barrier buckets are designed,
 - BARRIER_UNIT_SECTOR_SIZE
   The whole LBA address space of a raid1 device is divided into multiple
   barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
   Bio request won't go across border of barrier unit size, that means
   maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
   For random I/O 64MB is large enough for both read and write requests,
   for sequential I/O considering underlying block layer may merge them
   into larger requests, 64MB is still good enough.
   Neil also points out that for resync operation, "we want the resync to
   move from region to region fairly quickly so that the slowness caused
   by having to synchronize with the resync is averaged out over a fairly
   small time frame". For full speed resync, 64MB should take less then 1
   second. When resync is competing with other I/O, it could take up a few
   minutes. Therefore 64MB size is fairly good range for resync.

 - BARRIER_BUCKETS_NR
   There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
        #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
        #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
   this patch makes the bellowed members of struct r1conf from integer
   to array of integers,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     *nr_pending;
        +       int                     *nr_waiting;
        +       int                     *nr_queued;
        +       int                     *barrier;
   number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
   kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
   barrier buckets, and each array of integers occupies single memory page.
   1024 means for a request which is smaller than the I/O barrier unit size
   has ~0.1% chance to wait for resync to pause, which is quite a small
   enough fraction. Also requesting single memory page is more friendly to
   kernel page allocator than larger memory size.

 - I/O barrier bucket is indexed by bio start sector
   If multiple I/O requests hit different I/O barrier units, they only need
   to compete I/O barrier with other I/Os which hit the same I/O barrier
   bucket index with each other. The index of a barrier bucket which a
   bio should look for is calculated by sector_to_idx() which is defined
   in raid1.h as an inline function,
        static inline int sector_to_idx(sector_t sector)
        {
                return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
                                BARRIER_BUCKETS_NR_BITS);
        }
   Here sector_nr is the start sector number of a bio.

 - Single bio won't go across boundary of a I/O barrier unit
   If a request goes across boundary of barrier unit, it will be split. A
   bio may be split in raid1_make_request() or raid1_sync_request(), if
   sectors returned by align_to_barrier_unit_end() is small than original
   bio size.

Comparing to single sliding resync window,
 - Currently resync I/O grows linearly, therefore regular and resync I/O
   will have confliction within a single barrier units. So the I/O
   behavior is similar to single sliding resync window.
 - But a barrier unit bucket is shared by all barrier units with identical
   barrier uinit index, the probability of confliction might be higher
   than single sliding resync window, in condition that writing I/Os
   always hit barrier units which have identical barrier bucket indexs with
   the resync I/Os. This is a very rare condition in real I/O work loads,
   I cannot imagine how it could happen in practice.
 - Therefore we can achieve a good enough low confliction rate with much
   simpler barrier algorithm and implementation.

There are two changes should be noticed,
 - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
   single loop, it looks like this,
        spin_lock_irqsave(&conf->device_lock, flags);
        conf->nr_queued[idx]--;
        spin_unlock_irqrestore(&conf->device_lock, flags);
   This change generates more spin lock operations, but in next patch of
   this patch set, it will be replaced by a single line code,
        atomic_dec(&conf->nr_queueud[idx]);
   So we don't need to worry about spin lock cost here.
 - Mainline raid1 code split original raid1_make_request() into
   raid1_read_request() and raid1_write_request(). If the original bio
   goes across an I/O barrier unit size, this bio will be split before
   calling raid1_read_request() or raid1_write_request(),  this change
   the code logic more simple and clear.
 - In this patch wait_barrier() is moved from raid1_make_request() to
   raid1_write_request(). In raid_read_request(), original wait_barrier()
   is replaced by raid1_read_request().
   The differnece is wait_read_barrier() only waits if array is frozen,
   using different barrier function in different code path makes the code
   more clean and easy to read.
Changelog
V3:
- Rebase the patch against latest upstream kernel code.
- Many fixes by review comments from Neil,
  - Back to use pointers to replace arraries in struct r1conf
  - Remove total_barriers from struct r1conf
  - Add more patch comments to explain how/why the values of
    BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
  - Use get_unqueued_pending() to replace get_all_pendings() and
    get_all_queued()
  - Increase bucket number from 512 to 1024
- Change code comments format by review from Shaohua.
V2:
- Use bio_split() to split the orignal bio if it goes across barrier unit
  bounday, to make the code more simple, by suggestion from Shaohua and
  Neil.
- Use hash_long() to replace original linear hash, to avoid a possible
  confilict between resync I/O and sequential write I/O, by suggestion from
  Shaohua.
- Add conf->total_barriers to record barrier depth, which is used to
  control number of parallel sync I/O barriers, by suggestion from Shaohua.
- In V1 patch the bellowed barrier buckets related members in r1conf are
  allocated in memory page. To make the code more simple, V2 patch moves
  the memory space into struct r1conf, like this,
        -       int                     nr_pending;
        -       int                     nr_waiting;
        -       int                     nr_queued;
        -       int                     barrier;
        +       int                     nr_pending[BARRIER_BUCKETS_NR];
        +       int                     nr_waiting[BARRIER_BUCKETS_NR];
        +       int                     nr_queued[BARRIER_BUCKETS_NR];
        +       int                     barrier[BARRIER_BUCKETS_NR];
  This change is by the suggestion from Shaohua.
- Remove some inrelavent code comments, by suggestion from Guoqing.
- Add a missing wait_barrier() before jumping to retry_write, in
  raid1_make_write_request().
V1:
- Original RFC patch for comments

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 447 ++++++++++++++++++++++++++++++-----------------------
 drivers/md/raid1.h |  42 +++--
 2 files changed, 275 insertions(+), 214 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 7b0f647..4234494 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -71,9 +71,8 @@
  */
 static int max_queued_requests = 1024;
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector);
-static void lower_barrier(struct r1conf *conf);
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr);
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr);
 
 #define raid1_log(md, fmt, args...)				\
 	do { if ((md)->queue) blk_add_trace_msg((md)->queue, "raid1 " fmt, ##args); } while (0)
@@ -100,7 +99,6 @@ static void r1bio_pool_free(void *r1_bio, void *data)
 #define RESYNC_WINDOW_SECTORS (RESYNC_WINDOW >> 9)
 #define CLUSTER_RESYNC_WINDOW (16 * RESYNC_WINDOW)
 #define CLUSTER_RESYNC_WINDOW_SECTORS (CLUSTER_RESYNC_WINDOW >> 9)
-#define NEXT_NORMALIO_DISTANCE (3 * RESYNC_WINDOW_SECTORS)
 
 static void * r1buf_pool_alloc(gfp_t gfp_flags, void *data)
 {
@@ -215,7 +213,7 @@ static void put_buf(struct r1bio *r1_bio)
 
 	mempool_free(r1_bio, conf->r1buf_pool);
 
-	lower_barrier(conf);
+	lower_barrier(conf, r1_bio->sector);
 }
 
 static void reschedule_retry(struct r1bio *r1_bio)
@@ -223,10 +221,12 @@ static void reschedule_retry(struct r1bio *r1_bio)
 	unsigned long flags;
 	struct mddev *mddev = r1_bio->mddev;
 	struct r1conf *conf = mddev->private;
+	int idx;
 
+	idx = sector_to_idx(r1_bio->sector);
 	spin_lock_irqsave(&conf->device_lock, flags);
 	list_add(&r1_bio->retry_list, &conf->retry_list);
-	conf->nr_queued ++;
+	conf->nr_queued[idx]++;
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	wake_up(&conf->wait_barrier);
@@ -243,7 +243,6 @@ static void call_bio_endio(struct r1bio *r1_bio)
 	struct bio *bio = r1_bio->master_bio;
 	int done;
 	struct r1conf *conf = r1_bio->mddev->private;
-	sector_t start_next_window = r1_bio->start_next_window;
 	sector_t bi_sector = bio->bi_iter.bi_sector;
 
 	if (bio->bi_phys_segments) {
@@ -269,7 +268,7 @@ static void call_bio_endio(struct r1bio *r1_bio)
 		 * Wake up any possible resync thread that waits for the device
 		 * to go idle.
 		 */
-		allow_barrier(conf, start_next_window, bi_sector);
+		allow_barrier(conf, bi_sector);
 	}
 }
 
@@ -517,6 +516,25 @@ static void raid1_end_write_request(struct bio *bio)
 		bio_put(to_put);
 }
 
+static sector_t align_to_barrier_unit_end(sector_t start_sector,
+					  sector_t sectors)
+{
+	sector_t len;
+
+	WARN_ON(sectors == 0);
+	/*
+	 * len is the number of sectors from start_sector to end of the
+	 * barrier unit which start_sector belongs to.
+	 */
+	len = round_up(start_sector + 1, BARRIER_UNIT_SECTOR_SIZE) -
+	      start_sector;
+
+	if (len > sectors)
+		len = sectors;
+
+	return len;
+}
+
 /*
  * This routine returns the disk from which the requested read should
  * be done. There is a per-array 'next expected sequential IO' sector
@@ -813,168 +831,168 @@ static void flush_pending_writes(struct r1conf *conf)
  */
 static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 {
+	int idx = sector_to_idx(sector_nr);
+
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting,
+	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
-	conf->next_resync = sector_nr;
+	conf->barrier[idx]++;
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
-	 * B: while barrier >= RESYNC_DEPTH, meaning resync reach
-	 *    the max count which allowed.
-	 * C: next_resync + RESYNC_SECTORS > start_next_window, meaning
-	 *    next resync will reach to the window which normal bios are
-	 *    handling.
-	 * D: while there are any active requests in the current window.
+	 * B: while conf->nr_pending[idx] is not 0, meaning regular I/O
+	 *    existing in corresponding I/O barrier bucket.
+	 * C: while conf->barrier[idx] >= RESYNC_DEPTH, meaning reaches
+	 *    max resync count which allowed on current I/O barrier bucket.
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			    conf->barrier < RESYNC_DEPTH &&
-			    conf->current_window_requests == 0 &&
-			    (conf->start_next_window >=
-			     conf->next_resync + RESYNC_SECTORS),
+			     !conf->nr_pending[idx] &&
+			     conf->barrier[idx] < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
 }
 
-static void lower_barrier(struct r1conf *conf)
+static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	unsigned long flags;
-	BUG_ON(conf->barrier <= 0);
+	int idx = sector_to_idx(sector_nr);
+
+	BUG_ON(conf->barrier[idx] <= 0);
+
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	conf->nr_pending--;
+	conf->barrier[idx]--;
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
-static bool need_to_wait_for_sync(struct r1conf *conf, struct bio *bio)
+static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	bool wait = false;
-
-	if (conf->array_frozen || !bio)
-		wait = true;
-	else if (conf->barrier && bio_data_dir(bio) == WRITE) {
-		if ((conf->mddev->curr_resync_completed
-		     >= bio_end_sector(bio)) ||
-		    (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-		     <= bio->bi_iter.bi_sector))
-			wait = false;
-		else
-			wait = true;
+	spin_lock_irq(&conf->resync_lock);
+	if (conf->array_frozen || conf->barrier[idx]) {
+		conf->nr_waiting[idx]++;
+		/* Wait for the barrier to drop. */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen && !conf->barrier[idx],
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	return wait;
+	conf->nr_pending[idx]++;
+	spin_unlock_irq(&conf->resync_lock);
 }
 
-static sector_t wait_barrier(struct r1conf *conf, struct bio *bio)
+static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	sector_t sector = 0;
+	int idx = sector_to_idx(sector_nr);
 
 	spin_lock_irq(&conf->resync_lock);
-	if (need_to_wait_for_sync(conf, bio)) {
-		conf->nr_waiting++;
-		/* Wait for the barrier to drop.
-		 * However if there are already pending
-		 * requests (preventing the barrier from
-		 * rising completely), and the
-		 * per-process bio queue isn't empty,
-		 * then don't wait, as we need to empty
-		 * that queue to allow conf->start_next_window
-		 * to increase.
-		 */
-		raid1_log(conf->mddev, "wait barrier");
-		wait_event_lock_irq(conf->wait_barrier,
-				    !conf->array_frozen &&
-				    (!conf->barrier ||
-				     ((conf->start_next_window <
-				       conf->next_resync + RESYNC_SECTORS) &&
-				      current->bio_list &&
-				      !bio_list_empty(current->bio_list))),
-				    conf->resync_lock);
-		conf->nr_waiting--;
-	}
-
-	if (bio && bio_data_dir(bio) == WRITE) {
-		if (bio->bi_iter.bi_sector >= conf->next_resync) {
-			if (conf->start_next_window == MaxSector)
-				conf->start_next_window =
-					conf->next_resync +
-					NEXT_NORMALIO_DISTANCE;
-
-			if ((conf->start_next_window + NEXT_NORMALIO_DISTANCE)
-			    <= bio->bi_iter.bi_sector)
-				conf->next_window_requests++;
-			else
-				conf->current_window_requests++;
-			sector = conf->start_next_window;
-		}
+	if (conf->array_frozen) {
+		conf->nr_waiting[idx]++;
+		/* Wait for array to unfreeze */
+		wait_event_lock_irq(
+			conf->wait_barrier,
+			!conf->array_frozen,
+			conf->resync_lock);
+		conf->nr_waiting[idx]--;
 	}
 
-	conf->nr_pending++;
+	conf->nr_pending[idx]++;
 	spin_unlock_irq(&conf->resync_lock);
-	return sector;
 }
 
-static void allow_barrier(struct r1conf *conf, sector_t start_next_window,
-			  sector_t bi_sector)
+static void wait_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = sector_to_idx(sector_nr);
+
+	_wait_barrier(conf, idx);
+}
+
+static void wait_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_wait_barrier(conf, idx);
+}
+
+static void _allow_barrier(struct r1conf *conf, int idx)
 {
 	unsigned long flags;
 
 	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending--;
-	if (start_next_window) {
-		if (start_next_window == conf->start_next_window) {
-			if (conf->start_next_window + NEXT_NORMALIO_DISTANCE
-			    <= bi_sector)
-				conf->next_window_requests--;
-			else
-				conf->current_window_requests--;
-		} else
-			conf->current_window_requests--;
-
-		if (!conf->current_window_requests) {
-			if (conf->next_window_requests) {
-				conf->current_window_requests =
-					conf->next_window_requests;
-				conf->next_window_requests = 0;
-				conf->start_next_window +=
-					NEXT_NORMALIO_DISTANCE;
-			} else
-				conf->start_next_window = MaxSector;
-		}
-	}
+	conf->nr_pending[idx]--;
 	spin_unlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static void allow_barrier(struct r1conf *conf, sector_t sector_nr)
+{
+	int idx = sector_to_idx(sector_nr);
+
+	_allow_barrier(conf, idx);
+}
+
+static void allow_all_barriers(struct r1conf *conf)
+{
+	int idx;
+
+	for (idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		_allow_barrier(conf, idx);
+}
+
+/* conf->resync_lock should be held */
+static int get_unqueued_pending(struct r1conf *conf)
+{
+	int idx, ret;
+
+	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
+		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
+
+	return ret;
+}
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
-	/* stop syncio and normal IO and wait for everything to
+	/* Stop sync I/O and normal I/O and wait for everything to
 	 * go quite.
-	 * We wait until nr_pending match nr_queued+extra
-	 * This is called in the context of one normal IO request
-	 * that has failed. Thus any sync request that might be pending
-	 * will be blocked by nr_pending, and we need to wait for
-	 * pending IO requests to complete or be queued for re-try.
-	 * Thus the number queued (nr_queued) plus this request (extra)
-	 * must match the number of pending IOs (nr_pending) before
-	 * we continue.
+	 * This is called in two situations:
+	 * 1) management command handlers (reshape, remove disk, quiesce).
+	 * 2) one normal I/O request failed.
+
+	 * After array_frozen is set to 1, new sync IO will be blocked at
+	 * raise_barrier(), and new normal I/O will blocked at _wait_barrier()
+	 * or wait_read_barrier(). The flying I/Os will either complete or be
+	 * queued. When everything goes quite, there are only queued I/Os left.
+
+	 * Every flying I/O contributes to a conf->nr_pending[idx], idx is the
+	 * barrier bucket index which this I/O request hits. When all sync and
+	 * normal I/O are queued, sum of all conf->nr_pending[] will match sum
+	 * of all conf->nr_queued[]. But normal I/O failure is an exception,
+	 * in handle_read_error(), we may call freeze_array() before trying to
+	 * fix the read error. In this case, the error read I/O is not queued,
+	 * so get_unqueued_pending() == 1.
+	 *
+	 * Therefore before this function returns, we need to wait until
+	 * get_unqueued_pendings(conf) gets equal to extra. For
+	 * normal I/O context, extra is 1, in rested situations extra is 0.
 	 */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 1;
 	raid1_log(conf->mddev, "wait freeze");
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				conf->nr_pending == conf->nr_queued+extra,
-				conf->resync_lock,
-				flush_pending_writes(conf));
+	wait_event_lock_irq_cmd(
+		conf->wait_barrier,
+		get_unqueued_pending(conf) == extra,
+		conf->resync_lock,
+		flush_pending_writes(conf));
 	spin_unlock_irq(&conf->resync_lock);
 }
 static void unfreeze_array(struct r1conf *conf)
@@ -1070,11 +1088,11 @@ static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	kfree(plug);
 }
 
-static void raid1_read_request(struct mddev *mddev, struct bio *bio,
-				 struct r1bio *r1_bio)
+static void raid1_read_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
 	struct raid1_info *mirror;
+	struct r1bio *r1_bio;
 	struct bio *read_bio;
 	struct bitmap *bitmap = mddev->bitmap;
 	const int op = bio_op(bio);
@@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 	int max_sectors;
 	int rdisk;
 
-	wait_barrier(conf, bio);
+	/*
+	 * Still need barrier for READ in case that whole
+	 * array is frozen.
+	 */
+	wait_read_barrier(conf, bio->bi_iter.bi_sector);
+	bitmap = mddev->bitmap;
+
+	/*
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+	/*
+	 * We might need to issue multiple reads to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of reads in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
 
 read_again:
 	rdisk = read_balance(conf, r1_bio, &max_sectors);
@@ -1106,7 +1151,6 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 			   atomic_read(&bitmap->behind_writes) == 0);
 	}
 	r1_bio->read_disk = rdisk;
-	r1_bio->start_next_window = 0;
 
 	read_bio = bio_clone_mddev(bio, GFP_NOIO, mddev);
 	bio_trim(read_bio, r1_bio->sector - bio->bi_iter.bi_sector,
@@ -1163,10 +1207,10 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
 		generic_make_request(read_bio);
 }
 
-static void raid1_write_request(struct mddev *mddev, struct bio *bio,
-				struct r1bio *r1_bio)
+static void raid1_write_request(struct mddev *mddev, struct bio *bio)
 {
 	struct r1conf *conf = mddev->private;
+	struct r1bio *r1_bio;
 	int i, disks;
 	struct bitmap *bitmap = mddev->bitmap;
 	unsigned long flags;
@@ -1180,7 +1224,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
-	sector_t start_next_window;
 
 	/*
 	 * Register the new request and wait if the reconstruction
@@ -1216,7 +1259,29 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 		}
 		finish_wait(&conf->wait_barrier, &w);
 	}
-	start_next_window = wait_barrier(conf, bio);
+	wait_barrier(conf, bio->bi_iter.bi_sector);
+
+	/*
+	 * make_request() can abort the operation when read-ahead is being
+	 * used and no empty request is available.
+	 *
+	 */
+	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	r1_bio->master_bio = bio;
+	r1_bio->sectors = bio_sectors(bio);
+	r1_bio->state = 0;
+	r1_bio->mddev = mddev;
+	r1_bio->sector = bio->bi_iter.bi_sector;
+	/*
+	 * We might need to issue multiple writes to different
+	 * devices if there are bad blocks around, so we keep
+	 * track of the number of writes in bio->bi_phys_segments.
+	 * If this is 0, there is only one r1_bio and no locking
+	 * will be needed when requests complete.  If it is
+	 * non-zero, then it is the number of not-completed requests.
+	 */
+	bio->bi_phys_segments = 0;
+	bio_clear_flag(bio, BIO_SEG_VALID);
 
 	if (conf->pending_count >= max_queued_requests) {
 		md_wakeup_thread(mddev->thread);
@@ -1237,7 +1302,6 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 	disks = conf->raid_disks * 2;
  retry_write:
-	r1_bio->start_next_window = start_next_window;
 	blocked_rdev = NULL;
 	rcu_read_lock();
 	max_sectors = r1_bio->sectors;
@@ -1304,25 +1368,15 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 	if (unlikely(blocked_rdev)) {
 		/* Wait for this device to become unblocked */
 		int j;
-		sector_t old = start_next_window;
 
 		for (j = 0; j < i; j++)
 			if (r1_bio->bios[j])
 				rdev_dec_pending(conf->mirrors[j].rdev, mddev);
 		r1_bio->state = 0;
-		allow_barrier(conf, start_next_window, bio->bi_iter.bi_sector);
+		allow_barrier(conf, bio->bi_iter.bi_sector);
 		raid1_log(mddev, "wait rdev %d blocked", blocked_rdev->raid_disk);
 		md_wait_for_blocked_rdev(blocked_rdev, mddev);
-		start_next_window = wait_barrier(conf, bio);
-		/*
-		 * We must make sure the multi r1bios of bio have
-		 * the same value of bi_phys_segments
-		 */
-		if (bio->bi_phys_segments && old &&
-		    old != start_next_window)
-			/* Wait for the former r1bio(s) to complete */
-			wait_event(conf->wait_barrier,
-				   bio->bi_phys_segments == 1);
+		wait_barrier(conf, bio->bi_iter.bi_sector);
 		goto retry_write;
 	}
 
@@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
 
 static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 {
-	struct r1conf *conf = mddev->private;
-	struct r1bio *r1_bio;
+	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
+	struct bio *split;
+	sector_t sectors;
 
-	/*
-	 * make_request() can abort the operation when read-ahead is being
-	 * used and no empty request is available.
-	 *
-	 */
-	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+	make_request_fn = (bio_data_dir(bio) == READ) ?
+			  raid1_read_request : raid1_write_request;
 
-	r1_bio->master_bio = bio;
-	r1_bio->sectors = bio_sectors(bio);
-	r1_bio->state = 0;
-	r1_bio->mddev = mddev;
-	r1_bio->sector = bio->bi_iter.bi_sector;
-
-	/*
-	 * We might need to issue multiple reads to different devices if there
-	 * are bad blocks around, so we keep track of the number of reads in
-	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
-	 * no locking will be needed when requests complete.  If it is
-	 * non-zero, then it is the number of not-completed requests.
-	 */
-	bio->bi_phys_segments = 0;
-	bio_clear_flag(bio, BIO_SEG_VALID);
+	/* if bio exceeds barrier unit boundary, split it */
+	do {
+		sectors = align_to_barrier_unit_end(
+				bio->bi_iter.bi_sector, bio_sectors(bio));
+		if (sectors < bio_sectors(bio)) {
+			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+			bio_chain(split, bio);
+		} else {
+			split = bio;
+		}
 
-	if (bio_data_dir(bio) == READ)
-		raid1_read_request(mddev, bio, r1_bio);
-	else
-		raid1_write_request(mddev, bio, r1_bio);
+		make_request_fn(mddev, split);
+	} while (split != bio);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
@@ -1567,19 +1611,11 @@ static void print_conf(struct r1conf *conf)
 
 static void close_sync(struct r1conf *conf)
 {
-	wait_barrier(conf, NULL);
-	allow_barrier(conf, 0, 0);
+	wait_all_barriers(conf);
+	allow_all_barriers(conf);
 
 	mempool_destroy(conf->r1buf_pool);
 	conf->r1buf_pool = NULL;
-
-	spin_lock_irq(&conf->resync_lock);
-	conf->next_resync = MaxSector - 2 * NEXT_NORMALIO_DISTANCE;
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests +=
-		conf->next_window_requests;
-	conf->next_window_requests = 0;
-	spin_unlock_irq(&conf->resync_lock);
 }
 
 static int raid1_spare_active(struct mddev *mddev)
@@ -2326,8 +2362,9 @@ static void handle_sync_write_finished(struct r1conf *conf, struct r1bio *r1_bio
 
 static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 {
-	int m;
+	int m, idx;
 	bool fail = false;
+
 	for (m = 0; m < conf->raid_disks * 2 ; m++)
 		if (r1_bio->bios[m] == IO_MADE_GOOD) {
 			struct md_rdev *rdev = conf->mirrors[m].rdev;
@@ -2353,7 +2390,8 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 	if (fail) {
 		spin_lock_irq(&conf->device_lock);
 		list_add(&r1_bio->retry_list, &conf->bio_end_io_list);
-		conf->nr_queued++;
+		idx = sector_to_idx(r1_bio->sector);
+		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -2475,6 +2513,7 @@ static void raid1d(struct md_thread *thread)
 	struct r1conf *conf = mddev->private;
 	struct list_head *head = &conf->retry_list;
 	struct blk_plug plug;
+	int idx;
 
 	md_check_recovery(mddev);
 
@@ -2482,17 +2521,17 @@ static void raid1d(struct md_thread *thread)
 	    !test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
 		LIST_HEAD(tmp);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags)) {
-			while (!list_empty(&conf->bio_end_io_list)) {
-				list_move(conf->bio_end_io_list.prev, &tmp);
-				conf->nr_queued--;
-			}
-		}
+		if (!test_bit(MD_SB_CHANGE_PENDING, &mddev->sb_flags))
+			list_splice_init(&conf->bio_end_io_list, &tmp);
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 		while (!list_empty(&tmp)) {
 			r1_bio = list_first_entry(&tmp, struct r1bio,
 						  retry_list);
 			list_del(&r1_bio->retry_list);
+			idx = sector_to_idx(r1_bio->sector);
+			spin_lock_irqsave(&conf->device_lock, flags);
+			conf->nr_queued[idx]--;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2513,7 +2552,8 @@ static void raid1d(struct md_thread *thread)
 		}
 		r1_bio = list_entry(head->prev, struct r1bio, retry_list);
 		list_del(head->prev);
-		conf->nr_queued--;
+		idx = sector_to_idx(r1_bio->sector);
+		conf->nr_queued[idx]--;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
 		mddev = r1_bio->mddev;
@@ -2552,7 +2592,6 @@ static int init_resync(struct r1conf *conf)
 					  conf->poolinfo);
 	if (!conf->r1buf_pool)
 		return -ENOMEM;
-	conf->next_resync = 0;
 	return 0;
 }
 
@@ -2581,6 +2620,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	int still_degraded = 0;
 	int good_sectors = RESYNC_SECTORS;
 	int min_bad = 0; /* number of sectors that are bad in all devices */
+	int idx = sector_to_idx(sector_nr);
 
 	if (!conf->r1buf_pool)
 		if (init_resync(conf))
@@ -2630,7 +2670,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting)
+	if (conf->nr_waiting[idx])
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2657,6 +2697,8 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	r1_bio->sector = sector_nr;
 	r1_bio->state = 0;
 	set_bit(R1BIO_IsSync, &r1_bio->state);
+	/* make sure good_sectors won't go across barrier unit boundary */
+	good_sectors = align_to_barrier_unit_end(sector_nr, good_sectors);
 
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 		struct md_rdev *rdev;
@@ -2887,6 +2929,26 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	if (!conf)
 		goto abort;
 
+	conf->nr_pending = kcalloc(BARRIER_BUCKETS_NR,
+				   sizeof(int), GFP_KERNEL);
+	if (!conf->nr_pending)
+		goto abort;
+
+	conf->nr_waiting = kcalloc(BARRIER_BUCKETS_NR,
+				   sizeof(int), GFP_KERNEL);
+	if (!conf->nr_waiting)
+		goto abort;
+
+	conf->nr_queued = kcalloc(BARRIER_BUCKETS_NR,
+				  sizeof(int), GFP_KERNEL);
+	if (!conf->nr_queued)
+		goto abort;
+
+	conf->barrier = kcalloc(BARRIER_BUCKETS_NR,
+				sizeof(int), GFP_KERNEL);
+	if (!conf->barrier)
+		goto abort;
+
 	conf->mirrors = kzalloc(sizeof(struct raid1_info)
 				* mddev->raid_disks * 2,
 				 GFP_KERNEL);
@@ -2942,9 +3004,6 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
-	conf->start_next_window = MaxSector;
-	conf->current_window_requests = conf->next_window_requests = 0;
-
 	err = -EIO;
 	for (i = 0; i < conf->raid_disks * 2; i++) {
 
@@ -2987,6 +3046,10 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		kfree(conf->mirrors);
 		safe_put_page(conf->tmppage);
 		kfree(conf->poolinfo);
+		kfree(conf->nr_pending);
+		kfree(conf->nr_waiting);
+		kfree(conf->nr_queued);
+		kfree(conf->barrier);
 		kfree(conf);
 	}
 	return ERR_PTR(err);
@@ -3088,6 +3151,10 @@ static void raid1_free(struct mddev *mddev, void *priv)
 	kfree(conf->mirrors);
 	safe_put_page(conf->tmppage);
 	kfree(conf->poolinfo);
+	kfree(conf->nr_pending);
+	kfree(conf->nr_waiting);
+	kfree(conf->nr_queued);
+	kfree(conf->barrier);
 	kfree(conf);
 }
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index c52ef42..d3faf30 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -1,6 +1,14 @@
 #ifndef _RAID1_H
 #define _RAID1_H
 
+/* each barrier unit size is 64MB fow now
+ * note: it must be larger than RESYNC_DEPTH
+ */
+#define BARRIER_UNIT_SECTOR_BITS	17
+#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
+#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - 2)
+#define BARRIER_BUCKETS_NR		(1<<BARRIER_BUCKETS_NR_BITS)
+
 struct raid1_info {
 	struct md_rdev	*rdev;
 	sector_t	head_position;
@@ -35,25 +43,6 @@ struct r1conf {
 						 */
 	int			raid_disks;
 
-	/* During resync, read_balancing is only allowed on the part
-	 * of the array that has been resynced.  'next_resync' tells us
-	 * where that is.
-	 */
-	sector_t		next_resync;
-
-	/* When raid1 starts resync, we divide array into four partitions
-	 * |---------|--------------|---------------------|-------------|
-	 *        next_resync   start_next_window       end_window
-	 * start_next_window = next_resync + NEXT_NORMALIO_DISTANCE
-	 * end_window = start_next_window + NEXT_NORMALIO_DISTANCE
-	 * current_window_requests means the count of normalIO between
-	 *   start_next_window and end_window.
-	 * next_window_requests means the count of normalIO after end_window.
-	 * */
-	sector_t		start_next_window;
-	int			current_window_requests;
-	int			next_window_requests;
-
 	spinlock_t		device_lock;
 
 	/* list of 'struct r1bio' that need to be processed by raid1d,
@@ -79,10 +68,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			nr_pending;
-	int			nr_waiting;
-	int			nr_queued;
-	int			barrier;
+	int			*nr_pending;
+	int			*nr_waiting;
+	int			*nr_queued;
+	int			*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
@@ -135,7 +124,6 @@ struct r1bio {
 						 * in this BehindIO request
 						 */
 	sector_t		sector;
-	sector_t		start_next_window;
 	int			sectors;
 	unsigned long		state;
 	struct mddev		*mddev;
@@ -185,4 +173,10 @@ enum r1bio_state {
 	R1BIO_WriteError,
 	R1BIO_FailFast,
 };
+
+static inline int sector_to_idx(sector_t sector)
+{
+	return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
+			 BARRIER_BUCKETS_NR_BITS);
+}
 #endif
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
@ 2017-02-15 16:35 ` colyli
  2017-02-15 17:15   ` Coly Li
                     ` (2 more replies)
  2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
  2017-02-16  7:04 ` NeilBrown
  2 siblings, 3 replies; 43+ messages in thread
From: colyli @ 2017-02-15 16:35 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

When I run a parallel reading performan testing on a md raid1 device with
two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
only 2.7GB/s, this is around 50% of the idea performance number.

The perf reports locking contention happens at allow_barrier() and
wait_barrier() code,
 - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
   - _raw_spin_lock_irqsave
         + 89.92% allow_barrier
         + 9.34% __wake_up
 - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
   - _raw_spin_lock_irq
         - 100.00% wait_barrier

The reason is, in these I/O barrier related functions,
 - raise_barrier()
 - lower_barrier()
 - wait_barrier()
 - allow_barrier()
They always hold conf->resync_lock firstly, even there are only regular
reading I/Os and no resync I/O at all. This is a huge performance penalty.

The solution is a lockless-like algorithm in I/O barrier code, and only
holding conf->resync_lock when it has to.

The original idea is from Hannes Reinecke, and Neil Brown provides
comments to improve it. I continue to work on it, and make the patch into
current form.

In the new simpler raid1 I/O barrier implementation, there are two
wait barrier functions,
 - wait_barrier()
   Which calls _wait_barrier(), is used for regular write I/O. If there is
   resync I/O happening on the same I/O barrier bucket, or the whole
   array is frozen, task will wait until no barrier on same barrier bucket,
   or the whold array is unfreezed.
 - wait_read_barrier()
   Since regular read I/O won't interfere with resync I/O (read_balance()
   will make sure only uptodate data will be read out), it is unnecessary
   to wait for barrier in regular read I/Os, waiting in only necessary
   when the whole array is frozen.

The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
barrier[idx] are very carefully designed in raise_barrier(),
lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
avoid unnecessary spin locks in these functions. Once conf->
nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
raised in same barrier bucket index and array is not frozen, the regular
I/O doesn't need to hold conf->resync_lock, it can just increase
conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
very similar to _wait_barrier(), the only difference is it only waits when
array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
code almostly gets rid of all spin lock cost.

This patch significantly improves raid1 reading peroformance. From my
testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
increases from 2.7GB/s to 4.6GB/s (+70%).

Changelog
V3:
- Add smp_mb__after_atomic() as Shaohua and Neil suggested.
- Change conf->nr_queued[] from atomic_t to int.
- Change conf->array_frozen from atomic_t back to int, and use
  READ_ONCE(conf->array_frozen) to check value of conf->array_frozen
  in _wait_barrier() and wait_read_barrier().
- In _wait_barrier() and wait_read_barrier(), add a call to
  wake_up(&conf->wait_barrier) after atomic_dec(&conf->nr_pending[idx]),
  to fix a deadlock between  _wait_barrier()/wait_read_barrier and
  freeze_array().
V2:
- Remove a spin_lock/unlock pair in raid1d().
- Add more code comments to explain why there is no racy when checking two
  atomic_t variables at same time.
V1:
- Original RFC patch for comments.

Signed-off-by: Coly Li <colyli@suse.de>
Cc: Shaohua Li <shli@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Johannes Thumshirn <jthumshirn@suse.de>
Cc: Guoqing Jiang <gqjiang@suse.com>
---
 drivers/md/raid1.c | 157 +++++++++++++++++++++++++++++++++++++----------------
 drivers/md/raid1.h |   6 +-
 2 files changed, 114 insertions(+), 49 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 4234494..2ac2650 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -836,11 +836,21 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	spin_lock_irq(&conf->resync_lock);
 
 	/* Wait until no block IO is waiting */
-	wait_event_lock_irq(conf->wait_barrier, !conf->nr_waiting[idx],
+	wait_event_lock_irq(conf->wait_barrier,
+			    !atomic_read(&conf->nr_waiting[idx]),
 			    conf->resync_lock);
 
 	/* block any new IO from starting */
-	conf->barrier[idx]++;
+	atomic_inc(&conf->barrier[idx]);
+	/*
+	 * In raise_barrier() we firstly increase conf->barrier[idx] then
+	 * check conf->nr_pending[idx]. In _wait_barrier() we firstly
+	 * increase conf->nr_pending[idx] then check conf->barrier[idx].
+	 * A memory barrier here to make sure conf->nr_pending[idx] won't
+	 * be fetched before conf->barrier[idx] is increased. Otherwise
+	 * there will be a race between raise_barrier() and _wait_barrier().
+	 */
+	smp_mb__after_atomic();
 
 	/* For these conditions we must wait:
 	 * A: while the array is in frozen state
@@ -851,42 +861,81 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr)
 	 */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->array_frozen &&
-			     !conf->nr_pending[idx] &&
-			     conf->barrier[idx] < RESYNC_DEPTH,
+			     !atomic_read(&conf->nr_pending[idx]) &&
+			     atomic_read(&conf->barrier[idx]) < RESYNC_DEPTH,
 			    conf->resync_lock);
 
-	conf->nr_pending[idx]++;
+	atomic_inc(&conf->nr_pending[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r1conf *conf, sector_t sector_nr)
 {
-	unsigned long flags;
 	int idx = sector_to_idx(sector_nr);
 
-	BUG_ON(conf->barrier[idx] <= 0);
+	BUG_ON(atomic_read(&conf->barrier[idx]) <= 0);
 
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier[idx]--;
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->barrier[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
 static void _wait_barrier(struct r1conf *conf, int idx)
 {
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen || conf->barrier[idx]) {
-		conf->nr_waiting[idx]++;
-		/* Wait for the barrier to drop. */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen && !conf->barrier[idx],
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/*
+	 * We need to increase conf->nr_pending[idx] very early here,
+	 * then raise_barrier() can be blocked when it waits for
+	 * conf->nr_pending[idx] to be 0. Then we can avoid holding
+	 * conf->resync_lock when there is no barrier raised in same
+	 * barrier unit bucket. Also if the array is frozen, I/O
+	 * should be blocked until array is unfrozen.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
+	/*
+	 * In _wait_barrier() we firstly increase conf->nr_pending[idx], then
+	 * check conf->barrier[idx]. In raise_barrier() we firstly increase
+	 * conf->barrier[idx], then check conf->nr_pending[idx]. A memory
+	 * barrier is necessary here to make sure conf->barrier[idx] won't be
+	 * fetched before conf->nr_pending[idx] is increased. Otherwise there
+	 * will be a race between _wait_barrier() and raise_barrier().
+	 */
+	smp_mb__after_atomic();
+
+	/*
+	 * Don't worry about checking two atomic_t variables at same time
+	 * here. If during we check conf->barrier[idx], the array is
+	 * frozen (conf->array_frozen is 1), and chonf->barrier[idx] is
+	 * 0, it is safe to return and make the I/O continue. Because the
+	 * array is frozen, all I/O returned here will eventually complete
+	 * or be queued, no race will happen. See code comment in
+	 * frozen_array().
+	 */
+	if (!READ_ONCE(conf->array_frozen) &&
+	    !atomic_read(&conf->barrier[idx]))
+		return;
 
-	conf->nr_pending[idx]++;
+	/*
+	 * After holding conf->resync_lock, conf->nr_pending[idx]
+	 * should be decreased before waiting for barrier to drop.
+	 * Otherwise, we may encounter a race condition because
+	 * raise_barrer() might be waiting for conf->nr_pending[idx]
+	 * to be 0 at same time.
+	 */
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/*
+	 * In case freeze_array() is waiting for
+	 * get_unqueued_pending() == extra
+	 */
+	wake_up(&conf->wait_barrier);
+	/* Wait for the barrier in same barrier unit bucket to drop. */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->array_frozen &&
+			     !atomic_read(&conf->barrier[idx]),
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -894,18 +943,32 @@ static void wait_read_barrier(struct r1conf *conf, sector_t sector_nr)
 {
 	int idx = sector_to_idx(sector_nr);
 
-	spin_lock_irq(&conf->resync_lock);
-	if (conf->array_frozen) {
-		conf->nr_waiting[idx]++;
-		/* Wait for array to unfreeze */
-		wait_event_lock_irq(
-			conf->wait_barrier,
-			!conf->array_frozen,
-			conf->resync_lock);
-		conf->nr_waiting[idx]--;
-	}
+	/*
+	 * Very similar to _wait_barrier(). The difference is, for read
+	 * I/O we don't need wait for sync I/O, but if the whole array
+	 * is frozen, the read I/O still has to wait until the array is
+	 * unfrozen. Since there is no ordering requirement with
+	 * conf->barrier[idx] here, memory barrier is unnecessary as well.
+	 */
+	atomic_inc(&conf->nr_pending[idx]);
 
-	conf->nr_pending[idx]++;
+	if (!READ_ONCE(conf->array_frozen))
+		return;
+
+	spin_lock_irq(&conf->resync_lock);
+	atomic_inc(&conf->nr_waiting[idx]);
+	atomic_dec(&conf->nr_pending[idx]);
+	/*
+	 * In case freeze_array() is waiting for
+	 * get_unqueued_pending() == extra
+	 */
+	wake_up(&conf->wait_barrier);
+	/* Wait for array to be unfrozen */
+	wait_event_lock_irq(conf->wait_barrier,
+			    !conf->array_frozen,
+			    conf->resync_lock);
+	atomic_inc(&conf->nr_pending[idx]);
+	atomic_dec(&conf->nr_waiting[idx]);
 	spin_unlock_irq(&conf->resync_lock);
 }
 
@@ -926,11 +989,7 @@ static void wait_all_barriers(struct r1conf *conf)
 
 static void _allow_barrier(struct r1conf *conf, int idx)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->nr_pending[idx]--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+	atomic_dec(&conf->nr_pending[idx]);
 	wake_up(&conf->wait_barrier);
 }
 
@@ -955,11 +1014,14 @@ static int get_unqueued_pending(struct r1conf *conf)
 	int idx, ret;
 
 	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
-		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
+		ret += atomic_read(&conf->nr_pending[idx]) -
+			conf->nr_queued[idx];
 
 	return ret;
 }
 
+#define FREEZE_TIMEOUT_JIFFIES 10
+
 static void freeze_array(struct r1conf *conf, int extra)
 {
 	/* Stop sync I/O and normal I/O and wait for everything to
@@ -1000,8 +1062,8 @@ static void unfreeze_array(struct r1conf *conf)
 	/* reverse the effect of the freeze */
 	spin_lock_irq(&conf->resync_lock);
 	conf->array_frozen = 0;
-	wake_up(&conf->wait_barrier);
 	spin_unlock_irq(&conf->resync_lock);
+	wake_up(&conf->wait_barrier);
 }
 
 /* duplicate the data pages for behind I/O
@@ -2393,6 +2455,11 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
 		idx = sector_to_idx(r1_bio->sector);
 		conf->nr_queued[idx]++;
 		spin_unlock_irq(&conf->device_lock);
+		/*
+		 * In case freeze_array() is waiting for condition
+		 * get_unqueued_pending() == extra to be true.
+		 */
+		wake_up(&conf->wait_barrier);
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
 		if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2529,9 +2596,7 @@ static void raid1d(struct md_thread *thread)
 						  retry_list);
 			list_del(&r1_bio->retry_list);
 			idx = sector_to_idx(r1_bio->sector);
-			spin_lock_irqsave(&conf->device_lock, flags);
 			conf->nr_queued[idx]--;
-			spin_unlock_irqrestore(&conf->device_lock, flags);
 			if (mddev->degraded)
 				set_bit(R1BIO_Degraded, &r1_bio->state);
 			if (test_bit(R1BIO_WriteError, &r1_bio->state))
@@ -2670,7 +2735,7 @@ static sector_t raid1_sync_request(struct mddev *mddev, sector_t sector_nr,
 	 * If there is non-resync activity waiting for a turn, then let it
 	 * though before starting on this new sync request.
 	 */
-	if (conf->nr_waiting[idx])
+	if (atomic_read(&conf->nr_waiting[idx]))
 		schedule_timeout_uninterruptible(1);
 
 	/* we are incrementing sector_nr below. To be safe, we check against
@@ -2930,12 +2995,12 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		goto abort;
 
 	conf->nr_pending = kcalloc(BARRIER_BUCKETS_NR,
-				   sizeof(int), GFP_KERNEL);
+				   sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_pending)
 		goto abort;
 
 	conf->nr_waiting = kcalloc(BARRIER_BUCKETS_NR,
-				   sizeof(int), GFP_KERNEL);
+				   sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->nr_waiting)
 		goto abort;
 
@@ -2945,7 +3010,7 @@ static struct r1conf *setup_conf(struct mddev *mddev)
 		goto abort;
 
 	conf->barrier = kcalloc(BARRIER_BUCKETS_NR,
-				sizeof(int), GFP_KERNEL);
+				sizeof(atomic_t), GFP_KERNEL);
 	if (!conf->barrier)
 		goto abort;
 
diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
index d3faf30..1d9d279 100644
--- a/drivers/md/raid1.h
+++ b/drivers/md/raid1.h
@@ -68,10 +68,10 @@ struct r1conf {
 	 */
 	wait_queue_head_t	wait_barrier;
 	spinlock_t		resync_lock;
-	int			*nr_pending;
-	int			*nr_waiting;
+	atomic_t		*nr_pending;
+	atomic_t		*nr_waiting;
 	int			*nr_queued;
-	int			*barrier;
+	atomic_t		*barrier;
 	int			array_frozen;
 
 	/* Set to 1 if a full sync is needed, (fresh device added).
-- 
2.6.6


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
@ 2017-02-15 17:15   ` Coly Li
  2017-02-16  2:25   ` Shaohua Li
  2017-02-16  7:04   ` NeilBrown
  2 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-15 17:15 UTC (permalink / raw)
  To: linux-raid
  Cc: Shaohua Li, Hannes Reinecke, Neil Brown, Johannes Thumshirn,
	Guoqing Jiang

On 2017/2/16 上午12:35, colyli@suse.de wrote:
[snip]
>  
> @@ -955,11 +1014,14 @@ static int get_unqueued_pending(struct r1conf *conf)
>  	int idx, ret;
>  
>  	for (ret = 0, idx = 0; idx < BARRIER_BUCKETS_NR; idx++)
> -		ret += conf->nr_pending[idx] - conf->nr_queued[idx];
> +		ret += atomic_read(&conf->nr_pending[idx]) -
> +			conf->nr_queued[idx];
>  
>  	return ret;
>  }
>  
> +#define FREEZE_TIMEOUT_JIFFIES 10
> +

The above line will be removed in next version, please ignore it.

[snip]

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
  2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
@ 2017-02-16  2:22 ` Shaohua Li
  2017-02-16 17:05   ` Coly Li
  2017-02-16  7:04 ` NeilBrown
  2 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-16  2:22 UTC (permalink / raw)
  To: colyli
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On Thu, Feb 16, 2017 at 12:35:22AM +0800, colyli@suse.de wrote:
> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
> 
> The idea of sliding resync widow is awesome, but code complexity is a
> challenge. Sliding resync window requires several veriables to work
> collectively, this is complexed and very hard to make it work correctly.
> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> to fix the original resync window patch. This is not the end, any further
> related modification may easily introduce more regreassion.
> 
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
> 
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window
>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought
>    barrier buckets
> 
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means
>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>    For random I/O 64MB is large enough for both read and write requests,
>    for sequential I/O considering underlying block layer may merge them
>    into larger requests, 64MB is still good enough.
>    Neil also points out that for resync operation, "we want the resync to
>    move from region to region fairly quickly so that the slowness caused
>    by having to synchronize with the resync is averaged out over a fairly
>    small time frame". For full speed resync, 64MB should take less then 1
>    second. When resync is competing with other I/O, it could take up a few
>    minutes. Therefore 64MB size is fairly good range for resync.
> 
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    this patch makes the bellowed members of struct r1conf from integer
>    to array of integers,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     *nr_pending;
>         +       int                     *nr_waiting;
>         +       int                     *nr_queued;
>         +       int                     *barrier;
>    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>    barrier buckets, and each array of integers occupies single memory page.
>    1024 means for a request which is smaller than the I/O barrier unit size
>    has ~0.1% chance to wait for resync to pause, which is quite a small
>    enough fraction. Also requesting single memory page is more friendly to
>    kernel page allocator than larger memory size.
> 
>  - I/O barrier bucket is indexed by bio start sector
>    If multiple I/O requests hit different I/O barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same I/O barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by sector_to_idx() which is defined
>    in raid1.h as an inline function,
>         static inline int sector_to_idx(sector_t sector)
>         {
>                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>                                 BARRIER_BUCKETS_NR_BITS);
>         }
>    Here sector_nr is the start sector number of a bio.
> 
>  - Single bio won't go across boundary of a I/O barrier unit
>    If a request goes across boundary of barrier unit, it will be split. A
>    bio may be split in raid1_make_request() or raid1_sync_request(), if
>    sectors returned by align_to_barrier_unit_end() is small than original
>    bio size.
> 
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So the I/O
>    behavior is similar to single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket indexs with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much
>    simpler barrier algorithm and implementation.
> 
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(&conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Mainline raid1 code split original raid1_make_request() into
>    raid1_read_request() and raid1_write_request(). If the original bio
>    goes across an I/O barrier unit size, this bio will be split before
>    calling raid1_read_request() or raid1_write_request(),  this change
>    the code logic more simple and clear.
>  - In this patch wait_barrier() is moved from raid1_make_request() to
>    raid1_write_request(). In raid_read_request(), original wait_barrier()
>    is replaced by raid1_read_request().
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.
> Changelog
> V3:
> - Rebase the patch against latest upstream kernel code.
> - Many fixes by review comments from Neil,
>   - Back to use pointers to replace arraries in struct r1conf
>   - Remove total_barriers from struct r1conf
>   - Add more patch comments to explain how/why the values of
>     BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
>   - Use get_unqueued_pending() to replace get_all_pendings() and
>     get_all_queued()
>   - Increase bucket number from 512 to 1024
> - Change code comments format by review from Shaohua.
> V2:
> - Use bio_split() to split the orignal bio if it goes across barrier unit
>   bounday, to make the code more simple, by suggestion from Shaohua and
>   Neil.
> - Use hash_long() to replace original linear hash, to avoid a possible
>   confilict between resync I/O and sequential write I/O, by suggestion from
>   Shaohua.
> - Add conf->total_barriers to record barrier depth, which is used to
>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
> - In V1 patch the bellowed barrier buckets related members in r1conf are
>   allocated in memory page. To make the code more simple, V2 patch moves
>   the memory space into struct r1conf, like this,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>         +       int                     barrier[BARRIER_BUCKETS_NR];
>   This change is by the suggestion from Shaohua.
> - Remove some inrelavent code comments, by suggestion from Guoqing.
> - Add a missing wait_barrier() before jumping to retry_write, in
>   raid1_make_write_request().
> V1:
> - Original RFC patch for comments

Looks good, two minor issues.

>  
> -static void raid1_read_request(struct mddev *mddev, struct bio *bio,
> -				 struct r1bio *r1_bio)
> +static void raid1_read_request(struct mddev *mddev, struct bio *bio)
>  {
>  	struct r1conf *conf = mddev->private;
>  	struct raid1_info *mirror;
> +	struct r1bio *r1_bio;
>  	struct bio *read_bio;
>  	struct bitmap *bitmap = mddev->bitmap;
>  	const int op = bio_op(bio);
> @@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>  	int max_sectors;
>  	int rdisk;
>  
> -	wait_barrier(conf, bio);
> +	/*
> +	 * Still need barrier for READ in case that whole
> +	 * array is frozen.
> +	 */
> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
> +	bitmap = mddev->bitmap;
> +
> +	/*
> +	 * make_request() can abort the operation when read-ahead is being
> +	 * used and no empty request is available.
> +	 *
> +	 */
> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +	r1_bio->master_bio = bio;
> +	r1_bio->sectors = bio_sectors(bio);
> +	r1_bio->state = 0;
> +	r1_bio->mddev = mddev;
> +	r1_bio->sector = bio->bi_iter.bi_sector;

This part looks unnecessary complicated. If you change raid1_make_request to
something like __raid1_make_reques, add a new raid1_make_request and do bio
split there, then call __raid1_make_request for each splitted bio, then you
don't need to duplicate the r1_bio allocation parts for read/write.

> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
> index c52ef42..d3faf30 100644
> --- a/drivers/md/raid1.h
> +++ b/drivers/md/raid1.h
> @@ -1,6 +1,14 @@
>  #ifndef _RAID1_H
>  #define _RAID1_H
>  
> +/* each barrier unit size is 64MB fow now
> + * note: it must be larger than RESYNC_DEPTH
> + */
> +#define BARRIER_UNIT_SECTOR_BITS	17
> +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
> +#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - 2)

maybe write this as (PAGE_SHIFT - ilog2(sizeof(int)))? To be honest, I don't
think it really matters if the array is PAGE_SIZE length, maybe just specify a
const here.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
  2017-02-15 17:15   ` Coly Li
@ 2017-02-16  2:25   ` Shaohua Li
  2017-02-17 18:42     ` Coly Li
  2017-02-16  7:04   ` NeilBrown
  2 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-16  2:25 UTC (permalink / raw)
  To: colyli
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

On Thu, Feb 16, 2017 at 12:35:23AM +0800, colyli@suse.de wrote:
> When I run a parallel reading performan testing on a md raid1 device with
> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
> only 2.7GB/s, this is around 50% of the idea performance number.
> 
> The perf reports locking contention happens at allow_barrier() and
> wait_barrier() code,
>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>    - _raw_spin_lock_irqsave
>          + 89.92% allow_barrier
>          + 9.34% __wake_up
>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>    - _raw_spin_lock_irq
>          - 100.00% wait_barrier
> 
> The reason is, in these I/O barrier related functions,
>  - raise_barrier()
>  - lower_barrier()
>  - wait_barrier()
>  - allow_barrier()
> They always hold conf->resync_lock firstly, even there are only regular
> reading I/Os and no resync I/O at all. This is a huge performance penalty.
> 
> The solution is a lockless-like algorithm in I/O barrier code, and only
> holding conf->resync_lock when it has to.
> 
> The original idea is from Hannes Reinecke, and Neil Brown provides
> comments to improve it. I continue to work on it, and make the patch into
> current form.
> 
> In the new simpler raid1 I/O barrier implementation, there are two
> wait barrier functions,
>  - wait_barrier()
>    Which calls _wait_barrier(), is used for regular write I/O. If there is
>    resync I/O happening on the same I/O barrier bucket, or the whole
>    array is frozen, task will wait until no barrier on same barrier bucket,
>    or the whold array is unfreezed.
>  - wait_read_barrier()
>    Since regular read I/O won't interfere with resync I/O (read_balance()
>    will make sure only uptodate data will be read out), it is unnecessary
>    to wait for barrier in regular read I/Os, waiting in only necessary
>    when the whole array is frozen.
> 
> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
> barrier[idx] are very carefully designed in raise_barrier(),
> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
> avoid unnecessary spin locks in these functions. Once conf->
> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
> has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
> raised in same barrier bucket index and array is not frozen, the regular
> I/O doesn't need to hold conf->resync_lock, it can just increase
> conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
> very similar to _wait_barrier(), the only difference is it only waits when
> array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
> code almostly gets rid of all spin lock cost.
> 
> This patch significantly improves raid1 reading peroformance. From my
> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
> increases from 2.7GB/s to 4.6GB/s (+70%).
> 
> Changelog
> V3:
> - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
> - Change conf->nr_queued[] from atomic_t to int.

I missed this part. In the code, the nr_queued sometimes is protected by
device_lock, sometimes (raid1d) no protection at all. Can you explain this?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
  2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
  2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
@ 2017-02-16  7:04 ` NeilBrown
  2017-02-17  6:56   ` Coly Li
  2017-02-17 19:41   ` Shaohua Li
  2 siblings, 2 replies; 43+ messages in thread
From: NeilBrown @ 2017-02-16  7:04 UTC (permalink / raw)
  To: linux-raid; +Cc: Coly Li, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 10650 bytes --]

On Thu, Feb 16 2017, colyli@suse.de wrote:

> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> introduces a sliding resync window for raid1 I/O barrier, this idea limits
> I/O barriers to happen only inside a slidingresync window, for regular
> I/Os out of this resync window they don't need to wait for barrier any
> more. On large raid1 device, it helps a lot to improve parallel writing
> I/O throughput when there are background resync I/Os performing at
> same time.
>
> The idea of sliding resync widow is awesome, but code complexity is a
> challenge. Sliding resync window requires several veriables to work

variables

> collectively, this is complexed and very hard to make it work correctly.
> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> to fix the original resync window patch. This is not the end, any further
> related modification may easily introduce more regreassion.
>
> Therefore I decide to implement a much simpler raid1 I/O barrier, by
> removing resync window code, I believe life will be much easier.
>
> The brief idea of the simpler barrier is,
>  - Do not maintain a logbal unique resync window

global

>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular

conflicts

>    I/O only has to wait for a resync I/O when both them have same barrier
>    bucket index, vice versa.
>  - I/O barrier can be recuded to an acceptable number if there are enought

reduced
enough

>    barrier buckets
>
> Here I explain how the barrier buckets are designed,
>  - BARRIER_UNIT_SECTOR_SIZE
>    The whole LBA address space of a raid1 device is divided into multiple
>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>    Bio request won't go across border of barrier unit size, that means

requests

>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>    For random I/O 64MB is large enough for both read and write requests,
>    for sequential I/O considering underlying block layer may merge them
>    into larger requests, 64MB is still good enough.
>    Neil also points out that for resync operation, "we want the resync to
>    move from region to region fairly quickly so that the slowness caused
>    by having to synchronize with the resync is averaged out over a fairly
>    small time frame". For full speed resync, 64MB should take less then 1
>    second. When resync is competing with other I/O, it could take up a few
>    minutes. Therefore 64MB size is fairly good range for resync.
>
>  - BARRIER_BUCKETS_NR
>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>    this patch makes the bellowed members of struct r1conf from integer
>    to array of integers,
>         -       int                     nr_pending;
>         -       int                     nr_waiting;
>         -       int                     nr_queued;
>         -       int                     barrier;
>         +       int                     *nr_pending;
>         +       int                     *nr_waiting;
>         +       int                     *nr_queued;
>         +       int                     *barrier;
>    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>    barrier buckets, and each array of integers occupies single memory page.
>    1024 means for a request which is smaller than the I/O barrier unit size
>    has ~0.1% chance to wait for resync to pause, which is quite a small
>    enough fraction. Also requesting single memory page is more friendly to
>    kernel page allocator than larger memory size.
>
>  - I/O barrier bucket is indexed by bio start sector
>    If multiple I/O requests hit different I/O barrier units, they only need
>    to compete I/O barrier with other I/Os which hit the same I/O barrier
>    bucket index with each other. The index of a barrier bucket which a
>    bio should look for is calculated by sector_to_idx() which is defined
>    in raid1.h as an inline function,
>         static inline int sector_to_idx(sector_t sector)
>         {
>                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>                                 BARRIER_BUCKETS_NR_BITS);
>         }
>    Here sector_nr is the start sector number of a bio.

"hash_long() is used so that sequential writes in are region of the
array which is not being resynced will not consistently align with
the buckets that are being sequentially resynced, as described below"

>
>  - Single bio won't go across boundary of a I/O barrier unit
>    If a request goes across boundary of barrier unit, it will be split. A
>    bio may be split in raid1_make_request() or raid1_sync_request(), if
>    sectors returned by align_to_barrier_unit_end() is small than original

smaller

>    bio size.
>
> Comparing to single sliding resync window,
>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>    will have confliction within a single barrier units. So the I/O

... will conflict within ...

>    behavior is similar to single sliding resync window.
>  - But a barrier unit bucket is shared by all barrier units with identical
>    barrier uinit index, the probability of confliction might be higher
>    than single sliding resync window, in condition that writing I/Os
>    always hit barrier units which have identical barrier bucket indexs with
>    the resync I/Os. This is a very rare condition in real I/O work loads,
>    I cannot imagine how it could happen in practice.
>  - Therefore we can achieve a good enough low confliction rate with much

... low conflict rate ...

>    simpler barrier algorithm and implementation.
>
> There are two changes should be noticed,
>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>    single loop, it looks like this,
>         spin_lock_irqsave(&conf->device_lock, flags);
>         conf->nr_queued[idx]--;
>         spin_unlock_irqrestore(&conf->device_lock, flags);
>    This change generates more spin lock operations, but in next patch of
>    this patch set, it will be replaced by a single line code,
>         atomic_dec(&conf->nr_queueud[idx]);
>    So we don't need to worry about spin lock cost here.
>  - Mainline raid1 code split original raid1_make_request() into
>    raid1_read_request() and raid1_write_request(). If the original bio
>    goes across an I/O barrier unit size, this bio will be split before
>    calling raid1_read_request() or raid1_write_request(),  this change
>    the code logic more simple and clear.
>  - In this patch wait_barrier() is moved from raid1_make_request() to
>    raid1_write_request(). In raid_read_request(), original wait_barrier()
>    is replaced by raid1_read_request().
>    The differnece is wait_read_barrier() only waits if array is frozen,
>    using different barrier function in different code path makes the code
>    more clean and easy to read.

Thank you for putting the effort into writing a comprehensve change
description.  I really appreciate it.

>  
> @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>  
>  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>  {
> -	struct r1conf *conf = mddev->private;
> -	struct r1bio *r1_bio;
> +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
> +	struct bio *split;
> +	sector_t sectors;
>  
> -	/*
> -	 * make_request() can abort the operation when read-ahead is being
> -	 * used and no empty request is available.
> -	 *
> -	 */
> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> +	make_request_fn = (bio_data_dir(bio) == READ) ?
> +			  raid1_read_request : raid1_write_request;
>  
> -	r1_bio->master_bio = bio;
> -	r1_bio->sectors = bio_sectors(bio);
> -	r1_bio->state = 0;
> -	r1_bio->mddev = mddev;
> -	r1_bio->sector = bio->bi_iter.bi_sector;
> -
> -	/*
> -	 * We might need to issue multiple reads to different devices if there
> -	 * are bad blocks around, so we keep track of the number of reads in
> -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
> -	 * no locking will be needed when requests complete.  If it is
> -	 * non-zero, then it is the number of not-completed requests.
> -	 */
> -	bio->bi_phys_segments = 0;
> -	bio_clear_flag(bio, BIO_SEG_VALID);
> +	/* if bio exceeds barrier unit boundary, split it */
> +	do {
> +		sectors = align_to_barrier_unit_end(
> +				bio->bi_iter.bi_sector, bio_sectors(bio));
> +		if (sectors < bio_sectors(bio)) {
> +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +			bio_chain(split, bio);
> +		} else {
> +			split = bio;
> +		}
>  
> -	if (bio_data_dir(bio) == READ)
> -		raid1_read_request(mddev, bio, r1_bio);
> -	else
> -		raid1_write_request(mddev, bio, r1_bio);
> +		make_request_fn(mddev, split);
> +	} while (split != bio);
>  }

I know you are going to change this as Shaohua wantsthe spitting
to happen in a separate function, which I agree with, but there is
something else wrong here.
Calling bio_split/bio_chain repeatedly in a loop is dangerous.
It is OK for simple devices, but when one request can wait for another
request to the same device it can deadlock.
This can happen with raid1.  If a resync request calls raise_barrier()
between one request and the next, then the next has to wait for the
resync request, which has to wait for the first request.
As the first request will be stuck in the queue in
generic_make_request(), you get a deadlock.
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.


Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
  2017-02-15 17:15   ` Coly Li
  2017-02-16  2:25   ` Shaohua Li
@ 2017-02-16  7:04   ` NeilBrown
  2017-02-17  7:56     ` Coly Li
  2 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2017-02-16  7:04 UTC (permalink / raw)
  To: linux-raid
  Cc: Coly Li, Shaohua Li, Hannes Reinecke, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 1042 bytes --]

On Thu, Feb 16 2017, colyli@suse.de wrote:

> @@ -2393,6 +2455,11 @@ static void handle_write_finished(struct r1conf *conf, struct r1bio *r1_bio)
>  		idx = sector_to_idx(r1_bio->sector);
>  		conf->nr_queued[idx]++;
>  		spin_unlock_irq(&conf->device_lock);
> +		/*
> +		 * In case freeze_array() is waiting for condition
> +		 * get_unqueued_pending() == extra to be true.
> +		 */
> +		wake_up(&conf->wait_barrier);
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
>  		if (test_bit(R1BIO_WriteError, &r1_bio->state))
> @@ -2529,9 +2596,7 @@ static void raid1d(struct md_thread *thread)
>  						  retry_list);
>  			list_del(&r1_bio->retry_list);
>  			idx = sector_to_idx(r1_bio->sector);
> -			spin_lock_irqsave(&conf->device_lock, flags);
>  			conf->nr_queued[idx]--;
> -			spin_unlock_irqrestore(&conf->device_lock, flags);

Why do you think it is safe to decrement nr_queued without holding the
lock?
Surely this could race with handle_write_finished, and an update could
be lost.

Otherwise, looks good.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
@ 2017-02-16 17:05   ` Coly Li
  2017-02-17 12:40     ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-16 17:05 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On 2017/2/16 上午10:22, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 12:35:22AM +0800, colyli@suse.de wrote:
>> 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> I/O barriers to happen only inside a slidingresync window, for regular
>> I/Os out of this resync window they don't need to wait for barrier any
>> more. On large raid1 device, it helps a lot to improve parallel writing
>> I/O throughput when there are background resync I/Os performing at
>> same time.
>>
>> The idea of sliding resync widow is awesome, but code complexity is a
>> challenge. Sliding resync window requires several veriables to work
>> collectively, this is complexed and very hard to make it work correctly.
>> Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
>> to fix the original resync window patch. This is not the end, any further
>> related modification may easily introduce more regreassion.
>>
>> Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> removing resync window code, I believe life will be much easier.
>>
>> The brief idea of the simpler barrier is,
>>  - Do not maintain a logbal unique resync window
>>  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>>    I/O only has to wait for a resync I/O when both them have same barrier
>>    bucket index, vice versa.
>>  - I/O barrier can be recuded to an acceptable number if there are enought
>>    barrier buckets
>>
>> Here I explain how the barrier buckets are designed,
>>  - BARRIER_UNIT_SECTOR_SIZE
>>    The whole LBA address space of a raid1 device is divided into multiple
>>    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>>    Bio request won't go across border of barrier unit size, that means
>>    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>>    For random I/O 64MB is large enough for both read and write requests,
>>    for sequential I/O considering underlying block layer may merge them
>>    into larger requests, 64MB is still good enough.
>>    Neil also points out that for resync operation, "we want the resync to
>>    move from region to region fairly quickly so that the slowness caused
>>    by having to synchronize with the resync is averaged out over a fairly
>>    small time frame". For full speed resync, 64MB should take less then 1
>>    second. When resync is competing with other I/O, it could take up a few
>>    minutes. Therefore 64MB size is fairly good range for resync.
>>
>>  - BARRIER_BUCKETS_NR
>>    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>>         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>>         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>>    this patch makes the bellowed members of struct r1conf from integer
>>    to array of integers,
>>         -       int                     nr_pending;
>>         -       int                     nr_waiting;
>>         -       int                     nr_queued;
>>         -       int                     barrier;
>>         +       int                     *nr_pending;
>>         +       int                     *nr_waiting;
>>         +       int                     *nr_queued;
>>         +       int                     *barrier;
>>    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>>    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>>    barrier buckets, and each array of integers occupies single memory page.
>>    1024 means for a request which is smaller than the I/O barrier unit size
>>    has ~0.1% chance to wait for resync to pause, which is quite a small
>>    enough fraction. Also requesting single memory page is more friendly to
>>    kernel page allocator than larger memory size.
>>
>>  - I/O barrier bucket is indexed by bio start sector
>>    If multiple I/O requests hit different I/O barrier units, they only need
>>    to compete I/O barrier with other I/Os which hit the same I/O barrier
>>    bucket index with each other. The index of a barrier bucket which a
>>    bio should look for is calculated by sector_to_idx() which is defined
>>    in raid1.h as an inline function,
>>         static inline int sector_to_idx(sector_t sector)
>>         {
>>                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>>                                 BARRIER_BUCKETS_NR_BITS);
>>         }
>>    Here sector_nr is the start sector number of a bio.
>>
>>  - Single bio won't go across boundary of a I/O barrier unit
>>    If a request goes across boundary of barrier unit, it will be split. A
>>    bio may be split in raid1_make_request() or raid1_sync_request(), if
>>    sectors returned by align_to_barrier_unit_end() is small than original
>>    bio size.
>>
>> Comparing to single sliding resync window,
>>  - Currently resync I/O grows linearly, therefore regular and resync I/O
>>    will have confliction within a single barrier units. So the I/O
>>    behavior is similar to single sliding resync window.
>>  - But a barrier unit bucket is shared by all barrier units with identical
>>    barrier uinit index, the probability of confliction might be higher
>>    than single sliding resync window, in condition that writing I/Os
>>    always hit barrier units which have identical barrier bucket indexs with
>>    the resync I/Os. This is a very rare condition in real I/O work loads,
>>    I cannot imagine how it could happen in practice.
>>  - Therefore we can achieve a good enough low confliction rate with much
>>    simpler barrier algorithm and implementation.
>>
>> There are two changes should be noticed,
>>  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>>    single loop, it looks like this,
>>         spin_lock_irqsave(&conf->device_lock, flags);
>>         conf->nr_queued[idx]--;
>>         spin_unlock_irqrestore(&conf->device_lock, flags);
>>    This change generates more spin lock operations, but in next patch of
>>    this patch set, it will be replaced by a single line code,
>>         atomic_dec(&conf->nr_queueud[idx]);
>>    So we don't need to worry about spin lock cost here.
>>  - Mainline raid1 code split original raid1_make_request() into
>>    raid1_read_request() and raid1_write_request(). If the original bio
>>    goes across an I/O barrier unit size, this bio will be split before
>>    calling raid1_read_request() or raid1_write_request(),  this change
>>    the code logic more simple and clear.
>>  - In this patch wait_barrier() is moved from raid1_make_request() to
>>    raid1_write_request(). In raid_read_request(), original wait_barrier()
>>    is replaced by raid1_read_request().
>>    The differnece is wait_read_barrier() only waits if array is frozen,
>>    using different barrier function in different code path makes the code
>>    more clean and easy to read.
>> Changelog
>> V3:
>> - Rebase the patch against latest upstream kernel code.
>> - Many fixes by review comments from Neil,
>>   - Back to use pointers to replace arraries in struct r1conf
>>   - Remove total_barriers from struct r1conf
>>   - Add more patch comments to explain how/why the values of
>>     BARRIER_UNIT_SECTOR_SIZE and BARRIER_BUCKETS_NR are decided.
>>   - Use get_unqueued_pending() to replace get_all_pendings() and
>>     get_all_queued()
>>   - Increase bucket number from 512 to 1024
>> - Change code comments format by review from Shaohua.
>> V2:
>> - Use bio_split() to split the orignal bio if it goes across barrier unit
>>   bounday, to make the code more simple, by suggestion from Shaohua and
>>   Neil.
>> - Use hash_long() to replace original linear hash, to avoid a possible
>>   confilict between resync I/O and sequential write I/O, by suggestion from
>>   Shaohua.
>> - Add conf->total_barriers to record barrier depth, which is used to
>>   control number of parallel sync I/O barriers, by suggestion from Shaohua.
>> - In V1 patch the bellowed barrier buckets related members in r1conf are
>>   allocated in memory page. To make the code more simple, V2 patch moves
>>   the memory space into struct r1conf, like this,
>>         -       int                     nr_pending;
>>         -       int                     nr_waiting;
>>         -       int                     nr_queued;
>>         -       int                     barrier;
>>         +       int                     nr_pending[BARRIER_BUCKETS_NR];
>>         +       int                     nr_waiting[BARRIER_BUCKETS_NR];
>>         +       int                     nr_queued[BARRIER_BUCKETS_NR];
>>         +       int                     barrier[BARRIER_BUCKETS_NR];
>>   This change is by the suggestion from Shaohua.
>> - Remove some inrelavent code comments, by suggestion from Guoqing.
>> - Add a missing wait_barrier() before jumping to retry_write, in
>>   raid1_make_write_request().
>> V1:
>> - Original RFC patch for comments
> 
> Looks good, two minor issues.
> 
>>  
>> -static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>> -				 struct r1bio *r1_bio)
>> +static void raid1_read_request(struct mddev *mddev, struct bio *bio)
>>  {
>>  	struct r1conf *conf = mddev->private;
>>  	struct raid1_info *mirror;
>> +	struct r1bio *r1_bio;
>>  	struct bio *read_bio;
>>  	struct bitmap *bitmap = mddev->bitmap;
>>  	const int op = bio_op(bio);
>> @@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>>  	int max_sectors;
>>  	int rdisk;
>>  
>> -	wait_barrier(conf, bio);
>> +	/*
>> +	 * Still need barrier for READ in case that whole
>> +	 * array is frozen.
>> +	 */
>> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
>> +	bitmap = mddev->bitmap;
>> +
>> +	/*
>> +	 * make_request() can abort the operation when read-ahead is being
>> +	 * used and no empty request is available.
>> +	 *
>> +	 */
>> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>> +	r1_bio->master_bio = bio;
>> +	r1_bio->sectors = bio_sectors(bio);
>> +	r1_bio->state = 0;
>> +	r1_bio->mddev = mddev;
>> +	r1_bio->sector = bio->bi_iter.bi_sector;
> 
> This part looks unnecessary complicated. If you change raid1_make_request to
> something like __raid1_make_reques, add a new raid1_make_request and do bio
> split there, then call __raid1_make_request for each splitted bio, then you
> don't need to duplicate the r1_bio allocation parts for read/write.
> 

Aha, good point! I will do that.

>> diff --git a/drivers/md/raid1.h b/drivers/md/raid1.h
>> index c52ef42..d3faf30 100644
>> --- a/drivers/md/raid1.h
>> +++ b/drivers/md/raid1.h
>> @@ -1,6 +1,14 @@
>>  #ifndef _RAID1_H
>>  #define _RAID1_H
>>  
>> +/* each barrier unit size is 64MB fow now
>> + * note: it must be larger than RESYNC_DEPTH
>> + */
>> +#define BARRIER_UNIT_SECTOR_BITS	17
>> +#define BARRIER_UNIT_SECTOR_SIZE	(1<<17)
>> +#define BARRIER_BUCKETS_NR_BITS		(PAGE_SHIFT - 2)
> 
> maybe write this as (PAGE_SHIFT - ilog2(sizeof(int)))? To be honest, I don't
> think it really matters if the array is PAGE_SIZE length, maybe just specify a
> const here.

Nice catch! It makes sense, I will do that, and add some comments to
explain why it is sizeof(int).

Thanks.

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-16  7:04 ` NeilBrown
@ 2017-02-17  6:56   ` Coly Li
  2017-02-19 23:50     ` NeilBrown
  2017-02-17 19:41   ` Shaohua Li
  1 sibling, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-17  6:56 UTC (permalink / raw)
  To: NeilBrown, linux-raid; +Cc: Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/16 下午3:04, NeilBrown wrote:
> On Thu, Feb 16 2017, colyli@suse.de wrote:

[snip]
>> - I/O barrier bucket is indexed by bio start sector If multiple
>> I/O requests hit different I/O barrier units, they only need to
>> compete I/O barrier with other I/Os which hit the same I/O
>> barrier bucket index with each other. The index of a barrier
>> bucket which a bio should look for is calculated by
>> sector_to_idx() which is defined in raid1.h as an inline
>> function, static inline int sector_to_idx(sector_t sector) { 
>> return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS, 
>> BARRIER_BUCKETS_NR_BITS); } Here sector_nr is the start sector
>> number of a bio.
> 
> "hash_long() is used so that sequential writes in are region of
> the array which is not being resynced will not consistently align
> with the buckets that are being sequentially resynced, as described
> below"

Sorry, this sentence is too long to be understood by me ... Could you
please to use some simple and short sentence ?


[snip]
>> There are two changes should be noticed, - In raid1d(), I change
>> the code to decrease conf->nr_pending[idx] into single loop, it
>> looks like this, spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; 
>> spin_unlock_irqrestore(&conf->device_lock, flags); This change
>> generates more spin lock operations, but in next patch of this
>> patch set, it will be replaced by a single line code, 
>> atomic_dec(&conf->nr_queueud[idx]); So we don't need to worry
>> about spin lock cost here. - Mainline raid1 code split original
>> raid1_make_request() into raid1_read_request() and
>> raid1_write_request(). If the original bio goes across an I/O
>> barrier unit size, this bio will be split before calling
>> raid1_read_request() or raid1_write_request(),  this change the
>> code logic more simple and clear. - In this patch wait_barrier()
>> is moved from raid1_make_request() to raid1_write_request(). In
>> raid_read_request(), original wait_barrier() is replaced by
>> raid1_read_request(). The differnece is wait_read_barrier() only
>> waits if array is frozen, using different barrier function in
>> different code path makes the code more clean and easy to read.
> 
> Thank you for putting the effort into writing a comprehensve
> change description.  I really appreciate it.

Neil, thank you so much. I fix all the above typos in next version,
once I understand that long sentence I will include it in the patch
comments too.


>> 
>> @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct
>> mddev *mddev, struct bio *bio,
>> 
>> static void raid1_make_request(struct mddev *mddev, struct bio
>> *bio) { -	struct r1conf *conf = mddev->private; -	struct r1bio
>> *r1_bio; +	void (*make_request_fn)(struct mddev *mddev, struct
>> bio *bio); +	struct bio *split; +	sector_t sectors;
>> 
>> -	/* -	 * make_request() can abort the operation when read-ahead
>> is being -	 * used and no empty request is available. -	 * -	 */ 
>> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO); +
>> make_request_fn = (bio_data_dir(bio) == READ) ? +
>> raid1_read_request : raid1_write_request;
>> 
>> -	r1_bio->master_bio = bio; -	r1_bio->sectors =
>> bio_sectors(bio); -	r1_bio->state = 0; -	r1_bio->mddev = mddev; -
>> r1_bio->sector = bio->bi_iter.bi_sector; - -	/* -	 * We might
>> need to issue multiple reads to different devices if there -	 *
>> are bad blocks around, so we keep track of the number of reads
>> in -	 * bio->bi_phys_segments.  If this is 0, there is only one
>> r1_bio and -	 * no locking will be needed when requests complete.
>> If it is -	 * non-zero, then it is the number of not-completed
>> requests. -	 */ -	bio->bi_phys_segments = 0; -
>> bio_clear_flag(bio, BIO_SEG_VALID); +	/* if bio exceeds barrier
>> unit boundary, split it */ +	do { +		sectors =
>> align_to_barrier_unit_end( +				bio->bi_iter.bi_sector,
>> bio_sectors(bio)); +		if (sectors < bio_sectors(bio)) { +			split
>> = bio_split(bio, sectors, GFP_NOIO, fs_bio_set); +
>> bio_chain(split, bio); +		} else { +			split = bio; +		}
>> 
>> -	if (bio_data_dir(bio) == READ) -		raid1_read_request(mddev,
>> bio, r1_bio); -	else -		raid1_write_request(mddev, bio, r1_bio); 
>> +		make_request_fn(mddev, split); +	} while (split != bio); }
> 
> I know you are going to change this as Shaohua wantsthe spitting to
> happen in a separate function, which I agree with, but there is 
> something else wrong here. Calling bio_split/bio_chain repeatedly
> in a loop is dangerous. It is OK for simple devices, but when one
> request can wait for another request to the same device it can
> deadlock. This can happen with raid1.  If a resync request calls
> raise_barrier() between one request and the next, then the next has
> to wait for the resync request, which has to wait for the first
> request. As the first request will be stuck in the queue in 
> generic_make_request(), you get a deadlock.

For md raid1, queue in generic_make_request(), can I understand it as
bio_list_on_stack in this function? And queue in underlying device,
can I understand it as the data structures like plug->pending and
conf->pending_bio_list ?

I still don't get the point of deadlock, let me try to explain why I
don't see the possible deadlock. If a bio is split, and the first part
is processed by make_request_fn(), and then a resync comes and it will
raise a barrier, there are 3 possible conditions,
- the resync I/O tries to raise barrier on same bucket of the first
regular bio. Then the resync task has to wait to the first bio drops
its conf->nr_pending[idx]
- the resync I/O hits a barrier bucket not related to the first split
regular I/O or the second split regular I/O, no one will be blocked.
- the resync I/O hits the same barrier bucket which the second split
regular bio will access, then the barrier is raised and second split
bio will be blocked, but the first split regular I/O will continue to
go ahead.

The first and the second split bios won't hit same I/O barrier bucket,
and a single resync bio may only be interfered with one of the split
regular I/O. This is why I don't see how the deadlock comes.


And one more question is, even there is only one I/O barrier bucket, I
don't understand how the first split regular I/O will be stuck in the
queue in generic_make_request(), I assume the queue is
bio_list_on_stack. If the first split bio is stuck in
generic_make_request() I can understand there is a deadlock, but I
don't see why it may be blocked there. Could you please to give me
more hint ?


> It is much safer to:
> 
> if (need to split) { split = bio_split(bio, ...) bio_chain(...) 
> make_request_fn(split); generic_make_request(bio); } else 
> make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in
> 'split') which will queue some requests to the underlying devices.
> These requests will be queued in generic_make_request. Then we
> queue the remainder of the bio, which will be added to the end of
> the generic_make_request queue. Then we return. 
> generic_make_request() will pop the lower-level device requests off
> the queue and handle them first.  Then it will process the
> remainder of the original bio once the first section has been fully
> processed.

Do you mean before the first split bio being fully processed, the
second split bio should not be sent to the underlying device ? If this
is what I understand, it seems the split bio won't be merged into
large request in underlying device, it might be a tiny performance lost ?

Coly



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-16  7:04   ` NeilBrown
@ 2017-02-17  7:56     ` Coly Li
  2017-02-17 18:35       ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-17  7:56 UTC (permalink / raw)
  To: NeilBrown, linux-raid
  Cc: Shaohua Li, Hannes Reinecke, Johannes Thumshirn, Guoqing Jiang

On 2017/2/16 下午3:04, NeilBrown wrote:
> On Thu, Feb 16 2017, colyli@suse.de wrote:
> 
>> @@ -2393,6 +2455,11 @@ static void handle_write_finished(struct
>> r1conf *conf, struct r1bio *r1_bio) idx =
>> sector_to_idx(r1_bio->sector); conf->nr_queued[idx]++; 
>> spin_unlock_irq(&conf->device_lock); +		/* +		 * In case
>> freeze_array() is waiting for condition +		 *
>> get_unqueued_pending() == extra to be true. +		 */ +
>> wake_up(&conf->wait_barrier); 
>> md_wakeup_thread(conf->mddev->thread); } else { if
>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2529,9 +2596,7
>> @@ static void raid1d(struct md_thread *thread) retry_list); 
>> list_del(&r1_bio->retry_list); idx =
>> sector_to_idx(r1_bio->sector); -
>> spin_lock_irqsave(&conf->device_lock, flags); 
>> conf->nr_queued[idx]--; -
>> spin_unlock_irqrestore(&conf->device_lock, flags);
> 
> Why do you think it is safe to decrement nr_queued without holding
> the lock? Surely this could race with handle_write_finished, and an
> update could be lost.

conf->nr_queued[idx] is an integer and aligned to 4 bytes address, so
conf->nr_queued[idx]++ is same to atomic_inc(&conf->nr_queued[idx]),
it is atomic operation. And there is no ordering requirement, so I
don't need memory barrier here. This is why I remove spin lock, and
change it from atomic_t back to int.


IMHO, the problematic location is not here, but in freeze_array(). Now
the code assume array is froze when "get_unqueued_pending(conf) ==
extra" gets true. I think it is incorrect.

After conf->array_frozen is set to 1, raid1 code may still handle the
on flying requests, so conf->nr_pending[] and conf->nr_queued[] may
both decreasing. There is possibility that get_unqueued_pending()
returns 0 before everything is quiet at a very shot moment. If the
wait_event inside freeze_array() just catches this moment and gets a
true condition, continue to go and back to its caller, there will be
things unexpected happen.

I don't cover this issue in this patch set because I feel this is
another topic. Hmm, maybe I am a little off topic here.

Coly Li

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-16 17:05   ` Coly Li
@ 2017-02-17 12:40     ` Coly Li
  0 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-17 12:40 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Neil Brown, Johannes Thumshirn, Guoqing Jiang

On 2017/2/17 上午1:05, Coly Li wrote:
> On 2017/2/16 上午10:22, Shaohua Li wrote:
>> On Thu, Feb 16, 2017 at 12:35:22AM +0800, colyli@suse.de wrote:

[snip]
>>
>>>  
>>> -static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>>> -				 struct r1bio *r1_bio)
>>> +static void raid1_read_request(struct mddev *mddev, struct bio *bio)
>>>  {
>>>  	struct r1conf *conf = mddev->private;
>>>  	struct raid1_info *mirror;
>>> +	struct r1bio *r1_bio;
>>>  	struct bio *read_bio;
>>>  	struct bitmap *bitmap = mddev->bitmap;
>>>  	const int op = bio_op(bio);
>>> @@ -1083,7 +1101,34 @@ static void raid1_read_request(struct mddev *mddev, struct bio *bio,
>>>  	int max_sectors;
>>>  	int rdisk;
>>>  
>>> -	wait_barrier(conf, bio);
>>> +	/*
>>> +	 * Still need barrier for READ in case that whole
>>> +	 * array is frozen.
>>> +	 */
>>> +	wait_read_barrier(conf, bio->bi_iter.bi_sector);
>>> +	bitmap = mddev->bitmap;
>>> +
>>> +	/*
>>> +	 * make_request() can abort the operation when read-ahead is being
>>> +	 * used and no empty request is available.
>>> +	 *
>>> +	 */
>>> +	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>>> +	r1_bio->master_bio = bio;
>>> +	r1_bio->sectors = bio_sectors(bio);
>>> +	r1_bio->state = 0;
>>> +	r1_bio->mddev = mddev;
>>> +	r1_bio->sector = bio->bi_iter.bi_sector;
>>
>> This part looks unnecessary complicated. If you change raid1_make_request to
>> something like __raid1_make_reques, add a new raid1_make_request and do bio
>> split there, then call __raid1_make_request for each splitted bio, then you
>> don't need to duplicate the r1_bio allocation parts for read/write.
>>
> 
> Aha, good point! I will do that.

I find adding one more wrap function increases stack depth in raid1 I/O
path. Finally I choose to use a static inline function to allocate
r1bio, other than wrap one more function.


+static inline struct r1bio *
+alloc_r1bio(struct mddev *mddev, struct bio *bio, sector_t sectors_handled)
+{
+       struct r1conf *conf = mddev->private;
+       struct r1bio *r1_bio = NULL;
+
+
+       r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
+       if (!r1_bio) {
+               WARN(1, "%s: fail to allocate r1_bio.\n", __func__);
+               return NULL;
+       }
+
+       r1_bio->master_bio = bio;
+       r1_bio->sectors = bio_sectors(bio) - sectors_handled;
+       r1_bio->state = 0;
+       r1_bio->mddev = mddev;
+       r1_bio->sector = bio->bi_iter.bi_sector + sectors_handled;
+
+       return r1_bio;
+}

A good point is, before jumping to read_again in raid1_read_request() or
jumping to retry_write in raid1_write_request(), this function can help
to remove more redundant code like this,
	r1_bio = alloc_r1bio(mddev, bio, sectors_handled);

I don't check whether alloc_r1bio() returns NULL in neither
raid1_read_request() nor raid1_write_request(). If the allocation
failed, it may cause a NULL pointer deference fault, but I don't plan to
fix it in this patch set, just keep the existing code logic.

Coly



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-17  7:56     ` Coly Li
@ 2017-02-17 18:35       ` Coly Li
  0 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-17 18:35 UTC (permalink / raw)
  To: NeilBrown, linux-raid
  Cc: Shaohua Li, Hannes Reinecke, Johannes Thumshirn, Guoqing Jiang

On 2017/2/17 下午3:56, Coly Li wrote:
> On 2017/2/16 下午3:04, NeilBrown wrote:
>> On Thu, Feb 16 2017, colyli@suse.de wrote:
>>
>>> @@ -2393,6 +2455,11 @@ static void handle_write_finished(struct
>>> r1conf *conf, struct r1bio *r1_bio) idx =
>>> sector_to_idx(r1_bio->sector); conf->nr_queued[idx]++; 
>>> spin_unlock_irq(&conf->device_lock); +		/* +		 * In case
>>> freeze_array() is waiting for condition +		 *
>>> get_unqueued_pending() == extra to be true. +		 */ +
>>> wake_up(&conf->wait_barrier); 
>>> md_wakeup_thread(conf->mddev->thread); } else { if
>>> (test_bit(R1BIO_WriteError, &r1_bio->state)) @@ -2529,9 +2596,7
>>> @@ static void raid1d(struct md_thread *thread) retry_list); 
>>> list_del(&r1_bio->retry_list); idx =
>>> sector_to_idx(r1_bio->sector); -
>>> spin_lock_irqsave(&conf->device_lock, flags); 
>>> conf->nr_queued[idx]--; -
>>> spin_unlock_irqrestore(&conf->device_lock, flags);
>>
>> Why do you think it is safe to decrement nr_queued without holding
>> the lock? Surely this could race with handle_write_finished, and an
>> update could be lost.
> 
> conf->nr_queued[idx] is an integer and aligned to 4 bytes address, so
> conf->nr_queued[idx]++ is same to atomic_inc(&conf->nr_queued[idx]),
> it is atomic operation. And there is no ordering requirement, so I
> don't need memory barrier here. This is why I remove spin lock, and
> change it from atomic_t back to int.
> 
> 
> IMHO, the problematic location is not here, but in freeze_array(). Now
> the code assume array is froze when "get_unqueued_pending(conf) ==
> extra" gets true. I think it is incorrect.

Hmm, I am wrong here. conf->nr_queued[idx]++ is not atomic. Yes, it
should be atomic_t, I will fix it in next version.

Coly Li


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code
  2017-02-16  2:25   ` Shaohua Li
@ 2017-02-17 18:42     ` Coly Li
  0 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-17 18:42 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, Hannes Reinecke, Neil Brown,
	Johannes Thumshirn, Guoqing Jiang

On 2017/2/16 上午10:25, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 12:35:23AM +0800, colyli@suse.de wrote:
>> When I run a parallel reading performan testing on a md raid1 device with
>> two NVMe SSDs, I observe very bad throughput in supprise: by fio with 64KB
>> block size, 40 seq read I/O jobs, 128 iodepth, overall throughput is
>> only 2.7GB/s, this is around 50% of the idea performance number.
>>
>> The perf reports locking contention happens at allow_barrier() and
>> wait_barrier() code,
>>  - 41.41%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irqsave
>>    - _raw_spin_lock_irqsave
>>          + 89.92% allow_barrier
>>          + 9.34% __wake_up
>>  - 37.30%  fio [kernel.kallsyms]     [k] _raw_spin_lock_irq
>>    - _raw_spin_lock_irq
>>          - 100.00% wait_barrier
>>
>> The reason is, in these I/O barrier related functions,
>>  - raise_barrier()
>>  - lower_barrier()
>>  - wait_barrier()
>>  - allow_barrier()
>> They always hold conf->resync_lock firstly, even there are only regular
>> reading I/Os and no resync I/O at all. This is a huge performance penalty.
>>
>> The solution is a lockless-like algorithm in I/O barrier code, and only
>> holding conf->resync_lock when it has to.
>>
>> The original idea is from Hannes Reinecke, and Neil Brown provides
>> comments to improve it. I continue to work on it, and make the patch into
>> current form.
>>
>> In the new simpler raid1 I/O barrier implementation, there are two
>> wait barrier functions,
>>  - wait_barrier()
>>    Which calls _wait_barrier(), is used for regular write I/O. If there is
>>    resync I/O happening on the same I/O barrier bucket, or the whole
>>    array is frozen, task will wait until no barrier on same barrier bucket,
>>    or the whold array is unfreezed.
>>  - wait_read_barrier()
>>    Since regular read I/O won't interfere with resync I/O (read_balance()
>>    will make sure only uptodate data will be read out), it is unnecessary
>>    to wait for barrier in regular read I/Os, waiting in only necessary
>>    when the whole array is frozen.
>>
>> The operations on conf->nr_pending[idx], conf->nr_waiting[idx], conf->
>> barrier[idx] are very carefully designed in raise_barrier(),
>> lower_barrier(), _wait_barrier() and wait_read_barrier(), in order to
>> avoid unnecessary spin locks in these functions. Once conf->
>> nr_pengding[idx] is increased, a resync I/O with same barrier bucket index
>> has to wait in raise_barrier(). Then in _wait_barrier() if no barrier
>> raised in same barrier bucket index and array is not frozen, the regular
>> I/O doesn't need to hold conf->resync_lock, it can just increase
>> conf->nr_pending[idx], and return to its caller. wait_read_barrier() is
>> very similar to _wait_barrier(), the only difference is it only waits when
>> array is frozen. For heavy parallel reading I/Os, the lockless I/O barrier
>> code almostly gets rid of all spin lock cost.
>>
>> This patch significantly improves raid1 reading peroformance. From my
>> testing, a raid1 device built by two NVMe SSD, runs fio with 64KB
>> blocksize, 40 seq read I/O jobs, 128 iodepth, overall throughput
>> increases from 2.7GB/s to 4.6GB/s (+70%).
>>
>> Changelog
>> V3:
>> - Add smp_mb__after_atomic() as Shaohua and Neil suggested.
>> - Change conf->nr_queued[] from atomic_t to int.
> 
> I missed this part. In the code, the nr_queued sometimes is protected by
> device_lock, sometimes (raid1d) no protection at all. Can you explain this?

I make a mistake here, integer set is atomic, but integer plus is not.
conf->nr_queued[] must be atomic_t here, otherwise it is racy. I will
fix it in V4 patch.

Thanks for the review.

Coly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-16  7:04 ` NeilBrown
  2017-02-17  6:56   ` Coly Li
@ 2017-02-17 19:41   ` Shaohua Li
  2017-02-18  2:40     ` Coly Li
  2017-02-19 23:42     ` NeilBrown
  1 sibling, 2 replies; 43+ messages in thread
From: Shaohua Li @ 2017-02-17 19:41 UTC (permalink / raw)
  To: NeilBrown
  Cc: colyli, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:
> On Thu, Feb 16 2017, colyli@suse.de wrote:
> 
> > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
> > introduces a sliding resync window for raid1 I/O barrier, this idea limits
> > I/O barriers to happen only inside a slidingresync window, for regular
> > I/Os out of this resync window they don't need to wait for barrier any
> > more. On large raid1 device, it helps a lot to improve parallel writing
> > I/O throughput when there are background resync I/Os performing at
> > same time.
> >
> > The idea of sliding resync widow is awesome, but code complexity is a
> > challenge. Sliding resync window requires several veriables to work
> 
> variables
> 
> > collectively, this is complexed and very hard to make it work correctly.
> > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
> > to fix the original resync window patch. This is not the end, any further
> > related modification may easily introduce more regreassion.
> >
> > Therefore I decide to implement a much simpler raid1 I/O barrier, by
> > removing resync window code, I believe life will be much easier.
> >
> > The brief idea of the simpler barrier is,
> >  - Do not maintain a logbal unique resync window
> 
> global
> 
> >  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
> 
> conflicts
> 
> >    I/O only has to wait for a resync I/O when both them have same barrier
> >    bucket index, vice versa.
> >  - I/O barrier can be recuded to an acceptable number if there are enought
> 
> reduced
> enough
> 
> >    barrier buckets
> >
> > Here I explain how the barrier buckets are designed,
> >  - BARRIER_UNIT_SECTOR_SIZE
> >    The whole LBA address space of a raid1 device is divided into multiple
> >    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
> >    Bio request won't go across border of barrier unit size, that means
> 
> requests
> 
> >    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
> >    For random I/O 64MB is large enough for both read and write requests,
> >    for sequential I/O considering underlying block layer may merge them
> >    into larger requests, 64MB is still good enough.
> >    Neil also points out that for resync operation, "we want the resync to
> >    move from region to region fairly quickly so that the slowness caused
> >    by having to synchronize with the resync is averaged out over a fairly
> >    small time frame". For full speed resync, 64MB should take less then 1
> >    second. When resync is competing with other I/O, it could take up a few
> >    minutes. Therefore 64MB size is fairly good range for resync.
> >
> >  - BARRIER_BUCKETS_NR
> >    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
> >         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
> >         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
> >    this patch makes the bellowed members of struct r1conf from integer
> >    to array of integers,
> >         -       int                     nr_pending;
> >         -       int                     nr_waiting;
> >         -       int                     nr_queued;
> >         -       int                     barrier;
> >         +       int                     *nr_pending;
> >         +       int                     *nr_waiting;
> >         +       int                     *nr_queued;
> >         +       int                     *barrier;
> >    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
> >    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
> >    barrier buckets, and each array of integers occupies single memory page.
> >    1024 means for a request which is smaller than the I/O barrier unit size
> >    has ~0.1% chance to wait for resync to pause, which is quite a small
> >    enough fraction. Also requesting single memory page is more friendly to
> >    kernel page allocator than larger memory size.
> >
> >  - I/O barrier bucket is indexed by bio start sector
> >    If multiple I/O requests hit different I/O barrier units, they only need
> >    to compete I/O barrier with other I/Os which hit the same I/O barrier
> >    bucket index with each other. The index of a barrier bucket which a
> >    bio should look for is calculated by sector_to_idx() which is defined
> >    in raid1.h as an inline function,
> >         static inline int sector_to_idx(sector_t sector)
> >         {
> >                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
> >                                 BARRIER_BUCKETS_NR_BITS);
> >         }
> >    Here sector_nr is the start sector number of a bio.
> 
> "hash_long() is used so that sequential writes in are region of the
> array which is not being resynced will not consistently align with
> the buckets that are being sequentially resynced, as described below"
> 
> >
> >  - Single bio won't go across boundary of a I/O barrier unit
> >    If a request goes across boundary of barrier unit, it will be split. A
> >    bio may be split in raid1_make_request() or raid1_sync_request(), if
> >    sectors returned by align_to_barrier_unit_end() is small than original
> 
> smaller
> 
> >    bio size.
> >
> > Comparing to single sliding resync window,
> >  - Currently resync I/O grows linearly, therefore regular and resync I/O
> >    will have confliction within a single barrier units. So the I/O
> 
> ... will conflict within ...
> 
> >    behavior is similar to single sliding resync window.
> >  - But a barrier unit bucket is shared by all barrier units with identical
> >    barrier uinit index, the probability of confliction might be higher
> >    than single sliding resync window, in condition that writing I/Os
> >    always hit barrier units which have identical barrier bucket indexs with
> >    the resync I/Os. This is a very rare condition in real I/O work loads,
> >    I cannot imagine how it could happen in practice.
> >  - Therefore we can achieve a good enough low confliction rate with much
> 
> ... low conflict rate ...
> 
> >    simpler barrier algorithm and implementation.
> >
> > There are two changes should be noticed,
> >  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
> >    single loop, it looks like this,
> >         spin_lock_irqsave(&conf->device_lock, flags);
> >         conf->nr_queued[idx]--;
> >         spin_unlock_irqrestore(&conf->device_lock, flags);
> >    This change generates more spin lock operations, but in next patch of
> >    this patch set, it will be replaced by a single line code,
> >         atomic_dec(&conf->nr_queueud[idx]);
> >    So we don't need to worry about spin lock cost here.
> >  - Mainline raid1 code split original raid1_make_request() into
> >    raid1_read_request() and raid1_write_request(). If the original bio
> >    goes across an I/O barrier unit size, this bio will be split before
> >    calling raid1_read_request() or raid1_write_request(),  this change
> >    the code logic more simple and clear.
> >  - In this patch wait_barrier() is moved from raid1_make_request() to
> >    raid1_write_request(). In raid_read_request(), original wait_barrier()
> >    is replaced by raid1_read_request().
> >    The differnece is wait_read_barrier() only waits if array is frozen,
> >    using different barrier function in different code path makes the code
> >    more clean and easy to read.
> 
> Thank you for putting the effort into writing a comprehensve change
> description.  I really appreciate it.
> 
> >  
> > @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
> >  
> >  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
> >  {
> > -	struct r1conf *conf = mddev->private;
> > -	struct r1bio *r1_bio;
> > +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
> > +	struct bio *split;
> > +	sector_t sectors;
> >  
> > -	/*
> > -	 * make_request() can abort the operation when read-ahead is being
> > -	 * used and no empty request is available.
> > -	 *
> > -	 */
> > -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
> > +	make_request_fn = (bio_data_dir(bio) == READ) ?
> > +			  raid1_read_request : raid1_write_request;
> >  
> > -	r1_bio->master_bio = bio;
> > -	r1_bio->sectors = bio_sectors(bio);
> > -	r1_bio->state = 0;
> > -	r1_bio->mddev = mddev;
> > -	r1_bio->sector = bio->bi_iter.bi_sector;
> > -
> > -	/*
> > -	 * We might need to issue multiple reads to different devices if there
> > -	 * are bad blocks around, so we keep track of the number of reads in
> > -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
> > -	 * no locking will be needed when requests complete.  If it is
> > -	 * non-zero, then it is the number of not-completed requests.
> > -	 */
> > -	bio->bi_phys_segments = 0;
> > -	bio_clear_flag(bio, BIO_SEG_VALID);
> > +	/* if bio exceeds barrier unit boundary, split it */
> > +	do {
> > +		sectors = align_to_barrier_unit_end(
> > +				bio->bi_iter.bi_sector, bio_sectors(bio));
> > +		if (sectors < bio_sectors(bio)) {
> > +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> > +			bio_chain(split, bio);
> > +		} else {
> > +			split = bio;
> > +		}
> >  
> > -	if (bio_data_dir(bio) == READ)
> > -		raid1_read_request(mddev, bio, r1_bio);
> > -	else
> > -		raid1_write_request(mddev, bio, r1_bio);
> > +		make_request_fn(mddev, split);
> > +	} while (split != bio);
> >  }
> 
> I know you are going to change this as Shaohua wantsthe spitting
> to happen in a separate function, which I agree with, but there is
> something else wrong here.
> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
> It is OK for simple devices, but when one request can wait for another
> request to the same device it can deadlock.
> This can happen with raid1.  If a resync request calls raise_barrier()
> between one request and the next, then the next has to wait for the
> resync request, which has to wait for the first request.
> As the first request will be stuck in the queue in
> generic_make_request(), you get a deadlock.
> It is much safer to:
> 
>     if (need to split) {
>         split = bio_split(bio, ...)
>         bio_chain(...)
>         make_request_fn(split);
>         generic_make_request(bio);
>    } else
>         make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.

Good point! raid10 has the same problem. Looks this doesn't solve the issue for
device with 3 times stack though.

I knew you guys are working on this issue in block layer. Should we fix the
issue in MD side (for 2 stack devices) or wait for the block layer patch?

Thanks,
Shaohua 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-17 19:41   ` Shaohua Li
@ 2017-02-18  2:40     ` Coly Li
  2017-02-19 23:42     ` NeilBrown
  1 sibling, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-18  2:40 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown
  Cc: linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/18 上午3:41, Shaohua Li wrote:
> On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:

[snip]
>>> @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>>>  
>>>  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>>>  {
>>> -	struct r1conf *conf = mddev->private;
>>> -	struct r1bio *r1_bio;
>>> +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
>>> +	struct bio *split;
>>> +	sector_t sectors;
>>>  
>>> -	/*
>>> -	 * make_request() can abort the operation when read-ahead is being
>>> -	 * used and no empty request is available.
>>> -	 *
>>> -	 */
>>> -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>>> +	make_request_fn = (bio_data_dir(bio) == READ) ?
>>> +			  raid1_read_request : raid1_write_request;
>>>  
>>> -	r1_bio->master_bio = bio;
>>> -	r1_bio->sectors = bio_sectors(bio);
>>> -	r1_bio->state = 0;
>>> -	r1_bio->mddev = mddev;
>>> -	r1_bio->sector = bio->bi_iter.bi_sector;
>>> -
>>> -	/*
>>> -	 * We might need to issue multiple reads to different devices if there
>>> -	 * are bad blocks around, so we keep track of the number of reads in
>>> -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
>>> -	 * no locking will be needed when requests complete.  If it is
>>> -	 * non-zero, then it is the number of not-completed requests.
>>> -	 */
>>> -	bio->bi_phys_segments = 0;
>>> -	bio_clear_flag(bio, BIO_SEG_VALID);
>>> +	/* if bio exceeds barrier unit boundary, split it */
>>> +	do {
>>> +		sectors = align_to_barrier_unit_end(
>>> +				bio->bi_iter.bi_sector, bio_sectors(bio));
>>> +		if (sectors < bio_sectors(bio)) {
>>> +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
>>> +			bio_chain(split, bio);
>>> +		} else {
>>> +			split = bio;
>>> +		}
>>>  
>>> -	if (bio_data_dir(bio) == READ)
>>> -		raid1_read_request(mddev, bio, r1_bio);
>>> -	else
>>> -		raid1_write_request(mddev, bio, r1_bio);
>>> +		make_request_fn(mddev, split);
>>> +	} while (split != bio);
>>>  }
>>
>> I know you are going to change this as Shaohua wantsthe spitting
>> to happen in a separate function, which I agree with, but there is
>> something else wrong here.
>> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
>> It is OK for simple devices, but when one request can wait for another
>> request to the same device it can deadlock.
>> This can happen with raid1.  If a resync request calls raise_barrier()
>> between one request and the next, then the next has to wait for the
>> resync request, which has to wait for the first request.
>> As the first request will be stuck in the queue in
>> generic_make_request(), you get a deadlock.
>> It is much safer to:
>>
>>     if (need to split) {
>>         split = bio_split(bio, ...)
>>         bio_chain(...)
>>         make_request_fn(split);
>>         generic_make_request(bio);
>>    } else
>>         make_request_fn(mddev, bio);
>>
>> This way we first process the initial section of the bio (in 'split')
>> which will queue some requests to the underlying devices.  These
>> requests will be queued in generic_make_request.
>> Then we queue the remainder of the bio, which will be added to the end
>> of the generic_make_request queue.
>> Then we return.
>> generic_make_request() will pop the lower-level device requests off the
>> queue and handle them first.  Then it will process the remainder
>> of the original bio once the first section has been fully processed.
> 
> Good point! raid10 has the same problem. Looks this doesn't solve the issue for
> device with 3 times stack though.
> 
> I knew you guys are working on this issue in block layer. Should we fix the
> issue in MD side (for 2 stack devices) or wait for the block layer patch?

Obviously I don't get the point at all ... Could you please explain a
little more about why it is an issue and how it may happen ? Thanks a
lot :-)

Coly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-17 19:41   ` Shaohua Li
  2017-02-18  2:40     ` Coly Li
@ 2017-02-19 23:42     ` NeilBrown
  1 sibling, 0 replies; 43+ messages in thread
From: NeilBrown @ 2017-02-19 23:42 UTC (permalink / raw)
  To: Shaohua Li, NeilBrown
  Cc: colyli, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 12744 bytes --]

On Fri, Feb 17 2017, Shaohua Li wrote:

> On Thu, Feb 16, 2017 at 06:04:00PM +1100, NeilBrown wrote:
>> On Thu, Feb 16 2017, colyli@suse.de wrote:
>> 
>> > 'Commit 79ef3a8aa1cb ("raid1: Rewrite the implementation of iobarrier.")'
>> > introduces a sliding resync window for raid1 I/O barrier, this idea limits
>> > I/O barriers to happen only inside a slidingresync window, for regular
>> > I/Os out of this resync window they don't need to wait for barrier any
>> > more. On large raid1 device, it helps a lot to improve parallel writing
>> > I/O throughput when there are background resync I/Os performing at
>> > same time.
>> >
>> > The idea of sliding resync widow is awesome, but code complexity is a
>> > challenge. Sliding resync window requires several veriables to work
>> 
>> variables
>> 
>> > collectively, this is complexed and very hard to make it work correctly.
>> > Just grep "Fixes: 79ef3a8aa1" in kernel git log, there are 8 more patches
>> > to fix the original resync window patch. This is not the end, any further
>> > related modification may easily introduce more regreassion.
>> >
>> > Therefore I decide to implement a much simpler raid1 I/O barrier, by
>> > removing resync window code, I believe life will be much easier.
>> >
>> > The brief idea of the simpler barrier is,
>> >  - Do not maintain a logbal unique resync window
>> 
>> global
>> 
>> >  - Use multiple hash buckets to reduce I/O barrier conflictions, regular
>> 
>> conflicts
>> 
>> >    I/O only has to wait for a resync I/O when both them have same barrier
>> >    bucket index, vice versa.
>> >  - I/O barrier can be recuded to an acceptable number if there are enought
>> 
>> reduced
>> enough
>> 
>> >    barrier buckets
>> >
>> > Here I explain how the barrier buckets are designed,
>> >  - BARRIER_UNIT_SECTOR_SIZE
>> >    The whole LBA address space of a raid1 device is divided into multiple
>> >    barrier units, by the size of BARRIER_UNIT_SECTOR_SIZE.
>> >    Bio request won't go across border of barrier unit size, that means
>> 
>> requests
>> 
>> >    maximum bio size is BARRIER_UNIT_SECTOR_SIZE<<9 (64MB) in bytes.
>> >    For random I/O 64MB is large enough for both read and write requests,
>> >    for sequential I/O considering underlying block layer may merge them
>> >    into larger requests, 64MB is still good enough.
>> >    Neil also points out that for resync operation, "we want the resync to
>> >    move from region to region fairly quickly so that the slowness caused
>> >    by having to synchronize with the resync is averaged out over a fairly
>> >    small time frame". For full speed resync, 64MB should take less then 1
>> >    second. When resync is competing with other I/O, it could take up a few
>> >    minutes. Therefore 64MB size is fairly good range for resync.
>> >
>> >  - BARRIER_BUCKETS_NR
>> >    There are BARRIER_BUCKETS_NR buckets in total, which is defined by,
>> >         #define BARRIER_BUCKETS_NR_BITS   (PAGE_SHIFT - 2)
>> >         #define BARRIER_BUCKETS_NR        (1<<BARRIER_BUCKETS_NR_BITS)
>> >    this patch makes the bellowed members of struct r1conf from integer
>> >    to array of integers,
>> >         -       int                     nr_pending;
>> >         -       int                     nr_waiting;
>> >         -       int                     nr_queued;
>> >         -       int                     barrier;
>> >         +       int                     *nr_pending;
>> >         +       int                     *nr_waiting;
>> >         +       int                     *nr_queued;
>> >         +       int                     *barrier;
>> >    number of the array elements is defined as BARRIER_BUCKETS_NR. For 4KB
>> >    kernel space page size, (PAGE_SHIFT - 2) indecates there are 1024 I/O
>> >    barrier buckets, and each array of integers occupies single memory page.
>> >    1024 means for a request which is smaller than the I/O barrier unit size
>> >    has ~0.1% chance to wait for resync to pause, which is quite a small
>> >    enough fraction. Also requesting single memory page is more friendly to
>> >    kernel page allocator than larger memory size.
>> >
>> >  - I/O barrier bucket is indexed by bio start sector
>> >    If multiple I/O requests hit different I/O barrier units, they only need
>> >    to compete I/O barrier with other I/Os which hit the same I/O barrier
>> >    bucket index with each other. The index of a barrier bucket which a
>> >    bio should look for is calculated by sector_to_idx() which is defined
>> >    in raid1.h as an inline function,
>> >         static inline int sector_to_idx(sector_t sector)
>> >         {
>> >                 return hash_long(sector >> BARRIER_UNIT_SECTOR_BITS,
>> >                                 BARRIER_BUCKETS_NR_BITS);
>> >         }
>> >    Here sector_nr is the start sector number of a bio.
>> 
>> "hash_long() is used so that sequential writes in are region of the
>> array which is not being resynced will not consistently align with
>> the buckets that are being sequentially resynced, as described below"
>> 
>> >
>> >  - Single bio won't go across boundary of a I/O barrier unit
>> >    If a request goes across boundary of barrier unit, it will be split. A
>> >    bio may be split in raid1_make_request() or raid1_sync_request(), if
>> >    sectors returned by align_to_barrier_unit_end() is small than original
>> 
>> smaller
>> 
>> >    bio size.
>> >
>> > Comparing to single sliding resync window,
>> >  - Currently resync I/O grows linearly, therefore regular and resync I/O
>> >    will have confliction within a single barrier units. So the I/O
>> 
>> ... will conflict within ...
>> 
>> >    behavior is similar to single sliding resync window.
>> >  - But a barrier unit bucket is shared by all barrier units with identical
>> >    barrier uinit index, the probability of confliction might be higher
>> >    than single sliding resync window, in condition that writing I/Os
>> >    always hit barrier units which have identical barrier bucket indexs with
>> >    the resync I/Os. This is a very rare condition in real I/O work loads,
>> >    I cannot imagine how it could happen in practice.
>> >  - Therefore we can achieve a good enough low confliction rate with much
>> 
>> ... low conflict rate ...
>> 
>> >    simpler barrier algorithm and implementation.
>> >
>> > There are two changes should be noticed,
>> >  - In raid1d(), I change the code to decrease conf->nr_pending[idx] into
>> >    single loop, it looks like this,
>> >         spin_lock_irqsave(&conf->device_lock, flags);
>> >         conf->nr_queued[idx]--;
>> >         spin_unlock_irqrestore(&conf->device_lock, flags);
>> >    This change generates more spin lock operations, but in next patch of
>> >    this patch set, it will be replaced by a single line code,
>> >         atomic_dec(&conf->nr_queueud[idx]);
>> >    So we don't need to worry about spin lock cost here.
>> >  - Mainline raid1 code split original raid1_make_request() into
>> >    raid1_read_request() and raid1_write_request(). If the original bio
>> >    goes across an I/O barrier unit size, this bio will be split before
>> >    calling raid1_read_request() or raid1_write_request(),  this change
>> >    the code logic more simple and clear.
>> >  - In this patch wait_barrier() is moved from raid1_make_request() to
>> >    raid1_write_request(). In raid_read_request(), original wait_barrier()
>> >    is replaced by raid1_read_request().
>> >    The differnece is wait_read_barrier() only waits if array is frozen,
>> >    using different barrier function in different code path makes the code
>> >    more clean and easy to read.
>> 
>> Thank you for putting the effort into writing a comprehensve change
>> description.  I really appreciate it.
>> 
>> >  
>> > @@ -1447,36 +1501,26 @@ static void raid1_write_request(struct mddev *mddev, struct bio *bio,
>> >  
>> >  static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>> >  {
>> > -	struct r1conf *conf = mddev->private;
>> > -	struct r1bio *r1_bio;
>> > +	void (*make_request_fn)(struct mddev *mddev, struct bio *bio);
>> > +	struct bio *split;
>> > +	sector_t sectors;
>> >  
>> > -	/*
>> > -	 * make_request() can abort the operation when read-ahead is being
>> > -	 * used and no empty request is available.
>> > -	 *
>> > -	 */
>> > -	r1_bio = mempool_alloc(conf->r1bio_pool, GFP_NOIO);
>> > +	make_request_fn = (bio_data_dir(bio) == READ) ?
>> > +			  raid1_read_request : raid1_write_request;
>> >  
>> > -	r1_bio->master_bio = bio;
>> > -	r1_bio->sectors = bio_sectors(bio);
>> > -	r1_bio->state = 0;
>> > -	r1_bio->mddev = mddev;
>> > -	r1_bio->sector = bio->bi_iter.bi_sector;
>> > -
>> > -	/*
>> > -	 * We might need to issue multiple reads to different devices if there
>> > -	 * are bad blocks around, so we keep track of the number of reads in
>> > -	 * bio->bi_phys_segments.  If this is 0, there is only one r1_bio and
>> > -	 * no locking will be needed when requests complete.  If it is
>> > -	 * non-zero, then it is the number of not-completed requests.
>> > -	 */
>> > -	bio->bi_phys_segments = 0;
>> > -	bio_clear_flag(bio, BIO_SEG_VALID);
>> > +	/* if bio exceeds barrier unit boundary, split it */
>> > +	do {
>> > +		sectors = align_to_barrier_unit_end(
>> > +				bio->bi_iter.bi_sector, bio_sectors(bio));
>> > +		if (sectors < bio_sectors(bio)) {
>> > +			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
>> > +			bio_chain(split, bio);
>> > +		} else {
>> > +			split = bio;
>> > +		}
>> >  
>> > -	if (bio_data_dir(bio) == READ)
>> > -		raid1_read_request(mddev, bio, r1_bio);
>> > -	else
>> > -		raid1_write_request(mddev, bio, r1_bio);
>> > +		make_request_fn(mddev, split);
>> > +	} while (split != bio);
>> >  }
>> 
>> I know you are going to change this as Shaohua wantsthe spitting
>> to happen in a separate function, which I agree with, but there is
>> something else wrong here.
>> Calling bio_split/bio_chain repeatedly in a loop is dangerous.
>> It is OK for simple devices, but when one request can wait for another
>> request to the same device it can deadlock.
>> This can happen with raid1.  If a resync request calls raise_barrier()
>> between one request and the next, then the next has to wait for the
>> resync request, which has to wait for the first request.
>> As the first request will be stuck in the queue in
>> generic_make_request(), you get a deadlock.
>> It is much safer to:
>> 
>>     if (need to split) {
>>         split = bio_split(bio, ...)
>>         bio_chain(...)
>>         make_request_fn(split);
>>         generic_make_request(bio);
>>    } else
>>         make_request_fn(mddev, bio);
>> 
>> This way we first process the initial section of the bio (in 'split')
>> which will queue some requests to the underlying devices.  These
>> requests will be queued in generic_make_request.
>> Then we queue the remainder of the bio, which will be added to the end
>> of the generic_make_request queue.
>> Then we return.
>> generic_make_request() will pop the lower-level device requests off the
>> queue and handle them first.  Then it will process the remainder
>> of the original bio once the first section has been fully processed.
>
> Good point! raid10 has the same problem. Looks this doesn't solve the issue for
> device with 3 times stack though.
>
> I knew you guys are working on this issue in block layer. Should we fix the
> issue in MD side (for 2 stack devices) or wait for the block layer patch?

We cannot fix everything at the block layer, or at the individual device
layer.  We need changes in both.
I think that looping over bios in a device driver is wrong and can
easily lead to deadlocks.  We should remove that from md.
If the block layer gets fixed that way I want it to, then we could move
the generic_make_request() call earlier, so that above could be

>>     if (need to split) {
>>         split = bio_split(bio, ...)
>>         bio_chain(...)
>>         generic_make_request(bio);
>>         bio = split()
>>    }
>>    make_request_fn(mddev, bio);

which is slightly simpler.  But the original would still work.

So yes, I think we need this change in md/raid1. I suspect that if
you built a kernel with a smaller BARRIER_UNIT_SECTOR_BITS  - e.g. 4 -
you could very easily trigger a deadlock with md/raid1 on scsi.
At 17, it is not quite so easy, but is a real possibility.

I've had similar deadlocks reported before when the code wasn't quite
careful enough.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-17  6:56   ` Coly Li
@ 2017-02-19 23:50     ` NeilBrown
  2017-02-20  2:51       ` NeilBrown
  0 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2017-02-19 23:50 UTC (permalink / raw)
  To: Coly Li, NeilBrown, linux-raid
  Cc: Shaohua Li, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 2454 bytes --]

On Fri, Feb 17 2017, Coly Li wrote:

> On 2017/2/16 下午3:04, NeilBrown wrote:
>> I know you are going to change this as Shaohua wantsthe spitting to
>> happen in a separate function, which I agree with, but there is 
>> something else wrong here. Calling bio_split/bio_chain repeatedly
>> in a loop is dangerous. It is OK for simple devices, but when one
>> request can wait for another request to the same device it can
>> deadlock. This can happen with raid1.  If a resync request calls
>> raise_barrier() between one request and the next, then the next has
>> to wait for the resync request, which has to wait for the first
>> request. As the first request will be stuck in the queue in 
>> generic_make_request(), you get a deadlock.
>
> For md raid1, queue in generic_make_request(), can I understand it as
> bio_list_on_stack in this function? And queue in underlying device,
> can I understand it as the data structures like plug->pending and
> conf->pending_bio_list ?

Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
is the only queue I am talking about.  I'm not referring to
plug->pending or conf->pending_bio_list at all.

>
> I still don't get the point of deadlock, let me try to explain why I
> don't see the possible deadlock. If a bio is split, and the first part
> is processed by make_request_fn(), and then a resync comes and it will
> raise a barrier, there are 3 possible conditions,
> - the resync I/O tries to raise barrier on same bucket of the first
> regular bio. Then the resync task has to wait to the first bio drops
> its conf->nr_pending[idx]

Not quite.
First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
to be zero.  We can assume this happens immediately.
Then the resync_task will increment ->barrier[idx].
Only then will it wait for the first bio to drop ->nr_pending[idx].
The processing of that first bio will have submitted bios to the
underlying device, and they will be in the bio_list_on_stack queue, and
will not be processed until raid1_make_request() completes.

The loop in raid1_make_request() will then call make_request_fn() which
will call wait_barrier(), which will wait for ->barrier[idx] to be zero.

So raid1_make_request is waiting for the resync to progress, and resync
is waiting for a bio which is on bio_list_on_stack which won't be
processed until raid1_make_request() completes.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-19 23:50     ` NeilBrown
@ 2017-02-20  2:51       ` NeilBrown
  2017-02-20  7:04         ` Shaohua Li
  0 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2017-02-20  2:51 UTC (permalink / raw)
  To: Coly Li, NeilBrown, linux-raid
  Cc: Shaohua Li, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 2936 bytes --]

On Mon, Feb 20 2017, NeilBrown wrote:

> On Fri, Feb 17 2017, Coly Li wrote:
>
>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>> I know you are going to change this as Shaohua wantsthe spitting to
>>> happen in a separate function, which I agree with, but there is 
>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>> in a loop is dangerous. It is OK for simple devices, but when one
>>> request can wait for another request to the same device it can
>>> deadlock. This can happen with raid1.  If a resync request calls
>>> raise_barrier() between one request and the next, then the next has
>>> to wait for the resync request, which has to wait for the first
>>> request. As the first request will be stuck in the queue in 
>>> generic_make_request(), you get a deadlock.
>>
>> For md raid1, queue in generic_make_request(), can I understand it as
>> bio_list_on_stack in this function? And queue in underlying device,
>> can I understand it as the data structures like plug->pending and
>> conf->pending_bio_list ?
>
> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
> is the only queue I am talking about.  I'm not referring to
> plug->pending or conf->pending_bio_list at all.
>
>>
>> I still don't get the point of deadlock, let me try to explain why I
>> don't see the possible deadlock. If a bio is split, and the first part
>> is processed by make_request_fn(), and then a resync comes and it will
>> raise a barrier, there are 3 possible conditions,
>> - the resync I/O tries to raise barrier on same bucket of the first
>> regular bio. Then the resync task has to wait to the first bio drops
>> its conf->nr_pending[idx]
>
> Not quite.
> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
> to be zero.  We can assume this happens immediately.
> Then the resync_task will increment ->barrier[idx].
> Only then will it wait for the first bio to drop ->nr_pending[idx].
> The processing of that first bio will have submitted bios to the
> underlying device, and they will be in the bio_list_on_stack queue, and
> will not be processed until raid1_make_request() completes.
>
> The loop in raid1_make_request() will then call make_request_fn() which
> will call wait_barrier(), which will wait for ->barrier[idx] to be
> zero.

Thinking more carefully about this.. the 'idx' that the second bio will
wait for will normally be different, so there won't be a deadlock after
all.

However it is possible for hash_long() to produce the same idx for two
consecutive barrier_units so there is still the possibility of a
deadlock, though it isn't as likely as I thought at first.

NeilBrown


>
> So raid1_make_request is waiting for the resync to progress, and resync
> is waiting for a bio which is on bio_list_on_stack which won't be
> processed until raid1_make_request() completes.
>
> NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-20  2:51       ` NeilBrown
@ 2017-02-20  7:04         ` Shaohua Li
  2017-02-20  8:07           ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-20  7:04 UTC (permalink / raw)
  To: NeilBrown
  Cc: Coly Li, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang

On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> On Mon, Feb 20 2017, NeilBrown wrote:
> 
> > On Fri, Feb 17 2017, Coly Li wrote:
> >
> >> On 2017/2/16 下午3:04, NeilBrown wrote:
> >>> I know you are going to change this as Shaohua wantsthe spitting to
> >>> happen in a separate function, which I agree with, but there is 
> >>> something else wrong here. Calling bio_split/bio_chain repeatedly
> >>> in a loop is dangerous. It is OK for simple devices, but when one
> >>> request can wait for another request to the same device it can
> >>> deadlock. This can happen with raid1.  If a resync request calls
> >>> raise_barrier() between one request and the next, then the next has
> >>> to wait for the resync request, which has to wait for the first
> >>> request. As the first request will be stuck in the queue in 
> >>> generic_make_request(), you get a deadlock.
> >>
> >> For md raid1, queue in generic_make_request(), can I understand it as
> >> bio_list_on_stack in this function? And queue in underlying device,
> >> can I understand it as the data structures like plug->pending and
> >> conf->pending_bio_list ?
> >
> > Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
> > is the only queue I am talking about.  I'm not referring to
> > plug->pending or conf->pending_bio_list at all.
> >
> >>
> >> I still don't get the point of deadlock, let me try to explain why I
> >> don't see the possible deadlock. If a bio is split, and the first part
> >> is processed by make_request_fn(), and then a resync comes and it will
> >> raise a barrier, there are 3 possible conditions,
> >> - the resync I/O tries to raise barrier on same bucket of the first
> >> regular bio. Then the resync task has to wait to the first bio drops
> >> its conf->nr_pending[idx]
> >
> > Not quite.
> > First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
> > to be zero.  We can assume this happens immediately.
> > Then the resync_task will increment ->barrier[idx].
> > Only then will it wait for the first bio to drop ->nr_pending[idx].
> > The processing of that first bio will have submitted bios to the
> > underlying device, and they will be in the bio_list_on_stack queue, and
> > will not be processed until raid1_make_request() completes.
> >
> > The loop in raid1_make_request() will then call make_request_fn() which
> > will call wait_barrier(), which will wait for ->barrier[idx] to be
> > zero.
> 
> Thinking more carefully about this.. the 'idx' that the second bio will
> wait for will normally be different, so there won't be a deadlock after
> all.
> 
> However it is possible for hash_long() to produce the same idx for two
> consecutive barrier_units so there is still the possibility of a
> deadlock, though it isn't as likely as I thought at first.

Wrapped the function pointer issue Neil pointed out into Coly's original patch.
Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
please check.

Thanks,
Shaohua

From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@fb.com>
From: Shaohua Li <shli@fb.com>
Date: Sun, 19 Feb 2017 22:18:32 -0800
Subject: [PATCH] md/raid1/10: fix potential deadlock

Neil Brown pointed out a potential deadlock in raid 10 code with
bio_split/chain. The raid1 code could have the same issue, but recent
barrier rework makes it less likely to happen. The deadlock happens in
below sequence:

1. generic_make_request(bio), this will set current->bio_list
2. raid10_make_request will split bio to bio1 and bio2
3. __make_request(bio1), wait_barrer, add underlayer disk bio to
current->bio_list
4. __make_request(bio2), wait_barrer

If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
raise_barrier waits for IO completion from 3. And since raise_barrier
sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
dispatched because raid10_make_request() doesn't finished yet.

The solution is to adjust the IO ordering. Quotes from Neil:
"
It is much safer to:

    if (need to split) {
        split = bio_split(bio, ...)
        bio_chain(...)
        make_request_fn(split);
        generic_make_request(bio);
   } else
        make_request_fn(mddev, bio);

This way we first process the initial section of the bio (in 'split')
which will queue some requests to the underlying devices.  These
requests will be queued in generic_make_request.
Then we queue the remainder of the bio, which will be added to the end
of the generic_make_request queue.
Then we return.
generic_make_request() will pop the lower-level device requests off the
queue and handle them first.  Then it will process the remainder
of the original bio once the first section has been fully processed.
"

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
Suggested-by: NeilBrown <neilb@suse.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid1.c  | 28 ++++++++++++++--------------
 drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
 2 files changed, 34 insertions(+), 35 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 676f72d..e55d865 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
 	sector_t sectors;
 
 	/* if bio exceeds barrier unit boundary, split it */
-	do {
-		sectors = align_to_barrier_unit_end(
-				bio->bi_iter.bi_sector, bio_sectors(bio));
-		if (sectors < bio_sectors(bio)) {
-			split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
+	sectors = align_to_barrier_unit_end(
+			bio->bi_iter.bi_sector, bio_sectors(bio));
+	if (sectors < bio_sectors(bio)) {
+		split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+	} else {
+		split = bio;
+	}
 
-		if (bio_data_dir(split) == READ)
-			raid1_read_request(mddev, split);
-		else
-			raid1_write_request(mddev, split);
-	} while (split != bio);
+	if (bio_data_dir(split) == READ)
+		raid1_read_request(mddev, split);
+	else
+		raid1_write_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);
 }
 
 static void raid1_status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a1f8e98..b495049 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
 		return;
 	}
 
-	do {
-
-		/*
-		 * If this request crosses a chunk boundary, we need to split
-		 * it.
-		 */
-		if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
-			     bio_sectors(bio) > chunk_sects
-			     && (conf->geo.near_copies < conf->geo.raid_disks
-				 || conf->prev.near_copies <
-				 conf->prev.raid_disks))) {
-			split = bio_split(bio, chunk_sects -
-					  (bio->bi_iter.bi_sector &
-					   (chunk_sects - 1)),
-					  GFP_NOIO, fs_bio_set);
-			bio_chain(split, bio);
-		} else {
-			split = bio;
-		}
+	/*
+	 * If this request crosses a chunk boundary, we need to split
+	 * it.
+	 */
+	if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
+		     bio_sectors(bio) > chunk_sects
+		     && (conf->geo.near_copies < conf->geo.raid_disks
+			 || conf->prev.near_copies <
+			 conf->prev.raid_disks))) {
+		split = bio_split(bio, chunk_sects -
+				  (bio->bi_iter.bi_sector &
+				   (chunk_sects - 1)),
+				  GFP_NOIO, fs_bio_set);
+		bio_chain(split, bio);
+	} else {
+		split = bio;
+	}
 
-		__make_request(mddev, split);
-	} while (split != bio);
+	__make_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);
 
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-20  7:04         ` Shaohua Li
@ 2017-02-20  8:07           ` Coly Li
  2017-02-20  8:30             ` Coly Li
                               ` (2 more replies)
  0 siblings, 3 replies; 43+ messages in thread
From: Coly Li @ 2017-02-20  8:07 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang




发自我的 iPhone
> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
> 
>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>> 
>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>> 
>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>> happen in a separate function, which I agree with, but there is 
>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>> request can wait for another request to the same device it can
>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>> raise_barrier() between one request and the next, then the next has
>>>>> to wait for the resync request, which has to wait for the first
>>>>> request. As the first request will be stuck in the queue in 
>>>>> generic_make_request(), you get a deadlock.
>>>> 
>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>> can I understand it as the data structures like plug->pending and
>>>> conf->pending_bio_list ?
>>> 
>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>> is the only queue I am talking about.  I'm not referring to
>>> plug->pending or conf->pending_bio_list at all.
>>> 
>>>> 
>>>> I still don't get the point of deadlock, let me try to explain why I
>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>> raise a barrier, there are 3 possible conditions,
>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>> its conf->nr_pending[idx]
>>> 
>>> Not quite.
>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>> to be zero.  We can assume this happens immediately.
>>> Then the resync_task will increment ->barrier[idx].
>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>> The processing of that first bio will have submitted bios to the
>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>> will not be processed until raid1_make_request() completes.
>>> 
>>> The loop in raid1_make_request() will then call make_request_fn() which
>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>> zero.
>> 
>> Thinking more carefully about this.. the 'idx' that the second bio will
>> wait for will normally be different, so there won't be a deadlock after
>> all.
>> 
>> However it is possible for hash_long() to produce the same idx for two
>> consecutive barrier_units so there is still the possibility of a
>> deadlock, though it isn't as likely as I thought at first.
> 
> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
> please check.
> 
> Thanks,
> Shaohua
> 

Hmm, please hold, I am still thinking of it. With barrier bucket and hash_long(), I don't see dead lock yet. For raid10 it might happen, but once we have barrier bucket on it , there will no deadlock.

My question is, this deadlock only happens when a big bio is split, and the split small bios are continuous, and the resync io visiting barrier buckets in sequntial order too. In the case if adjacent split regular bios or resync bios hit same barrier bucket, it will be a very big failure of hash design, and should have been found already. But no one complain it, so I don't convince myself tje deadlock is real with io barrier buckets (this is what Neil concerns).

For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..

The loop in raid1_make_request() is quite high level, I am not sure whether CPU brach pridiction may work correctly, especially when it is a big DISCARD bio, using function pointer may drop a possible brach.

So I need to check what we get and lose when use function pointer or not. If it is not urgent, please hold this patch for a while.

The only thing I worry in the bellowed patch is, if a very big DISCARD bio comes, will the kernel space stack trend to be overflow?

Thanks.

Coly





> From ee9c98138bcdf8bceef384a68f49258b6b8b8c6d Mon Sep 17 00:00:00 2001
> Message-Id: <ee9c98138bcdf8bceef384a68f49258b6b8b8c6d.1487573888.git.shli@fb.com>
> From: Shaohua Li <shli@fb.com>
> Date: Sun, 19 Feb 2017 22:18:32 -0800
> Subject: [PATCH] md/raid1/10: fix potential deadlock
> 
> Neil Brown pointed out a potential deadlock in raid 10 code with
> bio_split/chain. The raid1 code could have the same issue, but recent
> barrier rework makes it less likely to happen. The deadlock happens in
> below sequence:
> 
> 1. generic_make_request(bio), this will set current->bio_list
> 2. raid10_make_request will split bio to bio1 and bio2
> 3. __make_request(bio1), wait_barrer, add underlayer disk bio to
> current->bio_list
> 4. __make_request(bio2), wait_barrer
> 
> If raise_barrier happens between 3 & 4, since wait_barrier runs at 3,
> raise_barrier waits for IO completion from 3. And since raise_barrier
> sets barrier, 4 waits for raise_barrier. But IO from 3 can't be
> dispatched because raid10_make_request() doesn't finished yet.
> 
> The solution is to adjust the IO ordering. Quotes from Neil:
> "
> It is much safer to:
> 
>    if (need to split) {
>        split = bio_split(bio, ...)
>        bio_chain(...)
>        make_request_fn(split);
>        generic_make_request(bio);
>   } else
>        make_request_fn(mddev, bio);
> 
> This way we first process the initial section of the bio (in 'split')
> which will queue some requests to the underlying devices.  These
> requests will be queued in generic_make_request.
> Then we queue the remainder of the bio, which will be added to the end
> of the generic_make_request queue.
> Then we return.
> generic_make_request() will pop the lower-level device requests off the
> queue and handle them first.  Then it will process the remainder
> of the original bio once the first section has been fully processed.
> "
> 
> Cc: Coly Li <colyli@suse.de>
> Cc: stable@vger.kernel.org (v3.14+, only the raid10 part)
> Suggested-by: NeilBrown <neilb@suse.com>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
> drivers/md/raid1.c  | 28 ++++++++++++++--------------
> drivers/md/raid10.c | 41 ++++++++++++++++++++---------------------
> 2 files changed, 34 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 676f72d..e55d865 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1566,21 +1566,21 @@ static void raid1_make_request(struct mddev *mddev, struct bio *bio)
>    sector_t sectors;
> 
>    /* if bio exceeds barrier unit boundary, split it */
> -    do {
> -        sectors = align_to_barrier_unit_end(
> -                bio->bi_iter.bi_sector, bio_sectors(bio));
> -        if (sectors < bio_sectors(bio)) {
> -            split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    sectors = align_to_barrier_unit_end(
> +            bio->bi_iter.bi_sector, bio_sectors(bio));
> +    if (sectors < bio_sectors(bio)) {
> +        split = bio_split(bio, sectors, GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        if (bio_data_dir(split) == READ)
> -            raid1_read_request(mddev, split);
> -        else
> -            raid1_write_request(mddev, split);
> -    } while (split != bio);
> +    if (bio_data_dir(split) == READ)
> +        raid1_read_request(mddev, split);
> +    else
> +        raid1_write_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> }
> 
> static void raid1_status(struct seq_file *seq, struct mddev *mddev)
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index a1f8e98..b495049 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -1551,28 +1551,27 @@ static void raid10_make_request(struct mddev *mddev, struct bio *bio)
>        return;
>    }
> 
> -    do {
> -
> -        /*
> -         * If this request crosses a chunk boundary, we need to split
> -         * it.
> -         */
> -        if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> -                 bio_sectors(bio) > chunk_sects
> -                 && (conf->geo.near_copies < conf->geo.raid_disks
> -                 || conf->prev.near_copies <
> -                 conf->prev.raid_disks))) {
> -            split = bio_split(bio, chunk_sects -
> -                      (bio->bi_iter.bi_sector &
> -                       (chunk_sects - 1)),
> -                      GFP_NOIO, fs_bio_set);
> -            bio_chain(split, bio);
> -        } else {
> -            split = bio;
> -        }
> +    /*
> +     * If this request crosses a chunk boundary, we need to split
> +     * it.
> +     */
> +    if (unlikely((bio->bi_iter.bi_sector & chunk_mask) +
> +             bio_sectors(bio) > chunk_sects
> +             && (conf->geo.near_copies < conf->geo.raid_disks
> +             || conf->prev.near_copies <
> +             conf->prev.raid_disks))) {
> +        split = bio_split(bio, chunk_sects -
> +                  (bio->bi_iter.bi_sector &
> +                   (chunk_sects - 1)),
> +                  GFP_NOIO, fs_bio_set);
> +        bio_chain(split, bio);
> +    } else {
> +        split = bio;
> +    }
> 
> -        __make_request(mddev, split);
> -    } while (split != bio);
> +    __make_request(mddev, split);
> +    if (split != bio)
> +        generic_make_request(bio);
> 
>    /* In case raid10d snuck in to freeze_array */
>    wake_up(&conf->wait_barrier);
> -- 
> 2.9.3
> 


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-20  8:07           ` Coly Li
@ 2017-02-20  8:30             ` Coly Li
  2017-02-20 18:14             ` Wols Lists
  2017-02-21  0:29             ` NeilBrown
  2 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-20  8:30 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang


> 在 2017年2月20日,下午4:07,Coly Li <colyli@suse.de> 写道:
> 
> 
> 
> 
> 发自我的 iPhone
>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>>> 
>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>> 
>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>> 
>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>>> happen in a separate function, which I agree with, but there is 
>>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>>> request can wait for another request to the same device it can
>>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>>> raise_barrier() between one request and the next, then the next has
>>>>>> to wait for the resync request, which has to wait for the first
>>>>>> request. As the first request will be stuck in the queue in 
>>>>>> generic_make_request(), you get a deadlock.
>>>>> 
>>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>>> can I understand it as the data structures like plug->pending and
>>>>> conf->pending_bio_list ?
>>>> 
>>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>>> is the only queue I am talking about.  I'm not referring to
>>>> plug->pending or conf->pending_bio_list at all.
>>>> 
>>>>> 
>>>>> I still don't get the point of deadlock, let me try to explain why I
>>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>>> raise a barrier, there are 3 possible conditions,
>>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>>> its conf->nr_pending[idx]
>>>> 
>>>> Not quite.
>>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>>> to be zero.  We can assume this happens immediately.
>>>> Then the resync_task will increment ->barrier[idx].
>>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>>> The processing of that first bio will have submitted bios to the
>>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>>> will not be processed until raid1_make_request() completes.
>>>> 
>>>> The loop in raid1_make_request() will then call make_request_fn() which
>>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>>> zero.
>>> 
>>> Thinking more carefully about this.. the 'idx' that the second bio will
>>> wait for will normally be different, so there won't be a deadlock after
>>> all.
>>> 
>>> However it is possible for hash_long() to produce the same idx for two
>>> consecutive barrier_units so there is still the possibility of a
>>> deadlock, though it isn't as likely as I thought at first.
>> 
>> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
>> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
>> please check.
>> 
>> Thanks,
>> Shaohua
>> 
> 
> Hmm, please hold, I am still thinking of it. With barrier bucket and hash_long(), I don't see dead lock yet. For raid10 it might happen, but once we have barrier bucket on it , there will no deadlock.
> 
> My question is, this deadlock only happens when a big bio is split, and the split small bios are continuous, and the resync io visiting barrier buckets in sequntial order too. In the case if adjacent split regular bios or resync bios hit same barrier bucket, it will be a very big failure of hash design, and should have been found already. But no one complain it, so I don't convince myself tje deadlock is real with io barrier buckets (this is what Neil concerns).
> 
> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
> 
> The loop in raid1_make_request() is quite high level, I am not sure whether CPU brach pridiction may work correctly, especially when it is a big DISCARD bio, using function pointer may drop a possible brach.
> 
> So I need to check what we get and lose when use function pointer or not. If it is not urgent, please hold this patch for a while.
> 
> The only thing I worry in the bellowed patch is, if a very big DISCARD bio comes, will the kernel space stack trend to be overflow?
> 

Before calling generic_make_request(), if we can do a check whether the stack is about to full, and schedule a small time out if there are too many bios stacked, maybe the stack oveflow can be avoided.

Coly



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-20  8:07           ` Coly Li
  2017-02-20  8:30             ` Coly Li
@ 2017-02-20 18:14             ` Wols Lists
  2017-02-21 11:30               ` Coly Li
  2017-02-21  0:29             ` NeilBrown
  2 siblings, 1 reply; 43+ messages in thread
From: Wols Lists @ 2017-02-20 18:14 UTC (permalink / raw)
  To: Coly Li, Shaohua Li
  Cc: NeilBrown, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang

On 20/02/17 08:07, Coly Li wrote:
> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..

Can you force gcc to inline or compile a function? Isn't it dangerous to
rely on default behaviour and assume it won't change when the compiler
is upgraded?

Cheers,
Wol

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-20  8:07           ` Coly Li
  2017-02-20  8:30             ` Coly Li
  2017-02-20 18:14             ` Wols Lists
@ 2017-02-21  0:29             ` NeilBrown
  2017-02-21  9:45               ` Coly Li
  2 siblings, 1 reply; 43+ messages in thread
From: NeilBrown @ 2017-02-21  0:29 UTC (permalink / raw)
  To: Coly Li, Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 7052 bytes --]

On Mon, Feb 20 2017, Coly Li wrote:

> 发自我的 iPhone
>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>> 
>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>> 
>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>> 
>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote:
>>>>>> I know you are going to change this as Shaohua wantsthe spitting to
>>>>>> happen in a separate function, which I agree with, but there is 
>>>>>> something else wrong here. Calling bio_split/bio_chain repeatedly
>>>>>> in a loop is dangerous. It is OK for simple devices, but when one
>>>>>> request can wait for another request to the same device it can
>>>>>> deadlock. This can happen with raid1.  If a resync request calls
>>>>>> raise_barrier() between one request and the next, then the next has
>>>>>> to wait for the resync request, which has to wait for the first
>>>>>> request. As the first request will be stuck in the queue in 
>>>>>> generic_make_request(), you get a deadlock.
>>>>> 
>>>>> For md raid1, queue in generic_make_request(), can I understand it as
>>>>> bio_list_on_stack in this function? And queue in underlying device,
>>>>> can I understand it as the data structures like plug->pending and
>>>>> conf->pending_bio_list ?
>>>> 
>>>> Yes, the queue in generic_make_request() is the bio_list_on_stack.  That
>>>> is the only queue I am talking about.  I'm not referring to
>>>> plug->pending or conf->pending_bio_list at all.
>>>> 
>>>>> 
>>>>> I still don't get the point of deadlock, let me try to explain why I
>>>>> don't see the possible deadlock. If a bio is split, and the first part
>>>>> is processed by make_request_fn(), and then a resync comes and it will
>>>>> raise a barrier, there are 3 possible conditions,
>>>>> - the resync I/O tries to raise barrier on same bucket of the first
>>>>> regular bio. Then the resync task has to wait to the first bio drops
>>>>> its conf->nr_pending[idx]
>>>> 
>>>> Not quite.
>>>> First, the resync task (in raise_barrier()) will wait for ->nr_waiting[idx]
>>>> to be zero.  We can assume this happens immediately.
>>>> Then the resync_task will increment ->barrier[idx].
>>>> Only then will it wait for the first bio to drop ->nr_pending[idx].
>>>> The processing of that first bio will have submitted bios to the
>>>> underlying device, and they will be in the bio_list_on_stack queue, and
>>>> will not be processed until raid1_make_request() completes.
>>>> 
>>>> The loop in raid1_make_request() will then call make_request_fn() which
>>>> will call wait_barrier(), which will wait for ->barrier[idx] to be
>>>> zero.
>>> 
>>> Thinking more carefully about this.. the 'idx' that the second bio will
>>> wait for will normally be different, so there won't be a deadlock after
>>> all.
>>> 
>>> However it is possible for hash_long() to produce the same idx for two
>>> consecutive barrier_units so there is still the possibility of a
>>> deadlock, though it isn't as likely as I thought at first.
>> 
>> Wrapped the function pointer issue Neil pointed out into Coly's original patch.
>> Also fix a 'use-after-free' bug. For the deadlock issue, I'll add below patch,
>> please check.
>> 
>> Thanks,
>> Shaohua
>> 
>
> Hmm, please hold, I am still thinking of it. With barrier bucket and
> hash_long(), I don't see dead lock yet. For raid10 it might happen,
> but once we have barrier bucket on it , there will no deadlock.
>
> My question is, this deadlock only happens when a big bio is split,
> and the split small bios are continuous, and the resync io visiting
> barrier buckets in sequntial order too. In the case if adjacent split
> regular bios or resync bios hit same barrier bucket, it will be a very
> big failure of hash design, and should have been found already. But no
> one complain it, so I don't convince myself tje deadlock is real with
> io barrier buckets (this is what Neil concerns).

I think you are wrong about the design goal of a hash function.
When feed a sequence of inputs, with any stride (i.e. with any constant
difference between consecutive inputs), the output of the hash function
should appear to be random.
A random sequence can produce the same number twice in a row.
If the hash function produces a number from 0 to N-1, you would expect
two consecutive outputs to be the same about once every N inputs.

Even if there was no possibility of a deadlock from a resync request
happening between two bios, there are other possibilities.

It is not, in general, safe to call mempool_alloc() twice in a row,
without first ensuring that the first allocation will get freed by some
other thread.  raid1_write_request() allocates from r1bio_pool, and then
submits bios to the underlying device, which get queued on
bio_list_on_stack.  They will not be processed until after
raid1_make_request() completes, so when raid1_make_request loops around
and calls raid1_write_request() again, it will try to allocate another
r1bio from r1bio_pool, and this might end up waiting for the r1bio which
is trapped and cannot complete.

As r1bio_pool preallocates 256 entries, this is unlikely  but not
impossible.  If 256 threads all attempt a write (or read) that crosses a
boundary, then they will consume all 256 preallocated entries, and want
more. If there is no free memory, they will block indefinitely.

bio_alloc_bioset() has punt_bios_to_rescuer() to attempt to work around
a deadlock very similar to this, but it is a very specific solution,
wouldn't help raid1, and is much more complex than just rearranging the
code.


>
> For the function pointer asignment, it is because I see a brach
> happens in a loop. If I use a function pointer, I can avoid redundant
> brach inside the loop. raid1_read_request() and raid1_write_request()
> are not simple functions, I don't know whether gcc may make them
> inline or not, so I am on the way to check the disassembled code..

It is a long time since I studied how CPUs handle different sorts of
machine code, but I'm fairly sure that and indirect branch (i.e. a
branch through a function pointer) is harder to optimize than a
conditional branch.

But I think the readability of the code is more important, and having an
if-then-else in a loop is more familiar to most readers than using a
function pointer like this.

>
> The loop in raid1_make_request() is quite high level, I am not sure
> whether CPU brach pridiction may work correctly, especially when it is
> a big DISCARD bio, using function pointer may drop a possible brach.
>
> So I need to check what we get and lose when use function pointer or
> not. If it is not urgent, please hold this patch for a while. 
>
> The only thing I worry in the bellowed patch is, if a very big DISCARD
> bio comes, will the kernel space stack trend to be overflow?

How would a large request cause extra stack space to be used?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-21  0:29             ` NeilBrown
@ 2017-02-21  9:45               ` Coly Li
  2017-02-21 17:45                 ` Shaohua Li
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-21  9:45 UTC (permalink / raw)
  To: NeilBrown
  Cc: Shaohua Li, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/21 上午8:29, NeilBrown wrote:
> On Mon, Feb 20 2017, Coly Li wrote:
> 
>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>>> 
>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>> 
>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>> 
>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
>>>>>>> going to change this as Shaohua wantsthe spitting to 
>>>>>>> happen in a separate function, which I agree with, but
>>>>>>> there is something else wrong here. Calling
>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
>>>>>>> It is OK for simple devices, but when one request can
>>>>>>> wait for another request to the same device it can 
>>>>>>> deadlock. This can happen with raid1.  If a resync
>>>>>>> request calls raise_barrier() between one request and
>>>>>>> the next, then the next has to wait for the resync
>>>>>>> request, which has to wait for the first request. As
>>>>>>> the first request will be stuck in the queue in 
>>>>>>> generic_make_request(), you get a deadlock.
>>>>>> 
>>>>>> For md raid1, queue in generic_make_request(), can I
>>>>>> understand it as bio_list_on_stack in this function? And
>>>>>> queue in underlying device, can I understand it as the
>>>>>> data structures like plug->pending and 
>>>>>> conf->pending_bio_list ?
>>>>> 
>>>>> Yes, the queue in generic_make_request() is the
>>>>> bio_list_on_stack.  That is the only queue I am talking
>>>>> about.  I'm not referring to plug->pending or
>>>>> conf->pending_bio_list at all.
>>>>> 
>>>>>> 
>>>>>> I still don't get the point of deadlock, let me try to
>>>>>> explain why I don't see the possible deadlock. If a bio
>>>>>> is split, and the first part is processed by
>>>>>> make_request_fn(), and then a resync comes and it will 
>>>>>> raise a barrier, there are 3 possible conditions, - the
>>>>>> resync I/O tries to raise barrier on same bucket of the
>>>>>> first regular bio. Then the resync task has to wait to
>>>>>> the first bio drops its conf->nr_pending[idx]
>>>>> 
>>>>> Not quite. First, the resync task (in raise_barrier()) will
>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
>>>>> happens immediately. Then the resync_task will increment
>>>>> ->barrier[idx]. Only then will it wait for the first bio to
>>>>> drop ->nr_pending[idx]. The processing of that first bio
>>>>> will have submitted bios to the underlying device, and they
>>>>> will be in the bio_list_on_stack queue, and will not be
>>>>> processed until raid1_make_request() completes.
>>>>> 
>>>>> The loop in raid1_make_request() will then call
>>>>> make_request_fn() which will call wait_barrier(), which
>>>>> will wait for ->barrier[idx] to be zero.
>>>> 
>>>> Thinking more carefully about this.. the 'idx' that the
>>>> second bio will wait for will normally be different, so there
>>>> won't be a deadlock after all.
>>>> 
>>>> However it is possible for hash_long() to produce the same
>>>> idx for two consecutive barrier_units so there is still the
>>>> possibility of a deadlock, though it isn't as likely as I
>>>> thought at first.
>>> 
>>> Wrapped the function pointer issue Neil pointed out into Coly's
>>> original patch. Also fix a 'use-after-free' bug. For the
>>> deadlock issue, I'll add below patch, please check.
>>> 
>>> Thanks, Shaohua
>>> 
>> 

Neil,

Thanks for your patient explanation, I feel I come to follow up what
you mean. Let me try to re-tell what I understand, correct me if I am
wrong.


>> Hmm, please hold, I am still thinking of it. With barrier bucket
>> and hash_long(), I don't see dead lock yet. For raid10 it might
>> happen, but once we have barrier bucket on it , there will no
>> deadlock.
>> 
>> My question is, this deadlock only happens when a big bio is
>> split, and the split small bios are continuous, and the resync io
>> visiting barrier buckets in sequntial order too. In the case if
>> adjacent split regular bios or resync bios hit same barrier
>> bucket, it will be a very big failure of hash design, and should
>> have been found already. But no one complain it, so I don't
>> convince myself tje deadlock is real with io barrier buckets
>> (this is what Neil concerns).
> 
> I think you are wrong about the design goal of a hash function. 
> When feed a sequence of inputs, with any stride (i.e. with any
> constant difference between consecutive inputs), the output of the
> hash function should appear to be random. A random sequence can
> produce the same number twice in a row. If the hash function
> produces a number from 0 to N-1, you would expect two consecutive
> outputs to be the same about once every N inputs.
> 

Yes, you are right. But when I mentioned hash conflict, I limit the
integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
address divided by 64MB I/O barrier unit size, its value range is
reduced to [0, 1<<38].

Maximum size of normal bio is 1MB, it could be split into 2 bios at most.

For DISCARD bio, its maximum size is 4GB, it could be split into 65
bios at most.

Then in this patch, the hash question is degraded to: for any
consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
these 65 integers into range [0, 1023], will any hash conflict happen
among these integers ?

I tried a half range [0, 1<<37] to check hash conflict, by writing a
simple code to emulate hash calculation in the new I/O barrier patch,
to iterate all consecutive {2, 65, 128, 512} integers in range [0,
1<<37] for hash conflict.

On a 20 core CPU each run spent 7+ hours, finally I find no hash
conflict detected up to 512 consecutive integers in above limited
condition. For 1024, there are a lot hash conflict detected.

[0, 1<<37] range back to [0, 63] LBA range, this is large enough for
almost all existing md raid configuration. So for current kernel
implementation and real world device, for a single bio, there is no
possible hash conflict the new I/O barrier patch.

If bi_iter.bi_size changes from unsigned int to unsigned long in
future, the above assumption will be wrong. There will be hash
conflict, and potential dead lock, which is quite implicit. Yes, I
agree with you. No, bio split inside loop is not perfect.

> Even if there was no possibility of a deadlock from a resync
> request happening between two bios, there are other possibilities.
> 

The bellowed text makes me know more about raid1 code, but confuses me
more as well. Here comes my questions,

> It is not, in general, safe to call mempool_alloc() twice in a
> row, without first ensuring that the first allocation will get
> freed by some other thread.  raid1_write_request() allocates from
> r1bio_pool, and then submits bios to the underlying device, which
> get queued on bio_list_on_stack.  They will not be processed until
> after raid1_make_request() completes, so when raid1_make_request
> loops around and calls raid1_write_request() again, it will try to
> allocate another r1bio from r1bio_pool, and this might end up
> waiting for the r1bio which is trapped and cannot complete.
> 

Can I say that it is because blk_finish_plug() won't be called before
raid1_make_request() returns ? Then in raid1_write_request(), mbio
will be added into plug->pending, but before blk_finish_plug() is
called, they won't be handled.


> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> impossible.  If 256 threads all attempt a write (or read) that
> crosses a boundary, then they will consume all 256 preallocated
> entries, and want more. If there is no free memory, they will block
> indefinitely.
> 

If raid1_make_request() is modified into this way,
+	if (bio_data_dir(split) == READ)
+		raid1_read_request(mddev, split);
+	else
+		raid1_write_request(mddev, split);
+	if (split != bio)
+		generic_make_request(bio);

Then the original bio will be added into the bio_list_on_stack of top
level generic_make_request(), current->bio_list is initialized, when
generic_make_request() is called nested in raid1_make_request(), the
split bio will be added into current->bio_list and nothing else happens.

After the nested generic_make_request() returns, the code back to next
code of generic_make_request(),
2022                         ret = q->make_request_fn(q, bio);
2023
2024                         blk_queue_exit(q);
2025
2026                         bio = bio_list_pop(current->bio_list);

bio_list_pop() will return the second half of the split bio, and it is
continuously handled in raid1_make_request() by calling,
2022                         ret = q->make_request_fn(q, bio);

Then there is no dead lock, at all.


> bio_alloc_bioset() has punt_bios_to_rescuer() to attempt to work
> around a deadlock very similar to this, but it is a very specific
> solution, wouldn't help raid1, and is much more complex than just
> rearranging the code.
> 

Yes, it is, definitely.

> 
>> 
>> For the function pointer asignment, it is because I see a brach 
>> happens in a loop. If I use a function pointer, I can avoid
>> redundant brach inside the loop. raid1_read_request() and
>> raid1_write_request() are not simple functions, I don't know
>> whether gcc may make them inline or not, so I am on the way to
>> check the disassembled code..
> 
> It is a long time since I studied how CPUs handle different sorts
> of machine code, but I'm fairly sure that and indirect branch (i.e.
> a branch through a function pointer) is harder to optimize than a 
> conditional branch.
> 

It makes sense to me, yes, you are right.

> But I think the readability of the code is more important, and
> having an if-then-else in a loop is more familiar to most readers
> than using a function pointer like this.
> 

Copied, I agree with you.


>> 
>> The loop in raid1_make_request() is quite high level, I am not
>> sure whether CPU brach pridiction may work correctly, especially
>> when it is a big DISCARD bio, using function pointer may drop a
>> possible brach.
>> 
>> So I need to check what we get and lose when use function pointer
>> or not. If it is not urgent, please hold this patch for a while.
>> 
>> 
>> The only thing I worry in the bellowed patch is, if a very big
>> DISCARD bio comes, will the kernel space stack trend to be
>> overflow?
> 
> How would a large request cause extra stack space to be used?

If my understand is correct, there is no worry here.

Please correct me.

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-20 18:14             ` Wols Lists
@ 2017-02-21 11:30               ` Coly Li
  2017-02-21 19:20                 ` Wols Lists
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-21 11:30 UTC (permalink / raw)
  To: Wols Lists
  Cc: Shaohua Li, NeilBrown, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang

On 2017/2/21 上午2:14, Wols Lists wrote:
> On 20/02/17 08:07, Coly Li wrote:
>> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
> 
> Can you force gcc to inline or compile a function? Isn't it dangerous to
> rely on default behaviour and assume it won't change when the compiler
> is upgraded?

I choose to trust compiler, and trust the people behind gcc.

Coly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-21  9:45               ` Coly Li
@ 2017-02-21 17:45                 ` Shaohua Li
  2017-02-21 20:09                   ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-21 17:45 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
> On 2017/2/21 上午8:29, NeilBrown wrote:
> > On Mon, Feb 20 2017, Coly Li wrote:
> > 
> >>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
> >>> 
> >>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> >>>>> On Mon, Feb 20 2017, NeilBrown wrote:
> >>>>> 
> >>>>>> On Fri, Feb 17 2017, Coly Li wrote:
> >>>>>> 
> >>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
> >>>>>>> going to change this as Shaohua wantsthe spitting to 
> >>>>>>> happen in a separate function, which I agree with, but
> >>>>>>> there is something else wrong here. Calling
> >>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
> >>>>>>> It is OK for simple devices, but when one request can
> >>>>>>> wait for another request to the same device it can 
> >>>>>>> deadlock. This can happen with raid1.  If a resync
> >>>>>>> request calls raise_barrier() between one request and
> >>>>>>> the next, then the next has to wait for the resync
> >>>>>>> request, which has to wait for the first request. As
> >>>>>>> the first request will be stuck in the queue in 
> >>>>>>> generic_make_request(), you get a deadlock.
> >>>>>> 
> >>>>>> For md raid1, queue in generic_make_request(), can I
> >>>>>> understand it as bio_list_on_stack in this function? And
> >>>>>> queue in underlying device, can I understand it as the
> >>>>>> data structures like plug->pending and 
> >>>>>> conf->pending_bio_list ?
> >>>>> 
> >>>>> Yes, the queue in generic_make_request() is the
> >>>>> bio_list_on_stack.  That is the only queue I am talking
> >>>>> about.  I'm not referring to plug->pending or
> >>>>> conf->pending_bio_list at all.
> >>>>> 
> >>>>>> 
> >>>>>> I still don't get the point of deadlock, let me try to
> >>>>>> explain why I don't see the possible deadlock. If a bio
> >>>>>> is split, and the first part is processed by
> >>>>>> make_request_fn(), and then a resync comes and it will 
> >>>>>> raise a barrier, there are 3 possible conditions, - the
> >>>>>> resync I/O tries to raise barrier on same bucket of the
> >>>>>> first regular bio. Then the resync task has to wait to
> >>>>>> the first bio drops its conf->nr_pending[idx]
> >>>>> 
> >>>>> Not quite. First, the resync task (in raise_barrier()) will
> >>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
> >>>>> happens immediately. Then the resync_task will increment
> >>>>> ->barrier[idx]. Only then will it wait for the first bio to
> >>>>> drop ->nr_pending[idx]. The processing of that first bio
> >>>>> will have submitted bios to the underlying device, and they
> >>>>> will be in the bio_list_on_stack queue, and will not be
> >>>>> processed until raid1_make_request() completes.
> >>>>> 
> >>>>> The loop in raid1_make_request() will then call
> >>>>> make_request_fn() which will call wait_barrier(), which
> >>>>> will wait for ->barrier[idx] to be zero.
> >>>> 
> >>>> Thinking more carefully about this.. the 'idx' that the
> >>>> second bio will wait for will normally be different, so there
> >>>> won't be a deadlock after all.
> >>>> 
> >>>> However it is possible for hash_long() to produce the same
> >>>> idx for two consecutive barrier_units so there is still the
> >>>> possibility of a deadlock, though it isn't as likely as I
> >>>> thought at first.
> >>> 
> >>> Wrapped the function pointer issue Neil pointed out into Coly's
> >>> original patch. Also fix a 'use-after-free' bug. For the
> >>> deadlock issue, I'll add below patch, please check.
> >>> 
> >>> Thanks, Shaohua
> >>> 
> >> 
> 
> Neil,
> 
> Thanks for your patient explanation, I feel I come to follow up what
> you mean. Let me try to re-tell what I understand, correct me if I am
> wrong.
> 
> 
> >> Hmm, please hold, I am still thinking of it. With barrier bucket
> >> and hash_long(), I don't see dead lock yet. For raid10 it might
> >> happen, but once we have barrier bucket on it , there will no
> >> deadlock.
> >> 
> >> My question is, this deadlock only happens when a big bio is
> >> split, and the split small bios are continuous, and the resync io
> >> visiting barrier buckets in sequntial order too. In the case if
> >> adjacent split regular bios or resync bios hit same barrier
> >> bucket, it will be a very big failure of hash design, and should
> >> have been found already. But no one complain it, so I don't
> >> convince myself tje deadlock is real with io barrier buckets
> >> (this is what Neil concerns).
> > 
> > I think you are wrong about the design goal of a hash function. 
> > When feed a sequence of inputs, with any stride (i.e. with any
> > constant difference between consecutive inputs), the output of the
> > hash function should appear to be random. A random sequence can
> > produce the same number twice in a row. If the hash function
> > produces a number from 0 to N-1, you would expect two consecutive
> > outputs to be the same about once every N inputs.
> > 
> 
> Yes, you are right. But when I mentioned hash conflict, I limit the
> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
> address divided by 64MB I/O barrier unit size, its value range is
> reduced to [0, 1<<38].
> 
> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
> 
> For DISCARD bio, its maximum size is 4GB, it could be split into 65
> bios at most.
> 
> Then in this patch, the hash question is degraded to: for any
> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
> these 65 integers into range [0, 1023], will any hash conflict happen
> among these integers ?
> 
> I tried a half range [0, 1<<37] to check hash conflict, by writing a
> simple code to emulate hash calculation in the new I/O barrier patch,
> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
> 1<<37] for hash conflict.
> 
> On a 20 core CPU each run spent 7+ hours, finally I find no hash
> conflict detected up to 512 consecutive integers in above limited
> condition. For 1024, there are a lot hash conflict detected.
> 
> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
> almost all existing md raid configuration. So for current kernel
> implementation and real world device, for a single bio, there is no
> possible hash conflict the new I/O barrier patch.
> 
> If bi_iter.bi_size changes from unsigned int to unsigned long in
> future, the above assumption will be wrong. There will be hash
> conflict, and potential dead lock, which is quite implicit. Yes, I
> agree with you. No, bio split inside loop is not perfect.
> 
> > Even if there was no possibility of a deadlock from a resync
> > request happening between two bios, there are other possibilities.
> > 
> 
> The bellowed text makes me know more about raid1 code, but confuses me
> more as well. Here comes my questions,
> 
> > It is not, in general, safe to call mempool_alloc() twice in a
> > row, without first ensuring that the first allocation will get
> > freed by some other thread.  raid1_write_request() allocates from
> > r1bio_pool, and then submits bios to the underlying device, which
> > get queued on bio_list_on_stack.  They will not be processed until
> > after raid1_make_request() completes, so when raid1_make_request
> > loops around and calls raid1_write_request() again, it will try to
> > allocate another r1bio from r1bio_pool, and this might end up
> > waiting for the r1bio which is trapped and cannot complete.
> > 
> 
> Can I say that it is because blk_finish_plug() won't be called before
> raid1_make_request() returns ? Then in raid1_write_request(), mbio
> will be added into plug->pending, but before blk_finish_plug() is
> called, they won't be handled.

blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
current->bio_list, not in plug list.
 
> > As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> > impossible.  If 256 threads all attempt a write (or read) that
> > crosses a boundary, then they will consume all 256 preallocated
> > entries, and want more. If there is no free memory, they will block
> > indefinitely.
> > 
> 
> If raid1_make_request() is modified into this way,
> +	if (bio_data_dir(split) == READ)
> +		raid1_read_request(mddev, split);
> +	else
> +		raid1_write_request(mddev, split);
> +	if (split != bio)
> +		generic_make_request(bio);
> 
> Then the original bio will be added into the bio_list_on_stack of top
> level generic_make_request(), current->bio_list is initialized, when
> generic_make_request() is called nested in raid1_make_request(), the
> split bio will be added into current->bio_list and nothing else happens.
> 
> After the nested generic_make_request() returns, the code back to next
> code of generic_make_request(),
> 2022                         ret = q->make_request_fn(q, bio);
> 2023
> 2024                         blk_queue_exit(q);
> 2025
> 2026                         bio = bio_list_pop(current->bio_list);
> 
> bio_list_pop() will return the second half of the split bio, and it is

So in above sequence, the curent->bio_list will has bios in below sequence:
bios to underlaying disks, second half of original bio

bio_list_pop will pop bios to underlaying disks first, handle them, then the
second half of original bio.

That said, this doesn't work for array stacked 3 layers. Because in 3-layer
array, handling the middle layer bio will make the 3rd layer bio hold to
bio_list again.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-21 11:30               ` Coly Li
@ 2017-02-21 19:20                 ` Wols Lists
  2017-02-21 20:16                   ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Wols Lists @ 2017-02-21 19:20 UTC (permalink / raw)
  To: Coly Li
  Cc: Shaohua Li, NeilBrown, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang

On 21/02/17 11:30, Coly Li wrote:
> On 2017/2/21 上午2:14, Wols Lists wrote:
>> On 20/02/17 08:07, Coly Li wrote:
>>> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
>>
>> Can you force gcc to inline or compile a function? Isn't it dangerous to
>> rely on default behaviour and assume it won't change when the compiler
>> is upgraded?
> 
> I choose to trust compiler, and trust the people behind gcc.
> 
I admire your faith. I seem to remember several occasions where the gcc
people added new optimisations and caused all sorts of subtle havoc with
the kernel where it relied on the old behaviour. Don't forget - the
linux kernel is one of the compiler's most demanding customers. And
don't forget also - there are quite a few people now using llvm to
compile the kernel (it may not yet be working - I think it is certainly
for simple use cases) so tests on gcc don't guarantee it'll work for
everyone ...

I think you can trace the addition of many kernel compile-time flags to
that sort of thing - disabling new optimisations.

Cheers,
Wol


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-21 17:45                 ` Shaohua Li
@ 2017-02-21 20:09                   ` Coly Li
  2017-02-23  5:54                     ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-21 20:09 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/22 上午1:45, Shaohua Li wrote:
> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
>> On 2017/2/21 上午8:29, NeilBrown wrote:
>>> On Mon, Feb 20 2017, Coly Li wrote:
>>>
>>>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>>>>>
>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>>>>
>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>>>>
>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
>>>>>>>>> happen in a separate function, which I agree with, but
>>>>>>>>> there is something else wrong here. Calling
>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
>>>>>>>>> It is OK for simple devices, but when one request can
>>>>>>>>> wait for another request to the same device it can 
>>>>>>>>> deadlock. This can happen with raid1.  If a resync
>>>>>>>>> request calls raise_barrier() between one request and
>>>>>>>>> the next, then the next has to wait for the resync
>>>>>>>>> request, which has to wait for the first request. As
>>>>>>>>> the first request will be stuck in the queue in 
>>>>>>>>> generic_make_request(), you get a deadlock.
>>>>>>>>
>>>>>>>> For md raid1, queue in generic_make_request(), can I
>>>>>>>> understand it as bio_list_on_stack in this function? And
>>>>>>>> queue in underlying device, can I understand it as the
>>>>>>>> data structures like plug->pending and 
>>>>>>>> conf->pending_bio_list ?
>>>>>>>
>>>>>>> Yes, the queue in generic_make_request() is the
>>>>>>> bio_list_on_stack.  That is the only queue I am talking
>>>>>>> about.  I'm not referring to plug->pending or
>>>>>>> conf->pending_bio_list at all.
>>>>>>>
>>>>>>>>
>>>>>>>> I still don't get the point of deadlock, let me try to
>>>>>>>> explain why I don't see the possible deadlock. If a bio
>>>>>>>> is split, and the first part is processed by
>>>>>>>> make_request_fn(), and then a resync comes and it will 
>>>>>>>> raise a barrier, there are 3 possible conditions, - the
>>>>>>>> resync I/O tries to raise barrier on same bucket of the
>>>>>>>> first regular bio. Then the resync task has to wait to
>>>>>>>> the first bio drops its conf->nr_pending[idx]
>>>>>>>
>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
>>>>>>> happens immediately. Then the resync_task will increment
>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
>>>>>>> will have submitted bios to the underlying device, and they
>>>>>>> will be in the bio_list_on_stack queue, and will not be
>>>>>>> processed until raid1_make_request() completes.
>>>>>>>
>>>>>>> The loop in raid1_make_request() will then call
>>>>>>> make_request_fn() which will call wait_barrier(), which
>>>>>>> will wait for ->barrier[idx] to be zero.
>>>>>>
>>>>>> Thinking more carefully about this.. the 'idx' that the
>>>>>> second bio will wait for will normally be different, so there
>>>>>> won't be a deadlock after all.
>>>>>>
>>>>>> However it is possible for hash_long() to produce the same
>>>>>> idx for two consecutive barrier_units so there is still the
>>>>>> possibility of a deadlock, though it isn't as likely as I
>>>>>> thought at first.
>>>>>
>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
>>>>> original patch. Also fix a 'use-after-free' bug. For the
>>>>> deadlock issue, I'll add below patch, please check.
>>>>>
>>>>> Thanks, Shaohua
>>>>>
>>>>
>>
>> Neil,
>>
>> Thanks for your patient explanation, I feel I come to follow up what
>> you mean. Let me try to re-tell what I understand, correct me if I am
>> wrong.
>>
>>
>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
>>>> happen, but once we have barrier bucket on it , there will no
>>>> deadlock.
>>>>
>>>> My question is, this deadlock only happens when a big bio is
>>>> split, and the split small bios are continuous, and the resync io
>>>> visiting barrier buckets in sequntial order too. In the case if
>>>> adjacent split regular bios or resync bios hit same barrier
>>>> bucket, it will be a very big failure of hash design, and should
>>>> have been found already. But no one complain it, so I don't
>>>> convince myself tje deadlock is real with io barrier buckets
>>>> (this is what Neil concerns).
>>>
>>> I think you are wrong about the design goal of a hash function. 
>>> When feed a sequence of inputs, with any stride (i.e. with any
>>> constant difference between consecutive inputs), the output of the
>>> hash function should appear to be random. A random sequence can
>>> produce the same number twice in a row. If the hash function
>>> produces a number from 0 to N-1, you would expect two consecutive
>>> outputs to be the same about once every N inputs.
>>>
>>
>> Yes, you are right. But when I mentioned hash conflict, I limit the
>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
>> address divided by 64MB I/O barrier unit size, its value range is
>> reduced to [0, 1<<38].
>>
>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
>>
>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
>> bios at most.
>>
>> Then in this patch, the hash question is degraded to: for any
>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
>> these 65 integers into range [0, 1023], will any hash conflict happen
>> among these integers ?
>>
>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
>> simple code to emulate hash calculation in the new I/O barrier patch,
>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
>> 1<<37] for hash conflict.
>>
>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
>> conflict detected up to 512 consecutive integers in above limited
>> condition. For 1024, there are a lot hash conflict detected.
>>
>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
>> almost all existing md raid configuration. So for current kernel
>> implementation and real world device, for a single bio, there is no
>> possible hash conflict the new I/O barrier patch.
>>
>> If bi_iter.bi_size changes from unsigned int to unsigned long in
>> future, the above assumption will be wrong. There will be hash
>> conflict, and potential dead lock, which is quite implicit. Yes, I
>> agree with you. No, bio split inside loop is not perfect.
>>
>>> Even if there was no possibility of a deadlock from a resync
>>> request happening between two bios, there are other possibilities.
>>>
>>
>> The bellowed text makes me know more about raid1 code, but confuses me
>> more as well. Here comes my questions,
>>
>>> It is not, in general, safe to call mempool_alloc() twice in a
>>> row, without first ensuring that the first allocation will get
>>> freed by some other thread.  raid1_write_request() allocates from
>>> r1bio_pool, and then submits bios to the underlying device, which
>>> get queued on bio_list_on_stack.  They will not be processed until
>>> after raid1_make_request() completes, so when raid1_make_request
>>> loops around and calls raid1_write_request() again, it will try to
>>> allocate another r1bio from r1bio_pool, and this might end up
>>> waiting for the r1bio which is trapped and cannot complete.
>>>
>>
>> Can I say that it is because blk_finish_plug() won't be called before
>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
>> will be added into plug->pending, but before blk_finish_plug() is
>> called, they won't be handled.
> 
> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
> current->bio_list, not in plug list.
>  

Oops, I messed them up,  thank you for the clarifying :-)

>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>> impossible.  If 256 threads all attempt a write (or read) that
>>> crosses a boundary, then they will consume all 256 preallocated
>>> entries, and want more. If there is no free memory, they will block
>>> indefinitely.
>>>
>>
>> If raid1_make_request() is modified into this way,
>> +	if (bio_data_dir(split) == READ)
>> +		raid1_read_request(mddev, split);
>> +	else
>> +		raid1_write_request(mddev, split);
>> +	if (split != bio)
>> +		generic_make_request(bio);
>>
>> Then the original bio will be added into the bio_list_on_stack of top
>> level generic_make_request(), current->bio_list is initialized, when
>> generic_make_request() is called nested in raid1_make_request(), the
>> split bio will be added into current->bio_list and nothing else happens.
>>
>> After the nested generic_make_request() returns, the code back to next
>> code of generic_make_request(),
>> 2022                         ret = q->make_request_fn(q, bio);
>> 2023
>> 2024                         blk_queue_exit(q);
>> 2025
>> 2026                         bio = bio_list_pop(current->bio_list);
>>
>> bio_list_pop() will return the second half of the split bio, and it is
> 
> So in above sequence, the curent->bio_list will has bios in below sequence:
> bios to underlaying disks, second half of original bio
> 
> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> second half of original bio.
> 
> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> array, handling the middle layer bio will make the 3rd layer bio hold to
> bio_list again.
> 

Could you please give me more hint,
- What is the meaning of "hold" from " make the 3rd layer bio hold to
bio_list again" ?
- Why deadlock happens if the 3rd layer bio hold to bio_list again ?

Thanks in advance.

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-21 19:20                 ` Wols Lists
@ 2017-02-21 20:16                   ` Coly Li
  0 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-21 20:16 UTC (permalink / raw)
  To: Wols Lists
  Cc: Shaohua Li, NeilBrown, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang

On 2017/2/22 上午3:20, Wols Lists wrote:
> On 21/02/17 11:30, Coly Li wrote:
>> On 2017/2/21 上午2:14, Wols Lists wrote:
>>> On 20/02/17 08:07, Coly Li wrote:
>>>> For the function pointer asignment, it is because I see a brach happens in a loop. If I use a function pointer, I can avoid redundant brach inside the loop. raid1_read_request() and raid1_write_request() are not simple functions, I don't know whether gcc may make them inline or not, so I am on the way to check the disassembled code..
>>>
>>> Can you force gcc to inline or compile a function? Isn't it dangerous to
>>> rely on default behaviour and assume it won't change when the compiler
>>> is upgraded?
>>
>> I choose to trust compiler, and trust the people behind gcc.
>>
> I admire your faith. I seem to remember several occasions where the gcc
> people added new optimisations and caused all sorts of subtle havoc with
> the kernel where it relied on the old behaviour. Don't forget - the
> linux kernel is one of the compiler's most demanding customers. And
> don't forget also - there are quite a few people now using llvm to
> compile the kernel (it may not yet be working - I think it is certainly
> for simple use cases) so tests on gcc don't guarantee it'll work for
> everyone ...

I know the risk, but I don't think I can figure out where gcc goes wrong
by myself. So I have to choose trust compiler developers.

> 
> I think you can trace the addition of many kernel compile-time flags to
> that sort of thing - disabling new optimisations.

Do you suggest that I can put my eyes on kernel compiling command lines
and I will find many compile-time flags which indeed disables some new
gcc optimization options ?

If I understand you correctly, please permit me to say this is a good
point. I will notice these kind of flags, and check what they mean :-)

Thanks.

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-21 20:09                   ` Coly Li
@ 2017-02-23  5:54                     ` Coly Li
  2017-02-23 17:34                       ` Shaohua Li
  2017-02-23 23:14                       ` NeilBrown
  0 siblings, 2 replies; 43+ messages in thread
From: Coly Li @ 2017-02-23  5:54 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/22 上午4:09, Coly Li wrote:
> On 2017/2/22 上午1:45, Shaohua Li wrote:
>> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
>>> On 2017/2/21 上午8:29, NeilBrown wrote:
>>>> On Mon, Feb 20 2017, Coly Li wrote:
>>>>
>>>>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
>>>>>>
>>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
>>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
>>>>>>>>
>>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
>>>>>>>>>
>>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
>>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
>>>>>>>>>> happen in a separate function, which I agree with, but
>>>>>>>>>> there is something else wrong here. Calling
>>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
>>>>>>>>>> It is OK for simple devices, but when one request can
>>>>>>>>>> wait for another request to the same device it can 
>>>>>>>>>> deadlock. This can happen with raid1.  If a resync
>>>>>>>>>> request calls raise_barrier() between one request and
>>>>>>>>>> the next, then the next has to wait for the resync
>>>>>>>>>> request, which has to wait for the first request. As
>>>>>>>>>> the first request will be stuck in the queue in 
>>>>>>>>>> generic_make_request(), you get a deadlock.
>>>>>>>>>
>>>>>>>>> For md raid1, queue in generic_make_request(), can I
>>>>>>>>> understand it as bio_list_on_stack in this function? And
>>>>>>>>> queue in underlying device, can I understand it as the
>>>>>>>>> data structures like plug->pending and 
>>>>>>>>> conf->pending_bio_list ?
>>>>>>>>
>>>>>>>> Yes, the queue in generic_make_request() is the
>>>>>>>> bio_list_on_stack.  That is the only queue I am talking
>>>>>>>> about.  I'm not referring to plug->pending or
>>>>>>>> conf->pending_bio_list at all.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I still don't get the point of deadlock, let me try to
>>>>>>>>> explain why I don't see the possible deadlock. If a bio
>>>>>>>>> is split, and the first part is processed by
>>>>>>>>> make_request_fn(), and then a resync comes and it will 
>>>>>>>>> raise a barrier, there are 3 possible conditions, - the
>>>>>>>>> resync I/O tries to raise barrier on same bucket of the
>>>>>>>>> first regular bio. Then the resync task has to wait to
>>>>>>>>> the first bio drops its conf->nr_pending[idx]
>>>>>>>>
>>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
>>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
>>>>>>>> happens immediately. Then the resync_task will increment
>>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
>>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
>>>>>>>> will have submitted bios to the underlying device, and they
>>>>>>>> will be in the bio_list_on_stack queue, and will not be
>>>>>>>> processed until raid1_make_request() completes.
>>>>>>>>
>>>>>>>> The loop in raid1_make_request() will then call
>>>>>>>> make_request_fn() which will call wait_barrier(), which
>>>>>>>> will wait for ->barrier[idx] to be zero.
>>>>>>>
>>>>>>> Thinking more carefully about this.. the 'idx' that the
>>>>>>> second bio will wait for will normally be different, so there
>>>>>>> won't be a deadlock after all.
>>>>>>>
>>>>>>> However it is possible for hash_long() to produce the same
>>>>>>> idx for two consecutive barrier_units so there is still the
>>>>>>> possibility of a deadlock, though it isn't as likely as I
>>>>>>> thought at first.
>>>>>>
>>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
>>>>>> original patch. Also fix a 'use-after-free' bug. For the
>>>>>> deadlock issue, I'll add below patch, please check.
>>>>>>
>>>>>> Thanks, Shaohua
>>>>>>
>>>>>
>>>
>>> Neil,
>>>
>>> Thanks for your patient explanation, I feel I come to follow up what
>>> you mean. Let me try to re-tell what I understand, correct me if I am
>>> wrong.
>>>
>>>
>>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
>>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
>>>>> happen, but once we have barrier bucket on it , there will no
>>>>> deadlock.
>>>>>
>>>>> My question is, this deadlock only happens when a big bio is
>>>>> split, and the split small bios are continuous, and the resync io
>>>>> visiting barrier buckets in sequntial order too. In the case if
>>>>> adjacent split regular bios or resync bios hit same barrier
>>>>> bucket, it will be a very big failure of hash design, and should
>>>>> have been found already. But no one complain it, so I don't
>>>>> convince myself tje deadlock is real with io barrier buckets
>>>>> (this is what Neil concerns).
>>>>
>>>> I think you are wrong about the design goal of a hash function. 
>>>> When feed a sequence of inputs, with any stride (i.e. with any
>>>> constant difference between consecutive inputs), the output of the
>>>> hash function should appear to be random. A random sequence can
>>>> produce the same number twice in a row. If the hash function
>>>> produces a number from 0 to N-1, you would expect two consecutive
>>>> outputs to be the same about once every N inputs.
>>>>
>>>
>>> Yes, you are right. But when I mentioned hash conflict, I limit the
>>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
>>> address divided by 64MB I/O barrier unit size, its value range is
>>> reduced to [0, 1<<38].
>>>
>>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
>>>
>>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
>>> bios at most.
>>>
>>> Then in this patch, the hash question is degraded to: for any
>>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
>>> these 65 integers into range [0, 1023], will any hash conflict happen
>>> among these integers ?
>>>
>>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
>>> simple code to emulate hash calculation in the new I/O barrier patch,
>>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
>>> 1<<37] for hash conflict.
>>>
>>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
>>> conflict detected up to 512 consecutive integers in above limited
>>> condition. For 1024, there are a lot hash conflict detected.
>>>
>>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
>>> almost all existing md raid configuration. So for current kernel
>>> implementation and real world device, for a single bio, there is no
>>> possible hash conflict the new I/O barrier patch.
>>>
>>> If bi_iter.bi_size changes from unsigned int to unsigned long in
>>> future, the above assumption will be wrong. There will be hash
>>> conflict, and potential dead lock, which is quite implicit. Yes, I
>>> agree with you. No, bio split inside loop is not perfect.
>>>
>>>> Even if there was no possibility of a deadlock from a resync
>>>> request happening between two bios, there are other possibilities.
>>>>
>>>
>>> The bellowed text makes me know more about raid1 code, but confuses me
>>> more as well. Here comes my questions,
>>>
>>>> It is not, in general, safe to call mempool_alloc() twice in a
>>>> row, without first ensuring that the first allocation will get
>>>> freed by some other thread.  raid1_write_request() allocates from
>>>> r1bio_pool, and then submits bios to the underlying device, which
>>>> get queued on bio_list_on_stack.  They will not be processed until
>>>> after raid1_make_request() completes, so when raid1_make_request
>>>> loops around and calls raid1_write_request() again, it will try to
>>>> allocate another r1bio from r1bio_pool, and this might end up
>>>> waiting for the r1bio which is trapped and cannot complete.
>>>>
>>>
>>> Can I say that it is because blk_finish_plug() won't be called before
>>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
>>> will be added into plug->pending, but before blk_finish_plug() is
>>> called, they won't be handled.
>>
>> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
>> current->bio_list, not in plug list.
>>  
> 
> Oops, I messed them up,  thank you for the clarifying :-)
> 
>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>>> impossible.  If 256 threads all attempt a write (or read) that
>>>> crosses a boundary, then they will consume all 256 preallocated
>>>> entries, and want more. If there is no free memory, they will block
>>>> indefinitely.
>>>>
>>>
>>> If raid1_make_request() is modified into this way,
>>> +	if (bio_data_dir(split) == READ)
>>> +		raid1_read_request(mddev, split);
>>> +	else
>>> +		raid1_write_request(mddev, split);
>>> +	if (split != bio)
>>> +		generic_make_request(bio);
>>>
>>> Then the original bio will be added into the bio_list_on_stack of top
>>> level generic_make_request(), current->bio_list is initialized, when
>>> generic_make_request() is called nested in raid1_make_request(), the
>>> split bio will be added into current->bio_list and nothing else happens.
>>>
>>> After the nested generic_make_request() returns, the code back to next
>>> code of generic_make_request(),
>>> 2022                         ret = q->make_request_fn(q, bio);
>>> 2023
>>> 2024                         blk_queue_exit(q);
>>> 2025
>>> 2026                         bio = bio_list_pop(current->bio_list);
>>>
>>> bio_list_pop() will return the second half of the split bio, and it is
>>
>> So in above sequence, the curent->bio_list will has bios in below sequence:
>> bios to underlaying disks, second half of original bio
>>
>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
>> second half of original bio.
>>
>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
>> array, handling the middle layer bio will make the 3rd layer bio hold to
>> bio_list again.
>>
> 
> Could you please give me more hint,
> - What is the meaning of "hold" from " make the 3rd layer bio hold to
> bio_list again" ?
> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?

I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
bucket size to 8MB, running for 10 hours, there is no deadlock observed,

Here is how the 4 layer stacked raid1 setup,
- There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
  /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
  /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
  /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
  /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
- Here is how the 4 layer stacked raid1 assembled, level 1 means the top
level, level 4 means the bottom level in the stacked devices,
  - level 1:
	/dev/md40: /dev/md30  /dev/md31
  - level 2:
	/dev/md30: /dev/md20  /dev/md21
	/dev/md31: /dev/md22  /dev/md23
  - level 3:
	/dev/md20: /dev/md10  /dev/md11
	/dev/md21: /dev/md12  /dev/md13
	/dev/md22: /dev/md14  /dev/md15
	/dev/md23: /dev/md16  /dev/md17
  - level 4:
	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4

Here is the fio job file,
[global]
direct=1
thread=1
ioengine=libaio

[job]
filename=/dev/md40
readwrite=write
numjobs=10
blocksize=33M
iodepth=128
time_based=1
runtime=10h

I planed to learn how the deadlock comes by analyze a deadlock
condition. Maybe it was because 8MB bucket unit size is small enough,
now I try to run with 512K bucket unit size, and see whether I can
encounter a deadlock.


=============== P.S ==============
When I run the stacked raid1 testing, I feel I see something behavior
suspiciously, it is resync.

The second time when I rebuild all the raid1 devices by "mdadm -C
/dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
/dev/md40 already accomplished 50%+ resync. I don't think it could be
that fast...

Coly



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23  5:54                     ` Coly Li
@ 2017-02-23 17:34                       ` Shaohua Li
  2017-02-23 19:31                         ` Coly Li
  2017-02-23 23:14                       ` NeilBrown
  1 sibling, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-23 17:34 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
> On 2017/2/22 上午4:09, Coly Li wrote:
> > On 2017/2/22 上午1:45, Shaohua Li wrote:
> >> On Tue, Feb 21, 2017 at 05:45:53PM +0800, Coly Li wrote:
> >>> On 2017/2/21 上午8:29, NeilBrown wrote:
> >>>> On Mon, Feb 20 2017, Coly Li wrote:
> >>>>
> >>>>>> 在 2017年2月20日,下午3:04,Shaohua Li <shli@kernel.org> 写道:
> >>>>>>
> >>>>>>> On Mon, Feb 20, 2017 at 01:51:22PM +1100, Neil Brown wrote:
> >>>>>>>> On Mon, Feb 20 2017, NeilBrown wrote:
> >>>>>>>>
> >>>>>>>>> On Fri, Feb 17 2017, Coly Li wrote:
> >>>>>>>>>
> >>>>>>>>>> On 2017/2/16 下午3:04, NeilBrown wrote: I know you are
> >>>>>>>>>> going to change this as Shaohua wantsthe spitting to 
> >>>>>>>>>> happen in a separate function, which I agree with, but
> >>>>>>>>>> there is something else wrong here. Calling
> >>>>>>>>>> bio_split/bio_chain repeatedly in a loop is dangerous.
> >>>>>>>>>> It is OK for simple devices, but when one request can
> >>>>>>>>>> wait for another request to the same device it can 
> >>>>>>>>>> deadlock. This can happen with raid1.  If a resync
> >>>>>>>>>> request calls raise_barrier() between one request and
> >>>>>>>>>> the next, then the next has to wait for the resync
> >>>>>>>>>> request, which has to wait for the first request. As
> >>>>>>>>>> the first request will be stuck in the queue in 
> >>>>>>>>>> generic_make_request(), you get a deadlock.
> >>>>>>>>>
> >>>>>>>>> For md raid1, queue in generic_make_request(), can I
> >>>>>>>>> understand it as bio_list_on_stack in this function? And
> >>>>>>>>> queue in underlying device, can I understand it as the
> >>>>>>>>> data structures like plug->pending and 
> >>>>>>>>> conf->pending_bio_list ?
> >>>>>>>>
> >>>>>>>> Yes, the queue in generic_make_request() is the
> >>>>>>>> bio_list_on_stack.  That is the only queue I am talking
> >>>>>>>> about.  I'm not referring to plug->pending or
> >>>>>>>> conf->pending_bio_list at all.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I still don't get the point of deadlock, let me try to
> >>>>>>>>> explain why I don't see the possible deadlock. If a bio
> >>>>>>>>> is split, and the first part is processed by
> >>>>>>>>> make_request_fn(), and then a resync comes and it will 
> >>>>>>>>> raise a barrier, there are 3 possible conditions, - the
> >>>>>>>>> resync I/O tries to raise barrier on same bucket of the
> >>>>>>>>> first regular bio. Then the resync task has to wait to
> >>>>>>>>> the first bio drops its conf->nr_pending[idx]
> >>>>>>>>
> >>>>>>>> Not quite. First, the resync task (in raise_barrier()) will
> >>>>>>>> wait for ->nr_waiting[idx] to be zero.  We can assume this
> >>>>>>>> happens immediately. Then the resync_task will increment
> >>>>>>>> ->barrier[idx]. Only then will it wait for the first bio to
> >>>>>>>> drop ->nr_pending[idx]. The processing of that first bio
> >>>>>>>> will have submitted bios to the underlying device, and they
> >>>>>>>> will be in the bio_list_on_stack queue, and will not be
> >>>>>>>> processed until raid1_make_request() completes.
> >>>>>>>>
> >>>>>>>> The loop in raid1_make_request() will then call
> >>>>>>>> make_request_fn() which will call wait_barrier(), which
> >>>>>>>> will wait for ->barrier[idx] to be zero.
> >>>>>>>
> >>>>>>> Thinking more carefully about this.. the 'idx' that the
> >>>>>>> second bio will wait for will normally be different, so there
> >>>>>>> won't be a deadlock after all.
> >>>>>>>
> >>>>>>> However it is possible for hash_long() to produce the same
> >>>>>>> idx for two consecutive barrier_units so there is still the
> >>>>>>> possibility of a deadlock, though it isn't as likely as I
> >>>>>>> thought at first.
> >>>>>>
> >>>>>> Wrapped the function pointer issue Neil pointed out into Coly's
> >>>>>> original patch. Also fix a 'use-after-free' bug. For the
> >>>>>> deadlock issue, I'll add below patch, please check.
> >>>>>>
> >>>>>> Thanks, Shaohua
> >>>>>>
> >>>>>
> >>>
> >>> Neil,
> >>>
> >>> Thanks for your patient explanation, I feel I come to follow up what
> >>> you mean. Let me try to re-tell what I understand, correct me if I am
> >>> wrong.
> >>>
> >>>
> >>>>> Hmm, please hold, I am still thinking of it. With barrier bucket
> >>>>> and hash_long(), I don't see dead lock yet. For raid10 it might
> >>>>> happen, but once we have barrier bucket on it , there will no
> >>>>> deadlock.
> >>>>>
> >>>>> My question is, this deadlock only happens when a big bio is
> >>>>> split, and the split small bios are continuous, and the resync io
> >>>>> visiting barrier buckets in sequntial order too. In the case if
> >>>>> adjacent split regular bios or resync bios hit same barrier
> >>>>> bucket, it will be a very big failure of hash design, and should
> >>>>> have been found already. But no one complain it, so I don't
> >>>>> convince myself tje deadlock is real with io barrier buckets
> >>>>> (this is what Neil concerns).
> >>>>
> >>>> I think you are wrong about the design goal of a hash function. 
> >>>> When feed a sequence of inputs, with any stride (i.e. with any
> >>>> constant difference between consecutive inputs), the output of the
> >>>> hash function should appear to be random. A random sequence can
> >>>> produce the same number twice in a row. If the hash function
> >>>> produces a number from 0 to N-1, you would expect two consecutive
> >>>> outputs to be the same about once every N inputs.
> >>>>
> >>>
> >>> Yes, you are right. But when I mentioned hash conflict, I limit the
> >>> integers in range [0, 1<<38]. 38 is (64-17-9), when a 64bit LBA
> >>> address divided by 64MB I/O barrier unit size, its value range is
> >>> reduced to [0, 1<<38].
> >>>
> >>> Maximum size of normal bio is 1MB, it could be split into 2 bios at most.
> >>>
> >>> For DISCARD bio, its maximum size is 4GB, it could be split into 65
> >>> bios at most.
> >>>
> >>> Then in this patch, the hash question is degraded to: for any
> >>> consecutive 65 integers in range [0, 1<<38], use hash_long() to hash
> >>> these 65 integers into range [0, 1023], will any hash conflict happen
> >>> among these integers ?
> >>>
> >>> I tried a half range [0, 1<<37] to check hash conflict, by writing a
> >>> simple code to emulate hash calculation in the new I/O barrier patch,
> >>> to iterate all consecutive {2, 65, 128, 512} integers in range [0,
> >>> 1<<37] for hash conflict.
> >>>
> >>> On a 20 core CPU each run spent 7+ hours, finally I find no hash
> >>> conflict detected up to 512 consecutive integers in above limited
> >>> condition. For 1024, there are a lot hash conflict detected.
> >>>
> >>> [0, 1<<37] range back to [0, 63] LBA range, this is large enough for
> >>> almost all existing md raid configuration. So for current kernel
> >>> implementation and real world device, for a single bio, there is no
> >>> possible hash conflict the new I/O barrier patch.
> >>>
> >>> If bi_iter.bi_size changes from unsigned int to unsigned long in
> >>> future, the above assumption will be wrong. There will be hash
> >>> conflict, and potential dead lock, which is quite implicit. Yes, I
> >>> agree with you. No, bio split inside loop is not perfect.
> >>>
> >>>> Even if there was no possibility of a deadlock from a resync
> >>>> request happening between two bios, there are other possibilities.
> >>>>
> >>>
> >>> The bellowed text makes me know more about raid1 code, but confuses me
> >>> more as well. Here comes my questions,
> >>>
> >>>> It is not, in general, safe to call mempool_alloc() twice in a
> >>>> row, without first ensuring that the first allocation will get
> >>>> freed by some other thread.  raid1_write_request() allocates from
> >>>> r1bio_pool, and then submits bios to the underlying device, which
> >>>> get queued on bio_list_on_stack.  They will not be processed until
> >>>> after raid1_make_request() completes, so when raid1_make_request
> >>>> loops around and calls raid1_write_request() again, it will try to
> >>>> allocate another r1bio from r1bio_pool, and this might end up
> >>>> waiting for the r1bio which is trapped and cannot complete.
> >>>>
> >>>
> >>> Can I say that it is because blk_finish_plug() won't be called before
> >>> raid1_make_request() returns ? Then in raid1_write_request(), mbio
> >>> will be added into plug->pending, but before blk_finish_plug() is
> >>> called, they won't be handled.
> >>
> >> blk_finish_plug is called if raid1_make_request sleep. The bio is hold in
> >> current->bio_list, not in plug list.
> >>  
> > 
> > Oops, I messed them up,  thank you for the clarifying :-)
> > 
> >>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> >>>> impossible.  If 256 threads all attempt a write (or read) that
> >>>> crosses a boundary, then they will consume all 256 preallocated
> >>>> entries, and want more. If there is no free memory, they will block
> >>>> indefinitely.
> >>>>
> >>>
> >>> If raid1_make_request() is modified into this way,
> >>> +	if (bio_data_dir(split) == READ)
> >>> +		raid1_read_request(mddev, split);
> >>> +	else
> >>> +		raid1_write_request(mddev, split);
> >>> +	if (split != bio)
> >>> +		generic_make_request(bio);
> >>>
> >>> Then the original bio will be added into the bio_list_on_stack of top
> >>> level generic_make_request(), current->bio_list is initialized, when
> >>> generic_make_request() is called nested in raid1_make_request(), the
> >>> split bio will be added into current->bio_list and nothing else happens.
> >>>
> >>> After the nested generic_make_request() returns, the code back to next
> >>> code of generic_make_request(),
> >>> 2022                         ret = q->make_request_fn(q, bio);
> >>> 2023
> >>> 2024                         blk_queue_exit(q);
> >>> 2025
> >>> 2026                         bio = bio_list_pop(current->bio_list);
> >>>
> >>> bio_list_pop() will return the second half of the split bio, and it is
> >>
> >> So in above sequence, the curent->bio_list will has bios in below sequence:
> >> bios to underlaying disks, second half of original bio
> >>
> >> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> >> second half of original bio.
> >>
> >> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> >> array, handling the middle layer bio will make the 3rd layer bio hold to
> >> bio_list again.
> >>
> > 
> > Could you please give me more hint,
> > - What is the meaning of "hold" from " make the 3rd layer bio hold to
> > bio_list again" ?
> > - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
> 
> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
> 
> Here is how the 4 layer stacked raid1 setup,
> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
> level, level 4 means the bottom level in the stacked devices,
>   - level 1:
> 	/dev/md40: /dev/md30  /dev/md31
>   - level 2:
> 	/dev/md30: /dev/md20  /dev/md21
> 	/dev/md31: /dev/md22  /dev/md23
>   - level 3:
> 	/dev/md20: /dev/md10  /dev/md11
> 	/dev/md21: /dev/md12  /dev/md13
> 	/dev/md22: /dev/md14  /dev/md15
> 	/dev/md23: /dev/md16  /dev/md17
>   - level 4:
> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
> 
> Here is the fio job file,
> [global]
> direct=1
> thread=1
> ioengine=libaio
> 
> [job]
> filename=/dev/md40
> readwrite=write
> numjobs=10
> blocksize=33M
> iodepth=128
> time_based=1
> runtime=10h
> 
> I planed to learn how the deadlock comes by analyze a deadlock
> condition. Maybe it was because 8MB bucket unit size is small enough,
> now I try to run with 512K bucket unit size, and see whether I can
> encounter a deadlock.

Don't think raid1 could easily trigger the deadlock. Maybe you should try
raid10. The resync case is hard to trigger for raid1. The memory pressure case
is hard to trigger for both raid1/10. But it's possible to trigger.

The 3-layer case is something like this:
1. in level1, set current->bio_list, split bio to bio1 and bio2
2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
3. queue bio2 in current->bio_list
4. generic_make_request then pops bio1-level2
5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock

The problem is because we add new bio to current->bio_list tail.

> =============== P.S ==============
> When I run the stacked raid1 testing, I feel I see something behavior
> suspiciously, it is resync.
> 
> The second time when I rebuild all the raid1 devices by "mdadm -C
> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
> /dev/md40 already accomplished 50%+ resync. I don't think it could be
> that fast...

no idea, is this reproducible?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23 17:34                       ` Shaohua Li
@ 2017-02-23 19:31                         ` Coly Li
  2017-02-23 19:58                           ` Shaohua Li
  2017-02-24 10:19                           ` 王金浦
  0 siblings, 2 replies; 43+ messages in thread
From: Coly Li @ 2017-02-23 19:31 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/24 上午1:34, Shaohua Li wrote:
> On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
[snip]
>>>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>>>>> impossible.  If 256 threads all attempt a write (or read) that
>>>>>> crosses a boundary, then they will consume all 256 preallocated
>>>>>> entries, and want more. If there is no free memory, they will block
>>>>>> indefinitely.
>>>>>>
>>>>>
>>>>> If raid1_make_request() is modified into this way,
>>>>> +	if (bio_data_dir(split) == READ)
>>>>> +		raid1_read_request(mddev, split);
>>>>> +	else
>>>>> +		raid1_write_request(mddev, split);
>>>>> +	if (split != bio)
>>>>> +		generic_make_request(bio);
>>>>>
>>>>> Then the original bio will be added into the bio_list_on_stack of top
>>>>> level generic_make_request(), current->bio_list is initialized, when
>>>>> generic_make_request() is called nested in raid1_make_request(), the
>>>>> split bio will be added into current->bio_list and nothing else happens.
>>>>>
>>>>> After the nested generic_make_request() returns, the code back to next
>>>>> code of generic_make_request(),
>>>>> 2022                         ret = q->make_request_fn(q, bio);
>>>>> 2023
>>>>> 2024                         blk_queue_exit(q);
>>>>> 2025
>>>>> 2026                         bio = bio_list_pop(current->bio_list);
>>>>>
>>>>> bio_list_pop() will return the second half of the split bio, and it is
>>>>
>>>> So in above sequence, the curent->bio_list will has bios in below sequence:
>>>> bios to underlaying disks, second half of original bio
>>>>
>>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
>>>> second half of original bio.
>>>>
>>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
>>>> array, handling the middle layer bio will make the 3rd layer bio hold to
>>>> bio_list again.
>>>>
>>>
>>> Could you please give me more hint,
>>> - What is the meaning of "hold" from " make the 3rd layer bio hold to
>>> bio_list again" ?
>>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
>>
>> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
>> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
>>
>> Here is how the 4 layer stacked raid1 setup,
>> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
>> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
>> level, level 4 means the bottom level in the stacked devices,
>>   - level 1:
>> 	/dev/md40: /dev/md30  /dev/md31
>>   - level 2:
>> 	/dev/md30: /dev/md20  /dev/md21
>> 	/dev/md31: /dev/md22  /dev/md23
>>   - level 3:
>> 	/dev/md20: /dev/md10  /dev/md11
>> 	/dev/md21: /dev/md12  /dev/md13
>> 	/dev/md22: /dev/md14  /dev/md15
>> 	/dev/md23: /dev/md16  /dev/md17
>>   - level 4:
>> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
>> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
>> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
>> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
>> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
>> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
>> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
>> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
>>
>> Here is the fio job file,
>> [global]
>> direct=1
>> thread=1
>> ioengine=libaio
>>
>> [job]
>> filename=/dev/md40
>> readwrite=write
>> numjobs=10
>> blocksize=33M
>> iodepth=128
>> time_based=1
>> runtime=10h
>>
>> I planed to learn how the deadlock comes by analyze a deadlock
>> condition. Maybe it was because 8MB bucket unit size is small enough,
>> now I try to run with 512K bucket unit size, and see whether I can
>> encounter a deadlock.
> 
> Don't think raid1 could easily trigger the deadlock. Maybe you should try
> raid10. The resync case is hard to trigger for raid1. The memory pressure case
> is hard to trigger for both raid1/10. But it's possible to trigger.
> 
> The 3-layer case is something like this:

Hi Shaohua,

I try to catch up with you, let me try to follow your mind by the
split-in-while-loop condition (this is my new I/O barrier patch). I
assume the original BIO is a write bio, and original bio is split and
handled in a while loop in raid1_make_request().

> 1. in level1, set current->bio_list, split bio to bio1 and bio2

This is done in level1 raid1_make_request().

> 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list

Remap is done by raid1_write_request(), and bio1_level may be added into
one of the two list:
- plug->pending:
  bios in plug->pending may be handled in raid1_unplug(), or in
flush_pending_writes() of raid1d().
  If current task is about to be scheduled, raid1_unplug() will merge
plug->pending's bios to conf->pending_bio_list. And
conf->pending_bio_list will be handled in raid1d.
  If raid1_unplug() is triggered by blk_finish_plug(), it is also
handled in raid1d.

- conf->pending_bio_list:
  bios in this list is handled in raid1d by calling flush_pending_writes().


So generic_make_request() to handle bio1_level2 can only be called in
context of raid1d thread, bio1_level2 is added into raid1d's
bio_list_on_stack, not caller of level1 generic_make_request().

> 3. queue bio2 in current->bio_list

Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
Then back to level1 generic_make_request()

> 4. generic_make_request then pops bio1-level2

At this moment, bio1_level2 and bio2_level2 are in either plug->pending
or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
generic_make_request() returns to its caller.

If before bio_list_pop() called, kernel thread raid1d wakes up and
iterates conf->pending_bio_list in flush_pending_writes() or iterate
plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
level1 raid1d's stack, bios will not show up in level1
generic_make_reques(), bio_list_pop() still returns NULL.

> 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list

bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
bio2_level2 is handled firstly.

level1 raid1 calls level2 generic_make_request(), then level2
raid1_make_request() is called, then level raid1_write_request().
bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
will be handled by level2 raid1d, when level2 raid1d wakes up.
Then returns back to level1 raid1, bio1_level2
is handled by level2 generic_make_request() and added into level2
plug->pending or conf->pending_bio_list. In this case neither
bio2_level2 nor bio1_level is added into any bio_list_on_stack.

Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
and sleeps.

Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
iterate level2 plug->pending or conf->pending_bio_list, and calling
level3 generic_make_request().

In level3 generic_make_request(), because it is level2 raid1d context,
not level1 raid1d context, bio2_level3 is send into
q->make_request_fn(), and finally added into level3 plug->pending or
conf->pending_bio_list, then back to level3 generic_make_reqeust().

Now level2 raid1d's current->bio_list is empty, so level3
generic_make_request() returns to level2 raid1d and continue to iterate
and send bio1_level3 into level3 generic_make_request().

After all bios are added into level3 plug->pending or
conf->pending_bio_list, level2 raid1d sleeps.

Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
conf->pending_bio_list by calling generic_make_request() to underlying
devices (which might be a read device).

On the above whole patch, each lower level generic_make_request() is
called in context of the lower level raid1d. No recursive call happens
in normal code path.

In raid1 code, recursive call of generic_make_request() only happens for
READ bio, but if array is not frozen, no barrier is required, it doesn't
hurt.


> 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock

As my understand to the code, it won't happen neither.

> 
> The problem is because we add new bio to current->bio_list tail.

New bios are added into other context's current->bio_list, which are
different lists. If what I understand is correct, a dead lock won't
happen in this way.

If my understanding is correct, suddenly I come to realize why raid1
bios are handled indirectly in another kernel thread.

(Just for your information, when I write to this location, another run
of testing finished, no deadlock. This time I reduce I/O barrier bucket
unit size to 512KB, and set blocksize to 33MB in fio job file. It is
really slow (130MB/s), but no deadlock observed)


The stacked raid1 devices are really really confused, if I am wrong, any
hint is warmly welcome.

> 
>> =============== P.S ==============
>> When I run the stacked raid1 testing, I feel I see something behavior
>> suspiciously, it is resync.
>>
>> The second time when I rebuild all the raid1 devices by "mdadm -C
>> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
>> /dev/md40 already accomplished 50%+ resync. I don't think it could be
>> that fast...
> 
> no idea, is this reproducible?

It can be stably reproduced. I need to check whether bitmap is cleaned
when create a stacked raid1. This is a little off topic in this thread,
once I have some idea, I will send out another topic. Hope it is just so
fast.

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23 19:31                         ` Coly Li
@ 2017-02-23 19:58                           ` Shaohua Li
  2017-02-24 17:02                             ` Coly Li
  2017-02-24 10:19                           ` 王金浦
  1 sibling, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-23 19:58 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On Fri, Feb 24, 2017 at 03:31:16AM +0800, Coly Li wrote:
> On 2017/2/24 上午1:34, Shaohua Li wrote:
> > On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
> [snip]
> >>>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
> >>>>>> impossible.  If 256 threads all attempt a write (or read) that
> >>>>>> crosses a boundary, then they will consume all 256 preallocated
> >>>>>> entries, and want more. If there is no free memory, they will block
> >>>>>> indefinitely.
> >>>>>>
> >>>>>
> >>>>> If raid1_make_request() is modified into this way,
> >>>>> +	if (bio_data_dir(split) == READ)
> >>>>> +		raid1_read_request(mddev, split);
> >>>>> +	else
> >>>>> +		raid1_write_request(mddev, split);
> >>>>> +	if (split != bio)
> >>>>> +		generic_make_request(bio);
> >>>>>
> >>>>> Then the original bio will be added into the bio_list_on_stack of top
> >>>>> level generic_make_request(), current->bio_list is initialized, when
> >>>>> generic_make_request() is called nested in raid1_make_request(), the
> >>>>> split bio will be added into current->bio_list and nothing else happens.
> >>>>>
> >>>>> After the nested generic_make_request() returns, the code back to next
> >>>>> code of generic_make_request(),
> >>>>> 2022                         ret = q->make_request_fn(q, bio);
> >>>>> 2023
> >>>>> 2024                         blk_queue_exit(q);
> >>>>> 2025
> >>>>> 2026                         bio = bio_list_pop(current->bio_list);
> >>>>>
> >>>>> bio_list_pop() will return the second half of the split bio, and it is
> >>>>
> >>>> So in above sequence, the curent->bio_list will has bios in below sequence:
> >>>> bios to underlaying disks, second half of original bio
> >>>>
> >>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
> >>>> second half of original bio.
> >>>>
> >>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
> >>>> array, handling the middle layer bio will make the 3rd layer bio hold to
> >>>> bio_list again.
> >>>>
> >>>
> >>> Could you please give me more hint,
> >>> - What is the meaning of "hold" from " make the 3rd layer bio hold to
> >>> bio_list again" ?
> >>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
> >>
> >> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> >> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
> >>
> >> Here is how the 4 layer stacked raid1 setup,
> >> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
> >>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
> >>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
> >>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
> >>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
> >> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
> >> level, level 4 means the bottom level in the stacked devices,
> >>   - level 1:
> >> 	/dev/md40: /dev/md30  /dev/md31
> >>   - level 2:
> >> 	/dev/md30: /dev/md20  /dev/md21
> >> 	/dev/md31: /dev/md22  /dev/md23
> >>   - level 3:
> >> 	/dev/md20: /dev/md10  /dev/md11
> >> 	/dev/md21: /dev/md12  /dev/md13
> >> 	/dev/md22: /dev/md14  /dev/md15
> >> 	/dev/md23: /dev/md16  /dev/md17
> >>   - level 4:
> >> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
> >> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
> >> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
> >> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
> >> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
> >> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
> >> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
> >> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
> >>
> >> Here is the fio job file,
> >> [global]
> >> direct=1
> >> thread=1
> >> ioengine=libaio
> >>
> >> [job]
> >> filename=/dev/md40
> >> readwrite=write
> >> numjobs=10
> >> blocksize=33M
> >> iodepth=128
> >> time_based=1
> >> runtime=10h
> >>
> >> I planed to learn how the deadlock comes by analyze a deadlock
> >> condition. Maybe it was because 8MB bucket unit size is small enough,
> >> now I try to run with 512K bucket unit size, and see whether I can
> >> encounter a deadlock.
> > 
> > Don't think raid1 could easily trigger the deadlock. Maybe you should try
> > raid10. The resync case is hard to trigger for raid1. The memory pressure case
> > is hard to trigger for both raid1/10. But it's possible to trigger.
> > 
> > The 3-layer case is something like this:
> 
> Hi Shaohua,
> 
> I try to catch up with you, let me try to follow your mind by the
> split-in-while-loop condition (this is my new I/O barrier patch). I
> assume the original BIO is a write bio, and original bio is split and
> handled in a while loop in raid1_make_request().
> 
> > 1. in level1, set current->bio_list, split bio to bio1 and bio2
> 
> This is done in level1 raid1_make_request().
> 
> > 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
> 
> Remap is done by raid1_write_request(), and bio1_level may be added into
> one of the two list:
> - plug->pending:
>   bios in plug->pending may be handled in raid1_unplug(), or in
> flush_pending_writes() of raid1d().
>   If current task is about to be scheduled, raid1_unplug() will merge
> plug->pending's bios to conf->pending_bio_list. And
> conf->pending_bio_list will be handled in raid1d.
>   If raid1_unplug() is triggered by blk_finish_plug(), it is also
> handled in raid1d.
> 
> - conf->pending_bio_list:
>   bios in this list is handled in raid1d by calling flush_pending_writes().
> 
> 
> So generic_make_request() to handle bio1_level2 can only be called in
> context of raid1d thread, bio1_level2 is added into raid1d's
> bio_list_on_stack, not caller of level1 generic_make_request().
> 
> > 3. queue bio2 in current->bio_list
> 
> Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
> Then back to level1 generic_make_request()
> 
> > 4. generic_make_request then pops bio1-level2
> 
> At this moment, bio1_level2 and bio2_level2 are in either plug->pending
> or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
> generic_make_request() returns to its caller.
> 
> If before bio_list_pop() called, kernel thread raid1d wakes up and
> iterates conf->pending_bio_list in flush_pending_writes() or iterate
> plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
> level1 raid1d's stack, bios will not show up in level1
> generic_make_reques(), bio_list_pop() still returns NULL.
> 
> > 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
> 
> bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
> bio2_level2 is handled firstly.
> 
> level1 raid1 calls level2 generic_make_request(), then level2
> raid1_make_request() is called, then level raid1_write_request().
> bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
> raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
> will be handled by level2 raid1d, when level2 raid1d wakes up.
> Then returns back to level1 raid1, bio1_level2
> is handled by level2 generic_make_request() and added into level2
> plug->pending or conf->pending_bio_list. In this case neither
> bio2_level2 nor bio1_level is added into any bio_list_on_stack.
> 
> Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
> and sleeps.
> 
> Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
> iterate level2 plug->pending or conf->pending_bio_list, and calling
> level3 generic_make_request().
> 
> In level3 generic_make_request(), because it is level2 raid1d context,
> not level1 raid1d context, bio2_level3 is send into
> q->make_request_fn(), and finally added into level3 plug->pending or
> conf->pending_bio_list, then back to level3 generic_make_reqeust().
> 
> Now level2 raid1d's current->bio_list is empty, so level3
> generic_make_request() returns to level2 raid1d and continue to iterate
> and send bio1_level3 into level3 generic_make_request().
> 
> After all bios are added into level3 plug->pending or
> conf->pending_bio_list, level2 raid1d sleeps.
> 
> Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
> conf->pending_bio_list by calling generic_make_request() to underlying
> devices (which might be a read device).
> 
> On the above whole patch, each lower level generic_make_request() is
> called in context of the lower level raid1d. No recursive call happens
> in normal code path.
> 
> In raid1 code, recursive call of generic_make_request() only happens for
> READ bio, but if array is not frozen, no barrier is required, it doesn't
> hurt.
> 
> 
> > 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock
> 
> As my understand to the code, it won't happen neither.
> 
> > 
> > The problem is because we add new bio to current->bio_list tail.
> 
> New bios are added into other context's current->bio_list, which are
> different lists. If what I understand is correct, a dead lock won't
> happen in this way.
> 
> If my understanding is correct, suddenly I come to realize why raid1
> bios are handled indirectly in another kernel thread.
> 
> (Just for your information, when I write to this location, another run
> of testing finished, no deadlock. This time I reduce I/O barrier bucket
> unit size to 512KB, and set blocksize to 33MB in fio job file. It is
> really slow (130MB/s), but no deadlock observed)
> 
> 
> The stacked raid1 devices are really really confused, if I am wrong, any
> hint is warmly welcome.

Aha, you are correct. I missed we never directly dispatch bio in a schedule based
blk-plug flush. I'll drop the patch. Thanks for the insistence, good discussion!

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23  5:54                     ` Coly Li
  2017-02-23 17:34                       ` Shaohua Li
@ 2017-02-23 23:14                       ` NeilBrown
  2017-02-24 17:06                         ` Coly Li
  1 sibling, 1 reply; 43+ messages in thread
From: NeilBrown @ 2017-02-23 23:14 UTC (permalink / raw)
  To: Coly Li, Shaohua Li
  Cc: linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

[-- Attachment #1: Type: text/plain, Size: 378 bytes --]

On Thu, Feb 23 2017, Coly Li wrote:

>
> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
> bucket size to 8MB, running for 10 hours, there is no deadlock observed,

Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS to 3
and make sure the write requests are larger than 1 page (and have resync
happen at the same time as writes).

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23 19:31                         ` Coly Li
  2017-02-23 19:58                           ` Shaohua Li
@ 2017-02-24 10:19                           ` 王金浦
  2017-02-28 19:42                             ` Shaohua Li
  1 sibling, 1 reply; 43+ messages in thread
From: 王金浦 @ 2017-02-24 10:19 UTC (permalink / raw)
  To: Coly Li
  Cc: Shaohua Li, NeilBrown, linux-raid, Shaohua Li,
	Johannes Thumshirn, Guoqing Jiang

Hi Coly, Hi Shaohua,


>
> Hi Shaohua,
>
> I try to catch up with you, let me try to follow your mind by the
> split-in-while-loop condition (this is my new I/O barrier patch). I
> assume the original BIO is a write bio, and original bio is split and
> handled in a while loop in raid1_make_request().

It's still possible for read bio. We hit a deadlock in the past.
See https://patchwork.kernel.org/patch/9498949/

Also:
http://www.spinics.net/lists/raid/msg52792.html

Regards,
Jack

>
>> 1. in level1, set current->bio_list, split bio to bio1 and bio2
>
> This is done in level1 raid1_make_request().
>
>> 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
>
> Remap is done by raid1_write_request(), and bio1_level may be added into
> one of the two list:
> - plug->pending:
>   bios in plug->pending may be handled in raid1_unplug(), or in
> flush_pending_writes() of raid1d().
>   If current task is about to be scheduled, raid1_unplug() will merge
> plug->pending's bios to conf->pending_bio_list. And
> conf->pending_bio_list will be handled in raid1d.
>   If raid1_unplug() is triggered by blk_finish_plug(), it is also
> handled in raid1d.
>
> - conf->pending_bio_list:
>   bios in this list is handled in raid1d by calling flush_pending_writes().
>
>
> So generic_make_request() to handle bio1_level2 can only be called in
> context of raid1d thread, bio1_level2 is added into raid1d's
> bio_list_on_stack, not caller of level1 generic_make_request().
>
>> 3. queue bio2 in current->bio_list
>
> Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
> Then back to level1 generic_make_request()
>
>> 4. generic_make_request then pops bio1-level2
>
> At this moment, bio1_level2 and bio2_level2 are in either plug->pending
> or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
> generic_make_request() returns to its caller.
>
> If before bio_list_pop() called, kernel thread raid1d wakes up and
> iterates conf->pending_bio_list in flush_pending_writes() or iterate
> plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
> level1 raid1d's stack, bios will not show up in level1
> generic_make_reques(), bio_list_pop() still returns NULL.
>
>> 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
>
> bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
> bio2_level2 is handled firstly.
>
> level1 raid1 calls level2 generic_make_request(), then level2
> raid1_make_request() is called, then level raid1_write_request().
> bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
> raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
> will be handled by level2 raid1d, when level2 raid1d wakes up.
> Then returns back to level1 raid1, bio1_level2
> is handled by level2 generic_make_request() and added into level2
> plug->pending or conf->pending_bio_list. In this case neither
> bio2_level2 nor bio1_level is added into any bio_list_on_stack.
>
> Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
> and sleeps.
>
> Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
> iterate level2 plug->pending or conf->pending_bio_list, and calling
> level3 generic_make_request().
>
> In level3 generic_make_request(), because it is level2 raid1d context,
> not level1 raid1d context, bio2_level3 is send into
> q->make_request_fn(), and finally added into level3 plug->pending or
> conf->pending_bio_list, then back to level3 generic_make_reqeust().
>
> Now level2 raid1d's current->bio_list is empty, so level3
> generic_make_request() returns to level2 raid1d and continue to iterate
> and send bio1_level3 into level3 generic_make_request().
>
> After all bios are added into level3 plug->pending or
> conf->pending_bio_list, level2 raid1d sleeps.
>
> Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
> conf->pending_bio_list by calling generic_make_request() to underlying
> devices (which might be a read device).
>
> On the above whole patch, each lower level generic_make_request() is
> called in context of the lower level raid1d. No recursive call happens
> in normal code path.
>
> In raid1 code, recursive call of generic_make_request() only happens for
> READ bio, but if array is not frozen, no barrier is required, it doesn't
> hurt.
>
>
>> 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock
>
> As my understand to the code, it won't happen neither.
>
>>
>> The problem is because we add new bio to current->bio_list tail.
>
> New bios are added into other context's current->bio_list, which are
> different lists. If what I understand is correct, a dead lock won't
> happen in this way.
>
> If my understanding is correct, suddenly I come to realize why raid1
> bios are handled indirectly in another kernel thread.
>
> (Just for your information, when I write to this location, another run
> of testing finished, no deadlock. This time I reduce I/O barrier bucket
> unit size to 512KB, and set blocksize to 33MB in fio job file. It is
> really slow (130MB/s), but no deadlock observed)
>
>
> The stacked raid1 devices are really really confused, if I am wrong, any
> hint is warmly welcome.
>
>>
>>> =============== P.S ==============
>>> When I run the stacked raid1 testing, I feel I see something behavior
>>> suspiciously, it is resync.
>>>
>>> The second time when I rebuild all the raid1 devices by "mdadm -C
>>> /dev/mdXX -l 1 -n 2 /dev/xxx /dev/xxx", I see the top level raid1 device
>>> /dev/md40 already accomplished 50%+ resync. I don't think it could be
>>> that fast...
>>
>> no idea, is this reproducible?
>
> It can be stably reproduced. I need to check whether bitmap is cleaned
> when create a stacked raid1. This is a little off topic in this thread,
> once I have some idea, I will send out another topic. Hope it is just so
> fast.
>
> Coly
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23 19:58                           ` Shaohua Li
@ 2017-02-24 17:02                             ` Coly Li
  0 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-24 17:02 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/24 上午3:58, Shaohua Li wrote:
> On Fri, Feb 24, 2017 at 03:31:16AM +0800, Coly Li wrote:
>> On 2017/2/24 上午1:34, Shaohua Li wrote:
>>> On Thu, Feb 23, 2017 at 01:54:47PM +0800, Coly Li wrote:
>> [snip]
>>>>>>>> As r1bio_pool preallocates 256 entries, this is unlikely  but not 
>>>>>>>> impossible.  If 256 threads all attempt a write (or read) that
>>>>>>>> crosses a boundary, then they will consume all 256 preallocated
>>>>>>>> entries, and want more. If there is no free memory, they will block
>>>>>>>> indefinitely.
>>>>>>>>
>>>>>>>
>>>>>>> If raid1_make_request() is modified into this way,
>>>>>>> +	if (bio_data_dir(split) == READ)
>>>>>>> +		raid1_read_request(mddev, split);
>>>>>>> +	else
>>>>>>> +		raid1_write_request(mddev, split);
>>>>>>> +	if (split != bio)
>>>>>>> +		generic_make_request(bio);
>>>>>>>
>>>>>>> Then the original bio will be added into the bio_list_on_stack of top
>>>>>>> level generic_make_request(), current->bio_list is initialized, when
>>>>>>> generic_make_request() is called nested in raid1_make_request(), the
>>>>>>> split bio will be added into current->bio_list and nothing else happens.
>>>>>>>
>>>>>>> After the nested generic_make_request() returns, the code back to next
>>>>>>> code of generic_make_request(),
>>>>>>> 2022                         ret = q->make_request_fn(q, bio);
>>>>>>> 2023
>>>>>>> 2024                         blk_queue_exit(q);
>>>>>>> 2025
>>>>>>> 2026                         bio = bio_list_pop(current->bio_list);
>>>>>>>
>>>>>>> bio_list_pop() will return the second half of the split bio, and it is
>>>>>>
>>>>>> So in above sequence, the curent->bio_list will has bios in below sequence:
>>>>>> bios to underlaying disks, second half of original bio
>>>>>>
>>>>>> bio_list_pop will pop bios to underlaying disks first, handle them, then the
>>>>>> second half of original bio.
>>>>>>
>>>>>> That said, this doesn't work for array stacked 3 layers. Because in 3-layer
>>>>>> array, handling the middle layer bio will make the 3rd layer bio hold to
>>>>>> bio_list again.
>>>>>>
>>>>>
>>>>> Could you please give me more hint,
>>>>> - What is the meaning of "hold" from " make the 3rd layer bio hold to
>>>>> bio_list again" ?
>>>>> - Why deadlock happens if the 3rd layer bio hold to bio_list again ?
>>>>
>>>> I tried to set up a 4 layer stacked md raid1, and reduce I/O barrier
>>>> bucket size to 8MB, running for 10 hours, there is no deadlock observed,
>>>>
>>>> Here is how the 4 layer stacked raid1 setup,
>>>> - There are 4 NVMe SSDs, on each SSD I create four 500GB partition,
>>>>   /dev/nvme0n1:  nvme0n1p1, nvme0n1p2, nvme0n1p3, nvme0n1p4
>>>>   /dev/nvme1n1:  nvme1n1p1, nvme1n1p2, nvme1n1p3, nvme1n1p4
>>>>   /dev/nvme2n1:  nvme2n1p1, nvme2n1p2, nvme2n1p3, nvme2n1p4
>>>>   /dev/nvme3n1:  nvme3n1p1, nvme3n1p2, nvme3n1p3, nvme3n1p4
>>>> - Here is how the 4 layer stacked raid1 assembled, level 1 means the top
>>>> level, level 4 means the bottom level in the stacked devices,
>>>>   - level 1:
>>>> 	/dev/md40: /dev/md30  /dev/md31
>>>>   - level 2:
>>>> 	/dev/md30: /dev/md20  /dev/md21
>>>> 	/dev/md31: /dev/md22  /dev/md23
>>>>   - level 3:
>>>> 	/dev/md20: /dev/md10  /dev/md11
>>>> 	/dev/md21: /dev/md12  /dev/md13
>>>> 	/dev/md22: /dev/md14  /dev/md15
>>>> 	/dev/md23: /dev/md16  /dev/md17
>>>>   - level 4:
>>>> 	/dev/md10: /dev/nvme0n1p1  /dev/nvme1n1p1
>>>> 	/dev/md11: /dev/nvme2n1p1  /dev/nvme3n1p1
>>>> 	/dev/md12: /dev/nvme0n1p2  /dev/nvme1n1p2
>>>> 	/dev/md13: /dev/nvme2n1p2  /dev/nvme3n1p2
>>>> 	/dev/md14: /dev/nvme0n1p3  /dev/nvme1n1p3
>>>> 	/dev/md15: /dev/nvme2n1p3  /dev/nvme3n1p3
>>>> 	/dev/md16: /dev/nvme0n1p4  /dev/nvme1n1p4
>>>> 	/dev/md17: /dev/nvme2n1p4  /dev/nvme3n1p4
>>>>
>>>> Here is the fio job file,
>>>> [global]
>>>> direct=1
>>>> thread=1
>>>> ioengine=libaio
>>>>
>>>> [job]
>>>> filename=/dev/md40
>>>> readwrite=write
>>>> numjobs=10
>>>> blocksize=33M
>>>> iodepth=128
>>>> time_based=1
>>>> runtime=10h
>>>>
>>>> I planed to learn how the deadlock comes by analyze a deadlock
>>>> condition. Maybe it was because 8MB bucket unit size is small enough,
>>>> now I try to run with 512K bucket unit size, and see whether I can
>>>> encounter a deadlock.
>>>
>>> Don't think raid1 could easily trigger the deadlock. Maybe you should try
>>> raid10. The resync case is hard to trigger for raid1. The memory pressure case
>>> is hard to trigger for both raid1/10. But it's possible to trigger.
>>>
>>> The 3-layer case is something like this:
>>
>> Hi Shaohua,
>>
>> I try to catch up with you, let me try to follow your mind by the
>> split-in-while-loop condition (this is my new I/O barrier patch). I
>> assume the original BIO is a write bio, and original bio is split and
>> handled in a while loop in raid1_make_request().
>>
>>> 1. in level1, set current->bio_list, split bio to bio1 and bio2
>>
>> This is done in level1 raid1_make_request().
>>
>>> 2. remap bio1 to level2 disk, and queue bio1-level2 in current->bio_list
>>
>> Remap is done by raid1_write_request(), and bio1_level may be added into
>> one of the two list:
>> - plug->pending:
>>   bios in plug->pending may be handled in raid1_unplug(), or in
>> flush_pending_writes() of raid1d().
>>   If current task is about to be scheduled, raid1_unplug() will merge
>> plug->pending's bios to conf->pending_bio_list. And
>> conf->pending_bio_list will be handled in raid1d.
>>   If raid1_unplug() is triggered by blk_finish_plug(), it is also
>> handled in raid1d.
>>
>> - conf->pending_bio_list:
>>   bios in this list is handled in raid1d by calling flush_pending_writes().
>>
>>
>> So generic_make_request() to handle bio1_level2 can only be called in
>> context of raid1d thread, bio1_level2 is added into raid1d's
>> bio_list_on_stack, not caller of level1 generic_make_request().
>>
>>> 3. queue bio2 in current->bio_list
>>
>> Same, bio2_level2 is in level1 raid1d's bio_list_on_stack.
>> Then back to level1 generic_make_request()
>>
>>> 4. generic_make_request then pops bio1-level2
>>
>> At this moment, bio1_level2 and bio2_level2 are in either plug->pending
>> or conf->pending_bio_list, bio_list_pop() returns NULL, and level1
>> generic_make_request() returns to its caller.
>>
>> If before bio_list_pop() called, kernel thread raid1d wakes up and
>> iterates conf->pending_bio_list in flush_pending_writes() or iterate
>> plug->pending in raid1_unplug() by blk_finish_plug(), that happens in
>> level1 raid1d's stack, bios will not show up in level1
>> generic_make_reques(), bio_list_pop() still returns NULL.
>>
>>> 5. remap bio1-level2 to level3 disk, and queue bio1-level2-level3 in current->bio_list
>>
>> bio2_level2 is at head of conf->pending_bio_list or plug->pending, so
>> bio2_level2 is handled firstly.
>>
>> level1 raid1 calls level2 generic_make_request(), then level2
>> raid1_make_request() is called, then level raid1_write_request().
>> bio2_level2 is remapped to bio2_level3, added into plug->pending (level1
>> raid1d's context) or conf->pending_bio_list (level2 raid1's conf), it
>> will be handled by level2 raid1d, when level2 raid1d wakes up.
>> Then returns back to level1 raid1, bio1_level2
>> is handled by level2 generic_make_request() and added into level2
>> plug->pending or conf->pending_bio_list. In this case neither
>> bio2_level2 nor bio1_level is added into any bio_list_on_stack.
>>
>> Then level1 raid1d handles all bios in level1 conf->pending_bio_list,
>> and sleeps.
>>
>> Then level2 raid1d wakes up, and handle bio2_level3 and bio1_level3, by
>> iterate level2 plug->pending or conf->pending_bio_list, and calling
>> level3 generic_make_request().
>>
>> In level3 generic_make_request(), because it is level2 raid1d context,
>> not level1 raid1d context, bio2_level3 is send into
>> q->make_request_fn(), and finally added into level3 plug->pending or
>> conf->pending_bio_list, then back to level3 generic_make_reqeust().
>>
>> Now level2 raid1d's current->bio_list is empty, so level3
>> generic_make_request() returns to level2 raid1d and continue to iterate
>> and send bio1_level3 into level3 generic_make_request().
>>
>> After all bios are added into level3 plug->pending or
>> conf->pending_bio_list, level2 raid1d sleeps.
>>
>> Now level3 raid1d wakes up, continue to iterate level3 plug->pending or
>> conf->pending_bio_list by calling generic_make_request() to underlying
>> devices (which might be a read device).
>>
>> On the above whole patch, each lower level generic_make_request() is
>> called in context of the lower level raid1d. No recursive call happens
>> in normal code path.
>>
>> In raid1 code, recursive call of generic_make_request() only happens for
>> READ bio, but if array is not frozen, no barrier is required, it doesn't
>> hurt.
>>
>>
>>> 6. generic_make_request then pops bio2, but bio1 hasn't finished yet, deadlock
>>
>> As my understand to the code, it won't happen neither.
>>
>>>
>>> The problem is because we add new bio to current->bio_list tail.
>>
>> New bios are added into other context's current->bio_list, which are
>> different lists. If what I understand is correct, a dead lock won't
>> happen in this way.
>>
>> If my understanding is correct, suddenly I come to realize why raid1
>> bios are handled indirectly in another kernel thread.
>>
>> (Just for your information, when I write to this location, another run
>> of testing finished, no deadlock. This time I reduce I/O barrier bucket
>> unit size to 512KB, and set blocksize to 33MB in fio job file. It is
>> really slow (130MB/s), but no deadlock observed)
>>
>>
>> The stacked raid1 devices are really really confused, if I am wrong, any
>> hint is warmly welcome.
> 
> Aha, you are correct. I missed we never directly dispatch bio in a schedule based
> blk-plug flush. I'll drop the patch. Thanks for the insistence, good discussion!

Thank you for the encouragement :-)
After carefully think your patch again, I suggest to let it go ahead and
keep your fix in -next. The reasons are,
1) For 32bit bi_iter.bi_size, it is safe and no hash conflict. But if
someone (maybe I) changes bi_iter.bi_size from 32bit to 64bit, for a
DISCARD bio, it is very easy to be split into more then 512 pieces, then
there will be a lot of hash conflit. If there are resync triggered, it
will every easy go into a dead lock.
  If this dead lock happens in future, it will be quite hard to find out
the root cause. If we have this fix now to avoid future possible hash
conflict, life will be easier at that time.
2) If we decide to use nested generic_make_request() fix in your patch,
then the while-loop does not exist, more CPU cycles will be consumed, it
does not make sense to save a branch by a function pointer, and pay the
cost of code readability. So remove the function pointer is better once
we take reason 1).

Please keep this patch in -next, it helps to avoid future bug.

Thanks.

Coly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-23 23:14                       ` NeilBrown
@ 2017-02-24 17:06                         ` Coly Li
  2017-02-24 17:17                           ` Shaohua Li
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-24 17:06 UTC (permalink / raw)
  To: NeilBrown, Shaohua Li
  Cc: linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/24 上午7:14, NeilBrown wrote:
> On Thu, Feb 23 2017, Coly Li wrote:
> 
>> 
>> I tried to set up a 4 layer stacked md raid1, and reduce I/O
>> barrier bucket size to 8MB, running for 10 hours, there is no
>> deadlock observed,
> 
> Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS
> to 3 and make sure the write requests are larger than 1 page (and
> have resync happen at the same time as writes).

Hi Neil,

Yes, the above method triggers deadlock easily. After come to
understand how bios are handled in stacked raid1 and the relationship
between current->bio_list, plug->pending and conf->pending_bio_list, I
think I come to understand what you worried and the meaning of your fix.

I totally agree and understand there will be hash conflict sooner or
later now. Yes we need this fix.

Thanks to you and Shaohua, explaining the details to me, and help me
to catch up your mind :-)

Coly


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-24 17:06                         ` Coly Li
@ 2017-02-24 17:17                           ` Shaohua Li
  2017-02-24 18:57                             ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-24 17:17 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On Sat, Feb 25, 2017 at 01:06:22AM +0800, Coly Li wrote:
> On 2017/2/24 上午7:14, NeilBrown wrote:
> > On Thu, Feb 23 2017, Coly Li wrote:
> > 
> >> 
> >> I tried to set up a 4 layer stacked md raid1, and reduce I/O
> >> barrier bucket size to 8MB, running for 10 hours, there is no
> >> deadlock observed,
> > 
> > Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS
> > to 3 and make sure the write requests are larger than 1 page (and
> > have resync happen at the same time as writes).
> 
> Hi Neil,
> 
> Yes, the above method triggers deadlock easily. After come to
> understand how bios are handled in stacked raid1 and the relationship
> between current->bio_list, plug->pending and conf->pending_bio_list, I
> think I come to understand what you worried and the meaning of your fix.
> 
> I totally agree and understand there will be hash conflict sooner or
> later now. Yes we need this fix.
> 
> Thanks to you and Shaohua, explaining the details to me, and help me
> to catch up your mind :-)

I'm confused. So the deadlock is real? How is it triggered?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-24 17:17                           ` Shaohua Li
@ 2017-02-24 18:57                             ` Coly Li
  2017-02-24 19:02                               ` Shaohua Li
  0 siblings, 1 reply; 43+ messages in thread
From: Coly Li @ 2017-02-24 18:57 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/25 上午1:17, Shaohua Li wrote:
> On Sat, Feb 25, 2017 at 01:06:22AM +0800, Coly Li wrote:
>> On 2017/2/24 上午7:14, NeilBrown wrote:
>>> On Thu, Feb 23 2017, Coly Li wrote:
>>>
>>>>
>>>> I tried to set up a 4 layer stacked md raid1, and reduce I/O
>>>> barrier bucket size to 8MB, running for 10 hours, there is no
>>>> deadlock observed,
>>>
>>> Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS
>>> to 3 and make sure the write requests are larger than 1 page (and
>>> have resync happen at the same time as writes).
>>
>> Hi Neil,
>>
>> Yes, the above method triggers deadlock easily. After come to
>> understand how bios are handled in stacked raid1 and the relationship
>> between current->bio_list, plug->pending and conf->pending_bio_list, I
>> think I come to understand what you worried and the meaning of your fix.
>>
>> I totally agree and understand there will be hash conflict sooner or
>> later now. Yes we need this fix.
>>
>> Thanks to you and Shaohua, explaining the details to me, and help me
>> to catch up your mind :-)
> 
> I'm confused. So the deadlock is real? How is it triggered?

Let me explain,

There is no deadlock now, because,
1) If there is hash conflict existing, a deadlock is possible.
2) In current Linux kernel, hash conflict won't happen in real life
   2.1) regular bio maximum size is 2MB, it can only be split into 2
bios in raid1_make_request() of my new I/O barrier patch
   2.2) DISCARD bio maximum size is 4GB, it can be split into 65 bios in
raid1_make_request() of my new I/O barrier patch.
   2.3) I verified that, for any consecutive  512 integers in [0,
1<<63], there is no hash conflict by calling sector_to_idx().
   2.4) Currently there is almost no device provides LBA range exceeds
(1<<63) bytes. So in current Linux kernel with my new I/O barrier patch,
no dead lock will happen. The patch in current Linux kernel is deadlock
clean from all conditions we discussed before.

The reason why I suggest to still have your patch is,
1) If one day bi_iter.bi_size is extended from 32bit (unsigned int) to
64bit (unsigned long), a DISCARD bio will be split to more than 1024
smaller bios.
2) If a DISCARD bio is split into more then 1024 smaller bios, that
means sector_to_idx() is called by 1024+ consecutive integers. It is
100% possible to have hash conflict.
3) If hash conflict exists, the deadlock described by Neil will be passible.


What I mean is, currently there is no deadlock, because bi_iter.bi_size
is 32 bit; if in future bi_iter.bi_size extended to 64 bit, we will have
deadlock. Your fix almost does not hurt performance, improves code
readability and will avoid a potential deadlock in future (when
bi_iter.bi_size extended to 64 bit), why not have it ?

Coly

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-24 18:57                             ` Coly Li
@ 2017-02-24 19:02                               ` Shaohua Li
  2017-02-24 19:19                                 ` Coly Li
  0 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-24 19:02 UTC (permalink / raw)
  To: Coly Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On Sat, Feb 25, 2017 at 02:57:26AM +0800, Coly Li wrote:
> On 2017/2/25 上午1:17, Shaohua Li wrote:
> > On Sat, Feb 25, 2017 at 01:06:22AM +0800, Coly Li wrote:
> >> On 2017/2/24 上午7:14, NeilBrown wrote:
> >>> On Thu, Feb 23 2017, Coly Li wrote:
> >>>
> >>>>
> >>>> I tried to set up a 4 layer stacked md raid1, and reduce I/O
> >>>> barrier bucket size to 8MB, running for 10 hours, there is no
> >>>> deadlock observed,
> >>>
> >>> Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS
> >>> to 3 and make sure the write requests are larger than 1 page (and
> >>> have resync happen at the same time as writes).
> >>
> >> Hi Neil,
> >>
> >> Yes, the above method triggers deadlock easily. After come to
> >> understand how bios are handled in stacked raid1 and the relationship
> >> between current->bio_list, plug->pending and conf->pending_bio_list, I
> >> think I come to understand what you worried and the meaning of your fix.
> >>
> >> I totally agree and understand there will be hash conflict sooner or
> >> later now. Yes we need this fix.
> >>
> >> Thanks to you and Shaohua, explaining the details to me, and help me
> >> to catch up your mind :-)
> > 
> > I'm confused. So the deadlock is real? How is it triggered?
> 
> Let me explain,
> 
> There is no deadlock now, because,
> 1) If there is hash conflict existing, a deadlock is possible.
> 2) In current Linux kernel, hash conflict won't happen in real life
>    2.1) regular bio maximum size is 2MB, it can only be split into 2
> bios in raid1_make_request() of my new I/O barrier patch
>    2.2) DISCARD bio maximum size is 4GB, it can be split into 65 bios in
> raid1_make_request() of my new I/O barrier patch.
>    2.3) I verified that, for any consecutive  512 integers in [0,
> 1<<63], there is no hash conflict by calling sector_to_idx().
>    2.4) Currently there is almost no device provides LBA range exceeds
> (1<<63) bytes. So in current Linux kernel with my new I/O barrier patch,
> no dead lock will happen. The patch in current Linux kernel is deadlock
> clean from all conditions we discussed before.
> 
> The reason why I suggest to still have your patch is,
> 1) If one day bi_iter.bi_size is extended from 32bit (unsigned int) to
> 64bit (unsigned long), a DISCARD bio will be split to more than 1024
> smaller bios.
> 2) If a DISCARD bio is split into more then 1024 smaller bios, that
> means sector_to_idx() is called by 1024+ consecutive integers. It is
> 100% possible to have hash conflict.
> 3) If hash conflict exists, the deadlock described by Neil will be passible.
> 
> 
> What I mean is, currently there is no deadlock, because bi_iter.bi_size
> is 32 bit; if in future bi_iter.bi_size extended to 64 bit, we will have
> deadlock. Your fix almost does not hurt performance, improves code
> readability and will avoid a potential deadlock in future (when
> bi_iter.bi_size extended to 64 bit), why not have it ?

Let's assume there is hash conflict. We have raid10 anyway, which doesn't have
the fancy barrier. When can the deadlock be triggered? My understanding is
there isn't because we are handling bios in raid1/10d. Any thing I missed?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-24 19:02                               ` Shaohua Li
@ 2017-02-24 19:19                                 ` Coly Li
  0 siblings, 0 replies; 43+ messages in thread
From: Coly Li @ 2017-02-24 19:19 UTC (permalink / raw)
  To: Shaohua Li
  Cc: NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn, Guoqing Jiang

On 2017/2/25 上午3:02, Shaohua Li wrote:
> On Sat, Feb 25, 2017 at 02:57:26AM +0800, Coly Li wrote:
>> On 2017/2/25 上午1:17, Shaohua Li wrote:
>>> On Sat, Feb 25, 2017 at 01:06:22AM +0800, Coly Li wrote:
>>>> On 2017/2/24 上午7:14, NeilBrown wrote:
>>>>> On Thu, Feb 23 2017, Coly Li wrote:
>>>>>
>>>>>>
>>>>>> I tried to set up a 4 layer stacked md raid1, and reduce I/O
>>>>>> barrier bucket size to 8MB, running for 10 hours, there is no
>>>>>> deadlock observed,
>>>>>
>>>>> Try setting BARRIER_BUCKETS_NR to '1' and BARRIER_UNIT_SECTOR_BITS
>>>>> to 3 and make sure the write requests are larger than 1 page (and
>>>>> have resync happen at the same time as writes).
>>>>
>>>> Hi Neil,
>>>>
>>>> Yes, the above method triggers deadlock easily. After come to
>>>> understand how bios are handled in stacked raid1 and the relationship
>>>> between current->bio_list, plug->pending and conf->pending_bio_list, I
>>>> think I come to understand what you worried and the meaning of your fix.
>>>>
>>>> I totally agree and understand there will be hash conflict sooner or
>>>> later now. Yes we need this fix.
>>>>
>>>> Thanks to you and Shaohua, explaining the details to me, and help me
>>>> to catch up your mind :-)
>>>
>>> I'm confused. So the deadlock is real? How is it triggered?
>>
>> Let me explain,
>>
>> There is no deadlock now, because,
>> 1) If there is hash conflict existing, a deadlock is possible.
>> 2) In current Linux kernel, hash conflict won't happen in real life
>>    2.1) regular bio maximum size is 2MB, it can only be split into 2
>> bios in raid1_make_request() of my new I/O barrier patch
>>    2.2) DISCARD bio maximum size is 4GB, it can be split into 65 bios in
>> raid1_make_request() of my new I/O barrier patch.
>>    2.3) I verified that, for any consecutive  512 integers in [0,
>> 1<<63], there is no hash conflict by calling sector_to_idx().
>>    2.4) Currently there is almost no device provides LBA range exceeds
>> (1<<63) bytes. So in current Linux kernel with my new I/O barrier patch,
>> no dead lock will happen. The patch in current Linux kernel is deadlock
>> clean from all conditions we discussed before.
>>
>> The reason why I suggest to still have your patch is,
>> 1) If one day bi_iter.bi_size is extended from 32bit (unsigned int) to
>> 64bit (unsigned long), a DISCARD bio will be split to more than 1024
>> smaller bios.
>> 2) If a DISCARD bio is split into more then 1024 smaller bios, that
>> means sector_to_idx() is called by 1024+ consecutive integers. It is
>> 100% possible to have hash conflict.
>> 3) If hash conflict exists, the deadlock described by Neil will be passible.
>>
>>
>> What I mean is, currently there is no deadlock, because bi_iter.bi_size
>> is 32 bit; if in future bi_iter.bi_size extended to 64 bit, we will have
>> deadlock. Your fix almost does not hurt performance, improves code
>> readability and will avoid a potential deadlock in future (when
>> bi_iter.bi_size extended to 64 bit), why not have it ?
> 
> Let's assume there is hash conflict. We have raid10 anyway, which doesn't have
> the fancy barrier. When can the deadlock be triggered? My understanding is
> there isn't because we are handling bios in raid1/10d. Any thing I missed?
> 

Oho, yeah, when we discuss indirect bio handling by raid1d, no barrier
bucket idx mentioned at all. That means even without the new I/O barrier
code, it does not lock up at all.

Yes you are right, ignore my noise please :-)

Coly




^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-24 10:19                           ` 王金浦
@ 2017-02-28 19:42                             ` Shaohua Li
  2017-03-01 17:01                               ` 王金浦
  0 siblings, 1 reply; 43+ messages in thread
From: Shaohua Li @ 2017-02-28 19:42 UTC (permalink / raw)
  To: 王金浦
  Cc: Coly Li, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang

On Fri, Feb 24, 2017 at 11:19:05AM +0100, 王金浦 wrote:
> Hi Coly, Hi Shaohua,
> 
> 
> >
> > Hi Shaohua,
> >
> > I try to catch up with you, let me try to follow your mind by the
> > split-in-while-loop condition (this is my new I/O barrier patch). I
> > assume the original BIO is a write bio, and original bio is split and
> > handled in a while loop in raid1_make_request().
> 
> It's still possible for read bio. We hit a deadlock in the past.
> See https://patchwork.kernel.org/patch/9498949/
> 
> Also:
> http://www.spinics.net/lists/raid/msg52792.html

Thanks Jinpu. So this is for the read side, where we don't have plug stuff.
Yep, that finally makes sense. It's my fault I didn't look at the read side
code carefully. Looks we need the patch Neil suggested.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window
  2017-02-28 19:42                             ` Shaohua Li
@ 2017-03-01 17:01                               ` 王金浦
  0 siblings, 0 replies; 43+ messages in thread
From: 王金浦 @ 2017-03-01 17:01 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Coly Li, NeilBrown, linux-raid, Shaohua Li, Johannes Thumshirn,
	Guoqing Jiang

2017-02-28 20:42 GMT+01:00 Shaohua Li <shli@kernel.org>:
> On Fri, Feb 24, 2017 at 11:19:05AM +0100, 王金浦 wrote:
>> Hi Coly, Hi Shaohua,
>>
>>
>> >
>> > Hi Shaohua,
>> >
>> > I try to catch up with you, let me try to follow your mind by the
>> > split-in-while-loop condition (this is my new I/O barrier patch). I
>> > assume the original BIO is a write bio, and original bio is split and
>> > handled in a while loop in raid1_make_request().
>>
>> It's still possible for read bio. We hit a deadlock in the past.
>> See https://patchwork.kernel.org/patch/9498949/
>>
>> Also:
>> http://www.spinics.net/lists/raid/msg52792.html
>
> Thanks Jinpu. So this is for the read side, where we don't have plug stuff.
> Yep, that finally makes sense. It's my fault I didn't look at the read side
> code carefully. Looks we need the patch Neil suggested.
>
> Thanks,
> Shaohua

Thanks Shaohua, I will test your patch and report result.

Regards,
Jinpu

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2017-03-01 17:01 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-15 16:35 [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window colyli
2017-02-15 16:35 ` [PATCH V3 2/2] RAID1: avoid unnecessary spin locks in I/O barrier code colyli
2017-02-15 17:15   ` Coly Li
2017-02-16  2:25   ` Shaohua Li
2017-02-17 18:42     ` Coly Li
2017-02-16  7:04   ` NeilBrown
2017-02-17  7:56     ` Coly Li
2017-02-17 18:35       ` Coly Li
2017-02-16  2:22 ` [PATCH V3 1/2] RAID1: a new I/O barrier implementation to remove resync window Shaohua Li
2017-02-16 17:05   ` Coly Li
2017-02-17 12:40     ` Coly Li
2017-02-16  7:04 ` NeilBrown
2017-02-17  6:56   ` Coly Li
2017-02-19 23:50     ` NeilBrown
2017-02-20  2:51       ` NeilBrown
2017-02-20  7:04         ` Shaohua Li
2017-02-20  8:07           ` Coly Li
2017-02-20  8:30             ` Coly Li
2017-02-20 18:14             ` Wols Lists
2017-02-21 11:30               ` Coly Li
2017-02-21 19:20                 ` Wols Lists
2017-02-21 20:16                   ` Coly Li
2017-02-21  0:29             ` NeilBrown
2017-02-21  9:45               ` Coly Li
2017-02-21 17:45                 ` Shaohua Li
2017-02-21 20:09                   ` Coly Li
2017-02-23  5:54                     ` Coly Li
2017-02-23 17:34                       ` Shaohua Li
2017-02-23 19:31                         ` Coly Li
2017-02-23 19:58                           ` Shaohua Li
2017-02-24 17:02                             ` Coly Li
2017-02-24 10:19                           ` 王金浦
2017-02-28 19:42                             ` Shaohua Li
2017-03-01 17:01                               ` 王金浦
2017-02-23 23:14                       ` NeilBrown
2017-02-24 17:06                         ` Coly Li
2017-02-24 17:17                           ` Shaohua Li
2017-02-24 18:57                             ` Coly Li
2017-02-24 19:02                               ` Shaohua Li
2017-02-24 19:19                                 ` Coly Li
2017-02-17 19:41   ` Shaohua Li
2017-02-18  2:40     ` Coly Li
2017-02-19 23:42     ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.