All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] retry write on error
@ 2017-11-22  0:35 Liu Bo
  2017-11-22  0:35 ` [PATCH 1/7] Btrfs: keep a copy of bi_iter in btrfs_io_bio Liu Bo
                   ` (7 more replies)
  0 siblings, 8 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

If the underlying protocal doesn't support retry and there are some
transient errors happening somewhere in our IO stack, we'd like to
give an extra chance for IO.  Or sometimes you see btrfs reporting
'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this
retry may help a bit.

In btrfs, read retry is handled in bio_readpage_error() with the retry
unit being page size, for write retry however, we're going to do it in
a different way, as a write may consist of several writes onto
different stripes, retry write needs to be done right after the IO on
each stripe completes and arrives at endio.

Patch 1-3 are the implementation of retry write on error for
non-raid56 profile.  Patch 4-6 are for raid56 profile.  Both raid56
and non-raid56 shares one retry function helper.

Patch 3 does retry sector by sector, but since this patch set doesn't
included badblocks support, patch 7 changes it back to retry the whole
bio.  (I didn't fold patch 7 to patch 3 in the hope of just reverting
patch 7 once badblocks support is done, but I'm open to it.)

Liu Bo (7):
  Btrfs: keep a copy of bi_iter in btrfs_io_bio
  Btrfs: add helper __btrfs_end_bio
  Btrfs: retry write for non-raid56
  Btrfs: get rbio inside fail_bio_stripe
  Btrfs: add helper __raid_write_end_io
  Btrfs: retry write for raid56
  Btrfs: retry the whole bio on write error

 fs/btrfs/extent_io.c |   2 -
 fs/btrfs/raid56.c    |  73 ++++++++++++++++++++++------
 fs/btrfs/volumes.c   | 133 ++++++++++++++++++++++++++++++++++++++++-----------
 fs/btrfs/volumes.h   |   3 ++
 4 files changed, 167 insertions(+), 44 deletions(-)

-- 
2.9.4


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/7] Btrfs: keep a copy of bi_iter in btrfs_io_bio
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-22  0:35 ` [PATCH 2/7] Btrfs: add helper __btrfs_end_bio Liu Bo
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

Before this, we only keep a copy of bi_iter for cloned bio so that we
can iterate bvec in endio with bio_for_each_segment(), and we iterate
normal bio with bio_for_each_segment_all().

However, in the following patches, I'd add retry on write errors where
I'd like to process both cloned bio and normal bio in the same way.

This keeps a copy of bi_iter in btrfs_io_bio before we submit bio so
that in endio we can iter the bio with bio_for_each_segment() only.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/extent_io.c | 2 --
 fs/btrfs/volumes.c   | 3 ++-
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f64d05a..043b7df 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -2689,7 +2689,6 @@ struct bio *btrfs_bio_clone(struct bio *bio)
 	new = bio_clone_fast(bio, GFP_NOFS, btrfs_bioset);
 	btrfs_bio = btrfs_io_bio(new);
 	btrfs_io_bio_init(btrfs_bio);
-	btrfs_bio->iter = bio->bi_iter;
 	return new;
 }
 
@@ -2716,7 +2715,6 @@ struct bio *btrfs_bio_clone_partial(struct bio *orig, int offset, int size)
 	btrfs_io_bio_init(btrfs_bio);
 
 	bio_trim(bio, offset >> 9, size >> 9);
-	btrfs_bio->iter = bio->bi_iter;
 	return bio;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c70fd0f..4835136 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6113,9 +6113,10 @@ static void submit_stripe_bio(struct btrfs_bio *bbio, struct bio *bio,
 	struct btrfs_fs_info *fs_info = bbio->fs_info;
 
 	bio->bi_private = bbio;
-	btrfs_io_bio(bio)->stripe_index = dev_nr;
 	bio->bi_end_io = btrfs_end_bio;
 	bio->bi_iter.bi_sector = physical >> 9;
+	btrfs_io_bio(bio)->stripe_index = dev_nr;
+	btrfs_io_bio(bio)->iter = bio->bi_iter;
 #ifdef DEBUG
 	{
 		struct rcu_string *name;
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/7] Btrfs: add helper __btrfs_end_bio
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
  2017-11-22  0:35 ` [PATCH 1/7] Btrfs: keep a copy of bi_iter in btrfs_io_bio Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-22  0:35 ` [PATCH 3/7] Btrfs: retry write for non-raid56 Liu Bo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

btrfs_end_bio() does two things, firstly it'd check and print errors
from the underlying layers, secondly it'd check whether it's the last
stripe for this IO (e.g. raid1 consists of two stripes for a single IO).

Since I'd like to do retry on write errors before failing write IO,
the above two parts need to be split.

This add a helper __btrfs_end_bio() to do the second part.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/volumes.c | 60 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 34 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 4835136..853f9ce 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5987,36 +5987,11 @@ static inline void btrfs_end_bbio(struct btrfs_bio *bbio, struct bio *bio)
 	btrfs_put_bbio(bbio);
 }
 
-static void btrfs_end_bio(struct bio *bio)
+static void __btrfs_end_bio(struct bio *bio)
 {
 	struct btrfs_bio *bbio = bio->bi_private;
 	int is_orig_bio = 0;
 
-	if (bio->bi_status) {
-		atomic_inc(&bbio->error);
-		if (bio->bi_status == BLK_STS_IOERR ||
-		    bio->bi_status == BLK_STS_TARGET) {
-			unsigned int stripe_index =
-				btrfs_io_bio(bio)->stripe_index;
-			struct btrfs_device *dev;
-
-			BUG_ON(stripe_index >= bbio->num_stripes);
-			dev = bbio->stripes[stripe_index].dev;
-			if (dev->bdev) {
-				if (bio_op(bio) == REQ_OP_WRITE)
-					btrfs_dev_stat_inc(dev,
-						BTRFS_DEV_STAT_WRITE_ERRS);
-				else
-					btrfs_dev_stat_inc(dev,
-						BTRFS_DEV_STAT_READ_ERRS);
-				if (bio->bi_opf & REQ_PREFLUSH)
-					btrfs_dev_stat_inc(dev,
-						BTRFS_DEV_STAT_FLUSH_ERRS);
-				btrfs_dev_stat_print_on_error(dev);
-			}
-		}
-	}
-
 	if (bio == bbio->orig_bio)
 		is_orig_bio = 1;
 
@@ -6048,6 +6023,39 @@ static void btrfs_end_bio(struct bio *bio)
 	}
 }
 
+static void btrfs_end_bio(struct bio *bio)
+{
+	struct btrfs_bio *bbio = bio->bi_private;
+	int is_orig_bio = 0;
+
+	if (bio->bi_status) {
+		atomic_inc(&bbio->error);
+		if (bio->bi_status == BLK_STS_IOERR ||
+		    bio->bi_status == BLK_STS_TARGET) {
+			unsigned int stripe_index =
+				btrfs_io_bio(bio)->stripe_index;
+			struct btrfs_device *dev;
+
+			BUG_ON(stripe_index >= bbio->num_stripes);
+			dev = bbio->stripes[stripe_index].dev;
+			if (dev->bdev) {
+				if (bio_op(bio) == REQ_OP_WRITE)
+					btrfs_dev_stat_inc(dev,
+						BTRFS_DEV_STAT_WRITE_ERRS);
+				else
+					btrfs_dev_stat_inc(dev,
+						BTRFS_DEV_STAT_READ_ERRS);
+				if (bio->bi_opf & REQ_PREFLUSH)
+					btrfs_dev_stat_inc(dev,
+						BTRFS_DEV_STAT_FLUSH_ERRS);
+				btrfs_dev_stat_print_on_error(dev);
+			}
+		}
+	}
+
+	__btrfs_end_bio(bio);
+}
+
 /*
  * see run_scheduled_bios for a description of why bios are collected for
  * async submit.
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/7] Btrfs: retry write for non-raid56
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
  2017-11-22  0:35 ` [PATCH 1/7] Btrfs: keep a copy of bi_iter in btrfs_io_bio Liu Bo
  2017-11-22  0:35 ` [PATCH 2/7] Btrfs: add helper __btrfs_end_bio Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-22 14:41   ` Nikolay Borisov
  2017-11-22  0:35 ` [PATCH 4/7] Btrfs: get rbio inside fail_bio_stripe Liu Bo
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

If the underlying protocal doesn't support retry and there are some
transient errors happening somewhere in our IO stack, we'd like to
give an extra chance for IO.

In btrfs, read retry is handled in bio_readpage_error() with the retry
unit being page size , for write retry however, we're going to do it
in a different way, as a write may consist of several writes onto
different stripes, retry write needs to be done right after the IO on
each stripe completes and reaches to endio.

As write errors are really corner cases, performance impact is not
considered.

This adds code to retry write on errors _sector by sector_ in order to
find out which sector is really bad so that we can mark it somewhere.
And since endio is running in interrupt context, the real retry work
is scheduled in system_unbound_wq in order to get a normal context.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/volumes.c | 162 +++++++++++++++++++++++++++++++++++++++++++++--------
 fs/btrfs/volumes.h |   3 +
 2 files changed, 142 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 853f9ce..c11db0b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
 	}
 }
 
-static void btrfs_end_bio(struct bio *bio)
+static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
 {
 	struct btrfs_bio *bbio = bio->bi_private;
-	int is_orig_bio = 0;
+	unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
+
+	BUG_ON(stripe_index >= bbio->num_stripes);
+	return bbio->stripes[stripe_index].dev;
+}
+
+/*
+ * return 1 if every sector retry returns successful.
+ * return 0 if one or more sector retries fails.
+ */
+int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
+{
+	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
+	u64 sectors_to_write;
+	u64 offset;
+	u64 orig;
+	u64 unit;
+	u64 block_sectors;
+	int ok = 1;
+	struct bio *wbio;
+
+	/* offset and unit are bytes aligned, not 512-bytes aligned. */
+	sectors_to_write = io_bio->iter.bi_size >> 9;
+	orig = io_bio->iter.bi_sector;
+	offset = 0;
+	block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
+	unit = block_sectors;
+	ASSERT(unit == 1);
+
+	while (1) {
+		if (!sectors_to_write)
+			break;
+		/*
+		 * LIUBO: I don't think unit > sectors_to_write could
+		 * happen, sectors_to_write should be aligned to PAGE_SIZE
+		 * which is > unit.  Just in case.
+		 */
+		if (unit > sectors_to_write) {
+			WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", unit, sectors_to_write);
+			unit = sectors_to_write;
+		}
+
+		/* write @unit bytes at @offset */
+		/* this would never fail, check btrfs_bio_clone(). */
+		wbio = btrfs_bio_clone(bio);
+		wbio->bi_opf = REQ_OP_WRITE;
+		wbio->bi_iter = io_bio->iter;
+
+		bio_trim(wbio, offset, unit);
+		bio_copy_dev(wbio, bio);
+
+		/* submit in sync way */
+		/*
+		 * LIUBO: There is an issue, if this bio is quite
+		 * large, say 1M or 2M, and sector size is just 512,
+		 * then this may take a while.
+		 *
+		 * May need to schedule the job to workqueue.
+		 */
+		if (submit_bio_wait(wbio) < 0) {
+			ok = 0 && ok;
+			/*
+			 * This is not correct if badblocks is enabled
+			 * as we need to record every bad sector by
+			 * trying sectors one by one.
+			 */
+			break;
+		}
+
+		bio_put(wbio);
+		offset += unit;
+		sectors_to_write -= unit;
+		unit = block_sectors;
+	}
+	return ok;
+}
+
+void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
+{
+	if (bio->bi_status == BLK_STS_IOERR ||
+	    bio->bi_status == BLK_STS_TARGET) {
+		if (dev->bdev) {
+			if (bio_data_dir(bio) == WRITE)
+				btrfs_dev_stat_inc(dev,
+						   BTRFS_DEV_STAT_WRITE_ERRS);
+			else
+				btrfs_dev_stat_inc(dev,
+						   BTRFS_DEV_STAT_READ_ERRS);
+			if (bio->bi_opf & REQ_PREFLUSH)
+				btrfs_dev_stat_inc(dev,
+						   BTRFS_DEV_STAT_FLUSH_ERRS);
+			btrfs_dev_stat_print_on_error(dev);
+		}
+	}
+}
+
+static void btrfs_handle_bio_error(struct work_struct *work)
+{
+	struct btrfs_io_bio *io_bio = container_of(work, struct btrfs_io_bio,
+						   work);
+	struct bio *bio = &io_bio->bio;
+	struct btrfs_device *dev = get_device_from_bio(bio);
+
+	ASSERT(bio_data_dir(bio) == WRITE);
+
+	if (!btrfs_narrow_write_error(bio, dev)) {
+		/* inc error counter if narrow (retry) fails */
+		struct btrfs_bio *bbio = bio->bi_private;
 
-	if (bio->bi_status) {
 		atomic_inc(&bbio->error);
-		if (bio->bi_status == BLK_STS_IOERR ||
-		    bio->bi_status == BLK_STS_TARGET) {
-			unsigned int stripe_index =
-				btrfs_io_bio(bio)->stripe_index;
-			struct btrfs_device *dev;
-
-			BUG_ON(stripe_index >= bbio->num_stripes);
-			dev = bbio->stripes[stripe_index].dev;
-			if (dev->bdev) {
-				if (bio_op(bio) == REQ_OP_WRITE)
-					btrfs_dev_stat_inc(dev,
-						BTRFS_DEV_STAT_WRITE_ERRS);
-				else
-					btrfs_dev_stat_inc(dev,
-						BTRFS_DEV_STAT_READ_ERRS);
-				if (bio->bi_opf & REQ_PREFLUSH)
-					btrfs_dev_stat_inc(dev,
-						BTRFS_DEV_STAT_FLUSH_ERRS);
-				btrfs_dev_stat_print_on_error(dev);
-			}
+		btrfs_record_bio_error(bio, dev);
+	}
+
+	__btrfs_end_bio(bio);
+}
+
+/* running in irq context */
+static void btrfs_reschedule_retry(struct bio *bio)
+{
+	INIT_WORK(&btrfs_io_bio(bio)->work, btrfs_handle_bio_error);
+	/* _temporarily_ use system_unbound_wq */
+	queue_work(system_unbound_wq, &btrfs_io_bio(bio)->work);
+}
+
+static void btrfs_end_bio(struct bio *bio)
+{
+	if (bio->bi_status) {
+		struct btrfs_bio *bbio = bio->bi_private;
+
+		if (bio_data_dir(bio) == WRITE) {
+			btrfs_reschedule_retry(bio);
+			return;
 		}
+
+		atomic_inc(&bbio->error);
+
+		/* I don't think we can have REQ_PREFLUSH, but just in case. */
+		WARN_ON_ONCE(bio->bi_opf & REQ_PREFLUSH);
+
+		/* REQ_PREFLUSH */
+		btrfs_record_bio_error(bio, get_device_from_bio(bio));
 	}
 
 	__btrfs_end_bio(bio);
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6108fdf..c1da27f 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -281,6 +281,7 @@ struct btrfs_io_bio {
 	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
 	u8 *csum_allocated;
 	btrfs_io_bio_end_io_t *end_io;
+	struct work_struct work;
 	struct bvec_iter iter;
 	/*
 	 * This member must come last, bio_alloc_bioset will allocate enough
@@ -544,5 +545,7 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
 void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
 				 u8 *uuid);
+int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev);
+void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev);
 
 #endif
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 4/7] Btrfs: get rbio inside fail_bio_stripe
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
                   ` (2 preceding siblings ...)
  2017-11-22  0:35 ` [PATCH 3/7] Btrfs: retry write for non-raid56 Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-22  0:35 ` [PATCH 5/7] Btrfs: add helper __raid_write_end_io Liu Bo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

Because rbio will be bio->private in any case, we can remove @rbio in
arguments.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/raid56.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 24a6222..aa04e5b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -183,7 +183,7 @@ static void rmw_work(struct btrfs_work *work);
 static void read_rebuild_work(struct btrfs_work *work);
 static void async_rmw_stripe(struct btrfs_raid_bio *rbio);
 static void async_read_rebuild(struct btrfs_raid_bio *rbio);
-static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
+static int fail_bio_stripe(struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void __free_raid_bio(struct btrfs_raid_bio *rbio);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
@@ -898,7 +898,7 @@ static void raid_write_end_io(struct bio *bio)
 	int max_errors;
 
 	if (err)
-		fail_bio_stripe(rbio, bio);
+		fail_bio_stripe(bio);
 
 	bio_put(bio);
 
@@ -1415,9 +1415,9 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed)
  * helper to fail a stripe based on a physical disk
  * bio.
  */
-static int fail_bio_stripe(struct btrfs_raid_bio *rbio,
-			   struct bio *bio)
+static int fail_bio_stripe(struct bio *bio)
 {
+	struct btrfs_raid_bio *rbio = bio->bi_private;
 	int failed = find_bio_stripe(rbio, bio);
 
 	if (failed < 0)
@@ -1455,7 +1455,7 @@ static void raid_rmw_end_io(struct bio *bio)
 	struct btrfs_raid_bio *rbio = bio->bi_private;
 
 	if (bio->bi_status)
-		fail_bio_stripe(rbio, bio);
+		fail_bio_stripe(bio);
 	else
 		set_bio_pages_uptodate(bio);
 
@@ -1998,7 +1998,7 @@ static void raid_recover_end_io(struct bio *bio)
 	 * up to date if there were no errors
 	 */
 	if (bio->bi_status)
-		fail_bio_stripe(rbio, bio);
+		fail_bio_stripe(bio);
 	else
 		set_bio_pages_uptodate(bio);
 	bio_put(bio);
@@ -2537,7 +2537,7 @@ static void raid56_parity_scrub_end_io(struct bio *bio)
 	struct btrfs_raid_bio *rbio = bio->bi_private;
 
 	if (bio->bi_status)
-		fail_bio_stripe(rbio, bio);
+		fail_bio_stripe(bio);
 	else
 		set_bio_pages_uptodate(bio);
 
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 5/7] Btrfs: add helper __raid_write_end_io
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
                   ` (3 preceding siblings ...)
  2017-11-22  0:35 ` [PATCH 4/7] Btrfs: get rbio inside fail_bio_stripe Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-22  0:35 ` [PATCH 6/7] Btrfs: retry write for raid56 Liu Bo
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

We'd like to retry write on errors, so this splits raid_write_end_io()
to two parts, a) checking errors and b) the rest in a helper
__raid_write_end_io().

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/raid56.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index aa04e5b..aebc849 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -887,19 +887,12 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 	}
 }
 
-/*
- * end io function used by finish_rmw.  When we finally
- * get here, we've written a full stripe
- */
-static void raid_write_end_io(struct bio *bio)
+static void __raid_write_end_io(struct bio *bio)
 {
 	struct btrfs_raid_bio *rbio = bio->bi_private;
-	blk_status_t err = bio->bi_status;
+	blk_status_t err;
 	int max_errors;
 
-	if (err)
-		fail_bio_stripe(bio);
-
 	bio_put(bio);
 
 	if (!atomic_dec_and_test(&rbio->stripes_pending))
@@ -917,6 +910,18 @@ static void raid_write_end_io(struct bio *bio)
 }
 
 /*
+ * end io function used by finish_rmw.  When we finally
+ * get here, we've written a full stripe
+ */
+static void raid_write_end_io(struct bio *bio)
+{
+	if (bio->bi_status)
+		fail_bio_stripe(bio);
+
+	__raid_write_end_io(bio);
+}
+
+/*
  * the read/modify/write code wants to use the original bio for
  * any pages it included, and then use the rbio for everything
  * else.  This function decides if a given index (stripe number)
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 6/7] Btrfs: retry write for raid56
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
                   ` (4 preceding siblings ...)
  2017-11-22  0:35 ` [PATCH 5/7] Btrfs: add helper __raid_write_end_io Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-22  0:35 ` [PATCH 7/7] Btrfs: retry the whole bio on write error Liu Bo
  2017-11-28 19:22 ` [PATCH 0/7] retry write on error David Sterba
  7 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

Retry writes on raid56's final full stripe writes in order to get over
some transient errors in our IO stack.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/raid56.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index aebc849..2e182f8 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -909,14 +909,49 @@ static void __raid_write_end_io(struct bio *bio)
 	rbio_orig_end_io(rbio, err);
 }
 
+struct btrfs_device *get_raid_device_from_bio(struct bio *bio)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+	unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
+
+	BUG_ON(stripe_index >= rbio->bbio->num_stripes);
+	return rbio->bbio->stripes[stripe_index].dev;
+}
+
+static void raid_handle_write_error(struct work_struct *work)
+{
+	struct btrfs_io_bio *io_bio;
+	struct bio *bio;
+	struct btrfs_device *dev;
+
+	io_bio = container_of(work, struct btrfs_io_bio, work);
+	bio = &io_bio->bio;
+	dev = get_raid_device_from_bio(bio);
+
+	if (!btrfs_narrow_write_error(bio, dev)) {
+		fail_bio_stripe(bio);
+		btrfs_record_bio_error(bio, dev);
+	}
+
+	__raid_write_end_io(bio);
+}
+
+static void raid_reschedule_bio(struct bio *bio)
+{
+	INIT_WORK(&btrfs_io_bio(bio)->work, raid_handle_write_error);
+	queue_work(system_unbound_wq, &btrfs_io_bio(bio)->work);
+}
+
 /*
  * end io function used by finish_rmw.  When we finally
  * get here, we've written a full stripe
  */
 static void raid_write_end_io(struct bio *bio)
 {
-	if (bio->bi_status)
-		fail_bio_stripe(bio);
+	if (bio->bi_status) {
+		raid_reschedule_bio(bio);
+		return;
+	}
 
 	__raid_write_end_io(bio);
 }
@@ -1108,6 +1143,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 	bio->bi_iter.bi_size = 0;
 	bio_set_dev(bio, stripe->dev->bdev);
 	bio->bi_iter.bi_sector = disk_start >> 9;
+	btrfs_io_bio(bio)->stripe_index = stripe_nr;
 
 	bio_add_page(bio, page, PAGE_SIZE, 0);
 	bio_list_add(bio_list, bio);
@@ -1324,6 +1360,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 		bio->bi_private = rbio;
 		bio->bi_end_io = raid_write_end_io;
 		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+		btrfs_io_bio(bio)->iter = bio->bi_iter;
 
 		submit_bio(bio);
 	}
@@ -2452,6 +2489,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 		bio->bi_private = rbio;
 		bio->bi_end_io = raid_write_end_io;
 		bio_set_op_attrs(bio, REQ_OP_WRITE, 0);
+		btrfs_io_bio(bio)->iter = bio->bi_iter;
 
 		submit_bio(bio);
 	}
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 7/7] Btrfs: retry the whole bio on write error
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
                   ` (5 preceding siblings ...)
  2017-11-22  0:35 ` [PATCH 6/7] Btrfs: retry write for raid56 Liu Bo
@ 2017-11-22  0:35 ` Liu Bo
  2017-11-28 19:22 ` [PATCH 0/7] retry write on error David Sterba
  7 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-22  0:35 UTC (permalink / raw)
  To: linux-btrfs

Since we haven't included badblocks support in our retry path, instead
of retry sector by sector only retry the whole bio to give it an extra
chance.

Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
---
 fs/btrfs/volumes.c | 72 ++++++++++--------------------------------------------
 1 file changed, 13 insertions(+), 59 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c11db0b..395b03c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6033,74 +6033,28 @@ static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
 }
 
 /*
- * return 1 if every sector retry returns successful.
- * return 0 if one or more sector retries fails.
+ * Return 1 if retry returns successful.
+ * Return 0 otherwise.
  */
 int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
 {
 	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
-	u64 sectors_to_write;
-	u64 offset;
-	u64 orig;
-	u64 unit;
-	u64 block_sectors;
-	int ok = 1;
+	int ret = 1;
 	struct bio *wbio;
 
-	/* offset and unit are bytes aligned, not 512-bytes aligned. */
-	sectors_to_write = io_bio->iter.bi_size >> 9;
-	orig = io_bio->iter.bi_sector;
-	offset = 0;
-	block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
-	unit = block_sectors;
-	ASSERT(unit == 1);
-
-	while (1) {
-		if (!sectors_to_write)
-			break;
-		/*
-		 * LIUBO: I don't think unit > sectors_to_write could
-		 * happen, sectors_to_write should be aligned to PAGE_SIZE
-		 * which is > unit.  Just in case.
-		 */
-		if (unit > sectors_to_write) {
-			WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", unit, sectors_to_write);
-			unit = sectors_to_write;
-		}
-
-		/* write @unit bytes at @offset */
-		/* this would never fail, check btrfs_bio_clone(). */
-		wbio = btrfs_bio_clone(bio);
-		wbio->bi_opf = REQ_OP_WRITE;
-		wbio->bi_iter = io_bio->iter;
+	/* this would never fail, check btrfs_bio_clone(). */
+	wbio = btrfs_bio_clone(bio);
+	wbio->bi_opf = REQ_OP_WRITE;
+	wbio->bi_iter = io_bio->iter;
 
-		bio_trim(wbio, offset, unit);
-		bio_copy_dev(wbio, bio);
+	bio_copy_dev(wbio, bio);
 
-		/* submit in sync way */
-		/*
-		 * LIUBO: There is an issue, if this bio is quite
-		 * large, say 1M or 2M, and sector size is just 512,
-		 * then this may take a while.
-		 *
-		 * May need to schedule the job to workqueue.
-		 */
-		if (submit_bio_wait(wbio) < 0) {
-			ok = 0 && ok;
-			/*
-			 * This is not correct if badblocks is enabled
-			 * as we need to record every bad sector by
-			 * trying sectors one by one.
-			 */
-			break;
-		}
+	/* submit in sync way */
+	if (submit_bio_wait(wbio) < 0)
+		ret = 0;
 
-		bio_put(wbio);
-		offset += unit;
-		sectors_to_write -= unit;
-		unit = block_sectors;
-	}
-	return ok;
+	bio_put(wbio);
+	return ret;
 }
 
 void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
-- 
2.9.4


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] Btrfs: retry write for non-raid56
  2017-11-22  0:35 ` [PATCH 3/7] Btrfs: retry write for non-raid56 Liu Bo
@ 2017-11-22 14:41   ` Nikolay Borisov
  2017-11-28 23:01     ` Liu Bo
  0 siblings, 1 reply; 21+ messages in thread
From: Nikolay Borisov @ 2017-11-22 14:41 UTC (permalink / raw)
  To: Liu Bo, linux-btrfs



On 22.11.2017 02:35, Liu Bo wrote:
> If the underlying protocal doesn't support retry and there are some
> transient errors happening somewhere in our IO stack, we'd like to
> give an extra chance for IO.
> 
> In btrfs, read retry is handled in bio_readpage_error() with the retry
> unit being page size , for write retry however, we're going to do it
> in a different way, as a write may consist of several writes onto
> different stripes, retry write needs to be done right after the IO on
> each stripe completes and reaches to endio.
> 
> As write errors are really corner cases, performance impact is not
> considered.
> 
> This adds code to retry write on errors _sector by sector_ in order to
> find out which sector is really bad so that we can mark it somewhere.
> And since endio is running in interrupt context, the real retry work
> is scheduled in system_unbound_wq in order to get a normal context.
> 
> Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> ---
>  fs/btrfs/volumes.c | 162 +++++++++++++++++++++++++++++++++++++++++++++--------
>  fs/btrfs/volumes.h |   3 +
>  2 files changed, 142 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 853f9ce..c11db0b 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
>  	}
>  }
>  
> -static void btrfs_end_bio(struct bio *bio)
> +static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
>  {
>  	struct btrfs_bio *bbio = bio->bi_private;
> -	int is_orig_bio = 0;
> +	unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
> +
> +	BUG_ON(stripe_index >= bbio->num_stripes);
> +	return bbio->stripes[stripe_index].dev;
> +}
> +
> +/*
> + * return 1 if every sector retry returns successful.
> + * return 0 if one or more sector retries fails.
> + */
> +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
> +{
> +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> +	u64 sectors_to_write;
> +	u64 offset;
> +	u64 orig;
> +	u64 unit;
> +	u64 block_sectors;
> +	int ok = 1;
> +	struct bio *wbio;
> +
> +	/* offset and unit are bytes aligned, not 512-bytes aligned. */
> +	sectors_to_write = io_bio->iter.bi_size >> 9;
> +	orig = io_bio->iter.bi_sector;
> +	offset = 0;
> +	block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
> +	unit = block_sectors;
> +	ASSERT(unit == 1);
> +
> +	while (1) {
> +		if (!sectors_to_write)
> +			break;
> +		/*
> +		 * LIUBO: I don't think unit > sectors_to_write could
> +		 * happen, sectors_to_write should be aligned to PAGE_SIZE
> +		 * which is > unit.  Just in case.
> +		 */
> +		if (unit > sectors_to_write) {
> +			WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", unit, sectors_to_write);
> +			unit = sectors_to_write;
> +		}
> +
> +		/* write @unit bytes at @offset */
> +		/* this would never fail, check btrfs_bio_clone(). */
> +		wbio = btrfs_bio_clone(bio);
> +		wbio->bi_opf = REQ_OP_WRITE;
> +		wbio->bi_iter = io_bio->iter;
> +
> +		bio_trim(wbio, offset, unit);
> +		bio_copy_dev(wbio, bio);
> +
> +		/* submit in sync way */
> +		/*
> +		 * LIUBO: There is an issue, if this bio is quite
> +		 * large, say 1M or 2M, and sector size is just 512,
> +		 * then this may take a while.
> +		 *
> +		 * May need to schedule the job to workqueue.
> +		 */
> +		if (submit_bio_wait(wbio) < 0) {
> +			ok = 0 && ok;
> +			/*
> +			 * This is not correct if badblocks is enabled
> +			 * as we need to record every bad sector by
> +			 * trying sectors one by one.
> +			 */
> +			break;
> +		}
> +
> +		bio_put(wbio);
> +		offset += unit;
> +		sectors_to_write -= unit;
> +		unit = block_sectors;
> +	}
> +	return ok;
> +}
> +
> +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
> +{
> +	if (bio->bi_status == BLK_STS_IOERR ||
> +	    bio->bi_status == BLK_STS_TARGET) {
> +		if (dev->bdev) {
> +			if (bio_data_dir(bio) == WRITE)
> +				btrfs_dev_stat_inc(dev,
> +						   BTRFS_DEV_STAT_WRITE_ERRS);
> +			else
> +				btrfs_dev_stat_inc(dev,
> +						   BTRFS_DEV_STAT_READ_ERRS);
> +			if (bio->bi_opf & REQ_PREFLUSH)
> +				btrfs_dev_stat_inc(dev,
> +						   BTRFS_DEV_STAT_FLUSH_ERRS);
> +			btrfs_dev_stat_print_on_error(dev);
> +		}
> +	}
> +}
> +
> +static void btrfs_handle_bio_error(struct work_struct *work)
> +{
> +	struct btrfs_io_bio *io_bio = container_of(work, struct btrfs_io_bio,
> +						   work);
> +	struct bio *bio = &io_bio->bio;
> +	struct btrfs_device *dev = get_device_from_bio(bio);
> +
> +	ASSERT(bio_data_dir(bio) == WRITE);
> +
> +	if (!btrfs_narrow_write_error(bio, dev)) {
> +		/* inc error counter if narrow (retry) fails */
> +		struct btrfs_bio *bbio = bio->bi_private;
>  
> -	if (bio->bi_status) {
>  		atomic_inc(&bbio->error);
> -		if (bio->bi_status == BLK_STS_IOERR ||
> -		    bio->bi_status == BLK_STS_TARGET) {
> -			unsigned int stripe_index =
> -				btrfs_io_bio(bio)->stripe_index;
> -			struct btrfs_device *dev;
> -
> -			BUG_ON(stripe_index >= bbio->num_stripes);
> -			dev = bbio->stripes[stripe_index].dev;
> -			if (dev->bdev) {
> -				if (bio_op(bio) == REQ_OP_WRITE)
> -					btrfs_dev_stat_inc(dev,
> -						BTRFS_DEV_STAT_WRITE_ERRS);
> -				else
> -					btrfs_dev_stat_inc(dev,
> -						BTRFS_DEV_STAT_READ_ERRS);
> -				if (bio->bi_opf & REQ_PREFLUSH)
> -					btrfs_dev_stat_inc(dev,
> -						BTRFS_DEV_STAT_FLUSH_ERRS);
> -				btrfs_dev_stat_print_on_error(dev);
> -			}
> +		btrfs_record_bio_error(bio, dev);
> +	}
> +
> +	__btrfs_end_bio(bio);
> +}
> +
> +/* running in irq context */
> +static void btrfs_reschedule_retry(struct bio *bio)
> +{
> +	INIT_WORK(&btrfs_io_bio(bio)->work, btrfs_handle_bio_error);
> +	/* _temporarily_ use system_unbound_wq */
> +	queue_work(system_unbound_wq, &btrfs_io_bio(bio)->work);
> +}
> +
> +static void btrfs_end_bio(struct bio *bio)
> +{
> +	if (bio->bi_status) {
> +		struct btrfs_bio *bbio = bio->bi_private;
> +
> +		if (bio_data_dir(bio) == WRITE) {
> +			btrfs_reschedule_retry(bio);
> +			return;
>  		}
> +
> +		atomic_inc(&bbio->error);
> +
> +		/* I don't think we can have REQ_PREFLUSH, but just in case. */
> +		WARN_ON_ONCE(bio->bi_opf & REQ_PREFLUSH);
> +
> +		/* REQ_PREFLUSH */
> +		btrfs_record_bio_error(bio, get_device_from_bio(bio));
>  	}

This patch adds btrfs_end_bio but the same function is also added by the
previous one. Perhaps some of the refactorings here should go into the
previous patch which just juggles existing code and leave the new
changes for this patch?

>  
>  	__btrfs_end_bio(bio);
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6108fdf..c1da27f 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -281,6 +281,7 @@ struct btrfs_io_bio {
>  	u8 csum_inline[BTRFS_BIO_INLINE_CSUM_SIZE];
>  	u8 *csum_allocated;
>  	btrfs_io_bio_end_io_t *end_io;
> +	struct work_struct work;
>  	struct bvec_iter iter;
>  	/*
>  	 * This member must come last, bio_alloc_bioset will allocate enough
> @@ -544,5 +545,7 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
>  bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info);
>  void btrfs_report_missing_device(struct btrfs_fs_info *fs_info, u64 devid,
>  				 u8 *uuid);
> +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev);
> +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev);
>  
>  #endif
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
                   ` (6 preceding siblings ...)
  2017-11-22  0:35 ` [PATCH 7/7] Btrfs: retry the whole bio on write error Liu Bo
@ 2017-11-28 19:22 ` David Sterba
  2017-11-28 22:07   ` Edmund Nadolski
  2017-11-29 18:09   ` Liu Bo
  7 siblings, 2 replies; 21+ messages in thread
From: David Sterba @ 2017-11-28 19:22 UTC (permalink / raw)
  To: Liu Bo; +Cc: linux-btrfs

On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote:
> If the underlying protocal doesn't support retry and there are some
> transient errors happening somewhere in our IO stack, we'd like to
> give an extra chance for IO.  Or sometimes you see btrfs reporting
> 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this
> retry may help a bit.

A limited number of retries may make sense, though I saw some long
stalls after retries on bad disks. Tracking the retries would be a good
addition to the dev stats, ie. a soft error but still worth reporting.

> In btrfs, read retry is handled in bio_readpage_error() with the retry
> unit being page size, for write retry however, we're going to do it in
> a different way, as a write may consist of several writes onto
> different stripes, retry write needs to be done right after the IO on
> each stripe completes and arrives at endio.
> 
> Patch 1-3 are the implementation of retry write on error for
> non-raid56 profile.  Patch 4-6 are for raid56 profile.  Both raid56
> and non-raid56 shares one retry function helper.
> 
> Patch 3 does retry sector by sector, but since this patch set doesn't
> included badblocks support, patch 7 changes it back to retry the whole
> bio.  (I didn't fold patch 7 to patch 3 in the hope of just reverting
> patch 7 once badblocks support is done, but I'm open to it.)

What does 'badblocks' refer to? I know about the badblocks utility that
find and reportts bad blocks, possibly ext2 understands that and avoids
allocating them. Btrfs does not have such support.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-28 19:22 ` [PATCH 0/7] retry write on error David Sterba
@ 2017-11-28 22:07   ` Edmund Nadolski
  2017-11-28 23:41     ` Peter Grandi
  2017-11-29 18:09   ` Liu Bo
  1 sibling, 1 reply; 21+ messages in thread
From: Edmund Nadolski @ 2017-11-28 22:07 UTC (permalink / raw)
  To: dsterba, Liu Bo, linux-btrfs



On 11/28/2017 12:22 PM, David Sterba wrote:
> On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote:
>> If the underlying protocal doesn't support retry and there are some
>> transient errors happening somewhere in our IO stack, we'd like to
>> give an extra chance for IO.  Or sometimes you see btrfs reporting
>> 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this
>> retry may help a bit.
> 
> A limited number of retries may make sense, though I saw some long
> stalls after retries on bad disks. Tracking the retries would be a good
> addition to the dev stats, ie. a soft error but still worth reporting.

Seems preferable to avoid issuing retries when the underlying transport
layer(s) has already done so, but I am not sure there is a way to know
that at the fs level. Likewise for cases where a retry cannot succeed
(transport failures? media errors?).  It can get tricky to know how long
to delay between retries and when it is appropriate to stop retrying.

Ed

>> In btrfs, read retry is handled in bio_readpage_error() with the retry
>> unit being page size, for write retry however, we're going to do it in
>> a different way, as a write may consist of several writes onto
>> different stripes, retry write needs to be done right after the IO on
>> each stripe completes and arrives at endio.
>>
>> Patch 1-3 are the implementation of retry write on error for
>> non-raid56 profile.  Patch 4-6 are for raid56 profile.  Both raid56
>> and non-raid56 shares one retry function helper.
>>
>> Patch 3 does retry sector by sector, but since this patch set doesn't
>> included badblocks support, patch 7 changes it back to retry the whole
>> bio.  (I didn't fold patch 7 to patch 3 in the hope of just reverting
>> patch 7 once badblocks support is done, but I'm open to it.)
> 
> What does 'badblocks' refer to? I know about the badblocks utility that
> find and reportts bad blocks, possibly ext2 understands that and avoids
> allocating them. Btrfs does not have such support.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/7] Btrfs: retry write for non-raid56
  2017-11-22 14:41   ` Nikolay Borisov
@ 2017-11-28 23:01     ` Liu Bo
  0 siblings, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-28 23:01 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: linux-btrfs

On Wed, Nov 22, 2017 at 04:41:10PM +0200, Nikolay Borisov wrote:
> 
> 
> On 22.11.2017 02:35, Liu Bo wrote:
> > If the underlying protocal doesn't support retry and there are some
> > transient errors happening somewhere in our IO stack, we'd like to
> > give an extra chance for IO.
> > 
> > In btrfs, read retry is handled in bio_readpage_error() with the retry
> > unit being page size , for write retry however, we're going to do it
> > in a different way, as a write may consist of several writes onto
> > different stripes, retry write needs to be done right after the IO on
> > each stripe completes and reaches to endio.
> > 
> > As write errors are really corner cases, performance impact is not
> > considered.
> > 
> > This adds code to retry write on errors _sector by sector_ in order to
> > find out which sector is really bad so that we can mark it somewhere.
> > And since endio is running in interrupt context, the real retry work
> > is scheduled in system_unbound_wq in order to get a normal context.
> > 
> > Signed-off-by: Liu Bo <bo.li.liu@oracle.com>
> > ---
> >  fs/btrfs/volumes.c | 162 +++++++++++++++++++++++++++++++++++++++++++++--------
> >  fs/btrfs/volumes.h |   3 +
> >  2 files changed, 142 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 853f9ce..c11db0b 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6023,34 +6023,150 @@ static void __btrfs_end_bio(struct bio *bio)
> >  	}
> >  }
> >  
> > -static void btrfs_end_bio(struct bio *bio)
> > +static inline struct btrfs_device *get_device_from_bio(struct bio *bio)
> >  {
> >  	struct btrfs_bio *bbio = bio->bi_private;
> > -	int is_orig_bio = 0;
> > +	unsigned int stripe_index = btrfs_io_bio(bio)->stripe_index;
> > +
> > +	BUG_ON(stripe_index >= bbio->num_stripes);
> > +	return bbio->stripes[stripe_index].dev;
> > +}
> > +
> > +/*
> > + * return 1 if every sector retry returns successful.
> > + * return 0 if one or more sector retries fails.
> > + */
> > +int btrfs_narrow_write_error(struct bio *bio, struct btrfs_device *dev)
> > +{
> > +	struct btrfs_io_bio *io_bio = btrfs_io_bio(bio);
> > +	u64 sectors_to_write;
> > +	u64 offset;
> > +	u64 orig;
> > +	u64 unit;
> > +	u64 block_sectors;
> > +	int ok = 1;
> > +	struct bio *wbio;
> > +
> > +	/* offset and unit are bytes aligned, not 512-bytes aligned. */
> > +	sectors_to_write = io_bio->iter.bi_size >> 9;
> > +	orig = io_bio->iter.bi_sector;
> > +	offset = 0;
> > +	block_sectors = bdev_logical_block_size(dev->bdev) >> 9;
> > +	unit = block_sectors;
> > +	ASSERT(unit == 1);
> > +
> > +	while (1) {
> > +		if (!sectors_to_write)
> > +			break;
> > +		/*
> > +		 * LIUBO: I don't think unit > sectors_to_write could
> > +		 * happen, sectors_to_write should be aligned to PAGE_SIZE
> > +		 * which is > unit.  Just in case.
> > +		 */
> > +		if (unit > sectors_to_write) {
> > +			WARN_ONCE(1, "unit %llu > sectors_to_write (%llu)\n", unit, sectors_to_write);
> > +			unit = sectors_to_write;
> > +		}
> > +
> > +		/* write @unit bytes at @offset */
> > +		/* this would never fail, check btrfs_bio_clone(). */
> > +		wbio = btrfs_bio_clone(bio);
> > +		wbio->bi_opf = REQ_OP_WRITE;
> > +		wbio->bi_iter = io_bio->iter;
> > +
> > +		bio_trim(wbio, offset, unit);
> > +		bio_copy_dev(wbio, bio);
> > +
> > +		/* submit in sync way */
> > +		/*
> > +		 * LIUBO: There is an issue, if this bio is quite
> > +		 * large, say 1M or 2M, and sector size is just 512,
> > +		 * then this may take a while.
> > +		 *
> > +		 * May need to schedule the job to workqueue.
> > +		 */
> > +		if (submit_bio_wait(wbio) < 0) {
> > +			ok = 0 && ok;
> > +			/*
> > +			 * This is not correct if badblocks is enabled
> > +			 * as we need to record every bad sector by
> > +			 * trying sectors one by one.
> > +			 */
> > +			break;
> > +		}
> > +
> > +		bio_put(wbio);
> > +		offset += unit;
> > +		sectors_to_write -= unit;
> > +		unit = block_sectors;
> > +	}
> > +	return ok;
> > +}
> > +
> > +void btrfs_record_bio_error(struct bio *bio, struct btrfs_device *dev)
> > +{
> > +	if (bio->bi_status == BLK_STS_IOERR ||
> > +	    bio->bi_status == BLK_STS_TARGET) {
> > +		if (dev->bdev) {
> > +			if (bio_data_dir(bio) == WRITE)
> > +				btrfs_dev_stat_inc(dev,
> > +						   BTRFS_DEV_STAT_WRITE_ERRS);
> > +			else
> > +				btrfs_dev_stat_inc(dev,
> > +						   BTRFS_DEV_STAT_READ_ERRS);
> > +			if (bio->bi_opf & REQ_PREFLUSH)
> > +				btrfs_dev_stat_inc(dev,
> > +						   BTRFS_DEV_STAT_FLUSH_ERRS);
> > +			btrfs_dev_stat_print_on_error(dev);
> > +		}
> > +	}
> > +}
> > +
> > +static void btrfs_handle_bio_error(struct work_struct *work)
> > +{
> > +	struct btrfs_io_bio *io_bio = container_of(work, struct btrfs_io_bio,
> > +						   work);
> > +	struct bio *bio = &io_bio->bio;
> > +	struct btrfs_device *dev = get_device_from_bio(bio);
> > +
> > +	ASSERT(bio_data_dir(bio) == WRITE);
> > +
> > +	if (!btrfs_narrow_write_error(bio, dev)) {
> > +		/* inc error counter if narrow (retry) fails */
> > +		struct btrfs_bio *bbio = bio->bi_private;
> >  
> > -	if (bio->bi_status) {
> >  		atomic_inc(&bbio->error);
> > -		if (bio->bi_status == BLK_STS_IOERR ||
> > -		    bio->bi_status == BLK_STS_TARGET) {
> > -			unsigned int stripe_index =
> > -				btrfs_io_bio(bio)->stripe_index;
> > -			struct btrfs_device *dev;
> > -
> > -			BUG_ON(stripe_index >= bbio->num_stripes);
> > -			dev = bbio->stripes[stripe_index].dev;
> > -			if (dev->bdev) {
> > -				if (bio_op(bio) == REQ_OP_WRITE)
> > -					btrfs_dev_stat_inc(dev,
> > -						BTRFS_DEV_STAT_WRITE_ERRS);
> > -				else
> > -					btrfs_dev_stat_inc(dev,
> > -						BTRFS_DEV_STAT_READ_ERRS);
> > -				if (bio->bi_opf & REQ_PREFLUSH)
> > -					btrfs_dev_stat_inc(dev,
> > -						BTRFS_DEV_STAT_FLUSH_ERRS);
> > -				btrfs_dev_stat_print_on_error(dev);
> > -			}
> > +		btrfs_record_bio_error(bio, dev);
> > +	}
> > +
> > +	__btrfs_end_bio(bio);
> > +}
> > +
> > +/* running in irq context */
> > +static void btrfs_reschedule_retry(struct bio *bio)
> > +{
> > +	INIT_WORK(&btrfs_io_bio(bio)->work, btrfs_handle_bio_error);
> > +	/* _temporarily_ use system_unbound_wq */
> > +	queue_work(system_unbound_wq, &btrfs_io_bio(bio)->work);
> > +}
> > +
> > +static void btrfs_end_bio(struct bio *bio)
> > +{
> > +	if (bio->bi_status) {
> > +		struct btrfs_bio *bbio = bio->bi_private;
> > +
> > +		if (bio_data_dir(bio) == WRITE) {
> > +			btrfs_reschedule_retry(bio);
> > +			return;
> >  		}
> > +
> > +		atomic_inc(&bbio->error);
> > +
> > +		/* I don't think we can have REQ_PREFLUSH, but just in case. */
> > +		WARN_ON_ONCE(bio->bi_opf & REQ_PREFLUSH);
> > +
> > +		/* REQ_PREFLUSH */
> > +		btrfs_record_bio_error(bio, get_device_from_bio(bio));
> >  	}
> 
> This patch adds btrfs_end_bio but the same function is also added by the
> previous one. Perhaps some of the refactorings here should go into the
> previous patch which just juggles existing code and leave the new
> changes for this patch?

I looked into it, it seems that git diff made a joke as
btrfs_end_bio() and btrfs_handle_bio_error() share some code.

Tbh I'm not sure how to refactor to make git diff happy, as you can
see, patch 2 is too easy to be refactored.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-28 22:07   ` Edmund Nadolski
@ 2017-11-28 23:41     ` Peter Grandi
  2017-11-29  4:09       ` Anand Jain
  0 siblings, 1 reply; 21+ messages in thread
From: Peter Grandi @ 2017-11-28 23:41 UTC (permalink / raw)
  To: Linux fs Btrfs

>>> If the underlying protocal doesn't support retry and there
>>> are some transient errors happening somewhere in our IO
>>> stack, we'd like to give an extra chance for IO.

>> A limited number of retries may make sense, though I saw some
>> long stalls after retries on bad disks.

Indeed! One of the major issues in actual storage administration
is to find ways to reliably disable most retries, or to shorten
them, both at the block device level and the device level,
because in almost all cases where storage reliability matters
what is important is simply swapping out the failing device
immediately and then examining and possible refreshing it
offline.

To the point that many device manufacturers deliberately cripple
in cheaper products retry shortening or disabling options to
force long stalls, so that people who care about reliability
more than price will buy the more expensive version that can
disable or shorten retries.

> Seems preferable to avoid issuing retries when the underlying
> transport layer(s) has already done so, but I am not sure
> there is a way to know that at the fs level.

Inded, and to use an euphemism, a third layer of retries at the
filesystem level are currently a thoroughly imbecilic idea :-),
as whether retries are worth doing is not a filesystem dependent
issue (but then plugging is done at the block io level when it
is entirely device dependent whether it is worth doing, so there
is famous precedent).

There are excellent reasons why error recovery is in general not
done at the filesystem level since around 20 years ago, which do
not need repeating every time. However one of them is that where
it makes sense device firmware does retries, and the block
device layer does retries too, which is often a bad idea, and
where it is not, the block io level should be do that, not the
filesystem.

A large part of the above discussion would not be needed if
Linux kernel "developers" exposed a clear notion of hardware
device and block device state machine and related semantics, or
even knew that it were desirable, but that's an idea that is
only 50 years old, so may not have yet reached popularity :-).

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-28 23:41     ` Peter Grandi
@ 2017-11-29  4:09       ` Anand Jain
  2017-11-29 16:47         ` David Sterba
  0 siblings, 1 reply; 21+ messages in thread
From: Anand Jain @ 2017-11-29  4:09 UTC (permalink / raw)
  To: Peter Grandi, Linux fs Btrfs



On 11/29/2017 07:41 AM, pg@btrfs.list.sabi.co.UK wrote:
>>>> If the underlying protocal doesn't support retry and there
>>>> are some transient errors happening somewhere in our IO
>>>> stack, we'd like to give an extra chance for IO.
> 
>>> A limited number of retries may make sense, though I saw some
>>> long stalls after retries on bad disks.
> 
> Indeed! One of the major issues in actual storage administration
> is to find ways to reliably disable most retries, or to shorten
> them, both at the block device level and the device level,
> because in almost all cases where storage reliability matters
> what is important is simply swapping out the failing device
> immediately and then examining and possible refreshing it
> offline.
> 
> To the point that many device manufacturers deliberately cripple
> in cheaper products retry shortening or disabling options to
> force long stalls, so that people who care about reliability
> more than price will buy the more expensive version that can
> disable or shorten retries.
> 
>> Seems preferable to avoid issuing retries when the underlying
>> transport layer(s) has already done so, but I am not sure
>> there is a way to know that at the fs level.
> 
> Inded, and to use an euphemism, a third layer of retries at the
> filesystem level are currently a thoroughly imbecilic idea :-),
> as whether retries are worth doing is not a filesystem dependent
> issue (but then plugging is done at the block io level when it
> is entirely device dependent whether it is worth doing, so there
> is famous precedent).
> 
> There are excellent reasons why error recovery is in general not
> done at the filesystem level since around 20 years ago, which do
> not need repeating every time. However one of them is that where
> it makes sense device firmware does retries, and the block
> device layer does retries too, which is often a bad idea, and
> where it is not, the block io level should be do that, not the
> filesystem.
> 
> A large part of the above discussion would not be needed if
> Linux kernel "developers" exposed a clear notion of hardware
> device and block device state machine and related semantics, or
> even knew that it were desirable, but that's an idea that is
> only 50 years old, so may not have yet reached popularity :-).


  I agree with Ed and Peter, similar opinion was posted here [1].

     [1]
     https://www.spinics.net/lists/linux-btrfs/msg70240.html

Thanks, Anand





^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-29  4:09       ` Anand Jain
@ 2017-11-29 16:47         ` David Sterba
  2017-11-30 20:22           ` Liu Bo
  0 siblings, 1 reply; 21+ messages in thread
From: David Sterba @ 2017-11-29 16:47 UTC (permalink / raw)
  To: Anand Jain; +Cc: Peter Grandi, Linux fs Btrfs

On Wed, Nov 29, 2017 at 12:09:29PM +0800, Anand Jain wrote:
> On 11/29/2017 07:41 AM, pg@btrfs.list.sabi.co.UK wrote:
> >>>> If the underlying protocal doesn't support retry and there
> >>>> are some transient errors happening somewhere in our IO
> >>>> stack, we'd like to give an extra chance for IO.
> > 
> >>> A limited number of retries may make sense, though I saw some
> >>> long stalls after retries on bad disks.
> > 
> > Indeed! One of the major issues in actual storage administration
> > is to find ways to reliably disable most retries, or to shorten
> > them, both at the block device level and the device level,
> > because in almost all cases where storage reliability matters
> > what is important is simply swapping out the failing device
> > immediately and then examining and possible refreshing it
> > offline.
> > 
> > To the point that many device manufacturers deliberately cripple
> > in cheaper products retry shortening or disabling options to
> > force long stalls, so that people who care about reliability
> > more than price will buy the more expensive version that can
> > disable or shorten retries.
> > 
> >> Seems preferable to avoid issuing retries when the underlying
> >> transport layer(s) has already done so, but I am not sure
> >> there is a way to know that at the fs level.
> > 
> > Inded, and to use an euphemism, a third layer of retries at the
> > filesystem level are currently a thoroughly imbecilic idea :-),
> > as whether retries are worth doing is not a filesystem dependent
> > issue (but then plugging is done at the block io level when it
> > is entirely device dependent whether it is worth doing, so there
> > is famous precedent).
> > 
> > There are excellent reasons why error recovery is in general not
> > done at the filesystem level since around 20 years ago, which do
> > not need repeating every time. However one of them is that where
> > it makes sense device firmware does retries, and the block
> > device layer does retries too, which is often a bad idea, and
> > where it is not, the block io level should be do that, not the
> > filesystem.
> > 
> > A large part of the above discussion would not be needed if
> > Linux kernel "developers" exposed a clear notion of hardware
> > device and block device state machine and related semantics, or
> > even knew that it were desirable, but that's an idea that is
> > only 50 years old, so may not have yet reached popularity :-).
> 
>   I agree with Ed and Peter, similar opinion was posted here [1].
>      https://www.spinics.net/lists/linux-btrfs/msg70240.html

All the points in this thread speak against retries on the filesystem
level and I agree. Without an interface to query the block layer if the
retries make sense, it's just guessing, likely to be wrong.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-28 19:22 ` [PATCH 0/7] retry write on error David Sterba
  2017-11-28 22:07   ` Edmund Nadolski
@ 2017-11-29 18:09   ` Liu Bo
  1 sibling, 0 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-29 18:09 UTC (permalink / raw)
  To: dsterba, linux-btrfs

On Tue, Nov 28, 2017 at 08:22:36PM +0100, David Sterba wrote:
> On Tue, Nov 21, 2017 at 05:35:51PM -0700, Liu Bo wrote:
> > If the underlying protocal doesn't support retry and there are some
> > transient errors happening somewhere in our IO stack, we'd like to
> > give an extra chance for IO.  Or sometimes you see btrfs reporting
> > 'wrr 1 flush 0 read 0 blabla' but the disk drive is 100% good, this
> > retry may help a bit.
> 
> A limited number of retries may make sense, though I saw some long
> stalls after retries on bad disks. Tracking the retries would be a good
> addition to the dev stats, ie. a soft error but still worth reporting.
>
> > In btrfs, read retry is handled in bio_readpage_error() with the retry
> > unit being page size, for write retry however, we're going to do it in
> > a different way, as a write may consist of several writes onto
> > different stripes, retry write needs to be done right after the IO on
> > each stripe completes and arrives at endio.
> > 
> > Patch 1-3 are the implementation of retry write on error for
> > non-raid56 profile.  Patch 4-6 are for raid56 profile.  Both raid56
> > and non-raid56 shares one retry function helper.
> > 
> > Patch 3 does retry sector by sector, but since this patch set doesn't
> > included badblocks support, patch 7 changes it back to retry the whole
> > bio.  (I didn't fold patch 7 to patch 3 in the hope of just reverting
> > patch 7 once badblocks support is done, but I'm open to it.)
> 
> What does 'badblocks' refer to? I know about the badblocks utility that
> find and reportts bad blocks, possibly ext2 understands that and avoids
> allocating them. Btrfs does not have such support.

The same thing, badblocks refers to block/badblocks.c, it derives
from md's badblocks table, and now also serves as tracking bad cells
in pmem.  And yes, btrfs is yet to have that.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-29 16:47         ` David Sterba
@ 2017-11-30 20:22           ` Liu Bo
  2017-12-03 21:00             ` Peter Grandi
                               ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Liu Bo @ 2017-11-30 20:22 UTC (permalink / raw)
  To: dsterba, Anand Jain, Peter Grandi, Linux fs Btrfs

On Wed, Nov 29, 2017 at 05:47:08PM +0100, David Sterba wrote:
> On Wed, Nov 29, 2017 at 12:09:29PM +0800, Anand Jain wrote:
> > On 11/29/2017 07:41 AM, pg@btrfs.list.sabi.co.UK wrote:
> > >>>> If the underlying protocal doesn't support retry and there
> > >>>> are some transient errors happening somewhere in our IO
> > >>>> stack, we'd like to give an extra chance for IO.
> > > 
> > >>> A limited number of retries may make sense, though I saw some
> > >>> long stalls after retries on bad disks.
> > > 
> > > Indeed! One of the major issues in actual storage administration
> > > is to find ways to reliably disable most retries, or to shorten
> > > them, both at the block device level and the device level,
> > > because in almost all cases where storage reliability matters
> > > what is important is simply swapping out the failing device
> > > immediately and then examining and possible refreshing it
> > > offline.
> > > 
> > > To the point that many device manufacturers deliberately cripple
> > > in cheaper products retry shortening or disabling options to
> > > force long stalls, so that people who care about reliability
> > > more than price will buy the more expensive version that can
> > > disable or shorten retries.
> > > 
> > >> Seems preferable to avoid issuing retries when the underlying
> > >> transport layer(s) has already done so, but I am not sure
> > >> there is a way to know that at the fs level.
> > > 
> > > Inded, and to use an euphemism, a third layer of retries at the
> > > filesystem level are currently a thoroughly imbecilic idea :-),
> > > as whether retries are worth doing is not a filesystem dependent
> > > issue (but then plugging is done at the block io level when it
> > > is entirely device dependent whether it is worth doing, so there
> > > is famous precedent).
> > > 
> > > There are excellent reasons why error recovery is in general not
> > > done at the filesystem level since around 20 years ago, which do
> > > not need repeating every time. However one of them is that where
> > > it makes sense device firmware does retries, and the block
> > > device layer does retries too, which is often a bad idea, and
> > > where it is not, the block io level should be do that, not the
> > > filesystem.
> > > 
> > > A large part of the above discussion would not be needed if
> > > Linux kernel "developers" exposed a clear notion of hardware
> > > device and block device state machine and related semantics, or
> > > even knew that it were desirable, but that's an idea that is
> > > only 50 years old, so may not have yet reached popularity :-).
> > 
> >   I agree with Ed and Peter, similar opinion was posted here [1].
> >      https://www.spinics.net/lists/linux-btrfs/msg70240.html
> 
> All the points in this thread speak against retries on the filesystem
> level and I agree. Without an interface to query the block layer if the
> retries make sense, it's just guessing, likely to be wrong.

I do agree that filesystem doesn't need to retry at its own level, but
btrfs incorporates disk management which is actually a version of md
layer, adding retry for this layer can gain robustness for us.

It's true that scsi's sd layer has done SD_MAX_RETRIES(5) retries, but
can we really depend on other layers for all robustness?

In terms of raid56 scenario, typically filesystem doens't fail the
raid/disk, but btrfs does, doing retry _may_ save us a lot of time of
rebuilding raid.

Anyway, this is for a corner case, not for everyone, I think I need to
make it configurable so that at least we can provide some extra
robustness for people who super care about their data.

Thanks,

-liubo

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-30 20:22           ` Liu Bo
@ 2017-12-03 21:00             ` Peter Grandi
  2017-12-04  9:14             ` Anand Jain
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 21+ messages in thread
From: Peter Grandi @ 2017-12-03 21:00 UTC (permalink / raw)
  To: Linux fs Btrfs

> [ ... ] btrfs incorporates disk management which is actually a
> version of md layer, [ ... ]

As far as I know Btrfs has no disk management, and was wisely
designed without any, just like MD: Btrfs volumes and MD sets
can be composed from "block devices", not disks, and block
devices are quite high level abstractions, as they closely mimic
the semantics of a UNIX file, not a physical device.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-30 20:22           ` Liu Bo
  2017-12-03 21:00             ` Peter Grandi
@ 2017-12-04  9:14             ` Anand Jain
  2017-12-04 20:49             ` Edmund Nadolski
  2017-12-05 18:57             ` David Sterba
  3 siblings, 0 replies; 21+ messages in thread
From: Anand Jain @ 2017-12-04  9:14 UTC (permalink / raw)
  To: bo.li.liu; +Cc: dsterba, Peter Grandi, Linux fs Btrfs




>>>>>>> If the underlying protocal doesn't support retry
>>>>>>> and there
>>>>>>> are some transient errors happening somewhere in our IO
>>>>>>> stack, we'd like to give an extra chance for IO.


> It's true that scsi's sd layer has done SD_MAX_RETRIES(5) retries, but
> can we really depend on other layers for all robustness?


  You are contradicting on your own problem statement as above.


Thanks, Anand




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-30 20:22           ` Liu Bo
  2017-12-03 21:00             ` Peter Grandi
  2017-12-04  9:14             ` Anand Jain
@ 2017-12-04 20:49             ` Edmund Nadolski
  2017-12-05 18:57             ` David Sterba
  3 siblings, 0 replies; 21+ messages in thread
From: Edmund Nadolski @ 2017-12-04 20:49 UTC (permalink / raw)
  To: bo.li.liu, dsterba, Anand Jain, Peter Grandi, Linux fs Btrfs

On 11/30/2017 01:22 PM, Liu Bo wrote:
>>>>>>> If the underlying protocal doesn't support retry and there
>>>>>>> are some transient errors happening somewhere in our IO
>>>>>>> stack, we'd like to give an extra chance for IO.

> Anyway, this is for a corner case, not for everyone, I think I need to
> make it configurable so that at least we can provide some extra
> robustness for people who super care about their data.

Not sure I follow -- wouldn't such users prefer a transport like e.g.
scsi that *does* perform retries (as well as other error recovery)?

(Possibly they would also want features like mirroring & multipath, but
in those scenarios doing additional retries from the filesystem is also
unlikely to help much.)

Ed


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/7] retry write on error
  2017-11-30 20:22           ` Liu Bo
                               ` (2 preceding siblings ...)
  2017-12-04 20:49             ` Edmund Nadolski
@ 2017-12-05 18:57             ` David Sterba
  3 siblings, 0 replies; 21+ messages in thread
From: David Sterba @ 2017-12-05 18:57 UTC (permalink / raw)
  To: Liu Bo; +Cc: dsterba, Anand Jain, Peter Grandi, Linux fs Btrfs

On Thu, Nov 30, 2017 at 12:22:53PM -0800, Liu Bo wrote:
> > >   I agree with Ed and Peter, similar opinion was posted here [1].
> > >      https://www.spinics.net/lists/linux-btrfs/msg70240.html
> > 
> > All the points in this thread speak against retries on the filesystem
> > level and I agree. Without an interface to query the block layer if the
> > retries make sense, it's just guessing, likely to be wrong.
> 
> I do agree that filesystem doesn't need to retry at its own level, but
> btrfs incorporates disk management which is actually a version of md
> layer, adding retry for this layer can gain robustness for us.
> 
> It's true that scsi's sd layer has done SD_MAX_RETRIES(5) retries, but
> can we really depend on other layers for all robustness?

I'm afraid that we have to. We don't really want to dive too much to the
storage layer that's beneath the filesystem. If there was an interface
like "blk_queue_retries_make_sense", then letting the upper layer could
improve robustness, but otherwise I see it as false hopes an just
wasting time retrying.

> In terms of raid56 scenario, typically filesystem doens't fail the
> raid/disk, but btrfs does, doing retry _may_ save us a lot of time of
> rebuilding raid.
> 
> Anyway, this is for a corner case, not for everyone, I think I need to
> make it configurable so that at least we can provide some extra
> robustness for people who super care about their data.

This should be IMHO solved by redundancy. If the storage layer is not
reliable it's not wise to continue using the underlying devices and rely
on the retries.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2017-12-05 18:59 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-22  0:35 [PATCH 0/7] retry write on error Liu Bo
2017-11-22  0:35 ` [PATCH 1/7] Btrfs: keep a copy of bi_iter in btrfs_io_bio Liu Bo
2017-11-22  0:35 ` [PATCH 2/7] Btrfs: add helper __btrfs_end_bio Liu Bo
2017-11-22  0:35 ` [PATCH 3/7] Btrfs: retry write for non-raid56 Liu Bo
2017-11-22 14:41   ` Nikolay Borisov
2017-11-28 23:01     ` Liu Bo
2017-11-22  0:35 ` [PATCH 4/7] Btrfs: get rbio inside fail_bio_stripe Liu Bo
2017-11-22  0:35 ` [PATCH 5/7] Btrfs: add helper __raid_write_end_io Liu Bo
2017-11-22  0:35 ` [PATCH 6/7] Btrfs: retry write for raid56 Liu Bo
2017-11-22  0:35 ` [PATCH 7/7] Btrfs: retry the whole bio on write error Liu Bo
2017-11-28 19:22 ` [PATCH 0/7] retry write on error David Sterba
2017-11-28 22:07   ` Edmund Nadolski
2017-11-28 23:41     ` Peter Grandi
2017-11-29  4:09       ` Anand Jain
2017-11-29 16:47         ` David Sterba
2017-11-30 20:22           ` Liu Bo
2017-12-03 21:00             ` Peter Grandi
2017-12-04  9:14             ` Anand Jain
2017-12-04 20:49             ` Edmund Nadolski
2017-12-05 18:57             ` David Sterba
2017-11-29 18:09   ` Liu Bo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.