All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 1/2] md:r5cache: in r5l_do_submit_io(), submit io->split_bio first.
@ 2017-05-09  0:39 Song Liu
  2017-05-09  0:39 ` [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode Song Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Song Liu @ 2017-05-09  0:39 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen, Song Liu

In r5l_do_submit_io(), it is necessary to check io->split_bio before
submit io->current_bio. This is because, endio of current_bio may
free the whole IO unit, and thus change io->split_bio.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 26ba092..dc1dba6 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -622,20 +622,29 @@ static void r5l_do_submit_io(struct r5l_log *log, struct r5l_io_unit *io)
 	__r5l_set_io_unit_state(io, IO_UNIT_IO_START);
 	spin_unlock_irqrestore(&log->io_list_lock, flags);
 
+	/*
+	 * In case of journal device failures, submit_bio will get error
+	 * and calls endio, then active stripes will continue write
+	 * process. Therefore, it is not necessary to check Faulty bit
+	 * of journal device here.
+	 *
+	 * However, calling r5l_log_endio for current_bio may free the
+	 * io_unit. Therefore, it is necessary to check io->split_bio
+	 * before submitting io->current_bio.
+	 */
+	if (io->split_bio) {
+		if (io->has_flush)
+			io->split_bio->bi_opf |= REQ_PREFLUSH;
+		if (io->has_fua)
+			io->split_bio->bi_opf |= REQ_FUA;
+		submit_bio(io->split_bio);
+	}
+
 	if (io->has_flush)
 		io->current_bio->bi_opf |= REQ_PREFLUSH;
 	if (io->has_fua)
 		io->current_bio->bi_opf |= REQ_FUA;
 	submit_bio(io->current_bio);
-
-	if (!io->split_bio)
-		return;
-
-	if (io->has_flush)
-		io->split_bio->bi_opf |= REQ_PREFLUSH;
-	if (io->has_fua)
-		io->split_bio->bi_opf |= REQ_FUA;
-	submit_bio(io->split_bio);
 }
 
 /* deferred io_unit will be dispatched here */
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode
  2017-05-09  0:39 [PATCH v4 1/2] md:r5cache: in r5l_do_submit_io(), submit io->split_bio first Song Liu
@ 2017-05-09  0:39 ` Song Liu
  2017-05-10 17:01   ` Shaohua Li
  0 siblings, 1 reply; 6+ messages in thread
From: Song Liu @ 2017-05-09  0:39 UTC (permalink / raw)
  To: linux-raid
  Cc: shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen, Song Liu

For the raid456 with writeback cache, when journal device failed during
normal operation, it is still possible to persist all data, as all
pending data is still in stripe cache. However, it is necessary to handle
journal failure gracefully.

During journal failures, this patch makes the follow changes to land data
in cache to raid disks gracefully:

1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
   to make progress;
2. In delay_towrite(), only process data in the cache (skip dev->towrite);
3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
   in loprio_list

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/md/raid5-cache.c | 13 ++++++++++---
 drivers/md/raid5-log.h   |  3 ++-
 drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
 3 files changed, 35 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index dc1dba6..e6032f6 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -24,6 +24,7 @@
 #include "md.h"
 #include "raid5.h"
 #include "bitmap.h"
+#include "raid5-log.h"
 
 /*
  * metadata/data stored in disk with 4k size unit (a block) regardless
@@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
 		return;
 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
 		mdname(mddev));
+	md_update_sb(mddev, 1);
 	mddev_suspend(mddev);
 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
 	mddev_resume(mddev);
@@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 void r5l_quiesce(struct r5l_log *log, int state)
 {
 	struct mddev *mddev;
+	struct r5conf *conf;
+
 	if (!log || state == 2)
 		return;
 	if (state == 0)
@@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	else if (state == 1) {
 		/* make sure r5l_write_super_and_discard_space exits */
 		mddev = log->rdev->mddev;
+		conf = mddev->private;
 		wake_up(&mddev->sb_wait);
 		kthread_park(log->reclaim_thread->tsk);
 		r5l_wake_reclaim(log, MaxSector);
-		r5l_do_reclaim(log);
+		if (!r5l_log_disk_error(conf))
+			r5l_do_reclaim(log);
 	}
 }
 
@@ -2982,7 +2988,7 @@ static int r5l_load_log(struct r5l_log *log)
 	return ret;
 }
 
-void r5c_update_on_rdev_error(struct mddev *mddev)
+void r5c_update_on_rdev_error(struct mddev *mddev, struct md_rdev *rdev)
 {
 	struct r5conf *conf = mddev->private;
 	struct r5l_log *log = conf->log;
@@ -2990,7 +2996,8 @@ void r5c_update_on_rdev_error(struct mddev *mddev)
 	if (!log)
 		return;
 
-	if (raid5_calc_degraded(conf) > 0 &&
+	if ((raid5_calc_degraded(conf) > 0 ||
+	     test_bit(Journal, &rdev->flags)) &&
 	    conf->log->r5c_journal_mode == R5C_JOURNAL_MODE_WRITE_BACK)
 		schedule_work(&log->disable_writeback_work);
 }
diff --git a/drivers/md/raid5-log.h b/drivers/md/raid5-log.h
index 2709710..328d67a 100644
--- a/drivers/md/raid5-log.h
+++ b/drivers/md/raid5-log.h
@@ -28,7 +28,8 @@ extern void r5c_flush_cache(struct r5conf *conf, int num);
 extern void r5c_check_stripe_cache_usage(struct r5conf *conf);
 extern void r5c_check_cached_full_stripe(struct r5conf *conf);
 extern struct md_sysfs_entry r5c_journal_mode;
-extern void r5c_update_on_rdev_error(struct mddev *mddev);
+extern void r5c_update_on_rdev_error(struct mddev *mddev,
+				     struct md_rdev *rdev);
 extern bool r5c_big_stripe_cached(struct r5conf *conf, sector_t sect);
 
 extern struct dma_async_tx_descriptor *
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f8055a7..0ac57a9 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -2689,7 +2689,7 @@ static void raid5_error(struct mddev *mddev, struct md_rdev *rdev)
 		bdevname(rdev->bdev, b),
 		mdname(mddev),
 		conf->raid_disks - mddev->degraded);
-	r5c_update_on_rdev_error(mddev);
+	r5c_update_on_rdev_error(mddev, rdev);
 }
 
 /*
@@ -3050,6 +3050,11 @@ sector_t raid5_compute_blocknr(struct stripe_head *sh, int i, int previous)
  *      When LOG_CRITICAL, stripes with injournal == 0 will be sent to
  *      no_space_stripes list.
  *
+ *   3. during journal failure
+ *      In journal failure, we try to flush all cached data to raid disks
+ *      based on data in stripe cache. The array is read-only to upper
+ *      layers, so we would skip all pending writes.
+ *
  */
 static inline bool delay_towrite(struct r5conf *conf,
 				 struct r5dev *dev,
@@ -3063,6 +3068,9 @@ static inline bool delay_towrite(struct r5conf *conf,
 	if (test_bit(R5C_LOG_CRITICAL, &conf->cache_state) &&
 	    s->injournal > 0)
 		return true;
+	/* case 3 above */
+	if (s->log_failed && s->injournal)
+		return true;
 	return false;
 }
 
@@ -4696,10 +4704,15 @@ static void handle_stripe(struct stripe_head *sh)
 	       " to_write=%d failed=%d failed_num=%d,%d\n",
 	       s.locked, s.uptodate, s.to_read, s.to_write, s.failed,
 	       s.failed_num[0], s.failed_num[1]);
-	/* check if the array has lost more than max_degraded devices and,
+	/*
+	 * check if the array has lost more than max_degraded devices and,
 	 * if so, some requests might need to be failed.
+	 *
+	 * When journal device failed (log_failed), we will only process
+	 * the stripe if there is data need write to raid disks
 	 */
-	if (s.failed > conf->max_degraded || s.log_failed) {
+	if (s.failed > conf->max_degraded ||
+	    (s.log_failed && s.injournal == 0)) {
 		sh->check_state = 0;
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
@@ -5272,8 +5285,10 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf, int group)
 	struct stripe_head *sh, *tmp;
 	struct list_head *handle_list = NULL;
 	struct r5worker_group *wg;
-	bool second_try = !r5c_is_writeback(conf->log);
-	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state);
+	bool second_try = !r5c_is_writeback(conf->log) &&
+		!r5l_log_disk_error(conf);
+	bool try_loprio = test_bit(R5C_LOG_TIGHT, &conf->cache_state) ||
+		r5l_log_disk_error(conf);
 
 again:
 	wg = NULL;
@@ -7521,7 +7536,9 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 		 * neilb: there is no locking about new writes here,
 		 * so this cannot be safe.
 		 */
-		if (atomic_read(&conf->active_stripes)) {
+		if (atomic_read(&conf->active_stripes) ||
+		    atomic_read(&conf->r5c_cached_full_stripes) ||
+		    atomic_read(&conf->r5c_cached_partial_stripes)) {
 			return -EBUSY;
 		}
 		log_exit(conf);
-- 
2.9.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode
  2017-05-09  0:39 ` [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode Song Liu
@ 2017-05-10 17:01   ` Shaohua Li
  2017-05-10 21:45     ` Song Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Shaohua Li @ 2017-05-10 17:01 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, shli, neilb, kernel-team, dan.j.williams, hch, jes.sorensen

On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote:
> For the raid456 with writeback cache, when journal device failed during
> normal operation, it is still possible to persist all data, as all
> pending data is still in stripe cache. However, it is necessary to handle
> journal failure gracefully.
> 
> During journal failures, this patch makes the follow changes to land data
> in cache to raid disks gracefully:
> 
> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
>    to make progress;
> 2. In delay_towrite(), only process data in the cache (skip dev->towrite);
> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
>    in loprio_list

Applied the first patch. For this patch, I don't have a clear picture about
what you are trying to do. Please describe the steps we are doing to do after
journal failure.
 
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/md/raid5-cache.c | 13 ++++++++++---
>  drivers/md/raid5-log.h   |  3 ++-
>  drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
>  3 files changed, 35 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index dc1dba6..e6032f6 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -24,6 +24,7 @@
>  #include "md.h"
>  #include "raid5.h"
>  #include "bitmap.h"
> +#include "raid5-log.h"
>  
>  /*
>   * metadata/data stored in disk with 4k size unit (a block) regardless
> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>  		return;
>  	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
>  		mdname(mddev));
> +	md_update_sb(mddev, 1);

Why this? And md_update_sb must be called within mddev->reconfig_mutex locked.
>  	mddev_suspend(mddev);
>  	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>  	mddev_resume(mddev);
> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>  void r5l_quiesce(struct r5l_log *log, int state)
>  {
>  	struct mddev *mddev;
> +	struct r5conf *conf;
> +
>  	if (!log || state == 2)
>  		return;
>  	if (state == 0)
> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
>  	else if (state == 1) {
>  		/* make sure r5l_write_super_and_discard_space exits */
>  		mddev = log->rdev->mddev;
> +		conf = mddev->private;
>  		wake_up(&mddev->sb_wait);
>  		kthread_park(log->reclaim_thread->tsk);
>  		r5l_wake_reclaim(log, MaxSector);
> -		r5l_do_reclaim(log);
> +		if (!r5l_log_disk_error(conf))
> +			r5l_do_reclaim(log);

I think r5c_disable_writeback_async() will call into this, so we flush all
stripe cache out to raid disks, why skip the reclaim?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode
  2017-05-10 17:01   ` Shaohua Li
@ 2017-05-10 21:45     ` Song Liu
  2017-05-10 22:51       ` Shaohua Li
  0 siblings, 1 reply; 6+ messages in thread
From: Song Liu @ 2017-05-10 21:45 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, neilb, Kernel Team, dan.j.williams, hch,
	jes.sorensen


> On May 10, 2017, at 10:01 AM, Shaohua Li <shli@kernel.org> wrote:
> 
> On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote:
>> For the raid456 with writeback cache, when journal device failed during
>> normal operation, it is still possible to persist all data, as all
>> pending data is still in stripe cache. However, it is necessary to handle
>> journal failure gracefully.
>> 
>> During journal failures, this patch makes the follow changes to land data
>> in cache to raid disks gracefully:
>> 
>> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
>>   to make progress;
>> 2. In delay_towrite(), only process data in the cache (skip dev->towrite);
>> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
>>   in loprio_list
> 
> Applied the first patch. For this patch, I don't have a clear picture about
> what you are trying to do. Please describe the steps we are doing to do after
> journal failure.

I will add more description to the next version. 

> 
>> Signed-off-by: Song Liu <songliubraving@fb.com>
>> ---
>> drivers/md/raid5-cache.c | 13 ++++++++++---
>> drivers/md/raid5-log.h   |  3 ++-
>> drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
>> 3 files changed, 35 insertions(+), 10 deletions(-)
>> 
>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>> index dc1dba6..e6032f6 100644
>> --- a/drivers/md/raid5-cache.c
>> +++ b/drivers/md/raid5-cache.c
>> @@ -24,6 +24,7 @@
>> #include "md.h"
>> #include "raid5.h"
>> #include "bitmap.h"
>> +#include "raid5-log.h"
>> 
>> /*
>>  * metadata/data stored in disk with 4k size unit (a block) regardless
>> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>> 		return;
>> 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
>> 		mdname(mddev));
>> +	md_update_sb(mddev, 1);
> 
> Why this? And md_update_sb must be called within mddev->reconfig_mutex locked.

This is to avoid skipping in handle_stripe():

        if (s.handle_bad_blocks ||
            test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
                set_bit(STRIPE_HANDLE, &sh->state);
                goto finish;
        }

I haven't got a better idea than calling md_update_sb() somewhere. It is also tricky
to lock mddev->reconfigured_mutex here, due to potential deadlocking with 
mddev->open_mutex. 

Do you have suggestions on this?


>> 	mddev_suspend(mddev);
>> 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>> 	mddev_resume(mddev);
>> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>> void r5l_quiesce(struct r5l_log *log, int state)
>> {
>> 	struct mddev *mddev;
>> +	struct r5conf *conf;
>> +
>> 	if (!log || state == 2)
>> 		return;
>> 	if (state == 0)
>> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
>> 	else if (state == 1) {
>> 		/* make sure r5l_write_super_and_discard_space exits */
>> 		mddev = log->rdev->mddev;
>> +		conf = mddev->private;
>> 		wake_up(&mddev->sb_wait);
>> 		kthread_park(log->reclaim_thread->tsk);
>> 		r5l_wake_reclaim(log, MaxSector);
>> -		r5l_do_reclaim(log);
>> +		if (!r5l_log_disk_error(conf))
>> +			r5l_do_reclaim(log);
> 
> I think r5c_disable_writeback_async() will call into this, so we flush all
> stripe cache out to raid disks, why skip the reclaim?
> 

r5l_do_reclaim() reclaims log space with 2 steps:
1. clear all io_unit lists (flushing_ios, etc.) by waking up mddev->thread. 
2. update log_tail in the journal device, and issue discard to journal device. 

When we are handling log failures in r5c_disable_writeback_async(), we are 
flushing the cache, so it is not necessary to wake up mddev->thread. Also, 
with log device error, it is not necessary update log_tail or issue discard. 

Therefore, r5l_do_reclaim is not necessary in log disk errors. 

Thanks,
Song


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode
  2017-05-10 21:45     ` Song Liu
@ 2017-05-10 22:51       ` Shaohua Li
  2017-05-10 23:58         ` Song Liu
  0 siblings, 1 reply; 6+ messages in thread
From: Shaohua Li @ 2017-05-10 22:51 UTC (permalink / raw)
  To: Song Liu
  Cc: linux-raid, Shaohua Li, neilb, Kernel Team, dan.j.williams, hch,
	jes.sorensen

On Wed, May 10, 2017 at 09:45:05PM +0000, Song Liu wrote:
> 
> > On May 10, 2017, at 10:01 AM, Shaohua Li <shli@kernel.org> wrote:
> > 
> > On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote:
> >> For the raid456 with writeback cache, when journal device failed during
> >> normal operation, it is still possible to persist all data, as all
> >> pending data is still in stripe cache. However, it is necessary to handle
> >> journal failure gracefully.
> >> 
> >> During journal failures, this patch makes the follow changes to land data
> >> in cache to raid disks gracefully:
> >> 
> >> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
> >>   to make progress;
> >> 2. In delay_towrite(), only process data in the cache (skip dev->towrite);
> >> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
> >>   in loprio_list
> > 
> > Applied the first patch. For this patch, I don't have a clear picture about
> > what you are trying to do. Please describe the steps we are doing to do after
> > journal failure.
> 
> I will add more description to the next version. 
> 
> > 
> >> Signed-off-by: Song Liu <songliubraving@fb.com>
> >> ---
> >> drivers/md/raid5-cache.c | 13 ++++++++++---
> >> drivers/md/raid5-log.h   |  3 ++-
> >> drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
> >> 3 files changed, 35 insertions(+), 10 deletions(-)
> >> 
> >> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> >> index dc1dba6..e6032f6 100644
> >> --- a/drivers/md/raid5-cache.c
> >> +++ b/drivers/md/raid5-cache.c
> >> @@ -24,6 +24,7 @@
> >> #include "md.h"
> >> #include "raid5.h"
> >> #include "bitmap.h"
> >> +#include "raid5-log.h"
> >> 
> >> /*
> >>  * metadata/data stored in disk with 4k size unit (a block) regardless
> >> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
> >> 		return;
> >> 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
> >> 		mdname(mddev));
> >> +	md_update_sb(mddev, 1);
> > 
> > Why this? And md_update_sb must be called within mddev->reconfig_mutex locked.
> 
> This is to avoid skipping in handle_stripe():
> 
>         if (s.handle_bad_blocks ||
>             test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>                 set_bit(STRIPE_HANDLE, &sh->state);
>                 goto finish;
>         }
> 
> I haven't got a better idea than calling md_update_sb() somewhere. It is also tricky
> to lock mddev->reconfigured_mutex here, due to potential deadlocking with 
> mddev->open_mutex. 
> 
> Do you have suggestions on this?

This sounds not necessary then. The mddev->thread will call md_update_sb and
clear the MD_SB_CHANGE_PENDING bit. After that, the stripes will be handled.
This is running in a workqueue, I'm wondering what kind of deadlock issue there
is.

> 
> >> 	mddev_suspend(mddev);
> >> 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
> >> 	mddev_resume(mddev);
> >> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
> >> void r5l_quiesce(struct r5l_log *log, int state)
> >> {
> >> 	struct mddev *mddev;
> >> +	struct r5conf *conf;
> >> +
> >> 	if (!log || state == 2)
> >> 		return;
> >> 	if (state == 0)
> >> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
> >> 	else if (state == 1) {
> >> 		/* make sure r5l_write_super_and_discard_space exits */
> >> 		mddev = log->rdev->mddev;
> >> +		conf = mddev->private;
> >> 		wake_up(&mddev->sb_wait);
> >> 		kthread_park(log->reclaim_thread->tsk);
> >> 		r5l_wake_reclaim(log, MaxSector);
> >> -		r5l_do_reclaim(log);
> >> +		if (!r5l_log_disk_error(conf))
> >> +			r5l_do_reclaim(log);
> > 
> > I think r5c_disable_writeback_async() will call into this, so we flush all
> > stripe cache out to raid disks, why skip the reclaim?
> > 
> 
> r5l_do_reclaim() reclaims log space with 2 steps:
> 1. clear all io_unit lists (flushing_ios, etc.) by waking up mddev->thread. 
> 2. update log_tail in the journal device, and issue discard to journal device. 
> 
> When we are handling log failures in r5c_disable_writeback_async(), we are 
> flushing the cache, so it is not necessary to wake up mddev->thread. Also, 
> with log device error, it is not necessary update log_tail or issue discard. 
> 
> Therefore, r5l_do_reclaim is not necessary in log disk errors. 

Ok, there is no side effect, right? let's call it anyway.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode
  2017-05-10 22:51       ` Shaohua Li
@ 2017-05-10 23:58         ` Song Liu
  0 siblings, 0 replies; 6+ messages in thread
From: Song Liu @ 2017-05-10 23:58 UTC (permalink / raw)
  To: Shaohua Li
  Cc: linux-raid, Shaohua Li, neilb, Kernel Team, dan.j.williams, hch,
	jes.sorensen


> On May 10, 2017, at 3:51 PM, Shaohua Li <shli@kernel.org> wrote:
> 
> On Wed, May 10, 2017 at 09:45:05PM +0000, Song Liu wrote:
>> 
>>> On May 10, 2017, at 10:01 AM, Shaohua Li <shli@kernel.org> wrote:
>>> 
>>> On Mon, May 08, 2017 at 05:39:25PM -0700, Song Liu wrote:
>>>> For the raid456 with writeback cache, when journal device failed during
>>>> normal operation, it is still possible to persist all data, as all
>>>> pending data is still in stripe cache. However, it is necessary to handle
>>>> journal failure gracefully.
>>>> 
>>>> During journal failures, this patch makes the follow changes to land data
>>>> in cache to raid disks gracefully:
>>>> 
>>>> 1. In handle_stripe(), allow stripes with data in journal (s.injournal > 0)
>>>>  to make progress;
>>>> 2. In delay_towrite(), only process data in the cache (skip dev->towrite);
>>>> 3. In __get_priority_stripe(), set try_loprio to true, so no stripe stuck
>>>>  in loprio_list
>>> 
>>> Applied the first patch. For this patch, I don't have a clear picture about
>>> what you are trying to do. Please describe the steps we are doing to do after
>>> journal failure.
>> 
>> I will add more description to the next version. 
>> 
>>> 
>>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>>> ---
>>>> drivers/md/raid5-cache.c | 13 ++++++++++---
>>>> drivers/md/raid5-log.h   |  3 ++-
>>>> drivers/md/raid5.c       | 29 +++++++++++++++++++++++------
>>>> 3 files changed, 35 insertions(+), 10 deletions(-)
>>>> 
>>>> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
>>>> index dc1dba6..e6032f6 100644
>>>> --- a/drivers/md/raid5-cache.c
>>>> +++ b/drivers/md/raid5-cache.c
>>>> @@ -24,6 +24,7 @@
>>>> #include "md.h"
>>>> #include "raid5.h"
>>>> #include "bitmap.h"
>>>> +#include "raid5-log.h"
>>>> 
>>>> /*
>>>> * metadata/data stored in disk with 4k size unit (a block) regardless
>>>> @@ -679,6 +680,7 @@ static void r5c_disable_writeback_async(struct work_struct *work)
>>>> 		return;
>>>> 	pr_info("md/raid:%s: Disabling writeback cache for degraded array.\n",
>>>> 		mdname(mddev));
>>>> +	md_update_sb(mddev, 1);
>>> 
>>> Why this? And md_update_sb must be called within mddev->reconfig_mutex locked.
>> 
>> This is to avoid skipping in handle_stripe():
>> 
>>        if (s.handle_bad_blocks ||
>>            test_bit(MD_SB_CHANGE_PENDING, &conf->mddev->sb_flags)) {
>>                set_bit(STRIPE_HANDLE, &sh->state);
>>                goto finish;
>>        }
>> 
>> I haven't got a better idea than calling md_update_sb() somewhere. It is also tricky
>> to lock mddev->reconfigured_mutex here, due to potential deadlocking with 
>> mddev->open_mutex. 
>> 
>> Do you have suggestions on this?
> 
> This sounds not necessary then. The mddev->thread will call md_update_sb and
> clear the MD_SB_CHANGE_PENDING bit. After that, the stripes will be handled.
> This is running in a workqueue, I'm wondering what kind of deadlock issue there
> is.

mddev->thread calls md_check_recovery() in raid5d(), and md_check_recovery() 
calls md_update_sb(). However, we need to call md_update_sb() before 
mddev_suspend() here. Otherwise, mddev_suspend() will increase mddev->suspended
and then md_check_recovery() will exit before calling md_update_sb(). 

The deadlock of reconfig_mutex and open_mutex is as the following:

[  211.753343] ======================================================
[  211.753823] [ INFO: possible circular locking dependency detected ]
[  211.754310] 4.11.0+ #88 Not tainted
[  211.754582] -------------------------------------------------------
[  211.755079] mdadm/1582 is trying to acquire lock:
[  211.755444]  ((&log->disable_writeback_work)){+.+...}, at: [<ffffffff8107d842>] flush_work+0x12/0x290
[  211.756160]
[  211.756160] but task is already holding lock:
[  211.756609]  (&mddev->reconfig_mutex){+.+.+.}, at: [<ffffffffa0052654>] rdev_attr_store+0x64/0xd0 [md_mod]
[  211.757354]
[  211.757354] which lock already depends on the new lock.
[  211.757354]
[  211.757984]
[  211.757984] the existing dependency chain (in reverse order) is:
[  211.758560]
[  211.758560] -> #1 (&mddev->reconfig_mutex){+.+.+.}:
[  211.759062]        lock_acquire+0xc2/0x230
[  211.759391]        __mutex_lock+0x7b/0x9d0
[  211.759716]        mutex_lock_interruptible_nested+0x1b/0x20
[  211.760158]        r5c_disable_writeback_async+0x6b/0xa0 [raid456]
[  211.760634]        process_one_work+0x1d6/0x650
[  211.760986]        worker_thread+0x4d/0x380
[  211.761311]        kthread+0x117/0x150
[  211.761602]        ret_from_fork+0x2e/0x40
[  211.761924]
[  211.761924] -> #0 ((&log->disable_writeback_work)){+.+...}:
[  211.762470]        __lock_acquire+0x1458/0x1770
[  211.762825]        lock_acquire+0xc2/0x230
[  211.763149]        flush_work+0x4a/0x290
[  211.763457]        r5l_exit_log+0x31/0x80 [raid456]
[  211.763840]        raid5_remove_disk+0x8a/0x240 [raid456]
[  211.764265]        remove_and_add_spares+0x175/0x390 [md_mod]
[  211.764714]        state_store+0xa3/0x4f0 [md_mod]
[  211.765090]        rdev_attr_store+0x89/0xd0 [md_mod]
[  211.765486]        sysfs_kf_write+0x4f/0x70
[  211.765815]        kernfs_fop_write+0x147/0x1d0
[  211.766170]        __vfs_write+0x28/0x120
[  211.766485]        vfs_write+0xd3/0x1d0
[  211.766789]        SyS_write+0x52/0xa0
[  211.767083]        do_syscall_64+0x6a/0x210
[  211.767408]        return_from_SYSCALL_64+0x0/0x7a
[  211.767782]
[  211.767782] other info that might help us debug this:
[  211.767782]
[  211.768399]  Possible unsafe locking scenario:
[  211.768399]
[  211.768861]        CPU0                    CPU1
[  211.769214]        ----                    ----
[  211.769567]   lock(&mddev->reconfig_mutex);
[  211.769897]                                lock((&log->disable_writeback_work));
[  211.770471]                                lock(&mddev->reconfig_mutex);
[  211.770993]   lock((&log->disable_writeback_work));

> 
>> 
>>>> 	mddev_suspend(mddev);
>>>> 	log->r5c_journal_mode = R5C_JOURNAL_MODE_WRITE_THROUGH;
>>>> 	mddev_resume(mddev);
>>>> @@ -1557,6 +1559,8 @@ void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>>>> void r5l_quiesce(struct r5l_log *log, int state)
>>>> {
>>>> 	struct mddev *mddev;
>>>> +	struct r5conf *conf;
>>>> +
>>>> 	if (!log || state == 2)
>>>> 		return;
>>>> 	if (state == 0)
>>>> @@ -1564,10 +1568,12 @@ void r5l_quiesce(struct r5l_log *log, int state)
>>>> 	else if (state == 1) {
>>>> 		/* make sure r5l_write_super_and_discard_space exits */
>>>> 		mddev = log->rdev->mddev;
>>>> +		conf = mddev->private;
>>>> 		wake_up(&mddev->sb_wait);
>>>> 		kthread_park(log->reclaim_thread->tsk);
>>>> 		r5l_wake_reclaim(log, MaxSector);
>>>> -		r5l_do_reclaim(log);
>>>> +		if (!r5l_log_disk_error(conf))
>>>> +			r5l_do_reclaim(log);
>>> 
>>> I think r5c_disable_writeback_async() will call into this, so we flush all
>>> stripe cache out to raid disks, why skip the reclaim?
>>> 
>> 
>> r5l_do_reclaim() reclaims log space with 2 steps:
>> 1. clear all io_unit lists (flushing_ios, etc.) by waking up mddev->thread. 
>> 2. update log_tail in the journal device, and issue discard to journal device. 
>> 
>> When we are handling log failures in r5c_disable_writeback_async(), we are 
>> flushing the cache, so it is not necessary to wake up mddev->thread. Also, 
>> with log device error, it is not necessary update log_tail or issue discard. 
>> 
>> Therefore, r5l_do_reclaim is not necessary in log disk errors. 
> 
> Ok, there is no side effect, right? let's call it anyway.

Let me double check potential side effect. If it is safe, I will remove it. 

Thanks,
Song

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-05-10 23:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-09  0:39 [PATCH v4 1/2] md:r5cache: in r5l_do_submit_io(), submit io->split_bio first Song Liu
2017-05-09  0:39 ` [PATCH v4 2/2] md/r5cache: gracefully handle journal device errors for writeback mode Song Liu
2017-05-10 17:01   ` Shaohua Li
2017-05-10 21:45     ` Song Liu
2017-05-10 22:51       ` Shaohua Li
2017-05-10 23:58         ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.