All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] raid5-cache fixes
@ 2015-10-09  4:54 Shaohua Li
  2015-10-09  4:54 ` [PATCH 1/9] MD: fix info output for journal disk Shaohua Li
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Hi,

some fixes for raid5-cache/md. Mainly supports trim, handle IO error and
handling array assemble with missing/faulty disks.
patch 1: fix a bug
patch 2: support trim
patch 3-4, IO error handling
patch 5-9, handle assemble issue when journal disk is missing/faulty

If there is IO error to journal disk, write IO will fail immediately. If we
assemble an array with journal feature enabled but with journal disk
missing/faulty, the raid array will start readonly.

Thanks,
Shaohua

Shaohua Li (6):
  MD: fix info output for journal disk
  raid5-cache: add trim support for log
  raid5: journal disk can't be removed
  raid5-cache: IO error handling
  raid5-cache: start raid5 readonly if journal is missing
  MD: set journal disk ->raid_disk

Song Liu (3):
  MD: add new bit to indicate raid array with journal
  MD: kick out journal disk if it's not fresh
  MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW

 drivers/md/md.c          | 77 +++++++++++++++++++++++++++++++++++++---------
 drivers/md/md.h          |  1 +
 drivers/md/raid5-cache.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++--
 drivers/md/raid5.c       | 29 ++++++++++++++++--
 drivers/md/raid5.h       |  2 ++
 5 files changed, 169 insertions(+), 19 deletions(-)

-- 
2.4.6


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/9] MD: fix info output for journal disk
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-12  5:37   ` Neil Brown
  2015-10-09  4:54 ` [PATCH 2/9] raid5-cache: add trim support for log Shaohua Li
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

journal disk can be faulty. The Journal and Faulty aren't exclusive with
each other.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0729cc7..daf42bb 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -5857,7 +5857,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
 		else if (test_bit(In_sync, &rdev->flags)) {
 			info.state |= (1<<MD_DISK_ACTIVE);
 			info.state |= (1<<MD_DISK_SYNC);
-		} else if (test_bit(Journal, &rdev->flags))
+		}
+		if (test_bit(Journal, &rdev->flags))
 			info.state |= (1<<MD_DISK_JOURNAL);
 		if (test_bit(WriteMostly, &rdev->flags))
 			info.state |= (1<<MD_DISK_WRITEMOSTLY);
@@ -7335,18 +7336,21 @@ static int md_seq_show(struct seq_file *seq, void *v)
 		rcu_read_lock();
 		rdev_for_each_rcu(rdev, mddev) {
 			char b[BDEVNAME_SIZE];
+			bool skip = false;
 			seq_printf(seq, " %s[%d]",
 				bdevname(rdev->bdev,b), rdev->desc_nr);
 			if (test_bit(WriteMostly, &rdev->flags))
 				seq_printf(seq, "(W)");
 			if (test_bit(Faulty, &rdev->flags)) {
 				seq_printf(seq, "(F)");
-				continue;
+				skip = true;
 			}
 			if (test_bit(Journal, &rdev->flags)) {
 				seq_printf(seq, "(J)");
-				continue;
+				skip = true;
 			}
+			if (skip)
+				continue;
 			if (rdev->raid_disk < 0)
 				seq_printf(seq, "(S)"); /* spare */
 			if (test_bit(Replacement, &rdev->flags))
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/9] raid5-cache: add trim support for log
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
  2015-10-09  4:54 ` [PATCH 1/9] MD: fix info output for journal disk Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-12  6:00   ` Neil Brown
  2015-10-09  4:54 ` [PATCH 3/9] raid5: journal disk can't be removed Shaohua Li
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 62 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index caeec10..5dbe084 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -85,6 +85,7 @@ struct r5l_log {
 	spinlock_t no_space_stripes_lock;
 
 	bool need_cache_flush;
+	bool in_teardown;
 };
 
 /*
@@ -644,6 +645,60 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
+static void r5l_write_super_and_discard_space(struct r5l_log *log,
+	sector_t end)
+{
+	struct block_device *bdev = log->rdev->bdev;
+	struct mddev *mddev;
+
+	r5l_write_super(log, end);
+
+	if (!blk_queue_discard(bdev_get_queue(bdev)))
+		return;
+
+	mddev = log->rdev->mddev;
+	/*
+	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
+	 * wait for this thread to finish. This thread waits for
+	 * MD_CHANGE_PENDING clear, which is supposed to be done in
+	 * md_check_recovery(). md_check_recovery() tries to get
+	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
+	 * md_check_recovery() fails, so the PENDING never get cleared. The
+	 * in_teardown check workaround this issue.
+	 * */
+	if (!log->in_teardown) {
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
+			log->in_teardown);
+		/*
+		 * r5l_quiesce could run after in_teardown check and hold
+		 * mutex first. Superblock might get updated twice.
+		 * */
+		if (log->in_teardown)
+			md_update_sb(mddev, 1);
+	} else {
+		BUG_ON(!mddev_is_locked(mddev));
+		md_update_sb(mddev, 1);
+	}
+
+	if (log->last_checkpoint < end) {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				end - log->last_checkpoint, GFP_NOIO, 0);
+	} else {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				log->device_size - log->last_checkpoint,
+				GFP_NOIO, 0);
+		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
+				GFP_NOIO, 0);
+	}
+}
+
+
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
@@ -685,7 +740,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * here, because the log area might be reused soon and we don't want to
 	 * confuse recovery
 	 */
-	r5l_write_super(log, next_checkpoint);
+	r5l_write_super_and_discard_space(log, next_checkpoint);
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = next_checkpoint;
@@ -721,9 +776,11 @@ static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 
 void r5l_quiesce(struct r5l_log *log, int state)
 {
+	struct mddev *mddev;
 	if (!log || state == 2)
 		return;
 	if (state == 0) {
+		log->in_teardown = 0;
 		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
 					log->rdev->mddev, "reclaim");
 	} else if (state == 1) {
@@ -731,6 +788,10 @@ void r5l_quiesce(struct r5l_log *log, int state)
 		 * at this point all stripes are finished, so io_unit is at
 		 * least in STRIPE_END state
 		 */
+		log->in_teardown = 1;
+		/* make sure r5l_write_super_and_discard_space exits */
+		mddev = log->rdev->mddev;
+		wake_up(&mddev->sb_wait);
 		r5l_wake_reclaim(log, -1L);
 		md_unregister_thread(&log->reclaim_thread);
 		r5l_do_reclaim(log);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/9] raid5: journal disk can't be removed
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
  2015-10-09  4:54 ` [PATCH 1/9] MD: fix info output for journal disk Shaohua Li
  2015-10-09  4:54 ` [PATCH 2/9] raid5-cache: add trim support for log Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-09  4:54 ` [PATCH 4/9] raid5-cache: IO error handling Shaohua Li
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

raid5-cache uses journal disk rdev->bdev, rdev->mddev in several places.
Don't allow journal disk disappear magically. On the other hand, we do
need to update superblock for other disks to bump up ->events, so next
time journal disk will be identified as stale.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 216fa3c..6430098 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7128,6 +7128,15 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct disk_info *p = conf->disks + number;
 
 	print_raid5_conf(conf);
+	if (test_bit(Journal, &rdev->flags)) {
+		/*
+		 * journal disk is not removable, but we need give a chance to
+		 * update superblock of other disks. Otherwise journal disk
+		 * will be considered as 'fresh'
+		 * */
+		set_bit(MD_CHANGE_DEVS, &mddev->flags);
+		return -EINVAL;
+	}
 	if (rdev == p->rdev)
 		rdevp = &p->rdev;
 	else if (rdev == p->replacement)
@@ -7190,6 +7199,8 @@ static int raid5_add_disk(struct mddev *mddev, struct md_rdev *rdev)
 	int first = 0;
 	int last = conf->raid_disks - 1;
 
+	if (test_bit(Journal, &rdev->flags))
+		return -EINVAL;
 	if (mddev->recovery_disabled == conf->recovery_disabled)
 		return -EBUSY;
 
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/9] raid5-cache: IO error handling
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (2 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 3/9] raid5: journal disk can't be removed Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-09  4:54 ` [PATCH 5/9] MD: add new bit to indicate raid array with journal Shaohua Li
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 15 ++++++++++++++-
 drivers/md/raid5.c       |  4 +++-
 drivers/md/raid5.h       |  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 5dbe084..2cf6b72 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -190,7 +190,6 @@ static void r5l_io_run_stripes(struct r5l_io_unit *io)
 	}
 }
 
-/* XXX: totally ignores I/O errors */
 static void r5l_log_run_stripes(struct r5l_log *log)
 {
 	struct r5l_io_unit *io, *next;
@@ -213,6 +212,9 @@ static void r5l_log_endio(struct bio *bio)
 	struct r5l_log *log = io->log;
 	unsigned long flags;
 
+	if (bio->bi_error)
+		md_error(log->rdev->mddev, log->rdev);
+
 	bio_put(bio);
 
 	spin_lock_irqsave(&log->io_list_lock, flags);
@@ -598,6 +600,9 @@ static void r5l_log_flush_endio(struct bio *bio)
 	unsigned long flags;
 	struct r5l_io_unit *io;
 
+	if (bio->bi_error)
+		md_error(log->rdev->mddev, log->rdev);
+
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	list_for_each_entry(io, &log->flushing_ios, log_sibling)
 		r5l_io_run_stripes(io);
@@ -684,6 +689,7 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 		md_update_sb(mddev, 1);
 	}
 
+	/* discard IO error really doesn't matter, ignore it */
 	if (log->last_checkpoint < end) {
 		blkdev_issue_discard(bdev,
 				log->last_checkpoint + log->rdev->data_offset,
@@ -798,6 +804,13 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	}
 }
 
+bool r5l_log_disk_error(struct r5conf *conf)
+{
+	if (!conf->log)
+		return false;
+	return test_bit(Faulty, &conf->log->rdev->flags);
+}
+
 struct r5l_recovery_ctx {
 	struct page *meta_page;		/* current meta */
 	sector_t meta_total_blocks;	/* total size of current meta and data */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6430098..0d1c6c4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3147,6 +3147,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 		 * the data has not reached the cache yet.
 		 */
 		if (!test_bit(R5_Wantfill, &sh->dev[i].flags) &&
+		    s->failed > conf->max_degraded &&
 		    (!test_bit(R5_Insync, &sh->dev[i].flags) ||
 		      test_bit(R5_ReadError, &sh->dev[i].flags))) {
 			spin_lock_irq(&sh->stripe_lock);
@@ -4015,6 +4016,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 	s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head;
 	s->failed_num[0] = -1;
 	s->failed_num[1] = -1;
+	s->log_failed = r5l_log_disk_error(conf);
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
@@ -4358,7 +4360,7 @@ static void handle_stripe(struct stripe_head *sh)
 	/* check if the array has lost more than max_degraded devices and,
 	 * if so, some requests might need to be failed.
 	 */
-	if (s.failed > conf->max_degraded) {
+	if (s.failed > conf->max_degraded || s.log_failed) {
 		sh->check_state = 0;
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1ab534c..a415e1c 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -272,6 +272,7 @@ struct stripe_head_state {
 	struct bio_list return_bi;
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
+	int log_failed;
 };
 
 /* Flags for struct r5dev.flags */
@@ -631,4 +632,5 @@ extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
 extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
+extern bool r5l_log_disk_error(struct r5conf *conf);
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/9] MD: add new bit to indicate raid array with journal
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (3 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 4/9] raid5-cache: IO error handling Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-09  4:54 ` [PATCH 6/9] raid5-cache: start raid5 readonly if journal is missing Shaohua Li
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

From: Song Liu <songliubraving@fb.com>

If a raid array has journal feature bit set, add a new bit to indicate
this. If the array is started without journal disk existing, we know
there is something wrong.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 10 +++++++---
 drivers/md/md.h |  1 +
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index daf42bb..a5b0bbd 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1667,6 +1667,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			set_bit(WriteMostly, &rdev->flags);
 		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_REPLACEMENT)
 			set_bit(Replacement, &rdev->flags);
+		if (le32_to_cpu(sb->feature_map) & MD_FEATURE_JOURNAL)
+			set_bit(MD_HAS_JOURNAL, &mddev->flags);
 	} else /* MULTIPATH are always insync */
 		set_bit(In_sync, &rdev->flags);
 
@@ -1807,16 +1809,18 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 	for (i=0; i<max_dev;i++)
 		sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_FAULTY);
 
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags))
+		sb->feature_map |= MD_FEATURE_JOURNAL;
+
 	rdev_for_each(rdev2, mddev) {
 		i = rdev2->desc_nr;
 		if (test_bit(Faulty, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_FAULTY);
 		else if (test_bit(In_sync, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
-		else if (test_bit(Journal, &rdev2->flags)) {
+		else if (test_bit(Journal, &rdev2->flags))
 			sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_JOURNAL);
-			sb->feature_map |= cpu_to_le32(MD_FEATURE_JOURNAL);
-		} else if (rdev2->raid_disk >= 0)
+		else if (rdev2->raid_disk >= 0)
 			sb->dev_roles[i] = cpu_to_le16(rdev2->raid_disk);
 		else
 			sb->dev_roles[i] = cpu_to_le16(MD_DISK_ROLE_SPARE);
diff --git a/drivers/md/md.h b/drivers/md/md.h
index e14e667..2bea51e 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -233,6 +233,7 @@ struct mddev {
 				 * md_ioctl checked on it.
 				 */
 #define MD_JOURNAL_CLEAN 5	/* A raid with journal is already clean */
+#define MD_HAS_JOURNAL	6	/* The raid array has journal feature set */
 
 	int				suspended;
 	atomic_t			active_io;
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/9] raid5-cache: start raid5 readonly if journal is missing
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (4 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 5/9] MD: add new bit to indicate raid array with journal Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-09  4:54 ` [PATCH 7/9] MD: kick out journal disk if it's not fresh Shaohua Li
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If raid array is expected to have journal (eg, journal is set in MD
superblock feature map) and the array is started without journal disk,
start the array readonly.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 3 ++-
 drivers/md/raid5.c       | 8 ++++++++
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 2cf6b72..e64684b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -806,8 +806,9 @@ void r5l_quiesce(struct r5l_log *log, int state)
 
 bool r5l_log_disk_error(struct r5conf *conf)
 {
+	/* don't allow write if journal disk is missing */
 	if (!conf->log)
-		return false;
+		return test_bit(MD_HAS_JOURNAL, &conf->mddev->flags);
 	return test_bit(Faulty, &conf->log->rdev->flags);
 }
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0d1c6c4..8024cad 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6810,6 +6810,14 @@ static int run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags) && !journal_dev) {
+		printk(KERN_ERR"md/raid:%s: journal disk is missing, "
+			"force array readonly\n",
+			mdname(mddev));
+		mddev->ro = 1;
+		set_disk_ro(mddev->gendisk, 1);
+	}
+
 	conf->min_offset_diff = min_offset_diff;
 	mddev->thread = conf->thread;
 	conf->thread = NULL;
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/9] MD: kick out journal disk if it's not fresh
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (5 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 6/9] raid5-cache: start raid5 readonly if journal is missing Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-09  4:54 ` [PATCH 8/9] MD: set journal disk ->raid_disk Shaohua Li
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

From: Song Liu <songliubraving@fb.com>

When journal disk is faulty and we are reassemabling the raid array, the
journal disk is old. We don't allow the journal disk added to the raid
array. Since journal disk is missing in the array, the raid5 will mark
the array readonly.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a5b0bbd..a8db2e2 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1608,7 +1608,8 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 		++ev1;
 		if (rdev->desc_nr >= 0 &&
 		    rdev->desc_nr < le32_to_cpu(sb->max_dev) &&
-		    le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX)
+		    (le16_to_cpu(sb->dev_roles[rdev->desc_nr]) < MD_DISK_ROLE_MAX ||
+		     le16_to_cpu(sb->dev_roles[rdev->desc_nr]) == MD_DISK_ROLE_JOURNAL))
 			if (ev1 < mddev->events)
 				return -EINVAL;
 	} else if (mddev->bitmap) {
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 8/9] MD: set journal disk ->raid_disk
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (6 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 7/9] MD: kick out journal disk if it's not fresh Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-09  4:54 ` [PATCH 9/9] MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW Shaohua Li
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Set journal disk ->raid_disk to >=0, I choose raid_disks + 1 instead of
0, because we already have a disk with ->raid_disk 0 and this causes
sysfs entry creation conflict. A lot of places assumes disk with
->raid_disk >=0 is normal raid disk, so we add check for journal disk.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c    | 27 ++++++++++++++++++++++-----
 drivers/md/raid5.c |  6 ++++--
 2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index a8db2e2..59b38ef 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1650,6 +1650,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
 			if (mddev->recovery_cp == MaxSector)
 				set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
+			rdev->raid_disk = mddev->raid_disks;
 			break;
 		default:
 			rdev->saved_raid_disk = role;
@@ -1719,7 +1720,7 @@ static void super_1_sync(struct mddev *mddev, struct md_rdev *rdev)
 		sb->feature_map = cpu_to_le32(MD_FEATURE_BITMAP_OFFSET);
 	}
 
-	if (rdev->raid_disk >= 0 &&
+	if (rdev->raid_disk >= 0 && !test_bit(Journal, &rdev->flags) &&
 	    !test_bit(In_sync, &rdev->flags)) {
 		sb->feature_map |=
 			cpu_to_le32(MD_FEATURE_RECOVERY_OFFSET);
@@ -2303,6 +2304,7 @@ void md_update_sb(struct mddev *mddev, int force_change)
 	rdev_for_each(rdev, mddev) {
 		if (rdev->raid_disk >= 0 &&
 		    mddev->delta_disks >= 0 &&
+		    !test_bit(Journal, &rdev->flags) &&
 		    !test_bit(In_sync, &rdev->flags) &&
 		    mddev->curr_resync_completed > rdev->recovery_offset)
 				rdev->recovery_offset = mddev->curr_resync_completed;
@@ -2539,6 +2541,7 @@ state_show(struct md_rdev *rdev, char *page)
 		sep = ",";
 	}
 	if (!test_bit(Faulty, &flags) &&
+	    !test_bit(Journal, &flags) &&
 	    !test_bit(In_sync, &flags)) {
 		len += sprintf(page+len, "%sspare", sep);
 		sep = ",";
@@ -2622,7 +2625,8 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 	} else if (cmd_match(buf, "insync") && rdev->raid_disk == -1) {
 		set_bit(In_sync, &rdev->flags);
 		err = 0;
-	} else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0) {
+	} else if (cmd_match(buf, "-insync") && rdev->raid_disk >= 0 &&
+		   !test_bit(Journal, &rdev->flags)) {
 		if (rdev->mddev->pers == NULL) {
 			clear_bit(In_sync, &rdev->flags);
 			rdev->saved_raid_disk = rdev->raid_disk;
@@ -2641,6 +2645,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 		 * check if recovery is needed.
 		 */
 		if (rdev->raid_disk >= 0 &&
+		    !test_bit(Journal, &rdev->flags) &&
 		    !test_bit(Replacement, &rdev->flags))
 			set_bit(WantReplacement, &rdev->flags);
 		set_bit(MD_RECOVERY_NEEDED, &rdev->mddev->recovery);
@@ -2718,7 +2723,9 @@ __ATTR(errors, S_IRUGO|S_IWUSR, errors_show, errors_store);
 static ssize_t
 slot_show(struct md_rdev *rdev, char *page)
 {
-	if (rdev->raid_disk < 0)
+	if (test_bit(Journal, &rdev->flags))
+		return sprintf(page, "journal\n");
+	else if (rdev->raid_disk < 0)
 		return sprintf(page, "none\n");
 	else
 		return sprintf(page, "%d\n", rdev->raid_disk);
@@ -2730,6 +2737,8 @@ slot_store(struct md_rdev *rdev, const char *buf, size_t len)
 	int slot;
 	int err;
 
+	if (test_bit(Journal, &rdev->flags))
+		return -EBUSY;
 	if (strncmp(buf, "none", 4)==0)
 		slot = -1;
 	else {
@@ -2928,6 +2937,8 @@ rdev_size_store(struct md_rdev *rdev, const char *buf, size_t len)
 	sector_t oldsectors = rdev->sectors;
 	sector_t sectors;
 
+	if (test_bit(Journal, &rdev->flags))
+		return -EBUSY;
 	if (strict_blocks_to_sectors(buf, &sectors) < 0)
 		return -EINVAL;
 	if (rdev->data_offset != rdev->new_data_offset)
@@ -3290,7 +3301,9 @@ static void analyze_sbs(struct mddev *mddev)
 			rdev->desc_nr = i++;
 			rdev->raid_disk = rdev->desc_nr;
 			set_bit(In_sync, &rdev->flags);
-		} else if (rdev->raid_disk >= (mddev->raid_disks - min(0, mddev->delta_disks))) {
+		} else if (rdev->raid_disk >=
+			    (mddev->raid_disks - min(0, mddev->delta_disks)) &&
+			   !test_bit(Journal, &rdev->flags)) {
 			rdev->raid_disk = -1;
 			clear_bit(In_sync, &rdev->flags);
 		}
@@ -7805,6 +7818,7 @@ void md_do_sync(struct md_thread *thread)
 		rcu_read_lock();
 		rdev_for_each_rcu(rdev, mddev)
 			if (rdev->raid_disk >= 0 &&
+			    !test_bit(Journal, &rdev->flags) &&
 			    !test_bit(Faulty, &rdev->flags) &&
 			    !test_bit(In_sync, &rdev->flags) &&
 			    rdev->recovery_offset < j)
@@ -8026,6 +8040,7 @@ void md_do_sync(struct md_thread *thread)
 			rdev_for_each_rcu(rdev, mddev)
 				if (rdev->raid_disk >= 0 &&
 				    mddev->delta_disks >= 0 &&
+				    !test_bit(Journal, &rdev->flags) &&
 				    !test_bit(Faulty, &rdev->flags) &&
 				    !test_bit(In_sync, &rdev->flags) &&
 				    rdev->recovery_offset < mddev->curr_resync)
@@ -8066,7 +8081,8 @@ static int remove_and_add_spares(struct mddev *mddev,
 		    rdev->raid_disk >= 0 &&
 		    !test_bit(Blocked, &rdev->flags) &&
 		    (test_bit(Faulty, &rdev->flags) ||
-		     ! test_bit(In_sync, &rdev->flags)) &&
+		     (!test_bit(In_sync, &rdev->flags) &&
+		      !test_bit(Journal, &rdev->flags))) &&
 		    atomic_read(&rdev->nr_pending)==0) {
 			if (mddev->pers->hot_remove_disk(
 				    mddev, rdev) == 0) {
@@ -8088,6 +8104,7 @@ static int remove_and_add_spares(struct mddev *mddev,
 			continue;
 		if (rdev->raid_disk >= 0 &&
 		    !test_bit(In_sync, &rdev->flags) &&
+		    !test_bit(Journal, &rdev->flags) &&
 		    !test_bit(Faulty, &rdev->flags))
 			spares++;
 		if (rdev->raid_disk >= 0)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8024cad..efbb5da 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -6560,7 +6560,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	rdev_for_each(rdev, mddev) {
 		raid_disk = rdev->raid_disk;
 		if (raid_disk >= max_disks
-		    || raid_disk < 0)
+		    || raid_disk < 0 || test_bit(Journal, &rdev->flags))
 			continue;
 		disk = conf->disks + raid_disk;
 
@@ -6694,8 +6694,10 @@ static int run(struct mddev *mddev)
 	rdev_for_each(rdev, mddev) {
 		long long diff;
 
-		if (test_bit(Journal, &rdev->flags))
+		if (test_bit(Journal, &rdev->flags)) {
 			journal_dev = rdev;
+			continue;
+		}
 		if (rdev->raid_disk < 0)
 			continue;
 		diff = (rdev->new_data_offset - rdev->data_offset);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 9/9] MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (7 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 8/9] MD: set journal disk ->raid_disk Shaohua Li
@ 2015-10-09  4:54 ` Shaohua Li
  2015-10-12  6:17 ` [PATCH 0/9] raid5-cache fixes Neil Brown
  2015-10-13 21:03 ` Neil Brown
  10 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-09  4:54 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

From: Song Liu <songliubraving@fb.com>

When RAID-4/5/6 array suffers from missing journal device, we put
the array in read only state. We should not allow trasition to
read-write states (clean and active) before replacing journal device.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 59b38ef..c23d21b 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -3961,7 +3961,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 		break;
 	case clean:
 		if (mddev->pers) {
-			restart_array(mddev);
+			err = restart_array(mddev);
+			if (err)
+				break;
 			spin_lock(&mddev->lock);
 			if (atomic_read(&mddev->writes_pending) == 0) {
 				if (mddev->in_sync == 0) {
@@ -3979,7 +3981,9 @@ array_state_store(struct mddev *mddev, const char *buf, size_t len)
 		break;
 	case active:
 		if (mddev->pers) {
-			restart_array(mddev);
+			err = restart_array(mddev);
+			if (err)
+				break;
 			clear_bit(MD_CHANGE_PENDING, &mddev->flags);
 			wake_up(&mddev->sb_wait);
 			err = 0;
@@ -5336,6 +5340,25 @@ static int restart_array(struct mddev *mddev)
 		return -EINVAL;
 	if (!mddev->ro)
 		return -EBUSY;
+	if (test_bit(MD_HAS_JOURNAL, &mddev->flags)) {
+		struct md_rdev *rdev;
+		bool has_journal = false;
+
+		rcu_read_lock();
+		rdev_for_each_rcu(rdev, mddev) {
+			if (test_bit(Journal, &rdev->flags) &&
+			    !test_bit(Faulty, &rdev->flags)) {
+				has_journal = true;
+				break;
+			}
+		}
+		rcu_read_unlock();
+
+		/* Don't restart rw with journal missing/faulty */
+		if (!has_journal)
+			return -EINVAL;
+	}
+
 	mddev->safemode = 0;
 	mddev->ro = 0;
 	set_disk_ro(disk, 0);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] MD: fix info output for journal disk
  2015-10-09  4:54 ` [PATCH 1/9] MD: fix info output for journal disk Shaohua Li
@ 2015-10-12  5:37   ` Neil Brown
  2015-10-12 17:01     ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2015-10-12  5:37 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2177 bytes --]

Shaohua Li <shli@fb.com> writes:

> journal disk can be faulty. The Journal and Faulty aren't exclusive with
> each other.
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c | 10 +++++++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0729cc7..daf42bb 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5857,7 +5857,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
>  		else if (test_bit(In_sync, &rdev->flags)) {
>  			info.state |= (1<<MD_DISK_ACTIVE);
>  			info.state |= (1<<MD_DISK_SYNC);
> -		} else if (test_bit(Journal, &rdev->flags))
> +		}
> +		if (test_bit(Journal, &rdev->flags))
>  			info.state |= (1<<MD_DISK_JOURNAL);
>  		if (test_bit(WriteMostly, &rdev->flags))
>  			info.state |= (1<<MD_DISK_WRITEMOSTLY);
> @@ -7335,18 +7336,21 @@ static int md_seq_show(struct seq_file *seq, void *v)
>  		rcu_read_lock();
>  		rdev_for_each_rcu(rdev, mddev) {
>  			char b[BDEVNAME_SIZE];
> +			bool skip = false;
>  			seq_printf(seq, " %s[%d]",
>  				bdevname(rdev->bdev,b), rdev->desc_nr);
>  			if (test_bit(WriteMostly, &rdev->flags))
>  				seq_printf(seq, "(W)");
>  			if (test_bit(Faulty, &rdev->flags)) {
>  				seq_printf(seq, "(F)");
> -				continue;
> +				skip = true;
>  			}
>  			if (test_bit(Journal, &rdev->flags)) {
>  				seq_printf(seq, "(J)");
> -				continue;
> +				skip = true;
>  			}
> +			if (skip)
> +				continue;

Is this 'skip' stuff really needed?  What would go wrong if we just left
it out?  An active journal will have raid_disk>=0 now won't it?  And
if we don't support replacement of journals (which I suspect we
shouldn't) then (R) will never get reported.

So can we just drop the 'skip'??

Thanks,
NeilBrown

>  			if (rdev->raid_disk < 0)
>  				seq_printf(seq, "(S)"); /* spare */
>  			if (test_bit(Replacement, &rdev->flags))
> -- 
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/9] raid5-cache: add trim support for log
  2015-10-09  4:54 ` [PATCH 2/9] raid5-cache: add trim support for log Shaohua Li
@ 2015-10-12  6:00   ` Neil Brown
  2015-10-12 16:57     ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2015-10-12  6:00 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 5087 bytes --]

Shaohua Li <shli@fb.com> writes:

> Since superblock is updated infrequently, we do a simple trim of log
> disk (a synchronous trim)
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5-cache.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 62 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index caeec10..5dbe084 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -85,6 +85,7 @@ struct r5l_log {
>  	spinlock_t no_space_stripes_lock;
>  
>  	bool need_cache_flush;
> +	bool in_teardown;
>  };
>  
>  /*
> @@ -644,6 +645,60 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
>  }
>  
>  static void r5l_write_super(struct r5l_log *log, sector_t cp);
> +static void r5l_write_super_and_discard_space(struct r5l_log *log,
> +	sector_t end)
> +{
> +	struct block_device *bdev = log->rdev->bdev;
> +	struct mddev *mddev;
> +
> +	r5l_write_super(log, end);
> +
> +	if (!blk_queue_discard(bdev_get_queue(bdev)))
> +		return;
> +
> +	mddev = log->rdev->mddev;
> +	/*
> +	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
> +	 * wait for this thread to finish. This thread waits for
> +	 * MD_CHANGE_PENDING clear, which is supposed to be done in
> +	 * md_check_recovery(). md_check_recovery() tries to get
> +	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
> +	 * md_check_recovery() fails, so the PENDING never get cleared. The
> +	 * in_teardown check workaround this issue.
> +	 * */

Thanks for this comment (but can we just end the comment with "*/"
instead of  "* */" - that's what everyone else does).
It helps a lot.  If feels a bit clumsy, but maybe that is just me.  At
least I understand now and it make sense.

> +	if (!log->in_teardown) {
> +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
> +		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +		md_wakeup_thread(mddev->thread);
> +		wait_event(mddev->sb_wait,
> +			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
> +			log->in_teardown);
> +		/*
> +		 * r5l_quiesce could run after in_teardown check and hold
> +		 * mutex first. Superblock might get updated twice.
> +		 * */
> +		if (log->in_teardown)
> +			md_update_sb(mddev, 1);
> +	} else {
> +		BUG_ON(!mddev_is_locked(mddev));
> +		md_update_sb(mddev, 1);
> +	}
> +
> +	if (log->last_checkpoint < end) {
> +		blkdev_issue_discard(bdev,
> +				log->last_checkpoint + log->rdev->data_offset,
> +				end - log->last_checkpoint, GFP_NOIO, 0);
> +	} else {
> +		blkdev_issue_discard(bdev,
> +				log->last_checkpoint + log->rdev->data_offset,
> +				log->device_size - log->last_checkpoint,
> +				GFP_NOIO, 0);
> +		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
> +				GFP_NOIO, 0);
> +	}
> +}
> +
> +
>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
>  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
> @@ -685,7 +740,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  	 * here, because the log area might be reused soon and we don't want to
>  	 * confuse recovery
>  	 */
> -	r5l_write_super(log, next_checkpoint);
> +	r5l_write_super_and_discard_space(log, next_checkpoint);
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = next_checkpoint;
> @@ -721,9 +776,11 @@ static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
>  
>  void r5l_quiesce(struct r5l_log *log, int state)
>  {
> +	struct mddev *mddev;
>  	if (!log || state == 2)
>  		return;
>  	if (state == 0) {
> +		log->in_teardown = 0;
>  		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
>  					log->rdev->mddev, "reclaim");
>  	} else if (state == 1) {
> @@ -731,6 +788,10 @@ void r5l_quiesce(struct r5l_log *log, int state)
>  		 * at this point all stripes are finished, so io_unit is at
>  		 * least in STRIPE_END state
>  		 */
> +		log->in_teardown = 1;
> +		/* make sure r5l_write_super_and_discard_space exits */
> +		mddev = log->rdev->mddev;
> +		wake_up(&mddev->sb_wait);
>  		r5l_wake_reclaim(log, -1L);
>  		md_unregister_thread(&log->reclaim_thread);

I think this is racey (though you haven't changed the code, so it must
have been racy before).
There is no guarantee that the thread will actually wake up to handle
the reclaim before md_unregister_thread marks the thread as
kthread_should_stop().
Maybe the thing to do would be
   md_unregister_thread()
   log->reclaim_target = (unsigned long) -1;
   r5l_do_reclaim(log);

or something like that.  i.e. just do it synchronously.

Then.... maybe you don't need in_teardown.
When r5l_do_reclaim() is called from the thread, you wait.
When called from r5l_quiesce(), you md_update_sb().

Would that work?

Thanks,
NeilBrown



>  		r5l_do_reclaim(log);
> -- 
> 2.4.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9] raid5-cache fixes
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (8 preceding siblings ...)
  2015-10-09  4:54 ` [PATCH 9/9] MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW Shaohua Li
@ 2015-10-12  6:17 ` Neil Brown
  2015-10-13 21:03 ` Neil Brown
  10 siblings, 0 replies; 18+ messages in thread
From: Neil Brown @ 2015-10-12  6:17 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

Shaohua Li <shli@fb.com> writes:

> Hi,
>
> some fixes for raid5-cache/md. Mainly supports trim, handle IO error and
> handling array assemble with missing/faulty disks.
> patch 1: fix a bug
> patch 2: support trim

These two I've replied to separately.

> patch 3-4, IO error handling
> patch 5-9, handle assemble issue when journal disk is missing/faulty

These I am happy with (I think, it is getting late, my brain slows
down...)
I was going to apply them but they depend on the trim support to apply
cleanly.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/9] raid5-cache: add trim support for log
  2015-10-12  6:00   ` Neil Brown
@ 2015-10-12 16:57     ` Shaohua Li
  0 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-12 16:57 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Mon, Oct 12, 2015 at 05:00:14PM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > Since superblock is updated infrequently, we do a simple trim of log
> > disk (a synchronous trim)
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/raid5-cache.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 62 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> > index caeec10..5dbe084 100644
> > --- a/drivers/md/raid5-cache.c
> > +++ b/drivers/md/raid5-cache.c
> > @@ -85,6 +85,7 @@ struct r5l_log {
> >  	spinlock_t no_space_stripes_lock;
> >  
> >  	bool need_cache_flush;
> > +	bool in_teardown;
> >  };
> >  
> >  /*
> > @@ -644,6 +645,60 @@ void r5l_flush_stripe_to_raid(struct r5l_log *log)
> >  }
> >  
> >  static void r5l_write_super(struct r5l_log *log, sector_t cp);
> > +static void r5l_write_super_and_discard_space(struct r5l_log *log,
> > +	sector_t end)
> > +{
> > +	struct block_device *bdev = log->rdev->bdev;
> > +	struct mddev *mddev;
> > +
> > +	r5l_write_super(log, end);
> > +
> > +	if (!blk_queue_discard(bdev_get_queue(bdev)))
> > +		return;
> > +
> > +	mddev = log->rdev->mddev;
> > +	/*
> > +	 * This is to avoid a deadlock. r5l_quiesce holds reconfig_mutex and
> > +	 * wait for this thread to finish. This thread waits for
> > +	 * MD_CHANGE_PENDING clear, which is supposed to be done in
> > +	 * md_check_recovery(). md_check_recovery() tries to get
> > +	 * reconfig_mutex. Since r5l_quiesce already holds the mutex,
> > +	 * md_check_recovery() fails, so the PENDING never get cleared. The
> > +	 * in_teardown check workaround this issue.
> > +	 * */
> 
> Thanks for this comment (but can we just end the comment with "*/"
> instead of  "* */" - that's what everyone else does).

sorry, I will set vim correctly.
> It helps a lot.  If feels a bit clumsy, but maybe that is just me.  At
> least I understand now and it make sense. 
> > +	if (!log->in_teardown) {
> > +		set_bit(MD_CHANGE_DEVS, &mddev->flags);
> > +		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> > +		md_wakeup_thread(mddev->thread);
> > +		wait_event(mddev->sb_wait,
> > +			!test_bit(MD_CHANGE_PENDING, &mddev->flags) ||
> > +			log->in_teardown);
> > +		/*
> > +		 * r5l_quiesce could run after in_teardown check and hold
> > +		 * mutex first. Superblock might get updated twice.
> > +		 * */
> > +		if (log->in_teardown)
> > +			md_update_sb(mddev, 1);
> > +	} else {
> > +		BUG_ON(!mddev_is_locked(mddev));
> > +		md_update_sb(mddev, 1);
> > +	}
> > +
> > +	if (log->last_checkpoint < end) {
> > +		blkdev_issue_discard(bdev,
> > +				log->last_checkpoint + log->rdev->data_offset,
> > +				end - log->last_checkpoint, GFP_NOIO, 0);
> > +	} else {
> > +		blkdev_issue_discard(bdev,
> > +				log->last_checkpoint + log->rdev->data_offset,
> > +				log->device_size - log->last_checkpoint,
> > +				GFP_NOIO, 0);
> > +		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
> > +				GFP_NOIO, 0);
> > +	}
> > +}
> > +
> > +
> >  static void r5l_do_reclaim(struct r5l_log *log)
> >  {
> >  	sector_t reclaim_target = xchg(&log->reclaim_target, 0);
> > @@ -685,7 +740,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
> >  	 * here, because the log area might be reused soon and we don't want to
> >  	 * confuse recovery
> >  	 */
> > -	r5l_write_super(log, next_checkpoint);
> > +	r5l_write_super_and_discard_space(log, next_checkpoint);
> >  
> >  	mutex_lock(&log->io_mutex);
> >  	log->last_checkpoint = next_checkpoint;
> > @@ -721,9 +776,11 @@ static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
> >  
> >  void r5l_quiesce(struct r5l_log *log, int state)
> >  {
> > +	struct mddev *mddev;
> >  	if (!log || state == 2)
> >  		return;
> >  	if (state == 0) {
> > +		log->in_teardown = 0;
> >  		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
> >  					log->rdev->mddev, "reclaim");
> >  	} else if (state == 1) {
> > @@ -731,6 +788,10 @@ void r5l_quiesce(struct r5l_log *log, int state)
> >  		 * at this point all stripes are finished, so io_unit is at
> >  		 * least in STRIPE_END state
> >  		 */
> > +		log->in_teardown = 1;
> > +		/* make sure r5l_write_super_and_discard_space exits */
> > +		mddev = log->rdev->mddev;
> > +		wake_up(&mddev->sb_wait);
> >  		r5l_wake_reclaim(log, -1L);
> >  		md_unregister_thread(&log->reclaim_thread);
> 
> I think this is racey (though you haven't changed the code, so it must
> have been racy before).
> There is no guarantee that the thread will actually wake up to handle
> the reclaim before md_unregister_thread marks the thread as
> kthread_should_stop().
> Maybe the thing to do would be
>    md_unregister_thread()
>    log->reclaim_target = (unsigned long) -1;
>    r5l_do_reclaim(log);
> 
> or something like that.  i.e. just do it synchronously.

This doesn't sound like a race. If the thread doesn't run because it
exits after kthread_should_stop check, the reclaim_target should remain
to be -1 after md_unregister_thread

> Then.... maybe you don't need in_teardown.
> When r5l_do_reclaim() is called from the thread, you wait.
> When called from r5l_quiesce(), you md_update_sb().
> 
> Would that work?

It wouldn't. This only fixes the deadlock issue r5l_quiesce calls
r5l_do_reclaim() directly. The problem is r5l_wake_reclaim() will wake
the thread, md_unregister_thread will wait till the thread exits. At
that time the thread might try to lock the reconfig_mutex. Since
r5l_quesce already holds the mutex, the md_unregister_thread never
returns. In summary, r5l_do_reclaim() is called from the thread can't be
used to determine whether we need the deadlock avoidness.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] MD: fix info output for journal disk
  2015-10-12  5:37   ` Neil Brown
@ 2015-10-12 17:01     ` Shaohua Li
  2015-10-12 23:26       ` Neil Brown
  0 siblings, 1 reply; 18+ messages in thread
From: Shaohua Li @ 2015-10-12 17:01 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Mon, Oct 12, 2015 at 04:37:55PM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > journal disk can be faulty. The Journal and Faulty aren't exclusive with
> > each other.
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/md.c | 10 +++++++---
> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 0729cc7..daf42bb 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -5857,7 +5857,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
> >  		else if (test_bit(In_sync, &rdev->flags)) {
> >  			info.state |= (1<<MD_DISK_ACTIVE);
> >  			info.state |= (1<<MD_DISK_SYNC);
> > -		} else if (test_bit(Journal, &rdev->flags))
> > +		}
> > +		if (test_bit(Journal, &rdev->flags))
> >  			info.state |= (1<<MD_DISK_JOURNAL);
> >  		if (test_bit(WriteMostly, &rdev->flags))
> >  			info.state |= (1<<MD_DISK_WRITEMOSTLY);
> > @@ -7335,18 +7336,21 @@ static int md_seq_show(struct seq_file *seq, void *v)
> >  		rcu_read_lock();
> >  		rdev_for_each_rcu(rdev, mddev) {
> >  			char b[BDEVNAME_SIZE];
> > +			bool skip = false;
> >  			seq_printf(seq, " %s[%d]",
> >  				bdevname(rdev->bdev,b), rdev->desc_nr);
> >  			if (test_bit(WriteMostly, &rdev->flags))
> >  				seq_printf(seq, "(W)");
> >  			if (test_bit(Faulty, &rdev->flags)) {
> >  				seq_printf(seq, "(F)");
> > -				continue;
> > +				skip = true;
> >  			}
> >  			if (test_bit(Journal, &rdev->flags)) {
> >  				seq_printf(seq, "(J)");
> > -				continue;
> > +				skip = true;
> >  			}
> > +			if (skip)
> > +				continue;
> 
> Is this 'skip' stuff really needed?  What would go wrong if we just left
> it out?  An active journal will have raid_disk>=0 now won't it?  And
> if we don't support replacement of journals (which I suspect we
> shouldn't) then (R) will never get reported.
> 
> So can we just drop the 'skip'??

Sounds good. Let me know if I should resend this one alone or the
series.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] MD: fix info output for journal disk
  2015-10-12 17:01     ` Shaohua Li
@ 2015-10-12 23:26       ` Neil Brown
  2015-10-12 23:38         ` Shaohua Li
  0 siblings, 1 reply; 18+ messages in thread
From: Neil Brown @ 2015-10-12 23:26 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2489 bytes --]

Shaohua Li <shli@fb.com> writes:

> On Mon, Oct 12, 2015 at 04:37:55PM +1100, Neil Brown wrote:
>> Shaohua Li <shli@fb.com> writes:
>> 
>> > journal disk can be faulty. The Journal and Faulty aren't exclusive with
>> > each other.
>> >
>> > Signed-off-by: Shaohua Li <shli@fb.com>
>> > ---
>> >  drivers/md/md.c | 10 +++++++---
>> >  1 file changed, 7 insertions(+), 3 deletions(-)
>> >
>> > diff --git a/drivers/md/md.c b/drivers/md/md.c
>> > index 0729cc7..daf42bb 100644
>> > --- a/drivers/md/md.c
>> > +++ b/drivers/md/md.c
>> > @@ -5857,7 +5857,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
>> >  		else if (test_bit(In_sync, &rdev->flags)) {
>> >  			info.state |= (1<<MD_DISK_ACTIVE);
>> >  			info.state |= (1<<MD_DISK_SYNC);
>> > -		} else if (test_bit(Journal, &rdev->flags))
>> > +		}
>> > +		if (test_bit(Journal, &rdev->flags))
>> >  			info.state |= (1<<MD_DISK_JOURNAL);
>> >  		if (test_bit(WriteMostly, &rdev->flags))
>> >  			info.state |= (1<<MD_DISK_WRITEMOSTLY);
>> > @@ -7335,18 +7336,21 @@ static int md_seq_show(struct seq_file *seq, void *v)
>> >  		rcu_read_lock();
>> >  		rdev_for_each_rcu(rdev, mddev) {
>> >  			char b[BDEVNAME_SIZE];
>> > +			bool skip = false;
>> >  			seq_printf(seq, " %s[%d]",
>> >  				bdevname(rdev->bdev,b), rdev->desc_nr);
>> >  			if (test_bit(WriteMostly, &rdev->flags))
>> >  				seq_printf(seq, "(W)");
>> >  			if (test_bit(Faulty, &rdev->flags)) {
>> >  				seq_printf(seq, "(F)");
>> > -				continue;
>> > +				skip = true;
>> >  			}
>> >  			if (test_bit(Journal, &rdev->flags)) {
>> >  				seq_printf(seq, "(J)");
>> > -				continue;
>> > +				skip = true;
>> >  			}
>> > +			if (skip)
>> > +				continue;
>> 
>> Is this 'skip' stuff really needed?  What would go wrong if we just left
>> it out?  An active journal will have raid_disk>=0 now won't it?  And
>> if we don't support replacement of journals (which I suspect we
>> shouldn't) then (R) will never get reported.
>> 
>> So can we just drop the 'skip'??
>
> Sounds good. Let me know if I should resend this one alone or the
> series.

Just send this one alone.  I'll apply the rest from the previous
posting.

I'm not entirely convinced about the race issue, but I don't really
want to think about it today and as you seem certain we will go with the
current code for now.
If I think of something that does convince me, it can always be fixed
later.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/9] MD: fix info output for journal disk
  2015-10-12 23:26       ` Neil Brown
@ 2015-10-12 23:38         ` Shaohua Li
  0 siblings, 0 replies; 18+ messages in thread
From: Shaohua Li @ 2015-10-12 23:38 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Tue, Oct 13, 2015 at 10:26:08AM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > On Mon, Oct 12, 2015 at 04:37:55PM +1100, Neil Brown wrote:
> >> Shaohua Li <shli@fb.com> writes:
> >> 
> >> > journal disk can be faulty. The Journal and Faulty aren't exclusive with
> >> > each other.
> >> >
> >> > Signed-off-by: Shaohua Li <shli@fb.com>
> >> > ---
> >> >  drivers/md/md.c | 10 +++++++---
> >> >  1 file changed, 7 insertions(+), 3 deletions(-)
> >> >
> >> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> >> > index 0729cc7..daf42bb 100644
> >> > --- a/drivers/md/md.c
> >> > +++ b/drivers/md/md.c
> >> > @@ -5857,7 +5857,8 @@ static int get_disk_info(struct mddev *mddev, void __user * arg)
> >> >  		else if (test_bit(In_sync, &rdev->flags)) {
> >> >  			info.state |= (1<<MD_DISK_ACTIVE);
> >> >  			info.state |= (1<<MD_DISK_SYNC);
> >> > -		} else if (test_bit(Journal, &rdev->flags))
> >> > +		}
> >> > +		if (test_bit(Journal, &rdev->flags))
> >> >  			info.state |= (1<<MD_DISK_JOURNAL);
> >> >  		if (test_bit(WriteMostly, &rdev->flags))
> >> >  			info.state |= (1<<MD_DISK_WRITEMOSTLY);
> >> > @@ -7335,18 +7336,21 @@ static int md_seq_show(struct seq_file *seq, void *v)
> >> >  		rcu_read_lock();
> >> >  		rdev_for_each_rcu(rdev, mddev) {
> >> >  			char b[BDEVNAME_SIZE];
> >> > +			bool skip = false;
> >> >  			seq_printf(seq, " %s[%d]",
> >> >  				bdevname(rdev->bdev,b), rdev->desc_nr);
> >> >  			if (test_bit(WriteMostly, &rdev->flags))
> >> >  				seq_printf(seq, "(W)");
> >> >  			if (test_bit(Faulty, &rdev->flags)) {
> >> >  				seq_printf(seq, "(F)");
> >> > -				continue;
> >> > +				skip = true;
> >> >  			}
> >> >  			if (test_bit(Journal, &rdev->flags)) {
> >> >  				seq_printf(seq, "(J)");
> >> > -				continue;
> >> > +				skip = true;
> >> >  			}
> >> > +			if (skip)
> >> > +				continue;
> >> 
> >> Is this 'skip' stuff really needed?  What would go wrong if we just left
> >> it out?  An active journal will have raid_disk>=0 now won't it?  And
> >> if we don't support replacement of journals (which I suspect we
> >> shouldn't) then (R) will never get reported.
> >> 
> >> So can we just drop the 'skip'??
> >
> > Sounds good. Let me know if I should resend this one alone or the
> > series.
> 
> Just send this one alone.  I'll apply the rest from the previous
> posting.
> 
> I'm not entirely convinced about the race issue, but I don't really
> want to think about it today and as you seem certain we will go with the
> current code for now.
> If I think of something that does convince me, it can always be fixed
> later.

Cool, thanks!

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/9] raid5-cache fixes
  2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
                   ` (9 preceding siblings ...)
  2015-10-12  6:17 ` [PATCH 0/9] raid5-cache fixes Neil Brown
@ 2015-10-13 21:03 ` Neil Brown
  10 siblings, 0 replies; 18+ messages in thread
From: Neil Brown @ 2015-10-13 21:03 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1059 bytes --]

Shaohua Li <shli@fb.com> writes:

> Hi,
>
> some fixes for raid5-cache/md. Mainly supports trim, handle IO error and
> handling array assemble with missing/faulty disks.
> patch 1: fix a bug
> patch 2: support trim
> patch 3-4, IO error handling
> patch 5-9, handle assemble issue when journal disk is missing/faulty
>
> If there is IO error to journal disk, write IO will fail immediately. If we
> assemble an array with journal feature enabled but with journal disk
> missing/faulty, the raid array will start readonly.
>
> Thanks,
> Shaohua
>
> Shaohua Li (6):
>   MD: fix info output for journal disk
>   raid5-cache: add trim support for log
>   raid5: journal disk can't be removed
>   raid5-cache: IO error handling
>   raid5-cache: start raid5 readonly if journal is missing
>   MD: set journal disk ->raid_disk
>
> Song Liu (3):
>   MD: add new bit to indicate raid array with journal
>   MD: kick out journal disk if it's not fresh
>   MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW
>

Thanks,
I've applied all of these.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2015-10-13 21:03 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-09  4:54 [PATCH 0/9] raid5-cache fixes Shaohua Li
2015-10-09  4:54 ` [PATCH 1/9] MD: fix info output for journal disk Shaohua Li
2015-10-12  5:37   ` Neil Brown
2015-10-12 17:01     ` Shaohua Li
2015-10-12 23:26       ` Neil Brown
2015-10-12 23:38         ` Shaohua Li
2015-10-09  4:54 ` [PATCH 2/9] raid5-cache: add trim support for log Shaohua Li
2015-10-12  6:00   ` Neil Brown
2015-10-12 16:57     ` Shaohua Li
2015-10-09  4:54 ` [PATCH 3/9] raid5: journal disk can't be removed Shaohua Li
2015-10-09  4:54 ` [PATCH 4/9] raid5-cache: IO error handling Shaohua Li
2015-10-09  4:54 ` [PATCH 5/9] MD: add new bit to indicate raid array with journal Shaohua Li
2015-10-09  4:54 ` [PATCH 6/9] raid5-cache: start raid5 readonly if journal is missing Shaohua Li
2015-10-09  4:54 ` [PATCH 7/9] MD: kick out journal disk if it's not fresh Shaohua Li
2015-10-09  4:54 ` [PATCH 8/9] MD: set journal disk ->raid_disk Shaohua Li
2015-10-09  4:54 ` [PATCH 9/9] MD: when RAID journal is missing/faulty, block RESTART_ARRAY_RW Shaohua Li
2015-10-12  6:17 ` [PATCH 0/9] raid5-cache fixes Neil Brown
2015-10-13 21:03 ` Neil Brown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.