All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] raid5-cache fixes
@ 2015-10-04 16:20 Shaohua Li
  2015-10-04 16:20 ` [PATCH 1/6] md: show journal for journal disk in disk state sysfs Shaohua Li
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Hi,

some fixes for raid5-cache.
patch 1, a small fix
patch 2-3, move reclaim teardown to quiesce handling and add trim support. I
still need md_update_sb there and play the mddev_is_locked trick as the
reconfig_mutex is already hold at md stop. The md_update_sb probably can move
to md core later.
patch 4-6, add error handling. For patch 4, I still need the journal bit check
in in md core, otherwise there is no way to prevent 'echo remove > rdev/state'
to delete journal disk. For patch 6, I didn't change has_failed() yet. Handling
assemble with miss/failed log disk is still on going.

Next step is to make assemble correct with miss/failed log disk. This will need
kernel/utilities cooperation. Song and I are working on it.

Thanks,
Shaohua

Shaohua Li (6):
  md: show journal for journal disk in disk state sysfs
  raid5-cache: move reclaim stop to quiesce
  raid5-cache: add trim support for log
  md: don't export log device
  md: set In_Sync for log disk
  raid5-cache: IO error handling

 drivers/md/md.c          |  9 ++++--
 drivers/md/raid5-cache.c | 83 ++++++++++++++++++++++++++++++++++++++++--------
 drivers/md/raid5.c       |  7 +++-
 drivers/md/raid5.h       |  3 ++
 4 files changed, 85 insertions(+), 17 deletions(-)

-- 
2.4.6


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/6] md: show journal for journal disk in disk state sysfs
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
@ 2015-10-04 16:20 ` Shaohua Li
  2015-10-04 16:20 ` [PATCH 2/6] raid5-cache: move reclaim stop to quiesce Shaohua Li
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Journal disk state sysfs entry should indicate it's journal

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f1cbb08..c643c9a 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2464,6 +2464,10 @@ state_show(struct md_rdev *rdev, char *page)
 		len += sprintf(page+len, "%sin_sync",sep);
 		sep = ",";
 	}
+	if (test_bit(Journal, &flags)) {
+		len += sprintf(page+len, "%sjournal",sep);
+		sep = ",";
+	}
 	if (test_bit(WriteMostly, &flags)) {
 		len += sprintf(page+len, "%swrite_mostly",sep);
 		sep = ",";
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/6] raid5-cache: move reclaim stop to quiesce
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
  2015-10-04 16:20 ` [PATCH 1/6] md: show journal for journal disk in disk state sysfs Shaohua Li
@ 2015-10-04 16:20 ` Shaohua Li
  2015-10-04 16:20 ` [PATCH 3/6] raid5-cache: add trim support for log Shaohua Li
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Move reclaim stop to quiesce handling, where is safer for this stuff.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 30 ++++++++++++++++++------------
 drivers/md/raid5.c       |  1 +
 drivers/md/raid5.h       |  1 +
 3 files changed, 20 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index a02f9ce..93097a8 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -748,6 +748,24 @@ static void r5l_wake_reclaim(struct r5l_log *log, sector_t space)
 	md_wakeup_thread(log->reclaim_thread);
 }
 
+void r5l_quiesce(struct r5l_log *log, int state)
+{
+	if (!log || state == 2)
+		return;
+	if (state == 0) {
+		log->reclaim_thread = md_register_thread(r5l_reclaim_thread,
+					log->rdev->mddev, "reclaim");
+	} else if (state == 1) {
+		/*
+		 * at this point all stripes are finished, so io_unit is at
+		 * least in STRIPE_END state
+		 */
+		r5l_wake_reclaim(log, -1L);
+		md_unregister_thread(&log->reclaim_thread);
+		r5l_do_reclaim(log);
+	}
+}
+
 struct r5l_recovery_ctx {
 	struct page *meta_page;		/* current meta */
 	sector_t meta_total_blocks;	/* total size of current meta and data */
@@ -1120,19 +1138,7 @@ int r5l_init_log(struct r5conf *conf, struct md_rdev *rdev)
 
 void r5l_exit_log(struct r5l_log *log)
 {
-	/*
-	 * at this point all stripes are finished, so io_unit is at least in
-	 * STRIPE_END state
-	 */
-	r5l_wake_reclaim(log, -1L);
 	md_unregister_thread(&log->reclaim_thread);
-	r5l_do_reclaim(log);
-	/*
-	 * force a super update, r5l_do_reclaim might updated the super.
-	 * mddev->thread is already stopped
-	 */
-	md_update_sb(log->rdev->mddev, 1);
-
 	kmem_cache_destroy(log->io_kc);
 	kfree(log);
 }
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a622ccb..216fa3c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7582,6 +7582,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 		unlock_all_device_hash_locks_irq(conf);
 		break;
 	}
+	r5l_quiesce(conf->log, state);
 }
 
 static void *raid45_takeover_raid0(struct mddev *mddev, int level)
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 32c8ce8..1ab534c 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -630,4 +630,5 @@ extern void r5l_write_stripe_run(struct r5l_log *log);
 extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
 extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
+extern void r5l_quiesce(struct r5l_log *log, int state);
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/6] raid5-cache: add trim support for log
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
  2015-10-04 16:20 ` [PATCH 1/6] md: show journal for journal disk in disk state sysfs Shaohua Li
  2015-10-04 16:20 ` [PATCH 2/6] raid5-cache: move reclaim stop to quiesce Shaohua Li
@ 2015-10-04 16:20 ` Shaohua Li
  2015-10-08  1:53   ` Neil Brown
  2015-10-04 16:20 ` [PATCH 4/6] md: don't export log device Shaohua Li
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 38 +++++++++++++++++++++++++++++++++++++-
 1 file changed, 37 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 93097a8..6c52168 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -654,6 +654,42 @@ static void r5l_kick_io_unit(struct r5l_log *log)
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
+static void r5l_write_super_and_discard_space(struct r5l_log *log,
+	sector_t end)
+{
+	struct block_device *bdev = log->rdev->bdev;
+	struct mddev *mddev;
+
+	r5l_write_super(log, end);
+
+	if (!blk_queue_discard(bdev_get_queue(bdev)))
+		return;
+
+	mddev = log->rdev->mddev;
+	if (!mddev_is_locked(mddev)) {
+		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+	} else { /* we are stopping the array, already take reconfig_mutex */
+		md_update_sb(mddev, 1);
+	}
+
+	if (log->last_checkpoint < end) {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				end - log->last_checkpoint, GFP_NOIO, 0);
+	} else {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				log->device_size - log->last_checkpoint,
+				GFP_NOIO, 0);
+		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
+				GFP_NOIO, 0);
+	}
+}
+
+
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	struct r5l_io_unit *io, *last;
@@ -709,7 +745,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * here, because the log area might be reused soon and we don't want to
 	 * confuse recovery
 	 */
-	r5l_write_super(log, last->log_start);
+	r5l_write_super_and_discard_space(log, last->log_start);
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = last->log_start;
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/6] md: don't export log device
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
                   ` (2 preceding siblings ...)
  2015-10-04 16:20 ` [PATCH 3/6] raid5-cache: add trim support for log Shaohua Li
@ 2015-10-04 16:20 ` Shaohua Li
  2015-10-08  1:57   ` Neil Brown
  2015-10-04 16:20 ` [PATCH 5/6] md: set In_Sync for log disk Shaohua Li
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If there is IO error in log device, don't export it like other raid
disks. Otherwise we get kernel crash in different places since
rdev->bdev, rdev->mddev becomes NULL

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c    | 4 ++--
 drivers/md/raid5.c | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index c643c9a..ec6574d 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2523,7 +2523,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 		else
 			err = -EBUSY;
 	} else if (cmd_match(buf, "remove")) {
-		if (rdev->raid_disk >= 0)
+		if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
 			err = -EBUSY;
 		else {
 			struct mddev *mddev = rdev->mddev;
@@ -6044,7 +6044,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
 	clear_bit(Blocked, &rdev->flags);
 	remove_and_add_spares(mddev, rdev);
 
-	if (rdev->raid_disk >= 0)
+	if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
 		goto busy;
 
 	if (mddev_is_clustered(mddev))
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 216fa3c..c164501 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7128,6 +7128,8 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
 	struct disk_info *p = conf->disks + number;
 
 	print_raid5_conf(conf);
+	if (test_bit(Journal, &rdev->flags))
+		return -EBUSY;
 	if (rdev == p->rdev)
 		rdevp = &p->rdev;
 	else if (rdev == p->replacement)
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/6] md: set In_Sync for log disk
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
                   ` (3 preceding siblings ...)
  2015-10-04 16:20 ` [PATCH 4/6] md: don't export log device Shaohua Li
@ 2015-10-04 16:20 ` Shaohua Li
  2015-10-04 16:20 ` [PATCH 6/6] raid5-cache: IO error handling Shaohua Li
  2015-10-08  2:10 ` [PATCH 0/6] raid5-cache fixes Neil Brown
  6 siblings, 0 replies; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

We forgot set the bit for log disk

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index ec6574d..2fbecd7 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1646,6 +1646,7 @@ static int super_1_validate(struct mddev *mddev, struct md_rdev *rdev)
 				return -EINVAL;
 			}
 			set_bit(Journal, &rdev->flags);
+			set_bit(In_sync, &rdev->flags);
 			rdev->journal_tail = le64_to_cpu(sb->journal_tail);
 			if (mddev->recovery_cp == MaxSector)
 				set_bit(MD_JOURNAL_CLEAN, &mddev->flags);
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/6] raid5-cache: IO error handling
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
                   ` (4 preceding siblings ...)
  2015-10-04 16:20 ` [PATCH 5/6] md: set In_Sync for log disk Shaohua Li
@ 2015-10-04 16:20 ` Shaohua Li
  2015-10-08  2:10 ` [PATCH 0/6] raid5-cache fixes Neil Brown
  6 siblings, 0 replies; 19+ messages in thread
From: Shaohua Li @ 2015-10-04 16:20 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we mark log disk faulty and fail all write IO. Read IO is
still allowed to run. Userspace will get a notification too and
corresponding daemon can choose setting raid array readonly for example.

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 15 ++++++++++++++-
 drivers/md/raid5.c       |  4 +++-
 drivers/md/raid5.h       |  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 6c52168..e917e53 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -223,13 +223,15 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
-/* XXX: totally ignores I/O errors */
 static void r5l_log_endio(struct bio *bio)
 {
 	struct r5l_io_unit *io = bio->bi_private;
 	struct r5l_log *log = io->log;
 	unsigned long flags;
 
+	if (bio->bi_error)
+		md_error(log->rdev->mddev, log->rdev);
+
 	bio_put(bio);
 
 	if (!atomic_dec_and_test(&io->pending_io))
@@ -594,6 +596,9 @@ static void r5l_log_flush_endio(struct bio *bio)
 	struct r5l_io_unit *io;
 	struct stripe_head *sh;
 
+	if (bio->bi_error)
+		md_error(log->rdev->mddev, log->rdev);
+
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	list_for_each_entry(io, &log->flushing_ios, log_sibling) {
 		while (!list_empty(&io->stripe_list)) {
@@ -675,6 +680,7 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 		md_update_sb(mddev, 1);
 	}
 
+	/* discard IO error really doesn't matter, ignore it */
 	if (log->last_checkpoint < end) {
 		blkdev_issue_discard(bdev,
 				log->last_checkpoint + log->rdev->data_offset,
@@ -802,6 +808,13 @@ void r5l_quiesce(struct r5l_log *log, int state)
 	}
 }
 
+bool r5l_log_disk_error(struct r5l_log *log)
+{
+	if (!log)
+		return false;
+	return !test_bit(In_sync, &log->rdev->flags);
+}
+
 struct r5l_recovery_ctx {
 	struct page *meta_page;		/* current meta */
 	sector_t meta_total_blocks;	/* total size of current meta and data */
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c164501..53bb7f6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3147,6 +3147,7 @@ handle_failed_stripe(struct r5conf *conf, struct stripe_head *sh,
 		 * the data has not reached the cache yet.
 		 */
 		if (!test_bit(R5_Wantfill, &sh->dev[i].flags) &&
+		    s->failed > conf->max_degraded &&
 		    (!test_bit(R5_Insync, &sh->dev[i].flags) ||
 		      test_bit(R5_ReadError, &sh->dev[i].flags))) {
 			spin_lock_irq(&sh->stripe_lock);
@@ -4015,6 +4016,7 @@ static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
 	s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state) && !sh->batch_head;
 	s->failed_num[0] = -1;
 	s->failed_num[1] = -1;
+	s->log_failed = r5l_log_disk_error(conf->log);
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
@@ -4358,7 +4360,7 @@ static void handle_stripe(struct stripe_head *sh)
 	/* check if the array has lost more than max_degraded devices and,
 	 * if so, some requests might need to be failed.
 	 */
-	if (s.failed > conf->max_degraded) {
+	if (s.failed > conf->max_degraded || s.log_failed) {
 		sh->check_state = 0;
 		sh->reconstruct_state = 0;
 		break_stripe_batch_list(sh, 0);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 1ab534c..d114389 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -272,6 +272,7 @@ struct stripe_head_state {
 	struct bio_list return_bi;
 	struct md_rdev *blocked_rdev;
 	int handle_bad_blocks;
+	int log_failed;
 };
 
 /* Flags for struct r5dev.flags */
@@ -631,4 +632,5 @@ extern void r5l_flush_stripe_to_raid(struct r5l_log *log);
 extern void r5l_stripe_write_finished(struct stripe_head *sh);
 extern int r5l_handle_flush_request(struct r5l_log *log, struct bio *bio);
 extern void r5l_quiesce(struct r5l_log *log, int state);
+extern bool r5l_log_disk_error(struct r5l_log *log);
 #endif
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/6] raid5-cache: add trim support for log
  2015-10-04 16:20 ` [PATCH 3/6] raid5-cache: add trim support for log Shaohua Li
@ 2015-10-08  1:53   ` Neil Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Neil Brown @ 2015-10-08  1:53 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2842 bytes --]

Shaohua Li <shli@fb.com> writes:

> Since superblock is updated infrequently, we do a simple trim of log
> disk (a synchronous trim)
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5-cache.c | 38 +++++++++++++++++++++++++++++++++++++-
>  1 file changed, 37 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index 93097a8..6c52168 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -654,6 +654,42 @@ static void r5l_kick_io_unit(struct r5l_log *log)
>  }
>  
>  static void r5l_write_super(struct r5l_log *log, sector_t cp);
> +static void r5l_write_super_and_discard_space(struct r5l_log *log,
> +	sector_t end)
> +{
> +	struct block_device *bdev = log->rdev->bdev;
> +	struct mddev *mddev;
> +
> +	r5l_write_super(log, end);
> +
> +	if (!blk_queue_discard(bdev_get_queue(bdev)))
> +		return;
> +
> +	mddev = log->rdev->mddev;
> +	if (!mddev_is_locked(mddev)) {
> +		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +		md_wakeup_thread(mddev->thread);
> +		wait_event(mddev->sb_wait,
> +			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
> +	} else { /* we are stopping the array, already take reconfig_mutex */
> +		md_update_sb(mddev, 1);
> +	}

No.

Just because mddev is locked, that doesn't mean that this thread owns
the lock.  Some other thread might just happen to have it locked for a
moment, and may unlock it long before we get to md_update_sb().

If you cannot block here, then you need to find a way to schedule the
discard after the md_update_sb() has happened.
Maybe set some flag, and in raid5d(), after md_check_recovery(), see if
the superblock has been written and if the discard is still pending, and
then do the discard. Maybe.

Or pass a flag into r5l_do_reclaim() to say whether this thread holds
the lock or not.

NeilBrown

> +
> +	if (log->last_checkpoint < end) {
> +		blkdev_issue_discard(bdev,
> +				log->last_checkpoint + log->rdev->data_offset,
> +				end - log->last_checkpoint, GFP_NOIO, 0);
> +	} else {
> +		blkdev_issue_discard(bdev,
> +				log->last_checkpoint + log->rdev->data_offset,
> +				log->device_size - log->last_checkpoint,
> +				GFP_NOIO, 0);
> +		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
> +				GFP_NOIO, 0);
> +	}
> +}
> +
> +
>  static void r5l_do_reclaim(struct r5l_log *log)
>  {
>  	struct r5l_io_unit *io, *last;
> @@ -709,7 +745,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
>  	 * here, because the log area might be reused soon and we don't want to
>  	 * confuse recovery
>  	 */
> -	r5l_write_super(log, last->log_start);
> +	r5l_write_super_and_discard_space(log, last->log_start);
>  
>  	mutex_lock(&log->io_mutex);
>  	log->last_checkpoint = last->log_start;
> -- 
> 2.4.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-04 16:20 ` [PATCH 4/6] md: don't export log device Shaohua Li
@ 2015-10-08  1:57   ` Neil Brown
  2015-10-08  3:16     ` Shaohua Li
  0 siblings, 1 reply; 19+ messages in thread
From: Neil Brown @ 2015-10-08  1:57 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1905 bytes --]

Shaohua Li <shli@fb.com> writes:

> If there is IO error in log device, don't export it like other raid
> disks. Otherwise we get kernel crash in different places since
> rdev->bdev, rdev->mddev becomes NULL
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c    | 4 ++--
>  drivers/md/raid5.c | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index c643c9a..ec6574d 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2523,7 +2523,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>  		else
>  			err = -EBUSY;
>  	} else if (cmd_match(buf, "remove")) {
> -		if (rdev->raid_disk >= 0)
> +		if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
>  			err = -EBUSY;
>  		else {
>  			struct mddev *mddev = rdev->mddev;
> @@ -6044,7 +6044,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
>  	clear_bit(Blocked, &rdev->flags);
>  	remove_and_add_spares(mddev, rdev);
>  
> -	if (rdev->raid_disk >= 0)
> +	if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
>  		goto busy;
>  
>  	if (mddev_is_clustered(mddev))

Neither of these chunks should be needed.
->raid_disk of an active devices is only set to -1 if ->hot_remove_disk
succeeds.
You have make ->hot_remove_disk fail for Journal devices, so ->raid_disk
will be >= 0.

NeilBrown

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 216fa3c..c164501 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -7128,6 +7128,8 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
>  	struct disk_info *p = conf->disks + number;
>  
>  	print_raid5_conf(conf);
> +	if (test_bit(Journal, &rdev->flags))
> +		return -EBUSY;
>  	if (rdev == p->rdev)
>  		rdevp = &p->rdev;
>  	else if (rdev == p->replacement)
> -- 
> 2.4.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] raid5-cache fixes
  2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
                   ` (5 preceding siblings ...)
  2015-10-04 16:20 ` [PATCH 6/6] raid5-cache: IO error handling Shaohua Li
@ 2015-10-08  2:10 ` Neil Brown
  2015-10-08  2:56   ` Shaohua Li
  6 siblings, 1 reply; 19+ messages in thread
From: Neil Brown @ 2015-10-08  2:10 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1536 bytes --]

Shaohua Li <shli@fb.com> writes:

> Hi,
>
> some fixes for raid5-cache.
> patch 1, a small fix
> patch 2-3, move reclaim teardown to quiesce handling and add trim support. I
> still need md_update_sb there and play the mddev_is_locked trick as the
> reconfig_mutex is already hold at md stop. The md_update_sb probably can move
> to md core later.
> patch 4-6, add error handling. For patch 4, I still need the journal bit check
> in in md core, otherwise there is no way to prevent 'echo remove > rdev/state'
> to delete journal disk. For patch 6, I didn't change has_failed() yet. Handling
> assemble with miss/failed log disk is still on going.
>
> Next step is to make assemble correct with miss/failed log disk. This will need
> kernel/utilities cooperation. Song and I are working on it.
>
> Thanks,
> Shaohua
>
> Shaohua Li (6):
>   md: show journal for journal disk in disk state sysfs
>   raid5-cache: move reclaim stop to quiesce
>   raid5-cache: add trim support for log
>   md: don't export log device
>   md: set In_Sync for log disk
>   raid5-cache: IO error handling
>
>  drivers/md/md.c          |  9 ++++--
>  drivers/md/raid5-cache.c | 83 ++++++++++++++++++++++++++++++++++++++++--------
>  drivers/md/raid5.c       |  7 +++-
>  drivers/md/raid5.h       |  3 ++
>  4 files changed, 85 insertions(+), 17 deletions(-)
>
> -- 
> 2.4.6

Thanks.

I've applied 1, 2, and 5. Should appear in 'devel' shortly.
3 and 4 I've replies to separately, 6 depends on 3.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] raid5-cache fixes
  2015-10-08  2:10 ` [PATCH 0/6] raid5-cache fixes Neil Brown
@ 2015-10-08  2:56   ` Shaohua Li
  2015-10-08  3:18     ` Neil Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Shaohua Li @ 2015-10-08  2:56 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Oct 08, 2015 at 01:10:32PM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > Hi,
> >
> > some fixes for raid5-cache.
> > patch 1, a small fix
> > patch 2-3, move reclaim teardown to quiesce handling and add trim support. I
> > still need md_update_sb there and play the mddev_is_locked trick as the
> > reconfig_mutex is already hold at md stop. The md_update_sb probably can move
> > to md core later.
> > patch 4-6, add error handling. For patch 4, I still need the journal bit check
> > in in md core, otherwise there is no way to prevent 'echo remove > rdev/state'
> > to delete journal disk. For patch 6, I didn't change has_failed() yet. Handling
> > assemble with miss/failed log disk is still on going.
> >
> > Next step is to make assemble correct with miss/failed log disk. This will need
> > kernel/utilities cooperation. Song and I are working on it.
> >
> > Thanks,
> > Shaohua
> >
> > Shaohua Li (6):
> >   md: show journal for journal disk in disk state sysfs
> >   raid5-cache: move reclaim stop to quiesce
> >   raid5-cache: add trim support for log
> >   md: don't export log device
> >   md: set In_Sync for log disk
> >   raid5-cache: IO error handling
> >
> >  drivers/md/md.c          |  9 ++++--
> >  drivers/md/raid5-cache.c | 83 ++++++++++++++++++++++++++++++++++++++++--------
> >  drivers/md/raid5.c       |  7 +++-
> >  drivers/md/raid5.h       |  3 ++
> >  4 files changed, 85 insertions(+), 17 deletions(-)
> >
> > -- 
> > 2.4.6
> 
> Thanks.
> 
> I've applied 1, 2, and 5. Should appear in 'devel' shortly.
> 3 and 4 I've replies to separately, 6 depends on 3.

please hold 5 currently. I'm a little confused about the In_sync bit.
It's quite tricky to handle this bit. For example, if both In_sync and
Journal bits are set, I'll need move checking the 'Journal' bit ahead of
checking the 'In_sync' in super_1_sync (current patch haven't done it
yet, it's a bug). There are similar cases in mdadm too. This makes me
thinking about What's the exact role for the In_sync bit for journal
disk. The comments in the bit definition doesn't give an answer. We can
use the Faulty bit for error handling. Any thoughts?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-08  1:57   ` Neil Brown
@ 2015-10-08  3:16     ` Shaohua Li
  2015-10-08  4:16       ` Neil Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Shaohua Li @ 2015-10-08  3:16 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Oct 08, 2015 at 12:57:29PM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> 
> > If there is IO error in log device, don't export it like other raid
> > disks. Otherwise we get kernel crash in different places since
> > rdev->bdev, rdev->mddev becomes NULL
> >
> > Signed-off-by: Shaohua Li <shli@fb.com>
> > ---
> >  drivers/md/md.c    | 4 ++--
> >  drivers/md/raid5.c | 2 ++
> >  2 files changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index c643c9a..ec6574d 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -2523,7 +2523,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
> >  		else
> >  			err = -EBUSY;
> >  	} else if (cmd_match(buf, "remove")) {
> > -		if (rdev->raid_disk >= 0)
> > +		if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
> >  			err = -EBUSY;
> >  		else {
> >  			struct mddev *mddev = rdev->mddev;
> > @@ -6044,7 +6044,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
> >  	clear_bit(Blocked, &rdev->flags);
> >  	remove_and_add_spares(mddev, rdev);
> >  
> > -	if (rdev->raid_disk >= 0)
> > +	if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
> >  		goto busy;
> >  
> >  	if (mddev_is_clustered(mddev))
> 
> Neither of these chunks should be needed.
> ->raid_disk of an active devices is only set to -1 if ->hot_remove_disk
> succeeds.
> You have make ->hot_remove_disk fail for Journal devices, so ->raid_disk
> will be >= 0.

I agree the raid5_remove_disk part is superficial, I fixed in an updated
patch. I still didn't get the point what can prevent a journal disk is
removed. Currently the raid_disk is always -1 for journal disk. If it
should be >=0, what value it should be? We give journal disk a special
role '0xfffd' currently.

Thanks,
shaohua
> 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 216fa3c..c164501 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -7128,6 +7128,8 @@ static int raid5_remove_disk(struct mddev *mddev, struct md_rdev *rdev)
> >  	struct disk_info *p = conf->disks + number;
> >  
> >  	print_raid5_conf(conf);
> > +	if (test_bit(Journal, &rdev->flags))
> > +		return -EBUSY;
> >  	if (rdev == p->rdev)
> >  		rdevp = &p->rdev;
> >  	else if (rdev == p->replacement)
> > -- 
> > 2.4.6



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] raid5-cache fixes
  2015-10-08  2:56   ` Shaohua Li
@ 2015-10-08  3:18     ` Neil Brown
  2015-10-08  3:24       ` Shaohua Li
  0 siblings, 1 reply; 19+ messages in thread
From: Neil Brown @ 2015-10-08  3:18 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1296 bytes --]

Shaohua Li <shli@fb.com> writes:
>
> please hold 5 currently. I'm a little confused about the In_sync bit.
> It's quite tricky to handle this bit. For example, if both In_sync and
> Journal bits are set, I'll need move checking the 'Journal' bit ahead of
> checking the 'In_sync' in super_1_sync (current patch haven't done it
> yet, it's a bug). There are similar cases in mdadm too. This makes me
> thinking about What's the exact role for the In_sync bit for journal
> disk. The comments in the bit definition doesn't give an answer. We can
> use the Faulty bit for error handling. Any thoughts?

"In_sync" effectively means that recovery_offset == MaxSector.  It means
all data which should be on the device is on the device (except as
described in the bad-block log).
It is set at array-creation time or when recovery completes, and is
cleared when an error is detected.  It is useful for differentiating
between a spare being added (without In_sync) or a recently failed
device being re-added (with In_sync).

None of this really relates to the Journal.  So as you say, it doesn't
mean much to set that flag for the log.

Maybe r5l_log_disk_error() should check Error rather than In_sync.  Was
there a reason you didn't do that in the first place?

I'll drop that patch.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/6] raid5-cache fixes
  2015-10-08  3:18     ` Neil Brown
@ 2015-10-08  3:24       ` Shaohua Li
  0 siblings, 0 replies; 19+ messages in thread
From: Shaohua Li @ 2015-10-08  3:24 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Oct 08, 2015 at 02:18:19PM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> >
> > please hold 5 currently. I'm a little confused about the In_sync bit.
> > It's quite tricky to handle this bit. For example, if both In_sync and
> > Journal bits are set, I'll need move checking the 'Journal' bit ahead of
> > checking the 'In_sync' in super_1_sync (current patch haven't done it
> > yet, it's a bug). There are similar cases in mdadm too. This makes me
> > thinking about What's the exact role for the In_sync bit for journal
> > disk. The comments in the bit definition doesn't give an answer. We can
> > use the Faulty bit for error handling. Any thoughts?
> 
> "In_sync" effectively means that recovery_offset == MaxSector.  It means
> all data which should be on the device is on the device (except as
> described in the bad-block log).
> It is set at array-creation time or when recovery completes, and is
> cleared when an error is detected.  It is useful for differentiating
> between a spare being added (without In_sync) or a recently failed
> device being re-added (with In_sync).
> 
> None of this really relates to the Journal.  So as you say, it doesn't
> mean much to set that flag for the log.
> 
> Maybe r5l_log_disk_error() should check Error rather than In_sync.  Was
> there a reason you didn't do that in the first place?

Thanks. No, likely a misuse.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-08  3:16     ` Shaohua Li
@ 2015-10-08  4:16       ` Neil Brown
  2015-10-08  4:31         ` Shaohua Li
  0 siblings, 1 reply; 19+ messages in thread
From: Neil Brown @ 2015-10-08  4:16 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1045 bytes --]

Shaohua Li <shli@fb.com> writes:
>> 
>> Neither of these chunks should be needed.
>> ->raid_disk of an active devices is only set to -1 if ->hot_remove_disk
>> succeeds.
>> You have make ->hot_remove_disk fail for Journal devices, so ->raid_disk
>> will be >= 0.
>
> I agree the raid5_remove_disk part is superficial, I fixed in an updated
> patch. I still didn't get the point what can prevent a journal disk is
> removed. Currently the raid_disk is always -1 for journal disk. If it
> should be >=0, what value it should be? We give journal disk a special
> role '0xfffd' currently.

Oh, are we leaving the ->raid_disk at -1 for the journal?  I hadn't
noticed that.  I don't feel comfortable it.  Too much code assumes that
<0 means "not in use".

Probably set it to 0, and add a check to setup_conf(), and adjust the
check in run().  md_update_sb() probably need to be careful of journals
too (to not change ->recovery_offset).
I wonder what 'slot_show' should report for the journal.... maybe
"journal"??

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-08  4:16       ` Neil Brown
@ 2015-10-08  4:31         ` Shaohua Li
  2015-10-08  6:04           ` Neil Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Shaohua Li @ 2015-10-08  4:31 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Oct 08, 2015 at 03:16:41PM +1100, Neil Brown wrote:
> Shaohua Li <shli@fb.com> writes:
> >> 
> >> Neither of these chunks should be needed.
> >> ->raid_disk of an active devices is only set to -1 if ->hot_remove_disk
> >> succeeds.
> >> You have make ->hot_remove_disk fail for Journal devices, so ->raid_disk
> >> will be >= 0.
> >
> > I agree the raid5_remove_disk part is superficial, I fixed in an updated
> > patch. I still didn't get the point what can prevent a journal disk is
> > removed. Currently the raid_disk is always -1 for journal disk. If it
> > should be >=0, what value it should be? We give journal disk a special
> > role '0xfffd' currently.
> 
> Oh, are we leaving the ->raid_disk at -1 for the journal?  I hadn't
> noticed that.  I don't feel comfortable it.  Too much code assumes that
> <0 means "not in use".
> 
> Probably set it to 0, and add a check to setup_conf(), and adjust the
> check in run().  md_update_sb() probably need to be careful of journals
> too (to not change ->recovery_offset).
> I wonder what 'slot_show' should report for the journal.... maybe
> "journal"??

->raid_disk >= 0 is for normal raid disks. If we use it, we will have
two disks with ->raid_disk 0, it sounds weird. Currently we add the
'test(Journal, rdev->flags)' check in different places to destinguish
journal disk. We will need to audit the code which assumes ' < 0 means
not in use'. We will probably need to audit the same code if we set
->raid_disk 0 for journal. Neither is perfect.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-08  4:31         ` Shaohua Li
@ 2015-10-08  6:04           ` Neil Brown
  2015-10-13 12:07             ` Christoph Hellwig
  0 siblings, 1 reply; 19+ messages in thread
From: Neil Brown @ 2015-10-08  6:04 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2246 bytes --]

Shaohua Li <shli@fb.com> writes:

> On Thu, Oct 08, 2015 at 03:16:41PM +1100, Neil Brown wrote:
>> Shaohua Li <shli@fb.com> writes:
>> >> 
>> >> Neither of these chunks should be needed.
>> >> ->raid_disk of an active devices is only set to -1 if ->hot_remove_disk
>> >> succeeds.
>> >> You have make ->hot_remove_disk fail for Journal devices, so ->raid_disk
>> >> will be >= 0.
>> >
>> > I agree the raid5_remove_disk part is superficial, I fixed in an updated
>> > patch. I still didn't get the point what can prevent a journal disk is
>> > removed. Currently the raid_disk is always -1 for journal disk. If it
>> > should be >=0, what value it should be? We give journal disk a special
>> > role '0xfffd' currently.
>> 
>> Oh, are we leaving the ->raid_disk at -1 for the journal?  I hadn't
>> noticed that.  I don't feel comfortable it.  Too much code assumes that
>> <0 means "not in use".
>> 
>> Probably set it to 0, and add a check to setup_conf(), and adjust the
>> check in run().  md_update_sb() probably need to be careful of journals
>> too (to not change ->recovery_offset).
>> I wonder what 'slot_show' should report for the journal.... maybe
>> "journal"??
>
> ->raid_disk >= 0 is for normal raid disks. If we use it, we will have
> two disks with ->raid_disk 0, it sounds weird. Currently we add the
> 'test(Journal, rdev->flags)' check in different places to destinguish
> journal disk. We will need to audit the code which assumes ' < 0 means
> not in use'. We will probably need to audit the same code if we set
> ->raid_disk 0 for journal. Neither is perfect.

Having two disks with ->raid_disk==0 does seem a little weird, but we do
already have that in some cases.
When you have a hot-replace going, both the original and the replacement
have the same ->raid_disk numbers.  They can be distinguished by the
Replacement flag.
I'm suggesting the same (sort of) for journals, and distinguish by the
Journal flag.

I did quick audit and just found setup_conf, run() and md_update_sb().
If you could do an audit to that would be good.  I'd be surprised if you
find many more places where Journal needs to be tested with ->raid_disk.

Thanks,
NeilBrown

>
> Thanks,
> Shaohua

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-08  6:04           ` Neil Brown
@ 2015-10-13 12:07             ` Christoph Hellwig
  2015-10-13 20:41               ` Neil Brown
  0 siblings, 1 reply; 19+ messages in thread
From: Christoph Hellwig @ 2015-10-13 12:07 UTC (permalink / raw)
  To: Neil Brown
  Cc: Shaohua Li, linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

On Thu, Oct 08, 2015 at 05:04:54PM +1100, Neil Brown wrote:
> Having two disks with ->raid_disk==0 does seem a little weird, but we do
> already have that in some cases.
> When you have a hot-replace going, both the original and the replacement
> have the same ->raid_disk numbers.  They can be distinguished by the
> Replacement flag.
> I'm suggesting the same (sort of) for journals, and distinguish by the
> Journal flag.
> 
> I did quick audit and just found setup_conf, run() and md_update_sb().
> If you could do an audit to that would be good.  I'd be surprised if you
> find many more places where Journal needs to be tested with ->raid_disk.

Overloading positive numbers for the journal disk sounds like a bad idea
to me as it will cause a lot of confusion.  I'd rather assign specific
negative values to special roles outside the actual rate.  This will
require an initial audit, but give us nicely understandable rules later
on.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/6] md: don't export log device
  2015-10-13 12:07             ` Christoph Hellwig
@ 2015-10-13 20:41               ` Neil Brown
  0 siblings, 0 replies; 19+ messages in thread
From: Neil Brown @ 2015-10-13 20:41 UTC (permalink / raw)
  Cc: Shaohua Li, linux-raid, Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1830 bytes --]

Christoph Hellwig <hch@infradead.org> writes:

> On Thu, Oct 08, 2015 at 05:04:54PM +1100, Neil Brown wrote:
>> Having two disks with ->raid_disk==0 does seem a little weird, but we do
>> already have that in some cases.
>> When you have a hot-replace going, both the original and the replacement
>> have the same ->raid_disk numbers.  They can be distinguished by the
>> Replacement flag.
>> I'm suggesting the same (sort of) for journals, and distinguish by the
>> Journal flag.
>> 
>> I did quick audit and just found setup_conf, run() and md_update_sb().
>> If you could do an audit to that would be good.  I'd be surprised if you
>> find many more places where Journal needs to be tested with ->raid_disk.
>
> Overloading positive numbers for the journal disk sounds like a bad idea
> to me as it will cause a lot of confusion.  I'd rather assign specific
> negative values to special roles outside the actual rate.  This will
> require an initial audit, but give us nicely understandable rules later
> on.

The positive numbers are in different name-spaces:
 primary raid disks
 replacement raid disks
 journal disks

While confusion is always possible, I think keeping these separate is
not that hart.  For example ->raid_disk is primarily used to place the
rdev in an array after which it is the position in the array which is
primarily used.  The value of ->raid_disk is mostly tested only for
whether it is <0 or not.

I'm certainly happy to do our best to remove sources of confusion from
the user-space view of these numbers (e.g. put 'journal' or 'none' in
the 'slot' file, not '0' (and definitely not "-5")).  But internally to
the kernel it is important to remember that it isn't a primary key - and
once you do that there is not much chance of confusion.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2015-10-13 20:41 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-04 16:20 [PATCH 0/6] raid5-cache fixes Shaohua Li
2015-10-04 16:20 ` [PATCH 1/6] md: show journal for journal disk in disk state sysfs Shaohua Li
2015-10-04 16:20 ` [PATCH 2/6] raid5-cache: move reclaim stop to quiesce Shaohua Li
2015-10-04 16:20 ` [PATCH 3/6] raid5-cache: add trim support for log Shaohua Li
2015-10-08  1:53   ` Neil Brown
2015-10-04 16:20 ` [PATCH 4/6] md: don't export log device Shaohua Li
2015-10-08  1:57   ` Neil Brown
2015-10-08  3:16     ` Shaohua Li
2015-10-08  4:16       ` Neil Brown
2015-10-08  4:31         ` Shaohua Li
2015-10-08  6:04           ` Neil Brown
2015-10-13 12:07             ` Christoph Hellwig
2015-10-13 20:41               ` Neil Brown
2015-10-04 16:20 ` [PATCH 5/6] md: set In_Sync for log disk Shaohua Li
2015-10-04 16:20 ` [PATCH 6/6] raid5-cache: IO error handling Shaohua Li
2015-10-08  2:10 ` [PATCH 0/6] raid5-cache fixes Neil Brown
2015-10-08  2:56   ` Shaohua Li
2015-10-08  3:18     ` Neil Brown
2015-10-08  3:24       ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.