All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] raid5-cache fixes
@ 2015-09-30 23:15 Shaohua Li
  2015-09-30 23:15 ` [PATCH 1/3] raid5-cache: add trim support for log Shaohua Li
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Shaohua Li @ 2015-09-30 23:15 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Neil,

These are some fixes for raid5-cache. The first one fixes the issue related to
trim. The last two make raid5-cache support log device IO error.

Thanks,
Shaohua

Shaohua Li (3):
  raid5-cache: add trim support for log
  md: don't export log device
  raid5-cache: IO error handling

 drivers/md/md.c          |  4 ++--
 drivers/md/raid5-cache.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 62 insertions(+), 4 deletions(-)

-- 
2.4.6


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] raid5-cache: add trim support for log
  2015-09-30 23:15 [PATCH 0/3] raid5-cache fixes Shaohua Li
@ 2015-09-30 23:15 ` Shaohua Li
  2015-10-01  4:41   ` Neil Brown
  2015-09-30 23:15 ` [PATCH 2/3] md: don't export log device Shaohua Li
  2015-09-30 23:15 ` [PATCH 3/3] raid5-cache: IO error handling Shaohua Li
  2 siblings, 1 reply; 8+ messages in thread
From: Shaohua Li @ 2015-09-30 23:15 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

Since superblock is updated infrequently, we do a simple trim of log
disk (a synchronous trim)

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index a02f9ce..afc3b6b 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -654,6 +654,48 @@ static void r5l_kick_io_unit(struct r5l_log *log)
 }
 
 static void r5l_write_super(struct r5l_log *log, sector_t cp);
+static void r5l_write_super_and_discard_space(struct r5l_log *log,
+	sector_t end)
+{
+	struct block_device *bdev = log->rdev->bdev;
+	struct mddev *mddev;
+
+	r5l_write_super(log, end);
+
+	if (!blk_queue_discard(bdev_get_queue(bdev)))
+		return;
+
+	/* discard destroy old data in log, so force a super update */
+	mddev = log->rdev->mddev;
+	/*
+	 * mddev->thread could be shut down already in raid array stop. At that
+	 * time, we should already lock reconfig_mutex
+	 * */
+	if (!mddev->thread) {
+		WARN_ON(!mddev_is_locked(mddev));
+		md_update_sb(mddev, 1);
+	} else {
+		set_bit(MD_CHANGE_PENDING, &mddev->flags);
+		md_wakeup_thread(mddev->thread);
+		wait_event(mddev->sb_wait,
+			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
+	}
+
+	if (log->last_checkpoint < end) {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				end - log->last_checkpoint, GFP_NOIO, 0);
+	} else {
+		blkdev_issue_discard(bdev,
+				log->last_checkpoint + log->rdev->data_offset,
+				log->device_size - log->last_checkpoint,
+				GFP_NOIO, 0);
+		blkdev_issue_discard(bdev, log->rdev->data_offset, end,
+				GFP_NOIO, 0);
+	}
+}
+
+
 static void r5l_do_reclaim(struct r5l_log *log)
 {
 	struct r5l_io_unit *io, *last;
@@ -709,7 +751,7 @@ static void r5l_do_reclaim(struct r5l_log *log)
 	 * here, because the log area might be reused soon and we don't want to
 	 * confuse recovery
 	 */
-	r5l_write_super(log, last->log_start);
+	r5l_write_super_and_discard_space(log, last->log_start);
 
 	mutex_lock(&log->io_mutex);
 	log->last_checkpoint = last->log_start;
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] md: don't export log device
  2015-09-30 23:15 [PATCH 0/3] raid5-cache fixes Shaohua Li
  2015-09-30 23:15 ` [PATCH 1/3] raid5-cache: add trim support for log Shaohua Li
@ 2015-09-30 23:15 ` Shaohua Li
  2015-10-01  4:45   ` Neil Brown
  2015-09-30 23:15 ` [PATCH 3/3] raid5-cache: IO error handling Shaohua Li
  2 siblings, 1 reply; 8+ messages in thread
From: Shaohua Li @ 2015-09-30 23:15 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

If there is IO error in log device, don't export it like other raid
disks. Otherwise we get kernel crash in different places since
rdev->bdev, rdev->mddev becomes NULL

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/md.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index f1cbb08..0b1d7ef 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -2519,7 +2519,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
 		else
 			err = -EBUSY;
 	} else if (cmd_match(buf, "remove")) {
-		if (rdev->raid_disk >= 0)
+		if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
 			err = -EBUSY;
 		else {
 			struct mddev *mddev = rdev->mddev;
@@ -6040,7 +6040,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
 	clear_bit(Blocked, &rdev->flags);
 	remove_and_add_spares(mddev, rdev);
 
-	if (rdev->raid_disk >= 0)
+	if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
 		goto busy;
 
 	if (mddev_is_clustered(mddev))
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] raid5-cache: IO error handling
  2015-09-30 23:15 [PATCH 0/3] raid5-cache fixes Shaohua Li
  2015-09-30 23:15 ` [PATCH 1/3] raid5-cache: add trim support for log Shaohua Li
  2015-09-30 23:15 ` [PATCH 2/3] md: don't export log device Shaohua Li
@ 2015-09-30 23:15 ` Shaohua Li
  2015-10-01  4:50   ` Neil Brown
  2015-10-04 14:44   ` Christoph Hellwig
  2 siblings, 2 replies; 8+ messages in thread
From: Shaohua Li @ 2015-09-30 23:15 UTC (permalink / raw)
  To: linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams, neilb

There are 3 places the raid5-cache dispatches IO. The discard IO error
doesn't matter, so we ignore it. The superblock write IO error can be
handled in MD core. The remaining are log write and flush. When the IO
error happens, we simply fail all raid disks and continue the stripe
state machine. The MD/raid5 core can handle it (for example, mark all
disks faulty, report bio error and so on).

Signed-off-by: Shaohua Li <shli@fb.com>
---
 drivers/md/raid5-cache.c | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index afc3b6b..430ce5c 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -223,7 +223,16 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
 	io->state = state;
 }
 
-/* XXX: totally ignores I/O errors */
+static void r5l_log_io_error(struct r5l_log *log)
+{
+	struct md_rdev *rdev;
+
+	rcu_read_lock();
+	rdev_for_each_rcu(rdev, log->rdev->mddev)
+		md_error(log->rdev->mddev, rdev);
+	rcu_read_unlock();
+}
+
 static void r5l_log_endio(struct bio *bio)
 {
 	struct r5l_io_unit *io = bio->bi_private;
@@ -232,6 +241,9 @@ static void r5l_log_endio(struct bio *bio)
 
 	bio_put(bio);
 
+	if (bio->bi_error)
+		r5l_log_io_error(log);
+
 	if (!atomic_dec_and_test(&io->pending_io))
 		return;
 
@@ -594,6 +606,9 @@ static void r5l_log_flush_endio(struct bio *bio)
 	struct r5l_io_unit *io;
 	struct stripe_head *sh;
 
+	if (bio->bi_error)
+		r5l_log_io_error(log);
+
 	spin_lock_irqsave(&log->io_list_lock, flags);
 	list_for_each_entry(io, &log->flushing_ios, log_sibling) {
 		while (!list_empty(&io->stripe_list)) {
@@ -681,6 +696,7 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
 			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
 	}
 
+	/* discard IO error really doesn't matter, ignore it */
 	if (log->last_checkpoint < end) {
 		blkdev_issue_discard(bdev,
 				log->last_checkpoint + log->rdev->data_offset,
-- 
2.4.6


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/3] raid5-cache: add trim support for log
  2015-09-30 23:15 ` [PATCH 1/3] raid5-cache: add trim support for log Shaohua Li
@ 2015-10-01  4:41   ` Neil Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2015-10-01  4:41 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2643 bytes --]

Shaohua Li <shli@fb.com> writes:

> Since superblock is updated infrequently, we do a simple trim of log
> disk (a synchronous trim)
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5-cache.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index a02f9ce..afc3b6b 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -654,6 +654,48 @@ static void r5l_kick_io_unit(struct r5l_log *log)
>  }
>  
>  static void r5l_write_super(struct r5l_log *log, sector_t cp);
> +static void r5l_write_super_and_discard_space(struct r5l_log *log,
> +	sector_t end)
> +{
> +	struct block_device *bdev = log->rdev->bdev;
> +	struct mddev *mddev;
> +
> +	r5l_write_super(log, end);
> +
> +	if (!blk_queue_discard(bdev_get_queue(bdev)))
> +		return;
> +
> +	/* discard destroy old data in log, so force a super update */
> +	mddev = log->rdev->mddev;
> +	/*
> +	 * mddev->thread could be shut down already in raid array stop. At that
> +	 * time, we should already lock reconfig_mutex
> +	 * */
> +	if (!mddev->thread) {
> +		WARN_ON(!mddev_is_locked(mddev));
> +		md_update_sb(mddev, 1);
> +	} else {
> +		set_bit(MD_CHANGE_PENDING, &mddev->flags);
> +		md_wakeup_thread(mddev->thread);
> +		wait_event(mddev->sb_wait,
> +			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
> +	}

I didn't like this.  It looks clumsy to me.

So I went to have a look at the code to understand why it was needed.
I found that r5l_exit_log() was being called from free_conf().
I didn't notice that before.

->free() is only supposed to free, not write anything.  It could be too
late to write anything.

You need to get the raid5_quiesce(1) call to stablise the array.  It can
do the final r5l_do_reclaim().

So it is OK for free_conf to call r5l_exit_log as long as it only
deregisters the thread and frees the data structures.
The "r5l_do_reclaim" needs to be moved out.

I wonder where the md_update_sb() should go...

We currently calls "stop_writes" and then "mddev_detach".
So we shouldn't be writing anything by the time we get to mddev_detach,
but in there we wait for writes to complete.  That looks wrong.

I might move some of that stuff from mddev_detach to __md_stop_writes.
Then md_stop_writes can call ->quiesce and then md_update_sb().  That is
enough for raid5-cache to just call r5l_do_reclaim in raid5_quiesce, and
have the md_update_sb() called at the right time.

I'll see if i can make that work.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] md: don't export log device
  2015-09-30 23:15 ` [PATCH 2/3] md: don't export log device Shaohua Li
@ 2015-10-01  4:45   ` Neil Brown
  0 siblings, 0 replies; 8+ messages in thread
From: Neil Brown @ 2015-10-01  4:45 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 1351 bytes --]

Shaohua Li <shli@fb.com> writes:

> If there is IO error in log device, don't export it like other raid
> disks. Otherwise we get kernel crash in different places since
> rdev->bdev, rdev->mddev becomes NULL
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/md.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index f1cbb08..0b1d7ef 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -2519,7 +2519,7 @@ state_store(struct md_rdev *rdev, const char *buf, size_t len)
>  		else
>  			err = -EBUSY;
>  	} else if (cmd_match(buf, "remove")) {
> -		if (rdev->raid_disk >= 0)
> +		if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
>  			err = -EBUSY;
>  		else {
>  			struct mddev *mddev = rdev->mddev;
> @@ -6040,7 +6040,7 @@ static int hot_remove_disk(struct mddev *mddev, dev_t dev)
>  	clear_bit(Blocked, &rdev->flags);
>  	remove_and_add_spares(mddev, rdev);
>  
> -	if (rdev->raid_disk >= 0)
> +	if (rdev->raid_disk >= 0 || test_bit(Journal, &rdev->flags))
>  		goto busy;
>  
>  	if (mddev_is_clustered(mddev))
> -- 
> 2.4.6

I'd rather keep these details local to the RAID5 code...

Can we just have raid5_remove_disk return -EBUSY when an attempt is made
to remove the Journal device?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] raid5-cache: IO error handling
  2015-09-30 23:15 ` [PATCH 3/3] raid5-cache: IO error handling Shaohua Li
@ 2015-10-01  4:50   ` Neil Brown
  2015-10-04 14:44   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Neil Brown @ 2015-10-01  4:50 UTC (permalink / raw)
  To: Shaohua Li, linux-raid; +Cc: Kernel-team, songliubraving, hch, dan.j.williams

[-- Attachment #1: Type: text/plain, Size: 2529 bytes --]

Shaohua Li <shli@fb.com> writes:

> There are 3 places the raid5-cache dispatches IO. The discard IO error
> doesn't matter, so we ignore it. The superblock write IO error can be
> handled in MD core. The remaining are log write and flush. When the IO
> error happens, we simply fail all raid disks and continue the stripe
> state machine. The MD/raid5 core can handle it (for example, mark all
> disks faulty, report bio error and so on).
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>  drivers/md/raid5-cache.c | 18 +++++++++++++++++-
>  1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
> index afc3b6b..430ce5c 100644
> --- a/drivers/md/raid5-cache.c
> +++ b/drivers/md/raid5-cache.c
> @@ -223,7 +223,16 @@ static void __r5l_set_io_unit_state(struct r5l_io_unit *io,
>  	io->state = state;
>  }
>  
> -/* XXX: totally ignores I/O errors */
> +static void r5l_log_io_error(struct r5l_log *log)
> +{
> +	struct md_rdev *rdev;
> +
> +	rcu_read_lock();
> +	rdev_for_each_rcu(rdev, log->rdev->mddev)
> +		md_error(log->rdev->mddev, rdev);
> +	rcu_read_unlock();
> +}

This fails spare devices too... seems a bit heavy handed.

If the journal device fails we should still be able to read from the
array, just not write.

So can we just enhance the
	if (s.failed > conf->max_degraded) {
test in handle_stripe(), and probably improve has_failed() too??

Thanks,
NeilBrown


> +
>  static void r5l_log_endio(struct bio *bio)
>  {
>  	struct r5l_io_unit *io = bio->bi_private;
> @@ -232,6 +241,9 @@ static void r5l_log_endio(struct bio *bio)
>  
>  	bio_put(bio);
>  
> +	if (bio->bi_error)
> +		r5l_log_io_error(log);
> +
>  	if (!atomic_dec_and_test(&io->pending_io))
>  		return;
>  
> @@ -594,6 +606,9 @@ static void r5l_log_flush_endio(struct bio *bio)
>  	struct r5l_io_unit *io;
>  	struct stripe_head *sh;
>  
> +	if (bio->bi_error)
> +		r5l_log_io_error(log);
> +
>  	spin_lock_irqsave(&log->io_list_lock, flags);
>  	list_for_each_entry(io, &log->flushing_ios, log_sibling) {
>  		while (!list_empty(&io->stripe_list)) {
> @@ -681,6 +696,7 @@ static void r5l_write_super_and_discard_space(struct r5l_log *log,
>  			   !test_bit(MD_CHANGE_PENDING, &mddev->flags));
>  	}
>  
> +	/* discard IO error really doesn't matter, ignore it */
>  	if (log->last_checkpoint < end) {
>  		blkdev_issue_discard(bdev,
>  				log->last_checkpoint + log->rdev->data_offset,
> -- 
> 2.4.6

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 818 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/3] raid5-cache: IO error handling
  2015-09-30 23:15 ` [PATCH 3/3] raid5-cache: IO error handling Shaohua Li
  2015-10-01  4:50   ` Neil Brown
@ 2015-10-04 14:44   ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2015-10-04 14:44 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, Kernel-team, songliubraving, dan.j.williams, neilb

On Wed, Sep 30, 2015 at 04:15:40PM -0700, Shaohua Li wrote:
> There are 3 places the raid5-cache dispatches IO. The discard IO error
> doesn't matter, so we ignore it. The superblock write IO error can be
> handled in MD core. The remaining are log write and flush. When the IO
> error happens, we simply fail all raid disks and continue the stripe
> state machine. The MD/raid5 core can handle it (for example, mark all
> disks faulty, report bio error and so on).

This introduces a use after free, which will always report an I/O error
when SLAB poisoning is enabled.  The following patch needs to be folded
into it to fix that:

diff --git a/drivers/md/raid5-cache.c b/drivers/md/raid5-cache.c
index 430ce5c..8d93af1 100644
--- a/drivers/md/raid5-cache.c
+++ b/drivers/md/raid5-cache.c
@@ -239,11 +239,11 @@ static void r5l_log_endio(struct bio *bio)
 	struct r5l_log *log = io->log;
 	unsigned long flags;
 
-	bio_put(bio);
-
 	if (bio->bi_error)
 		r5l_log_io_error(log);
 
+	bio_put(bio);
+
 	if (!atomic_dec_and_test(&io->pending_io))
 		return;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-10-04 14:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-30 23:15 [PATCH 0/3] raid5-cache fixes Shaohua Li
2015-09-30 23:15 ` [PATCH 1/3] raid5-cache: add trim support for log Shaohua Li
2015-10-01  4:41   ` Neil Brown
2015-09-30 23:15 ` [PATCH 2/3] md: don't export log device Shaohua Li
2015-10-01  4:45   ` Neil Brown
2015-09-30 23:15 ` [PATCH 3/3] raid5-cache: IO error handling Shaohua Li
2015-10-01  4:50   ` Neil Brown
2015-10-04 14:44   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.