linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] dm: track io errors per mapped device
@ 2020-05-07 23:05 Kjetil Orbekk
  2020-05-08  0:18 ` Alasdair G Kergon
  2020-05-08  1:12 ` Chaitanya Kulkarni
  0 siblings, 2 replies; 7+ messages in thread
From: Kjetil Orbekk @ 2020-05-07 23:05 UTC (permalink / raw)
  To: Alasdair Kergon, Mike Snitzer, dm-devel, linux-block
  Cc: harshads, Khazhismel Kumykov, Kjetil Orbekk

From: Khazhismel Kumykov <khazhy@google.com>

This will track ioerr_cnt on all dm targets and expose it as
<device>/dm/ioerr_cnt.

Signed-off-by: Khazhismel Kumykov <khazhy@google.com>
[kj@orbekk.com: added support for bio based devices in dm.c]
Signed-off-by: Kjetil Orbekk <kj@orbekk.com>
---
 drivers/md/dm-core.h  |  2 ++
 drivers/md/dm-rq.c    |  4 ++++
 drivers/md/dm-sysfs.c |  8 ++++++++
 drivers/md/dm.c       | 10 ++++++++++
 drivers/md/dm.h       |  1 +
 5 files changed, 25 insertions(+)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index c4ef1fceead6..c6cc0f9e4d9a 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -108,6 +108,8 @@ struct mapped_device {
 
 	struct dm_stats stats;
 
+	atomic_t ioerr_cnt;
+
 	/* for blk-mq request-based DM support */
 	struct blk_mq_tag_set *tag_set;
 	bool init_tio_pdu:1;
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 3f8577e2c13b..1c1fe96ca7a4 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -171,6 +171,8 @@ static void dm_end_request(struct request *clone, blk_status_t error)
 	tio->ti->type->release_clone_rq(clone, NULL);
 
 	rq_end_stats(md, rq);
+	if (error)
+		atomic_inc(&md->ioerr_cnt);
 	blk_mq_end_request(rq, error);
 	rq_completed(md);
 }
@@ -268,6 +270,8 @@ static void dm_softirq_done(struct request *rq)
 		struct mapped_device *md = tio->md;
 
 		rq_end_stats(md, rq);
+		if (tio->error)
+			atomic_inc(&md->ioerr_cnt);
 		blk_mq_end_request(rq, tio->error);
 		rq_completed(md);
 		return;
diff --git a/drivers/md/dm-sysfs.c b/drivers/md/dm-sysfs.c
index a05fcd50e1b9..5d59790b4b17 100644
--- a/drivers/md/dm-sysfs.c
+++ b/drivers/md/dm-sysfs.c
@@ -74,6 +74,12 @@ static ssize_t dm_attr_name_show(struct mapped_device *md, char *buf)
 	return strlen(buf);
 }
 
+static ssize_t dm_attr_ioerr_cnt_show(struct mapped_device *md, char *buf)
+{
+	sprintf(buf, "%d\n", dm_ioerr_cnt(md));
+	return strlen(buf);
+}
+
 static ssize_t dm_attr_uuid_show(struct mapped_device *md, char *buf)
 {
 	if (dm_copy_name_and_uuid(md, NULL, buf))
@@ -99,6 +105,7 @@ static ssize_t dm_attr_use_blk_mq_show(struct mapped_device *md, char *buf)
 }
 
 static DM_ATTR_RO(name);
+static DM_ATTR_RO(ioerr_cnt);
 static DM_ATTR_RO(uuid);
 static DM_ATTR_RO(suspended);
 static DM_ATTR_RO(use_blk_mq);
@@ -106,6 +113,7 @@ static DM_ATTR_RW(rq_based_seq_io_merge_deadline);
 
 static struct attribute *dm_attrs[] = {
 	&dm_attr_name.attr,
+	&dm_attr_ioerr_cnt.attr,
 	&dm_attr_uuid.attr,
 	&dm_attr_suspended.attr,
 	&dm_attr_use_blk_mq.attr,
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index db9e46114653..7a264379473e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1017,6 +1017,9 @@ static void clone_endio(struct bio *bio)
 			disable_write_zeroes(md);
 	}
 
+	if (error)
+		atomic_inc(&md->ioerr_cnt);
+
 	if (endio) {
 		int r = endio(tio->ti, bio, &error);
 		switch (r) {
@@ -1304,6 +1307,7 @@ static blk_qc_t __map_bio(struct dm_target_io *tio)
 		break;
 	case DM_MAPIO_KILL:
 		free_tio(tio);
+		atomic_inc(&md->ioerr_cnt);
 		dec_pending(io, BLK_STS_IOERR);
 		break;
 	case DM_MAPIO_REQUEUE:
@@ -2014,6 +2018,7 @@ static struct mapped_device *alloc_dev(int minor)
 		goto bad;
 
 	dm_stats_init(&md->stats);
+	atomic_set(&md->ioerr_cnt, 0);
 
 	/* Populate the mapping, nobody knows we exist yet */
 	spin_lock(&_minor_lock);
@@ -2992,6 +2997,11 @@ int dm_noflush_suspending(struct dm_target *ti)
 }
 EXPORT_SYMBOL_GPL(dm_noflush_suspending);
 
+int dm_ioerr_cnt(struct mapped_device *md)
+{
+	return atomic_read(&md->ioerr_cnt);
+}
+
 struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
 					    unsigned integrity, unsigned per_io_data_size,
 					    unsigned min_pool_size)
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index d7c4f6606b5f..fafb9d379ce9 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -82,6 +82,7 @@ void dm_unlock_md_type(struct mapped_device *md);
 void dm_set_md_type(struct mapped_device *md, enum dm_queue_mode type);
 enum dm_queue_mode dm_get_md_type(struct mapped_device *md);
 struct target_type *dm_get_immutable_target_type(struct mapped_device *md);
+int dm_ioerr_cnt(struct mapped_device *md);
 
 int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] dm: track io errors per mapped device
  2020-05-07 23:05 [PATCH] dm: track io errors per mapped device Kjetil Orbekk
@ 2020-05-08  0:18 ` Alasdair G Kergon
  2020-05-08 19:18   ` kj
  2020-05-08  1:12 ` Chaitanya Kulkarni
  1 sibling, 1 reply; 7+ messages in thread
From: Alasdair G Kergon @ 2020-05-08  0:18 UTC (permalink / raw)
  To: Kjetil Orbekk
  Cc: Alasdair Kergon, Mike Snitzer, dm-devel, linux-block, harshads,
	Khazhismel Kumykov

On Thu, May 07, 2020 at 07:05:33PM -0400, Kjetil Orbekk wrote:
> This will track ioerr_cnt on all dm targets and expose it as
> <device>/dm/ioerr_cnt.

How do you propose to use this?
What are you trying to measure and why?
- How exact must the number be to meet your requirements?

Or to put it another way, do you need to know the exact number you are
exposing, or do you derive something else from this which could also be
derived from an alternative number?

In particular, given the way we split and clone and stack devices (so
there may be an element of multiplication), and reload tables (so
existing values might become irrelevant), did you consider alternative
semantics before selecting this approach?
(Or to put it another way, is there a need to reset it or track
the value since the last resume?)

(Documentation is also needed - which ought to explain the semantics
and how the observed values interact with the use of device-mapper
features.)

Alasdair


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dm: track io errors per mapped device
  2020-05-07 23:05 [PATCH] dm: track io errors per mapped device Kjetil Orbekk
  2020-05-08  0:18 ` Alasdair G Kergon
@ 2020-05-08  1:12 ` Chaitanya Kulkarni
  2020-05-08 19:22   ` kj
  1 sibling, 1 reply; 7+ messages in thread
From: Chaitanya Kulkarni @ 2020-05-08  1:12 UTC (permalink / raw)
  To: Kjetil Orbekk, Alasdair Kergon, Mike Snitzer, dm-devel, linux-block
  Cc: harshads, Khazhismel Kumykov

On 05/07/2020 04:06 PM, Kjetil Orbekk wrote:
> +		if (tio->error)
> +			atomic_inc(&md->ioerr_cnt);

Given that there are so many errors how would user know what
kind of error is generated and how many times?


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dm: track io errors per mapped device
  2020-05-08  0:18 ` Alasdair G Kergon
@ 2020-05-08 19:18   ` kj
  0 siblings, 0 replies; 7+ messages in thread
From: kj @ 2020-05-08 19:18 UTC (permalink / raw)
  To: Alasdair G Kergon
  Cc: Mike Snitzer, dm-devel, linux-block, harshads,
	Khazhismel Kumykov, orbekk

Hi Alasdair,

Thank you for your time and comments.

On Thu, May 7, 2020, at 20:18, Alasdair G Kergon wrote:
> On Thu, May 07, 2020 at 07:05:33PM -0400, Kjetil Orbekk wrote:
> > This will track ioerr_cnt on all dm targets and expose it as
> > <device>/dm/ioerr_cnt.
> 
> How do you propose to use this?
> What are you trying to measure and why?
> - How exact must the number be to meet your requirements?

This is proposed in order to detect if I/O errors have occurred on the dm device. Deriving this from the ioerr_cnt from the underlying device was considered, but it's not reliable for dm devices that tolerate some underlying errors (raid setups and similar).

> Or to put it another way, do you need to know the exact number you are
> exposing, or do you derive something else from this which could also be
> derived from an alternative number?

Our use case is to detect if I/O errors have happened at all. We expect the ioerr_cnt to increase when there are errors, but the precise number is not important in our environment.

> In particular, given the way we split and clone and stack devices (so
> there may be an element of multiplication), and reload tables (so
> existing values might become irrelevant), did you consider alternative
> semantics before selecting this approach?
>
> (Or to put it another way, is there a need to reset it or track
> the value since the last resume?)

I'm not very familiar with dm and I don't follow how the cloning and stacking will lead to multiplication. Do you have any suggestions for how I might deal with that?

Resetting the value would not be desirable for our use case, because the probing process can miss I/O errors that happen right before a device is suspended and then resumed, though I can imagine that there might be cases where one would want that. Users could look at increases in ioerr_cnt instead of the absolute numbers, or I could provide a way to reset the counter if desired.

> (Documentation is also needed - which ought to explain the semantics
> and how the observed values interact with the use of device-mapper
> features.)

I will be happy to provide an updated patch with inline documentation once I have addressed your comments. Are there any other places where I need to update documentation?

-- 
Stay safe!
KJ

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] dm: track io errors per mapped device
  2020-05-08  1:12 ` Chaitanya Kulkarni
@ 2020-05-08 19:22   ` kj
  2020-05-08 21:09     ` Mike Snitzer
  0 siblings, 1 reply; 7+ messages in thread
From: kj @ 2020-05-08 19:22 UTC (permalink / raw)
  To: Chaitanya Kulkarni, Alasdair G Kergon, Mike Snitzer, dm-devel,
	linux-block
  Cc: harshads, Khazhismel Kumykov

On Thu, May 7, 2020, at 21:12, Chaitanya Kulkarni wrote:
> On 05/07/2020 04:06 PM, Kjetil Orbekk wrote:
> > +		if (tio->error)
> > +			atomic_inc(&md->ioerr_cnt);
> 
> Given that there are so many errors how would user know what
> kind of error is generated and how many times?

The intended use case is to provide an easy way to check if errors have occurred at all, and then the user needs to investigate using other means. I replied with more detail to Alasdair's email.

-- 
KJ

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm: track io errors per mapped device
  2020-05-08 19:22   ` kj
@ 2020-05-08 21:09     ` Mike Snitzer
  2020-05-09 16:01       ` kj
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Snitzer @ 2020-05-08 21:09 UTC (permalink / raw)
  To: kj
  Cc: Chaitanya Kulkarni, Alasdair G Kergon, dm-devel, linux-block,
	harshads, Khazhismel Kumykov

On Fri, May 08 2020 at  3:22pm -0400,
kj@orbekk.com <kj@orbekk.com> wrote:

> On Thu, May 7, 2020, at 21:12, Chaitanya Kulkarni wrote:
> > On 05/07/2020 04:06 PM, Kjetil Orbekk wrote:
> > > +		if (tio->error)
> > > +			atomic_inc(&md->ioerr_cnt);
> > 
> > Given that there are so many errors how would user know what
> > kind of error is generated and how many times?
> 
> The intended use case is to provide an easy way to check if errors
> have occurred at all, and then the user needs to investigate using
> other means. I replied with more detail to Alasdair's email.

But most operations initiated by the user that fail will be felt by the
upper layer that the user is interfacing with.

Only exception that springs to mind is dm-writecache's writeback that
occurs after writes have already been acknowledged back to the upper
layers -- in that case dm-writecache provides an error flag that is
exposed via writecache_status.

Anyway, just not seeing why you need a upper-layer use-case agnostic
flag to know an error occurred in DM.  That layers that interface with
the DM device will have been notified of all errors.

And why just for DM devices?  Why not all block devices (in which case
DM would get this feature "for free")?

Mike


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: dm: track io errors per mapped device
  2020-05-08 21:09     ` Mike Snitzer
@ 2020-05-09 16:01       ` kj
  0 siblings, 0 replies; 7+ messages in thread
From: kj @ 2020-05-09 16:01 UTC (permalink / raw)
  To: Mike Snitzer
  Cc: Chaitanya Kulkarni, Alasdair G Kergon, dm-devel, linux-block,
	harshads, Khazhismel Kumykov

On Fri, May 8, 2020, at 17:09, Mike Snitzer wrote:
> On Fri, May 08 2020 at  3:22pm -0400,
> kj@orbekk.com <kj@orbekk.com> wrote:
> 
> > On Thu, May 7, 2020, at 21:12, Chaitanya Kulkarni wrote:
> > > On 05/07/2020 04:06 PM, Kjetil Orbekk wrote:
> > > > +		if (tio->error)
> > > > +			atomic_inc(&md->ioerr_cnt);
> > > 
> > > Given that there are so many errors how would user know what
> > > kind of error is generated and how many times?
> > 
> > The intended use case is to provide an easy way to check if errors
> > have occurred at all, and then the user needs to investigate using
> > other means. I replied with more detail to Alasdair's email.
> 
> But most operations initiated by the user that fail will be felt by the
> upper layer that the user is interfacing with.
> 
> Only exception that springs to mind is dm-writecache's writeback that
> occurs after writes have already been acknowledged back to the upper
> layers -- in that case dm-writecache provides an error flag that is
> exposed via writecache_status.
> 
> Anyway, just not seeing why you need a upper-layer use-case agnostic
> flag to know an error occurred in DM.  That layers that interface with
> the DM device will have been notified of all errors.

It's used as a signal by a probing process which is not in the IO path itself.

> And why just for DM devices?  Why not all block devices (in which case
> DM would get this feature "for free")?

This sounds like a good idea to me. Looks like I could add this to disk_stats and expose it through the block/<device>/stats file. I'll try to see if this is feasible.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-05-09 16:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-07 23:05 [PATCH] dm: track io errors per mapped device Kjetil Orbekk
2020-05-08  0:18 ` Alasdair G Kergon
2020-05-08 19:18   ` kj
2020-05-08  1:12 ` Chaitanya Kulkarni
2020-05-08 19:22   ` kj
2020-05-08 21:09     ` Mike Snitzer
2020-05-09 16:01       ` kj

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).