All of lore.kernel.org
 help / color / mirror / Atom feed
* SMART disk monitoring
@ 2017-11-10 17:58 Sage Weil
  2017-11-10 23:45 ` Ali Maredia
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2017-11-10 17:58 UTC (permalink / raw)
  To: ceph-devel, yaarit

Hi everyone,

I'm delighted to share that Yaarit Hatuka has been selected for an 
Outreachy internship adding support for monitoring SMART information for 
OSD devices!  I'm looking forward to working with her over the next few 
months to add support to the osd, mgr, and some smartmontools 
infrastructure to make this all work.

Congratulations!
sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-10 17:58 SMART disk monitoring Sage Weil
@ 2017-11-10 23:45 ` Ali Maredia
  2017-11-11  3:36   ` Yaarit Hatuka
  0 siblings, 1 reply; 18+ messages in thread
From: Ali Maredia @ 2017-11-10 23:45 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel, yaarit

Congrats Yaarit! Welcome!

-Ali

----- Original Message -----
> From: "Sage Weil" <sweil@redhat.com>
> To: ceph-devel@vger.kernel.org, yaarit@gmail.com
> Sent: Friday, November 10, 2017 12:58:16 PM
> Subject: SMART disk monitoring
> 
> Hi everyone,
> 
> I'm delighted to share that Yaarit Hatuka has been selected for an
> Outreachy internship adding support for monitoring SMART information for
> OSD devices!  I'm looking forward to working with her over the next few
> months to add support to the osd, mgr, and some smartmontools
> infrastructure to make this all work.
> 
> Congratulations!
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-10 23:45 ` Ali Maredia
@ 2017-11-11  3:36   ` Yaarit Hatuka
  2017-11-12 17:16     ` Lars Marowsky-Bree
  0 siblings, 1 reply; 18+ messages in thread
From: Yaarit Hatuka @ 2017-11-11  3:36 UTC (permalink / raw)
  To: ceph-devel

Many thanks! I'm very excited to join Ceph's outstanding community!
I'm looking forward to working on this challenging project, and I'm
very grateful for the opportunity to be guided by Sage.

Best,
Yaarit

On Fri, Nov 10, 2017 at 6:45 PM, Ali Maredia <amaredia@redhat.com> wrote:
> Congrats Yaarit! Welcome!
>
> -Ali
>
> ----- Original Message -----
>> From: "Sage Weil" <sweil@redhat.com>
>> To: ceph-devel@vger.kernel.org, yaarit@gmail.com
>> Sent: Friday, November 10, 2017 12:58:16 PM
>> Subject: SMART disk monitoring
>>
>> Hi everyone,
>>
>> I'm delighted to share that Yaarit Hatuka has been selected for an
>> Outreachy internship adding support for monitoring SMART information for
>> OSD devices!  I'm looking forward to working with her over the next few
>> months to add support to the osd, mgr, and some smartmontools
>> infrastructure to make this all work.
>>
>> Congratulations!
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-11  3:36   ` Yaarit Hatuka
@ 2017-11-12 17:16     ` Lars Marowsky-Bree
  2017-11-12 20:16       ` Sage Weil
  2018-01-03 16:37       ` Sage Weil
  0 siblings, 2 replies; 18+ messages in thread
From: Lars Marowsky-Bree @ 2017-11-12 17:16 UTC (permalink / raw)
  To: ceph-devel

On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:

> Many thanks! I'm very excited to join Ceph's outstanding community!
> I'm looking forward to working on this challenging project, and I'm
> very grateful for the opportunity to be guided by Sage.

That's all excellent news!

Can we discuss though if/how this belongs into ceph-osd? Given that this
can (and is) already collected via smartmon, either via prometheus or, I
assume, collectd as well? Does this really need to be added to the OSD
code?

Would the goal be for them to report this to ceph-mgr, or expose
directly as something to be queried via, say, a prometheus exporter
binding? Or are the OSDs supposed to directly act on this information?


Regards,
    Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-12 17:16     ` Lars Marowsky-Bree
@ 2017-11-12 20:16       ` Sage Weil
  2017-11-13 10:46         ` John Spray
  2017-11-13 11:53         ` Piotr Dałek
  2018-01-03 16:37       ` Sage Weil
  1 sibling, 2 replies; 18+ messages in thread
From: Sage Weil @ 2017-11-12 20:16 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: ceph-devel

On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
> 
> > Many thanks! I'm very excited to join Ceph's outstanding community!
> > I'm looking forward to working on this challenging project, and I'm
> > very grateful for the opportunity to be guided by Sage.
> 
> That's all excellent news!
> 
> Can we discuss though if/how this belongs into ceph-osd? Given that this
> can (and is) already collected via smartmon, either via prometheus or, I
> assume, collectd as well? Does this really need to be added to the OSD
> code?
> 
> Would the goal be for them to report this to ceph-mgr, or expose
> directly as something to be queried via, say, a prometheus exporter
> binding? Or are the OSDs supposed to directly act on this information?

The OSD is just a convenient channel, but needn't be the only 
one or only option.

Part 1 of the project is to get JSON output out of smartctl so we avoid 
one of the many crufty projects floating around to parse its weird output; 
that'll be helpful all consumers, presumably.

Part 2 is to map OSDs to host:device pairs; that merged already.

Part 3 is to gather the actual data.  The prototype has the OSD polling 
this because it (1) knows which devices it consumes and (2) is present on 
every node.  We're contemplating a per-host ceph-volume-agent for 
assisting with OSD (de)provisioning (i.e., running ceph-volume); that 
could be an option.  Of if some other tool is already scraping it and can 
be queried, that would work too.

I think the OSD will end up being a necessary path (perhaps among many), 
though, because when we are using SPDK I don't think we'll be able to get 
the SMART data via smartctl (or any other tool) at all because the OSD 
process will be running the NVMe driver.

Part 4 is to archive the results.  The original thought was to dump it 
into RADOS.  I hadn't considered prometheus, but that might be a better 
fit!  I'm generally pretty cautious about introducing dependencies like 
this but we're already expecting prometheus to be used for other metrics 
for the dashboard.  I'm not sure whether prometheus' query interface lends 
itself to the failure models, though...

Part 5 is to do some basic failure prediction!

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-12 20:16       ` Sage Weil
@ 2017-11-13 10:46         ` John Spray
  2017-11-13 11:00           ` Lars Marowsky-Bree
  2017-11-13 11:53         ` Piotr Dałek
  1 sibling, 1 reply; 18+ messages in thread
From: John Spray @ 2017-11-13 10:46 UTC (permalink / raw)
  To: Sage Weil; +Cc: Lars Marowsky-Bree, Ceph Development

On Sun, Nov 12, 2017 at 8:16 PM, Sage Weil <sage@newdream.net> wrote:
> On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
>> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
>>
>> > Many thanks! I'm very excited to join Ceph's outstanding community!
>> > I'm looking forward to working on this challenging project, and I'm
>> > very grateful for the opportunity to be guided by Sage.
>>
>> That's all excellent news!
>>
>> Can we discuss though if/how this belongs into ceph-osd? Given that this
>> can (and is) already collected via smartmon, either via prometheus or, I
>> assume, collectd as well? Does this really need to be added to the OSD
>> code?
>>
>> Would the goal be for them to report this to ceph-mgr, or expose
>> directly as something to be queried via, say, a prometheus exporter
>> binding? Or are the OSDs supposed to directly act on this information?
>
> The OSD is just a convenient channel, but needn't be the only
> one or only option.
>
> Part 1 of the project is to get JSON output out of smartctl so we avoid
> one of the many crufty projects floating around to parse its weird output;
> that'll be helpful all consumers, presumably.
>
> Part 2 is to map OSDs to host:device pairs; that merged already.
>
> Part 3 is to gather the actual data.  The prototype has the OSD polling
> this because it (1) knows which devices it consumes and (2) is present on
> every node.  We're contemplating a per-host ceph-volume-agent for
> assisting with OSD (de)provisioning (i.e., running ceph-volume); that
> could be an option.  Of if some other tool is already scraping it and can
> be queried, that would work too.
>
> I think the OSD will end up being a necessary path (perhaps among many),
> though, because when we are using SPDK I don't think we'll be able to get
> the SMART data via smartctl (or any other tool) at all because the OSD
> process will be running the NVMe driver.
>
> Part 4 is to archive the results.  The original thought was to dump it
> into RADOS.  I hadn't considered prometheus, but that might be a better
> fit!  I'm generally pretty cautious about introducing dependencies like
> this but we're already expecting prometheus to be used for other metrics
> for the dashboard.  I'm not sure whether prometheus' query interface lends
> itself to the failure models, though...

At the risk of stretching the analogy to breaking point, when we build
something "batteries included", it doesn't mean someone can't also
plug it into a mains power supply :-)

My attitude to prometheus is that we should use it (a lot! I'm a total
fan boy) but that it isn't an exclusive relationship: plug prometheus
into Ceph and you get the histories of things, but without prometheus
you should still be able to see all the latest values.

In that context, I would wonder if it would be better to initially do
the SMART work with just latest values (for just latest vals we could
persist these in config keys), and any history-based failure
prediction would perhaps depend on the user having a prometheus server
to store the history?

John

> Part 5 is to do some basic failure prediction!
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-13 10:46         ` John Spray
@ 2017-11-13 11:00           ` Lars Marowsky-Bree
  2017-11-13 13:28             ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Lars Marowsky-Bree @ 2017-11-13 11:00 UTC (permalink / raw)
  To: Ceph Development

On 2017-11-13T10:46:25, John Spray <jspray@redhat.com> wrote:

> At the risk of stretching the analogy to breaking point, when we build
> something "batteries included", it doesn't mean someone can't also
> plug it into a mains power supply :-)

Plugging something designed to take 2x AAA cells into a mains power
supply is usually considered a bad idea, though ;-)

> My attitude to prometheus is that we should use it (a lot! I'm a total
> fan boy) but that it isn't an exclusive relationship: plug prometheus
> into Ceph and you get the histories of things, but without prometheus
> you should still be able to see all the latest values.

That makes sense, of course. Prometheus scrapes values from various
sources, and if it could scrape data directly off the ceph-osd
processes, why not.

> In that context, I would wonder if it would be better to initially do
> the SMART work with just latest values (for just latest vals we could
> persist these in config keys), and any history-based failure
> prediction would perhaps depend on the user having a prometheus server
> to store the history?

That isn't a bad idea, but would you really want to persist this in a
(potentially rather large) map? That'd involve relaying them to the MONs
or mgr.

Wouldn't it make more sense for something that wants to look at this
data to contact the relevant daemon? It exposing the data also in the
Prometheus exporter format would be useful (so they can directly be
ingested), of course.

This would allow a platform like kubernetes and Prometheus combined to
take care of the automatic service discovery eventually too, and would
even work well for micro-server deployments.

If going down this route though, might want to directly link SMART
libraries into ceph-osd so the daemon doesn't need to call out to
smartctl.


Regards,
    Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-12 20:16       ` Sage Weil
  2017-11-13 10:46         ` John Spray
@ 2017-11-13 11:53         ` Piotr Dałek
  2017-11-14  4:09           ` Ric Wheeler
  1 sibling, 1 reply; 18+ messages in thread
From: Piotr Dałek @ 2017-11-13 11:53 UTC (permalink / raw)
  To: Sage Weil, Lars Marowsky-Bree; +Cc: ceph-devel

On 17-11-12 09:16 PM, Sage Weil wrote:
> On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
>> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
>>
>>> Many thanks! I'm very excited to join Ceph's outstanding community!
>>> I'm looking forward to working on this challenging project, and I'm
>>> very grateful for the opportunity to be guided by Sage.
>>
>> That's all excellent news!
>>
>> Can we discuss though if/how this belongs into ceph-osd? Given that this
>> can (and is) already collected via smartmon, either via prometheus or, I
>> assume, collectd as well? Does this really need to be added to the OSD
>> code?
>>
>> Would the goal be for them to report this to ceph-mgr, or expose
>> directly as something to be queried via, say, a prometheus exporter
>> binding? Or are the OSDs supposed to directly act on this information?
> 
> The OSD is just a convenient channel, but needn't be the only
> one or only option.
> 
> Part 1 of the project is to get JSON output out of smartctl so we avoid
> one of the many crufty projects floating around to parse its weird output;
> that'll be helpful all consumers, presumably.

That means a new patch to smartctl itself, right?

> Part 2 is to map OSDs to host:device pairs; that merged already.
> 
> Part 3 is to gather the actual data.  The prototype has the OSD polling
> this because it (1) knows which devices it consumes and (2) is present on
> every node.  We're contemplating a per-host ceph-volume-agent for
> assisting with OSD (de)provisioning (i.e., running ceph-volume); that
> could be an option.  Of if some other tool is already scraping it and can
> be queried, that would work too.
> 
> I think the OSD will end up being a necessary path (perhaps among many),
> though, because when we are using SPDK I don't think we'll be able to get
> the SMART data via smartctl (or any other tool) at all because the OSD
> process will be running the NVMe driver.

This may not work anyway, because  many controllers (including JBOD 
controllers) don't pass-through SMART data, or the data don't make sense.

> Part 4 is to archive the results.  The original thought was to dump it
> into RADOS.  I hadn't considered prometheus, but that might be a better
> fit!  I'm generally pretty cautious about introducing dependencies like
> this but we're already expecting prometheus to be used for other metrics
> for the dashboard.  I'm not sure whether prometheus' query interface lends
> itself to the failure models, though...
> Part 5 is to do some basic failure prediction!

SMART is unreliable on spinning disks, and on SSDs it's only as reliable as 
firmware goes (and that is often questionable).
Also, many vendors give different meaning to different SMART attributes, 
making some of obvious choices (like power-on hours or power-cycle count) 
useless (see https://www.backblaze.com/blog/hard-drive-smart-stats/ for 
example).

Anyway, we'd love to see that this feature can be completely disabled by 
config change and don't incur any backwards incompatibility by itself.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-13 11:00           ` Lars Marowsky-Bree
@ 2017-11-13 13:28             ` Sage Weil
  2017-11-13 13:31               ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2017-11-13 13:28 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Ceph Development

On Mon, 13 Nov 2017, Lars Marowsky-Bree wrote:
> On 2017-11-13T10:46:25, John Spray <jspray@redhat.com> wrote:
> 
> > At the risk of stretching the analogy to breaking point, when we build
> > something "batteries included", it doesn't mean someone can't also
> > plug it into a mains power supply :-)
> 
> Plugging something designed to take 2x AAA cells into a mains power
> supply is usually considered a bad idea, though ;-)
> 
> > My attitude to prometheus is that we should use it (a lot! I'm a total
> > fan boy) but that it isn't an exclusive relationship: plug prometheus
> > into Ceph and you get the histories of things, but without prometheus
> > you should still be able to see all the latest values.
> 
> That makes sense, of course. Prometheus scrapes values from various
> sources, and if it could scrape data directly off the ceph-osd
> processes, why not.
> 
> > In that context, I would wonder if it would be better to initially do
> > the SMART work with just latest values (for just latest vals we could
> > persist these in config keys), and any history-based failure
> > prediction would perhaps depend on the user having a prometheus server
> > to store the history?
> 
> That isn't a bad idea, but would you really want to persist this in a
> (potentially rather large) map? That'd involve relaying them to the MONs
> or mgr.
> 
> Wouldn't it make more sense for something that wants to look at this
> data to contact the relevant daemon? It exposing the data also in the
> Prometheus exporter format would be useful (so they can directly be
> ingested), of course.

The decision should about preemptive failure should be made by the mgr 
module regardless (so it can consider other factors, like cluster 
health and fullness), so if it's not getting the raw data to apply the 
model it needs to get a sufficiently meaningful metric (e.g., 
precision/recall curve or area under precisiosn-recall curve [1]).

> This would allow a platform like kubernetes and Prometheus combined to
> take care of the automatic service discovery eventually too, and would
> even work well for micro-server deployments.
> 
> If going down this route though, might want to directly link SMART
> libraries into ceph-osd so the daemon doesn't need to call out to
> smartctl.

There is a libatasmart.. it's simple and clean but it only works with ATA.  
smartmontools is much more robust and gathers health metrics for NVMe as 
well, which is I we were going down that path.

This is a pretty infrequent event, so I don't think it matters if we exec 
out to smartctl to get a JSON blob.

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-13 13:28             ` Sage Weil
@ 2017-11-13 13:31               ` Sage Weil
  0 siblings, 0 replies; 18+ messages in thread
From: Sage Weil @ 2017-11-13 13:31 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: Ceph Development

On Mon, 13 Nov 2017, Sage Weil wrote:
> On Mon, 13 Nov 2017, Lars Marowsky-Bree wrote:
> > On 2017-11-13T10:46:25, John Spray <jspray@redhat.com> wrote:
> > 
> > > At the risk of stretching the analogy to breaking point, when we build
> > > something "batteries included", it doesn't mean someone can't also
> > > plug it into a mains power supply :-)
> > 
> > Plugging something designed to take 2x AAA cells into a mains power
> > supply is usually considered a bad idea, though ;-)
> > 
> > > My attitude to prometheus is that we should use it (a lot! I'm a total
> > > fan boy) but that it isn't an exclusive relationship: plug prometheus
> > > into Ceph and you get the histories of things, but without prometheus
> > > you should still be able to see all the latest values.
> > 
> > That makes sense, of course. Prometheus scrapes values from various
> > sources, and if it could scrape data directly off the ceph-osd
> > processes, why not.
> > 
> > > In that context, I would wonder if it would be better to initially do
> > > the SMART work with just latest values (for just latest vals we could
> > > persist these in config keys), and any history-based failure
> > > prediction would perhaps depend on the user having a prometheus server
> > > to store the history?
> > 
> > That isn't a bad idea, but would you really want to persist this in a
> > (potentially rather large) map? That'd involve relaying them to the MONs
> > or mgr.
> > 
> > Wouldn't it make more sense for something that wants to look at this
> > data to contact the relevant daemon? It exposing the data also in the
> > Prometheus exporter format would be useful (so they can directly be
> > ingested), of course.
> 
> The decision should about preemptive failure should be made by the mgr 
> module regardless (so it can consider other factors, like cluster 
> health and fullness), so if it's not getting the raw data to apply the 
> model it needs to get a sufficiently meaningful metric (e.g., 
> precision/recall curve or area under precisiosn-recall curve [1]).

[1] http://events.linuxfoundation.org/sites/events/files/slides/LF-Vault-2017-aelshimi.pdf



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-13 11:53         ` Piotr Dałek
@ 2017-11-14  4:09           ` Ric Wheeler
  2017-11-14  8:28             ` Piotr Dałek
  0 siblings, 1 reply; 18+ messages in thread
From: Ric Wheeler @ 2017-11-14  4:09 UTC (permalink / raw)
  To: Piotr Dałek, Sage Weil, Lars Marowsky-Bree; +Cc: ceph-devel

On 11/13/2017 05:23 PM, Piotr Dałek wrote:
> On 17-11-12 09:16 PM, Sage Weil wrote:
>> On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
>>> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
>>>
>>>> Many thanks! I'm very excited to join Ceph's outstanding community!
>>>> I'm looking forward to working on this challenging project, and I'm
>>>> very grateful for the opportunity to be guided by Sage.
>>>
>>> That's all excellent news!
>>>
>>> Can we discuss though if/how this belongs into ceph-osd? Given that this
>>> can (and is) already collected via smartmon, either via prometheus or, I
>>> assume, collectd as well? Does this really need to be added to the OSD
>>> code?
>>>
>>> Would the goal be for them to report this to ceph-mgr, or expose
>>> directly as something to be queried via, say, a prometheus exporter
>>> binding? Or are the OSDs supposed to directly act on this information?
>>
>> The OSD is just a convenient channel, but needn't be the only
>> one or only option.
>>
>> Part 1 of the project is to get JSON output out of smartctl so we avoid
>> one of the many crufty projects floating around to parse its weird output;
>> that'll be helpful all consumers, presumably.
>
> That means a new patch to smartctl itself, right?
>
>> Part 2 is to map OSDs to host:device pairs; that merged already.
>>
>> Part 3 is to gather the actual data.  The prototype has the OSD polling
>> this because it (1) knows which devices it consumes and (2) is present on
>> every node.  We're contemplating a per-host ceph-volume-agent for
>> assisting with OSD (de)provisioning (i.e., running ceph-volume); that
>> could be an option.  Of if some other tool is already scraping it and can
>> be queried, that would work too.
>>
>> I think the OSD will end up being a necessary path (perhaps among many),
>> though, because when we are using SPDK I don't think we'll be able to get
>> the SMART data via smartctl (or any other tool) at all because the OSD
>> process will be running the NVMe driver.
>
> This may not work anyway, because  many controllers (including JBOD 
> controllers) don't pass-through SMART data, or the data don't make sense.

You are right that many controllers don't pass this information without going 
through their non-open source tools. The libstoragemgmt project - 
https://github.com/libstorage/libstoragemgmt - has added support for doing some 
types of access for the physical back end drives. It is worth syncing up with 
them I think to see how we might be able to extract interesting bits.

>
>> Part 4 is to archive the results.  The original thought was to dump it
>> into RADOS.  I hadn't considered prometheus, but that might be a better
>> fit!  I'm generally pretty cautious about introducing dependencies like
>> this but we're already expecting prometheus to be used for other metrics
>> for the dashboard.  I'm not sure whether prometheus' query interface lends
>> itself to the failure models, though...
>> Part 5 is to do some basic failure prediction!
>
> SMART is unreliable on spinning disks, and on SSDs it's only as reliable as 
> firmware goes (and that is often questionable).
> Also, many vendors give different meaning to different SMART attributes, 
> making some of obvious choices (like power-on hours or power-cycle count) 
> useless (see https://www.backblaze.com/blog/hard-drive-smart-stats/ for example).

SMART data has been used selectively by major storage vendors for years to help 
flag errors. For spinning drives, one traditional red flag was the number of 
reallocated sectors (normalized by the number a spinning drive has). When you 
start chewing through those, that is a pretty good flag. Seagate and others did 
a lot of work (and models) that turned smart data into a good predictor for 
failure on spinning drives, but it is not entirely trivial to do.

For example, at the USENIX Vault conference, there was this presentation which 
showed some interesting recent work:

http://sched.co/9WQT

There is also a lot of information about drive failures (SSD and spinning) at 
USENIX FAST over many years. Things have improved a lot over the years, 
especially with modern SSD's and NVME where a lot of hard work has happened to 
add improved metrics to the data.

Regards,

Ric

>
> Anyway, we'd love to see that this feature can be completely disabled by 
> config change and don't incur any backwards incompatibility by itself.
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-14  4:09           ` Ric Wheeler
@ 2017-11-14  8:28             ` Piotr Dałek
  2017-11-14 14:19               ` Sage Weil
  0 siblings, 1 reply; 18+ messages in thread
From: Piotr Dałek @ 2017-11-14  8:28 UTC (permalink / raw)
  To: Ric Wheeler, Sage Weil, Lars Marowsky-Bree; +Cc: ceph-devel

On 17-11-14 05:09 AM, Ric Wheeler wrote:
> On 11/13/2017 05:23 PM, Piotr Dałek wrote:
>> On 17-11-12 09:16 PM, Sage Weil wrote:
>>> On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
>>>> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
>>>>
>>>>> Many thanks! I'm very excited to join Ceph's outstanding community!
>>>>> I'm looking forward to working on this challenging project, and I'm
>>>>> very grateful for the opportunity to be guided by Sage.
>>>>
>>>> That's all excellent news!
>>>>
>>>> Can we discuss though if/how this belongs into ceph-osd? Given that this
>>>> can (and is) already collected via smartmon, either via prometheus or, I
>>>> assume, collectd as well? Does this really need to be added to the OSD
>>>> code?
>>>>
>>>> Would the goal be for them to report this to ceph-mgr, or expose
>>>> directly as something to be queried via, say, a prometheus exporter
>>>> binding? Or are the OSDs supposed to directly act on this information?
>>>
>>> The OSD is just a convenient channel, but needn't be the only
>>> one or only option.
>>>
>>> Part 1 of the project is to get JSON output out of smartctl so we avoid
>>> one of the many crufty projects floating around to parse its weird output;
>>> that'll be helpful all consumers, presumably.
>>
>> That means a new patch to smartctl itself, right?
>>
>>> Part 2 is to map OSDs to host:device pairs; that merged already.
>>>
>>> Part 3 is to gather the actual data.  The prototype has the OSD polling
>>> this because it (1) knows which devices it consumes and (2) is present on
>>> every node.  We're contemplating a per-host ceph-volume-agent for
>>> assisting with OSD (de)provisioning (i.e., running ceph-volume); that
>>> could be an option.  Of if some other tool is already scraping it and can
>>> be queried, that would work too.
>>>
>>> I think the OSD will end up being a necessary path (perhaps among many),
>>> though, because when we are using SPDK I don't think we'll be able to get
>>> the SMART data via smartctl (or any other tool) at all because the OSD
>>> process will be running the NVMe driver.
>>
>> This may not work anyway, because  many controllers (including JBOD 
>> controllers) don't pass-through SMART data, or the data don't make sense.
> 
> You are right that many controllers don't pass this information without 
> going through their non-open source tools. The libstoragemgmt project - 
> https://github.com/libstorage/libstoragemgmt - has added support for doing 
> some types of access for the physical back end drives. It is worth syncing 
> up with them I think to see how we might be able to extract interesting bits.

There's another problem - bcache/flashcache/<insert your favorite vendor> 
cache - osds often reside on top of some cache device, and accessing SMART 
values for that might not work, or might not return all required values.

>>> Part 4 is to archive the results.  The original thought was to dump it
>>> into RADOS.  I hadn't considered prometheus, but that might be a better
>>> fit!  I'm generally pretty cautious about introducing dependencies like
>>> this but we're already expecting prometheus to be used for other metrics
>>> for the dashboard.  I'm not sure whether prometheus' query interface lends
>>> itself to the failure models, though...
>>> Part 5 is to do some basic failure prediction!
>>
>> SMART is unreliable on spinning disks, and on SSDs it's only as reliable 
>> as firmware goes (and that is often questionable).
>> Also, many vendors give different meaning to different SMART attributes, 
>> making some of obvious choices (like power-on hours or power-cycle count) 
>> useless (see https://www.backblaze.com/blog/hard-drive-smart-stats/ for 
>> example).
> 
> SMART data has been used selectively by major storage vendors for years to 
> help flag errors. For spinning drives, one traditional red flag was the 
> number of reallocated sectors (normalized by the number a spinning drive 
> has). When you start chewing through those, that is a pretty good flag. 

This value increases when platters are worn out, get somehow demagnetized or 
disk vibrates too much. Still doesn't take motor wear into account.

> Seagate and others did a lot of work (and models) that turned smart data 
> into a good predictor for failure on spinning drives, but it is not entirely 
> trivial to do.
> 
> For example, at the USENIX Vault conference, there was this presentation 
> which showed some interesting recent work:
> 
> http://sched.co/9WQT
> 
> There is also a lot of information about drive failures (SSD and spinning) 
> at USENIX FAST over many years. Things have improved a lot over the years, 
> especially with modern SSD's and NVME where a lot of hard work has happened 
> to add improved metrics to the data.

That's my point. That's a lot of statistics to chew through and most of it 
relies on assumptions that can be already wrong or be wrong some time after. 
All it takes is a brand-new product line with different characteristics.
SSDs are different - you just measure number of erase/program cycles and 
(again) do assumptions based on that - that's easier and more reliable.
Still, I would be *very* unhappy if I'd be woken up in the middle of the 
night just to realize that cluster incorrectly predicted disk failure and my 
company (and I'm pretty sure not only my company) wouldn't be happy either 
if cluster would force it to throw away perfectly good disks, because 
reusing them would yield the same result.
On the other hand, this creates a back door for vendors to force device 
replacement even if it's perfectly fine, some SSD vendors already do this 
with their devices going into read-only mode even when there's a whole lot 
of p/e cycles left in flash cells. I don't think we need Ceph to go this way.

tl;dr - I'm fine with that feature as long as there'll be a possibility to 
disable it entirely.

-- 
Piotr Dałek
piotr.dalek@corp.ovh.com
https://www.ovh.com/us/

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-14  8:28             ` Piotr Dałek
@ 2017-11-14 14:19               ` Sage Weil
  2017-11-15  0:09                 ` Huang Zhiteng
  0 siblings, 1 reply; 18+ messages in thread
From: Sage Weil @ 2017-11-14 14:19 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Ric Wheeler, Lars Marowsky-Bree, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3348 bytes --]

On Tue, 14 Nov 2017, Piotr Dałek wrote:
> On 17-11-14 05:09 AM, Ric Wheeler wrote:
> > On 11/13/2017 05:23 PM, Piotr Dałek wrote:
> > > This may not work anyway, because  many controllers (including JBOD
> > > controllers) don't pass-through SMART data, or the data don't make sense.
> > 
> > You are right that many controllers don't pass this information without
> > going through their non-open source tools. The libstoragemgmt project -
> > https://github.com/libstorage/libstoragemgmt - has added support for doing
> > some types of access for the physical back end drives. It is worth syncing
> > up with them I think to see how we might be able to extract interesting
> > bits.
> 
> There's another problem - bcache/flashcache/<insert your favorite vendor>
> cache - osds often reside on top of some cache device, and accessing SMART
> values for that might not work, or might not return all required values.

For devicemapper devices at least it is pretty straightforward to work out 
the underlying physical device.

I'm sure there will always be some devices and stacks that successfully 
obscure the reliablity data, but most deployments will benefit.

> > There is also a lot of information about drive failures (SSD and spinning)
> > at USENIX FAST over many years. Things have improved a lot over the years,
> > especially with modern SSD's and NVME where a lot of hard work has happened
> > to add improved metrics to the data.
> 
> That's my point. That's a lot of statistics to chew through and most of it
> relies on assumptions that can be already wrong or be wrong some time after.
> All it takes is a brand-new product line with different characteristics.
> SSDs are different - you just measure number of erase/program cycles and
> (again) do assumptions based on that - that's easier and more reliable.
> Still, I would be *very* unhappy if I'd be woken up in the middle of the night
> just to realize that cluster incorrectly predicted disk failure and my company
> (and I'm pretty sure not only my company) wouldn't be happy either if cluster
> would force it to throw away perfectly good disks, because reusing them would
> yield the same result.
> On the other hand, this creates a back door for vendors to force device
> replacement even if it's perfectly fine, some SSD vendors already do this with
> their devices going into read-only mode even when there's a whole lot of p/e
> cycles left in flash cells. I don't think we need Ceph to go this way.

OT: I view building good prediction models as an orthogonal problem, and 
one that relies on collecting a large data set.  Patrick McGarry and 
several others are working on a related project to build a public data set 
of SMART etc reliability data so that such models can be built for use in 
open systems.  Current data sets from backblaze suffer from a small set of 
device models, which means only large cloud providers or system vendor 
with large deployments are able to gather enough healthy metrics and 
failure data to build good models.  The goal of the other project is to 
allow regular users (of systmes like Ceph) to opt into sharing reliability 
data so that better models can be built--ones that cover a broader range 
of devices.
 
> tl;dr - I'm fine with that feature as long as there'll be a possibility to
> disable it entirely.

Of course!

sage

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-14 14:19               ` Sage Weil
@ 2017-11-15  0:09                 ` Huang Zhiteng
  0 siblings, 0 replies; 18+ messages in thread
From: Huang Zhiteng @ 2017-11-15  0:09 UTC (permalink / raw)
  To: Sage Weil; +Cc: Piotr Dałek, Ric Wheeler, Lars Marowsky-Bree, ceph-devel

Another work done (probably against similar dataset from BackBlaze) by
IBM, which is pretty impressive:
https://www.ibm.com/blogs/research/2016/08/predicting-disk-failures-reliable-clouds/

On Tue, Nov 14, 2017 at 10:19 PM, Sage Weil <sweil@redhat.com> wrote:
> On Tue, 14 Nov 2017, Piotr Dałek wrote:
>> On 17-11-14 05:09 AM, Ric Wheeler wrote:
>> > On 11/13/2017 05:23 PM, Piotr Dałek wrote:
>> > > This may not work anyway, because  many controllers (including JBOD
>> > > controllers) don't pass-through SMART data, or the data don't make sense.
>> >
>> > You are right that many controllers don't pass this information without
>> > going through their non-open source tools. The libstoragemgmt project -
>> > https://github.com/libstorage/libstoragemgmt - has added support for doing
>> > some types of access for the physical back end drives. It is worth syncing
>> > up with them I think to see how we might be able to extract interesting
>> > bits.
>>
>> There's another problem - bcache/flashcache/<insert your favorite vendor>
>> cache - osds often reside on top of some cache device, and accessing SMART
>> values for that might not work, or might not return all required values.
>
> For devicemapper devices at least it is pretty straightforward to work out
> the underlying physical device.
>
> I'm sure there will always be some devices and stacks that successfully
> obscure the reliablity data, but most deployments will benefit.
>
>> > There is also a lot of information about drive failures (SSD and spinning)
>> > at USENIX FAST over many years. Things have improved a lot over the years,
>> > especially with modern SSD's and NVME where a lot of hard work has happened
>> > to add improved metrics to the data.
>>
>> That's my point. That's a lot of statistics to chew through and most of it
>> relies on assumptions that can be already wrong or be wrong some time after.
>> All it takes is a brand-new product line with different characteristics.
>> SSDs are different - you just measure number of erase/program cycles and
>> (again) do assumptions based on that - that's easier and more reliable.
>> Still, I would be *very* unhappy if I'd be woken up in the middle of the night
>> just to realize that cluster incorrectly predicted disk failure and my company
>> (and I'm pretty sure not only my company) wouldn't be happy either if cluster
>> would force it to throw away perfectly good disks, because reusing them would
>> yield the same result.
>> On the other hand, this creates a back door for vendors to force device
>> replacement even if it's perfectly fine, some SSD vendors already do this with
>> their devices going into read-only mode even when there's a whole lot of p/e
>> cycles left in flash cells. I don't think we need Ceph to go this way.
>
> OT: I view building good prediction models as an orthogonal problem, and
> one that relies on collecting a large data set.  Patrick McGarry and
> several others are working on a related project to build a public data set
> of SMART etc reliability data so that such models can be built for use in
> open systems.  Current data sets from backblaze suffer from a small set of
> device models, which means only large cloud providers or system vendor
> with large deployments are able to gather enough healthy metrics and
> failure data to build good models.  The goal of the other project is to
> allow regular users (of systmes like Ceph) to opt into sharing reliability
> data so that better models can be built--ones that cover a broader range
> of devices.
>
>> tl;dr - I'm fine with that feature as long as there'll be a possibility to
>> disable it entirely.
>
> Of course!
>
> sage



-- 
Regards
Huang Zhiteng

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2017-11-12 17:16     ` Lars Marowsky-Bree
  2017-11-12 20:16       ` Sage Weil
@ 2018-01-03 16:37       ` Sage Weil
  2018-01-03 17:28         ` Kyle Bader
  2018-01-03 20:36         ` Jan Fajerski
  1 sibling, 2 replies; 18+ messages in thread
From: Sage Weil @ 2018-01-03 16:37 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: yaarit, ceph-devel

On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
> 
> > Many thanks! I'm very excited to join Ceph's outstanding community!
> > I'm looking forward to working on this challenging project, and I'm
> > very grateful for the opportunity to be guided by Sage.
> 
> That's all excellent news!
> 
> Can we discuss though if/how this belongs into ceph-osd? Given that this
> can (and is) already collected via smartmon, either via prometheus or, I
> assume, collectd as well? Does this really need to be added to the OSD
> code?

Hi Lars,

Yaarit is taking a look at this now and the smartmon.sh collector for 
prometheus looks a bit janky:

1) It seems like you have to set up a cron job to write the current smart 
output to a text file in a directory somewhere, and then prometheus will 
scrape it when polled.[1]

2) smartmon.sh[2] is a shortish pile of bash that collects only a handful 
of fields by parsing smartctl output.

The second piece will hopefully improve once the JSON output mode for 
smartctl is completed (that is in progress upstream in smartmontools).  
But the first part seems awkward, and doesn't look like it would work out 
of the box.  Are you guys currently collecting SMART data?  If so, how did 
you automate/simplify the setup?

Thanks!
sage


[1] https://github.com/prometheus/node_exporter#textfile-collector
[2] https://github.com/prometheus/node_exporter/edit/master/text_collector_examples/smartmon.sh

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2018-01-03 16:37       ` Sage Weil
@ 2018-01-03 17:28         ` Kyle Bader
  2018-01-03 20:36         ` Jan Fajerski
  1 sibling, 0 replies; 18+ messages in thread
From: Kyle Bader @ 2018-01-03 17:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Lars Marowsky-Bree, yaarit, ceph-devel

I seem to recall a batch of hardware in one of the early clusters that
had issues with repeated polling of smart data, IO would pause for a
few seconds. This might explain why they are writing out a daily cron
script and polling a results file instead of repeatedly polling the
actual devices.

On Wed, Jan 3, 2018 at 8:37 AM, Sage Weil <sage@newdream.net> wrote:
> On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
>> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
>>
>> > Many thanks! I'm very excited to join Ceph's outstanding community!
>> > I'm looking forward to working on this challenging project, and I'm
>> > very grateful for the opportunity to be guided by Sage.
>>
>> That's all excellent news!
>>
>> Can we discuss though if/how this belongs into ceph-osd? Given that this
>> can (and is) already collected via smartmon, either via prometheus or, I
>> assume, collectd as well? Does this really need to be added to the OSD
>> code?
>
> Hi Lars,
>
> Yaarit is taking a look at this now and the smartmon.sh collector for
> prometheus looks a bit janky:
>
> 1) It seems like you have to set up a cron job to write the current smart
> output to a text file in a directory somewhere, and then prometheus will
> scrape it when polled.[1]
>
> 2) smartmon.sh[2] is a shortish pile of bash that collects only a handful
> of fields by parsing smartctl output.
>
> The second piece will hopefully improve once the JSON output mode for
> smartctl is completed (that is in progress upstream in smartmontools).
> But the first part seems awkward, and doesn't look like it would work out
> of the box.  Are you guys currently collecting SMART data?  If so, how did
> you automate/simplify the setup?
>
> Thanks!
> sage
>
>
> [1] https://github.com/prometheus/node_exporter#textfile-collector
> [2] https://github.com/prometheus/node_exporter/edit/master/text_collector_examples/smartmon.sh
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2018-01-03 16:37       ` Sage Weil
  2018-01-03 17:28         ` Kyle Bader
@ 2018-01-03 20:36         ` Jan Fajerski
  2018-01-03 22:44           ` Lars Marowsky-Bree
  1 sibling, 1 reply; 18+ messages in thread
From: Jan Fajerski @ 2018-01-03 20:36 UTC (permalink / raw)
  To: Sage Weil; +Cc: Lars Marowsky-Bree, yaarit, ceph-devel

On Wed, Jan 03, 2018 at 04:37:00PM +0000, Sage Weil wrote:
>On Sun, 12 Nov 2017, Lars Marowsky-Bree wrote:
>> On 2017-11-10T22:36:46, Yaarit Hatuka <yaarit@gmail.com> wrote:
>>
>> > Many thanks! I'm very excited to join Ceph's outstanding community!
>> > I'm looking forward to working on this challenging project, and I'm
>> > very grateful for the opportunity to be guided by Sage.
>>
>> That's all excellent news!
>>
>> Can we discuss though if/how this belongs into ceph-osd? Given that this
>> can (and is) already collected via smartmon, either via prometheus or, I
>> assume, collectd as well? Does this really need to be added to the OSD
>> code?
>
>Hi Lars,
>
>Yaarit is taking a look at this now and the smartmon.sh collector for
>prometheus looks a bit janky:
>
>1) It seems like you have to set up a cron job to write the current smart
>output to a text file in a directory somewhere, and then prometheus will
>scrape it when polled.[1]
>
>2) smartmon.sh[2] is a shortish pile of bash that collects only a handful
>of fields by parsing smartctl output.
>
>The second piece will hopefully improve once the JSON output mode for
>smartctl is completed (that is in progress upstream in smartmontools).
>But the first part seems awkward, and doesn't look like it would work out
>of the box.  Are you guys currently collecting SMART data?  If so, how did
>you automate/simplify the setup?
We do collect SMART data with the mechanism described above. We use salt to 
setup up a cronjob (though say a systemd timer would also work) that runs
smartmon.sh > node_exporter/text_collector_dir/file.
It does not work out of the box, but that is how the node_exporter is meant to 
be extended afaiu.
>
>Thanks!
>sage
>
>
>[1] https://github.com/prometheus/node_exporter#textfile-collector
>[2] https://github.com/prometheus/node_exporter/edit/master/text_collector_examples/smartmon.sh
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: SMART disk monitoring
  2018-01-03 20:36         ` Jan Fajerski
@ 2018-01-03 22:44           ` Lars Marowsky-Bree
  0 siblings, 0 replies; 18+ messages in thread
From: Lars Marowsky-Bree @ 2018-01-03 22:44 UTC (permalink / raw)
  To: Sage Weil, yaarit, ceph-devel

On 2018-01-03T21:36:56, Jan Fajerski <jfajerski@suse.com> wrote:

> We do collect SMART data with the mechanism described above. We use salt to
> setup up a cronjob (though say a systemd timer would also work) that runs
> smartmon.sh > node_exporter/text_collector_dir/file.
> It does not work out of the box, but that is how the node_exporter is meant
> to be extended afaiu.

I could see this being more neatly handled by having a smarter exporter
that pulls data on-demand on scrape, and then manage the intervals
accordingly via that mechanism, but that'd be inefficient with multiple
Prometheus instances scraping (which is a common setup).

I'd prefer if we got JSON directly out of smartctl, of course. But even
ceph-osd would probably only do periodic scrapes for the above
reasons.

Unless you want ceph-osd to take direct action, I'm not sure there's a
need for it to handle this internally.

Also, how would it handle things like devices shared between OSDs, e.g.,
the WAL/journals? That's better handled via an outside scheduler.


Regards,
    Lars

-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-01-03 22:44 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-10 17:58 SMART disk monitoring Sage Weil
2017-11-10 23:45 ` Ali Maredia
2017-11-11  3:36   ` Yaarit Hatuka
2017-11-12 17:16     ` Lars Marowsky-Bree
2017-11-12 20:16       ` Sage Weil
2017-11-13 10:46         ` John Spray
2017-11-13 11:00           ` Lars Marowsky-Bree
2017-11-13 13:28             ` Sage Weil
2017-11-13 13:31               ` Sage Weil
2017-11-13 11:53         ` Piotr Dałek
2017-11-14  4:09           ` Ric Wheeler
2017-11-14  8:28             ` Piotr Dałek
2017-11-14 14:19               ` Sage Weil
2017-11-15  0:09                 ` Huang Zhiteng
2018-01-03 16:37       ` Sage Weil
2018-01-03 17:28         ` Kyle Bader
2018-01-03 20:36         ` Jan Fajerski
2018-01-03 22:44           ` Lars Marowsky-Bree

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.