[LSF/MM TOPIC] block level event logging for storage media management

All of lore.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM TOPIC] block level event logging for storage media management
@ 2017-01-18 23:34 Song Liu
  2017-01-19  0:11 ` Bart Van Assche
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Song Liu @ 2017-01-18 23:34 UTC (permalink / raw)
  To: lsf-pc; +Cc: Jens Axboe, Kernel Team, linux-block

Media health monitoring is very important for large scale distributed stora=
ge systems.=20
Traditionally, enterprise storage controllers maintain event logs for attac=
hed storage
devices. However, these controller managed logs do not scale well for large=
 scale=20
distributed systems.=20

While designing a more flexible and scalable event logging systems, we thin=
k it is better
to build the log in block layer. Block level event logging covers all major=
 storage media
(SCSI, SATA, NVMe), and thus minimizes redundant work for different protoco=
ls.=20

In this LSF/MM, we would like to discuss the following topics with the comm=
unity:
    1. Mechanism for drivers report events (or errors) to block layer.=20
       Basically, we will need a traceable function for the drivers to repo=
rt errors=20
       (most likely right before calling end_request or bio_endio). =20
 =20
    2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event=
 logging?

    3. How should we categorize different events?
       Currently, there are existing code that translates ATA error (ata_to=
_sense_error)=20
       and NVMe error (nvme_trans_status_code) to SCSI sense code. So we ca=
n=20
       leverage SCSI Key Code Qualifier for event categorizations.=20

    4. Detailed discussions on data structure for event logging.=20

We will be able to show a prototype implementation during LSF/MM.=20

Thanks,
Song=

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu
@ 2017-01-19  0:11 ` Bart Van Assche
  2017-01-19  6:32 ` Coly Li
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Bart Van Assche @ 2017-01-19  0:11 UTC (permalink / raw)
  To: lsf-pc, songliubraving; +Cc: Kernel-team, linux-block, axboe

On Wed, 2017-01-18 at 23:34 +0000, Song Liu wrote:
> Media health monitoring is very important for large scale distributed sto=
rage systems.=20
> Traditionally, enterprise storage controllers maintain event logs for att=
ached storage
> devices. However, these controller managed logs do not scale well for lar=
ge scale=20
> distributed systems.=20
>=20
> While designing a more flexible and scalable event logging systems, we th=
ink it is better
> to build the log in block layer. Block level event logging covers all maj=
or storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different proto=
cols.=20
>=20
> In this LSF/MM, we would like to discuss the following topics with the co=
mmunity:
>     1. Mechanism for drivers report events (or errors) to block layer.=20
>        Basically, we will need a traceable function for the drivers to re=
port errors=20
>        (most likely right before calling end_request or bio_endio). =20
>  =20
>     2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the eve=
nt logging?
>=20
>     3. How should we categorize different events?
>        Currently, there are existing code that translates ATA error (ata_=
to_sense_error)=20
>        and NVMe error (nvme_trans_status_code) to SCSI sense code. So we =
can=20
>        leverage SCSI Key Code Qualifier for event categorizations.=20
>=20
>     4. Detailed discussions on data structure for event logging.=20
>=20
> We will be able to show a prototype implementation during LSF/MM.=20

I'd like to participate in this discussion.

Bart.=

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu
  2017-01-19  0:11 ` Bart Van Assche
@ 2017-01-19  6:32 ` Coly Li
  2017-01-19  6:48 ` Hannes Reinecke
  2017-01-21  5:46 ` Dan Williams
  3 siblings, 0 replies; 11+ messages in thread
From: Coly Li @ 2017-01-19  6:32 UTC (permalink / raw)
  To: Song Liu, lsf-pc; +Cc: Jens Axboe, Kernel Team, linux-block

On 2017/1/19 上午7:34, Song Liu wrote:
> 
> Media health monitoring is very important for large scale distributed storage systems. 
> Traditionally, enterprise storage controllers maintain event logs for attached storage
> devices. However, these controller managed logs do not scale well for large scale 
> distributed systems. 
> 
> While designing a more flexible and scalable event logging systems, we think it is better
> to build the log in block layer. Block level event logging covers all major storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. 
> 
> In this LSF/MM, we would like to discuss the following topics with the community:
>     1. Mechanism for drivers report events (or errors) to block layer. 
>        Basically, we will need a traceable function for the drivers to report errors 
>        (most likely right before calling end_request or bio_endio).  
>   
>     2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging?
> 
>     3. How should we categorize different events?
>        Currently, there are existing code that translates ATA error (ata_to_sense_error) 
>        and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can 
>        leverage SCSI Key Code Qualifier for event categorizations. 
> 
>     4. Detailed discussions on data structure for event logging. 
> 
> We will be able to show a prototype implementation during LSF/MM. 

This is an interesting topic. For stacked block devices, all layers
higher than the fault layer will observe the media error, reporting the
underlying failure in every layer may introduce quite a lot noise.

Yes, I am willing to attend this discussion.

Thanks.

Coly Li

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu
  2017-01-19  0:11 ` Bart Van Assche
  2017-01-19  6:32 ` Coly Li
@ 2017-01-19  6:48 ` Hannes Reinecke
  2017-01-21  5:46 ` Dan Williams
  3 siblings, 0 replies; 11+ messages in thread
From: Hannes Reinecke @ 2017-01-19  6:48 UTC (permalink / raw)
  To: Song Liu, lsf-pc; +Cc: Jens Axboe, Kernel Team, linux-block

On 01/19/2017 12:34 AM, Song Liu wrote:
> 
> Media health monitoring is very important for large scale distributed storage systems. 
> Traditionally, enterprise storage controllers maintain event logs for attached storage
> devices. However, these controller managed logs do not scale well for large scale 
> distributed systems. 
> 
> While designing a more flexible and scalable event logging systems, we think it is better
> to build the log in block layer. Block level event logging covers all major storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols. 
> 
> In this LSF/MM, we would like to discuss the following topics with the community:
>     1. Mechanism for drivers report events (or errors) to block layer. 
>        Basically, we will need a traceable function for the drivers to report errors 
>        (most likely right before calling end_request or bio_endio).  
>   
>     2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging?
> 
>     3. How should we categorize different events?
>        Currently, there are existing code that translates ATA error (ata_to_sense_error) 
>        and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can 
>        leverage SCSI Key Code Qualifier for event categorizations. 
> 
>     4. Detailed discussions on data structure for event logging. 
> 
> We will be able to show a prototype implementation during LSF/MM. 
> 
Very good topic; I'm very much in favour of it.

That ties in rather nicely with my multipath redesign, where I've added
a notifier chain for block events.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		   Teamlead Storage & Networking
hare@suse.de			               +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nï¿½rnberg
GF: F. Imendï¿½rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nï¿½rnberg)

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu
                   ` (2 preceding siblings ...)
  2017-01-19  6:48 ` Hannes Reinecke
@ 2017-01-21  5:46 ` Dan Williams
  2017-01-23  6:00   ` Song Liu
  3 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2017-01-21  5:46 UTC (permalink / raw)
  To: Song Liu; +Cc: lsf-pc, Jens Axboe, Kernel Team, linux-block, Verma, Vishal L

On Wed, Jan 18, 2017 at 3:34 PM, Song Liu <songliubraving@fb.com> wrote:
>
> Media health monitoring is very important for large scale distributed storage systems.
> Traditionally, enterprise storage controllers maintain event logs for attached storage
> devices. However, these controller managed logs do not scale well for large scale
> distributed systems.
>
> While designing a more flexible and scalable event logging systems, we think it is better
> to build the log in block layer. Block level event logging covers all major storage media
> (SCSI, SATA, NVMe), and thus minimizes redundant work for different protocols.
>
> In this LSF/MM, we would like to discuss the following topics with the community:
>     1. Mechanism for drivers report events (or errors) to block layer.
>        Basically, we will need a traceable function for the drivers to report errors
>        (most likely right before calling end_request or bio_endio).
>
>     2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the event logging?
>
>     3. How should we categorize different events?
>        Currently, there are existing code that translates ATA error (ata_to_sense_error)
>        and NVMe error (nvme_trans_status_code) to SCSI sense code. So we can
>        leverage SCSI Key Code Qualifier for event categorizations.
>
>     4. Detailed discussions on data structure for event logging.
>
> We will be able to show a prototype implementation during LSF/MM.

Hi Song,

How is this distinct from tracking a badblocks list?

I'm interested in this topic since we have both media error reporting
/ scrubbing for nvdimms as well "SMART" media health retrieval
commands.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-21  5:46 ` Dan Williams
@ 2017-01-23  6:00   ` Song Liu
  2017-01-23  7:27     ` Dan Williams
  0 siblings, 1 reply; 11+ messages in thread
From: Song Liu @ 2017-01-23  6:00 UTC (permalink / raw)
  To: Dan Williams
  Cc: lsf-pc, Jens Axboe, Kernel Team, linux-block, Verma, Vishal L

Hi Dan,=20

I think the the block level event log is more like log only system. When en=
 event=20
happens,  it is not necessary to take immediate action. (I guess this is di=
fferent
to bad block list?).=20

I would hope the event log to track more information. Some of these individ=
ual=20
event may not be very interesting, for example, soft error or latency outli=
ers.=20
However, when we gather event log for a fleet of devices, these "soft event=
"=20
may become valuable for health monitoring.=20

Thanks,
Song


> On Jan 20, 2017, at 9:46 PM, Dan Williams <dan.j.williams@intel.com> wrot=
e:
>=20
> On Wed, Jan 18, 2017 at 3:34 PM, Song Liu <songliubraving@fb.com> wrote:
>>=20
>> Media health monitoring is very important for large scale distributed st=
orage systems.
>> Traditionally, enterprise storage controllers maintain event logs for at=
tached storage
>> devices. However, these controller managed logs do not scale well for la=
rge scale
>> distributed systems.
>>=20
>> While designing a more flexible and scalable event logging systems, we t=
hink it is better
>> to build the log in block layer. Block level event logging covers all ma=
jor storage media
>> (SCSI, SATA, NVMe), and thus minimizes redundant work for different prot=
ocols.
>>=20
>> In this LSF/MM, we would like to discuss the following topics with the c=
ommunity:
>>    1. Mechanism for drivers report events (or errors) to block layer.
>>       Basically, we will need a traceable function for the drivers to re=
port errors
>>       (most likely right before calling end_request or bio_endio).
>>=20
>>    2. What mechanism (ftrace, BPF, etc.) is mostly preferred for the eve=
nt logging?
>>=20
>>    3. How should we categorize different events?
>>       Currently, there are existing code that translates ATA error (ata_=
to_sense_error)
>>       and NVMe error (nvme_trans_status_code) to SCSI sense code. So we =
can
>>       leverage SCSI Key Code Qualifier for event categorizations.
>>=20
>>    4. Detailed discussions on data structure for event logging.
>>=20
>> We will be able to show a prototype implementation during LSF/MM.
>=20
> Hi Song,
>=20
> How is this distinct from tracking a badblocks list?
>=20
> I'm interested in this topic since we have both media error reporting
> / scrubbing for nvdimms as well "SMART" media health retrieval
> commands.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-23  6:00   ` Song Liu
@ 2017-01-23  7:27     ` Dan Williams
  2017-01-24 20:18       ` Oleg Drokin
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Williams @ 2017-01-23  7:27 UTC (permalink / raw)
  To: Song Liu
  Cc: lsf-pc, Jens Axboe, Kernel Team, linux-block, Verma, Vishal L, green

[ adding Oleg ]

On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote:
> Hi Dan,
>
> I think the the block level event log is more like log only system. When en event
> happens,  it is not necessary to take immediate action. (I guess this is different
> to bad block list?).
>
> I would hope the event log to track more information. Some of these individual
> event may not be very interesting, for example, soft error or latency outliers.
> However, when we gather event log for a fleet of devices, these "soft event"
> may become valuable for health monitoring.

I'd be interested in this. It sounds like you're trying to fill a gap
between tracing and console log messages which I believe others have
encountered as well.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-23  7:27     ` Dan Williams
@ 2017-01-24 20:18       ` Oleg Drokin
  2017-01-24 23:17         ` Song Liu
  2017-01-25  9:56         ` [Lsf-pc] " Jan Kara
  0 siblings, 2 replies; 11+ messages in thread
From: Oleg Drokin @ 2017-01-24 20:18 UTC (permalink / raw)
  To: Dan Williams
  Cc: Song Liu, lsf-pc, Jens Axboe, Kernel Team, linux-block, Verma,
	Vishal L, Andreas Dilger, Greg Kroah-Hartman

On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:

> [ adding Oleg ]
> 
> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote:
>> Hi Dan,
>> 
>> I think the the block level event log is more like log only system. When en event
>> happens,  it is not necessary to take immediate action. (I guess this is different
>> to bad block list?).
>> 
>> I would hope the event log to track more information. Some of these individual
>> event may not be very interesting, for example, soft error or latency outliers.
>> However, when we gather event log for a fleet of devices, these "soft event"
>> may become valuable for health monitoring.
> 
> I'd be interested in this. It sounds like you're trying to fill a gap
> between tracing and console log messages which I believe others have
> encountered as well.

We have a somewhat similar problem problem in Lustre and I guess it's not just Lustre.
Currently there are all sorts of conditional debug code all over the place that goes
to the console and when you enable it for anything verbose, you quickly overflow
your dmesg buffer no matter the size, that might be mostly ok for local
"block level" stuff, but once you become distributed, it start to be a mess
and once you get to be super large it worsens even more since you need to
somehow coordinate data from multiple nodes, ensure all of it is not lost and still
you don't end up using a lot of it since only a few nodes end up being useful.
(I don't know how NFS people manage to debug complicated issues using just this,
could not be super easy).

Having some sort of a buffer of a (potentially very) large size that could be
storing the data until it's needed, or eagerly polled by some daemon for storage
(helpful when you expect a lot of data that definitely won't fit in RAM).

Tracepoints have the buffer and the daemon, but creating new messages is
very cumbersome, so converting every debug message into one does not look very feasible.
Also it's convenient to have "event masks" one want logged that I don't think you could
do with tracepoints.

I know you were talking about reporting events to the block layer, but other than plain
errors what would block layer do with them? just a convenient way to map messages
to a particular device? You don't plan to store it on some block device as part
of the block layer, right?

Implementing such a buffer all sorts of additional generic data might be
collected automatically for all events as part of the buffer format like
what cpu did emit it, time, stack usage information, current pid,
backtrace (tracepoint-alike could be optional), actual source code location of
the message, ï¿½

Having something like that being standard part of {dev,pr}_{dbg,warn,...} and friends
would be super awesome too, I imagine (adding Greg to CC for that).

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-24 20:18       ` Oleg Drokin
@ 2017-01-24 23:17         ` Song Liu
  2017-01-25  9:56         ` [Lsf-pc] " Jan Kara
  1 sibling, 0 replies; 11+ messages in thread
From: Song Liu @ 2017-01-24 23:17 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: Dan Williams, lsf-pc, Jens Axboe, Kernel Team, linux-block,
	Verma, Vishal L, Andreas Dilger, Greg Kroah-Hartman

DQo+IE9uIEphbiAyNCwgMjAxNywgYXQgMTI6MTggUE0sIE9sZWcgRHJva2luIDxncmVlbkBsaW51
eGhhY2tlci5ydT4gd3JvdGU6DQo+IA0KPiANCj4gT24gSmFuIDIzLCAyMDE3LCBhdCAyOjI3IEFN
LCBEYW4gV2lsbGlhbXMgd3JvdGU6DQo+IA0KPj4gWyBhZGRpbmcgT2xlZyBdDQo+PiANCj4+IE9u
IFN1biwgSmFuIDIyLCAyMDE3IGF0IDEwOjAwIFBNLCBTb25nIExpdSA8c29uZ2xpdWJyYXZpbmdA
ZmIuY29tPiB3cm90ZToNCj4+PiBIaSBEYW4sDQo+Pj4gDQo+Pj4gSSB0aGluayB0aGUgdGhlIGJs
b2NrIGxldmVsIGV2ZW50IGxvZyBpcyBtb3JlIGxpa2UgbG9nIG9ubHkgc3lzdGVtLiBXaGVuIGVu
IGV2ZW50DQo+Pj4gaGFwcGVucywgIGl0IGlzIG5vdCBuZWNlc3NhcnkgdG8gdGFrZSBpbW1lZGlh
dGUgYWN0aW9uLiAoSSBndWVzcyB0aGlzIGlzIGRpZmZlcmVudA0KPj4+IHRvIGJhZCBibG9jayBs
aXN0PykuDQo+Pj4gDQo+Pj4gSSB3b3VsZCBob3BlIHRoZSBldmVudCBsb2cgdG8gdHJhY2sgbW9y
ZSBpbmZvcm1hdGlvbi4gU29tZSBvZiB0aGVzZSBpbmRpdmlkdWFsDQo+Pj4gZXZlbnQgbWF5IG5v
dCBiZSB2ZXJ5IGludGVyZXN0aW5nLCBmb3IgZXhhbXBsZSwgc29mdCBlcnJvciBvciBsYXRlbmN5
IG91dGxpZXJzLg0KPj4+IEhvd2V2ZXIsIHdoZW4gd2UgZ2F0aGVyIGV2ZW50IGxvZyBmb3IgYSBm
bGVldCBvZiBkZXZpY2VzLCB0aGVzZSAic29mdCBldmVudCINCj4+PiBtYXkgYmVjb21lIHZhbHVh
YmxlIGZvciBoZWFsdGggbW9uaXRvcmluZy4NCj4+IA0KPj4gSSdkIGJlIGludGVyZXN0ZWQgaW4g
dGhpcy4gSXQgc291bmRzIGxpa2UgeW91J3JlIHRyeWluZyB0byBmaWxsIGEgZ2FwDQo+PiBiZXR3
ZWVuIHRyYWNpbmcgYW5kIGNvbnNvbGUgbG9nIG1lc3NhZ2VzIHdoaWNoIEkgYmVsaWV2ZSBvdGhl
cnMgaGF2ZQ0KPj4gZW5jb3VudGVyZWQgYXMgd2VsbC4NCj4gDQo+IFdlIGhhdmUgYSBzb21ld2hh
dCBzaW1pbGFyIHByb2JsZW0gcHJvYmxlbSBpbiBMdXN0cmUgYW5kIEkgZ3Vlc3MgaXQncyBub3Qg
anVzdCBMdXN0cmUuDQo+IEN1cnJlbnRseSB0aGVyZSBhcmUgYWxsIHNvcnRzIG9mIGNvbmRpdGlv
bmFsIGRlYnVnIGNvZGUgYWxsIG92ZXIgdGhlIHBsYWNlIHRoYXQgZ29lcw0KPiB0byB0aGUgY29u
c29sZSBhbmQgd2hlbiB5b3UgZW5hYmxlIGl0IGZvciBhbnl0aGluZyB2ZXJib3NlLCB5b3UgcXVp
Y2tseSBvdmVyZmxvdw0KPiB5b3VyIGRtZXNnIGJ1ZmZlciBubyBtYXR0ZXIgdGhlIHNpemUsIHRo
YXQgbWlnaHQgYmUgbW9zdGx5IG9rIGZvciBsb2NhbA0KPiAiYmxvY2sgbGV2ZWwiIHN0dWZmLCBi
dXQgb25jZSB5b3UgYmVjb21lIGRpc3RyaWJ1dGVkLCBpdCBzdGFydCB0byBiZSBhIG1lc3MNCj4g
YW5kIG9uY2UgeW91IGdldCB0byBiZSBzdXBlciBsYXJnZSBpdCB3b3JzZW5zIGV2ZW4gbW9yZSBz
aW5jZSB5b3UgbmVlZCB0bw0KPiBzb21laG93IGNvb3JkaW5hdGUgZGF0YSBmcm9tIG11bHRpcGxl
IG5vZGVzLCBlbnN1cmUgYWxsIG9mIGl0IGlzIG5vdCBsb3N0IGFuZCBzdGlsbA0KPiB5b3UgZG9u
J3QgZW5kIHVwIHVzaW5nIGEgbG90IG9mIGl0IHNpbmNlIG9ubHkgYSBmZXcgbm9kZXMgZW5kIHVw
IGJlaW5nIHVzZWZ1bC4NCj4gKEkgZG9uJ3Qga25vdyBob3cgTkZTIHBlb3BsZSBtYW5hZ2UgdG8g
ZGVidWcgY29tcGxpY2F0ZWQgaXNzdWVzIHVzaW5nIGp1c3QgdGhpcywNCj4gY291bGQgbm90IGJl
IHN1cGVyIGVhc3kpLg0KPiANCj4gSGF2aW5nIHNvbWUgc29ydCBvZiBhIGJ1ZmZlciBvZiBhIChw
b3RlbnRpYWxseSB2ZXJ5KSBsYXJnZSBzaXplIHRoYXQgY291bGQgYmUNCj4gc3RvcmluZyB0aGUg
ZGF0YSB1bnRpbCBpdCdzIG5lZWRlZCwgb3IgZWFnZXJseSBwb2xsZWQgYnkgc29tZSBkYWVtb24g
Zm9yIHN0b3JhZ2UNCj4gKGhlbHBmdWwgd2hlbiB5b3UgZXhwZWN0IGEgbG90IG9mIGRhdGEgdGhh
dCBkZWZpbml0ZWx5IHdvbid0IGZpdCBpbiBSQU0pLg0KPiANCj4gVHJhY2Vwb2ludHMgaGF2ZSB0
aGUgYnVmZmVyIGFuZCB0aGUgZGFlbW9uLCBidXQgY3JlYXRpbmcgbmV3IG1lc3NhZ2VzIGlzDQo+
IHZlcnkgY3VtYmVyc29tZSwgc28gY29udmVydGluZyBldmVyeSBkZWJ1ZyBtZXNzYWdlIGludG8g
b25lIGRvZXMgbm90IGxvb2sgdmVyeSBmZWFzaWJsZS4NCj4gQWxzbyBpdCdzIGNvbnZlbmllbnQg
dG8gaGF2ZSAiZXZlbnQgbWFza3MiIG9uZSB3YW50IGxvZ2dlZCB0aGF0IEkgZG9uJ3QgdGhpbmsg
eW91IGNvdWxkDQo+IGRvIHdpdGggdHJhY2Vwb2ludHMuDQo+IA0KPiBJIGtub3cgeW91IHdlcmUg
dGFsa2luZyBhYm91dCByZXBvcnRpbmcgZXZlbnRzIHRvIHRoZSBibG9jayBsYXllciwgYnV0IG90
aGVyIHRoYW4gcGxhaW4NCj4gZXJyb3JzIHdoYXQgd291bGQgYmxvY2sgbGF5ZXIgZG8gd2l0aCB0
aGVtPyBqdXN0IGEgY29udmVuaWVudCB3YXkgdG8gbWFwIG1lc3NhZ2VzDQo+IHRvIGEgcGFydGlj
dWxhciBkZXZpY2U/IFlvdSBkb24ndCBwbGFuIHRvIHN0b3JlIGl0IG9uIHNvbWUgYmxvY2sgZGV2
aWNlIGFzIHBhcnQNCj4gb2YgdGhlIGJsb2NrIGxheWVyLCByaWdodD8NCj4gDQo+IEltcGxlbWVu
dGluZyBzdWNoIGEgYnVmZmVyIGFsbCBzb3J0cyBvZiBhZGRpdGlvbmFsIGdlbmVyaWMgZGF0YSBt
aWdodCBiZQ0KPiBjb2xsZWN0ZWQgYXV0b21hdGljYWxseSBmb3IgYWxsIGV2ZW50cyBhcyBwYXJ0
IG9mIHRoZSBidWZmZXIgZm9ybWF0IGxpa2UNCj4gd2hhdCBjcHUgZGlkIGVtaXQgaXQsIHRpbWUs
IHN0YWNrIHVzYWdlIGluZm9ybWF0aW9uLCBjdXJyZW50IHBpZCwNCj4gYmFja3RyYWNlICh0cmFj
ZXBvaW50LWFsaWtlIGNvdWxkIGJlIG9wdGlvbmFsKSwgYWN0dWFsIHNvdXJjZSBjb2RlIGxvY2F0
aW9uIG9mDQo+IHRoZSBtZXNzYWdlLCDigKYNCj4gDQo+IEhhdmluZyBzb21ldGhpbmcgbGlrZSB0
aGF0IGJlaW5nIHN0YW5kYXJkIHBhcnQgb2Yge2Rldixwcn1fe2RiZyx3YXJuLC4uLn0gYW5kIGZy
aWVuZHMNCj4gd291bGQgYmUgc3VwZXIgYXdlc29tZSB0b28sIEkgaW1hZ2luZSAoYWRkaW5nIEdy
ZWcgdG8gQ0MgZm9yIHRoYXQpLg0KPiANCg0KDQpIaSBPbGVnLCANCg0KVGhhbmtzIGZvciBzaGFy
aW5nIHRoZXNlIGluc2lnaHRzLiANCg0KV2UgYnVpbHQgYW4gZXZlbnQgbG9nZ2VyIHRoYXQgcGFy
c2VzIGRtZXNnIHRvIGdldCBldmVudHMuIEZvciBzaW1pbGFyIHJlYXNvbnMgYXMgeW91IGRlc2Ny
aWJlZCANCmFib3ZlLCBpdCBkb2Vzbid0IHdvcmsgd2VsbC4gQW5kIG9uZSBvZiB0aGUgYmlnZ2Vz
dCBpc3N1ZSBpcyBwb29yICJldmVudCBtYXNrIiBzdXBwb3J0LiBJIGFtIA0KaG9waW5nIGdldCBi
ZXR0ZXIgZXZlbnQgbWFzayBpbiBuZXdlciBpbXBsZW1lbnRhdGlvbiwgZm9yIGV4YW1wbGUsIHdp
dGgga2VybmVsIHRyYWNpbmcgZmlsdGVyLCBvciANCmltcGxlbWVudCBjdXN0b21pemVkIGxvZ2lj
IGluIEJQRi4gDQoNCldpdGggYSByZWxhdGl2ZWx5IG1hdHVyZSBpbmZyYXN0cnVjdHVyZSwgd2Ug
ZG9uJ3QgaGF2ZSBtdWNoIHByb2JsZW0gc3RvcmluZyBsb2dzIGZyb20gdGhlIGV2ZW50DQpsb2dn
ZXIuIFNwZWNpZmljYWxseSwgd2UgdXNlIGEgZGFlbW9uIHRoYXQgY29sbGVjdHMgZXZlbnRzIGFu
ZCBzZW5kIHRoZW0gdG8gZGlzdHJpYnV0ZWQgc3RvcmFnZQ0KKEhERlMrSElWRSkuIEl0IG1pZ2h0
IGJlIGFuIG92ZXJraWxsIGZvciBzbWFsbGVyIGRlcGxveW1lbnQuIA0KDQpXZSBkbyB1c2UgaW5m
b3JtYXRpb24gZnJvbSBzaW1pbGFyIChub3QgZXhhY3RseSB0aGUgb25lIGFib3ZlKSBsb2dzIHRv
IG1ha2UgZGVjaXNpb24gYWJvdXQgDQpkZXZpY2UgaGFuZGxpbmcuIEZvciBleGFtcGxlLCBpZiBh
IGRyaXZlIHRocm93cyB0b28gbXVjaCBtZWRpdW0gZXJyb3IgaW4gc2hvcnQgcGVyaW9kIG9mIHRp
bWUsIA0Kd2Ugd2lsbCBraWNrIGl0IG91dCBvZiBwcm9kdWN0aW9uLiBJIHRoaW5rIGl0IGlzIG5v
dCBuZWNlc3NhcnkgdG8gaW5jbHVkZSB0aGlzIGluIHRoZSBibG9jayBsYXllci4gDQoNCk92ZXJh
bGwsIEkgYW0gaG9waW5nIHRoZSBrZXJuZWwgY2FuIGdlbmVyYXRlIGFjY3VyYXRlIGV2ZW50cywg
d2l0aCBmbGV4aWJsZSBmaWx0ZXIvbWFzayBzdXBwb3J0LiANClRoZXJlIGFyZSBkaWZmZXJlbnQg
d2F5cyB0byBzdG9yZSBhbmQgY29uc3VtZSB0aGVzZSBkYXRhLiBJIGd1ZXNzIG1vc3Qgb2YgdGhl
c2Ugd2lsbCBiZSANCmltcGxlbWVudGVkIGluIHVzZXIgc3BhY2UuIExldCdzIGRpc2N1c3MgcG90
ZW50aWFsIHVzZSBjYXNlcyBhbmQgcmVxdWlyZW1lbnRzLiBUaGVzZSANCmRpc2N1c3Npb25zIHNo
b3VsZCBoZWxwIHVzIGJ1aWxkIHRoZSBrZXJuZWwgcGFydCBvZiB0aGUgZXZlbnQgbG9nLiANCg0K
VGhhbmtzLA0KU29uZw0KDQoNCg0KDQoNCg0KDQoNCg==

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-24 20:18       ` Oleg Drokin
  2017-01-24 23:17         ` Song Liu
@ 2017-01-25  9:56         ` Jan Kara
  2017-01-25 18:30           ` Oleg Drokin
  1 sibling, 1 reply; 11+ messages in thread
From: Jan Kara @ 2017-01-25  9:56 UTC (permalink / raw)
  To: Oleg Drokin
  Cc: Dan Williams, linux-block, Song Liu, Andreas Dilger, Verma,
	Vishal L, Jens Axboe, Greg Kroah-Hartman, Kernel Team, lsf-pc

On Tue 24-01-17 15:18:57, Oleg Drokin wrote:
> 
> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:
> 
> > [ adding Oleg ]
> > 
> > On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote:
> >> Hi Dan,
> >> 
> >> I think the the block level event log is more like log only system. When en event
> >> happens,  it is not necessary to take immediate action. (I guess this is different
> >> to bad block list?).
> >> 
> >> I would hope the event log to track more information. Some of these individual
> >> event may not be very interesting, for example, soft error or latency outliers.
> >> However, when we gather event log for a fleet of devices, these "soft event"
> >> may become valuable for health monitoring.
> > 
> > I'd be interested in this. It sounds like you're trying to fill a gap
> > between tracing and console log messages which I believe others have
> > encountered as well.
> 
> We have a somewhat similar problem problem in Lustre and I guess it's not
> just Lustre.  Currently there are all sorts of conditional debug code all
> over the place that goes to the console and when you enable it for
> anything verbose, you quickly overflow your dmesg buffer no matter the
> size, that might be mostly ok for local "block level" stuff, but once you
> become distributed, it start to be a mess and once you get to be super
> large it worsens even more since you need to somehow coordinate data from
> multiple nodes, ensure all of it is not lost and still you don't end up
> using a lot of it since only a few nodes end up being useful.  (I don't
> know how NFS people manage to debug complicated issues using just this,
> could not be super easy).
> 
> Having some sort of a buffer of a (potentially very) large size that
> could be storing the data until it's needed, or eagerly polled by some
> daemon for storage (helpful when you expect a lot of data that definitely
> won't fit in RAM).
> 
> Tracepoints have the buffer and the daemon, but creating new messages is
> very cumbersome, so converting every debug message into one does not look
> very feasible.  Also it's convenient to have "event masks" one want
> logged that I don't think you could do with tracepoints.

So creating trace points IMO isn't that cumbersome. I agree that converting
hundreds or thousands debug printks into tracepoints is a pain in the
ass but still it is doable. WRT filtering, you can enable each tracepoint
individually. Granted that is not exactly the 'event mask' feature you ask
about but that can be easily scripted in userspace if you give some
structure to tracepoint names. Finally tracepoints provide a fine grained
control you never get with printk - e.g. you can make a tracepoint trigger
only if specific inode is involved with trace filters which greatly reduces
the amount of output.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lsf-pc] [LSF/MM TOPIC] block level event logging for storage media management
  2017-01-25  9:56         ` [Lsf-pc] " Jan Kara
@ 2017-01-25 18:30           ` Oleg Drokin
  0 siblings, 0 replies; 11+ messages in thread
From: Oleg Drokin @ 2017-01-25 18:30 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dan Williams, linux-block, Song Liu, Andreas Dilger, Verma,
	Vishal L, Jens Axboe, Greg Kroah-Hartman, Kernel Team, lsf-pc

On Jan 25, 2017, at 4:56 AM, Jan Kara wrote:

> On Tue 24-01-17 15:18:57, Oleg Drokin wrote:
>> 
>> On Jan 23, 2017, at 2:27 AM, Dan Williams wrote:
>> 
>>> [ adding Oleg ]
>>> 
>>> On Sun, Jan 22, 2017 at 10:00 PM, Song Liu <songliubraving@fb.com> wrote:
>>>> Hi Dan,
>>>> 
>>>> I think the the block level event log is more like log only system. When en event
>>>> happens,  it is not necessary to take immediate action. (I guess this is different
>>>> to bad block list?).
>>>> 
>>>> I would hope the event log to track more information. Some of these individual
>>>> event may not be very interesting, for example, soft error or latency outliers.
>>>> However, when we gather event log for a fleet of devices, these "soft event"
>>>> may become valuable for health monitoring.
>>> 
>>> I'd be interested in this. It sounds like you're trying to fill a gap
>>> between tracing and console log messages which I believe others have
>>> encountered as well.
>> 
>> We have a somewhat similar problem problem in Lustre and I guess it's not
>> just Lustre.  Currently there are all sorts of conditional debug code all
>> over the place that goes to the console and when you enable it for
>> anything verbose, you quickly overflow your dmesg buffer no matter the
>> size, that might be mostly ok for local "block level" stuff, but once you
>> become distributed, it start to be a mess and once you get to be super
>> large it worsens even more since you need to somehow coordinate data from
>> multiple nodes, ensure all of it is not lost and still you don't end up
>> using a lot of it since only a few nodes end up being useful.  (I don't
>> know how NFS people manage to debug complicated issues using just this,
>> could not be super easy).
>> 
>> Having some sort of a buffer of a (potentially very) large size that
>> could be storing the data until it's needed, or eagerly polled by some
>> daemon for storage (helpful when you expect a lot of data that definitely
>> won't fit in RAM).
>> 
>> Tracepoints have the buffer and the daemon, but creating new messages is
>> very cumbersome, so converting every debug message into one does not look
>> very feasible.  Also it's convenient to have "event masks" one want
>> logged that I don't think you could do with tracepoints.
> 
> So creating trace points IMO isn't that cumbersome. I agree that converting
> hundreds or thousands debug printks into tracepoints is a pain in the
> ass but still it is doable. WRT filtering, you can enable each tracepoint
> individually. Granted that is not exactly the 'event mask' feature you ask
> about but that can be easily scripted in userspace if you give some
> structure to tracepoint names. Finally tracepoints provide a fine grained
> control you never get with printk - e.g. you can make a tracepoint trigger
> only if specific inode is involved with trace filters which greatly reduces
> the amount of output.

Oh, I am not dissing tracepoints, don't get me wrong, they add valuable things
at a fine-grained level when you have necessary details.
The problem is sometimes there are bugs where you don't have enough of knowledge
beforehand so you cannot do some fine-grained debug.
Think of a 10.000 nodes cluster (heck make it even 100 or probably even 10)
with a report of "when running a moderately sized job, there's a hang/something weird/
some unexpected data corruption" that does not occur when run on a single node,
so often what you resort to is the shotgun approach where you enable all debug (or
selective like "Everything in ldlm and everything rpc related) you
could everywhere, then run the job for however long it takes to reproduce and then
once reproduced, you sift through those locks reconstructing picture back together
only to discover there was this weird race on one of the clients only when
some lock was contended but then the grant RPC and some userspace action
coincided or some such.
the dev_dbg() and nfs's /proc/sys/sunrpc/*debug are somewhat similar, only they dump
to dmesg which is quite limited in buffer size, adds huge delays if it goes out to some
slow console, wipes other potentially useful messages from the buffer in process
and such.

I guess you could print script tracepoints with a pattern in their name too,
but then there's this pain in the ass of converting:
$ git grep CERROR  drivers/staging/lustre/ | wc -l
1069
$ git grep CDEBUG  drivers/staging/lustre/ | wc -l
1140

messages AND there's also this thing that I do want many of those output to console
(because they are important enough) and to the buffer (so I can see them relative to
other debug messages I do not want to go to the console).

if tracepoints could be extended to enable that much - I'd be a super happy camper,
of course.
Sure, you cannot just make a macro that wraps the whole print into a tracepoint, but
that would be a stupid tracepoint with no finegrained control whatsoever,
but perhaps we can do name arguments or some such so that when you do

TRACEPOINT(someid, priority, "format string", some_value, some_other_value, ï¿½);

then if priority includes TPOINT_CONSOLE - it would also always go to console and to
the tracepoint buffer, and I can use the some_value and some_other_value
as actual matches for things (sure, that would limit you to just variables with
no logic done on them, but that's ok, I guess, could always be precalculated if
really necessary).

Hm, trying to research if you can extract the tracepoint buffer from a kernel crashdump
(and if anybody already happened to write a crash module for it yet), I also
stumbled upon LKST - http://elinux.org/Linux_Kernel_State_Tracer
(no idea how stale that is, but the page is from 2011 and last patch is from 2 years
ago) - this also implements a buffer and all sorts of extra event tracing,
so it appears to underscore the demand for such things is there and existing
mechanisms don't deliver for one reason or another.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-25 18:30 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-18 23:34 [LSF/MM TOPIC] block level event logging for storage media management Song Liu
2017-01-19  0:11 ` Bart Van Assche
2017-01-19  6:32 ` Coly Li
2017-01-19  6:48 ` Hannes Reinecke
2017-01-21  5:46 ` Dan Williams
2017-01-23  6:00   ` Song Liu
2017-01-23  7:27     ` Dan Williams
2017-01-24 20:18       ` Oleg Drokin
2017-01-24 23:17         ` Song Liu
2017-01-25  9:56         ` [Lsf-pc] " Jan Kara
2017-01-25 18:30           ` Oleg Drokin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.