linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Borislav Petkov <bp@alien8.de>
To: Mauro Carvalho Chehab <mchehab@redhat.com>
Cc: Linux Edac Mailing List <linux-edac@vger.kernel.org>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM)
Date: Thu, 24 Mar 2011 23:39:07 +0100	[thread overview]
Message-ID: <20110324223907.GA10498@liondog.tnic> (raw)
In-Reply-To: <20110324173257.36680b90@pedra>

On Thu, Mar 24, 2011 at 05:32:57PM -0300, Mauro Carvalho Chehab wrote:
> Adds a trace class for handle hardware events
> 
> Part of the description bellow is shamelessly copied from Tony
> Luck's notes about the Hardware Error BoF during LPC 2010 [1].
> Tony, thanks for your notes and discussions to generate the
> h/w error reporting requirements.
> 
> [1] http://lwn.net/Articles/416669/
> 
>     We have several subsystems & methods for reporting hardware errors:
> 
>     1) EDAC ("Error Detection and Correction").  In its original form
>     this consisted of a platform specific driver that read topology
>     information and error counts from chipset registers and reported
>     the results via a sysfs interface.
> 
>     2) mcelog - x86 specific decoding of machine check bank registers
>     reporting in binary form via /dev/mcelog. Recent additions make use
>     of the APEI extensions that were documented in version 4.0a of the
>     ACPI specification to acquire more information about errors without
>     having to rely reading chipset registers directly. A user level
>     programs decodes into somewhat human readable format.
> 
>     3) drivers/edac/mce_amd.c  A recent addition - this driver hooks into
>     the mcelog path and decodes errors reported via machine check bank
>     registers in AMD processors to the console log using printk() [despite
>     being in the drivers/edac directory, this seems completely different
>     from classic EDAC to me].

Well, maybe it is time to rename drivers/edac/ to drivers/ras/ where all
RAS stuff should go.

[.. ]

> diff --git a/include/trace/events/hw_event.h b/include/trace/events/hw_event.h
> new file mode 100644
> index 0000000..a46ac61
> --- /dev/null
> +++ b/include/trace/events/hw_event.h
> @@ -0,0 +1,322 @@
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM hw_event
> +
> +#if !defined(_TRACE_HW_EVENT_MC_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_HW_EVENT_MC_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/edac.h>
> +
> +/*
> + * Hardware Anomaly Report Mechanism (HARM) events
> + *
> + * Those events are generated when hardware detected a corrected or
> + * uncorrected event, and are meant to replace the current API to report
> + * errors defined on both EDAC and MCE subsystems.
> + */
> +
> +DECLARE_EVENT_CLASS(hw_event_class,
> +	TP_PROTO(const char *type, unsigned int instance),
> +	TP_ARGS(type, instance),
> +
> +	TP_STRUCT__entry(
> +		__field(	const char *,	type			)
> +		__field(	unsigned int,	instance		)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->type	= type;
> +		__entry->instance = instance;
> +	),
> +
> +	TP_printk("Initialized %s#%d\n",
> +		__entry->type,
> +		__entry->instance)
> +);
> +
> +/*
> + * This event indicates that a hardware collection mechanism is started
> + */
> +DEFINE_EVENT(hw_event_class, hw_event_init,
> +
> +	TP_PROTO(const char *type, unsigned int instance),
> +
> +	TP_ARGS(type, instance)
> +);
> +
> +
> +/*
> + * Memory Controller specific events
> + */

I think this is too fine-grained. You see, all those error records are
of type MCE so there's no need to have a trace event for corrected,
uncorrected, out of range etc. error types. You basically add a
flags argument to the trace_mce_record() tracepoint so that you can
differentiate between the different error records in the tracebuffer.
Then, you add additional fields like above for the MCEs which report a
DRAM ECC error.

IOW, what we need are two basic error records (tracepoints, etc.): MCEs
and PCI(e) errors which are derived from the hw_event_class.

Btw, I've played with the MCE tracepoint extension a bit and it looks
doable: http://lkml.org/lkml/2010/5/15/40.

-- 
Regards/Gruss,
    Boris.

  reply	other threads:[~2011-03-24 22:39 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <cover.1300996141.git.mchehab@redhat.com>
2011-03-24 20:32 ` [PATCH RFC 2/2] events/hw_event: Create a Hardware Anomaly Report Mechanism (HARM) Mauro Carvalho Chehab
2011-03-24 22:39   ` Borislav Petkov [this message]
2011-03-25 10:20     ` Mauro Carvalho Chehab
2011-03-25 14:13       ` Borislav Petkov
2011-03-25 21:22         ` Mauro Carvalho Chehab
2011-03-25 22:37           ` Tony Luck
2011-03-26 11:56             ` Mauro Carvalho Chehab
2011-03-28 17:03           ` Borislav Petkov
2011-03-28 19:44             ` Mauro Carvalho Chehab
2011-03-30 17:27               ` Luck, Tony
2011-03-30 17:51                 ` Borislav Petkov
2011-03-30 18:30                   ` Francis St. Amant
2011-03-30 19:50                     ` Borislav Petkov
2011-03-30 20:00                       ` Francis St. Amant
2011-03-31  7:43                         ` Borislav Petkov
2012-01-26 23:05     ` [PATCH 1/3] events/hw_event: Create a Hardware Events Report Mecanism (HERM) Mauro Carvalho Chehab
2012-01-26 23:05       ` [PATCH 2/3] events/hw_event: use __string() trace macros for events Mauro Carvalho Chehab
2012-01-26 23:05       ` [PATCH 3/3] hw_event: Consolidate uncorrected/corrected error msgs into one Mauro Carvalho Chehab
2011-03-24 20:32 ` [PATCH RFC 1/2] edac: Move edac main structs to include/linux/edac.h Mauro Carvalho Chehab
2011-03-24 20:54 ` Mauro Carvalho Chehab

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110324223907.GA10498@liondog.tnic \
    --to=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).