linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [ANNOUNCE] rasdaemon userspace tool v.0.1
@ 2013-03-14 18:38 Mauro Carvalho Chehab
  2013-03-15  7:18 ` Borislav Petkov
  0 siblings, 1 reply; 2+ messages in thread
From: Mauro Carvalho Chehab @ 2013-03-14 18:38 UTC (permalink / raw)
  To: linux-edac
  Cc: LKML, Steven Rostedt, Tony Luck, Borislav Petkov, Mark A. Grondona

In Kernel 3.5 we've started to address the long-discussed need of having a
better way to handle platform Reliability, Availability and Serviceability
(RAS).

Basically, a tracepoint event that handles memory errors called ras:mc_event
was added there, together with HERM/EDAC version 3.0 patches.

In Kernel 3.8, a new event was added, to handle PCIe AER events (ras:aer_event)
[1].

On kernel 3.9, a new driver was added to report hardware memory errors that
comes from the BIOS via ras:mc_event (the new ghes_edac driver).

It is still on my TODO list to add a RAS trace event for non-memory related
errors that come via the MCA machine check handler (mcelog).

While progress made was made at Kernel infrastructure, the needed userspace
tools were still lacking. So, I decided to start materializing the userspace
counterpart for what it was informally named as rasdaemon on some discussions.

The rasdaemon tool is available	at: 
	http://git.infradead.org/users/mchehab/rasdaemon.git

The current version is on very early stages, and it has a copy on it of
the library that Steven Rostedt's is writing for trace-cmd tool. The plan
is to use the trace-cmd library, when it starts to packaged as a separate
library. I'd like to thanks Steven for the help he gave me to write this 
initial version.

The current version of the tool enables the ras:mc_event log, and reads
it via the raw trace debugfs node:
	/sys/kernel/debug/tracing/per_cpu/cpu*/trace_pipe_raw

It also has a code that allows recording the errors via an sqlite3 database.

The long term plan is to provide a tool that will catch and handle all
ras:* error events that comes from the Kernel tracing infrastructure,
logging them and providing tools to report it, being able to detect burst
errors (like the ones caused by a solar storm at memories) or sparsed errors,
in a way that would provide a glue to the users about the root cause of the
error.

Of course, there are much to do there.

It is a natural evolution of the tool to add support there for the 
ras:aer_event traces that can come from PCIe AER.

While it currently works with current Kernels since kernel 3.5, there are a
number of interesting changes at tracing that are planned to be merged for
Kernel 3.10:
	- poll() support for per_cpu trace_pipe_raw;
	- a timestamp that could more easily associated with machine's
	  uptime;
	- support for a separate ringbuffer for RAS.

So, it is planned the minimal requirement for the final version (v1.0)
would be kernel 3.10.

This is currently on very early staging. Help is needed ;)

So, please send us suggestions, patches etc to the EDAC mailing list:
	linux-edac@vger.kernel.org

Thanks!
Mauro

-

[1] Currently, ras:mc_event is at include/ras/. It is on my todo list
to move it to be together with ras:aer_event, at include/trace/events/ras.h.

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [ANNOUNCE] rasdaemon userspace tool v.0.1
  2013-03-14 18:38 [ANNOUNCE] rasdaemon userspace tool v.0.1 Mauro Carvalho Chehab
@ 2013-03-15  7:18 ` Borislav Petkov
  0 siblings, 0 replies; 2+ messages in thread
From: Borislav Petkov @ 2013-03-15  7:18 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: linux-edac, LKML, Steven Rostedt, Tony Luck, Mark A. Grondona,
	Robert Richter

On Thu, Mar 14, 2013 at 03:38:44PM -0300, Mauro Carvalho Chehab wrote:
> It is still on my TODO list to add a RAS trace event for non-memory
> related errors that come via the MCA machine check handler (mcelog).

That's already there: trace_mce_record.

> While progress made was made at Kernel infrastructure, the needed
> userspace tools were still lacking. So, I decided to start
> materializing the userspace counterpart for what it was informally
> named as rasdaemon on some discussions.

Well, I had RAS daemon long before you did:
http://marc.info/?l=linux-kernel&m=130357637429355&w=2

And Robert and I have recently restarted work on it so having two
different RAS daemons will be confusing to people. You probably could
call it "hermd" or "errord" or something better you can come up with.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2013-03-15  7:18 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-03-14 18:38 [ANNOUNCE] rasdaemon userspace tool v.0.1 Mauro Carvalho Chehab
2013-03-15  7:18 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).