[RFC] persistent store

* [RFC] persistent store
@ 2010-11-20 23:48 Luck, Tony
  2010-11-21  9:07 ` Borislav Petkov
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Luck, Tony @ 2010-11-20 23:48 UTC (permalink / raw)
  To: linux-kernel, linux-arch; +Cc: tglx, mingo, greg, akpm, ying.huang

Here's a patch based on some discussions I had with Thomas
Gleixner at plumbers conference that implements a generic
layer for persistent storage usable to pass tens or hundreds
of kilobytes of data from the dying breath of a crashing
kernel to its successor.

The usage model I'm envisioning is that a platform driver
will register with this code to provide the actual storage.
I've tried to make this interface general, but I'm working
from a sample of one (the ACPI ERST code), so if anyone else
has some persistent store that can't be handled by this code,
speak up and we can put in the necessary tweaks.

My assumptions are that the data that Linux cares about will
be wrapped in some error record structure with a header, and
possibly a footer that the device code needs. So the driver
specifies how much padding to put around a buffer to make
life easy for it.  It also specifies the maximum number of
bytes that can be saved in one record.

There are three callback functions from the generic code to
the driver:

"reader" which iterates over all records currently in the
store - returning type, size and a record identifier as
well as the actual data.

"writer" which writes a record with a type to the persistent store

"eraser" which takes a record identifier, and clears that item
from the store.


The Linux user visible interface is via /sys (similar to
the "efivars" interface)

# ls -l /sys/firmware/pstore
total 0
-r--r--r-- 1 root root 0 2010-11-20 11:03 dmesg-0
--w------- 1 root root 0 2010-11-20 11:03 erase

The "type" of error record I mentioned earlier is used to
name the files ... saved console logs from kmsg_dmp() are
named with a "dmesg" prefix as shown above.

Once an error record has been viewed, analysed, saved. The
user can request it to be cleared by writing its name to the
"erase" file:

# echo "dmesg-0" > erase

Answers to a few questions that I think you might ask:

1) "Why do you only allow one platform driver to register?"
  I only have one such driver.  Adding more is easy from the "read" side
  (just collect all the records from all devices and remember where they
  came from so you can call the correct "eraser" function).  But the "write"
  side opens up questions that I don't have good answers for:
  - Which device(s) should error records be written to?
    All of them? Start with one and move on when it is
    full?  Write some types of records to one device?
  If someone has a machine with multiple persistent storage devices -
  then we can talk about how to answer these questions.

2) "Why do you read in all the data from the device when it
registers and save it in memory?  Couldn't you just get the
list of records and pick up the data from the device when
the user reads the file?"
  I don't think this is going to be very much data, just a few hundred
  kilobytes (i.e. less that $0.01 worth of memory, even expensive server
  memory).  The memory is freed when the record is erased ... which is
  likely to be soon after boot.

3) "/sys/firmware/pstore is the wrong pathname for this".
  You are probably right. I put it under "firmware" because that's where
  the "efivars" driver put its top level directory. In my case the ERST
  back end is firmware, so there is some vague logic to it ... but better
  suggestions are welcome. Perhaps /sys/devices/platform/pstore?

4) "/sys is the wrong place for this."
  Perhaps.  I definitely want to use some sort of filesystem interface (so
  each record shows up as a file to the user). This seems a lot cleaner
  than trying to map the semantics of actual persistent storage devices
  onto a character device.  The "sysfs_create_bin_file()" API seems very
  well designed for this usage.  If not /sys, then where?  "debugfs"
  would work - but not everyone mounts debugfs. Creating a whole new
  filesystem for this seems like overkill.

5) "Why is the record identifier type 'u64'?"
  This is one place where I knowingly let the ERST implementation bleed
  all the way up to the top - it uses 64-bit record numbers. It would be
  possible to map these to something smaller like "int" ... but the code
  to do so would be far larger than the memory saved.  The most common
  usage case is likely to be a software crash with just one "dmesg" record.

6) "Is this widely useful? How many systems have persistent storage?"
  Although ERST was only added to the ACPI spec earlier this year, it
  merely documents existing functionality required for WHEA (Windows
  Hardware Error Architecture).  So most modern server systems should
  have it (my test system has it, and it has a BIOS written in mid 2008).
  Sorry desktops & laptops - no love for you here.

No-sign-off-yet-this-is-just-RFC

-Tony

---

diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig
index e8b6a13..06afe40 100644
--- a/drivers/firmware/Kconfig
+++ b/drivers/firmware/Kconfig
@@ -134,4 +134,16 @@ config ISCSI_IBFT
 	  detect iSCSI boot parameters dynamically during system boot, say Y.
 	  Otherwise, say N.
 
+config PSTORE
+	tristate "Persistant store support via /sys"
+	default n
+	help
+	  This option enables generic access to platform level persistent
+	  storage via /sys/firmware/pstore.  Only useful if you have a
+	  platform level driver that registers with pstore to provide
+	  the data, so you probably should just go say "Y" (or "M") to
+	  a platform specific persistent store driver (e.g. ACPI_APEI on
+	  X86) which will select this for you.  If you don't have a platform
+	  persistent store driver, say N.
+
 endmenu
diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile
index 1c3c173..ba19784 100644
--- a/drivers/firmware/Makefile
+++ b/drivers/firmware/Makefile
@@ -11,3 +11,4 @@ obj-$(CONFIG_DMIID)		+= dmi-id.o
 obj-$(CONFIG_ISCSI_IBFT_FIND)	+= iscsi_ibft_find.o
 obj-$(CONFIG_ISCSI_IBFT)	+= iscsi_ibft.o
 obj-$(CONFIG_FIRMWARE_MEMMAP)	+= memmap.o
+obj-$(CONFIG_PSTORE)		+= pstore.o
diff --git a/drivers/firmware/pstore.c b/drivers/firmware/pstore.c
new file mode 100644
index 0000000..e11b454
--- /dev/null
+++ b/drivers/firmware/pstore.c
@@ -0,0 +1,313 @@
+/*
+ * Persistent Storage - pstore.c
+ *
+ * Copyright (C) 2010 Intel Corporation <tony.luck@intel.com>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via sysfs.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+#include <linux/atomic.h>
+#include <linux/types.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/kmsg_dump.h>
+#include <linux/module.h>
+#include <linux/pstore.h>
+#include <linux/string.h>
+#include <linux/sysfs.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+MODULE_AUTHOR("Tony Luck <tony.luck@intel.com>");
+MODULE_DESCRIPTION("sysfs interface to persistent storage");
+MODULE_LICENSE("GPL");
+
+static DEFINE_SPINLOCK(pstore_lock);
+static LIST_HEAD(pstore_list);
+static struct kset *pstore_kset;
+
+#define	PSTORE_NAMELEN	16
+
+struct pstore_entry {
+	struct	bin_attribute attr;
+	char	name[PSTORE_NAMELEN];
+	u64	id;
+	int	type;
+	int	size;
+	struct	list_head list;
+	char	data[];
+};
+
+static int pstore_create_sysfs_entry(struct pstore_entry *new_pstore);
+
+static struct pstore_info *psinfo;
+
+static char *pstore_buf;
+
+/*
+ * callback from kmsg_dump. (s2,l2) has the most recently
+ * written bytes, older bytes are in (s1,l1). Save as much
+ * as we can from the end of the buffer.
+ */
+static void
+pstore_dump(struct kmsg_dumper *dumper, enum kmsg_dump_reason reason,
+	    const char *s1, unsigned long l1,
+	    const char *s2, unsigned long l2)
+{
+	unsigned long s1_start, s2_start;
+	unsigned long l1_cpy, l2_cpy;
+	char *dst = pstore_buf + psinfo->header_size;
+
+	/* Don't dump oopses to persistent store */
+	if (reason == KMSG_DUMP_OOPS)
+		return;
+
+	l2_cpy = min(l2, psinfo->data_size);
+	l1_cpy = min(l1, psinfo->data_size - l2_cpy);
+
+	s2_start = l2 - l2_cpy;
+	s1_start = l1 - l1_cpy;
+
+	memcpy(dst, s1 + s1_start, l1_cpy);
+	memcpy(dst + l1_cpy, s2 + s2_start, l2_cpy);
+
+	psinfo->writer(PSTORE_DMESG, pstore_buf, l1_cpy + l2_cpy);
+}
+
+static struct kmsg_dumper pstore_dumper = {
+	.dump = pstore_dump,
+};
+
+/*
+ * platform specific persistent storage driver registers with
+ * us here. Read out all the records right away and install
+ * them in /sys.  Register with kmsg_dump to save last part
+ * of console log on panic.
+ */
+int
+pstore_register(struct pstore_info *psi)
+{
+	struct	pstore_entry	*new_pstore;
+	int	rc = 0, type;
+	unsigned long size;
+	u64	id;
+	unsigned long ps_maxsize;
+
+	spin_lock(&pstore_lock);
+	if (psinfo) {
+		spin_unlock(&pstore_lock);
+		return -EBUSY;
+	}
+	psinfo = psi;
+	spin_unlock(&pstore_lock);
+
+	ps_maxsize = psi->header_size + psi->data_size + psi->footer_size;
+	pstore_buf = kzalloc(ps_maxsize, GFP_KERNEL);
+	if (!pstore_buf)
+		return -ENOMEM;
+	for (;;) {
+		if (psi->reader(&id, &type, pstore_buf, &size) <= 0)
+			break;
+		new_pstore = kzalloc(sizeof(struct pstore_entry) + size,
+				     GFP_KERNEL);
+		if (!new_pstore) {
+			rc = -ENOMEM;
+			break;
+		}
+		new_pstore->id = id;
+		new_pstore->type = type;
+		new_pstore->size = size;
+		memcpy(new_pstore->data, pstore_buf + psi->header_size, size);
+		if (pstore_create_sysfs_entry(new_pstore)) {
+			kfree(new_pstore);
+			rc =  -EINVAL;
+			break;
+		}
+	}
+
+	kobject_uevent(&pstore_kset->kobj, KOBJ_ADD);
+
+	kmsg_dump_register(&pstore_dumper);
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(pstore_register);
+
+int
+pstore_write(int type, char *buf, unsigned long size)
+{
+	if (!psinfo->writer)
+		return -ENODEV;
+	if (size > psinfo->data_size)
+		return -EFBIG;
+
+	memcpy(pstore_buf + psinfo->header_size, buf, size);
+	return psinfo->writer(type, pstore_buf, size);
+}
+EXPORT_SYMBOL_GPL(pstore_write);
+
+#define to_pstore_entry(obj)  container_of(obj, struct pstore_entry, attr)
+
+/*
+ * "read" function for files containing persistent store records
+ */
+static ssize_t pstore_show(struct file *filp, struct kobject *kobj,
+			   struct bin_attribute *bin_attr, char *buf,
+			   loff_t offset, size_t count)
+{
+	struct pstore_entry *ps = to_pstore_entry(bin_attr);
+
+	return memory_read_from_buffer(buf, count, &offset,
+			ps->data, ps->size);
+}
+
+/*
+ * Erase records by writing their filename to the "erase" file. E.g.
+ *	# echo "dmesg-0" > erase
+ */
+static ssize_t pstore_erase(struct file *filp, struct kobject *kobj,
+			    struct bin_attribute *bin_attr,
+			    char *buf, loff_t pos, size_t count)
+{
+	struct pstore_entry *search_pstore, *n;
+	int len1, len2, found = 0;
+
+	len1 = count;
+	if (buf[len1 - 1] == '\n')
+		len1--;
+
+	spin_lock(&pstore_lock);
+
+	/*
+	 * Find this record
+	 */
+	list_for_each_entry_safe(search_pstore, n, &pstore_list, list) {
+		len2 = strlen(search_pstore->name);
+		if (len1 == len2 && memcmp(buf, search_pstore->name,
+					   len1) == 0) {
+			found = 1;
+			break;
+		}
+	}
+	if (!found) {
+		spin_unlock(&pstore_lock);
+		return -EINVAL;
+	}
+
+	if (psinfo->eraser)
+		if (psinfo->eraser(search_pstore->id)) {
+			spin_unlock(&pstore_lock);
+			return -EIO;
+		}
+
+	list_del(&search_pstore->list);
+
+	spin_unlock(&pstore_lock);
+
+	sysfs_remove_bin_file(&pstore_kset->kobj, &search_pstore->attr);
+
+	return count;
+}
+
+static struct bin_attribute attr_erase = {
+	.attr = {.name = "erase", .mode = 0200},
+	.write = pstore_erase,
+};
+
+static int
+pstore_create_sysfs_entry(struct pstore_entry *new_pstore)
+{
+	static	atomic_t	next;
+	int	error, seq;
+
+	seq = atomic_add_return(1, &next);
+
+	switch (new_pstore->type) {
+	case PSTORE_DMESG:
+		sprintf(new_pstore->name, "dmesg-%d", seq);
+		break;
+	case PSTORE_MCE:
+		sprintf(new_pstore->name, "mce-%d", seq);
+		break;
+	default:
+		sprintf(new_pstore->name, "type%d-%d", new_pstore->type, seq);
+		break;
+	}
+
+	sysfs_attr_init(&new_pstore->attr.attr);
+	new_pstore->attr.size = 0;
+	new_pstore->attr.read = pstore_show;
+	new_pstore->attr.attr.name = new_pstore->name;
+	new_pstore->attr.attr.mode = 0444;
+	error = sysfs_create_bin_file(&pstore_kset->kobj, &new_pstore->attr);
+	if (!error) {
+		spin_lock(&pstore_lock);
+		list_add(&new_pstore->list, &pstore_list);
+		spin_unlock(&pstore_lock);
+	}
+	return error;
+}
+
+static int __init
+pstore_init(void)
+{
+	int error = 0;
+
+	/* Register the pstore directory at /sys/firmware/pstore */
+	pstore_kset = kset_create_and_add("pstore", NULL, firmware_kobj);
+	if (!pstore_kset) {
+		printk(KERN_ERR "pstore: Subsystem registration failed.\n");
+		return -ENOMEM;
+	}
+
+	/*
+	 * Add attribute to allow records to be erased from persistent store
+	 */
+	error = sysfs_create_bin_file(&pstore_kset->kobj,
+				      &attr_erase);
+	if (error) {
+		printk(KERN_ERR "pstore: unable to create 'erase' sysfs file"
+				" due to error %d\n", error);
+		kset_unregister(pstore_kset);
+	}
+
+	return error;
+}
+
+static void __exit
+pstore_exit(void)
+{
+	struct pstore_entry *entry, *n;
+
+	if (psinfo)
+		kmsg_dump_unregister(&pstore_dumper);
+
+	list_for_each_entry_safe(entry, n, &pstore_list, list) {
+		spin_lock(&pstore_lock);
+		list_del(&entry->list);
+		spin_unlock(&pstore_lock);
+		sysfs_remove_bin_file(&pstore_kset->kobj, &entry->attr);
+	}
+	sysfs_remove_bin_file(&pstore_kset->kobj, &attr_erase);
+
+	kset_unregister(pstore_kset);
+
+	kfree(pstore_buf);
+}
+
+module_init(pstore_init);
+module_exit(pstore_exit);
diff --git a/include/linux/pstore.h b/include/linux/pstore.h
new file mode 100644
index 0000000..785ad86
--- /dev/null
+++ b/include/linux/pstore.h
@@ -0,0 +1,54 @@
+/*
+ * Persistent Storage - pstore.h
+ *
+ * Copyright (C) 2010 Intel Corporation <tony.luck@intel.com>
+ *
+ * This code is the generic layer to export data records from platform
+ * level persistent storage via sysfs.
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License version 2 as
+ *  published by the Free Software Foundation.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#ifndef _LINUX_PSTORE_H
+#define _LINUX_PSTORE_H
+
+/* types */
+#define	PSTORE_DMESG	0
+#define	PSTORE_MCE	1
+
+struct pstore_info {
+	unsigned long	header_size;
+	unsigned long	data_size;
+	unsigned long	footer_size;
+	int	(*reader)(u64 *id, int *type, char *buf, unsigned long *size);
+	int	(*writer)(int type, char *buf, unsigned long size);
+	int	(*eraser)(u64 id);
+};
+
+#if defined(CONFIG_PSTORE) || defined(CONFIG_PSTORE_MODULE)
+extern int pstore_register(struct pstore_info *);
+extern int pstore_write(int type, char *buf, unsigned long size);
+#else
+static inline int
+pstore_register(struct pstore_info *psi)
+{
+	return -ENODEV;
+}
+static inline int
+pstore_write(int type, char *buf, unsigned long size)
+{
+	return -ENODEV;
+}
+#endif
+
+#endif /*_LINUX_PSTORE_H*/

^ permalink raw reply related	[flat|nested] 18+ messages in thread