linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
@ 2018-04-17  7:44 Rahul Lakkireddy
  2018-04-17  7:44 ` [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel Rahul Lakkireddy
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-17  7:44 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy

On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.

This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device in the crash recovery kernel. In crash
recovery kernel, the collected logs are added as elf notes to
/proc/vmcore, which is copied by user space scripts for post-analysis.

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

The device specific hardware/firmware logs can be seen as elf notes:

# readelf -n /proc/vmcore

Displaying notes found at file offset 0x00001000 with length 0x04003288:
  Owner                 Data size	Description
  VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
  VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
  VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)

Patch 1 adds API to vmcore module to allow drivers to register callback
to collect the device specific hardware/firmware logs.  The logs will
be added to /proc/vmcore as elf notes.

Patch 2 updates read and mmap logic to append device specific hardware/
firmware logs as elf notes.

Patch 3 shows a cxgb4 driver example using the API to collect
hardware/firmware logs in crash recovery kernel, before hardware is
initialized.

Thanks,
Rahul

RFC v1: https://lkml.org/lkml/2018/3/2/542
RFC v2: https://lkml.org/lkml/2018/3/16/326

---
v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm@xmission.com>.  Added as patch 2 in this version.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.
- Updated comments.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs. Suggested by
  Stephen Hemminger <stephen@networkplumber.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm@xmission.com>.
- Added new crashdd module that exports /proc/crashdd/ containing
  driver's registered hardware/firmware logs in patch 1.
- Replaced the API to allow drivers to register their hardware/firmware
  log collect routine in crash recovery kernel in patch 1.
- Updated patch 2 to use the new API in patch 1.

Rahul Lakkireddy (3):
  vmcore: add API to collect hardware dump in second kernel
  vmcore: append device dumps to vmcore as elf notes
  cxgb4: collect hardware dump in second kernel

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |   4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c |  25 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |   3 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  10 +
 fs/proc/Kconfig                                  |  10 +
 fs/proc/vmcore.c                                 | 399 ++++++++++++++++++++++-
 include/linux/crash_core.h                       |   4 +
 include/linux/crash_dump.h                       |  17 +
 include/linux/kcore.h                            |   6 +
 include/uapi/linux/elf.h                         |   1 +
 10 files changed, 466 insertions(+), 13 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel
  2018-04-17  7:44 [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Rahul Lakkireddy
@ 2018-04-17  7:44 ` Rahul Lakkireddy
  2018-04-19  8:24   ` Greg KH
  2018-04-17  7:44 ` [PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes Rahul Lakkireddy
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-17  7:44 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy

The sequence of actions done by device drivers to append their device
specific hardware/firmware logs to /proc/vmcore are as follows:

1. During probe (before hardware is initialized), device drivers
register to the vmcore module (via vmcore_add_device_dump()), with
callback function, along with buffer size and log name needed for
firmware/hardware log collection.

2. vmcore module allocates the buffer with requested size. It adds
an Elf note and invokes the device driver's registered callback
function.

3. Device driver collects all hardware/firmware logs into the buffer
and returns control back to vmcore module.

Ensure that the device dump buffer size is always aligned to page size
so that it can be mmaped.

Also, rename alloc_elfnotes_buf() to vmcore_alloc_buf() to make it more
generic and reserve NT_VMCOREDD note type to indicate vmcore device
dump.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v4:
- Made __vmcore_add_device_dump() static.
- Moved compile check to define vmcore_add_device_dump() to
  crash_dump.h to fix compilation when vmcore.c is not compiled in.
- Convert ---help--- to help in Kconfig as indicated by checkpatch.
- Rebased to tip.

v3:
- Dropped sysfs crashdd module.
- Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
  dump support.
- Moved logic related to adding dumps from crashdd to vmcore module.
- Rename all crashdd* to vmcoredd*.

v2:
- Added ABI Documentation for crashdd.
- Directly use octal permission instead of macro.

Changes since rfc v2:
- Moved exporting crashdd from procfs to sysfs.  Suggested by
  Stephen Hemminger <stephen@networkplumber.org>
- Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
- Replaced all proc API with sysfs API and updated comments.
- Calling driver callback before creating the binary file under
  crashdd sysfs.
- Changed binary dump file permission from S_IRUSR to S_IRUGO.
- Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.

rfc v2:
- Collecting logs in 2nd kernel instead of during kernel panic.
  Suggested by Eric Biederman <ebiederm@xmission.com>.
- Patch added in this series.

 fs/proc/Kconfig            |  10 +++
 fs/proc/vmcore.c           | 152 ++++++++++++++++++++++++++++++++++++++++++---
 include/linux/crash_core.h |   4 ++
 include/linux/crash_dump.h |  17 +++++
 include/linux/kcore.h      |   6 ++
 include/uapi/linux/elf.h   |   1 +
 6 files changed, 181 insertions(+), 9 deletions(-)

diff --git a/fs/proc/Kconfig b/fs/proc/Kconfig
index 1ade1206bb89..504c77a444bd 100644
--- a/fs/proc/Kconfig
+++ b/fs/proc/Kconfig
@@ -43,6 +43,16 @@ config PROC_VMCORE
         help
         Exports the dump image of crashed kernel in ELF format.
 
+config PROC_VMCORE_DEVICE_DUMP
+	bool "Device Hardware/Firmware Log Collection"
+	depends on PROC_VMCORE
+	default y
+	help
+	  Device drivers can collect the device specific snapshot of
+	  their hardware or firmware before they are initialized in
+	  crash recovery kernel. If you say Y here, the device dumps
+	  will be added as ELF notes to /proc/vmcore
+
 config PROC_SYSCTL
 	bool "Sysctl support (/proc/sys)" if EXPERT
 	depends on PROC_FS
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a45f0af22a60..7395462d2f86 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
 #include <linux/init.h>
 #include <linux/crash_dump.h>
 #include <linux/list.h>
+#include <linux/mutex.h>
 #include <linux/vmalloc.h>
 #include <linux/pagemap.h>
 #include <linux/uaccess.h>
@@ -44,6 +45,12 @@ static u64 vmcore_size;
 
 static struct proc_dir_entry *proc_vmcore;
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/* Device Dump list and mutex to synchronize access to list */
+static LIST_HEAD(vmcoredd_list);
+static DEFINE_MUTEX(vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -302,10 +309,8 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
 };
 
 /**
- * alloc_elfnotes_buf - allocate buffer for ELF note segment in
- *                      vmalloc memory
- *
- * @notes_sz: size of buffer
+ * vmcore_alloc_buf - allocate buffer in vmalloc memory
+ * @sizez: size of buffer
  *
  * If CONFIG_MMU is defined, use vmalloc_user() to allow users to mmap
  * the buffer to user-space by means of remap_vmalloc_range().
@@ -313,12 +318,12 @@ static const struct vm_operations_struct vmcore_mmap_ops = {
  * If CONFIG_MMU is not defined, use vzalloc() since mmap_vmcore() is
  * disabled and there's no need to allow users to mmap the buffer.
  */
-static inline char *alloc_elfnotes_buf(size_t notes_sz)
+static inline char *vmcore_alloc_buf(size_t size)
 {
 #ifdef CONFIG_MMU
-	return vmalloc_user(notes_sz);
+	return vmalloc_user(size);
 #else
-	return vzalloc(notes_sz);
+	return vzalloc(size);
 #endif
 }
 
@@ -665,7 +670,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -851,7 +856,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 		return rc;
 
 	*notes_sz = roundup(phdr_sz, PAGE_SIZE);
-	*notes_buf = alloc_elfnotes_buf(*notes_sz);
+	*notes_buf = vmcore_alloc_buf(*notes_sz);
 	if (!*notes_buf)
 		return -ENOMEM;
 
@@ -1145,6 +1150,132 @@ static int __init parse_crash_elf_headers(void)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+/**
+ * vmcoredd_get_note_size - Get size of the note that will be inserted at
+ * beginning of the dump's buffer.
+ * @name: Note's name
+ *
+ * Gets the overall size of the note that will be inserted at the beginning
+ * of the dump's buffer.  It also adds padding, if necessary to meet
+ * alignment requirements.
+ */
+static inline size_t vmcoredd_get_note_size(const char *name)
+{
+	return CRASH_CORE_NOTE_HEAD_BYTES +
+	       ALIGN(VMCOREDD_NOTE_NAME_BYTES + strlen(name), sizeof(Elf_Word));
+}
+
+/**
+ * vmcoredd_write_note - Write note at the beginning of the dump's buffer
+ * @name: Dump's name
+ * @buf: Output buffer where the note is written
+ * @size: Size of the dump
+ *
+ * Fills beginning of the dump's data with elf note.
+ */
+static void vmcoredd_write_note(const char *name, void *buf, size_t size)
+{
+	struct elf_note *note = (struct elf_note *)buf;
+	Elf_Word *word = (Elf_Word *)note;
+
+	note->n_namesz = ALIGN(VMCOREDD_NOTE_NAME_BYTES + strlen(name),
+			       sizeof(Elf_Word));
+	note->n_descsz = size;
+	note->n_type = NT_VMCOREDD;
+	word += DIV_ROUND_UP(sizeof(*note), sizeof(Elf_Word));
+	snprintf((char *)word, note->n_namesz, "%s_%s", VMCOREDD_NOTE_NAME,
+		 name);
+}
+
+/**
+ * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
+ * @data: dump info.
+ *
+ * Allocate a buffer and invoke the calling driver's dump collect routine.
+ * Write Elf note at the beginning of the buffer to indicate vmcore device
+ * dump and add the dump to global list.
+ */
+static int __vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	size_t note_size, data_size;
+	struct vmcoredd_node *dump;
+	void *buf = NULL;
+	int ret;
+
+	if (!data || !strlen(data->name) ||
+	    !data->vmcoredd_callback || !data->size)
+		return -EINVAL;
+
+	dump = vzalloc(sizeof(*dump));
+	if (!dump) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	note_size = vmcoredd_get_note_size(data->name);
+	/* Keep size of the buffer page aligned so that it can be mmaped */
+	data_size = roundup(note_size + data->size, PAGE_SIZE);
+
+	/* Allocate buffer for driver's to write their dumps */
+	buf = vmcore_alloc_buf(data_size);
+	if (!buf) {
+		ret = -ENOMEM;
+		goto out_err;
+	}
+
+	vmcoredd_write_note(data->name, buf, data_size - note_size);
+
+	/* Invoke the driver's dump collection routing */
+	ret = data->vmcoredd_callback(data, buf + note_size);
+	if (ret)
+		goto out_err;
+
+	dump->buf = buf;
+	dump->size = data_size;
+
+	/* Add the dump to driver sysfs list */
+	mutex_lock(&vmcoredd_mutex);
+	list_add_tail(&dump->list, &vmcoredd_list);
+	mutex_unlock(&vmcoredd_mutex);
+
+	return 0;
+
+out_err:
+	if (buf)
+		vfree(buf);
+
+	if (dump)
+		vfree(dump);
+
+	return ret;
+}
+
+int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return __vmcore_add_device_dump(data);
+}
+EXPORT_SYMBOL(vmcore_add_device_dump);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+/* Free all dumps in vmcore device dump list */
+static void vmcore_free_device_dumps(void)
+{
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+	mutex_lock(&vmcoredd_mutex);
+	while (!list_empty(&vmcoredd_list)) {
+		struct vmcoredd_node *dump;
+
+		dump = list_first_entry(&vmcoredd_list, struct vmcoredd_node,
+					list);
+		list_del(&dump->list);
+		vfree(dump->buf);
+		vfree(dump);
+	}
+	mutex_unlock(&vmcoredd_mutex);
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+}
+
 /* Init function for vmcore module. */
 static int __init vmcore_init(void)
 {
@@ -1192,4 +1323,7 @@ void vmcore_cleanup(void)
 		kfree(m);
 	}
 	free_elfcorebuf();
+
+	/* clear vmcore device dump list */
+	vmcore_free_device_dumps();
 }
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index b511f6d24b42..3b6b041d84c8 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -27,6 +27,10 @@
 				     VMCOREINFO_NOTE_NAME_BYTES +	\
 				     VMCOREINFO_BYTES)
 
+#define VMCOREDD_MAX_NAME_BYTES  32
+#define VMCOREDD_NOTE_NAME	 "VMCOREDD"
+#define VMCOREDD_NOTE_NAME_BYTES sizeof(VMCOREDD_NOTE_NAME)
+
 typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
 
 void crash_update_vmcoreinfo_safecopy(void *ptr);
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index f7ac2aa93269..658d508d1ec5 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -93,4 +93,21 @@ static inline bool is_kdump_kernel(void) { return 0; }
 #endif /* CONFIG_CRASH_DUMP */
 
 extern unsigned long saved_max_pfn;
+
+/* Device Dump information to be filled by drivers */
+struct vmcoredd_data {
+	char name[VMCOREDD_MAX_NAME_BYTES]; /* Unique name of the dump */
+	unsigned long size;                 /* Size of the dump */
+	/* Driver's registered callback to be invoked to collect dump */
+	int (*vmcoredd_callback)(struct vmcoredd_data *data, void *buf);
+};
+
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+int vmcore_add_device_dump(struct vmcoredd_data *data);
+#else
+static inline int vmcore_add_device_dump(struct vmcoredd_data *data)
+{
+	return -EOPNOTSUPP;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 #endif /* LINUX_CRASHDUMP_H */
diff --git a/include/linux/kcore.h b/include/linux/kcore.h
index 80db19d3a505..aa26e7199060 100644
--- a/include/linux/kcore.h
+++ b/include/linux/kcore.h
@@ -28,6 +28,12 @@ struct vmcore {
 	loff_t offset;
 };
 
+struct vmcoredd_node {
+	struct list_head list;	/* List of dumps */
+	void *buf;		/* Buffer containing device's dump */
+	unsigned long size;	/* Size of the buffer */
+};
+
 #ifdef CONFIG_PROC_KCORE
 extern void kclist_add(struct kcore_list *, void *, size_t, int type);
 #else
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index e2535d6dcec7..4e12c423b9fe 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -421,6 +421,7 @@ typedef struct elf64_shdr {
 #define NT_ARM_SYSTEM_CALL	0x404	/* ARM system call number */
 #define NT_ARM_SVE	0x405		/* ARM Scalable Vector Extension registers */
 #define NT_ARC_V2	0x600		/* ARCv2 accumulator/extra registers */
+#define NT_VMCOREDD	0x700		/* Vmcore Device Dump Note */
 
 /* Note header in a PT_NOTE section */
 typedef struct elf32_note {
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes
  2018-04-17  7:44 [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Rahul Lakkireddy
  2018-04-17  7:44 ` [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel Rahul Lakkireddy
@ 2018-04-17  7:44 ` Rahul Lakkireddy
  2018-04-17  7:44 ` [PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel Rahul Lakkireddy
  2018-04-18  6:15 ` [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Dave Young
  3 siblings, 0 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-17  7:44 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy

Update read and mmap logic to append device dumps as additional notes
before the other elf notes. We add device dumps before other elf notes
because the other elf notes may not fill the elf notes buffer
completely and we will end up with zero-filled data between the elf
notes and the device dumps. Tools will then try to decode this
zero-filled data as valid notes and we don't want that. Hence, adding
device dumps before the other elf notes ensure that zero-filled data
can be avoided. This also ensures that the device dumps and the
other elf notes can be properly mmaped at page aligned address.

Incorporate device dump size into the total vmcore size. Also update
offsets for other program headers after the device dumps are added.

Suggested-by: Eric Biederman <ebiederm@xmission.com>.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v4:
- No changes.

v3:
- Patch added in this version.
- Exported dumps as elf notes. Suggested by Eric Biederman
  <ebiederm@xmission.com>.

 fs/proc/vmcore.c | 247 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 243 insertions(+), 4 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7395462d2f86..ed1ebd85e14e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -39,6 +39,8 @@ static size_t elfcorebuf_sz_orig;
 
 static char *elfnotes_buf;
 static size_t elfnotes_sz;
+/* Size of all notes minus the device dump notes */
+static size_t elfnotes_orig_sz;
 
 /* Total size of vmcore file. */
 static u64 vmcore_size;
@@ -51,6 +53,9 @@ static LIST_HEAD(vmcoredd_list);
 static DEFINE_MUTEX(vmcoredd_mutex);
 #endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
 
+/* Device Dump Size */
+static size_t vmcoredd_orig_sz;
+
 /*
  * Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
  * The called function has to take care of module refcounting.
@@ -185,6 +190,77 @@ static int copy_to(void *target, void *src, size_t size, int userbuf)
 	return 0;
 }
 
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+static int vmcoredd_copy_dumps(void *dst, u64 start, size_t size, int userbuf)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (copy_to(dst, buf, tsz, userbuf)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+
+static int vmcoredd_mmap_dumps(struct vm_area_struct *vma, unsigned long dst,
+			       u64 start, size_t size)
+{
+	struct vmcoredd_node *dump;
+	u64 offset = 0;
+	int ret = 0;
+	size_t tsz;
+	char *buf;
+
+	mutex_lock(&vmcoredd_mutex);
+	list_for_each_entry(dump, &vmcoredd_list, list) {
+		if (start < offset + dump->size) {
+			tsz = min(offset + (u64)dump->size - start, (u64)size);
+			buf = dump->buf + start - offset;
+			if (remap_vmalloc_range_partial(vma, dst, buf, tsz)) {
+				ret = -EFAULT;
+				goto out_unlock;
+			}
+
+			size -= tsz;
+			start += tsz;
+			dst += tsz;
+
+			/* Leave now if buffer filled already */
+			if (!size)
+				goto out_unlock;
+		}
+		offset += dump->size;
+	}
+
+out_unlock:
+	mutex_unlock(&vmcoredd_mutex);
+	return ret;
+}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
 /* Read from the ELF header and then the crash dump. On error, negative value is
  * returned otherwise number of bytes read are returned.
  */
@@ -222,10 +298,41 @@ static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
 	if (*fpos < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (*fpos < elfcorebuf_sz + vmcoredd_orig_sz) {
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)*fpos, buflen);
+			start = *fpos - elfcorebuf_sz;
+			if (vmcoredd_copy_dumps(buffer, start, tsz, userbuf))
+				return -EFAULT;
+
+			buflen -= tsz;
+			*fpos += tsz;
+			buffer += tsz;
+			acc += tsz;
+
+			/* leave now if filled buffer already */
+			if (!buflen)
+				return acc;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
-		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
+		kaddr = elfnotes_buf + *fpos - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (copy_to(buffer, kaddr, tsz, userbuf))
 			return -EFAULT;
+
 		buflen -= tsz;
 		*fpos += tsz;
 		buffer += tsz;
@@ -451,11 +558,46 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 	if (start < elfcorebuf_sz + elfnotes_sz) {
 		void *kaddr;
 
+		/* We add device dumps before other elf notes because the
+		 * other elf notes may not fill the elf notes buffer
+		 * completely and we will end up with zero-filled data
+		 * between the elf notes and the device dumps. Tools will
+		 * then try to decode this zero-filled data as valid notes
+		 * and we don't want that. Hence, adding device dumps before
+		 * the other elf notes ensure that zero-filled data can be
+		 * avoided. This also ensures that the device dumps and
+		 * other elf notes can be properly mmaped at page aligned
+		 * address.
+		 */
+#ifdef CONFIG_PROC_VMCORE_DEVICE_DUMP
+		/* Read device dumps */
+		if (start < elfcorebuf_sz + vmcoredd_orig_sz) {
+			u64 start_off;
+
+			tsz = min(elfcorebuf_sz + vmcoredd_orig_sz -
+				  (size_t)start, size);
+			start_off = start - elfcorebuf_sz;
+			if (vmcoredd_mmap_dumps(vma, vma->vm_start + len,
+						start_off, tsz))
+				goto fail;
+
+			size -= tsz;
+			start += tsz;
+			len += tsz;
+
+			/* leave now if filled buffer already */
+			if (!size)
+				return 0;
+		}
+#endif /* CONFIG_PROC_VMCORE_DEVICE_DUMP */
+
+		/* Read remaining elf notes */
 		tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)start, size);
-		kaddr = elfnotes_buf + start - elfcorebuf_sz;
+		kaddr = elfnotes_buf + start - elfcorebuf_sz - vmcoredd_orig_sz;
 		if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
 						kaddr, tsz))
 			goto fail;
+
 		size -= tsz;
 		start += tsz;
 		len += tsz;
@@ -703,6 +845,11 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -889,6 +1036,11 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
 	/* Modify e_phnum to reflect merged headers. */
 	ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
 
+	/* Store the size of all notes.  We need this to update the note
+	 * header when the device dumps will be added.
+	 */
+	elfnotes_orig_sz = phdr.p_memsz;
+
 	return 0;
 }
 
@@ -981,8 +1133,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
 }
 
 /* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
-					   struct list_head *vc_list)
+static void set_vmcore_list_offsets(size_t elfsz, size_t elfnotes_sz,
+				    struct list_head *vc_list)
 {
 	loff_t vmcore_off;
 	struct vmcore *m;
@@ -1188,6 +1340,92 @@ static void vmcoredd_write_note(const char *name, void *buf, size_t size)
 		 name);
 }
 
+/**
+ * vmcoredd_update_program_headers - Update all Elf program headers
+ * @elfptr: Pointer to elf header
+ * @elfnotesz: Size of elf notes aligned to page size
+ * @vmcoreddsz: Size of device dumps to be added to elf note header
+ *
+ * Determine type of Elf header (Elf64 or Elf32) and update the elf note size.
+ * Also update the offsets of all the program headers after the elf note header.
+ */
+static void vmcoredd_update_program_headers(char *elfptr, size_t elfnotesz,
+					    size_t vmcoreddsz)
+{
+	unsigned char *e_ident = (unsigned char *)elfptr;
+	u64 start, end, size;
+	loff_t vmcore_off;
+	u32 i;
+
+	vmcore_off = elfcorebuf_sz + elfnotesz;
+
+	if (e_ident[EI_CLASS] == ELFCLASS64) {
+		Elf64_Ehdr *ehdr = (Elf64_Ehdr *)elfptr;
+		Elf64_Phdr *phdr = (Elf64_Phdr *)(elfptr + sizeof(Elf64_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	} else {
+		Elf32_Ehdr *ehdr = (Elf32_Ehdr *)elfptr;
+		Elf32_Phdr *phdr = (Elf32_Phdr *)(elfptr + sizeof(Elf32_Ehdr));
+
+		/* Update all program headers */
+		for (i = 0; i < ehdr->e_phnum; i++, phdr++) {
+			if (phdr->p_type == PT_NOTE) {
+				/* Update note size */
+				phdr->p_memsz = elfnotes_orig_sz + vmcoreddsz;
+				phdr->p_filesz = phdr->p_memsz;
+				continue;
+			}
+
+			start = rounddown(phdr->p_offset, PAGE_SIZE);
+			end = roundup(phdr->p_offset + phdr->p_memsz,
+				      PAGE_SIZE);
+			size = end - start;
+			phdr->p_offset = vmcore_off + (phdr->p_offset - start);
+			vmcore_off += size;
+		}
+	}
+}
+
+/**
+ * vmcoredd_update_size - Update the total size of the device dumps and update
+ * Elf header
+ * @dump_size: Size of the current device dump to be added to total size
+ *
+ * Update the total size of all the device dumps and update the Elf program
+ * headers. Calculate the new offsets for the vmcore list and update the
+ * total vmcore size.
+ */
+static void vmcoredd_update_size(size_t dump_size)
+{
+	vmcoredd_orig_sz += dump_size;
+	elfnotes_sz = roundup(elfnotes_orig_sz, PAGE_SIZE) + vmcoredd_orig_sz;
+	vmcoredd_update_program_headers(elfcorebuf, elfnotes_sz,
+					vmcoredd_orig_sz);
+
+	/* Update vmcore list offsets */
+	set_vmcore_list_offsets(elfcorebuf_sz, elfnotes_sz, &vmcore_list);
+
+	vmcore_size = get_vmcore_size(elfcorebuf_sz, elfnotes_sz,
+				      &vmcore_list);
+	proc_vmcore->size = vmcore_size;
+}
+
 /**
  * vmcore_add_device_dump - Add a buffer containing device dump to vmcore
  * @data: dump info.
@@ -1239,6 +1477,7 @@ static int __vmcore_add_device_dump(struct vmcoredd_data *data)
 	list_add_tail(&dump->list, &vmcoredd_list);
 	mutex_unlock(&vmcoredd_mutex);
 
+	vmcoredd_update_size(data_size);
 	return 0;
 
 out_err:
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel
  2018-04-17  7:44 [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Rahul Lakkireddy
  2018-04-17  7:44 ` [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel Rahul Lakkireddy
  2018-04-17  7:44 ` [PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes Rahul Lakkireddy
@ 2018-04-17  7:44 ` Rahul Lakkireddy
  2018-04-18  6:15 ` [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Dave Young
  3 siblings, 0 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-17  7:44 UTC (permalink / raw)
  To: netdev, kexec, linux-fsdevel, linux-kernel
  Cc: davem, viro, ebiederm, stephen, akpm, torvalds, ganeshgr,
	nirranjan, indranil, Rahul Lakkireddy

Register callback to collect hardware/firmware dumps in second kernel
before hardware/firmware is initialized. The dumps for each device
will be available as elf notes in /proc/vmcore in second kernel.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
v4:
- No changes.

v3:
- Replaced all crashdd* with vmcoredd*.
- Replaced crashdd_add_dump() with vmcore_add_device_dump().
- Updated comments and commit message.

v2:
- No Changes.

Changes since rfc v2:
- Update comments and commit message for sysfs change.

rfc v2:
- Updated dump registration to the new API in patch 1.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |  4 ++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 25 ++++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  3 +++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 10 ++++++++++
 4 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 688f95440af2..01e7aad4ce5b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -50,6 +50,7 @@
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
 #include <linux/ptp_classify.h>
+#include <linux/crash_dump.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -964,6 +965,9 @@ struct adapter {
 	struct hma_data hma;
 
 	struct srq_data *srq;
+
+	/* Dump buffer for collecting logs in kdump kernel */
+	struct vmcoredd_data vmcoredd;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 143686c60234..76433d4fe483 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -488,3 +488,28 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter)
 	adapter->eth_dump.version = adapter->params.fw_vers;
 	adapter->eth_dump.len = 0;
 }
+
+static int cxgb4_cudbg_vmcoredd_collect(struct vmcoredd_data *data, void *buf)
+{
+	struct adapter *adap = container_of(data, struct adapter, vmcoredd);
+	u32 len = data->size;
+
+	return cxgb4_cudbg_collect(adap, buf, &len, CXGB4_ETH_DUMP_ALL);
+}
+
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap)
+{
+	struct vmcoredd_data *data = &adap->vmcoredd;
+	u32 len;
+
+	len = sizeof(struct cudbg_hdr) +
+	      sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY;
+	len += CUDBG_DUMP_BUFF_SIZE;
+
+	data->size = len;
+	snprintf(data->name, sizeof(data->name), "%s_%s", cxgb4_driver_name,
+		 adap->name);
+	data->vmcoredd_callback = cxgb4_cudbg_vmcoredd_collect;
+
+	return vmcore_add_device_dump(data);
+}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
index ce1ac9a1c878..ef59ba1ed968 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
@@ -41,8 +41,11 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS {
 	CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */
 };
 
+#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW)
+
 u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag);
 int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size,
 			u32 flag);
 void cxgb4_init_ethtool_dump(struct adapter *adapter);
+int cxgb4_cudbg_vmcore_add_dump(struct adapter *adap);
 #endif /* __CXGB4_CUDBG_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 24d2865b8806..32cad0acf76c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5544,6 +5544,16 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (err)
 		goto out_free_adapter;
 
+	if (is_kdump_kernel()) {
+		/* Collect hardware state and append to /proc/vmcore */
+		err = cxgb4_cudbg_vmcore_add_dump(adapter);
+		if (err) {
+			dev_warn(adapter->pdev_dev,
+				 "Fail collecting vmcore device dump, err: %d. Continuing\n",
+				 err);
+			err = 0;
+		}
+	}
 
 	if (!is_t4(adapter->params.chip)) {
 		s_qpp = (QUEUESPERPAGEPF0_S +
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-17  7:44 [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Rahul Lakkireddy
                   ` (2 preceding siblings ...)
  2018-04-17  7:44 ` [PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel Rahul Lakkireddy
@ 2018-04-18  6:15 ` Dave Young
  2018-04-18 12:31   ` Rahul Lakkireddy
  3 siblings, 1 reply; 16+ messages in thread
From: Dave Young @ 2018-04-18  6:15 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, indranil, nirranjan,
	stephen, ganeshgr, ebiederm, akpm, torvalds, davem, viro

Hi Rahul,
On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> On production servers running variety of workloads over time, kernel
> panic can happen sporadically after days or even months. It is
> important to collect as much debug logs as possible to root cause
> and fix the problem, that may not be easy to reproduce. Snapshot of
> underlying hardware/firmware state (like register dump, firmware
> logs, adapter memory, etc.), at the time of kernel panic will be very
> helpful while debugging the culprit device driver.
> 
> This series of patches add new generic framework that enable device
> drivers to collect device specific snapshot of the hardware/firmware
> state of the underlying device in the crash recovery kernel. In crash
> recovery kernel, the collected logs are added as elf notes to
> /proc/vmcore, which is copied by user space scripts for post-analysis.
> 
> The sequence of actions done by device drivers to append their device
> specific hardware/firmware logs to /proc/vmcore are as follows:
> 
> 1. During probe (before hardware is initialized), device drivers
> register to the vmcore module (via vmcore_add_device_dump()), with
> callback function, along with buffer size and log name needed for
> firmware/hardware log collection.

I assumed the elf notes info should be prepared while kexec_[file_]load
phase. But I did not read the old comment, not sure if it has been discussed
or not.

If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
How to guarantee the function works if vmcore reading happens before
the driver is loaded?

Also it is possible that kdump initramfs does not contains the driver
module.

Am I missing something?

> 
> 2. vmcore module allocates the buffer with requested size. It adds
> an elf note and invokes the device driver's registered callback
> function.
> 
> 3. Device driver collects all hardware/firmware logs into the buffer
> and returns control back to vmcore module.
> 
> The device specific hardware/firmware logs can be seen as elf notes:
> 
> # readelf -n /proc/vmcore
> 
> Displaying notes found at file offset 0x00001000 with length 0x04003288:
>   Owner                 Data size	Description
>   VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
>   VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
>   VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)
> 
> Patch 1 adds API to vmcore module to allow drivers to register callback
> to collect the device specific hardware/firmware logs.  The logs will
> be added to /proc/vmcore as elf notes.
> 
> Patch 2 updates read and mmap logic to append device specific hardware/
> firmware logs as elf notes.
> 
> Patch 3 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs in crash recovery kernel, before hardware is
> initialized.
> 
> Thanks,
> Rahul
> 
> RFC v1: https://lkml.org/lkml/2018/3/2/542
> RFC v2: https://lkml.org/lkml/2018/3/16/326
> 
> ---
> v4:
> - Made __vmcore_add_device_dump() static.
> - Moved compile check to define vmcore_add_device_dump() to
>   crash_dump.h to fix compilation when vmcore.c is not compiled in.
> - Convert ---help--- to help in Kconfig as indicated by checkpatch.
> - Rebased to tip.
> 
> v3:
> - Dropped sysfs crashdd module.
> - Exported dumps as elf notes. Suggested by Eric Biederman
>   <ebiederm@xmission.com>.  Added as patch 2 in this version.
> - Added CONFIG_PROC_VMCORE_DEVICE_DUMP to allow configuring device
>   dump support.
> - Moved logic related to adding dumps from crashdd to vmcore module.
> - Rename all crashdd* to vmcoredd*.
> - Updated comments.
> 
> v2:
> - Added ABI Documentation for crashdd.
> - Directly use octal permission instead of macro.
> 
> Changes since rfc v2:
> - Moved exporting crashdd from procfs to sysfs. Suggested by
>   Stephen Hemminger <stephen@networkplumber.org>
> - Moved code from fs/proc/crashdd.c to fs/crashdd/ directory.
> - Replaced all proc API with sysfs API and updated comments.
> - Calling driver callback before creating the binary file under
>   crashdd sysfs.
> - Changed binary dump file permission from S_IRUSR to S_IRUGO.
> - Changed module name from CRASH_DRIVER_DUMP to CRASH_DEVICE_DUMP.
> 
> rfc v2:
> - Collecting logs in 2nd kernel instead of during kernel panic.
>   Suggested by Eric Biederman <ebiederm@xmission.com>.
> - Added new crashdd module that exports /proc/crashdd/ containing
>   driver's registered hardware/firmware logs in patch 1.
> - Replaced the API to allow drivers to register their hardware/firmware
>   log collect routine in crash recovery kernel in patch 1.
> - Updated patch 2 to use the new API in patch 1.
> 
> Rahul Lakkireddy (3):
>   vmcore: add API to collect hardware dump in second kernel
>   vmcore: append device dumps to vmcore as elf notes
>   cxgb4: collect hardware dump in second kernel
> 
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |   4 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c |  25 ++
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |   3 +
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  |  10 +
>  fs/proc/Kconfig                                  |  10 +
>  fs/proc/vmcore.c                                 | 399 ++++++++++++++++++++++-
>  include/linux/crash_core.h                       |   4 +
>  include/linux/crash_dump.h                       |  17 +
>  include/linux/kcore.h                            |   6 +
>  include/uapi/linux/elf.h                         |   1 +
>  10 files changed, 466 insertions(+), 13 deletions(-)
> 
> -- 
> 2.14.1
> 
> 
> _______________________________________________
> kexec mailing list
> kexec@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Thanks
Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-18  6:15 ` [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Dave Young
@ 2018-04-18 12:31   ` Rahul Lakkireddy
  2018-04-18 14:28     ` Eric W. Biederman
  2018-04-19  1:40     ` Dave Young
  0 siblings, 2 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-18 12:31 UTC (permalink / raw)
  To: Dave Young
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, Indranil Choudhury,
	Nirranjan Kirubaharan, stephen, Ganesh GR, ebiederm, akpm,
	torvalds, davem, viro

On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> Hi Rahul,
> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > On production servers running variety of workloads over time, kernel
> > panic can happen sporadically after days or even months. It is
> > important to collect as much debug logs as possible to root cause
> > and fix the problem, that may not be easy to reproduce. Snapshot of
> > underlying hardware/firmware state (like register dump, firmware
> > logs, adapter memory, etc.), at the time of kernel panic will be very
> > helpful while debugging the culprit device driver.
> > 
> > This series of patches add new generic framework that enable device
> > drivers to collect device specific snapshot of the hardware/firmware
> > state of the underlying device in the crash recovery kernel. In crash
> > recovery kernel, the collected logs are added as elf notes to
> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> > 
> > The sequence of actions done by device drivers to append their device
> > specific hardware/firmware logs to /proc/vmcore are as follows:
> > 
> > 1. During probe (before hardware is initialized), device drivers
> > register to the vmcore module (via vmcore_add_device_dump()), with
> > callback function, along with buffer size and log name needed for
> > firmware/hardware log collection.
> 
> I assumed the elf notes info should be prepared while kexec_[file_]load
> phase. But I did not read the old comment, not sure if it has been discussed
> or not.
> 

We must not collect dumps in crashing kernel. Adding more things in
crash dump path risks not collecting vmcore at all. Eric had
discussed this in more detail at:

https://lkml.org/lkml/2018/3/24/319

We are safe to collect dumps in the second kernel. Each device dump
will be exported as an elf note in /proc/vmcore.

> If do this in 2nd kernel a question is driver can be loaded later than vmcore init.

Yes, drivers will add their device dumps after vmcore init.

> How to guarantee the function works if vmcore reading happens before
> the driver is loaded?
> 
> Also it is possible that kdump initramfs does not contains the driver
> module.
> 
> Am I missing something?
> 

Yes, driver must be in initramfs if it wants to collect and add device
dump to /proc/vmcore in second kernel.

> > 
> > 2. vmcore module allocates the buffer with requested size. It adds
> > an elf note and invokes the device driver's registered callback
> > function.
> > 
> > 3. Device driver collects all hardware/firmware logs into the buffer
> > and returns control back to vmcore module.
> > 
> > The device specific hardware/firmware logs can be seen as elf notes:
> > 
> > # readelf -n /proc/vmcore
> > 
> > Displaying notes found at file offset 0x00001000 with length 0x04003288:
> >   Owner                 Data size	Description
> >   VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
> >   VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> >   VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)
> > 
> > Patch 1 adds API to vmcore module to allow drivers to register callback
> > to collect the device specific hardware/firmware logs.  The logs will
> > be added to /proc/vmcore as elf notes.
> > 
> > Patch 2 updates read and mmap logic to append device specific hardware/
> > firmware logs as elf notes.
> > 
> > Patch 3 shows a cxgb4 driver example using the API to collect
> > hardware/firmware logs in crash recovery kernel, before hardware is
> > initialized.
> > 
> > Thanks,
> > Rahul
> > 
> > RFC v1: https://lkml.org/lkml/2018/3/2/542
> > RFC v2: https://lkml.org/lkml/2018/3/16/326
> > 
[...]

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-18 12:31   ` Rahul Lakkireddy
@ 2018-04-18 14:28     ` Eric W. Biederman
  2018-04-18 15:07       ` Rahul Lakkireddy
  2018-04-19  1:40     ` Dave Young
  1 sibling, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2018-04-18 14:28 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: Dave Young, netdev, kexec, linux-fsdevel, linux-kernel,
	Indranil Choudhury, Nirranjan Kirubaharan, stephen, Ganesh GR,
	akpm, torvalds, davem, viro

Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:

> On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
>> Hi Rahul,
>> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
>> > On production servers running variety of workloads over time, kernel
>> > panic can happen sporadically after days or even months. It is
>> > important to collect as much debug logs as possible to root cause
>> > and fix the problem, that may not be easy to reproduce. Snapshot of
>> > underlying hardware/firmware state (like register dump, firmware
>> > logs, adapter memory, etc.), at the time of kernel panic will be very
>> > helpful while debugging the culprit device driver.
>> > 
>> > This series of patches add new generic framework that enable device
>> > drivers to collect device specific snapshot of the hardware/firmware
>> > state of the underlying device in the crash recovery kernel. In crash
>> > recovery kernel, the collected logs are added as elf notes to
>> > /proc/vmcore, which is copied by user space scripts for post-analysis.
>> > 
>> > The sequence of actions done by device drivers to append their device
>> > specific hardware/firmware logs to /proc/vmcore are as follows:
>> > 
>> > 1. During probe (before hardware is initialized), device drivers
>> > register to the vmcore module (via vmcore_add_device_dump()), with
>> > callback function, along with buffer size and log name needed for
>> > firmware/hardware log collection.
>> 
>> I assumed the elf notes info should be prepared while kexec_[file_]load
>> phase. But I did not read the old comment, not sure if it has been discussed
>> or not.
>> 
>
> We must not collect dumps in crashing kernel. Adding more things in
> crash dump path risks not collecting vmcore at all. Eric had
> discussed this in more detail at:
>
> https://lkml.org/lkml/2018/3/24/319
>
> We are safe to collect dumps in the second kernel. Each device dump
> will be exported as an elf note in /proc/vmcore.

It just occurred to me there is one variation that is worth
considering.

Is the area you are looking at dumping part of a huge mmio area?
I think someone said 2GB?

If that is the case it could be worth it to simply add the needed
addresses to the range of memory we need to dump, and simply having a
elf note saying that is what happened.

>> If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
>
> Yes, drivers will add their device dumps after vmcore init.
>
>> How to guarantee the function works if vmcore reading happens before
>> the driver is loaded?
>> 
>> Also it is possible that kdump initramfs does not contains the driver
>> module.
>> 
>> Am I missing something?
>> 
>
> Yes, driver must be in initramfs if it wants to collect and add device
> dump to /proc/vmcore in second kernel.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-18 14:28     ` Eric W. Biederman
@ 2018-04-18 15:07       ` Rahul Lakkireddy
  0 siblings, 0 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-18 15:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Young, netdev, kexec, linux-fsdevel, linux-kernel,
	Indranil Choudhury, Nirranjan Kirubaharan, stephen, Ganesh GR,
	akpm, torvalds, davem, viro

On Wednesday, April 04/18/18, 2018 at 19:58:01 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> Hi Rahul,
> >> On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> > On production servers running variety of workloads over time, kernel
> >> > panic can happen sporadically after days or even months. It is
> >> > important to collect as much debug logs as possible to root cause
> >> > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> > underlying hardware/firmware state (like register dump, firmware
> >> > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> > helpful while debugging the culprit device driver.
> >> > 
> >> > This series of patches add new generic framework that enable device
> >> > drivers to collect device specific snapshot of the hardware/firmware
> >> > state of the underlying device in the crash recovery kernel. In crash
> >> > recovery kernel, the collected logs are added as elf notes to
> >> > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> > 
> >> > The sequence of actions done by device drivers to append their device
> >> > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> > 
> >> > 1. During probe (before hardware is initialized), device drivers
> >> > register to the vmcore module (via vmcore_add_device_dump()), with
> >> > callback function, along with buffer size and log name needed for
> >> > firmware/hardware log collection.
> >> 
> >> I assumed the elf notes info should be prepared while kexec_[file_]load
> >> phase. But I did not read the old comment, not sure if it has been discussed
> >> or not.
> >> 
> >
> > We must not collect dumps in crashing kernel. Adding more things in
> > crash dump path risks not collecting vmcore at all. Eric had
> > discussed this in more detail at:
> >
> > https://lkml.org/lkml/2018/3/24/319
> >
> > We are safe to collect dumps in the second kernel. Each device dump
> > will be exported as an elf note in /proc/vmcore.
> 
> It just occurred to me there is one variation that is worth
> considering.
> 
> Is the area you are looking at dumping part of a huge mmio area?
> I think someone said 2GB?
> 
> If that is the case it could be worth it to simply add the needed
> addresses to the range of memory we need to dump, and simply having a
> elf note saying that is what happened.
> 

We are _not_ dumping mmio area. However, one part of the dump
collection involves reading 2 GB on-chip memory via PIO access,
which is compressed and stored.

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-18 12:31   ` Rahul Lakkireddy
  2018-04-18 14:28     ` Eric W. Biederman
@ 2018-04-19  1:40     ` Dave Young
  2018-04-19 14:27       ` Rahul Lakkireddy
  1 sibling, 1 reply; 16+ messages in thread
From: Dave Young @ 2018-04-19  1:40 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, Indranil Choudhury,
	Nirranjan Kirubaharan, stephen, Ganesh GR, ebiederm, akpm,
	torvalds, davem, viro

On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> > Hi Rahul,
> > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > > On production servers running variety of workloads over time, kernel
> > > panic can happen sporadically after days or even months. It is
> > > important to collect as much debug logs as possible to root cause
> > > and fix the problem, that may not be easy to reproduce. Snapshot of
> > > underlying hardware/firmware state (like register dump, firmware
> > > logs, adapter memory, etc.), at the time of kernel panic will be very
> > > helpful while debugging the culprit device driver.
> > > 
> > > This series of patches add new generic framework that enable device
> > > drivers to collect device specific snapshot of the hardware/firmware
> > > state of the underlying device in the crash recovery kernel. In crash
> > > recovery kernel, the collected logs are added as elf notes to
> > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> > > 
> > > The sequence of actions done by device drivers to append their device
> > > specific hardware/firmware logs to /proc/vmcore are as follows:
> > > 
> > > 1. During probe (before hardware is initialized), device drivers
> > > register to the vmcore module (via vmcore_add_device_dump()), with
> > > callback function, along with buffer size and log name needed for
> > > firmware/hardware log collection.
> > 
> > I assumed the elf notes info should be prepared while kexec_[file_]load
> > phase. But I did not read the old comment, not sure if it has been discussed
> > or not.
> > 
> 
> We must not collect dumps in crashing kernel. Adding more things in
> crash dump path risks not collecting vmcore at all. Eric had
> discussed this in more detail at:
> 
> https://lkml.org/lkml/2018/3/24/319
> 
> We are safe to collect dumps in the second kernel. Each device dump
> will be exported as an elf note in /proc/vmcore.

I understand that we should avoid adding anything in crash path.  And I also
agree to collect device dump in second kernel.  I just assumed device
dump use some memory area to store the debug info and the memory
is persistent so that this can be done in 2 steps, first register the
address in elf header in kexec_load, then collect the dump in 2nd
kernel.  But it seems the driver is doing some other logic to collect
the info instead of just that simple like I thought. 

> 
> > If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
> 
> Yes, drivers will add their device dumps after vmcore init.
> 
> > How to guarantee the function works if vmcore reading happens before
> > the driver is loaded?
> > 
> > Also it is possible that kdump initramfs does not contains the driver
> > module.
> > 
> > Am I missing something?
> > 
> 
> Yes, driver must be in initramfs if it wants to collect and add device
> dump to /proc/vmcore in second kernel.

In RH/Fedora kdump scripts we only add the things are required to
bring up the dump target, so that we can use as less memory as we can.

For example, if a net driver panicked, and the dump target is rootfs
which is a scsi disk, then no network related stuff will be added in
initramfs.

In this case the device dump info will be not collected..
> 
> > > 
> > > 2. vmcore module allocates the buffer with requested size. It adds
> > > an elf note and invokes the device driver's registered callback
> > > function.
> > > 
> > > 3. Device driver collects all hardware/firmware logs into the buffer
> > > and returns control back to vmcore module.
> > > 
> > > The device specific hardware/firmware logs can be seen as elf notes:
> > > 
> > > # readelf -n /proc/vmcore
> > > 
> > > Displaying notes found at file offset 0x00001000 with length 0x04003288:
> > >   Owner                 Data size	Description
> > >   VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
> > >   VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > >   VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)
> > > 
> > > Patch 1 adds API to vmcore module to allow drivers to register callback
> > > to collect the device specific hardware/firmware logs.  The logs will
> > > be added to /proc/vmcore as elf notes.
> > > 
> > > Patch 2 updates read and mmap logic to append device specific hardware/
> > > firmware logs as elf notes.
> > > 
> > > Patch 3 shows a cxgb4 driver example using the API to collect
> > > hardware/firmware logs in crash recovery kernel, before hardware is
> > > initialized.
> > > 
> > > Thanks,
> > > Rahul
> > > 
> > > RFC v1: https://lkml.org/lkml/2018/3/2/542
> > > RFC v2: https://lkml.org/lkml/2018/3/16/326
> > > 
> [...]
> 
> Thanks,
> Rahul

Thanks
Dave

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel
  2018-04-17  7:44 ` [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel Rahul Lakkireddy
@ 2018-04-19  8:24   ` Greg KH
  2018-04-19 14:56     ` Rahul Lakkireddy
  0 siblings, 1 reply; 16+ messages in thread
From: Greg KH @ 2018-04-19  8:24 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, davem, viro,
	ebiederm, stephen, akpm, torvalds, ganeshgr, nirranjan, indranil

On Tue, Apr 17, 2018 at 01:14:17PM +0530, Rahul Lakkireddy wrote:
> +config PROC_VMCORE_DEVICE_DUMP
> +	bool "Device Hardware/Firmware Log Collection"
> +	depends on PROC_VMCORE
> +	default y

Only things that require the machine to keep working should be 'default
y', please remove this, it's an option.

> +	help
> +	  Device drivers can collect the device specific snapshot of
> +	  their hardware or firmware before they are initialized in
> +	  crash recovery kernel. If you say Y here, the device dumps
> +	  will be added as ELF notes to /proc/vmcore

Which exact "device drivers" are you referring to here?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-19  1:40     ` Dave Young
@ 2018-04-19 14:27       ` Rahul Lakkireddy
  2018-04-19 14:53         ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-19 14:27 UTC (permalink / raw)
  To: Dave Young
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, Indranil Choudhury,
	Nirranjan Kirubaharan, stephen, Ganesh GR, ebiederm, akpm,
	torvalds, davem, viro

On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> > > Hi Rahul,
> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> > > > On production servers running variety of workloads over time, kernel
> > > > panic can happen sporadically after days or even months. It is
> > > > important to collect as much debug logs as possible to root cause
> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
> > > > underlying hardware/firmware state (like register dump, firmware
> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
> > > > helpful while debugging the culprit device driver.
> > > > 
> > > > This series of patches add new generic framework that enable device
> > > > drivers to collect device specific snapshot of the hardware/firmware
> > > > state of the underlying device in the crash recovery kernel. In crash
> > > > recovery kernel, the collected logs are added as elf notes to
> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> > > > 
> > > > The sequence of actions done by device drivers to append their device
> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
> > > > 
> > > > 1. During probe (before hardware is initialized), device drivers
> > > > register to the vmcore module (via vmcore_add_device_dump()), with
> > > > callback function, along with buffer size and log name needed for
> > > > firmware/hardware log collection.
> > > 
> > > I assumed the elf notes info should be prepared while kexec_[file_]load
> > > phase. But I did not read the old comment, not sure if it has been discussed
> > > or not.
> > > 
> > 
> > We must not collect dumps in crashing kernel. Adding more things in
> > crash dump path risks not collecting vmcore at all. Eric had
> > discussed this in more detail at:
> > 
> > https://lkml.org/lkml/2018/3/24/319
> > 
> > We are safe to collect dumps in the second kernel. Each device dump
> > will be exported as an elf note in /proc/vmcore.
> 
> I understand that we should avoid adding anything in crash path.  And I also
> agree to collect device dump in second kernel.  I just assumed device
> dump use some memory area to store the debug info and the memory
> is persistent so that this can be done in 2 steps, first register the
> address in elf header in kexec_load, then collect the dump in 2nd
> kernel.  But it seems the driver is doing some other logic to collect
> the info instead of just that simple like I thought. 
> 

It seems simpler, but I'm concerned with waste of memory area, if
there are no device dumps being collected in second kernel. In
approach proposed in these series, we dynamically allocate memory
for the device dumps from second kernel's available memory.

> > 
> > > If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
> > 
> > Yes, drivers will add their device dumps after vmcore init.
> > 
> > > How to guarantee the function works if vmcore reading happens before
> > > the driver is loaded?
> > > 
> > > Also it is possible that kdump initramfs does not contains the driver
> > > module.
> > > 
> > > Am I missing something?
> > > 
> > 
> > Yes, driver must be in initramfs if it wants to collect and add device
> > dump to /proc/vmcore in second kernel.
> 
> In RH/Fedora kdump scripts we only add the things are required to
> bring up the dump target, so that we can use as less memory as we can.
> 
> For example, if a net driver panicked, and the dump target is rootfs
> which is a scsi disk, then no network related stuff will be added in
> initramfs.
> 
> In this case the device dump info will be not collected..

Correct. If the driver is not present in initramfs, it can't collect
its underlying device's dump. Administrator is expected to add the 
driver to initramfs, if device dump needs to be collected.

> > 
> > > > 
> > > > 2. vmcore module allocates the buffer with requested size. It adds
> > > > an elf note and invokes the device driver's registered callback
> > > > function.
> > > > 
> > > > 3. Device driver collects all hardware/firmware logs into the buffer
> > > > and returns control back to vmcore module.
> > > > 
> > > > The device specific hardware/firmware logs can be seen as elf notes:
> > > > 
> > > > # readelf -n /proc/vmcore
> > > > 
> > > > Displaying notes found at file offset 0x00001000 with length 0x04003288:
> > > >   Owner                 Data size	Description
> > > >   VMCOREDD_cxgb4_0000:02:00.4 0x02000fd8	Unknown note type: (0x00000700)
> > > >   VMCOREDD_cxgb4_0000:04:00.4 0x02000fd8	Unknown note type: (0x00000700)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   CORE                 0x00000150	NT_PRSTATUS (prstatus structure)
> > > >   VMCOREINFO           0x0000074f	Unknown note type: (0x00000000)
> > > > 
> > > > Patch 1 adds API to vmcore module to allow drivers to register callback
> > > > to collect the device specific hardware/firmware logs.  The logs will
> > > > be added to /proc/vmcore as elf notes.
> > > > 
> > > > Patch 2 updates read and mmap logic to append device specific hardware/
> > > > firmware logs as elf notes.
> > > > 
> > > > Patch 3 shows a cxgb4 driver example using the API to collect
> > > > hardware/firmware logs in crash recovery kernel, before hardware is
> > > > initialized.
> > > > 
> > > > Thanks,
> > > > Rahul
> > > > 
> > > > RFC v1: https://lkml.org/lkml/2018/3/2/542
> > > > RFC v2: https://lkml.org/lkml/2018/3/16/326
> > > > 

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-19 14:27       ` Rahul Lakkireddy
@ 2018-04-19 14:53         ` Eric W. Biederman
  2018-04-20 13:06           ` Rahul Lakkireddy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2018-04-19 14:53 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: Dave Young, netdev, kexec, linux-fsdevel, linux-kernel,
	Indranil Choudhury, Nirranjan Kirubaharan, stephen, Ganesh GR,
	akpm, torvalds, davem, viro

Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:

> On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
>> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
>> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
>> > > Hi Rahul,
>> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
>> > > > On production servers running variety of workloads over time, kernel
>> > > > panic can happen sporadically after days or even months. It is
>> > > > important to collect as much debug logs as possible to root cause
>> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
>> > > > underlying hardware/firmware state (like register dump, firmware
>> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
>> > > > helpful while debugging the culprit device driver.
>> > > > 
>> > > > This series of patches add new generic framework that enable device
>> > > > drivers to collect device specific snapshot of the hardware/firmware
>> > > > state of the underlying device in the crash recovery kernel. In crash
>> > > > recovery kernel, the collected logs are added as elf notes to
>> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
>> > > > 
>> > > > The sequence of actions done by device drivers to append their device
>> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
>> > > > 
>> > > > 1. During probe (before hardware is initialized), device drivers
>> > > > register to the vmcore module (via vmcore_add_device_dump()), with
>> > > > callback function, along with buffer size and log name needed for
>> > > > firmware/hardware log collection.
>> > > 
>> > > I assumed the elf notes info should be prepared while kexec_[file_]load
>> > > phase. But I did not read the old comment, not sure if it has been discussed
>> > > or not.
>> > > 
>> > 
>> > We must not collect dumps in crashing kernel. Adding more things in
>> > crash dump path risks not collecting vmcore at all. Eric had
>> > discussed this in more detail at:
>> > 
>> > https://lkml.org/lkml/2018/3/24/319
>> > 
>> > We are safe to collect dumps in the second kernel. Each device dump
>> > will be exported as an elf note in /proc/vmcore.
>> 
>> I understand that we should avoid adding anything in crash path.  And I also
>> agree to collect device dump in second kernel.  I just assumed device
>> dump use some memory area to store the debug info and the memory
>> is persistent so that this can be done in 2 steps, first register the
>> address in elf header in kexec_load, then collect the dump in 2nd
>> kernel.  But it seems the driver is doing some other logic to collect
>> the info instead of just that simple like I thought. 
>> 
>
> It seems simpler, but I'm concerned with waste of memory area, if
> there are no device dumps being collected in second kernel. In
> approach proposed in these series, we dynamically allocate memory
> for the device dumps from second kernel's available memory.

Don't count that kernel having more than about 128MiB.

For that reason if for no other it would be nice if it was possible to
have the driver to not initialize the device and just stand there
handing out the data a piece at a time as it is read from /proc/vmcore.

The 2GiB number I read earlier concerns me for working in a limited
environment.

It might even make sense to separate this into a completely separate
module (depended upon the main driver if it makes sense to share
the functionality) so that people performing crash dumps would not
hesitate to include the code in their initramfs images.

I can see splitting a device up into a portion only to be used in case
of a crash dump and a normal portion like we do for main memory but I
doubt that makes sense in practice.

>> > > If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
>> > 
>> > Yes, drivers will add their device dumps after vmcore init.
>> > 
>> > > How to guarantee the function works if vmcore reading happens before
>> > > the driver is loaded?
>> > > 
>> > > Also it is possible that kdump initramfs does not contains the driver
>> > > module.
>> > > 
>> > > Am I missing something?
>> > > 
>> > 
>> > Yes, driver must be in initramfs if it wants to collect and add device
>> > dump to /proc/vmcore in second kernel.
>> 
>> In RH/Fedora kdump scripts we only add the things are required to
>> bring up the dump target, so that we can use as less memory as we can.
>> 
>> For example, if a net driver panicked, and the dump target is rootfs
>> which is a scsi disk, then no network related stuff will be added in
>> initramfs.
>> 
>> In this case the device dump info will be not collected..
>
> Correct. If the driver is not present in initramfs, it can't collect
> its underlying device's dump. Administrator is expected to add the 
> driver to initramfs, if device dump needs to be collected.

That makes sense, as most people won't have that need.  Still if we can
find something that can work automatically and safely without the need
for manual configuration people are more likely to use it.

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel
  2018-04-19  8:24   ` Greg KH
@ 2018-04-19 14:56     ` Rahul Lakkireddy
  0 siblings, 0 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-19 14:56 UTC (permalink / raw)
  To: Greg KH
  Cc: netdev, kexec, linux-fsdevel, linux-kernel, davem, viro,
	ebiederm, stephen, akpm, torvalds, Ganesh GR,
	Nirranjan Kirubaharan, Indranil Choudhury

On Thursday, April 04/19/18, 2018 at 13:54:56 +0530, Greg KH wrote:
> On Tue, Apr 17, 2018 at 01:14:17PM +0530, Rahul Lakkireddy wrote:
> > +config PROC_VMCORE_DEVICE_DUMP
> > +	bool "Device Hardware/Firmware Log Collection"
> > +	depends on PROC_VMCORE
> > +	default y
> 
> Only things that require the machine to keep working should be 'default
> y', please remove this, it's an option.
> 

Ok. Will fix this.

> > +	help
> > +	  Device drivers can collect the device specific snapshot of
> > +	  their hardware or firmware before they are initialized in
> > +	  crash recovery kernel. If you say Y here, the device dumps
> > +	  will be added as ELF notes to /proc/vmcore
> 
> Which exact "device drivers" are you referring to here?
> 

The API is generic enough to collect any type of device's dump. Any
driver that wants to collect its underlying hardware/firmware dump
can use the API. In our case, cxgb4 driver collects dumps of the
underlying Chelsio network devices.

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-19 14:53         ` Eric W. Biederman
@ 2018-04-20 13:06           ` Rahul Lakkireddy
  2018-04-20 13:36             ` Eric W. Biederman
  0 siblings, 1 reply; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-20 13:06 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Young, netdev, kexec, linux-fsdevel, linux-kernel,
	Indranil Choudhury, Nirranjan Kirubaharan, stephen, Ganesh GR,
	akpm, torvalds, davem, viro

On Thursday, April 04/19/18, 2018 at 20:23:37 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
> >> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> >> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> > > Hi Rahul,
> >> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> > > > On production servers running variety of workloads over time, kernel
> >> > > > panic can happen sporadically after days or even months. It is
> >> > > > important to collect as much debug logs as possible to root cause
> >> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> > > > underlying hardware/firmware state (like register dump, firmware
> >> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> > > > helpful while debugging the culprit device driver.
> >> > > > 
> >> > > > This series of patches add new generic framework that enable device
> >> > > > drivers to collect device specific snapshot of the hardware/firmware
> >> > > > state of the underlying device in the crash recovery kernel. In crash
> >> > > > recovery kernel, the collected logs are added as elf notes to
> >> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> > > > 
> >> > > > The sequence of actions done by device drivers to append their device
> >> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> > > > 
> >> > > > 1. During probe (before hardware is initialized), device drivers
> >> > > > register to the vmcore module (via vmcore_add_device_dump()), with
> >> > > > callback function, along with buffer size and log name needed for
> >> > > > firmware/hardware log collection.
> >> > > 
> >> > > I assumed the elf notes info should be prepared while kexec_[file_]load
> >> > > phase. But I did not read the old comment, not sure if it has been discussed
> >> > > or not.
> >> > > 
> >> > 
> >> > We must not collect dumps in crashing kernel. Adding more things in
> >> > crash dump path risks not collecting vmcore at all. Eric had
> >> > discussed this in more detail at:
> >> > 
> >> > https://lkml.org/lkml/2018/3/24/319
> >> > 
> >> > We are safe to collect dumps in the second kernel. Each device dump
> >> > will be exported as an elf note in /proc/vmcore.
> >> 
> >> I understand that we should avoid adding anything in crash path.  And I also
> >> agree to collect device dump in second kernel.  I just assumed device
> >> dump use some memory area to store the debug info and the memory
> >> is persistent so that this can be done in 2 steps, first register the
> >> address in elf header in kexec_load, then collect the dump in 2nd
> >> kernel.  But it seems the driver is doing some other logic to collect
> >> the info instead of just that simple like I thought. 
> >> 
> >
> > It seems simpler, but I'm concerned with waste of memory area, if
> > there are no device dumps being collected in second kernel. In
> > approach proposed in these series, we dynamically allocate memory
> > for the device dumps from second kernel's available memory.
> 
> Don't count that kernel having more than about 128MiB.
> 

If large dump is expected, Administrator can increase the memory
allocated to the second kernel (using crashkernel boot param), to
ensure device dumps get collected.

> For that reason if for no other it would be nice if it was possible to
> have the driver to not initialize the device and just stand there
> handing out the data a piece at a time as it is read from /proc/vmcore.
> 

Since cxgb4 is a network driver, it can be used to transfer the dumps
over the network. So we must ensure the dumps get collected and
stored, before device gets initialized to transfer dumps over
the network.

> The 2GiB number I read earlier concerns me for working in a limited
> environment.
> 

All dumps, including the 2GB on-chip memory dump, is compressed by
the cxgb4 driver as they are collected. The overall compressed dump
comes out at max 32 MB.

> It might even make sense to separate this into a completely separate
> module (depended upon the main driver if it makes sense to share
> the functionality) so that people performing crash dumps would not
> hesitate to include the code in their initramfs images.
> 
> I can see splitting a device up into a portion only to be used in case
> of a crash dump and a normal portion like we do for main memory but I
> doubt that makes sense in practice.
> 

This is not required, especially in case of network drivers, which
must collect underlying device dump and initialize the device to
transfer dumps over the network.

> >> > > If do this in 2nd kernel a question is driver can be loaded later than vmcore init.
> >> > 
> >> > Yes, drivers will add their device dumps after vmcore init.
> >> > 
> >> > > How to guarantee the function works if vmcore reading happens before
> >> > > the driver is loaded?
> >> > > 
> >> > > Also it is possible that kdump initramfs does not contains the driver
> >> > > module.
> >> > > 
> >> > > Am I missing something?
> >> > > 
> >> > 
> >> > Yes, driver must be in initramfs if it wants to collect and add device
> >> > dump to /proc/vmcore in second kernel.
> >> 
> >> In RH/Fedora kdump scripts we only add the things are required to
> >> bring up the dump target, so that we can use as less memory as we can.
> >> 
> >> For example, if a net driver panicked, and the dump target is rootfs
> >> which is a scsi disk, then no network related stuff will be added in
> >> initramfs.
> >> 
> >> In this case the device dump info will be not collected..
> >
> > Correct. If the driver is not present in initramfs, it can't collect
> > its underlying device's dump. Administrator is expected to add the 
> > driver to initramfs, if device dump needs to be collected.
> 
> That makes sense, as most people won't have that need.  Still if we can
> find something that can work automatically and safely without the need
> for manual configuration people are more likely to use it.
> 

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-20 13:06           ` Rahul Lakkireddy
@ 2018-04-20 13:36             ` Eric W. Biederman
  2018-04-20 14:51               ` Rahul Lakkireddy
  0 siblings, 1 reply; 16+ messages in thread
From: Eric W. Biederman @ 2018-04-20 13:36 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: Dave Young, netdev, kexec, linux-fsdevel, linux-kernel,
	Indranil Choudhury, Nirranjan Kirubaharan, stephen, Ganesh GR,
	akpm, torvalds, davem, viro

Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:

> On Thursday, April 04/19/18, 2018 at 20:23:37 +0530, Eric W. Biederman wrote:
>> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
>> 
>> > On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
>> >> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
>> >> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
>> >> > > Hi Rahul,
>> >> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
>> >> > > > On production servers running variety of workloads over time, kernel
>> >> > > > panic can happen sporadically after days or even months. It is
>> >> > > > important to collect as much debug logs as possible to root cause
>> >> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
>> >> > > > underlying hardware/firmware state (like register dump, firmware
>> >> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
>> >> > > > helpful while debugging the culprit device driver.
>> >> > > > 
>> >> > > > This series of patches add new generic framework that enable device
>> >> > > > drivers to collect device specific snapshot of the hardware/firmware
>> >> > > > state of the underlying device in the crash recovery kernel. In crash
>> >> > > > recovery kernel, the collected logs are added as elf notes to
>> >> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
>> >> > > > 
>> >> > > > The sequence of actions done by device drivers to append their device
>> >> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
>> >> > > > 
>> >> > > > 1. During probe (before hardware is initialized), device drivers
>> >> > > > register to the vmcore module (via vmcore_add_device_dump()), with
>> >> > > > callback function, along with buffer size and log name needed for
>> >> > > > firmware/hardware log collection.
>> >> > > 
>> >> > > I assumed the elf notes info should be prepared while kexec_[file_]load
>> >> > > phase. But I did not read the old comment, not sure if it has been discussed
>> >> > > or not.
>> >> > > 
>> >> > 
>> >> > We must not collect dumps in crashing kernel. Adding more things in
>> >> > crash dump path risks not collecting vmcore at all. Eric had
>> >> > discussed this in more detail at:
>> >> > 
>> >> > https://lkml.org/lkml/2018/3/24/319
>> >> > 
>> >> > We are safe to collect dumps in the second kernel. Each device dump
>> >> > will be exported as an elf note in /proc/vmcore.
>> >> 
>> >> I understand that we should avoid adding anything in crash path.  And I also
>> >> agree to collect device dump in second kernel.  I just assumed device
>> >> dump use some memory area to store the debug info and the memory
>> >> is persistent so that this can be done in 2 steps, first register the
>> >> address in elf header in kexec_load, then collect the dump in 2nd
>> >> kernel.  But it seems the driver is doing some other logic to collect
>> >> the info instead of just that simple like I thought. 
>> >> 
>> >
>> > It seems simpler, but I'm concerned with waste of memory area, if
>> > there are no device dumps being collected in second kernel. In
>> > approach proposed in these series, we dynamically allocate memory
>> > for the device dumps from second kernel's available memory.
>> 
>> Don't count that kernel having more than about 128MiB.
>> 
>
> If large dump is expected, Administrator can increase the memory
> allocated to the second kernel (using crashkernel boot param), to
> ensure device dumps get collected.

Except 128MiB is already a already a huge amount to reserve.  I
typically have run crash dumps with 16MiB of memory and thought it was
overkill.  Looking below 32MiB seems a bit high but it is small enough
that it is still doable.  I am baffled at how 2GiB can be guaranteed to fit
in 32MiB (sparse register space?) but if it works reliably.

>> For that reason if for no other it would be nice if it was possible to
>> have the driver to not initialize the device and just stand there
>> handing out the data a piece at a time as it is read from /proc/vmcore.
>> 
>
> Since cxgb4 is a network driver, it can be used to transfer the dumps
> over the network. So we must ensure the dumps get collected and
> stored, before device gets initialized to transfer dumps over
> the network.

Good point.  For some reason I was thinking it was an infiniband and not
an 10GiB ethernet device.

>> The 2GiB number I read earlier concerns me for working in a limited
>> environment.
>> 
>
> All dumps, including the 2GB on-chip memory dump, is compressed by
> the cxgb4 driver as they are collected. The overall compressed dump
> comes out at max 32 MB.
>
>> It might even make sense to separate this into a completely separate
>> module (depended upon the main driver if it makes sense to share
>> the functionality) so that people performing crash dumps would not
>> hesitate to include the code in their initramfs images.
>> 
>> I can see splitting a device up into a portion only to be used in case
>> of a crash dump and a normal portion like we do for main memory but I
>> doubt that makes sense in practice.
>> 
>
> This is not required, especially in case of network drivers, which
> must collect underlying device dump and initialize the device to
> transfer dumps over the network.

I have a practical concern.  What happens if the previous kernel left
the device in such a bad stat the driver can not successfully initialize
it.

Does failure to initialize cxgb4 after a crash now mean that you can not
capture the crash dump to see the crazy state the device was in?

Typically the initramfs for a crash dump does not include unnecessary
drivers so that hardware in states the drivers can't handle won't
prevent taking a crash dump.

I understand the issue if you are taking a dump over your 10GiB ethernet
it is a moot point.  But if you are writing your dump to disk, or
writing it over a management gigabit ethernet then it is still an issue.

Is there a decoupling so that a totally b0rked device can't prevent
taking it's own dump?

Eric

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel
  2018-04-20 13:36             ` Eric W. Biederman
@ 2018-04-20 14:51               ` Rahul Lakkireddy
  0 siblings, 0 replies; 16+ messages in thread
From: Rahul Lakkireddy @ 2018-04-20 14:51 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Young, netdev, kexec, linux-fsdevel, linux-kernel,
	Indranil Choudhury, Nirranjan Kirubaharan, stephen, Ganesh GR,
	akpm, torvalds, davem, viro

On Friday, April 04/20/18, 2018 at 19:06:09 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > On Thursday, April 04/19/18, 2018 at 20:23:37 +0530, Eric W. Biederman wrote:
> >> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> >> 
> >> > On Thursday, April 04/19/18, 2018 at 07:10:30 +0530, Dave Young wrote:
> >> >> On 04/18/18 at 06:01pm, Rahul Lakkireddy wrote:
> >> >> > On Wednesday, April 04/18/18, 2018 at 11:45:46 +0530, Dave Young wrote:
> >> >> > > Hi Rahul,
> >> >> > > On 04/17/18 at 01:14pm, Rahul Lakkireddy wrote:
> >> >> > > > On production servers running variety of workloads over time, kernel
> >> >> > > > panic can happen sporadically after days or even months. It is
> >> >> > > > important to collect as much debug logs as possible to root cause
> >> >> > > > and fix the problem, that may not be easy to reproduce. Snapshot of
> >> >> > > > underlying hardware/firmware state (like register dump, firmware
> >> >> > > > logs, adapter memory, etc.), at the time of kernel panic will be very
> >> >> > > > helpful while debugging the culprit device driver.
> >> >> > > > 
> >> >> > > > This series of patches add new generic framework that enable device
> >> >> > > > drivers to collect device specific snapshot of the hardware/firmware
> >> >> > > > state of the underlying device in the crash recovery kernel. In crash
> >> >> > > > recovery kernel, the collected logs are added as elf notes to
> >> >> > > > /proc/vmcore, which is copied by user space scripts for post-analysis.
> >> >> > > > 
> >> >> > > > The sequence of actions done by device drivers to append their device
> >> >> > > > specific hardware/firmware logs to /proc/vmcore are as follows:
> >> >> > > > 
> >> >> > > > 1. During probe (before hardware is initialized), device drivers
> >> >> > > > register to the vmcore module (via vmcore_add_device_dump()), with
> >> >> > > > callback function, along with buffer size and log name needed for
> >> >> > > > firmware/hardware log collection.
> >> >> > > 
> >> >> > > I assumed the elf notes info should be prepared while kexec_[file_]load
> >> >> > > phase. But I did not read the old comment, not sure if it has been discussed
> >> >> > > or not.
> >> >> > > 
> >> >> > 
> >> >> > We must not collect dumps in crashing kernel. Adding more things in
> >> >> > crash dump path risks not collecting vmcore at all. Eric had
> >> >> > discussed this in more detail at:
> >> >> > 
> >> >> > https://lkml.org/lkml/2018/3/24/319
> >> >> > 
> >> >> > We are safe to collect dumps in the second kernel. Each device dump
> >> >> > will be exported as an elf note in /proc/vmcore.
> >> >> 
> >> >> I understand that we should avoid adding anything in crash path.  And I also
> >> >> agree to collect device dump in second kernel.  I just assumed device
> >> >> dump use some memory area to store the debug info and the memory
> >> >> is persistent so that this can be done in 2 steps, first register the
> >> >> address in elf header in kexec_load, then collect the dump in 2nd
> >> >> kernel.  But it seems the driver is doing some other logic to collect
> >> >> the info instead of just that simple like I thought. 
> >> >> 
> >> >
> >> > It seems simpler, but I'm concerned with waste of memory area, if
> >> > there are no device dumps being collected in second kernel. In
> >> > approach proposed in these series, we dynamically allocate memory
> >> > for the device dumps from second kernel's available memory.
> >> 
> >> Don't count that kernel having more than about 128MiB.
> >> 
> >
> > If large dump is expected, Administrator can increase the memory
> > allocated to the second kernel (using crashkernel boot param), to
> > ensure device dumps get collected.
> 
> Except 128MiB is already a already a huge amount to reserve.  I
> typically have run crash dumps with 16MiB of memory and thought it was
> overkill.  Looking below 32MiB seems a bit high but it is small enough
> that it is still doable.  I am baffled at how 2GiB can be guaranteed to fit
> in 32MiB (sparse register space?) but if it works reliably.
> 

Yes, we skip portions in on-chip memory dump that do not add to debug
value (such as the large regions reserved for holding Payload data
going through the device) and hence the overall dump size reduces
significantly.

> >> For that reason if for no other it would be nice if it was possible to
> >> have the driver to not initialize the device and just stand there
> >> handing out the data a piece at a time as it is read from /proc/vmcore.
> >> 
> >
> > Since cxgb4 is a network driver, it can be used to transfer the dumps
> > over the network. So we must ensure the dumps get collected and
> > stored, before device gets initialized to transfer dumps over
> > the network.
> 
> Good point.  For some reason I was thinking it was an infiniband and not
> an 10GiB ethernet device.
> 
> >> The 2GiB number I read earlier concerns me for working in a limited
> >> environment.
> >> 
> >
> > All dumps, including the 2GB on-chip memory dump, is compressed by
> > the cxgb4 driver as they are collected. The overall compressed dump
> > comes out at max 32 MB.
> >
> >> It might even make sense to separate this into a completely separate
> >> module (depended upon the main driver if it makes sense to share
> >> the functionality) so that people performing crash dumps would not
> >> hesitate to include the code in their initramfs images.
> >> 
> >> I can see splitting a device up into a portion only to be used in case
> >> of a crash dump and a normal portion like we do for main memory but I
> >> doubt that makes sense in practice.
> >> 
> >
> > This is not required, especially in case of network drivers, which
> > must collect underlying device dump and initialize the device to
> > transfer dumps over the network.
> 
> I have a practical concern.  What happens if the previous kernel left
> the device in such a bad stat the driver can not successfully initialize
> it.
> 
> Does failure to initialize cxgb4 after a crash now mean that you can not
> capture the crash dump to see the crazy state the device was in?
> 
> Typically the initramfs for a crash dump does not include unnecessary
> drivers so that hardware in states the drivers can't handle won't
> prevent taking a crash dump.
> 
> I understand the issue if you are taking a dump over your 10GiB ethernet
> it is a moot point.  But if you are writing your dump to disk, or
> writing it over a management gigabit ethernet then it is still an issue.
> 
> Is there a decoupling so that a totally b0rked device can't prevent
> taking it's own dump?
> 

As long as we can safely map and access the BAR registers, we
can collect the dumps regardless of whatever state the device and
firmware were left in and store it as part of /proc/vmcore. After
that, we attempt to re-initialize the device and restart the
firmware. So, even if driver initialization fails at this point,
we still have the dumps as part of vmcore.

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-04-20 14:52 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-17  7:44 [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Rahul Lakkireddy
2018-04-17  7:44 ` [PATCH net-next v4 1/3] vmcore: add API to collect hardware dump in second kernel Rahul Lakkireddy
2018-04-19  8:24   ` Greg KH
2018-04-19 14:56     ` Rahul Lakkireddy
2018-04-17  7:44 ` [PATCH net-next v4 2/3] vmcore: append device dumps to vmcore as elf notes Rahul Lakkireddy
2018-04-17  7:44 ` [PATCH net-next v4 3/3] cxgb4: collect hardware dump in second kernel Rahul Lakkireddy
2018-04-18  6:15 ` [PATCH net-next v4 0/3] kernel: add support to collect hardware logs in crash recovery kernel Dave Young
2018-04-18 12:31   ` Rahul Lakkireddy
2018-04-18 14:28     ` Eric W. Biederman
2018-04-18 15:07       ` Rahul Lakkireddy
2018-04-19  1:40     ` Dave Young
2018-04-19 14:27       ` Rahul Lakkireddy
2018-04-19 14:53         ` Eric W. Biederman
2018-04-20 13:06           ` Rahul Lakkireddy
2018-04-20 13:36             ` Eric W. Biederman
2018-04-20 14:51               ` Rahul Lakkireddy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).