[RFC 0/2] kernel: add support to collect hardware logs in panic

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC 0/2] kernel: add support to collect hardware logs in panic
@ 2018-03-02 12:19 Rahul Lakkireddy
  2018-03-02 12:19 ` [RFC 1/2] kernel/crash_core: add API to collect hardware dump in kernel panic Rahul Lakkireddy
                   ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Rahul Lakkireddy @ 2018-03-02 12:19 UTC (permalink / raw)
  To: linux-kernel, netdev, kexec
  Cc: davem, ebiederm, akpm, torvalds, ganeshgr, nirranjan, indranil,
	Rahul Lakkireddy

On production servers running variety of workloads over time, kernel
panic can happen sporadically after days or even months. It is
important to collect as much debug logs as possible to root cause
and fix the problem, that may not be easy to reproduce. Snapshot of
underlying hardware/firmware state (like register dump, firmware
logs, adapter memory, etc.), at the time of kernel panic will be very
helpful while debugging the culprit device driver.

This series of patches add new generic framework that enable device
drivers to collect device specific snapshot of the hardware/firmware
state of the underlying device at the time of kernel panic. The
collected logs are appended to vmcore along with details, such as
start address and length of the logs, which are required for
extraction during post-analysis.

Device drivers can use crash_driver_dump_register() to register their
callback that collects underlying device specific hardware/firmware
logs during kernel panic (i.e. before booting into the second kernel).
Drivers can unregister with crash_driver_dump_unregister().

To extract the device specific hardware/firmware logs using crash:

crash> help -D | grep DRIVERDUMP
DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968)

crash> rd ffffb131090bd000 37782968 -r hardware.log
37782968 bytes copied from 0xffffb131090bd000 to hardware.log

Patch 1 adds API to allow drivers to register callback to
collect the device specific hardware/firmware logs.

Patch 2 shows a cxgb4 driver example using the API to collect
hardware/firmware logs during kernel panic.

Suggestions and feedback will be much appreciated.

Thanks,
Rahul

Rahul Lakkireddy (2):
  kernel/crash_core: add API to collect hardware dump in kernel panic
  cxgb4: collect hardware dump in kernel panic

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |  6 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 95 +++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 12 +++
 include/linux/crash_core.h                       | 33 ++++++++
 kernel/crash_core.c                              | 83 ++++++++++++++++++++-
 kernel/kexec_core.c                              |  1 +
 7 files changed, 229 insertions(+), 5 deletions(-)

-- 
2.14.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC 1/2] kernel/crash_core: add API to collect hardware dump in kernel panic
  2018-03-02 12:19 [RFC 0/2] kernel: add support to collect hardware logs in panic Rahul Lakkireddy
@ 2018-03-02 12:19 ` Rahul Lakkireddy
  2018-03-02 12:19 ` [RFC 2/2] cxgb4: " Rahul Lakkireddy
  2018-03-02 13:22 ` [RFC 0/2] kernel: add support to collect hardware logs in panic Eric W. Biederman
  2 siblings, 0 replies; 5+ messages in thread
From: Rahul Lakkireddy @ 2018-03-02 12:19 UTC (permalink / raw)
  To: linux-kernel, netdev, kexec
  Cc: davem, ebiederm, akpm, torvalds, ganeshgr, nirranjan, indranil,
	Rahul Lakkireddy

Add API to allow drivers to register callback to collect device
specific hardware/firmware dump during kernel panic. The registered
callbacks will be invoked during kernel panic before moving to second
kernel. Also save the dump location and size in vmcore info.

The dump can be extracted using crash:

crash> help -D | grep DRIVERDUMP
DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968)

crash> rd ffffb131090bd000 37782968 -r hardware.log
37782968 bytes copied from 0xffffb131090bd000 to hardware.log

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 include/linux/crash_core.h | 33 ++++++++++++++++++
 kernel/crash_core.c        | 83 +++++++++++++++++++++++++++++++++++++++++++++-
 kernel/kexec_core.c        |  1 +
 3 files changed, 116 insertions(+), 1 deletion(-)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index b511f6d24b42..d9cc6b3346c7 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -59,6 +59,9 @@ phys_addr_t paddr_vmcoreinfo_note(void);
 	vmcoreinfo_append_str("NUMBER(%s)=%ld\n", #name, (long)name)
 #define VMCOREINFO_CONFIG(name) \
 	vmcoreinfo_append_str("CONFIG_%s=y\n", #name)
+#define VMCOREINFO_DRIVER_DUMP(name, buf, size) \
+	vmcoreinfo_append_str("DRIVERDUMP=(%s, %lx, %lu)\n", \
+			      name, (unsigned long)buf, (unsigned long)size)
 
 extern u32 *vmcoreinfo_note;
 
@@ -73,4 +76,34 @@ int parse_crashkernel_high(char *cmdline, unsigned long long system_ram,
 int parse_crashkernel_low(char *cmdline, unsigned long long system_ram,
 		unsigned long long *crash_size, unsigned long long *crash_base);
 
+/* Driver dump collection */
+#define CRASH_DRIVER_DUMP_NAME_LENGTH 32
+
+struct crash_driver_dump_node {
+	struct crash_driver_dump_node __rcu *next;
+	struct crash_driver_dump *data;
+	struct notifier_block *nb;
+};
+
+struct crash_driver_dump_head {
+	struct crash_driver_dump_node __rcu *head;
+};
+
+#define CRASH_DRIVER_DUMP_INIT(name) { \
+	.head = NULL \
+}
+
+#define CRASH_DRIVER_DUMP_HEAD(name) \
+	struct crash_driver_dump_head name = CRASH_DRIVER_DUMP_INIT(name) \
+
+struct crash_driver_dump {
+	char name[CRASH_DRIVER_DUMP_NAME_LENGTH];
+	void *buf;
+	unsigned long size;
+};
+
+void crash_driver_dump_call_handlers(void);
+int crash_driver_dump_register(struct crash_driver_dump *data,
+			       struct notifier_block *nb);
+int crash_driver_dump_unregister(struct crash_driver_dump *data);
 #endif /* LINUX_CRASH_CORE_H */
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 4f63597c824d..ff62713f597e 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -21,6 +21,64 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;
 
+/* List of driver registered callbacks to collect dumps */
+CRASH_DRIVER_DUMP_HEAD(crash_driver_dump_list);
+ATOMIC_NOTIFIER_HEAD(crash_driver_dump_notifier_list);
+
+static int driver_dump_callback_register(struct crash_driver_dump_head *dh,
+					 struct crash_driver_dump *data,
+					 struct notifier_block *nb)
+{
+	struct crash_driver_dump_node *d, **dl;
+
+	d = vzalloc(sizeof(*d));
+	if (!d)
+		return -ENOMEM;
+
+	atomic_notifier_chain_register(&crash_driver_dump_notifier_list, nb);
+	d->nb = nb;
+	d->data = data;
+
+	dl = &dh->head;
+	while (*dl)
+		dl = &((*dl)->next);
+	d->next = *dl;
+	rcu_assign_pointer(*dl, d);
+	return 0;
+}
+
+static int driver_dump_callback_unregister(struct crash_driver_dump_head *dh,
+					   struct crash_driver_dump *data)
+{
+	struct atomic_notifier_head *ah = &crash_driver_dump_notifier_list;
+	struct crash_driver_dump_node **dl;
+
+	dl = &dh->head;
+	while (*dl) {
+		if ((*dl)->data == data) {
+			atomic_notifier_chain_unregister(ah, (*dl)->nb);
+			rcu_assign_pointer((*dl), (*dl)->next);
+			vfree(*dl);
+			return 0;
+		}
+		dl = &((*dl)->next);
+	}
+	return -ENOENT;
+}
+
+int crash_driver_dump_register(struct crash_driver_dump *data,
+			       struct notifier_block *nb)
+{
+	return driver_dump_callback_register(&crash_driver_dump_list, data, nb);
+}
+EXPORT_SYMBOL(crash_driver_dump_register);
+
+int crash_driver_dump_unregister(struct crash_driver_dump *data)
+{
+	return driver_dump_callback_unregister(&crash_driver_dump_list, data);
+}
+EXPORT_SYMBOL(crash_driver_dump_unregister);
+
 /*
  * parsing the "crashkernel" commandline
  *
@@ -365,6 +423,30 @@ void vmcoreinfo_append_str(const char *fmt, ...)
 	vmcoreinfo_size += r;
 }
 
+void crash_driver_dump_call_handlers(void)
+{
+	static char driver_dump_buf[1024];
+	struct crash_driver_dump_node *dl;
+
+	dl = crash_driver_dump_list.head;
+	if (dl) {
+		atomic_notifier_call_chain(&crash_driver_dump_notifier_list, 0,
+					   driver_dump_buf);
+
+		/* Append to safe copy if available */
+		if (vmcoreinfo_data_safecopy)
+			vmcoreinfo_data = vmcoreinfo_data_safecopy;
+
+		while (dl) {
+			VMCOREINFO_DRIVER_DUMP(dl->data->name, dl->data->buf,
+					       dl->data->size);
+			dl = dl->next;
+		}
+
+		update_vmcoreinfo_note();
+	}
+}
+
 /*
  * provide an empty default implementation here -- architecture
  * code may override this
@@ -462,7 +544,6 @@ static int __init crash_save_vmcoreinfo_init(void)
 #ifdef CONFIG_HUGETLB_PAGE
 	VMCOREINFO_NUMBER(HUGETLB_PAGE_DTOR);
 #endif
-
 	arch_crash_save_vmcoreinfo();
 	update_vmcoreinfo_note();
 
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 20fef1a38602..b86ffe24fe09 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -938,6 +938,7 @@ void __noclone __crash_kexec(struct pt_regs *regs)
 		if (kexec_crash_image) {
 			struct pt_regs fixed_regs;
 
+			crash_driver_dump_call_handlers();
 			crash_setup_regs(&fixed_regs, regs);
 			crash_save_vmcoreinfo();
 			machine_crash_shutdown(&fixed_regs);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC 2/2] cxgb4: collect hardware dump in kernel panic
  2018-03-02 12:19 [RFC 0/2] kernel: add support to collect hardware logs in panic Rahul Lakkireddy
  2018-03-02 12:19 ` [RFC 1/2] kernel/crash_core: add API to collect hardware dump in kernel panic Rahul Lakkireddy
@ 2018-03-02 12:19 ` Rahul Lakkireddy
  2018-03-02 13:22 ` [RFC 0/2] kernel: add support to collect hardware logs in panic Eric W. Biederman
  2 siblings, 0 replies; 5+ messages in thread
From: Rahul Lakkireddy @ 2018-03-02 12:19 UTC (permalink / raw)
  To: linux-kernel, netdev, kexec
  Cc: davem, ebiederm, akpm, torvalds, ganeshgr, nirranjan, indranil,
	Rahul Lakkireddy

Pre-allocate dump buffer and register callback to collect hardware/
firmware logs in kernel panic. Free dump buffer on driver unload.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h       |  6 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c | 95 +++++++++++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h |  4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c  | 12 +++
 4 files changed, 113 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index d3fa53db61ee..21d095668374 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -50,6 +50,7 @@
 #include <linux/net_tstamp.h>
 #include <linux/ptp_clock_kernel.h>
 #include <linux/ptp_classify.h>
+#include <linux/crash_core.h>
 #include <asm/io.h>
 #include "t4_chip_type.h"
 #include "cxgb4_uld.h"
@@ -568,6 +569,7 @@ enum {                                 /* adapter flags */
 	FW_OFLD_CONN       = (1 << 9),
 	ROOT_NO_RELAXED_ORDERING = (1 << 10),
 	SHUTTING_DOWN	   = (1 << 11),
+	K_CRASH            = (1 << 12),
 };
 
 enum {
@@ -946,6 +948,10 @@ struct adapter {
 
 	/* Ethtool Dump */
 	struct ethtool_dump eth_dump;
+
+	/* Dump buffer for collecting logs in panic */
+	struct crash_driver_dump dump_buf;
+	struct notifier_block panic_nb;
 };
 
 /* Support for "sched-class" command to allow a TX Scheduling Class to be
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
index 143686c60234..c10d5e88321f 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.c
@@ -383,13 +383,25 @@ static void cxgb4_cudbg_collect_entity(struct cudbg_init *pdbg_init,
 
 static int cudbg_alloc_compress_buff(struct cudbg_init *pdbg_init)
 {
+	struct adapter *adap = pdbg_init->adap;
 	u32 workspace_size;
 
 	workspace_size = cudbg_get_workspace_size();
-	pdbg_init->compress_buff = vzalloc(CUDBG_COMPRESS_BUFF_SIZE +
-					   workspace_size);
-	if (!pdbg_init->compress_buff)
-		return -ENOMEM;
+
+	if (adap->flags & K_CRASH) {
+		/* In panic scenario, the compression buffer is already
+		 * allocated. So, just update accordingly.
+		 */
+		pdbg_init->compress_buff = (u8 *)adap->dump_buf.buf +
+					   adap->dump_buf.size -
+					   workspace_size -
+					   CUDBG_COMPRESS_BUFF_SIZE;
+	} else {
+		pdbg_init->compress_buff = vzalloc(CUDBG_COMPRESS_BUFF_SIZE +
+						   workspace_size);
+		if (!pdbg_init->compress_buff)
+			return -ENOMEM;
+	}
 
 	pdbg_init->compress_buff_size = CUDBG_COMPRESS_BUFF_SIZE;
 	pdbg_init->workspace = (u8 *)pdbg_init->compress_buff +
@@ -399,6 +411,14 @@ static int cudbg_alloc_compress_buff(struct cudbg_init *pdbg_init)
 
 static void cudbg_free_compress_buff(struct cudbg_init *pdbg_init)
 {
+	struct adapter *adap = pdbg_init->adap;
+
+	/* Don't free in panic scenario.  We need the buffer to be present
+	 * in vmcore so that we can extract the dump.
+	 */
+	if (adap->flags & K_CRASH)
+		return;
+
 	if (pdbg_init->compress_buff)
 		vfree(pdbg_init->compress_buff);
 }
@@ -488,3 +508,70 @@ void cxgb4_init_ethtool_dump(struct adapter *adapter)
 	adapter->eth_dump.version = adapter->params.fw_vers;
 	adapter->eth_dump.len = 0;
 }
+
+static int cxgb4_panic_notify(struct notifier_block *this, unsigned long event,
+			      void *ptr)
+{
+	struct adapter *adap = container_of(this, struct adapter, panic_nb);
+	bool use_bd;
+	u32 len;
+
+	/* Save original value and restore after collection */
+	use_bd = adap->use_bd;
+
+	dev_info(adap->pdev_dev, "Initialized cxgb4 debug collection...");
+	adap->flags |= K_CRASH;
+
+	/* Don't contact firmware.  Directly access registers */
+	adap->use_bd = true;
+
+	len = adap->dump_buf.size;
+	cxgb4_cudbg_collect(adap, adap->dump_buf.buf, &len, CXGB4_ETH_DUMP_ALL);
+	dev_info(adap->pdev_dev, "cxgb4 debug collection done...");
+
+	/* Restore original value */
+	adap->use_bd = use_bd;
+	return NOTIFY_DONE;
+}
+
+int cxgb4_cudbg_register_crash_dump(struct adapter *adap)
+{
+	u32 wsize, len;
+	int ret;
+
+	len = sizeof(struct cudbg_hdr) +
+	      sizeof(struct cudbg_entity_hdr) * CUDBG_MAX_ENTITY;
+	len += CUDBG_DUMP_BUFF_SIZE;
+
+	/* If compression is enabled, allocate extra memory needed for
+	 * compression too.
+	 */
+	wsize = cudbg_get_workspace_size();
+	if (wsize)
+		wsize += CUDBG_COMPRESS_BUFF_SIZE;
+
+	adap->dump_buf.size = len + wsize;
+	adap->dump_buf.buf = vzalloc(adap->dump_buf.size);
+	if (!adap->dump_buf.buf)
+		return -ENOMEM;
+
+	sprintf(adap->dump_buf.name, "cxgb4_%s", adap->name);
+	adap->panic_nb.notifier_call = cxgb4_panic_notify;
+	adap->panic_nb.priority = INT_MAX;
+
+	ret = crash_driver_dump_register(&adap->dump_buf, &adap->panic_nb);
+	if (ret) {
+		vfree(adap->dump_buf.buf);
+		return ret;
+	}
+
+	return 0;
+}
+
+void cxgb4_cudbg_unregister_crash_dump(struct adapter *adap)
+{
+	if (adap->dump_buf.buf) {
+		crash_driver_dump_unregister(&adap->dump_buf);
+		vfree(adap->dump_buf.buf);
+	}
+}
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
index ce1ac9a1c878..79261313a350 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_cudbg.h
@@ -41,8 +41,12 @@ enum CXGB4_ETHTOOL_DUMP_FLAGS {
 	CXGB4_ETH_DUMP_HW = (1 << 1), /* various FW and HW dumps */
 };
 
+#define CXGB4_ETH_DUMP_ALL (CXGB4_ETH_DUMP_MEM | CXGB4_ETH_DUMP_HW)
+
 u32 cxgb4_get_dump_length(struct adapter *adap, u32 flag);
 int cxgb4_cudbg_collect(struct adapter *adap, void *buf, u32 *buf_size,
 			u32 flag);
 void cxgb4_init_ethtool_dump(struct adapter *adapter);
+int cxgb4_cudbg_register_crash_dump(struct adapter *adap);
+void cxgb4_cudbg_unregister_crash_dump(struct adapter *adap);
 #endif /* __CXGB4_CUDBG_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7b452e85de2a..64eeffe0ba45 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -5291,6 +5291,16 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 	}
 
 	setup_memwin(adapter);
+
+	/* Register panic notifier */
+	err = cxgb4_cudbg_register_crash_dump(adapter);
+	if (err) {
+		dev_warn(adapter->pdev_dev,
+			 "Fail registering panic notifier, err: %d. Continuing\n",
+			 err);
+		err = 0;
+	}
+
 	err = adap_init0(adapter);
 #ifdef CONFIG_DEBUG_FS
 	bitmap_zero(adapter->sge.blocked_fl, adapter->sge.egr_sz);
@@ -5538,6 +5548,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
 		destroy_workqueue(adapter->workq);
 
 	kfree(adapter->mbox_log);
+	cxgb4_cudbg_unregister_crash_dump(adapter);
 	kfree(adapter);
  out_unmap_bar0:
 	iounmap(regs);
@@ -5617,6 +5628,7 @@ static void remove_one(struct pci_dev *pdev)
 	pci_release_regions(pdev);
 	kfree(adapter->mbox_log);
 	synchronize_rcu();
+	cxgb4_cudbg_unregister_crash_dump(adapter);
 	kfree(adapter);
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC 0/2] kernel: add support to collect hardware logs in panic
  2018-03-02 12:19 [RFC 0/2] kernel: add support to collect hardware logs in panic Rahul Lakkireddy
  2018-03-02 12:19 ` [RFC 1/2] kernel/crash_core: add API to collect hardware dump in kernel panic Rahul Lakkireddy
  2018-03-02 12:19 ` [RFC 2/2] cxgb4: " Rahul Lakkireddy
@ 2018-03-02 13:22 ` Eric W. Biederman
  2018-03-03 10:43   ` Rahul Lakkireddy
  2 siblings, 1 reply; 5+ messages in thread
From: Eric W. Biederman @ 2018-03-02 13:22 UTC (permalink / raw)
  To: Rahul Lakkireddy
  Cc: linux-kernel, netdev, kexec, davem, akpm, torvalds, ganeshgr,
	nirranjan, indranil

Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:

> On production servers running variety of workloads over time, kernel
> panic can happen sporadically after days or even months. It is
> important to collect as much debug logs as possible to root cause
> and fix the problem, that may not be easy to reproduce. Snapshot of
> underlying hardware/firmware state (like register dump, firmware
> logs, adapter memory, etc.), at the time of kernel panic will be very
> helpful while debugging the culprit device driver.
>
> This series of patches add new generic framework that enable device
> drivers to collect device specific snapshot of the hardware/firmware
> state of the underlying device at the time of kernel panic. The
> collected logs are appended to vmcore along with details, such as
> start address and length of the logs, which are required for
> extraction during post-analysis.
>
> Device drivers can use crash_driver_dump_register() to register their
> callback that collects underlying device specific hardware/firmware
> logs during kernel panic (i.e. before booting into the second kernel).
> Drivers can unregister with crash_driver_dump_unregister().
>
> To extract the device specific hardware/firmware logs using crash:
>
> crash> help -D | grep DRIVERDUMP
> DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968)
>
> crash> rd ffffb131090bd000 37782968 -r hardware.log
> 37782968 bytes copied from 0xffffb131090bd000 to hardware.log
>
> Patch 1 adds API to allow drivers to register callback to
> collect the device specific hardware/firmware logs.
>
> Patch 2 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs during kernel panic.
>
> Suggestions and feedback will be much appreciated.

I strongly suggest you figure out how to run this code in the
crash recovery kernel before your hardware is initialized.
That will give you a known good kernel to perform your collection from.

Every line of code we add to the kexec on panic code path tends to add
to it's fragility and increase the chance you won't get any information
at all.

When the assumption is it is something wrong with your driver/hardware
that caused the crash, calling into your driver is a very bad idea.
Especially running code that does callbacks and all kinds of other cute
things.

Doing this as the crash recover kernel boots up before much if any
hardware is initialized seems like a fine thing to do, and just
needs a little coordination with userspace to ensure the information
gets saved when a vmcore is computed.

Eric

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC 0/2] kernel: add support to collect hardware logs in panic
  2018-03-02 13:22 ` [RFC 0/2] kernel: add support to collect hardware logs in panic Eric W. Biederman
@ 2018-03-03 10:43   ` Rahul Lakkireddy
  0 siblings, 0 replies; 5+ messages in thread
From: Rahul Lakkireddy @ 2018-03-03 10:43 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: linux-kernel, netdev, kexec, davem, akpm, torvalds, Ganesh GR,
	Nirranjan Kirubaharan, Indranil Choudhury

On Friday, March 03/02/18, 2018 at 18:52:45 +0530, Eric W. Biederman wrote:
> Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> writes:
> 
> > On production servers running variety of workloads over time, kernel
> > panic can happen sporadically after days or even months. It is
> > important to collect as much debug logs as possible to root cause
> > and fix the problem, that may not be easy to reproduce. Snapshot of
> > underlying hardware/firmware state (like register dump, firmware
> > logs, adapter memory, etc.), at the time of kernel panic will be very
> > helpful while debugging the culprit device driver.
> >
> > This series of patches add new generic framework that enable device
> > drivers to collect device specific snapshot of the hardware/firmware
> > state of the underlying device at the time of kernel panic. The
> > collected logs are appended to vmcore along with details, such as
> > start address and length of the logs, which are required for
> > extraction during post-analysis.
> >
> > Device drivers can use crash_driver_dump_register() to register their
> > callback that collects underlying device specific hardware/firmware
> > logs during kernel panic (i.e. before booting into the second kernel).
> > Drivers can unregister with crash_driver_dump_unregister().
> >
> > To extract the device specific hardware/firmware logs using crash:
> >
> > crash> help -D | grep DRIVERDUMP
> > DRIVERDUMP=(cxgb4_0000:02:00.4, ffffb131090bd000, 37782968)
> >
> > crash> rd ffffb131090bd000 37782968 -r hardware.log
> > 37782968 bytes copied from 0xffffb131090bd000 to hardware.log
> >
> > Patch 1 adds API to allow drivers to register callback to
> > collect the device specific hardware/firmware logs.
> >
> > Patch 2 shows a cxgb4 driver example using the API to collect
> > hardware/firmware logs during kernel panic.
> >
> > Suggestions and feedback will be much appreciated.
> 
> I strongly suggest you figure out how to run this code in the
> crash recovery kernel before your hardware is initialized.
> That will give you a known good kernel to perform your collection from.
> 
> Every line of code we add to the kexec on panic code path tends to add
> to it's fragility and increase the chance you won't get any information
> at all.
> 
> When the assumption is it is something wrong with your driver/hardware
> that caused the crash, calling into your driver is a very bad idea.
> Especially running code that does callbacks and all kinds of other cute
> things.
> 
> Doing this as the crash recover kernel boots up before much if any
> hardware is initialized seems like a fine thing to do, and just
> needs a little coordination with userspace to ensure the information
> gets saved when a vmcore is computed.
> 

Thanks for the feedback and suggestions. I will work on achieving
this from the crash recover kernel.

Thanks,
Rahul

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-03-03 10:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-02 12:19 [RFC 0/2] kernel: add support to collect hardware logs in panic Rahul Lakkireddy
2018-03-02 12:19 ` [RFC 1/2] kernel/crash_core: add API to collect hardware dump in kernel panic Rahul Lakkireddy
2018-03-02 12:19 ` [RFC 2/2] cxgb4: " Rahul Lakkireddy
2018-03-02 13:22 ` [RFC 0/2] kernel: add support to collect hardware logs in panic Eric W. Biederman
2018-03-03 10:43   ` Rahul Lakkireddy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).