All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V5] Eliminate task stack trace duplication.
@ 2011-07-29  1:25 Andrew Bresticker
  2011-08-26 23:19 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Bresticker @ 2011-07-29  1:25 UTC (permalink / raw)
  To: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Andrew Morton,
	Li Zefan, Mel Gorman, Christoph Lameter, Johannes Weiner,
	Rik van Riel, Hugh Dickins, Michal Hocko, Dave Hansen,
	Zhu Yanhai
  Cc: linux-mm, Andrew Bresticker, Ying Han

The problem with small dmesg ring buffer like 512k is that only limited number
of task traces will be logged. Sometimes we lose important information only
because of too many duplicated stack traces. This problem occurs when dumping
lots of stacks in a single operation, such as sysrq-T.

This patch tries to reduce the duplication of task stack trace in the dump
message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
during bootup. Then we hash the task stack with stack_depth 32 for each stack
entry. Each time if we find the identical task trace in the task stack, we dump
only the pid of the task which has the task trace dumped. So it is easy to back
track to the full stack with the pid.

[   58.469730] kworker/0:0     S 0000000000000000     0     4      2 0x00000000
[   58.469735]  ffff88082fcfde80 0000000000000046 ffff88082e9d8000 ffff88082fcfc010
[   58.469739]  ffff88082fce9860 0000000000011440 ffff88082fcfdfd8 ffff88082fcfdfd8
[   58.469743]  0000000000011440 0000000000000000 ffff88082fcee180 ffff88082fce9860
[   58.469747] Call Trace:
[   58.469751]  [<ffffffff8108525a>] worker_thread+0x24b/0x250
[   58.469754]  [<ffffffff8108500f>] ? manage_workers+0x192/0x192
[   58.469757]  [<ffffffff810885bd>] kthread+0x82/0x8a
[   58.469760]  [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
[   58.469763]  [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
[   58.469765]  [<ffffffff8141aed0>] ? gs_change+0xb/0xb
[   58.469768] kworker/u:0     S 0000000000000004     0     5      2 0x00000000
[   58.469773]  ffff88082fcffe80 0000000000000046 ffff880800000000 ffff88082fcfe010
[   58.469777]  ffff88082fcea080 0000000000011440 ffff88082fcfffd8 ffff88082fcfffd8
[   58.469781]  0000000000011440 0000000000000000 ffff88082fd4e9a0 ffff88082fcea080
[   58.469785] Call Trace:
[   58.469786] <Same stack as pid 4>
[   58.470235] kworker/0:1     S 0000000000000000     0    13      2 0x00000000
[   58.470255]  ffff88082fd3fe80 0000000000000046 ffff880800000000 ffff88082fd3e010
[   58.470279]  ffff88082fcee180 0000000000011440 ffff88082fd3ffd8 ffff88082fd3ffd8
[   58.470301]  0000000000011440 0000000000000000 ffffffff8180b020 ffff88082fcee180
[   58.470325] Call Trace:
[   58.470332] <Same stack as pid 4>

Note: Non-x86 architectures will need to be updated since show_stack()
now takes an additional argument.

changelog v5..v4:
1. removed changes to Kconfig file
2. changed hashtable to keep only hash value and length of stack
3. simplified hashtable lookup

changelog v4..v3:
1. improve de-duplication by eliminating garbage entries from stack traces.
with this change 793/825 stack traces were recognized as duplicates. in v3
only 482/839 were duplicates.

changelog v3..v2:
1. again better documentation on the patch description.
2. make the stack_hash_table to be allocated at compile time.
3. have better name of variable index
4. move save_dup_stack_trace() in kernel/stacktrace.c

changelog v2..v1:
1. better documentation on the patch description
2. move the spinlock inside the hash lockup, so reducing the holding time.

Note:
1. with pid namespace, we might have same pid number for different processes. i
wonder how the stack trace (w/o dedup) handles the case, it uses tsk->pid as well
as far as i checked.

Change-Id: I57f5913002192367f71ab3c02d3fdc5e74e4caf1
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Andrew Bresticker <abrestic@google.com>
---
 arch/x86/include/asm/stacktrace.h |    6 ++-
 arch/x86/kernel/dumpstack.c       |   24 +++++++--
 arch/x86/kernel/dumpstack_32.c    |    7 ++-
 arch/x86/kernel/dumpstack_64.c    |   11 +++-
 arch/x86/kernel/stacktrace.c      |   97 +++++++++++++++++++++++++++++++++++++
 drivers/tty/sysrq.c               |    2 +-
 include/linux/sched.h             |    3 +-
 include/linux/stacktrace.h        |    2 +
 kernel/debug/kdb/kdb_bt.c         |    8 ++--
 kernel/rtmutex-debug.c            |    2 +-
 kernel/sched.c                    |   20 +++++++-
 kernel/stacktrace.c               |   10 ++++
 12 files changed, 168 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 70bbe39..ebaec8b 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -83,11 +83,13 @@ stack_frame(struct task_struct *task, struct pt_regs *regs)
 
 extern void
 show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *stack, unsigned long bp, char *log_lvl);
+		   unsigned long *stack, unsigned long bp, char *log_lvl,
+		   pid_t dup_stack_pid);
 
 extern void
 show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *sp, unsigned long bp, char *log_lvl);
+		   unsigned long *sp, unsigned long bp, char *log_lvl,
+		   pid_t dup_stack_pid);
 
 extern unsigned int code_bytes;
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 1aae78f..15d981b 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -157,23 +157,35 @@ static const struct stacktrace_ops print_trace_ops = {
 	.walk_stack		= print_context_stack,
 };
 
+/*
+ * The dup_stack_pid indicates whether or not a duplication of stack trace
+ * has been recorded. The non-zero value indicates the pid of the task with
+ * the same stack trace.
+ */
 void
 show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		unsigned long *stack, unsigned long bp, char *log_lvl)
+		unsigned long *stack, unsigned long bp, char *log_lvl,
+		pid_t dup_stack_pid)
 {
-	printk("%sCall Trace:\n", log_lvl);
-	dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
+	if (dup_stack_pid) {
+		printk("%sCall Trace:\n", log_lvl);
+		printk("<Same stack as pid %d>\n\n", dup_stack_pid);
+	} else {
+		printk("%sCall Trace:\n", log_lvl);
+		dump_trace(task, regs, stack, bp, &print_trace_ops, log_lvl);
+	}
 }
 
 void show_trace(struct task_struct *task, struct pt_regs *regs,
 		unsigned long *stack, unsigned long bp)
 {
-	show_trace_log_lvl(task, regs, stack, bp, "");
+	show_trace_log_lvl(task, regs, stack, bp, "", 0);
 }
 
-void show_stack(struct task_struct *task, unsigned long *sp)
+void show_stack(struct task_struct *task, unsigned long *sp,
+		pid_t dup_stack_pid)
 {
-	show_stack_log_lvl(task, NULL, sp, 0, "");
+	show_stack_log_lvl(task, NULL, sp, 0, "", dup_stack_pid);
 }
 
 /*
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index 3b97a80..bd06106 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -56,7 +56,8 @@ EXPORT_SYMBOL(dump_trace);
 
 void
 show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *sp, unsigned long bp, char *log_lvl)
+		   unsigned long *sp, unsigned long bp, char *log_lvl,
+		   pid_t dup_stack_pid)
 {
 	unsigned long *stack;
 	int i;
@@ -78,7 +79,7 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 		touch_nmi_watchdog();
 	}
 	printk(KERN_CONT "\n");
-	show_trace_log_lvl(task, regs, sp, bp, log_lvl);
+	show_trace_log_lvl(task, regs, sp, bp, log_lvl, dup_stack_pid);
 }
 
 
@@ -103,7 +104,7 @@ void show_registers(struct pt_regs *regs)
 		u8 *ip;
 
 		printk(KERN_EMERG "Stack:\n");
-		show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG);
+		show_stack_log_lvl(NULL, regs, &regs->sp, 0, KERN_EMERG, 0);
 
 		printk(KERN_EMERG "Code: ");
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 19853ad..1c26a36 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -198,7 +198,8 @@ EXPORT_SYMBOL(dump_trace);
 
 void
 show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
-		   unsigned long *sp, unsigned long bp, char *log_lvl)
+		   unsigned long *sp, unsigned long bp, char *log_lvl,
+		   pid_t dup_stack_pid)
 {
 	unsigned long *irq_stack_end;
 	unsigned long *irq_stack;
@@ -242,7 +243,11 @@ show_stack_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	preempt_enable();
 
 	printk(KERN_CONT "\n");
-	show_trace_log_lvl(task, regs, sp, bp, log_lvl);
+	if (dup_stack_pid) {
+		printk(KERN_CONT "%sCall Trace:\n", log_lvl);
+		printk(KERN_CONT "<Same stack as pid %d>\n\n", dup_stack_pid);
+	} else
+		show_trace_log_lvl(task, regs, sp, bp, log_lvl, dup_stack_pid);
 }
 
 void show_registers(struct pt_regs *regs)
@@ -271,7 +276,7 @@ void show_registers(struct pt_regs *regs)
 
 		printk(KERN_EMERG "Stack:\n");
 		show_stack_log_lvl(NULL, regs, (unsigned long *)sp,
-				   0, KERN_EMERG);
+				   0, KERN_EMERG, 0);
 
 		printk(KERN_EMERG "Code: ");
 
diff --git a/arch/x86/kernel/stacktrace.c b/arch/x86/kernel/stacktrace.c
index fdd0c64..175ba1e 100644
--- a/arch/x86/kernel/stacktrace.c
+++ b/arch/x86/kernel/stacktrace.c
@@ -7,6 +7,7 @@
 #include <linux/stacktrace.h>
 #include <linux/module.h>
 #include <linux/uaccess.h>
+#include <linux/jhash.h>
 #include <asm/stacktrace.h>
 
 static int save_stack_stack(void *data, char *name)
@@ -81,6 +82,102 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct stack_trace *trace)
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
+/*
+ * The implementation of stack trace dedup. It tries to reduce the duplication
+ * of task stack trace in the dump by hashing the stack trace. The hashtable is
+ * 32k pre-allocated buffer. Then we hash the task stack with stack_depth
+ * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identical trace
+ * is found in the stack, we dump only the pid of previous task. So it is easy
+ * to back track to the full stack with the pid.
+ */
+#define DEDUP_MAX_STACK_DEPTH 32
+#define DEDUP_STACK_HASH 32768
+#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
+#define DEDUP_HASH_MAX_ITERATIONS 10
+
+struct task_stack {
+	pid_t pid;
+	int len;
+	unsigned long hash;
+};
+
+static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
+static struct task_stack cur_stack;
+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
+
+/*
+ * The stack hashtable uses linear probing to resolve collisions.
+ * We consider two stacks to be the same if their hash values and lengths
+ * are equal.
+ */
+static unsigned int stack_trace_lookup(void)
+{
+	int j;
+	int index;
+	unsigned int ret = 0;
+	struct task_stack *stack;
+
+	index = cur_stack.hash % DEDUP_STACK_ENTRIES;
+
+	for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
+		stack = stack_hash_table + (index + j) % DEDUP_STACK_ENTRIES;
+		if (stack->hash == 0) {
+			*stack = cur_stack;
+			ret = 0;
+			break;
+		} else {
+			if (stack->hash == cur_stack.hash &&
+			    stack->len == cur_stack.len) {
+				ret = stack->pid;
+				break;
+			}
+		}
+	}
+	if (j == DEDUP_HASH_MAX_ITERATIONS)
+		stack_hash_table[index] = cur_stack;
+
+	memset(&cur_stack, 0, sizeof(cur_stack));
+
+	return ret;
+}
+
+static int save_dup_stack_stack(void *data, char *name)
+{
+	return 0;
+}
+
+static void save_dup_stack_address(void *data, unsigned long addr, int reliable)
+{
+	/*
+	 * To improve de-duplication, we'll only record reliable entries
+	 * in the stack trace.
+	 */
+	if (!reliable)
+		return;
+	cur_stack.hash = jhash(&addr, sizeof(addr), cur_stack.hash);
+	cur_stack.len++;
+}
+
+static const struct stacktrace_ops save_dup_stack_ops = {
+	.stack = save_dup_stack_stack,
+	.address = save_dup_stack_address,
+	.walk_stack = print_context_stack,
+};
+
+unsigned int save_dup_stack_trace(struct task_struct *tsk)
+{
+	unsigned int ret = 0;
+	unsigned int dummy = 0;
+
+	spin_lock(&stack_hash_lock);
+	dump_trace(tsk, NULL, NULL, 0, &save_dup_stack_ops, &dummy);
+	cur_stack.pid = tsk->pid;
+	ret = stack_trace_lookup();
+	spin_unlock(&stack_hash_lock);
+
+	return ret;
+}
+
 /* Userspace stacktrace - based on kernel/trace/trace_sysprof.c */
 
 struct stack_frame_user {
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 43db715..1165464 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -214,7 +214,7 @@ static void showacpu(void *dummy)
 
 	spin_lock_irqsave(&show_lock, flags);
 	printk(KERN_INFO "CPU%d:\n", smp_processor_id());
-	show_stack(NULL, NULL);
+	show_stack(NULL, NULL, 0);
 	spin_unlock_irqrestore(&show_lock, flags);
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e00903..ae7ee34 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -295,7 +295,8 @@ extern void show_regs(struct pt_regs *);
  * task), SP is the stack pointer of the first frame that should be shown in the back
  * trace (or NULL if the entire call-chain of the task should be shown).
  */
-extern void show_stack(struct task_struct *task, unsigned long *sp);
+extern void show_stack(struct task_struct *task, unsigned long *sp,
+			pid_t dup_stack_pid);
 
 void io_schedule(void);
 long io_schedule_timeout(long timeout);
diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
index 115b570..c8782d1 100644
--- a/include/linux/stacktrace.h
+++ b/include/linux/stacktrace.h
@@ -21,6 +21,7 @@ extern void save_stack_trace_tsk(struct task_struct *tsk,
 
 extern void print_stack_trace(struct stack_trace *trace, int spaces);
 
+extern unsigned int save_dup_stack_trace(struct task_struct *tsk);
 #ifdef CONFIG_USER_STACKTRACE_SUPPORT
 extern void save_stack_trace_user(struct stack_trace *trace);
 #else
@@ -32,6 +33,7 @@ extern void save_stack_trace_user(struct stack_trace *trace);
 # define save_stack_trace_tsk(tsk, trace)		do { } while (0)
 # define save_stack_trace_user(trace)			do { } while (0)
 # define print_stack_trace(trace, spaces)		do { } while (0)
+# define save_dup_stack_trace(tsk)			do { } while (0)
 #endif
 
 #endif
diff --git a/kernel/debug/kdb/kdb_bt.c b/kernel/debug/kdb/kdb_bt.c
index 2f62fe8..ff8c6ad 100644
--- a/kernel/debug/kdb/kdb_bt.c
+++ b/kernel/debug/kdb/kdb_bt.c
@@ -26,15 +26,15 @@ static void kdb_show_stack(struct task_struct *p, void *addr)
 	kdb_trap_printk++;
 	kdb_set_current_task(p);
 	if (addr) {
-		show_stack((struct task_struct *)p, addr);
+		show_stack((struct task_struct *)p, addr, 0);
 	} else if (kdb_current_regs) {
 #ifdef CONFIG_X86
-		show_stack(p, &kdb_current_regs->sp);
+		show_stack(p, &kdb_current_regs->sp, 0);
 #else
-		show_stack(p, NULL);
+		show_stack(p, NULL, 0);
 #endif
 	} else {
-		show_stack(p, NULL);
+		show_stack(p, NULL, 0);
 	}
 	console_loglevel = old_lvl;
 	kdb_trap_printk--;
diff --git a/kernel/rtmutex-debug.c b/kernel/rtmutex-debug.c
index 3c7cbc2..e636067 100644
--- a/kernel/rtmutex-debug.c
+++ b/kernel/rtmutex-debug.c
@@ -171,7 +171,7 @@ void debug_rt_mutex_print_deadlock(struct rt_mutex_waiter *waiter)
 
 	printk("\n%s/%d's [blocked] stackdump:\n\n",
 		task->comm, task_pid_nr(task));
-	show_stack(task, NULL);
+	show_stack(task, NULL, 0);
 	printk("\n%s/%d's [current] stackdump:\n\n",
 		current->comm, task_pid_nr(current));
 	dump_stack();
diff --git a/kernel/sched.c b/kernel/sched.c
index 45af137..6129700 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -71,6 +71,7 @@
 #include <linux/ctype.h>
 #include <linux/ftrace.h>
 #include <linux/slab.h>
+#include <linux/stacktrace.h>
 
 #include <asm/tlb.h>
 #include <asm/irq_regs.h>
@@ -5781,10 +5782,11 @@ out_unlock:
 
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
-void sched_show_task(struct task_struct *p)
+void _sched_show_task(struct task_struct *p, int dedup)
 {
 	unsigned long free = 0;
 	unsigned state;
+	pid_t dup_stack_pid = 0;
 
 	state = p->state ? __ffs(p->state) + 1 : 0;
 	printk(KERN_INFO "%-15.15s %c", p->comm,
@@ -5807,7 +5809,19 @@ void sched_show_task(struct task_struct *p)
 		task_pid_nr(p), task_pid_nr(p->real_parent),
 		(unsigned long)task_thread_info(p)->flags);
 
-	show_stack(p, NULL);
+	if (dedup)
+		dup_stack_pid = save_dup_stack_trace(p);
+	show_stack(p, NULL, dup_stack_pid);
+}
+
+void sched_show_task(struct task_struct *p)
+{
+	_sched_show_task(p, 0);
+}
+
+void sched_show_task_dedup(struct task_struct *p)
+{
+	_sched_show_task(p, 1);
 }
 
 void show_state_filter(unsigned long state_filter)
@@ -5829,7 +5843,7 @@ void show_state_filter(unsigned long state_filter)
 		 */
 		touch_nmi_watchdog();
 		if (!state_filter || (p->state & state_filter))
-			sched_show_task(p);
+			sched_show_task_dedup(p);
 	} while_each_thread(g, p);
 
 	touch_all_softlockup_watchdogs();
diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index d20c698..e246526 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -41,3 +41,13 @@ save_stack_trace_regs(struct pt_regs *regs, struct stack_trace *trace)
 {
 	WARN_ONCE(1, KERN_INFO "save_stack_trace_regs() not implemented yet.\n");
 }
+
+/*
+ * Architectures that do not implement the task stack dedup will fallback to
+ * the default functionality.
+ */
+__weak unsigned int
+save_dup_stack_trace(struct task_struct *tsk)
+{
+	return 0;
+}
-- 
1.7.3.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH V5] Eliminate task stack trace duplication.
  2011-07-29  1:25 [PATCH V5] Eliminate task stack trace duplication Andrew Bresticker
@ 2011-08-26 23:19 ` Andrew Morton
  2011-08-29  7:08   ` Ying Han
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2011-08-26 23:19 UTC (permalink / raw)
  To: Andrew Bresticker
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm,
	Ying Han

(I'm back!)

On Thu, 28 Jul 2011 18:25:59 -0700
Andrew Bresticker <abrestic@google.com> wrote:

> The problem with small dmesg ring buffer like 512k is that only limited number
> of task traces will be logged. Sometimes we lose important information only
> because of too many duplicated stack traces. This problem occurs when dumping
> lots of stacks in a single operation, such as sysrq-T.
> 
> This patch tries to reduce the duplication of task stack trace in the dump
> message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
> during bootup. Then we hash the task stack with stack_depth 32 for each stack
> entry. Each time if we find the identical task trace in the task stack, we dump
> only the pid of the task which has the task trace dumped. So it is easy to back
> track to the full stack with the pid.
> 
> [   58.469730] kworker/0:0     S 0000000000000000     0     4      2 0x00000000
> [   58.469735]  ffff88082fcfde80 0000000000000046 ffff88082e9d8000 ffff88082fcfc010
> [   58.469739]  ffff88082fce9860 0000000000011440 ffff88082fcfdfd8 ffff88082fcfdfd8
> [   58.469743]  0000000000011440 0000000000000000 ffff88082fcee180 ffff88082fce9860
> [   58.469747] Call Trace:
> [   58.469751]  [<ffffffff8108525a>] worker_thread+0x24b/0x250
> [   58.469754]  [<ffffffff8108500f>] ? manage_workers+0x192/0x192
> [   58.469757]  [<ffffffff810885bd>] kthread+0x82/0x8a
> [   58.469760]  [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
> [   58.469763]  [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
> [   58.469765]  [<ffffffff8141aed0>] ? gs_change+0xb/0xb
> [   58.469768] kworker/u:0     S 0000000000000004     0     5      2 0x00000000
> [   58.469773]  ffff88082fcffe80 0000000000000046 ffff880800000000 ffff88082fcfe010
> [   58.469777]  ffff88082fcea080 0000000000011440 ffff88082fcfffd8 ffff88082fcfffd8
> [   58.469781]  0000000000011440 0000000000000000 ffff88082fd4e9a0 ffff88082fcea080
> [   58.469785] Call Trace:
> [   58.469786] <Same stack as pid 4>
> [   58.470235] kworker/0:1     S 0000000000000000     0    13      2 0x00000000
> [   58.470255]  ffff88082fd3fe80 0000000000000046 ffff880800000000 ffff88082fd3e010
> [   58.470279]  ffff88082fcee180 0000000000011440 ffff88082fd3ffd8 ffff88082fd3ffd8
> [   58.470301]  0000000000011440 0000000000000000 ffffffff8180b020 ffff88082fcee180
> [   58.470325] Call Trace:
> [   58.470332] <Same stack as pid 4>

The code looks OK(ish) to me.  I'm still concerned that the implementation
will miss lots of de-duplications because it is hashing random crud in
the stack frame.

> Note: Non-x86 architectures will need to be updated since show_stack()
> now takes an additional argument.

Well, we can't break all architectures.

I can't think of a way to make the preprocessor convert show_stack(a,
b) into show_stack(a, b, N) (this can be done in the other direction). 
So all I can think of is to rename x86 show_stack() to something else and do

#define show_stack_something_else(a, b, c) show_stack(a, b)

for other architectures.

But on the other hand, why did the show_stack() interface get changed? 
show_stack() dumps a single tasks's stack, so top-level callers have no
earthly reason to be passing the dup_stack_pid into show_stack(). 
dup_stack_pid is purely for many-task stackdumps.

Also, the code as-is is pretty much useless for other architectures. 
The core changes in arch/x86/kernel/stacktrace.c look pretty generic -
can we design and place this code so that all architectures can use it?


> The problem with small dmesg ring buffer like 512k is that only limited number
> of task traces will be logged. Sometimes we lose important information only
> because of too many duplicated stack traces. This problem occurs when dumping
> lots of stacks in a single operation, such as sysrq-T.
> 
> This patch tries to reduce the duplication of task stack trace in the dump
> message by hashing the task stack. The hashtable is a 32k pre-allocated buffer
> during bootup. Then we hash the task stack with stack_depth 32 for each stack
> entry. Each time if we find the identical task trace in the task stack, we dump
> only the pid of the task which has the task trace dumped. So it is easy to back
> track to the full stack with the pid.
> 
>
> ...
>
> +/*
> + * The implementation of stack trace dedup. It tries to reduce the duplication
> + * of task stack trace in the dump by hashing the stack trace. The hashtable is
> + * 32k pre-allocated buffer. Then we hash the task stack with stack_depth
> + * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identical trace
> + * is found in the stack, we dump only the pid of previous task. So it is easy
> + * to back track to the full stack with the pid.
> + */
> +#define DEDUP_MAX_STACK_DEPTH 32
> +#define DEDUP_STACK_HASH 32768
> +#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
> +#define DEDUP_HASH_MAX_ITERATIONS 10

It wouldn't hurt to document DEDUP_HASH_MAX_ITERATIONS (at least).

But then, why does DEDUP_HASH_MAX_ITERATIONS exist? (below)

> +struct task_stack {
> +	pid_t pid;
> +	int len;
> +	unsigned long hash;
> +};
> +
> +static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
> +static struct task_stack cur_stack;
> +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
> +
> +/*
> + * The stack hashtable uses linear probing to resolve collisions.
> + * We consider two stacks to be the same if their hash values and lengths
> + * are equal.
> + */
> +static unsigned int stack_trace_lookup(void)
> +{
> +	int j;
> +	int index;
> +	unsigned int ret = 0;
> +	struct task_stack *stack;
> +
> +	index = cur_stack.hash % DEDUP_STACK_ENTRIES;
> +
> +	for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
> +		stack = stack_hash_table + (index + j) % DEDUP_STACK_ENTRIES;

(this would be more efficient if DEDUP_STACK_ENTRIES was a power of 2)

> +		if (stack->hash == 0) {
> +			*stack = cur_stack;
> +			ret = 0;
> +			break;
> +		} else {
> +			if (stack->hash == cur_stack.hash &&
> +			    stack->len == cur_stack.len) {
> +				ret = stack->pid;
> +				break;
> +			}
> +		}
> +	}
> +	if (j == DEDUP_HASH_MAX_ITERATIONS)
> +		stack_hash_table[index] = cur_stack;

Why stop there?  Why not just append to stack_hash_table[]?  When we
first decide to do a multi-task stackdump, zero the index into the
array.  Each time a task is processed, look to see if it is unique and
if so, add its task_stack to the end of the array.

This may require adding a stacktrace_ops.start().  This could be done
while moving stacktrace_ops (which advertises itself as a "Generic
stack tracer"!) out of x86-specific code.

> +	memset(&cur_stack, 0, sizeof(cur_stack));

Sane, but I'm not sure it's necessary.

> +	return ret;
> +}
> +
>
> ...
>

Making this all arch-neutral is quite a bit of work, which you may not
feel like undertaking, ho hum.  Also, the lack of any documentation in
that x86 code makes it unready for prime time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH V5] Eliminate task stack trace duplication.
  2011-08-26 23:19 ` Andrew Morton
@ 2011-08-29  7:08   ` Ying Han
  0 siblings, 0 replies; 3+ messages in thread
From: Ying Han @ 2011-08-29  7:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: KOSAKI Motohiro, Minchan Kim, Daisuke Nishimura, Balbir Singh,
	Tejun Heo, Pavel Emelyanov, KAMEZAWA Hiroyuki, Li Zefan,
	Mel Gorman, Christoph Lameter, Johannes Weiner, Rik van Riel,
	Hugh Dickins, Michal Hocko, Dave Hansen, Zhu Yanhai, linux-mm

[-- Attachment #1: Type: text/plain, Size: 7874 bytes --]

On Fri, Aug 26, 2011 at 4:19 PM, Andrew Morton <akpm@linux-foundation.org>wrote:

> (I'm back!)
>

Thank you Andrew for the comments.

Hmm, Looks like we still need some changes for this patch to get it merged
into -mm and I might be able to jump into it sometime next week. :)

--Ying

>
> On Thu, 28 Jul 2011 18:25:59 -0700
> Andrew Bresticker <abrestic@google.com> wrote:
>
> > The problem with small dmesg ring buffer like 512k is that only limited
> number
> > of task traces will be logged. Sometimes we lose important information
> only
> > because of too many duplicated stack traces. This problem occurs when
> dumping
> > lots of stacks in a single operation, such as sysrq-T.
> >
> > This patch tries to reduce the duplication of task stack trace in the
> dump
> > message by hashing the task stack. The hashtable is a 32k pre-allocated
> buffer
> > during bootup. Then we hash the task stack with stack_depth 32 for each
> stack
> > entry. Each time if we find the identical task trace in the task stack,
> we dump
> > only the pid of the task which has the task trace dumped. So it is easy
> to back
> > track to the full stack with the pid.
> >
> > [   58.469730] kworker/0:0     S 0000000000000000     0     4      2
> 0x00000000
> > [   58.469735]  ffff88082fcfde80 0000000000000046 ffff88082e9d8000
> ffff88082fcfc010
> > [   58.469739]  ffff88082fce9860 0000000000011440 ffff88082fcfdfd8
> ffff88082fcfdfd8
> > [   58.469743]  0000000000011440 0000000000000000 ffff88082fcee180
> ffff88082fce9860
> > [   58.469747] Call Trace:
> > [   58.469751]  [<ffffffff8108525a>] worker_thread+0x24b/0x250
> > [   58.469754]  [<ffffffff8108500f>] ? manage_workers+0x192/0x192
> > [   58.469757]  [<ffffffff810885bd>] kthread+0x82/0x8a
> > [   58.469760]  [<ffffffff8141aed4>] kernel_thread_helper+0x4/0x10
> > [   58.469763]  [<ffffffff8108853b>] ? kthread_worker_fn+0x112/0x112
> > [   58.469765]  [<ffffffff8141aed0>] ? gs_change+0xb/0xb
> > [   58.469768] kworker/u:0     S 0000000000000004     0     5      2
> 0x00000000
> > [   58.469773]  ffff88082fcffe80 0000000000000046 ffff880800000000
> ffff88082fcfe010
> > [   58.469777]  ffff88082fcea080 0000000000011440 ffff88082fcfffd8
> ffff88082fcfffd8
> > [   58.469781]  0000000000011440 0000000000000000 ffff88082fd4e9a0
> ffff88082fcea080
> > [   58.469785] Call Trace:
> > [   58.469786] <Same stack as pid 4>
> > [   58.470235] kworker/0:1     S 0000000000000000     0    13      2
> 0x00000000
> > [   58.470255]  ffff88082fd3fe80 0000000000000046 ffff880800000000
> ffff88082fd3e010
> > [   58.470279]  ffff88082fcee180 0000000000011440 ffff88082fd3ffd8
> ffff88082fd3ffd8
> > [   58.470301]  0000000000011440 0000000000000000 ffffffff8180b020
> ffff88082fcee180
> > [   58.470325] Call Trace:
> > [   58.470332] <Same stack as pid 4>
>
> The code looks OK(ish) to me.  I'm still concerned that the implementation
> will miss lots of de-duplications because it is hashing random crud in
> the stack frame.
>
> > Note: Non-x86 architectures will need to be updated since show_stack()
> > now takes an additional argument.
>
> Well, we can't break all architectures.
>
> I can't think of a way to make the preprocessor convert show_stack(a,
> b) into show_stack(a, b, N) (this can be done in the other direction).
> So all I can think of is to rename x86 show_stack() to something else and
> do
>
> #define show_stack_something_else(a, b, c) show_stack(a, b)
>
> for other architectures.
>
> But on the other hand, why did the show_stack() interface get changed?
> show_stack() dumps a single tasks's stack, so top-level callers have no
> earthly reason to be passing the dup_stack_pid into show_stack().
> dup_stack_pid is purely for many-task stackdumps.
>
> Also, the code as-is is pretty much useless for other architectures.
> The core changes in arch/x86/kernel/stacktrace.c look pretty generic -
> can we design and place this code so that all architectures can use it?
>
>
> > The problem with small dmesg ring buffer like 512k is that only limited
> number
> > of task traces will be logged. Sometimes we lose important information
> only
> > because of too many duplicated stack traces. This problem occurs when
> dumping
> > lots of stacks in a single operation, such as sysrq-T.
> >
> > This patch tries to reduce the duplication of task stack trace in the
> dump
> > message by hashing the task stack. The hashtable is a 32k pre-allocated
> buffer
> > during bootup. Then we hash the task stack with stack_depth 32 for each
> stack
> > entry. Each time if we find the identical task trace in the task stack,
> we dump
> > only the pid of the task which has the task trace dumped. So it is easy
> to back
> > track to the full stack with the pid.
> >
> >
> > ...
> >
> > +/*
> > + * The implementation of stack trace dedup. It tries to reduce the
> duplication
> > + * of task stack trace in the dump by hashing the stack trace. The
> hashtable is
> > + * 32k pre-allocated buffer. Then we hash the task stack with
> stack_depth
> > + * DEDUP_MAX_STACK_DEPTH for each stack entry. Each time if an identical
> trace
> > + * is found in the stack, we dump only the pid of previous task. So it
> is easy
> > + * to back track to the full stack with the pid.
> > + */
> > +#define DEDUP_MAX_STACK_DEPTH 32
> > +#define DEDUP_STACK_HASH 32768
> > +#define DEDUP_STACK_ENTRIES (DEDUP_STACK_HASH/sizeof(struct task_stack))
> > +#define DEDUP_HASH_MAX_ITERATIONS 10
>
> It wouldn't hurt to document DEDUP_HASH_MAX_ITERATIONS (at least).
>
> But then, why does DEDUP_HASH_MAX_ITERATIONS exist? (below)
>
> > +struct task_stack {
> > +     pid_t pid;
> > +     int len;
> > +     unsigned long hash;
> > +};
> > +
> > +static struct task_stack stack_hash_table[DEDUP_STACK_ENTRIES];
> > +static struct task_stack cur_stack;
> > +static __cacheline_aligned_in_smp DEFINE_SPINLOCK(stack_hash_lock);
> > +
> > +/*
> > + * The stack hashtable uses linear probing to resolve collisions.
> > + * We consider two stacks to be the same if their hash values and
> lengths
> > + * are equal.
> > + */
> > +static unsigned int stack_trace_lookup(void)
> > +{
> > +     int j;
> > +     int index;
> > +     unsigned int ret = 0;
> > +     struct task_stack *stack;
> > +
> > +     index = cur_stack.hash % DEDUP_STACK_ENTRIES;
> > +
> > +     for (j = 0; j < DEDUP_HASH_MAX_ITERATIONS; j++) {
> > +             stack = stack_hash_table + (index + j) %
> DEDUP_STACK_ENTRIES;
>
> (this would be more efficient if DEDUP_STACK_ENTRIES was a power of 2)
>
> > +             if (stack->hash == 0) {
> > +                     *stack = cur_stack;
> > +                     ret = 0;
> > +                     break;
> > +             } else {
> > +                     if (stack->hash == cur_stack.hash &&
> > +                         stack->len == cur_stack.len) {
> > +                             ret = stack->pid;
> > +                             break;
> > +                     }
> > +             }
> > +     }
> > +     if (j == DEDUP_HASH_MAX_ITERATIONS)
> > +             stack_hash_table[index] = cur_stack;
>
> Why stop there?  Why not just append to stack_hash_table[]?  When we
> first decide to do a multi-task stackdump, zero the index into the
> array.  Each time a task is processed, look to see if it is unique and
> if so, add its task_stack to the end of the array.
>
> This may require adding a stacktrace_ops.start().  This could be done
> while moving stacktrace_ops (which advertises itself as a "Generic
> stack tracer"!) out of x86-specific code.
>
> > +     memset(&cur_stack, 0, sizeof(cur_stack));
>
> Sane, but I'm not sure it's necessary.
>
> > +     return ret;
> > +}
> > +
> >
> > ...
> >
>
> Making this all arch-neutral is quite a bit of work, which you may not
> feel like undertaking, ho hum.  Also, the lack of any documentation in
> that x86 code makes it unready for prime time.
>

[-- Attachment #2: Type: text/html, Size: 9427 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-08-29  7:08 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-29  1:25 [PATCH V5] Eliminate task stack trace duplication Andrew Bresticker
2011-08-26 23:19 ` Andrew Morton
2011-08-29  7:08   ` Ying Han

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.