linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] ftrace: print stack usage right before Oops
@ 2014-05-28  6:53 Minchan Kim
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
                   ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-28  6:53 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: linux-mm, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Johannes Weiner, Hugh Dickins, rusty,
	mst, Dave Hansen, Steven Rostedt, Minchan Kim

While I played with my own feature(ex, something on the way to reclaim),
kernel went to oops easily. I guessed reason would be stack overflow
and wanted to prove it.

I found stack tracer which would be very useful for me but kernel went
oops before my user program gather the information via
"watch cat /sys/kernel/debug/tracing/stack_trace" so I couldn't get an
stack usage of each functions.

What I want was that emit the kernel stack usage when kernel goes oops.

This patch records callstack of max stack usage into ftrace buffer
right before Oops and print that information with ftrace_dump_on_oops.
At last, I can find a culprit. :)

The result is as follows.

  111.402376] ------------[ cut here ]------------
[  111.403077] kernel BUG at kernel/trace/trace_stack.c:177!
[  111.403831] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  111.404635] Dumping ftrace buffer:
[  111.404781] ---------------------------------
[  111.404781]    <...>-15987   5d..2 111689526us : stack_trace_call:         Depth    Size   Location    (49 entries)
[  111.404781]         -----    ----   --------
[  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   0)     7216      24   __change_page_attr_set_clr+0xe0/0xb50
[  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   1)     7192     392   kernel_map_pages+0x6c/0x120
[  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   2)     6800     256   get_page_from_freelist+0x489/0x920
[  111.404781]    <...>-15987   5d..2 111689536us : stack_trace_call:   3)     6544     352   __alloc_pages_nodemask+0x5e1/0xb20
[  111.404781]    <...>-15987   5d..2 111689536us : stack_trace_call:   4)     6192       8   alloc_pages_current+0x10f/0x1f0
[  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   5)     6184     168   new_slab+0x2c5/0x370
[  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   6)     6016       8   __slab_alloc+0x3a9/0x501
[  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   7)     6008      80   __kmalloc+0x1cb/0x200
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   8)     5928     376   vring_add_indirect+0x36/0x200
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   9)     5552     144   virtqueue_add_sgs+0x2e2/0x320
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  10)     5408     288   __virtblk_add_req+0xda/0x1b0
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  11)     5120      96   virtio_queue_rq+0xd3/0x1d0
[  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  12)     5024     128   __blk_mq_run_hw_queue+0x1ef/0x440
[  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  13)     4896      16   blk_mq_run_hw_queue+0x35/0x40
[  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  14)     4880      96   blk_mq_insert_requests+0xdb/0x160
[  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  15)     4784     112   blk_mq_flush_plug_list+0x12b/0x140
[  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  16)     4672     112   blk_flush_plug_list+0xc7/0x220
[  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  17)     4560      64   io_schedule_timeout+0x88/0x100
[  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  18)     4496     128   mempool_alloc+0x145/0x170
[  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  19)     4368      96   bio_alloc_bioset+0x10b/0x1d0
[  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  20)     4272      48   get_swap_bio+0x30/0x90
[  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  21)     4224     160   __swap_writepage+0x150/0x230
[  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  22)     4064      32   swap_writepage+0x42/0x90
[  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  23)     4032     320   shrink_page_list+0x676/0xa80
[  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  24)     3712     208   shrink_inactive_list+0x262/0x4e0
[  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  25)     3504     304   shrink_lruvec+0x3e1/0x6a0
[  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  26)     3200      80   shrink_zone+0x3f/0x110
[  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  27)     3120     128   do_try_to_free_pages+0x156/0x4c0
[  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  28)     2992     208   try_to_free_pages+0xf7/0x1e0
[  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  29)     2784     352   __alloc_pages_nodemask+0x783/0xb20
[  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  30)     2432       8   alloc_pages_current+0x10f/0x1f0
[  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  31)     2424     168   new_slab+0x2c5/0x370
[  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  32)     2256       8   __slab_alloc+0x3a9/0x501
[  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  33)     2248      80   kmem_cache_alloc+0x1ac/0x1c0
[  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  34)     2168     296   mempool_alloc_slab+0x15/0x20
[  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  35)     1872     128   mempool_alloc+0x5e/0x170
[  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  36)     1744      96   bio_alloc_bioset+0x10b/0x1d0
[  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  37)     1648      48   mpage_alloc+0x38/0xa0
[  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  38)     1600     208   do_mpage_readpage+0x49b/0x5d0
[  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  39)     1392     224   mpage_readpages+0xcf/0x120
[  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  40)     1168      48   ext4_readpages+0x45/0x60
[  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  41)     1120     224   __do_page_cache_readahead+0x222/0x2d0
[  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  42)      896      16   ra_submit+0x21/0x30
[  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  43)      880     112   filemap_fault+0x2d7/0x4f0
[  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  44)      768     144   __do_fault+0x6d/0x4c0
[  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  45)      624     160   handle_mm_fault+0x1a6/0xaf0
[  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  46)      464     272   __do_page_fault+0x18a/0x590
[  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  47)      192      16   do_page_fault+0xc/0x10
[  111.404781]    <...>-15987   5d..2 111689551us : stack_trace_call:  48)      176     176   page_fault+0x22/0x30
[  111.404781] ---------------------------------
[  111.404781] Modules linked in:
[  111.404781] CPU: 5 PID: 15987 Comm: cc1 Not tainted 3.14.0+ #162
[  111.404781] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  111.404781] task: ffff880008a4a0e0 ti: ffff88000002c000 task.ti: ffff88000002c000
[  111.404781] RIP: 0010:[<ffffffff8112340f>]  [<ffffffff8112340f>] stack_trace_call+0x37f/0x390
[  111.404781] RSP: 0000:ffff88000002c2b0  EFLAGS: 00010092
[  111.404781] RAX: ffff88000002c000 RBX: 0000000000000005 RCX: 0000000000000002
[  111.404781] RDX: 0000000000000006 RSI: 0000000000000002 RDI: ffff88002780be00
[  111.404781] RBP: ffff88000002c310 R08: 00000000000009e8 R09: ffffffffffffffff
[  111.404781] R10: ffff88000002dfd8 R11: 0000000000000001 R12: 000000000000f2e8
[  111.404781] R13: 0000000000000005 R14: ffffffff82768dfc R15: 00000000000000f8
[  111.404781] FS:  00002ae66a6e4640(0000) GS:ffff880027ca0000(0000) knlGS:0000000000000000
[  111.404781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  111.404781] CR2: 00002ba016c8e004 CR3: 00000000045b7000 CR4: 00000000000006e0
[  111.404781] Stack:
[  111.404781]  0000000000000005 ffffffff81042410 0000000000000087 0000000000001c30
[  111.404781]  ffff88000002c000 00002ae66a6f3000 ffffffffffffe000 0000000000000002
[  111.404781]  ffff88000002c510 ffff880000d04000 ffff88000002c4b8 0000000000000002
[  111.404781] Call Trace:
[  111.404781]  [<ffffffff81042410>] ? __change_page_attr_set_clr+0xe0/0xb50
[  111.404781]  [<ffffffff816efdff>] ftrace_call+0x5/0x2f
[  111.404781]  [<ffffffff81004ba7>] ? dump_trace+0x177/0x2b0
[  111.404781]  [<ffffffff81041a65>] ? _lookup_address_cpa.isra.3+0x5/0x40
[  111.404781]  [<ffffffff81041a65>] ? _lookup_address_cpa.isra.3+0x5/0x40
[  111.404781]  [<ffffffff81042410>] ? __change_page_attr_set_clr+0xe0/0xb50
[  111.404781]  [<ffffffff811231a9>] ? stack_trace_call+0x119/0x390
[  111.404781]  [<ffffffff81043eac>] ? kernel_map_pages+0x6c/0x120
[  111.404781]  [<ffffffff810a22dd>] ? trace_hardirqs_off+0xd/0x10
[  111.404781]  [<ffffffff81150131>] ? get_page_from_freelist+0x3d1/0x920
[  111.404781]  [<ffffffff81043eac>] kernel_map_pages+0x6c/0x120
[  111.404781]  [<ffffffff811501e9>] get_page_from_freelist+0x489/0x920
[  111.404781]  [<ffffffff81150c61>] __alloc_pages_nodemask+0x5e1/0xb20
[  111.404781]  [<ffffffff8119188f>] alloc_pages_current+0x10f/0x1f0
[  111.404781]  [<ffffffff8119ac35>] ? new_slab+0x2c5/0x370
[  111.404781]  [<ffffffff8119ac35>] new_slab+0x2c5/0x370
[  111.404781]  [<ffffffff816dbfc9>] __slab_alloc+0x3a9/0x501
[  111.404781]  [<ffffffff8119beeb>] ? __kmalloc+0x1cb/0x200
[  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
[  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
[  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
[  111.404781]  [<ffffffff8119beeb>] __kmalloc+0x1cb/0x200
[  111.404781]  [<ffffffff8141ed70>] ? vring_add_indirect+0x200/0x200
[  111.404781]  [<ffffffff8141eba6>] vring_add_indirect+0x36/0x200
[  111.404781]  [<ffffffff8141f362>] virtqueue_add_sgs+0x2e2/0x320
[  111.404781]  [<ffffffff8148f2ba>] __virtblk_add_req+0xda/0x1b0
[  111.404781]  [<ffffffff813780c5>] ? __delay+0x5/0x20
[  111.404781]  [<ffffffff8148f463>] virtio_queue_rq+0xd3/0x1d0
[  111.404781]  [<ffffffff8134b96f>] __blk_mq_run_hw_queue+0x1ef/0x440
[  111.404781]  [<ffffffff8134c035>] blk_mq_run_hw_queue+0x35/0x40
[  111.404781]  [<ffffffff8134c71b>] blk_mq_insert_requests+0xdb/0x160
[  111.404781]  [<ffffffff8134cdbb>] blk_mq_flush_plug_list+0x12b/0x140
[  111.404781]  [<ffffffff810c5ab5>] ? ktime_get_ts+0x125/0x150
[  111.404781]  [<ffffffff81343197>] blk_flush_plug_list+0xc7/0x220
[  111.404781]  [<ffffffff816e70bf>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
[  111.404781]  [<ffffffff816e26b8>] io_schedule_timeout+0x88/0x100
[  111.404781]  [<ffffffff816e2635>] ? io_schedule_timeout+0x5/0x100
[  111.404781]  [<ffffffff81149465>] mempool_alloc+0x145/0x170
[  111.404781]  [<ffffffff8109baf0>] ? __init_waitqueue_head+0x60/0x60
[  111.404781]  [<ffffffff811e33cb>] bio_alloc_bioset+0x10b/0x1d0
[  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[  111.404781]  [<ffffffff81184160>] get_swap_bio+0x30/0x90
[  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[  111.404781]  [<ffffffff811846b0>] __swap_writepage+0x150/0x230
[  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[  111.404781]  [<ffffffff81184565>] ? __swap_writepage+0x5/0x230
[  111.404781]  [<ffffffff811847d2>] swap_writepage+0x42/0x90
[  111.404781]  [<ffffffff8115aee6>] shrink_page_list+0x676/0xa80
[  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
[  111.404781]  [<ffffffff8115b8c2>] shrink_inactive_list+0x262/0x4e0
[  111.404781]  [<ffffffff8115c211>] shrink_lruvec+0x3e1/0x6a0
[  111.404781]  [<ffffffff8115c50f>] shrink_zone+0x3f/0x110
[  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
[  111.404781]  [<ffffffff8115ca36>] do_try_to_free_pages+0x156/0x4c0
[  111.404781]  [<ffffffff8115cf97>] try_to_free_pages+0xf7/0x1e0
[  111.404781]  [<ffffffff81150e03>] __alloc_pages_nodemask+0x783/0xb20
[  111.404781]  [<ffffffff8119188f>] alloc_pages_current+0x10f/0x1f0
[  111.404781]  [<ffffffff8119ac35>] ? new_slab+0x2c5/0x370
[  111.404781]  [<ffffffff8119ac35>] new_slab+0x2c5/0x370
[  111.404781]  [<ffffffff816dbfc9>] __slab_alloc+0x3a9/0x501
[  111.404781]  [<ffffffff8119d95c>] ? kmem_cache_alloc+0x1ac/0x1c0
[  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
[  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
[  111.404781]  [<ffffffff8119d95c>] kmem_cache_alloc+0x1ac/0x1c0
[  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
[  111.404781]  [<ffffffff81149025>] mempool_alloc_slab+0x15/0x20
[  111.404781]  [<ffffffff8114937e>] mempool_alloc+0x5e/0x170
[  111.404781]  [<ffffffff811e33cb>] bio_alloc_bioset+0x10b/0x1d0
[  111.404781]  [<ffffffff811ea618>] mpage_alloc+0x38/0xa0
[  111.404781]  [<ffffffff811eb2eb>] do_mpage_readpage+0x49b/0x5d0
[  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
[  111.404781]  [<ffffffff811eb55f>] mpage_readpages+0xcf/0x120
[  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
[  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
[  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
[  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
[  111.404781]  [<ffffffff81153e21>] ? __do_page_cache_readahead+0xc1/0x2d0
[  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
[  111.404781]  [<ffffffff8124d045>] ext4_readpages+0x45/0x60
[  111.404781]  [<ffffffff81153f82>] __do_page_cache_readahead+0x222/0x2d0
[  111.404781]  [<ffffffff81153e21>] ? __do_page_cache_readahead+0xc1/0x2d0
[  111.404781]  [<ffffffff811541c1>] ra_submit+0x21/0x30
[  111.404781]  [<ffffffff811482f7>] filemap_fault+0x2d7/0x4f0
[  111.404781]  [<ffffffff8116f3ad>] __do_fault+0x6d/0x4c0
[  111.404781]  [<ffffffff81172596>] handle_mm_fault+0x1a6/0xaf0
[  111.404781]  [<ffffffff816eb1aa>] __do_page_fault+0x18a/0x590
[  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
[  111.404781]  [<ffffffff81081e9c>] ? finish_task_switch+0x7c/0x120
[  111.404781]  [<ffffffff81081e5f>] ? finish_task_switch+0x3f/0x120
[  111.404781]  [<ffffffff816eb5bc>] do_page_fault+0xc/0x10
[  111.404781]  [<ffffffff816e7a52>] page_fault+0x22/0x30

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 kernel/trace/trace_stack.c | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
index 5aa9a5b9b6e2..5eb88e60bc5e 100644
--- a/kernel/trace/trace_stack.c
+++ b/kernel/trace/trace_stack.c
@@ -51,6 +51,30 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
 int stack_tracer_enabled;
 static int last_stack_tracer_enabled;
 
+static inline void print_max_stack(void)
+{
+	long i;
+	int size;
+
+	trace_printk("        Depth    Size   Location"
+			   "    (%d entries)\n"
+			   "        -----    ----   --------\n",
+			   max_stack_trace.nr_entries - 1);
+
+	for (i = 0; i < max_stack_trace.nr_entries; i++) {
+		if (stack_dump_trace[i] == ULONG_MAX)
+			break;
+		if (i+1 == max_stack_trace.nr_entries ||
+				stack_dump_trace[i+1] == ULONG_MAX)
+			size = stack_dump_index[i];
+		else
+			size = stack_dump_index[i] - stack_dump_index[i+1];
+
+		trace_printk("%3ld) %8d   %5d   %pS\n", i, stack_dump_index[i],
+				size, (void *)stack_dump_trace[i]);
+	}
+}
+
 static inline void
 check_stack(unsigned long ip, unsigned long *stack)
 {
@@ -149,8 +173,12 @@ check_stack(unsigned long ip, unsigned long *stack)
 			i++;
 	}
 
-	BUG_ON(current != &init_task &&
-		*(end_of_stack(current)) != STACK_END_MAGIC);
+	if ((current != &init_task &&
+		*(end_of_stack(current)) != STACK_END_MAGIC)) {
+		print_max_stack();
+		BUG();
+	}
+
  out:
 	arch_spin_unlock(&max_stack_lock);
 	local_irq_restore(flags);
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
@ 2014-05-28  6:53 ` Minchan Kim
  2014-05-28  8:37   ` Dave Chinner
                     ` (5 more replies)
  2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
  2014-05-29  3:01 ` Steven Rostedt
  2 siblings, 6 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-28  6:53 UTC (permalink / raw)
  To: linux-kernel, Andrew Morton
  Cc: linux-mm, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Johannes Weiner, Hugh Dickins, rusty,
	mst, Dave Hansen, Steven Rostedt, Minchan Kim

While I play inhouse patches with much memory pressure on qemu-kvm,
3.14 kernel was randomly crashed. The reason was kernel stack overflow.

When I investigated the problem, the callstack was a little bit deeper
by involve with reclaim functions but not direct reclaim path.

I tried to diet stack size of some functions related with alloc/reclaim
so did a hundred of byte but overflow was't disappeard so that I encounter
overflow by another deeper callstack on reclaim/allocator path.

Of course, we might sweep every sites we have found for reducing
stack usage but I'm not sure how long it saves the world(surely,
lots of developer start to add nice features which will use stack
agains) and if we consider another more complex feature in I/O layer
and/or reclaim path, it might be better to increase stack size(
meanwhile, stack usage on 64bit machine was doubled compared to 32bit
while it have sticked to 8K. Hmm, it's not a fair to me and arm64
already expaned to 16K. )

So, my stupid idea is just let's expand stack size and keep an eye
toward stack consumption on each kernel functions via stacktrace of ftrace.
For example, we can have a bar like that each funcion shouldn't exceed 200K
and emit the warning when some function consumes more in runtime.
Of course, it could make false positive but at least, it could make a
chance to think over it.

I guess this topic was discussed several time so there might be
strong reason not to increase kernel stack size on x86_64, for me not
knowing so Ccing x86_64 maintainers, other MM guys and virtio
maintainers.

[ 1065.604404] kworker/-5766    0d..2 1071625990us : stack_trace_call:         Depth    Size   Location    (51 entries)
[ 1065.604404]         -----    ----   --------
[ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
[ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
[ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
[ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  16)     5328      96   blk_mq_insert_requests+0xdb/0x160
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  18)     5120     112   blk_flush_plug_list+0xc7/0x220
[ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  19)     5008      64   io_schedule_timeout+0x88/0x100
[ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  20)     4944     128   mempool_alloc+0x145/0x170
[ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
[ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  22)     4720      48   get_swap_bio+0x30/0x90
[ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
[ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
[ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
[ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
[ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
[ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
[ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
[ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
[ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
[ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
[ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160
[ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  34)     2672      80   find_or_create_page+0x4c/0xb0
[ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
[ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
[ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
[ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
[ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  39)     1952     160   ext4_map_blocks+0x325/0x530
[ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  40)     1792     384   ext4_writepages+0x6d1/0xce0
[ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  41)     1408      16   do_writepages+0x23/0x40
[ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  42)     1392      96   __writeback_single_inode+0x45/0x2e0
[ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  43)     1296     176   writeback_sb_inodes+0x2ad/0x500
[ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
[ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  45)     1040     160   wb_writeback+0x29b/0x350
[ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  46)      880     208   bdi_writeback_workfn+0x11c/0x480
[ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  47)      672     144   process_one_work+0x1d2/0x570
[ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  48)      528     112   worker_thread+0x116/0x370
[ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  49)      416     240   kthread+0xf3/0x110
[ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  50)      176     176   ret_from_fork+0x7c/0xb0

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 arch/x86/include/asm/page_64_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 8de6d9cf3b95..678205195ae1 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -1,7 +1,7 @@
 #ifndef _ASM_X86_PAGE_64_DEFS_H
 #define _ASM_X86_PAGE_64_DEFS_H
 
-#define THREAD_SIZE_ORDER	1
+#define THREAD_SIZE_ORDER	2
 #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
 #define CURRENT_MASK (~(THREAD_SIZE - 1))
 
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
@ 2014-05-28  8:37   ` Dave Chinner
  2014-05-28  9:13     ` Dave Chinner
  2014-05-28  9:04   ` Michael S. Tsirkin
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-28  8:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen,
	Steven Rostedt, xfs

[ cc XFS list ]

On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
>
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
>
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
>
> [ 1065.604404] kworker/-5766    0d..2 1071625990us : stack_trace_call:         Depth    Size   Location    (51 entries)
> [ 1065.604404]         -----    ----   --------
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  18)     5120     112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  19)     5008      64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  20)     4944     128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  22)     4720      48   get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  34)     2672      80   find_or_create_page+0x4c/0xb0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  39)     1952     160   ext4_map_blocks+0x325/0x530
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  40)     1792     384   ext4_writepages+0x6d1/0xce0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  41)     1408      16   do_writepages+0x23/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  42)     1392      96   __writeback_single_inode+0x45/0x2e0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  45)     1040     160   wb_writeback+0x29b/0x350
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  46)      880     208   bdi_writeback_workfn+0x11c/0x480
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  47)      672     144   process_one_work+0x1d2/0x570
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  48)      528     112   worker_thread+0x116/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  49)      416     240   kthread+0xf3/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  50)      176     176   ret_from_fork+0x7c/0xb0
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/include/asm/page_64_types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 8de6d9cf3b95..678205195ae1 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -1,7 +1,7 @@
>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>  #define _ASM_X86_PAGE_64_DEFS_H
>  
> -#define THREAD_SIZE_ORDER	1
> +#define THREAD_SIZE_ORDER	2
>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>  #define CURRENT_MASK (~(THREAD_SIZE - 1))

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
  2014-05-28  8:37   ` Dave Chinner
@ 2014-05-28  9:04   ` Michael S. Tsirkin
  2014-05-29  1:09     ` Minchan Kim
  2014-05-29  4:10     ` virtio_ring stack usage Rusty Russell
  2014-05-28  9:27   ` [RFC 2/2] x86_64: expand kernel stack to 16K Borislav Petkov
                     ` (3 subsequent siblings)
  5 siblings, 2 replies; 107+ messages in thread
From: Michael S. Tsirkin @ 2014-05-28  9:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, Dave Hansen,
	Steven Rostedt

On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
> 
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
> 
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
> 
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
> 
> [ 1065.604404] kworker/-5766    0d..2 1071625990us : stack_trace_call:         Depth    Size   Location    (51 entries)
> [ 1065.604404]         -----    ----   --------
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0

virtio stack usage seems very high.
Here is virtio_ring.su generated using -fstack-usage flag for gcc 4.8.2.

virtio_ring.c:107:35:sg_next_arr        16      static


<--- this is a surprise, I really expected it to be inlined
     same for sg_next_chained.
<--- Rusty: should we force compiler to inline it?


virtio_ring.c:584:6:virtqueue_disable_cb        16      static
virtio_ring.c:604:10:virtqueue_enable_cb_prepare        16      static
virtio_ring.c:632:6:virtqueue_poll      16      static
virtio_ring.c:652:6:virtqueue_enable_cb 16      static
virtio_ring.c:845:14:virtqueue_get_vring_size   16      static
virtio_ring.c:854:6:virtqueue_is_broken 16      static
virtio_ring.c:101:35:sg_next_chained    16      static
virtio_ring.c:436:6:virtqueue_notify    24      static
virtio_ring.c:672:6:virtqueue_enable_cb_delayed 16      static
virtio_ring.c:820:6:vring_transport_features    16      static
virtio_ring.c:472:13:detach_buf 40      static
virtio_ring.c:518:7:virtqueue_get_buf   32      static
virtio_ring.c:812:6:vring_del_virtqueue 16      static
virtio_ring.c:394:6:virtqueue_kick_prepare      16      static
virtio_ring.c:464:6:virtqueue_kick      32      static
virtio_ring.c:186:19:4  16      static
virtio_ring.c:733:13:vring_interrupt    24      static
virtio_ring.c:707:7:virtqueue_detach_unused_buf 32      static
virtio_config.h:84:20:7 16      static
virtio_ring.c:753:19:vring_new_virtqueue        80      static  
virtio_ring.c:374:5:virtqueue_add_inbuf 56      static
virtio_ring.c:352:5:virtqueue_add_outbuf        56      static
virtio_ring.c:314:5:virtqueue_add_sgs   112     static  


as you see, vring_add_indirect was inlined within virtqueue_add_sgs by my gcc.
Taken together, they add up to only 112 bytes: not 1/2K as they do for you.
Which compiler version and flags did you use?


> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  18)     5120     112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  19)     5008      64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  20)     4944     128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  22)     4720      48   get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  34)     2672      80   find_or_create_page+0x4c/0xb0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  39)     1952     160   ext4_map_blocks+0x325/0x530
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  40)     1792     384   ext4_writepages+0x6d1/0xce0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  41)     1408      16   do_writepages+0x23/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  42)     1392      96   __writeback_single_inode+0x45/0x2e0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  45)     1040     160   wb_writeback+0x29b/0x350
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  46)      880     208   bdi_writeback_workfn+0x11c/0x480
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  47)      672     144   process_one_work+0x1d2/0x570
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  48)      528     112   worker_thread+0x116/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  49)      416     240   kthread+0xf3/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  50)      176     176   ret_from_fork+0x7c/0xb0
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/include/asm/page_64_types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 8de6d9cf3b95..678205195ae1 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -1,7 +1,7 @@
>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>  #define _ASM_X86_PAGE_64_DEFS_H
>  
> -#define THREAD_SIZE_ORDER	1
> +#define THREAD_SIZE_ORDER	2
>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>  #define CURRENT_MASK (~(THREAD_SIZE - 1))
>  
> -- 
> 1.9.2

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  8:37   ` Dave Chinner
@ 2014-05-28  9:13     ` Dave Chinner
  2014-05-28 16:06       ` Johannes Weiner
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-28  9:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen,
	Steven Rostedt, xfs

On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> [ cc XFS list ]

[and now there is a complete copy on the XFs list, I'll add my 2c]

> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > While I play inhouse patches with much memory pressure on qemu-kvm,
> > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > 
> > When I investigated the problem, the callstack was a little bit deeper
> > by involve with reclaim functions but not direct reclaim path.
> > 
> > I tried to diet stack size of some functions related with alloc/reclaim
> > so did a hundred of byte but overflow was't disappeard so that I encounter
> > overflow by another deeper callstack on reclaim/allocator path.

That's a no win situation. The stack overruns through ->writepage
we've been seeing with XFS over the past *4 years* are much larger
than a few bytes. The worst case stack usage on a virtio block
device was about 10.5KB of stack usage.

And, like this one, it came from the flusher thread as well. The
difference was that the allocation that triggered the reclaim path
you've reported occurred when 5k of the stack had already been
used...

> > Of course, we might sweep every sites we have found for reducing
> > stack usage but I'm not sure how long it saves the world(surely,
> > lots of developer start to add nice features which will use stack
> > agains) and if we consider another more complex feature in I/O layer
> > and/or reclaim path, it might be better to increase stack size(
> > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > already expaned to 16K. )

Yup, that's all been pointed out previously. 8k stacks were never
large enough to fit the linux IO architecture on x86-64, but nobody
outside filesystem and IO developers has been willing to accept that
argument as valid, despite regular stack overruns and filesystem
having to add workaround after workaround to prevent stack overruns.

That's why stuff like this appears in various filesystem's
->writepage:

        /*
         * Refuse to write the page out if we are called from reclaim context.
         *
         * This avoids stack overflows when called from deeply used stacks in
         * random callers for direct reclaim or memcg reclaim.  We explicitly
         * allow reclaim from kswapd as the stack usage there is relatively low.
         *
         * This should never happen except in the case of a VM regression so
         * warn about it.
         */
        if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
                        PF_MEMALLOC))
                goto redirty;

That still doesn't guarantee us enough stack space to do writeback,
though, because memory allocation can occur when reading in metadata
needed to do delayed allocation, and so we could trigger GFP_NOFS
memory allocation from the flusher thread with 4-5k of stack already
consumed, so that would still overrun teh stack.

So, a couple of years ago we started defering half the writeback
stack usage to a worker thread (commit c999a22 "xfs: introduce an
allocation workqueue"), under the assumption that the worst stack
usage when we call memory allocation is around 3-3.5k of stack used.
We thought that would be safe, but the stack trace you've posted
shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
which means we're still screwed despite all the workarounds we have
in place.

We've also had recent reports of allocation from direct IO blowing
the stack, as well as block allocation adding an entry to a
directory.  We're basically at the point where we have to push every
XFS operation that requires block allocation off to another thread
to get enough stack space for normal operation.....

> > So, my stupid idea is just let's expand stack size and keep an eye

Not stupid: it's been what I've been advocating we need to do for
the past 3-4 years. XFS has always been the stack usage canary and
this issue is basically a repeat of the 4k stack on i386 kernel
debacle.

> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> > For example, we can have a bar like that each funcion shouldn't exceed 200K
> > and emit the warning when some function consumes more in runtime.
> > Of course, it could make false positive but at least, it could make a
> > chance to think over it.

I don't think that's a good idea. There are reasons for putting a
150-200 byte structure on the stack (e.g. used in a context where
allocation cannot be guaranteed to succeed because forward progress
cannot be guaranteed). hence having these users warn all the time
will quickly get very annoying and that functionality switched off
or removed....

> > I guess this topic was discussed several time so there might be
> > strong reason not to increase kernel stack size on x86_64, for me not
> > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > maintainers.
> >
> >          Depth    Size   Location    (51 entries)
> > 
> >    0)     7696      16   lookup_address+0x28/0x30
> >    1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> >    2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> >    3)     7640     392   kernel_map_pages+0x6c/0x120
> >    4)     7248     256   get_page_from_freelist+0x489/0x920
> >    5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> >    6)     6640       8   alloc_pages_current+0x10f/0x1f0
> >    7)     6632     168   new_slab+0x2c5/0x370
> >    8)     6464       8   __slab_alloc+0x3a9/0x501
> >    9)     6456      80   __kmalloc+0x1cb/0x200
> >   10)     6376     376   vring_add_indirect+0x36/0x200
> >   11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> >   12)     5856     288   __virtblk_add_req+0xda/0x1b0
> >   13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> >   14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> >   15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> >   16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> >   17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> >   18)     5120     112   blk_flush_plug_list+0xc7/0x220
> >   19)     5008      64   io_schedule_timeout+0x88/0x100
> >   20)     4944     128   mempool_alloc+0x145/0x170
> >   21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> >   22)     4720      48   get_swap_bio+0x30/0x90
> >   23)     4672     160   __swap_writepage+0x150/0x230
> >   24)     4512      32   swap_writepage+0x42/0x90
> >   25)     4480     320   shrink_page_list+0x676/0xa80
> >   26)     4160     208   shrink_inactive_list+0x262/0x4e0
> >   27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> >   28)     3648      80   shrink_zone+0x3f/0x110
> >   29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> >   30)     3440     208   try_to_free_pages+0xf7/0x1e0
> >   31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> >   32)     2880       8   alloc_pages_current+0x10f/0x1f0
> >   33)     2872     200   __page_cache_alloc+0x13f/0x160
> >   34)     2672      80   find_or_create_page+0x4c/0xb0
> >   35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> >   36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> >   37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> >   38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> >   39)     1952     160   ext4_map_blocks+0x325/0x530
> >   40)     1792     384   ext4_writepages+0x6d1/0xce0
> >   41)     1408      16   do_writepages+0x23/0x40
> >   42)     1392      96   __writeback_single_inode+0x45/0x2e0
> >   43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> >   44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> >   45)     1040     160   wb_writeback+0x29b/0x350
> >   46)      880     208   bdi_writeback_workfn+0x11c/0x480
> >   47)      672     144   process_one_work+0x1d2/0x570
> >   48)      528     112   worker_thread+0x116/0x370
> >   49)      416     240   kthread+0xf3/0x110
> >   50)      176     176   ret_from_fork+0x7c/0xb0

Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...

However, add another 1000 bytes of stack for each IO by going
through the FC/scsi layers and hitting command allocation at the
bottom of the IO stack rather than bio allocation at the top and
maybe stack usage for 2-3 layers of MD and LVM as well, and you
start to see how that stack pushes >10k of usage rather than just
overflowing 8k....

> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  arch/x86/include/asm/page_64_types.h | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> > index 8de6d9cf3b95..678205195ae1 100644
> > --- a/arch/x86/include/asm/page_64_types.h
> > +++ b/arch/x86/include/asm/page_64_types.h
> > @@ -1,7 +1,7 @@
> >  #ifndef _ASM_X86_PAGE_64_DEFS_H
> >  #define _ASM_X86_PAGE_64_DEFS_H
> >  
> > -#define THREAD_SIZE_ORDER	1
> > +#define THREAD_SIZE_ORDER	2
> >  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> >  #define CURRENT_MASK (~(THREAD_SIZE - 1))

Got my vote. Can we get this into 3.16, please?

Acked-by: Dave Chinner <david@fromorbit.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
  2014-05-28  8:37   ` Dave Chinner
  2014-05-28  9:04   ` Michael S. Tsirkin
@ 2014-05-28  9:27   ` Borislav Petkov
  2014-05-29 13:23     ` One Thousand Gnomes
  2014-05-28 14:14   ` Steven Rostedt
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 107+ messages in thread
From: Borislav Petkov @ 2014-05-28  9:27 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen,
	Steven Rostedt

On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
> 
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )

Hmm, stupid question: what happens when 16K is not enough too, do we
increase again? When do we stop increasing? 1M, 2M... ?

Sounds like we want to make it a config option with a couple of sizes
for everyone to be happy. :-)

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
                     ` (2 preceding siblings ...)
  2014-05-28  9:27   ` [RFC 2/2] x86_64: expand kernel stack to 16K Borislav Petkov
@ 2014-05-28 14:14   ` Steven Rostedt
  2014-05-28 14:23     ` H. Peter Anvin
  2014-05-28 15:43   ` Richard Weinberger
  2014-05-28 16:09   ` Linus Torvalds
  5 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2014-05-28 14:14 UTC (permalink / raw)
  To: Minchan Kim, Linus Torvalds
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen


This looks like something that Linus should be involved in too. He's
been critical in the past about stack usage.

On Wed, 28 May 2014 15:53:59 +0900
Minchan Kim <minchan@kernel.org> wrote:

> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> 
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
> 
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
> 
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
> 
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
> 
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.

I agree with Boris that if this goes in, it should be a config option.
Or perhaps selected by those file systems that need it. I hate to have
16K stacks on a box that doesn't have that much memory, but also just
uses ext2.

-- Steve

> 
> [ 1065.604404] kworker/-5766    0d..2 1071625990us : stack_trace_call:         Depth    Size   Location    (51 entries)
> [ 1065.604404]         -----    ----   --------
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  18)     5120     112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  19)     5008      64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  20)     4944     128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  22)     4720      48   get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  34)     2672      80   find_or_create_page+0x4c/0xb0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  39)     1952     160   ext4_map_blocks+0x325/0x530
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  40)     1792     384   ext4_writepages+0x6d1/0xce0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  41)     1408      16   do_writepages+0x23/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  42)     1392      96   __writeback_single_inode+0x45/0x2e0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  45)     1040     160   wb_writeback+0x29b/0x350
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  46)      880     208   bdi_writeback_workfn+0x11c/0x480
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  47)      672     144   process_one_work+0x1d2/0x570
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  48)      528     112   worker_thread+0x116/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  49)      416     240   kthread+0xf3/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  50)      176     176   ret_from_fork+0x7c/0xb0
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/include/asm/page_64_types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 8de6d9cf3b95..678205195ae1 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -1,7 +1,7 @@
>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>  #define _ASM_X86_PAGE_64_DEFS_H
>  
> -#define THREAD_SIZE_ORDER	1
> +#define THREAD_SIZE_ORDER	2
>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>  #define CURRENT_MASK (~(THREAD_SIZE - 1))
>  


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 14:14   ` Steven Rostedt
@ 2014-05-28 14:23     ` H. Peter Anvin
  2014-05-28 22:11       ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-28 14:23 UTC (permalink / raw)
  To: Steven Rostedt, Minchan Kim, Linus Torvalds
  Cc: linux-kernel, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, rusty, mst, Dave Hansen

We tried for 4K on x86-64, too, for b quite a while as I recall.  The kernel stack is a one of the main costs for a thread.  I would like to decouple struct thread_info from the kernel stack (PJ Waskewicz was working on that before he left Intel) but that doesn't buy us all that much.

8K additional per thread is a huge hit.  XFS has indeed always been a canary, or troublespot, I suspect because it originally came from another kernel where this was not an optimization target.



On May 28, 2014 7:14:01 AM PDT, Steven Rostedt <rostedt@goodmis.org> wrote:
>
>This looks like something that Linus should be involved in too. He's
>been critical in the past about stack usage.
>
>On Wed, 28 May 2014 15:53:59 +0900
>Minchan Kim <minchan@kernel.org> wrote:
>
>> While I play inhouse patches with much memory pressure on qemu-kvm,
>> 3.14 kernel was randomly crashed. The reason was kernel stack
>overflow.
>> 
>> When I investigated the problem, the callstack was a little bit
>deeper
>> by involve with reclaim functions but not direct reclaim path.
>> 
>> I tried to diet stack size of some functions related with
>alloc/reclaim
>> so did a hundred of byte but overflow was't disappeard so that I
>encounter
>> overflow by another deeper callstack on reclaim/allocator path.
>> 
>> Of course, we might sweep every sites we have found for reducing
>> stack usage but I'm not sure how long it saves the world(surely,
>> lots of developer start to add nice features which will use stack
>> agains) and if we consider another more complex feature in I/O layer
>> and/or reclaim path, it might be better to increase stack size(
>> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
>> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
>> already expaned to 16K. )
>> 
>> So, my stupid idea is just let's expand stack size and keep an eye
>> toward stack consumption on each kernel functions via stacktrace of
>ftrace.
>> For example, we can have a bar like that each funcion shouldn't
>exceed 200K
>> and emit the warning when some function consumes more in runtime.
>> Of course, it could make false positive but at least, it could make a
>> chance to think over it.
>> 
>> I guess this topic was discussed several time so there might be
>> strong reason not to increase kernel stack size on x86_64, for me not
>> knowing so Ccing x86_64 maintainers, other MM guys and virtio
>> maintainers.
>
>I agree with Boris that if this goes in, it should be a config option.
>Or perhaps selected by those file systems that need it. I hate to have
>16K stacks on a box that doesn't have that much memory, but also just
>uses ext2.
>
>-- Steve
>
>> 
>> [ 1065.604404] kworker/-5766    0d..2 1071625990us :
>stack_trace_call:         Depth    Size   Location    (51 entries)
>> [ 1065.604404]         -----    ----   --------
>> [ 1065.604404] kworker/-5766    0d..2 1071625991us :
>stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
>> [ 1065.604404] kworker/-5766    0d..2 1071625991us :
>stack_trace_call:   1)     7680      16  
>_lookup_address_cpa.isra.3+0x3b/0x40
>> [ 1065.604404] kworker/-5766    0d..2 1071625991us :
>stack_trace_call:   2)     7664      24  
>__change_page_attr_set_clr+0xe0/0xb50
>> [ 1065.604404] kworker/-5766    0d..2 1071625991us :
>stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
>> [ 1065.604404] kworker/-5766    0d..2 1071625992us :
>stack_trace_call:   4)     7248     256  
>get_page_from_freelist+0x489/0x920
>> [ 1065.604404] kworker/-5766    0d..2 1071625992us :
>stack_trace_call:   5)     6992     352  
>__alloc_pages_nodemask+0x5e1/0xb20
>> [ 1065.604404] kworker/-5766    0d..2 1071625992us :
>stack_trace_call:   6)     6640       8  
>alloc_pages_current+0x10f/0x1f0
>> [ 1065.604404] kworker/-5766    0d..2 1071625992us :
>stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
>> [ 1065.604404] kworker/-5766    0d..2 1071625992us :
>stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us :
>stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us :
>stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us :
>stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us :
>stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us :
>stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
>> [ 1065.604404] kworker/-5766    0d..2 1071625994us :
>stack_trace_call:  14)     5472     128  
>__blk_mq_run_hw_queue+0x1ef/0x440
>> [ 1065.604404] kworker/-5766    0d..2 1071625994us :
>stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
>> [ 1065.604404] kworker/-5766    0d..2 1071625994us :
>stack_trace_call:  16)     5328      96  
>blk_mq_insert_requests+0xdb/0x160
>> [ 1065.604404] kworker/-5766    0d..2 1071625994us :
>stack_trace_call:  17)     5232     112  
>blk_mq_flush_plug_list+0x12b/0x140
>> [ 1065.604404] kworker/-5766    0d..2 1071625994us :
>stack_trace_call:  18)     5120     112  
>blk_flush_plug_list+0xc7/0x220
>> [ 1065.604404] kworker/-5766    0d..2 1071625995us :
>stack_trace_call:  19)     5008      64  
>io_schedule_timeout+0x88/0x100
>> [ 1065.604404] kworker/-5766    0d..2 1071625995us :
>stack_trace_call:  20)     4944     128   mempool_alloc+0x145/0x170
>> [ 1065.604404] kworker/-5766    0d..2 1071625995us :
>stack_trace_call:  21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
>> [ 1065.604404] kworker/-5766    0d..2 1071625995us :
>stack_trace_call:  22)     4720      48   get_swap_bio+0x30/0x90
>> [ 1065.604404] kworker/-5766    0d..2 1071625995us :
>stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
>> [ 1065.604404] kworker/-5766    0d..2 1071625996us :
>stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
>> [ 1065.604404] kworker/-5766    0d..2 1071625996us :
>stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
>> [ 1065.604404] kworker/-5766    0d..2 1071625996us :
>stack_trace_call:  26)     4160     208  
>shrink_inactive_list+0x262/0x4e0
>> [ 1065.604404] kworker/-5766    0d..2 1071625996us :
>stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
>> [ 1065.604404] kworker/-5766    0d..2 1071625996us :
>stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
>> [ 1065.604404] kworker/-5766    0d..2 1071625997us :
>stack_trace_call:  29)     3568     128  
>do_try_to_free_pages+0x156/0x4c0
>> [ 1065.604404] kworker/-5766    0d..2 1071625997us :
>stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
>> [ 1065.604404] kworker/-5766    0d..2 1071625997us :
>stack_trace_call:  31)     3232     352  
>__alloc_pages_nodemask+0x783/0xb20
>> [ 1065.604404] kworker/-5766    0d..2 1071625997us :
>stack_trace_call:  32)     2880       8  
>alloc_pages_current+0x10f/0x1f0
>> [ 1065.604404] kworker/-5766    0d..2 1071625997us :
>stack_trace_call:  33)     2872     200  
>__page_cache_alloc+0x13f/0x160
>> [ 1065.604404] kworker/-5766    0d..2 1071625998us :
>stack_trace_call:  34)     2672      80   find_or_create_page+0x4c/0xb0
>> [ 1065.604404] kworker/-5766    0d..2 1071625998us :
>stack_trace_call:  35)     2592      80  
>ext4_mb_load_buddy+0x1e9/0x370
>> [ 1065.604404] kworker/-5766    0d..2 1071625998us :
>stack_trace_call:  36)     2512     176  
>ext4_mb_regular_allocator+0x1b7/0x460
>> [ 1065.604404] kworker/-5766    0d..2 1071625998us :
>stack_trace_call:  37)     2336     128  
>ext4_mb_new_blocks+0x458/0x5f0
>> [ 1065.604404] kworker/-5766    0d..2 1071625998us :
>stack_trace_call:  38)     2208     256  
>ext4_ext_map_blocks+0x70b/0x1010
>> [ 1065.604404] kworker/-5766    0d..2 1071625999us :
>stack_trace_call:  39)     1952     160   ext4_map_blocks+0x325/0x530
>> [ 1065.604404] kworker/-5766    0d..2 1071625999us :
>stack_trace_call:  40)     1792     384   ext4_writepages+0x6d1/0xce0
>> [ 1065.604404] kworker/-5766    0d..2 1071625999us :
>stack_trace_call:  41)     1408      16   do_writepages+0x23/0x40
>> [ 1065.604404] kworker/-5766    0d..2 1071625999us :
>stack_trace_call:  42)     1392      96  
>__writeback_single_inode+0x45/0x2e0
>> [ 1065.604404] kworker/-5766    0d..2 1071625999us :
>stack_trace_call:  43)     1296     176  
>writeback_sb_inodes+0x2ad/0x500
>> [ 1065.604404] kworker/-5766    0d..2 1071626000us :
>stack_trace_call:  44)     1120      80  
>__writeback_inodes_wb+0x9e/0xd0
>> [ 1065.604404] kworker/-5766    0d..2 1071626000us :
>stack_trace_call:  45)     1040     160   wb_writeback+0x29b/0x350
>> [ 1065.604404] kworker/-5766    0d..2 1071626000us :
>stack_trace_call:  46)      880     208  
>bdi_writeback_workfn+0x11c/0x480
>> [ 1065.604404] kworker/-5766    0d..2 1071626000us :
>stack_trace_call:  47)      672     144   process_one_work+0x1d2/0x570
>> [ 1065.604404] kworker/-5766    0d..2 1071626000us :
>stack_trace_call:  48)      528     112   worker_thread+0x116/0x370
>> [ 1065.604404] kworker/-5766    0d..2 1071626001us :
>stack_trace_call:  49)      416     240   kthread+0xf3/0x110
>> [ 1065.604404] kworker/-5766    0d..2 1071626001us :
>stack_trace_call:  50)      176     176   ret_from_fork+0x7c/0xb0
>> 
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>> ---
>>  arch/x86/include/asm/page_64_types.h | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/include/asm/page_64_types.h
>b/arch/x86/include/asm/page_64_types.h
>> index 8de6d9cf3b95..678205195ae1 100644
>> --- a/arch/x86/include/asm/page_64_types.h
>> +++ b/arch/x86/include/asm/page_64_types.h
>> @@ -1,7 +1,7 @@
>>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>>  #define _ASM_X86_PAGE_64_DEFS_H
>>  
>> -#define THREAD_SIZE_ORDER	1
>> +#define THREAD_SIZE_ORDER	2
>>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>>  #define CURRENT_MASK (~(THREAD_SIZE - 1))
>>  

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
                     ` (3 preceding siblings ...)
  2014-05-28 14:14   ` Steven Rostedt
@ 2014-05-28 15:43   ` Richard Weinberger
  2014-05-28 16:08     ` Steven Rostedt
  2014-05-28 16:09   ` Linus Torvalds
  5 siblings, 1 reply; 107+ messages in thread
From: Richard Weinberger @ 2014-05-28 15:43 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, mst, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 8:53 AM, Minchan Kim <minchan@kernel.org> wrote:
> While I play inhouse patches with much memory pressure on qemu-kvm,
> 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
>
> When I investigated the problem, the callstack was a little bit deeper
> by involve with reclaim functions but not direct reclaim path.
>
> I tried to diet stack size of some functions related with alloc/reclaim
> so did a hundred of byte but overflow was't disappeard so that I encounter
> overflow by another deeper callstack on reclaim/allocator path.
>
> Of course, we might sweep every sites we have found for reducing
> stack usage but I'm not sure how long it saves the world(surely,
> lots of developer start to add nice features which will use stack
> agains) and if we consider another more complex feature in I/O layer
> and/or reclaim path, it might be better to increase stack size(
> meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> already expaned to 16K. )
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.
> For example, we can have a bar like that each funcion shouldn't exceed 200K
> and emit the warning when some function consumes more in runtime.
> Of course, it could make false positive but at least, it could make a
> chance to think over it.
>
> I guess this topic was discussed several time so there might be
> strong reason not to increase kernel stack size on x86_64, for me not
> knowing so Ccing x86_64 maintainers, other MM guys and virtio
> maintainers.
>
> [ 1065.604404] kworker/-5766    0d..2 1071625990us : stack_trace_call:         Depth    Size   Location    (51 entries)
> [ 1065.604404]         -----    ----   --------
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> [ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  18)     5120     112   blk_flush_plug_list+0xc7/0x220
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  19)     5008      64   io_schedule_timeout+0x88/0x100
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  20)     4944     128   mempool_alloc+0x145/0x170
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  22)     4720      48   get_swap_bio+0x30/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  34)     2672      80   find_or_create_page+0x4c/0xb0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> [ 1065.604404] kworker/-5766    0d..2 1071625998us : stack_trace_call:  38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  39)     1952     160   ext4_map_blocks+0x325/0x530
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  40)     1792     384   ext4_writepages+0x6d1/0xce0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  41)     1408      16   do_writepages+0x23/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  42)     1392      96   __writeback_single_inode+0x45/0x2e0
> [ 1065.604404] kworker/-5766    0d..2 1071625999us : stack_trace_call:  43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  45)     1040     160   wb_writeback+0x29b/0x350
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  46)      880     208   bdi_writeback_workfn+0x11c/0x480
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  47)      672     144   process_one_work+0x1d2/0x570
> [ 1065.604404] kworker/-5766    0d..2 1071626000us : stack_trace_call:  48)      528     112   worker_thread+0x116/0x370
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  49)      416     240   kthread+0xf3/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071626001us : stack_trace_call:  50)      176     176   ret_from_fork+0x7c/0xb0
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  arch/x86/include/asm/page_64_types.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 8de6d9cf3b95..678205195ae1 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -1,7 +1,7 @@
>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>  #define _ASM_X86_PAGE_64_DEFS_H
>
> -#define THREAD_SIZE_ORDER      1
> +#define THREAD_SIZE_ORDER      2
>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>  #define CURRENT_MASK (~(THREAD_SIZE - 1))

Do you have any numbers of the performance impact of this?

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  9:13     ` Dave Chinner
@ 2014-05-28 16:06       ` Johannes Weiner
  2014-05-28 21:55         ` Dave Chinner
  2014-05-29  6:06         ` Minchan Kim
  0 siblings, 2 replies; 107+ messages in thread
From: Johannes Weiner @ 2014-05-28 16:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Minchan Kim, linux-kernel, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Hugh Dickins, rusty, mst, Dave Hansen,
	Steven Rostedt, xfs

On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > [ cc XFS list ]
> 
> [and now there is a complete copy on the XFs list, I'll add my 2c]
> 
> > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > 
> > > When I investigated the problem, the callstack was a little bit deeper
> > > by involve with reclaim functions but not direct reclaim path.
> > > 
> > > I tried to diet stack size of some functions related with alloc/reclaim
> > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > overflow by another deeper callstack on reclaim/allocator path.
> 
> That's a no win situation. The stack overruns through ->writepage
> we've been seeing with XFS over the past *4 years* are much larger
> than a few bytes. The worst case stack usage on a virtio block
> device was about 10.5KB of stack usage.
> 
> And, like this one, it came from the flusher thread as well. The
> difference was that the allocation that triggered the reclaim path
> you've reported occurred when 5k of the stack had already been
> used...
> 
> > > Of course, we might sweep every sites we have found for reducing
> > > stack usage but I'm not sure how long it saves the world(surely,
> > > lots of developer start to add nice features which will use stack
> > > agains) and if we consider another more complex feature in I/O layer
> > > and/or reclaim path, it might be better to increase stack size(
> > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > already expaned to 16K. )
> 
> Yup, that's all been pointed out previously. 8k stacks were never
> large enough to fit the linux IO architecture on x86-64, but nobody
> outside filesystem and IO developers has been willing to accept that
> argument as valid, despite regular stack overruns and filesystem
> having to add workaround after workaround to prevent stack overruns.
> 
> That's why stuff like this appears in various filesystem's
> ->writepage:
> 
>         /*
>          * Refuse to write the page out if we are called from reclaim context.
>          *
>          * This avoids stack overflows when called from deeply used stacks in
>          * random callers for direct reclaim or memcg reclaim.  We explicitly
>          * allow reclaim from kswapd as the stack usage there is relatively low.
>          *
>          * This should never happen except in the case of a VM regression so
>          * warn about it.
>          */
>         if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
>                         PF_MEMALLOC))
>                 goto redirty;
> 
> That still doesn't guarantee us enough stack space to do writeback,
> though, because memory allocation can occur when reading in metadata
> needed to do delayed allocation, and so we could trigger GFP_NOFS
> memory allocation from the flusher thread with 4-5k of stack already
> consumed, so that would still overrun teh stack.
> 
> So, a couple of years ago we started defering half the writeback
> stack usage to a worker thread (commit c999a22 "xfs: introduce an
> allocation workqueue"), under the assumption that the worst stack
> usage when we call memory allocation is around 3-3.5k of stack used.
> We thought that would be safe, but the stack trace you've posted
> shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> which means we're still screwed despite all the workarounds we have
> in place.

The allocation and reclaim stack itself is only 2k per the stacktrace
below.  What got us in this particular case is that we engaged a
complicated block layer setup from within the allocation context in
order to swap out a page.

In the past we disabled filesystem ->writepage from within the
allocation context and deferred it to kswapd for stack reasons (see
the WARN_ON_ONCE and the comment in your above quote), but I think we
have to go further and do the same for even swap_writepage():

> > > I guess this topic was discussed several time so there might be
> > > strong reason not to increase kernel stack size on x86_64, for me not
> > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > maintainers.
> > >
> > >          Depth    Size   Location    (51 entries)
> > > 
> > >    0)     7696      16   lookup_address+0x28/0x30
> > >    1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> > >    2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> > >    3)     7640     392   kernel_map_pages+0x6c/0x120
> > >    4)     7248     256   get_page_from_freelist+0x489/0x920
> > >    5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> > >    6)     6640       8   alloc_pages_current+0x10f/0x1f0
> > >    7)     6632     168   new_slab+0x2c5/0x370
> > >    8)     6464       8   __slab_alloc+0x3a9/0x501
> > >    9)     6456      80   __kmalloc+0x1cb/0x200
> > >   10)     6376     376   vring_add_indirect+0x36/0x200
> > >   11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> > >   12)     5856     288   __virtblk_add_req+0xda/0x1b0
> > >   13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> > >   14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > >   15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> > >   16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> > >   17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> > >   18)     5120     112   blk_flush_plug_list+0xc7/0x220
> > >   19)     5008      64   io_schedule_timeout+0x88/0x100
> > >   20)     4944     128   mempool_alloc+0x145/0x170
> > >   21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> > >   22)     4720      48   get_swap_bio+0x30/0x90
> > >   23)     4672     160   __swap_writepage+0x150/0x230
> > >   24)     4512      32   swap_writepage+0x42/0x90

Without swap IO from the allocation context, the stack would have
ended here, which would have been easily survivable.  And left the
writeout work to kswapd, which has a much shallower stack than this:

> > >   25)     4480     320   shrink_page_list+0x676/0xa80
> > >   26)     4160     208   shrink_inactive_list+0x262/0x4e0
> > >   27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> > >   28)     3648      80   shrink_zone+0x3f/0x110
> > >   29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> > >   30)     3440     208   try_to_free_pages+0xf7/0x1e0
> > >   31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> > >   32)     2880       8   alloc_pages_current+0x10f/0x1f0
> > >   33)     2872     200   __page_cache_alloc+0x13f/0x160
> > >   34)     2672      80   find_or_create_page+0x4c/0xb0
> > >   35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> > >   36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> > >   37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> > >   38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> > >   39)     1952     160   ext4_map_blocks+0x325/0x530
> > >   40)     1792     384   ext4_writepages+0x6d1/0xce0
> > >   41)     1408      16   do_writepages+0x23/0x40
> > >   42)     1392      96   __writeback_single_inode+0x45/0x2e0
> > >   43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> > >   44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> > >   45)     1040     160   wb_writeback+0x29b/0x350
> > >   46)      880     208   bdi_writeback_workfn+0x11c/0x480
> > >   47)      672     144   process_one_work+0x1d2/0x570
> > >   48)      528     112   worker_thread+0x116/0x370
> > >   49)      416     240   kthread+0xf3/0x110
> > >   50)      176     176   ret_from_fork+0x7c/0xb0
> 
> Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
> GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...

Do they also usually involve swap_writepage()?

---

diff --git a/mm/page_io.c b/mm/page_io.c
index 7c59ef681381..02e7e3c168cf 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -233,6 +233,22 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	int ret = 0;
 
+	/*
+	 * Refuse to write the page out if we are called from reclaim context.
+	 *
+	 * This avoids stack overflows when called from deeply used stacks in
+	 * random callers for direct reclaim or memcg reclaim.  We explicitly
+	 * allow reclaim from kswapd as the stack usage there is relatively low.
+	 *
+	 * This should never happen except in the case of a VM regression so
+	 * warn about it.
+	 */
+	if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
+			PF_MEMALLOC)) {
+		SetPageDirty(page);
+		goto out;
+	}
+
 	if (try_to_free_swap(page)) {
 		unlock_page(page);
 		goto out;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 61c576083c07..99cca6633e0d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -985,13 +985,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 
 		if (PageDirty(page)) {
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but only writeback
+			 * Only kswapd can writeback pages to avoid
+			 * risk of stack overflow but only writeback
 			 * if many dirty pages have been encountered.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !zone_is_reclaim_dirty(zone))) {
+			if (!current_is_kswapd() ||
+			    !zone_is_reclaim_dirty(zone))) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 15:43   ` Richard Weinberger
@ 2014-05-28 16:08     ` Steven Rostedt
  2014-05-28 16:11       ` Richard Weinberger
  2014-05-28 16:13       ` Linus Torvalds
  0 siblings, 2 replies; 107+ messages in thread
From: Steven Rostedt @ 2014-05-28 16:08 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Minchan Kim, LKML, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Rusty Russell, mst, Dave Hansen

On Wed, 28 May 2014 17:43:50 +0200
Richard Weinberger <richard.weinberger@gmail.com> wrote:


> > diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> > index 8de6d9cf3b95..678205195ae1 100644
> > --- a/arch/x86/include/asm/page_64_types.h
> > +++ b/arch/x86/include/asm/page_64_types.h
> > @@ -1,7 +1,7 @@
> >  #ifndef _ASM_X86_PAGE_64_DEFS_H
> >  #define _ASM_X86_PAGE_64_DEFS_H
> >
> > -#define THREAD_SIZE_ORDER      1
> > +#define THREAD_SIZE_ORDER      2
> >  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
> >  #define CURRENT_MASK (~(THREAD_SIZE - 1))
> 
> Do you have any numbers of the performance impact of this?
> 

What performance impact are you looking for? Now if the system is short
on memory, it would probably cause issues in creating tasks. But other
than that, I'm not sure what you are looking for.

-- Steve

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
                     ` (4 preceding siblings ...)
  2014-05-28 15:43   ` Richard Weinberger
@ 2014-05-28 16:09   ` Linus Torvalds
  2014-05-28 22:31     ` Dave Chinner
                       ` (2 more replies)
  5 siblings, 3 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-28 16:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Tue, May 27, 2014 at 11:53 PM, Minchan Kim <minchan@kernel.org> wrote:
>
> So, my stupid idea is just let's expand stack size and keep an eye
> toward stack consumption on each kernel functions via stacktrace of ftrace.

We probably have to do this at some point, but that point is not -rc7.

And quite frankly, from the backtrace, I can only say: there is some
bad shit there. The current VM stands out as a bloated pig:

> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20

> [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
> [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160

That __alloc_pages_nodemask() thing in particular looks bad. It
actually seems not to be the usual "let's just allocate some
structures on the stack" disease, it looks more like "lots of
inlining, horrible calling conventions, and lots of random stupid
variables".

>From a quick glance at the frame usage, some of it seems to be gcc
being rather bad at stack allocation, but lots of it is just nasty
spilling around the disgusting call-sites with tons or arguments. A
_lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
"spill frame pointer", afaik).

Avoiding some inlining, and using a single flag value rather than the
collection of "bool"s would probably help. But nothing really
trivially obvious stands out.

But what *does* stand out (once again) is that we probably shouldn't
do swap-out in direct reclaim. This came up the last time we had stack
issues (XFS) too. I really do suspect that direct reclaim should only
do the kind of reclaim that does not need any IO at all.

I think we _do_ generally avoid IO in direct reclaim, but swap is
special. And not for a good reason, afaik. DaveC, remind me, I think
you said something about the swap case the last time this came up..

                  Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:08     ` Steven Rostedt
@ 2014-05-28 16:11       ` Richard Weinberger
  2014-05-28 16:13       ` Linus Torvalds
  1 sibling, 0 replies; 107+ messages in thread
From: Richard Weinberger @ 2014-05-28 16:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Minchan Kim, LKML, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Rusty Russell, mst, Dave Hansen

Am 28.05.2014 18:08, schrieb Steven Rostedt:
> On Wed, 28 May 2014 17:43:50 +0200
> Richard Weinberger <richard.weinberger@gmail.com> wrote:
> 
> 
>>> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
>>> index 8de6d9cf3b95..678205195ae1 100644
>>> --- a/arch/x86/include/asm/page_64_types.h
>>> +++ b/arch/x86/include/asm/page_64_types.h
>>> @@ -1,7 +1,7 @@
>>>  #ifndef _ASM_X86_PAGE_64_DEFS_H
>>>  #define _ASM_X86_PAGE_64_DEFS_H
>>>
>>> -#define THREAD_SIZE_ORDER      1
>>> +#define THREAD_SIZE_ORDER      2
>>>  #define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)
>>>  #define CURRENT_MASK (~(THREAD_SIZE - 1))
>>
>> Do you have any numbers of the performance impact of this?
>>
> 
> What performance impact are you looking for? Now if the system is short
> on memory, it would probably cause issues in creating tasks. But other
> than that, I'm not sure what you are looking for.

Allocating more continuous memory for every thread is not cheap.
I'd assume that such a change will cause more pressure on the allocator.

Thanks,
//richard

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:08     ` Steven Rostedt
  2014-05-28 16:11       ` Richard Weinberger
@ 2014-05-28 16:13       ` Linus Torvalds
  1 sibling, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-28 16:13 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Richard Weinberger, Minchan Kim, LKML, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen

On Wed, May 28, 2014 at 9:08 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> What performance impact are you looking for? Now if the system is short
> on memory, it would probably cause issues in creating tasks.

It doesn't necessarily need to be short on memory, it could just be
fragmented. But a page order of 2 should still be ok'ish.

That said, this is definitely not a rc7 issue. I'd *much* rather
disable swap from direct reclaim, although that kind of patch too
would be a "can Minchan test it, we can put it in the next merge
window and then backport it if we don't have issues".

I see that Johannes already did a patch for that (and this really
_has_ come up before), although I'd skip the WARN_ON_ONCE() part
except for perhaps Minchan testing it.

               Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/2] ftrace: print stack usage right before Oops
  2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
@ 2014-05-28 16:18 ` Steven Rostedt
  2014-05-29  3:52   ` Minchan Kim
  2014-05-29  3:01 ` Steven Rostedt
  2 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2014-05-28 16:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen

On Wed, 28 May 2014 15:53:58 +0900
Minchan Kim <minchan@kernel.org> wrote:

> While I played with my own feature(ex, something on the way to reclaim),
> kernel went to oops easily. I guessed reason would be stack overflow
> and wanted to prove it.
> 
> I found stack tracer which would be very useful for me but kernel went
> oops before my user program gather the information via
> "watch cat /sys/kernel/debug/tracing/stack_trace" so I couldn't get an
> stack usage of each functions.
> 
> What I want was that emit the kernel stack usage when kernel goes oops.
> 
> This patch records callstack of max stack usage into ftrace buffer
> right before Oops and print that information with ftrace_dump_on_oops.
> At last, I can find a culprit. :)
> 

This is not dependent on patch 2/2, nor is 2/2 dependent on this patch,
I'll review this as if 2/2 does not exist.


> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  kernel/trace/trace_stack.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> index 5aa9a5b9b6e2..5eb88e60bc5e 100644
> --- a/kernel/trace/trace_stack.c
> +++ b/kernel/trace/trace_stack.c
> @@ -51,6 +51,30 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
>  int stack_tracer_enabled;
>  static int last_stack_tracer_enabled;
>  
> +static inline void print_max_stack(void)
> +{
> +	long i;
> +	int size;
> +
> +	trace_printk("        Depth    Size   Location"
> +			   "    (%d entries)\n"

Please do not break strings just to satisfy that silly 80 character
limit. Even Linus Torvalds said that's pretty stupid.

Also, do not use trace_printk(). It is not made to be included in a
production kernel. It reserves special buffers to make it as fast as
possible, and those buffers should not be created in production
systems. In fact, I will probably add for 3.16 a big warning message
when trace_printk() is used.

Since this is a bug, why not just use printk() instead?

BTW, wouldn't this this function crash as well if the stack is already
bad?

-- Steve

> +			   "        -----    ----   --------\n",
> +			   max_stack_trace.nr_entries - 1);
> +
> +	for (i = 0; i < max_stack_trace.nr_entries; i++) {
> +		if (stack_dump_trace[i] == ULONG_MAX)
> +			break;
> +		if (i+1 == max_stack_trace.nr_entries ||
> +				stack_dump_trace[i+1] == ULONG_MAX)
> +			size = stack_dump_index[i];
> +		else
> +			size = stack_dump_index[i] - stack_dump_index[i+1];
> +
> +		trace_printk("%3ld) %8d   %5d   %pS\n", i, stack_dump_index[i],
> +				size, (void *)stack_dump_trace[i]);
> +	}
> +}
> +
>  static inline void
>  check_stack(unsigned long ip, unsigned long *stack)
>  {
> @@ -149,8 +173,12 @@ check_stack(unsigned long ip, unsigned long *stack)
>  			i++;
>  	}
>  
> -	BUG_ON(current != &init_task &&
> -		*(end_of_stack(current)) != STACK_END_MAGIC);
> +	if ((current != &init_task &&
> +		*(end_of_stack(current)) != STACK_END_MAGIC)) {
> +		print_max_stack();
> +		BUG();
> +	}
> +
>   out:
>  	arch_spin_unlock(&max_stack_lock);
>  	local_irq_restore(flags);


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:06       ` Johannes Weiner
@ 2014-05-28 21:55         ` Dave Chinner
  2014-05-29  6:06         ` Minchan Kim
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Chinner @ 2014-05-28 21:55 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Minchan Kim, linux-kernel, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Hugh Dickins, rusty, mst, Dave Hansen,
	Steven Rostedt, xfs

On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> > 
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> > 
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > > 
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > > 
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> > 
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> > 
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> > 
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> > 
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> > 
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> > 
> >         /*
> >          * Refuse to write the page out if we are called from reclaim context.
> >          *
> >          * This avoids stack overflows when called from deeply used stacks in
> >          * random callers for direct reclaim or memcg reclaim.  We explicitly
> >          * allow reclaim from kswapd as the stack usage there is relatively low.
> >          *
> >          * This should never happen except in the case of a VM regression so
> >          * warn about it.
> >          */
> >         if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> >                         PF_MEMALLOC))
> >                 goto redirty;
> > 
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> > 
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
> 
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below.  What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.

The report does not have a complicated block layer setup - it's just
a swap device on a virtio device. There's no MD, no raid, no complex
transport and protocol layer, etc. It's about as simple as it gets.

> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():

I don't think that solves the problem. I've seen plenty of near
stack overflows that were caused by >3k of stack being used because
of memory allocation/reclaim overhead and then scheduling.
usage and another 1k of stack scheduling waiting.

If we have a subsystem that can put >3k on the stack at arbitrary
locations, then we really only have <5k of stack available for
callers. And when the generic code typically consumes 1-2k of stack
before we get to filesystem specific methods, we only have 3-4k of
stack left for the worst case storage path stack usage. With the
block layer and driver layers requiring 2.5-3k because they can do
memory allocation and schedule, that leaves very little for the
layers in the middle, which is arguably the most algorithmically
complex layer of the storage stack.....

> > > > I guess this topic was discussed several time so there might be
> > > > strong reason not to increase kernel stack size on x86_64, for me not
> > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > > maintainers.
> > > >
> > > >          Depth    Size   Location    (51 entries)
> > > > 
> > > >    0)     7696      16   lookup_address+0x28/0x30
> > > >    1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> > > >    2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> > > >    3)     7640     392   kernel_map_pages+0x6c/0x120
> > > >    4)     7248     256   get_page_from_freelist+0x489/0x920
> > > >    5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> > > >    6)     6640       8   alloc_pages_current+0x10f/0x1f0
> > > >    7)     6632     168   new_slab+0x2c5/0x370
> > > >    8)     6464       8   __slab_alloc+0x3a9/0x501
> > > >    9)     6456      80   __kmalloc+0x1cb/0x200
> > > >   10)     6376     376   vring_add_indirect+0x36/0x200
> > > >   11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> > > >   12)     5856     288   __virtblk_add_req+0xda/0x1b0
> > > >   13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> > > >   14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > > >   15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> > > >   16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> > > >   17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> > > >   18)     5120     112   blk_flush_plug_list+0xc7/0x220
> > > >   19)     5008      64   io_schedule_timeout+0x88/0x100
> > > >   20)     4944     128   mempool_alloc+0x145/0x170
> > > >   21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> > > >   22)     4720      48   get_swap_bio+0x30/0x90
> > > >   23)     4672     160   __swap_writepage+0x150/0x230
> > > >   24)     4512      32   swap_writepage+0x42/0x90
> 
> Without swap IO from the allocation context, the stack would have
> ended here, which would have been easily survivable.  And left the
> writeout work to kswapd, which has a much shallower stack than this:

Sure, but this is just playing whack-a-stack. We can keep slapping
band-aids and restrictions on code and make the code more complex,
constrainted, convouted and slower, or we can just increase the
stack size....

> > > >   25)     4480     320   shrink_page_list+0x676/0xa80
> > > >   26)     4160     208   shrink_inactive_list+0x262/0x4e0
> > > >   27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> > > >   28)     3648      80   shrink_zone+0x3f/0x110
> > > >   29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> > > >   30)     3440     208   try_to_free_pages+0xf7/0x1e0
> > > >   31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> > > >   32)     2880       8   alloc_pages_current+0x10f/0x1f0
> > > >   33)     2872     200   __page_cache_alloc+0x13f/0x160
> > > >   34)     2672      80   find_or_create_page+0x4c/0xb0
> > > >   35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> > > >   36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> > > >   37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> > > >   38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> > > >   39)     1952     160   ext4_map_blocks+0x325/0x530
> > > >   40)     1792     384   ext4_writepages+0x6d1/0xce0
> > > >   41)     1408      16   do_writepages+0x23/0x40
> > > >   42)     1392      96   __writeback_single_inode+0x45/0x2e0
> > > >   43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> > > >   44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> > > >   45)     1040     160   wb_writeback+0x29b/0x350
> > > >   46)      880     208   bdi_writeback_workfn+0x11c/0x480
> > > >   47)      672     144   process_one_work+0x1d2/0x570
> > > >   48)      528     112   worker_thread+0x116/0x370
> > > >   49)      416     240   kthread+0xf3/0x110
> > > >   50)      176     176   ret_from_fork+0x7c/0xb0
> > 
> > Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
> > GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...
> 
> Do they also usually involve swap_writepage()?

No.  Have a look at this recent thread when Dave Jones reported
trinity was busting the stack.

http://oss.sgi.com/archives/xfs/2014-02/msg00325.html

What happens when a shrinker issues IO:

http://oss.sgi.com/archives/xfs/2014-02/msg00361.html

Yes, there was an XFS problem in there that was fixed (by moving
work to a workqueue!) but the point is that swap is not the only
path through memory allocation that can consume huge amounts of
stack. That above trace also points out a path through the scheduler
of close to 1k of stack usage. That gets worse -
wait_for_completion() typically requires 1.5k of stack....

Contributing is the new blk-mq layer, which from the above stack
trace still hasn't been fixed:

http://oss.sgi.com/archives/xfs/2014-02/msg00355.html

and a lot of the stack usage is because of saved registers on each
function call:

http://oss.sgi.com/archives/xfs/2014-02/msg00470.html

And here's a good set of examples of the amount of stack certain
functions can require:

http://oss.sgi.com/archives/xfs/2014-02/msg00365.html

Am I the only person who sees a widespread problem here?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 14:23     ` H. Peter Anvin
@ 2014-05-28 22:11       ` Dave Chinner
  2014-05-28 22:42         ` H. Peter Anvin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-28 22:11 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Steven Rostedt, Minchan Kim, Linus Torvalds, linux-kernel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, mst,
	Dave Hansen

On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
> We tried for 4K on x86-64, too, for b quite a while as I recall.
> The kernel stack is a one of the main costs for a thread.  I would
> like to decouple struct thread_info from the kernel stack (PJ
> Waskewicz was working on that before he left Intel) but that
> doesn't buy us all that much.
> 
> 8K additional per thread is a huge hit.  XFS has indeed always
> been a canary, or troublespot, I suspect because it originally
> came from another kernel where this was not an optimization
> target.

<sigh>

Always blame XFS for stack usage problems.

Even when the reported problem is from IO to an ext4 filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:09   ` Linus Torvalds
@ 2014-05-28 22:31     ` Dave Chinner
  2014-05-28 22:41       ` Linus Torvalds
  2014-05-29  3:46     ` Minchan Kim
  2014-05-30 21:23     ` Andi Kleen
  2 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-28 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 09:09:23AM -0700, Linus Torvalds wrote:
> On Tue, May 27, 2014 at 11:53 PM, Minchan Kim <minchan@kernel.org> wrote:
> >
> > So, my stupid idea is just let's expand stack size and keep an eye
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
.....
> But what *does* stand out (once again) is that we probably shouldn't
> do swap-out in direct reclaim. This came up the last time we had stack
> issues (XFS) too. I really do suspect that direct reclaim should only
> do the kind of reclaim that does not need any IO at all.
> 
> I think we _do_ generally avoid IO in direct reclaim, but swap is
> special. And not for a good reason, afaik. DaveC, remind me, I think
> you said something about the swap case the last time this came up..

Right, we do generally avoid IO through filesystems via direct
reclaim because delayed allocation requires significant amounts
of additional memory, stack space and IO.

However, swap doesn't have that overhead - it's just the IO stack
that it drives through submit_bio(), and the worst case I'd seen
through that path was much less than other reclaim stack path usage.
I haven't seen swap in any of the stack overflows from production
machines, and I only rarely see it in worst case stack usage
profiles on my test machines.

Indeed, the call chain reported here is not caused by swap issuing
IO.  We scheduled in the swap code (throttling waiting for
congestion, I think) with a plugged block device (from the ext4
writeback layer) with pending bios queued on it and the scheduler
has triggered a flush of the device.  submit_bio in the swap path
has much less stack usage than io_schedule() because it doesn't have
any of the scheduler or plug list flushing overhead in the stack.

So, realistically, the swap path is not worst case stack usage here
and disabling it won't prevent this stack overflow from happening.
Direct reclaim will simply throttle elsewhere and that will still
cause the plug to be flushed, the IO to be issued and the stack to
overflow.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 22:31     ` Dave Chinner
@ 2014-05-28 22:41       ` Linus Torvalds
  2014-05-29  1:30         ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Linus Torvalds @ 2014-05-28 22:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 3:31 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> Indeed, the call chain reported here is not caused by swap issuing
> IO.

Well, that's one way of reading that callchain.

I think it's the *wrong* way of reading it, though. Almost dishonestly
so. Because very clearly, the swapout _is_ what causes the unplugging
of the IO queue, and does so because it is allocating the BIO for its
own IO. The fact that that then fails (because of other IO's in
flight), and causes *other* IO to be flushed, doesn't really change
anything fundamental. It's still very much swap that causes that
"let's start IO".

IOW, swap-out directly caused that extra 3kB of stack use in what was
a deep call chain (due to memory allocation). I really don't
understand why you are arguing anything else on a pure technicality.

I thought you had some other argument for why swap was different, and
against removing that "page_is_file_cache()" special case in
shrink_page_list().

                         Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 22:11       ` Dave Chinner
@ 2014-05-28 22:42         ` H. Peter Anvin
  2014-05-28 23:17           ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-28 22:42 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Steven Rostedt, Minchan Kim, Linus Torvalds, linux-kernel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, mst,
	Dave Hansen

On 05/28/2014 03:11 PM, Dave Chinner wrote:
> On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
>> We tried for 4K on x86-64, too, for b quite a while as I recall.
>> The kernel stack is a one of the main costs for a thread.  I would
>> like to decouple struct thread_info from the kernel stack (PJ
>> Waskewicz was working on that before he left Intel) but that
>> doesn't buy us all that much.
>>
>> 8K additional per thread is a huge hit.  XFS has indeed always
>> been a canary, or troublespot, I suspect because it originally
>> came from another kernel where this was not an optimization
>> target.
> 
> <sigh>
> 
> Always blame XFS for stack usage problems.
> 
> Even when the reported problem is from IO to an ext4 filesystem.
> 

You were the one calling it a canary.

	-hpa



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 22:42         ` H. Peter Anvin
@ 2014-05-28 23:17           ` Dave Chinner
  2014-05-28 23:21             ` H. Peter Anvin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-28 23:17 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Steven Rostedt, Minchan Kim, Linus Torvalds, linux-kernel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, mst,
	Dave Hansen

On Wed, May 28, 2014 at 03:42:18PM -0700, H. Peter Anvin wrote:
> On 05/28/2014 03:11 PM, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 07:23:23AM -0700, H. Peter Anvin wrote:
> >> We tried for 4K on x86-64, too, for b quite a while as I recall.
> >> The kernel stack is a one of the main costs for a thread.  I would
> >> like to decouple struct thread_info from the kernel stack (PJ
> >> Waskewicz was working on that before he left Intel) but that
> >> doesn't buy us all that much.
> >>
> >> 8K additional per thread is a huge hit.  XFS has indeed always
> >> been a canary, or troublespot, I suspect because it originally
> >> came from another kernel where this was not an optimization
> >> target.
> > 
> > <sigh>
> > 
> > Always blame XFS for stack usage problems.
> > 
> > Even when the reported problem is from IO to an ext4 filesystem.
> > 
> 
> You were the one calling it a canary.

That doesn't mean it's to blame. Don't shoot the messenger...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 23:17           ` Dave Chinner
@ 2014-05-28 23:21             ` H. Peter Anvin
  0 siblings, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-28 23:21 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Steven Rostedt, Minchan Kim, Linus Torvalds, linux-kernel,
	Andrew Morton, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, mst,
	Dave Hansen

On 05/28/2014 04:17 PM, Dave Chinner wrote:
>>
>> You were the one calling it a canary.
> 
> That doesn't mean it's to blame. Don't shoot the messenger...
> 

Fair enough.

	-hpa



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  9:04   ` Michael S. Tsirkin
@ 2014-05-29  1:09     ` Minchan Kim
  2014-05-29  2:44       ` Steven Rostedt
  2014-05-29  2:47       ` Rusty Russell
  2014-05-29  4:10     ` virtio_ring stack usage Rusty Russell
  1 sibling, 2 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  1:09 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, Dave Hansen,
	Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 118673 bytes --]

On Wed, May 28, 2014 at 12:04:09PM +0300, Michael S. Tsirkin wrote:
> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > While I play inhouse patches with much memory pressure on qemu-kvm,
> > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > 
> > When I investigated the problem, the callstack was a little bit deeper
> > by involve with reclaim functions but not direct reclaim path.
> > 
> > I tried to diet stack size of some functions related with alloc/reclaim
> > so did a hundred of byte but overflow was't disappeard so that I encounter
> > overflow by another deeper callstack on reclaim/allocator path.
> > 
> > Of course, we might sweep every sites we have found for reducing
> > stack usage but I'm not sure how long it saves the world(surely,
> > lots of developer start to add nice features which will use stack
> > agains) and if we consider another more complex feature in I/O layer
> > and/or reclaim path, it might be better to increase stack size(
> > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > already expaned to 16K. )
> > 
> > So, my stupid idea is just let's expand stack size and keep an eye
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> > For example, we can have a bar like that each funcion shouldn't exceed 200K
> > and emit the warning when some function consumes more in runtime.
> > Of course, it could make false positive but at least, it could make a
> > chance to think over it.
> > 
> > I guess this topic was discussed several time so there might be
> > strong reason not to increase kernel stack size on x86_64, for me not
> > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > maintainers.
> > 
> > [ 1065.604404] kworker/-5766    0d..2 1071625990us : stack_trace_call:         Depth    Size   Location    (51 entries)
> > [ 1065.604404]         -----    ----   --------
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> 
> virtio stack usage seems very high.
> Here is virtio_ring.su generated using -fstack-usage flag for gcc 4.8.2.
> 
> virtio_ring.c:107:35:sg_next_arr        16      static
> 
> 
> <--- this is a surprise, I really expected it to be inlined
>      same for sg_next_chained.
> <--- Rusty: should we force compiler to inline it?
> 
> 
> virtio_ring.c:584:6:virtqueue_disable_cb        16      static
> virtio_ring.c:604:10:virtqueue_enable_cb_prepare        16      static
> virtio_ring.c:632:6:virtqueue_poll      16      static
> virtio_ring.c:652:6:virtqueue_enable_cb 16      static
> virtio_ring.c:845:14:virtqueue_get_vring_size   16      static
> virtio_ring.c:854:6:virtqueue_is_broken 16      static
> virtio_ring.c:101:35:sg_next_chained    16      static
> virtio_ring.c:436:6:virtqueue_notify    24      static
> virtio_ring.c:672:6:virtqueue_enable_cb_delayed 16      static
> virtio_ring.c:820:6:vring_transport_features    16      static
> virtio_ring.c:472:13:detach_buf 40      static
> virtio_ring.c:518:7:virtqueue_get_buf   32      static
> virtio_ring.c:812:6:vring_del_virtqueue 16      static
> virtio_ring.c:394:6:virtqueue_kick_prepare      16      static
> virtio_ring.c:464:6:virtqueue_kick      32      static
> virtio_ring.c:186:19:4  16      static
> virtio_ring.c:733:13:vring_interrupt    24      static
> virtio_ring.c:707:7:virtqueue_detach_unused_buf 32      static
> virtio_config.h:84:20:7 16      static
> virtio_ring.c:753:19:vring_new_virtqueue        80      static  
> virtio_ring.c:374:5:virtqueue_add_inbuf 56      static
> virtio_ring.c:352:5:virtqueue_add_outbuf        56      static
> virtio_ring.c:314:5:virtqueue_add_sgs   112     static  
> 
> 
> as you see, vring_add_indirect was inlined within virtqueue_add_sgs by my gcc.
> Taken together, they add up to only 112 bytes: not 1/2K as they do for you.
> Which compiler version and flags did you use?
> 

barrios@bbox:~/linux-2.6$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3


stacktrace reported that virtio_queue_rq used 96byte and objdump says
Disassembly of section .text:

ffffffff8148e480 <virtio_queue_rq>:
ffffffff8148e480:       e8 7b 09 26 00          callq  ffffffff816eee00 <__entry_text_start>
ffffffff8148e485:       55                      push   %rbp
ffffffff8148e486:       48 89 e5                mov    %rsp,%rbp
ffffffff8148e489:       48 83 ec 50             sub    $0x50,%rsp
ffffffff8148e48d:       4c 89 6d e8             mov    %r13,-0x18(%rbp)
ffffffff8148e491:       48 89 5d d8             mov    %rbx,-0x28(%rbp)
ffffffff8148e495:       49 89 fd                mov    %rdi,%r13
ffffffff8148e498:       4c 89 65 e0             mov    %r12,-0x20(%rbp)
ffffffff8148e49c:       4c 89 75 f0             mov    %r14,-0x10(%rbp)
ffffffff8148e4a0:       4c 89 7d f8             mov    %r15,-0x8(%rbp)
ffffffff8148e4a4:       48 8b 87 50 01 00 00    mov    0x150(%rdi),%rax
ffffffff8148e4ab:       48 8b 9e 18 01 00 00    mov    0x118(%rsi),%rbx
ffffffff8148e4b2:       4c 8b a0 c8 06 00 00    mov    0x6c8(%rax),%r12

So, it's not strange.

stacktrace reported that  __virtblk_add_req used 288byte and objdump says

ffffffff8148e2d0 <__virtblk_add_req>:
ffffffff8148e2d0:       e8 2b 0b 26 00          callq  ffffffff816eee00 <__entry_text_start>
ffffffff8148e2d5:       55                      push   %rbp
ffffffff8148e2d6:       48 89 e5                mov    %rsp,%rbp
ffffffff8148e2d9:       48 81 ec 10 01 00 00    sub    $0x110,%rsp
ffffffff8148e2e0:       48 89 5d d8             mov    %rbx,-0x28(%rbp)
ffffffff8148e2e4:       4c 89 65 e0             mov    %r12,-0x20(%rbp)
ffffffff8148e2e8:       48 8d 9d 30 ff ff ff    lea    -0xd0(%rbp),%rbx
ffffffff8148e2ef:       4c 89 6d e8             mov    %r13,-0x18(%rbp)
ffffffff8148e2f3:       4c 89 75 f0             mov    %r14,-0x10(%rbp)
ffffffff8148e2f7:       49 89 f5                mov    %rsi,%r13
ffffffff8148e2fa:       4c 89 7d f8             mov    %r15,-0x8(%rbp)
ffffffff8148e2fe:       44 8b 7e 08             mov    0x8(%rsi),%r15d
ffffffff8148e302:       48 8d 76 08             lea    0x8(%rsi),%rsi

So, it's not strange.

stacktrace reported that virtqueue_add_sgs used 144byte and objdump says

ffffffff8141e170 <virtqueue_add_sgs>:
ffffffff8141e170:       e8 8b 0c 2d 00          callq  ffffffff816eee00 <__entry_text_start>
ffffffff8141e175:       55                      push   %rbp
ffffffff8141e176:       48 89 e5                mov    %rsp,%rbp
ffffffff8141e179:       41 57                   push   %r15
ffffffff8141e17b:       41 56                   push   %r14
ffffffff8141e17d:       41 89 d6                mov    %edx,%r14d
ffffffff8141e180:       41 55                   push   %r13
ffffffff8141e182:       49 89 f5                mov    %rsi,%r13
ffffffff8141e185:       41 54                   push   %r12
ffffffff8141e187:       53                      push   %rbx
ffffffff8141e188:       48 89 fb                mov    %rdi,%rbx
ffffffff8141e18b:       48 83 ec 58             sub    $0x58,%rsp
ffffffff8141e18f:       85 d2                   test   %edx,%edx

So, it's not strange

stacktrace reported that vring_add_indirect used 376byte and objdump says

ffffffff8141dc60 <vring_add_indirect>:
ffffffff8141dc60:       55                      push   %rbp
ffffffff8141dc61:       48 89 e5                mov    %rsp,%rbp
ffffffff8141dc64:       41 57                   push   %r15
ffffffff8141dc66:       41 56                   push   %r14
ffffffff8141dc68:       41 55                   push   %r13
ffffffff8141dc6a:       49 89 fd                mov    %rdi,%r13
ffffffff8141dc6d:       89 cf                   mov    %ecx,%edi
ffffffff8141dc6f:       48 c1 e7 04             shl    $0x4,%rdi
ffffffff8141dc73:       41 54                   push   %r12
ffffffff8141dc75:       49 89 d4                mov    %rdx,%r12
ffffffff8141dc78:       53                      push   %rbx
ffffffff8141dc79:       48 89 f3                mov    %rsi,%rbx
ffffffff8141dc7c:       48 83 ec 28             sub    $0x28,%rsp
ffffffff8141dc80:       8b 75 20                mov    0x20(%rbp),%esi
ffffffff8141dc83:       89 4d bc                mov    %ecx,-0x44(%rbp)
ffffffff8141dc86:       44 89 45 cc             mov    %r8d,-0x34(%rbp)
ffffffff8141dc8a:       44 89 4d c8             mov    %r9d,-0x38(%rbp)
ffffffff8141dc8e:       83 e6 dd                and    $0xffffffdd,%esi
ffffffff8141dc91:       e8 7a d1 d7 ff          callq  ffffffff8119ae10 <__kmalloc>
ffffffff8141dc96:       48 85 c0                test   %rax,%rax

So, it's *strange*.

I will add .config and .o.
Maybe someone might find what happens.


#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.14.0 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_ARCH_HAS_CPU_AUTOPROBE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_FHANDLE=y
CONFIG_AUDIT=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_RCU_STALL_COMMON=y
# CONFIG_RCU_USER_QS is not set
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_RCU_FAST_NO_HZ=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_RCU_NOCB_CPU is not set
CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=18
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_ARCH_WANTS_PROT_NUMA_PROT_NONE=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_FREEZER=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
# CONFIG_CGROUP_CPUACCT is not set
CONFIG_RESOURCE_COUNTERS=y
# CONFIG_MEMCG is not set
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CGROUP_PERF=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
CONFIG_USER_NS=y
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
# CONFIG_SYSFS_DEPRECATED is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
# CONFIG_RD_LZ4 is not set
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_EXPERT=y
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_PCI_QUIRKS=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_SLUB_CPU_PARTIAL=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_ATTRS=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
CONFIG_CC_STACKPROTECTOR_REGULAR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
# CONFIG_SYSTEM_TRUSTED_KEYRING is not set
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
# CONFIG_MODULE_SIG is not set
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
CONFIG_BLK_DEV_THROTTLING=y
# CONFIG_BLK_CMDLINE_PARSER is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
CONFIG_ACORN_PARTITION=y
# CONFIG_ACORN_PARTITION_CUMANA is not set
# CONFIG_ACORN_PARTITION_EESOX is not set
CONFIG_ACORN_PARTITION_ICS=y
# CONFIG_ACORN_PARTITION_ADFS is not set
# CONFIG_ACORN_PARTITION_POWERTEC is not set
CONFIG_ACORN_PARTITION_RISCIX=y
# CONFIG_AIX_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
CONFIG_ATARI_PARTITION=y
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
CONFIG_LDM_PARTITION=y
# CONFIG_LDM_DEBUG is not set
CONFIG_SGI_PARTITION=y
CONFIG_ULTRIX_PARTITION=y
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
CONFIG_SYSV68_PARTITION=y
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
CONFIG_X86_EXTENDED_PLATFORM=y
CONFIG_X86_NUMACHIP=y
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_INTEL_LPSS is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_HYPERVISOR_GUEST is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MEMTEST is not set
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_PROCESSOR_SELECT=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=256
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y
# CONFIG_I8K is not set
# CONFIG_MICROCODE is not set
# CONFIG_MICROCODE_INTEL_EARLY is not set
# CONFIG_MICROCODE_AMD_EARLY is not set
# CONFIG_X86_MSR is not set
# CONFIG_X86_CPUID is not set
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_DIRECT_GBPAGES=y
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=6
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_MEMORY_PROBE=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
# CONFIG_MOVABLE_NODE is not set
CONFIG_HAVE_BOOTMEM_INFO_NODE=y
CONFIG_MEMORY_HOTPLUG=y
CONFIG_MEMORY_HOTPLUG_SPARSE=y
CONFIG_MEMORY_HOTREMOVE=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_NEED_BOUNCE_POOL=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
# CONFIG_HWPOISON_INJECT is not set
# CONFIG_TRANSPARENT_HUGEPAGE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_CLEANCACHE=y
CONFIG_FRONTSWAP=y
# CONFIG_CMA is not set
# CONFIG_ZBUD is not set
# CONFIG_ZSWAP is not set
# CONFIG_GCMA is not set
CONFIG_ZSMALLOC=y
# CONFIG_PGTABLE_MAPPING is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK=y
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_EFI=y
# CONFIG_EFI_STUB is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_COMPAT_VDSO is not set
# CONFIG_CMDLINE_BOOL is not set
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_ARCH_ENABLE_MEMORY_HOTREMOVE=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM_RUNTIME=y
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
CONFIG_PM_TEST_SUSPEND=y
CONFIG_PM_SLEEP_DEBUG=y
# CONFIG_DPM_WATCHDOG is not set
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
CONFIG_ACPI=y
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS is not set
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
CONFIG_ACPI_VIDEO=m
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
CONFIG_ACPI_CUSTOM_DSDT_FILE=""
# CONFIG_ACPI_CUSTOM_DSDT is not set
# CONFIG_ACPI_INITRD_TABLE_OVERRIDE is not set
# CONFIG_ACPI_DEBUG is not set
# CONFIG_ACPI_PCI_SLOT is not set
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
# CONFIG_ACPI_HOTPLUG_MEMORY is not set
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_ACPI_EXTLOG is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_COMMON=y
CONFIG_CPU_FREQ_STAT=y
CONFIG_CPU_FREQ_STAT_DETAILS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y

#
# x86 CPU frequency scaling drivers
#
# CONFIG_X86_INTEL_PSTATE is not set
# CONFIG_X86_PCC_CPUFREQ is not set
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=y
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
CONFIG_X86_SPEEDSTEP_CENTRINO=y
# CONFIG_X86_P4_CLOCKMOD is not set

#
# shared options
#
# CONFIG_X86_SPEEDSTEP_LIB is not set

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
# CONFIG_CPU_IDLE_MULTIPLE_DRIVERS is not set
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Memory power savings
#
# CONFIG_I7300_IDLE is not set

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
# CONFIG_PCI_CNB20LE_QUIRK is not set
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
# CONFIG_PCIE_ECRC is not set
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
CONFIG_PCI_MSI=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
# CONFIG_PCI_STUB is not set
CONFIG_HT_IRQ=y
CONFIG_PCI_ATS=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_IOAPIC=y
CONFIG_PCI_LABEL=y

#
# PCI host controller drivers
#
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
# CONFIG_PCCARD is not set
CONFIG_HOTPLUG_PCI=y
# CONFIG_HOTPLUG_PCI_ACPI is not set
CONFIG_HOTPLUG_PCI_CPCI=y
# CONFIG_HOTPLUG_PCI_CPCI_ZT5550 is not set
# CONFIG_HOTPLUG_PCI_CPCI_GENERIC is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set
CONFIG_RAPIDIO=y
CONFIG_RAPIDIO_TSI721=y
CONFIG_RAPIDIO_DISC_TIMEOUT=30
# CONFIG_RAPIDIO_ENABLE_RX_TX_PORTS is not set
# CONFIG_RAPIDIO_DMA_ENGINE is not set
# CONFIG_RAPIDIO_DEBUG is not set
# CONFIG_RAPIDIO_ENUM_BASIC is not set

#
# RapidIO Switch drivers
#
CONFIG_RAPIDIO_TSI57X=y
CONFIG_RAPIDIO_CPS_XX=y
CONFIG_RAPIDIO_TSI568=y
CONFIG_RAPIDIO_CPS_GEN2=y
# CONFIG_X86_SYSFB is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE=y
# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set
CONFIG_BINFMT_SCRIPT=y
# CONFIG_HAVE_AOUT is not set
CONFIG_BINFMT_MISC=m
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_KEYS_COMPAT=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_XFRM_USER is not set
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_NET_IP_TUNNEL is not set
CONFIG_IP_MROUTE=y
# CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
CONFIG_INET_LRO=y
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_GRE is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y
CONFIG_BRIDGE_NETFILTER=y

#
# Core Netfilter Configuration
#
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
# CONFIG_NF_TABLES is not set
# CONFIG_NETFILTER_XTABLES is not set
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_IP_NF_IPTABLES is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_BRIDGE_NF_EBTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
CONFIG_STP=m
CONFIG_BRIDGE=m
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_HAVE_NET_DSA=y
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
CONFIG_LLC=m
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_IEEE802154 is not set
CONFIG_6LOWPAN_IPHC=m
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_FQ is not set
# CONFIG_NET_SCH_HHF is not set
# CONFIG_NET_SCH_PIE is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
# CONFIG_NET_CLS_CGROUP is not set
# CONFIG_NET_CLS_BPF is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_MMAP is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_NET_MPLS_GSO is not set
# CONFIG_HSR is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_JIT=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
# CONFIG_NET_DROP_MONITOR is not set
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
# CONFIG_IRDA is not set
CONFIG_BT=m
CONFIG_BT_RFCOMM=m
CONFIG_BT_RFCOMM_TTY=y
CONFIG_BT_BNEP=m
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
# CONFIG_BT_HIDP is not set

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIBTSDIO is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBPA10X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_BT_MRVL is not set
# CONFIG_AF_RXRPC is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
# CONFIG_WIMAX is not set
CONFIG_RFKILL=y
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
# CONFIG_RFKILL_REGULATOR is not set
# CONFIG_RFKILL_GPIO is not set
# CONFIG_NET_9P is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
CONFIG_HAVE_BPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
# CONFIG_STANDALONE is not set
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE=""
CONFIG_FW_LOADER_USER_HELPER=y
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=y
CONFIG_REGMAP_SPI=y
CONFIG_REGMAP_IRQ=y
CONFIG_DMA_SHARED_BUFFER=y

#
# Bus devices
#
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
CONFIG_PARPORT=m
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
CONFIG_PARPORT_PC=m
# CONFIG_PARPORT_SERIAL is not set
CONFIG_PARPORT_PC_FIFO=y
# CONFIG_PARPORT_PC_SUPERIO is not set
# CONFIG_PARPORT_GSC is not set
# CONFIG_PARPORT_AX88796 is not set
CONFIG_PARPORT_1284=y
CONFIG_PNP=y
CONFIG_PNP_DEBUG_MESSAGES=y

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_PARIDE is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
CONFIG_ZRAM=y
# CONFIG_ZRAM_DEBUG is not set
# CONFIG_BLK_CPQ_CISS_DA is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=65536
# CONFIG_BLK_DEV_XIP is not set
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_BLK_DEV_HD is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ATMEL_SSC is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1780 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_TI_DAC7512 is not set
# CONFIG_BMP085_I2C is not set
# CONFIG_BMP085_SPI is not set
# CONFIG_PCH_PHUB is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_LATTICE_ECP3_CONFIG is not set
# CONFIG_SRAM is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_AT25 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_93XX46 is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_TI_ST is not set
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC Host Driver
#
# CONFIG_INTEL_MIC_HOST is not set

#
# Intel MIC Card Driver
#
# CONFIG_INTEL_MIC_CARD is not set
# CONFIG_GENWQE is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_TGT is not set
# CONFIG_SCSI_NETLINK is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
# CONFIG_BLK_DEV_SR_VENDOR is not set
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_SCSI_BNX2X_FCOE is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
# CONFIG_SCSI_AIC7XXX is not set
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
# CONFIG_SCSI_MPT2SAS is not set
# CONFIG_SCSI_MPT3SAS is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_LIBFC is not set
# CONFIG_LIBFCOE is not set
# CONFIG_FCOE is not set
# CONFIG_FCOE_FNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_PPA is not set
# CONFIG_SCSI_IMM is not set
# CONFIG_SCSI_STEX is not set
CONFIG_SCSI_SYM53C8XX_2=y
CONFIG_SCSI_SYM53C8XX_DMA_ADDRESSING_MODE=1
CONFIG_SCSI_SYM53C8XX_DEFAULT_TAGS=16
CONFIG_SCSI_SYM53C8XX_MAX_TAGS=64
CONFIG_SCSI_SYM53C8XX_MMIO=y
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_FC is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_LPFC is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_DC390T is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_SRP is not set
# CONFIG_SCSI_BFA_FC is not set
# CONFIG_SCSI_VIRTIO is not set
# CONFIG_SCSI_CHELSIO_FCOE is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
CONFIG_PDC_ADMA=y
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_HIGHBANK is not set
# CONFIG_SATA_MV is not set
# CONFIG_SATA_NV is not set
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_RCAR is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
# CONFIG_PATA_AMD is not set
# CONFIG_PATA_ARASAN_CF is not set
# CONFIG_PATA_ARTOP is not set
# CONFIG_PATA_ATIIXP is not set
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CS5520 is not set
# CONFIG_PATA_CS5530 is not set
# CONFIG_PATA_CS5536 is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
# CONFIG_PATA_JMICRON is not set
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
# CONFIG_PATA_OLDPIIX is not set
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SC1200 is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
CONFIG_PATA_SIS=y
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
# CONFIG_PATA_VIA is not set
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PLATFORM is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=y
CONFIG_ATA_GENERIC=y
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
# CONFIG_MD_LINEAR is not set
# CONFIG_MD_RAID0 is not set
# CONFIG_MD_RAID1 is not set
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
# CONFIG_MD_MULTIPATH is not set
# CONFIG_MD_FAULTY is not set
# CONFIG_BCACHE is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
# CONFIG_DM_CRYPT is not set
# CONFIG_DM_SNAPSHOT is not set
# CONFIG_DM_THIN_PROVISIONING is not set
# CONFIG_DM_CACHE is not set
# CONFIG_DM_MIRROR is not set
# CONFIG_DM_RAID is not set
# CONFIG_DM_ZERO is not set
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_FC is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=128
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
# CONFIG_I2O is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=m
CONFIG_NETDEVICES=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
# CONFIG_RIONET is not set
CONFIG_TUN=y
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#
# CONFIG_VHOST_NET is not set

#
# Distributed Switch Architecture drivers
#
# CONFIG_NET_DSA_MV88E6XXX is not set
# CONFIG_NET_DSA_MV88E6060 is not set
# CONFIG_NET_DSA_MV88E6XXX_NEED_PPU is not set
# CONFIG_NET_DSA_MV88E6131 is not set
# CONFIG_NET_DSA_MV88E6123_61_65 is not set
CONFIG_ETHERNET=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_VORTEX is not set
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
CONFIG_NET_CADENCE=y
# CONFIG_ARM_AT91_ETHER is not set
# CONFIG_MACB is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
# CONFIG_TIGON3 is not set
# CONFIG_BNX2X is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
# CONFIG_NET_CALXEDA_XGMAC is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EXAR=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
CONFIG_NET_VENDOR_HP=y
# CONFIG_HP100 is not set
CONFIG_NET_VENDOR_INTEL=y
# CONFIG_E100 is not set
# CONFIG_E1000 is not set
CONFIG_E1000E=m
# CONFIG_IGB is not set
# CONFIG_IGBVF is not set
# CONFIG_IXGB is not set
# CONFIG_IXGBE is not set
# CONFIG_IXGBEVF is not set
# CONFIG_I40E is not set
# CONFIG_I40EVF is not set
CONFIG_NET_VENDOR_I825XX=y
# CONFIG_IP1000 is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
# CONFIG_SKGE is not set
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_MLX5_CORE is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MICROCHIP=y
# CONFIG_ENC28J60 is not set
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_NE2K_PCI is not set
CONFIG_NET_VENDOR_NVIDIA=y
# CONFIG_FORCEDETH is not set
CONFIG_NET_VENDOR_OKI=y
# CONFIG_PCH_GBE is not set
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
# CONFIG_NETXEN_NIC is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_ATP is not set
# CONFIG_8139CP is not set
# CONFIG_8139TOO is not set
# CONFIG_R8169 is not set
# CONFIG_SH_ETH is not set
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_SEEQ=y
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
# CONFIG_SFC is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_FDDI=y
# CONFIG_DEFXX is not set
# CONFIG_SKFP is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_PHYLIB=y

#
# MII PHY device drivers
#
# CONFIG_AT803X_PHY is not set
# CONFIG_AMD_PHY is not set
CONFIG_MARVELL_PHY=y
CONFIG_DAVICOM_PHY=y
CONFIG_QSEMI_PHY=y
CONFIG_LXT_PHY=y
CONFIG_CICADA_PHY=y
CONFIG_VITESSE_PHY=y
CONFIG_SMSC_PHY=y
CONFIG_BROADCOM_PHY=y
# CONFIG_BCM87XX_PHY is not set
CONFIG_ICPLUS_PHY=y
CONFIG_REALTEK_PHY=y
CONFIG_NATIONAL_PHY=y
CONFIG_STE10XP=y
CONFIG_LSI_ET1011C_PHY=y
# CONFIG_MICREL_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MICREL_KS8995MA is not set
# CONFIG_PLIP is not set
CONFIG_PPP=y
# CONFIG_PPP_BSDCOMP is not set
# CONFIG_PPP_DEFLATE is not set
CONFIG_PPP_FILTER=y
# CONFIG_PPP_MPPE is not set
CONFIG_PPP_MULTILINK=y
# CONFIG_PPPOE is not set
# CONFIG_PPP_ASYNC is not set
# CONFIG_PPP_SYNC_TTY is not set
# CONFIG_SLIP is not set
CONFIG_SLHC=y

#
# USB Network Adapters
#
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
# CONFIG_AIRO is not set
# CONFIG_ATMEL is not set
# CONFIG_PRISM54 is not set
# CONFIG_USB_ZD1201 is not set
# CONFIG_HOSTAP is not set
# CONFIG_WL_TI is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
CONFIG_WAN=y
# CONFIG_HDLC is not set
# CONFIG_DLCI is not set
# CONFIG_SBNI is not set
# CONFIG_VMXNET3 is not set
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_ISDN_DRV_GIGASET is not set
# CONFIG_HYSDN is not set
# CONFIG_MISDN is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_FF_MEMLESS=m
# CONFIG_INPUT_POLLDEV is not set
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
CONFIG_INPUT_MOUSEDEV_PSAUX=y
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5520 is not set
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_GPIO is not set
# CONFIG_KEYBOARD_GPIO_POLLED is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_MATRIX is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_TC3589X is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=m
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_GPIO is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_DB9 is not set
# CONFIG_JOYSTICK_GAMECON is not set
# CONFIG_JOYSTICK_TURBOGRAFX is not set
# CONFIG_JOYSTICK_AS5011 is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
# CONFIG_JOYSTICK_WALKERA0701 is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_WACOM is not set
CONFIG_INPUT_TOUCHSCREEN=y
# CONFIG_TOUCHSCREEN_88PM860X is not set
# CONFIG_TOUCHSCREEN_ADS7846 is not set
# CONFIG_TOUCHSCREEN_AD7877 is not set
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_AUO_PIXCIR is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CY8CTMG110 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
# CONFIG_TOUCHSCREEN_DA9034 is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WM831X is not set
# CONFIG_TOUCHSCREEN_WM97XX is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2005 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_SUR40 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
# CONFIG_TOUCHSCREEN_ZFORCE is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_88PM860X_ONKEY is not set
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MAX8925_ONKEY is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_MPU3050 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_GP2A is not set
# CONFIG_INPUT_GPIO_TILT_POLLED is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=y
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_GPIO_ROTARY_ENCODER is not set
# CONFIG_INPUT_WM831X_ON is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
# CONFIG_SERIO_SERPORT is not set
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PARKBD is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
CONFIG_SERIO_RAW=m
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_DEVPTS_MULTIPLE_INSTANCES=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=0
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
# CONFIG_DEVKMEM is not set

#
# Serial drivers
#
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_NR_UARTS=48
CONFIG_SERIAL_8250_RUNTIME_UARTS=32
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_DW is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_MAX3100 is not set
# CONFIG_SERIAL_MAX310X is not set
# CONFIG_SERIAL_MFD_HSU is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_TIMBERDALE is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_IFX6X60 is not set
# CONFIG_SERIAL_PCH_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
CONFIG_TTY_PRINTK=y
CONFIG_PRINTER=m
# CONFIG_LP_CONSOLE is not set
CONFIG_PPDEV=m
# CONFIG_VIRTIO_CONSOLE is not set
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
# CONFIG_HW_RANDOM_VIRTIO is not set
CONFIG_HW_RANDOM_TPM=y
# CONFIG_NVRAM is not set
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set
# CONFIG_MWAVE is not set
# CONFIG_RAW_DRIVER is not set
CONFIG_HPET=y
CONFIG_HPET_MMAP=y
CONFIG_HPET_MMAP_DEFAULT=y
# CONFIG_HANGCHECK_TIMER is not set
CONFIG_TCG_TPM=y
CONFIG_TCG_TIS=m
# CONFIG_TCG_TIS_I2C_ATMEL is not set
# CONFIG_TCG_TIS_I2C_INFINEON is not set
# CONFIG_TCG_TIS_I2C_NUVOTON is not set
# CONFIG_TCG_NSC is not set
# CONFIG_TCG_ATMEL is not set
CONFIG_TCG_INFINEON=m
# CONFIG_TCG_ST33_I2C is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
CONFIG_I2C=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
# CONFIG_I2C_HELPER_AUTO is not set
# CONFIG_I2C_SMBUS is not set

#
# I2C Algorithms
#
CONFIG_I2C_ALGOBIT=m
# CONFIG_I2C_ALGOPCF is not set
# CONFIG_I2C_ALGOPCA is not set

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
# CONFIG_I2C_PIIX4 is not set
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_CBUS_GPIO is not set
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EG20T is not set
# CONFIG_I2C_GPIO is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
CONFIG_SPI=y
# CONFIG_SPI_DEBUG is not set
CONFIG_SPI_MASTER=y

#
# SPI Master Controller Drivers
#
# CONFIG_SPI_ALTERA is not set
# CONFIG_SPI_BITBANG is not set
# CONFIG_SPI_BUTTERFLY is not set
# CONFIG_SPI_GPIO is not set
# CONFIG_SPI_LM70_LLP is not set
# CONFIG_SPI_OC_TINY is not set
# CONFIG_SPI_PXA2XX is not set
# CONFIG_SPI_PXA2XX_PCI is not set
# CONFIG_SPI_SC18IS602 is not set
# CONFIG_SPI_TOPCLIFF_PCH is not set
# CONFIG_SPI_XCOMM is not set
# CONFIG_SPI_XILINX is not set
CONFIG_SPI_DESIGNWARE=y
# CONFIG_SPI_DW_PCI is not set

#
# SPI Protocol Masters
#
# CONFIG_SPI_SPIDEV is not set
# CONFIG_SPI_TLE62X0 is not set
# CONFIG_HSI is not set

#
# PPS support
#
CONFIG_PPS=m
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_PARPORT is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=m

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# CONFIG_PTP_1588_CLOCK_PCH is not set
CONFIG_ARCH_WANT_OPTIONAL_GPIOLIB=y
CONFIG_GPIOLIB=y
CONFIG_GPIO_DEVRES=y
CONFIG_GPIO_ACPI=y
# CONFIG_DEBUG_GPIO is not set
CONFIG_GPIO_SYSFS=y

#
# Memory mapped GPIO drivers:
#
# CONFIG_GPIO_GENERIC_PLATFORM is not set
# CONFIG_GPIO_IT8761E is not set
# CONFIG_GPIO_F7188X is not set
# CONFIG_GPIO_SCH311X is not set
# CONFIG_GPIO_TS5500 is not set
# CONFIG_GPIO_SCH is not set
# CONFIG_GPIO_ICH is not set
# CONFIG_GPIO_VX855 is not set
# CONFIG_GPIO_LYNXPOINT is not set

#
# I2C GPIO expanders:
#
# CONFIG_GPIO_MAX7300 is not set
# CONFIG_GPIO_MAX732X is not set
# CONFIG_GPIO_PCA953X is not set
# CONFIG_GPIO_PCF857X is not set
CONFIG_GPIO_SX150X=y
CONFIG_GPIO_TC3589X=y
# CONFIG_GPIO_TPS65912 is not set
# CONFIG_GPIO_WM831X is not set
# CONFIG_GPIO_WM8350 is not set
# CONFIG_GPIO_WM8994 is not set
# CONFIG_GPIO_ADP5520 is not set
# CONFIG_GPIO_ADP5588 is not set

#
# PCI GPIO expanders:
#
# CONFIG_GPIO_BT8XX is not set
# CONFIG_GPIO_AMD8111 is not set
# CONFIG_GPIO_INTEL_MID is not set
# CONFIG_GPIO_PCH is not set
# CONFIG_GPIO_ML_IOH is not set
# CONFIG_GPIO_RDC321X is not set

#
# SPI GPIO expanders:
#
# CONFIG_GPIO_MAX7301 is not set
# CONFIG_GPIO_MC33880 is not set

#
# AC97 GPIO expanders:
#

#
# LPC GPIO expanders:
#

#
# MODULbus GPIO expanders:
#
# CONFIG_GPIO_TPS6586X is not set
CONFIG_GPIO_TPS65910=y

#
# USB GPIO expanders:
#
# CONFIG_W1 is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_MAX8925_POWER is not set
# CONFIG_WM831X_BACKUP is not set
# CONFIG_WM831X_POWER is not set
# CONFIG_WM8350_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_88PM860X is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_BATTERY_BQ27x00 is not set
# CONFIG_BATTERY_DA9030 is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_GPIO is not set
# CONFIG_CHARGER_MANAGER is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_BQ24190 is not set
# CONFIG_CHARGER_BQ24735 is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_POWER_RESET is not set
# CONFIG_POWER_AVS is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7314 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADCXX is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7310 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
# CONFIG_SENSORS_K10TEMP is not set
# CONFIG_SENSORS_FAM15H_POWER is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_GPIO_FAN is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_HTU21 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM70 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_MAX1111 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT15 is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_ADS7871 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_WM831X is not set
# CONFIG_SENSORS_WM8350 is not set
# CONFIG_SENSORS_APPLESMC is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_EMULATION is not set
# CONFIG_INTEL_POWERCLAMP is not set
# CONFIG_X86_PKG_TEMP_THERMAL is not set
# CONFIG_ACPI_INT3403_THERMAL is not set

#
# Texas Instruments thermal drivers
#
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WM831X_WATCHDOG is not set
# CONFIG_WM8350_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
# CONFIG_SP5100_TCO is not set
# CONFIG_SC520_WDT is not set
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_SBC8360_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83697HF_WDT is not set
# CONFIG_W83697UG_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_MEN_A21_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y

#
# Broadcom specific AMBA
#
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
CONFIG_MFD_CORE=y
# CONFIG_MFD_CS5535 is not set
# CONFIG_MFD_AS3711 is not set
CONFIG_PMIC_ADP5520=y
CONFIG_MFD_AAT2870_CORE=y
# CONFIG_MFD_CROS_EC is not set
CONFIG_PMIC_DA903X=y
# CONFIG_MFD_DA9052_SPI is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_MC13XXX_SPI is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_HTC_PASIC3 is not set
CONFIG_HTC_I2CPLD=y
CONFIG_LPC_ICH=m
# CONFIG_LPC_SCH is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
CONFIG_MFD_88PM860X=y
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77686 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX8907 is not set
CONFIG_MFD_MAX8925=y
CONFIG_MFD_MAX8997=y
CONFIG_MFD_MAX8998=y
# CONFIG_EZX_PCAP is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_UCB1400_CORE is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RTSX_PCI is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SMSC is not set
CONFIG_ABX500_CORE=y
CONFIG_AB3100_CORE=y
# CONFIG_AB3100_OTP is not set
# CONFIG_MFD_STMPE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS65010 is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TPS65217 is not set
CONFIG_MFD_TPS6586X=y
CONFIG_MFD_TPS65910=y
CONFIG_MFD_TPS65912=y
CONFIG_MFD_TPS65912_I2C=y
CONFIG_MFD_TPS65912_SPI=y
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TIMBERDALE is not set
CONFIG_MFD_TC3589X=y
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_ARIZONA_SPI is not set
# CONFIG_MFD_WM8400 is not set
CONFIG_MFD_WM831X=y
CONFIG_MFD_WM831X_I2C=y
CONFIG_MFD_WM831X_SPI=y
CONFIG_MFD_WM8350=y
CONFIG_MFD_WM8350_I2C=y
CONFIG_MFD_WM8994=y
CONFIG_REGULATOR=y
# CONFIG_REGULATOR_DEBUG is not set
# CONFIG_REGULATOR_FIXED_VOLTAGE is not set
# CONFIG_REGULATOR_VIRTUAL_CONSUMER is not set
# CONFIG_REGULATOR_USERSPACE_CONSUMER is not set
CONFIG_REGULATOR_88PM8607=y
# CONFIG_REGULATOR_ACT8865 is not set
# CONFIG_REGULATOR_AD5398 is not set
# CONFIG_REGULATOR_AAT2870 is not set
# CONFIG_REGULATOR_AB3100 is not set
# CONFIG_REGULATOR_DA903X is not set
# CONFIG_REGULATOR_DA9210 is not set
# CONFIG_REGULATOR_FAN53555 is not set
# CONFIG_REGULATOR_GPIO is not set
# CONFIG_REGULATOR_ISL6271A is not set
# CONFIG_REGULATOR_LP3971 is not set
# CONFIG_REGULATOR_LP3972 is not set
# CONFIG_REGULATOR_LP872X is not set
# CONFIG_REGULATOR_LP8755 is not set
# CONFIG_REGULATOR_MAX1586 is not set
# CONFIG_REGULATOR_MAX8649 is not set
# CONFIG_REGULATOR_MAX8660 is not set
# CONFIG_REGULATOR_MAX8925 is not set
# CONFIG_REGULATOR_MAX8952 is not set
# CONFIG_REGULATOR_MAX8973 is not set
# CONFIG_REGULATOR_MAX8997 is not set
# CONFIG_REGULATOR_MAX8998 is not set
# CONFIG_REGULATOR_PFUZE100 is not set
# CONFIG_REGULATOR_TPS51632 is not set
# CONFIG_REGULATOR_TPS62360 is not set
# CONFIG_REGULATOR_TPS65023 is not set
# CONFIG_REGULATOR_TPS6507X is not set
# CONFIG_REGULATOR_TPS6524X is not set
# CONFIG_REGULATOR_TPS6586X is not set
# CONFIG_REGULATOR_TPS65910 is not set
# CONFIG_REGULATOR_TPS65912 is not set
# CONFIG_REGULATOR_WM831X is not set
# CONFIG_REGULATOR_WM8350 is not set
# CONFIG_REGULATOR_WM8994 is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
# CONFIG_AGP_SIS is not set
CONFIG_AGP_VIA=y
CONFIG_INTEL_GTT=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_VGA_SWITCHEROO=y
CONFIG_DRM=m
CONFIG_DRM_KMS_HELPER=m
CONFIG_DRM_KMS_FB_HELPER=y
CONFIG_DRM_LOAD_EDID_FIRMWARE=y
CONFIG_DRM_TTM=m

#
# I2C encoder or helper chips
#
# CONFIG_DRM_I2C_CH7006 is not set
# CONFIG_DRM_I2C_SIL164 is not set
# CONFIG_DRM_I2C_NXP_TDA998X is not set
# CONFIG_DRM_TDFX is not set
# CONFIG_DRM_R128 is not set
# CONFIG_DRM_RADEON is not set
CONFIG_DRM_NOUVEAU=m
CONFIG_NOUVEAU_DEBUG=5
CONFIG_NOUVEAU_DEBUG_DEFAULT=3
CONFIG_DRM_NOUVEAU_BACKLIGHT=y
# CONFIG_DRM_I810 is not set
# CONFIG_DRM_I915 is not set
# CONFIG_DRM_MGA is not set
# CONFIG_DRM_SIS is not set
# CONFIG_DRM_VIA is not set
# CONFIG_DRM_SAVAGE is not set
# CONFIG_DRM_VMWGFX is not set
# CONFIG_DRM_GMA500 is not set
# CONFIG_DRM_UDL is not set
# CONFIG_DRM_AST is not set
# CONFIG_DRM_MGAG200 is not set
# CONFIG_DRM_CIRRUS_QEMU is not set
# CONFIG_DRM_QXL is not set
# CONFIG_DRM_BOCHS is not set
# CONFIG_VGASTATE is not set
CONFIG_VIDEO_OUTPUT_CONTROL=m
CONFIG_HDMI=y
CONFIG_FB=y
CONFIG_FIRMWARE_EDID=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
CONFIG_FB_BACKLIGHT=y
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
CONFIG_FB_ASILIANT=y
CONFIG_FB_IMSTT=y
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_INTEL is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_VIA is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_TMIO is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_GOLDFISH is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_EXYNOS_VIDEO is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_DA903X is not set
# CONFIG_BACKLIGHT_MAX8925 is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_WM831X is not set
# CONFIG_BACKLIGHT_ADP5520 is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_88PM860X is not set
# CONFIG_BACKLIGHT_AAT2870 is not set
# CONFIG_BACKLIGHT_LM3630A is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LP855X is not set
# CONFIG_BACKLIGHT_GPIO is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
# CONFIG_VGACON_SOFT_SCROLLBACK is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE=y
CONFIG_FRAMEBUFFER_CONSOLE_DETECT_PRIMARY=y
# CONFIG_FRAMEBUFFER_CONSOLE_ROTATION is not set
# CONFIG_LOGO is not set
CONFIG_SOUND=m
# CONFIG_SOUND_OSS_CORE is not set
CONFIG_SND=m
CONFIG_SND_TIMER=m
CONFIG_SND_PCM=m
CONFIG_SND_HWDEP=m
CONFIG_SND_RAWMIDI=m
CONFIG_SND_JACK=y
CONFIG_SND_SEQUENCER=m
# CONFIG_SND_SEQ_DUMMY is not set
# CONFIG_SND_MIXER_OSS is not set
# CONFIG_SND_PCM_OSS is not set
# CONFIG_SND_SEQUENCER_OSS is not set
# CONFIG_SND_HRTIMER is not set
CONFIG_SND_DYNAMIC_MINORS=y
CONFIG_SND_MAX_CARDS=32
CONFIG_SND_SUPPORT_OLD_API=y
CONFIG_SND_VERBOSE_PROCFS=y
# CONFIG_SND_VERBOSE_PRINTK is not set
# CONFIG_SND_DEBUG is not set
CONFIG_SND_VMASTER=y
CONFIG_SND_KCTL_JACK=y
CONFIG_SND_DMA_SGBUF=y
CONFIG_SND_RAWMIDI_SEQ=m
CONFIG_SND_OPL3_LIB_SEQ=m
# CONFIG_SND_OPL4_LIB_SEQ is not set
# CONFIG_SND_SBAWE_SEQ is not set
# CONFIG_SND_EMU10K1_SEQ is not set
CONFIG_SND_OPL3_LIB=m
CONFIG_SND_AC97_CODEC=m
CONFIG_SND_DRIVERS=y
# CONFIG_SND_PCSP is not set
# CONFIG_SND_DUMMY is not set
# CONFIG_SND_ALOOP is not set
CONFIG_SND_VIRMIDI=m
# CONFIG_SND_MTPAV is not set
# CONFIG_SND_MTS64 is not set
# CONFIG_SND_SERIAL_U16550 is not set
# CONFIG_SND_MPU401 is not set
# CONFIG_SND_PORTMAN2X4 is not set
CONFIG_SND_AC97_POWER_SAVE=y
CONFIG_SND_AC97_POWER_SAVE_DEFAULT=0
CONFIG_SND_PCI=y
# CONFIG_SND_AD1889 is not set
CONFIG_SND_ALS300=m
# CONFIG_SND_ALS4000 is not set
# CONFIG_SND_ALI5451 is not set
# CONFIG_SND_ASIHPI is not set
# CONFIG_SND_ATIIXP is not set
# CONFIG_SND_ATIIXP_MODEM is not set
# CONFIG_SND_AU8810 is not set
# CONFIG_SND_AU8820 is not set
# CONFIG_SND_AU8830 is not set
# CONFIG_SND_AW2 is not set
# CONFIG_SND_AZT3328 is not set
# CONFIG_SND_BT87X is not set
# CONFIG_SND_CA0106 is not set
# CONFIG_SND_CMIPCI is not set
# CONFIG_SND_OXYGEN is not set
# CONFIG_SND_CS4281 is not set
# CONFIG_SND_CS46XX is not set
# CONFIG_SND_CS5530 is not set
# CONFIG_SND_CS5535AUDIO is not set
# CONFIG_SND_CTXFI is not set
# CONFIG_SND_DARLA20 is not set
# CONFIG_SND_GINA20 is not set
# CONFIG_SND_LAYLA20 is not set
# CONFIG_SND_DARLA24 is not set
# CONFIG_SND_GINA24 is not set
# CONFIG_SND_LAYLA24 is not set
# CONFIG_SND_MONA is not set
# CONFIG_SND_MIA is not set
# CONFIG_SND_ECHO3G is not set
# CONFIG_SND_INDIGO is not set
# CONFIG_SND_INDIGOIO is not set
# CONFIG_SND_INDIGODJ is not set
# CONFIG_SND_INDIGOIOX is not set
# CONFIG_SND_INDIGODJX is not set
# CONFIG_SND_EMU10K1 is not set
# CONFIG_SND_EMU10K1X is not set
# CONFIG_SND_ENS1370 is not set
# CONFIG_SND_ENS1371 is not set
# CONFIG_SND_ES1938 is not set
# CONFIG_SND_ES1968 is not set
# CONFIG_SND_FM801 is not set
CONFIG_SND_HDA_INTEL=m
CONFIG_SND_HDA_PREALLOC_SIZE=64
CONFIG_SND_HDA_HWDEP=y
CONFIG_SND_HDA_RECONFIG=y
CONFIG_SND_HDA_INPUT_BEEP=y
CONFIG_SND_HDA_INPUT_BEEP_MODE=0
CONFIG_SND_HDA_INPUT_JACK=y
CONFIG_SND_HDA_PATCH_LOADER=y
CONFIG_SND_HDA_CODEC_REALTEK=m
# CONFIG_SND_HDA_CODEC_ANALOG is not set
# CONFIG_SND_HDA_CODEC_SIGMATEL is not set
# CONFIG_SND_HDA_CODEC_VIA is not set
CONFIG_SND_HDA_CODEC_HDMI=m
# CONFIG_SND_HDA_CODEC_CIRRUS is not set
# CONFIG_SND_HDA_CODEC_CONEXANT is not set
# CONFIG_SND_HDA_CODEC_CA0110 is not set
# CONFIG_SND_HDA_CODEC_CA0132 is not set
# CONFIG_SND_HDA_CODEC_CMEDIA is not set
# CONFIG_SND_HDA_CODEC_SI3054 is not set
CONFIG_SND_HDA_GENERIC=m
CONFIG_SND_HDA_POWER_SAVE_DEFAULT=0
# CONFIG_SND_HDSP is not set
# CONFIG_SND_HDSPM is not set
# CONFIG_SND_ICE1712 is not set
# CONFIG_SND_ICE1724 is not set
# CONFIG_SND_INTEL8X0 is not set
# CONFIG_SND_INTEL8X0M is not set
# CONFIG_SND_KORG1212 is not set
# CONFIG_SND_LOLA is not set
# CONFIG_SND_LX6464ES is not set
# CONFIG_SND_MAESTRO3 is not set
# CONFIG_SND_MIXART is not set
# CONFIG_SND_NM256 is not set
# CONFIG_SND_PCXHR is not set
# CONFIG_SND_RIPTIDE is not set
# CONFIG_SND_RME32 is not set
# CONFIG_SND_RME96 is not set
# CONFIG_SND_RME9652 is not set
# CONFIG_SND_SONICVIBES is not set
# CONFIG_SND_TRIDENT is not set
# CONFIG_SND_VIA82XX is not set
# CONFIG_SND_VIA82XX_MODEM is not set
# CONFIG_SND_VIRTUOSO is not set
# CONFIG_SND_VX222 is not set
# CONFIG_SND_YMFPCI is not set
CONFIG_SND_SPI=y
CONFIG_SND_USB=y
# CONFIG_SND_USB_AUDIO is not set
# CONFIG_SND_USB_UA101 is not set
# CONFIG_SND_USB_USX2Y is not set
# CONFIG_SND_USB_CAIAQ is not set
# CONFIG_SND_USB_US122L is not set
# CONFIG_SND_USB_6FIRE is not set
# CONFIG_SND_USB_HIFACE is not set
# CONFIG_SND_SOC is not set
# CONFIG_SOUND_PRIME is not set
CONFIG_AC97_BUS=m

#
# HID support
#
CONFIG_HID=m
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=m

#
# Special HID drivers
#
# CONFIG_HID_A4TECH is not set
# CONFIG_HID_ACRUX is not set
# CONFIG_HID_APPLE is not set
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_AUREAL is not set
# CONFIG_HID_BELKIN is not set
# CONFIG_HID_CHERRY is not set
# CONFIG_HID_CHICONY is not set
# CONFIG_HID_PRODIKEYS is not set
# CONFIG_HID_CYPRESS is not set
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
# CONFIG_HID_EZKEY is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_HUION is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
# CONFIG_HID_TWINHAN is not set
# CONFIG_HID_KENSINGTON is not set
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LENOVO_TPKBD is not set
CONFIG_HID_LOGITECH=m
CONFIG_HID_LOGITECH_DJ=m
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
CONFIG_LOGIG940_FF=y
CONFIG_LOGIWHEELS_FF=y
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MICROSOFT is not set
# CONFIG_HID_MONTEREY is not set
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTRIG is not set
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set

#
# USB HID support
#
CONFIG_USB_HID=m
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# USB HID Boot Protocol drivers
#
# CONFIG_USB_KBD is not set
# CONFIG_USB_MOUSE is not set

#
# I2C HID support
#
# CONFIG_I2C_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
# CONFIG_USB_DEBUG is not set
# CONFIG_USB_ANNOUNCE_NEW_DEVICES is not set

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
# CONFIG_USB_OTG_BLACKLIST_HUB is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1760_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
# CONFIG_USB_FUSBH200_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set

#
# USB port drivers
#
# CONFIG_USB_USS720 is not set
CONFIG_USB_SERIAL=m
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
CONFIG_USB_SERIAL_FTDI_SIO=m
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_ZTE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_LED is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HSIC_USB3503 is not set

#
# USB Physical Layer drivers
#
# CONFIG_USB_PHY is not set
# CONFIG_USB_OTG_FSM is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_SAMSUNG_USB2PHY is not set
# CONFIG_SAMSUNG_USB3PHY is not set
# CONFIG_USB_GPIO_VBUS is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_RCAR_PHY is not set
# CONFIG_USB_GADGET is not set
# CONFIG_UWB is not set
CONFIG_MMC=y
# CONFIG_MMC_DEBUG is not set
# CONFIG_MMC_UNSAFE_RESUME is not set
# CONFIG_MMC_CLKGATE is not set

#
# MMC/SD/SDIO Card Drivers
#
# CONFIG_MMC_BLOCK is not set
# CONFIG_SDIO_UART is not set
# CONFIG_MMC_TEST is not set

#
# MMC/SD/SDIO Host Controller Drivers
#
# CONFIG_MMC_SDHCI is not set
# CONFIG_MMC_WBSD is not set
# CONFIG_MMC_TIFM_SD is not set
# CONFIG_MMC_SPI is not set
# CONFIG_MMC_CB710 is not set
# CONFIG_MMC_VIA_SDMMC is not set
# CONFIG_MMC_VUB300 is not set
# CONFIG_MMC_USHC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y

#
# LED drivers
#
# CONFIG_LEDS_88PM860X is not set
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_GPIO is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_LP5562 is not set
# CONFIG_LEDS_LP8501 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_PCA9685 is not set
# CONFIG_LEDS_WM831X_STATUS is not set
# CONFIG_LEDS_WM8350 is not set
# CONFIG_LEDS_DA903X is not set
# CONFIG_LEDS_DAC124S085 is not set
# CONFIG_LEDS_REGULATOR is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_LT3593 is not set
# CONFIG_LEDS_ADP5520 is not set
# CONFIG_LEDS_DELL_NETBOOKS is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_MAX8997 is not set
# CONFIG_LEDS_LM355x is not set
# CONFIG_LEDS_OT200 is not set
# CONFIG_LEDS_BLINKM is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_GPIO is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_ACCESSIBILITY is not set
# CONFIG_INFINIBAND is not set
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
# CONFIG_EDAC_DECODE_MCE is not set
CONFIG_EDAC_MM_EDAC=m
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
CONFIG_EDAC_SBRIDGE=m
CONFIG_RTC_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_88PM860X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_MAX8925 is not set
# CONFIG_RTC_DRV_MAX8998 is not set
# CONFIG_RTC_DRV_MAX8997 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_ISL12057 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_TPS6586X is not set
# CONFIG_RTC_DRV_TPS65910 is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# SPI RTC drivers
#
# CONFIG_RTC_DRV_M41T93 is not set
# CONFIG_RTC_DRV_M41T94 is not set
# CONFIG_RTC_DRV_DS1305 is not set
# CONFIG_RTC_DRV_DS1390 is not set
# CONFIG_RTC_DRV_MAX6902 is not set
# CONFIG_RTC_DRV_R9701 is not set
# CONFIG_RTC_DRV_RS5C348 is not set
# CONFIG_RTC_DRV_DS3234 is not set
# CONFIG_RTC_DRV_PCF2123 is not set
# CONFIG_RTC_DRV_RX4581 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_WM831X is not set
# CONFIG_RTC_DRV_WM8350 is not set
# CONFIG_RTC_DRV_AB3100 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_MOXART is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_HID_SENSOR_TIME is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
# CONFIG_INTEL_MID_DMAC is not set
CONFIG_INTEL_IOATDMA=m
# CONFIG_DW_DMAC_CORE is not set
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
# CONFIG_TIMB_DMA is not set
# CONFIG_PCH_DMA is not set
CONFIG_DMA_ENGINE=y
CONFIG_DMA_ACPI=y

#
# DMA Clients
#
# CONFIG_ASYNC_TX_DMA is not set
# CONFIG_DMATEST is not set
CONFIG_DMA_ENGINE_RAID=y
CONFIG_DCA=m
CONFIG_AUXDISPLAY=y
# CONFIG_KS0108 is not set
# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_VIRT_DRIVERS=y
CONFIG_VIRTIO=y

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=y
# CONFIG_VIRTIO_BALLOON is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
CONFIG_STAGING=y
# CONFIG_ET131X is not set
# CONFIG_SLICOSS is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_ECHO is not set
# CONFIG_COMEDI is not set
# CONFIG_PANEL is not set
# CONFIG_R8187SE is not set
# CONFIG_RTL8192U is not set
# CONFIG_RTLLIB is not set
# CONFIG_R8712U is not set
# CONFIG_R8188EU is not set
# CONFIG_RTS5139 is not set
# CONFIG_RTS5208 is not set
# CONFIG_TRANZPORT is not set
# CONFIG_IDE_PHISON is not set
# CONFIG_LINE6_USB is not set
# CONFIG_USB_SERIAL_QUATECH2 is not set
# CONFIG_VT6655 is not set
# CONFIG_VT6656 is not set
# CONFIG_DX_SEP is not set
# CONFIG_FB_SM7XX is not set
# CONFIG_CRYSTALHD is not set
# CONFIG_FB_XGI is not set
# CONFIG_ACPI_QUICKSTART is not set
# CONFIG_USB_ENESTORAGE is not set
# CONFIG_BCM_WIMAX is not set
# CONFIG_FT1000 is not set

#
# Speakup console speech
#
# CONFIG_SPEAKUP is not set
# CONFIG_TOUCHSCREEN_CLEARPAD_TM1217 is not set
# CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4 is not set
CONFIG_STAGING_MEDIA=y

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_USB_WPAN_HCD is not set
# CONFIG_WIMAX_GDM72XX is not set
# CONFIG_LTE_GDM724X is not set
CONFIG_NET_VENDOR_SILICOM=y
# CONFIG_SBYPASS is not set
# CONFIG_BPCTL is not set
# CONFIG_CED1401 is not set
# CONFIG_DGRP is not set
# CONFIG_LUSTRE_FS is not set
# CONFIG_XILLYBUS is not set
# CONFIG_DGNC is not set
# CONFIG_DGAP is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACER_WMI is not set
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_WMI is not set
# CONFIG_DELL_WMI_AIO is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_AMILO_RFKILL is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_HP_WIRELESS is not set
# CONFIG_HP_WMI is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_IDEAPAD_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ASUS_WMI is not set
CONFIG_ACPI_WMI=m
# CONFIG_MSI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_ACPI_TOSHIBA is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_IBM_RTL is not set
# CONFIG_XO15_EBOOK is not set
# CONFIG_SAMSUNG_LAPTOP is not set
CONFIG_MXM_WMI=m
# CONFIG_INTEL_OAKTRAIL is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set
# CONFIG_PVPANIC is not set
# CONFIG_CHROME_PLATFORMS is not set

#
# Hardware Spinlock drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# CONFIG_MAILBOX is not set
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_STATS=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers
#
# CONFIG_STE_MODEM_RPROC is not set

#
# Rpmsg drivers
#
CONFIG_PM_DEVFREQ=y

#
# DEVFREQ Governors
#
CONFIG_DEVFREQ_GOV_SIMPLE_ONDEMAND=y
CONFIG_DEVFREQ_GOV_PERFORMANCE=y
CONFIG_DEVFREQ_GOV_POWERSAVE=y
CONFIG_DEVFREQ_GOV_USERSPACE=y

#
# DEVFREQ Drivers
#
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set
# CONFIG_IPACK_BUS is not set
CONFIG_RESET_CONTROLLER=y
# CONFIG_FMC is not set

#
# PHY Subsystem
#
CONFIG_GENERIC_PHY=y
# CONFIG_PHY_EXYNOS_MIPI_VIDEO is not set
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_POWERCAP is not set

#
# Firmware Drivers
#
CONFIG_EDD=y
CONFIG_EDD_OFF=y
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
# CONFIG_DMI_SYSFS is not set
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_VARS=y
# CONFIG_EFI_VARS_PSTORE is not set
CONFIG_EFI_RUNTIME_MAP=y
CONFIG_UEFI_CPER=y

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
CONFIG_EXT3_FS_XATTR=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT23=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
CONFIG_FILE_LOCKING=y
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
# CONFIG_QFMT_V1 is not set
# CONFIG_QFMT_V2 is not set
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
# CONFIG_AUTOFS4_FS is not set
CONFIG_FUSE_FS=y
# CONFIG_CUSE is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
CONFIG_FAT_FS=m
# CONFIG_MSDOS_FS is not set
CONFIG_VFAT_FS=m
CONFIG_FAT_DEFAULT_CODEPAGE=437
CONFIG_FAT_DEFAULT_IOCHARSET="iso8859-1"
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
# CONFIG_CONFIGFS_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
CONFIG_ECRYPT_FS=y
# CONFIG_ECRYPT_FS_MESSAGING is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_LOGFS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_EFIVAR_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
# CONFIG_NFS_FS is not set
# CONFIG_NFSD is not set
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=m
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
# CONFIG_NLS_ASCII is not set
CONFIG_NLS_ISO8859_1=m
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_DEFAULT_MESSAGE_LOGLEVEL=4
CONFIG_BOOT_PRINTK_DELAY=y
# CONFIG_DYNAMIC_DEBUG is not set

#
# Compile-time checks and compiler options
#
CONFIG_DEBUG_INFO=y
# CONFIG_DEBUG_INFO_REDUCED is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
# CONFIG_ENABLE_MUST_CHECK is not set
CONFIG_FRAME_WARN=1024
# CONFIG_STRIP_ASM_SYMS is not set
# CONFIG_READABLE_ASM is not set
CONFIG_UNUSED_SYMBOLS=y
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_ARCH_WANT_FRAME_POINTERS=y
CONFIG_FRAME_POINTER=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
CONFIG_DEBUG_PAGEALLOC=y
CONFIG_WANT_PAGE_DEBUG_FLAGS=y
CONFIG_PAGE_GUARD=y
CONFIG_DEBUG_OBJECTS=y
# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
# CONFIG_DEBUG_OBJECTS_FREE is not set
# CONFIG_DEBUG_OBJECTS_TIMERS is not set
# CONFIG_DEBUG_OBJECTS_WORK is not set
# CONFIG_DEBUG_OBJECTS_RCU_HEAD is not set
# CONFIG_DEBUG_OBJECTS_PERCPU_COUNTER is not set
CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_RB=y
CONFIG_DEBUG_VIRTUAL=y
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
# CONFIG_DEBUG_SHIRQ is not set

#
# Debug Lockups and Hangs
#
# CONFIG_LOCKUP_DETECTOR is not set
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120
# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHEDSTATS=y
CONFIG_TIMER_STATS=y

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_PI_LIST=y
CONFIG_RT_MUTEX_TESTER=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_PROVE_LOCKING=y
CONFIG_LOCKDEP=y
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_LOCKDEP is not set
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
CONFIG_TRACE_IRQFLAGS=y
CONFIG_STACKTRACE=y
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_WRITECOUNT is not set
CONFIG_DEBUG_LIST=y
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_PROVE_RCU is not set
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_CPU_STALL_INFO is not set
# CONFIG_RCU_TRACE is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS=y
# CONFIG_DEBUG_STRICT_USER_COPY_CHECKS is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_FP_TEST=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
# CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENT=y
# CONFIG_UPROBE_EVENT is not set
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
CONFIG_MMIOTRACE=y
# CONFIG_MMIOTRACE_TEST is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set

#
# Runtime Testing
#
# CONFIG_LKDTM is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
# CONFIG_ATOMIC64_SELFTEST is not set
# CONFIG_TEST_STRING_HELPERS is not set
# CONFIG_TEST_KSTRTOX is not set
# CONFIG_PROVIDE_OHCI1394_DMA_INIT is not set
# CONFIG_DMA_API_DEBUG is not set
# CONFIG_TEST_MODULE is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
# CONFIG_KGDB_TESTS is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_KEYBOARD=y
CONFIG_KDB_CONTINUE_CATASTROPHIC=0
CONFIG_STRICT_DEVMEM=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
# CONFIG_EARLY_PRINTK_DBGP is not set
# CONFIG_EARLY_PRINTK_EFI is not set
# CONFIG_X86_PTDUMP is not set
CONFIG_DEBUG_RODATA=y
# CONFIG_DEBUG_RODATA_TEST is not set
CONFIG_DEBUG_SET_MODULE_RONX=y
# CONFIG_DEBUG_NX_TEST is not set
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
# CONFIG_IO_DELAY_0X80 is not set
CONFIG_IO_DELAY_0XED=y
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=1
# CONFIG_DEBUG_BOOT_PARAMS is not set
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_NMI_SELFTEST is not set
# CONFIG_X86_DEBUG_STATIC_CPU_HAS is not set

#
# Security options
#
CONFIG_KEYS=y
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_BIG_KEYS is not set
CONFIG_TRUSTED_KEYS=y
CONFIG_ENCRYPTED_KEYS=y
# CONFIG_KEYS_DEBUG_PROC_KEYS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_SECURITY_PATH=y
CONFIG_INTEL_TXT=y
CONFIG_LSM_MMAP_MIN_ADDR=0
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=0
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SELINUX_POLICYDB_VERSION_MAX is not set
CONFIG_SECURITY_SMACK=y
CONFIG_SECURITY_TOMOYO=y
CONFIG_SECURITY_TOMOYO_MAX_ACCEPT_ENTRY=2048
CONFIG_SECURITY_TOMOYO_MAX_AUDIT_LOG=1024
# CONFIG_SECURITY_TOMOYO_OMIT_USERSPACE_LOADER is not set
CONFIG_SECURITY_TOMOYO_POLICY_LOADER="/sbin/tomoyo-init"
CONFIG_SECURITY_TOMOYO_ACTIVATION_TRIGGER="/sbin/init"
CONFIG_SECURITY_APPARMOR=y
CONFIG_SECURITY_APPARMOR_BOOTPARAM_VALUE=1
CONFIG_SECURITY_APPARMOR_HASH=y
CONFIG_SECURITY_YAMA=y
# CONFIG_SECURITY_YAMA_STACKED is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_INTEGRITY_AUDIT=y
# CONFIG_IMA is not set
CONFIG_EVM=y
CONFIG_EVM_HMAC_VERSION=2
# CONFIG_DEFAULT_SECURITY_SELINUX is not set
# CONFIG_DEFAULT_SECURITY_SMACK is not set
# CONFIG_DEFAULT_SECURITY_TOMOYO is not set
CONFIG_DEFAULT_SECURITY_APPARMOR=y
# CONFIG_DEFAULT_SECURITY_YAMA is not set
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="apparmor"
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_PCOMP2=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=y
# CONFIG_CRYPTO_GF128MUL is not set
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
# CONFIG_CRYPTO_CRYPTD is not set
# CONFIG_CRYPTO_AUTHENC is not set
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_SEQIV is not set

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
# CONFIG_CRYPTO_CTR is not set
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
# CONFIG_CRYPTO_LRW is not set
# CONFIG_CRYPTO_PCBC is not set
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
# CONFIG_CRYPTO_CMAC is not set
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
CONFIG_CRYPTO_CRCT10DIF=y
# CONFIG_CRYPTO_CRCT10DIF_PCLMUL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
CONFIG_CRYPTO_SHA256=y
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
# CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL is not set

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_X86_64 is not set
# CONFIG_CRYPTO_AES_NI_INTEL is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_ZLIB is not set
CONFIG_CRYPTO_LZO=y
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
# CONFIG_CRYPTO_USER_API_HASH is not set
# CONFIG_CRYPTO_USER_API_SKCIPHER is not set
CONFIG_CRYPTO_HW=y
CONFIG_CRYPTO_DEV_PADLOCK=y
# CONFIG_CRYPTO_DEV_PADLOCK_AES is not set
# CONFIG_CRYPTO_DEV_PADLOCK_SHA is not set
# CONFIG_CRYPTO_DEV_CCP is not set
# CONFIG_ASYMMETRIC_KEY_TYPE is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_APIC_ARCHITECTURE=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_KVM_INTEL=m
# CONFIG_KVM_AMD is not set
# CONFIG_KVM_MMU_AUDIT is not set
CONFIG_KVM_DEVICE_ASSIGNMENT=y
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_BITREVERSE=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
# CONFIG_CRC_CCITT is not set
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=m
# CONFIG_CRC8 is not set
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT=y
CONFIG_HAS_DMA=y
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_NLATTR=y
CONFIG_ARCH_HAS_ATOMIC64_DEC_IF_POSITIVE=y
CONFIG_AVERAGE=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
CONFIG_UCS2_STRING=y
CONFIG_FONT_SUPPORT=y
# CONFIG_FONTS is not set
CONFIG_FONT_8x8=y
CONFIG_FONT_8x16=y

-- 
Kind regards,
Minchan Kim

[-- Attachment #2: virtio_ring.o --]
[-- Type: application/x-object, Size: 152192 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 22:41       ` Linus Torvalds
@ 2014-05-29  1:30         ` Dave Chinner
  2014-05-29  1:58           ` Dave Chinner
  2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
  0 siblings, 2 replies; 107+ messages in thread
From: Dave Chinner @ 2014-05-29  1:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 3:31 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > Indeed, the call chain reported here is not caused by swap issuing
> > IO.
> 
> Well, that's one way of reading that callchain.
> 
> I think it's the *wrong* way of reading it, though. Almost dishonestly
> so.

I guess you haven't met your insult quota for the day, Linus. :/

> Because very clearly, the swapout _is_ what causes the unplugging
> of the IO queue, and does so because it is allocating the BIO for its
> own IO.  The fact that that then fails (because of other IO's in
> flight), and causes *other* IO to be flushed, doesn't really change
> anything fundamental. It's still very much swap that causes that
> "let's start IO".

It is not rocket science to see how plugging outside memory
allocation context can lead to flushing that plug within direct
reclaim without having dispatched any IO at all from direct
reclaim....

You're focussing on the specific symptoms, not the bigger picture.
i.e. you're ignoring all the other "let's start IO" triggers in
direct reclaim. e.g there's two separate plug flush triggers in
shrink_inactive_list(), one of which is:

        /*
         * Stall direct reclaim for IO completions if underlying BDIs or zone
         * is congested. Allow kswapd to continue until it starts encountering
         * unqueued dirty pages or cycling through the LRU too quickly.
         */
        if (!sc->hibernation_mode && !current_is_kswapd())
                wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);

I'm not saying we shouldn't turn of swap from direct reclaim, just
that all we'd be doing by turning off swap is playing whack-a-stack
- the next report will simply be from one of the other direct
reclaim IO schedule points.

Regardless of whether it is swap or something external queues the
bio on the plug, perhaps we should look at why it's done inline
rather than by kblockd, where it was moved because it was blowing
the stack from schedule():

commit f4af3c3d077a004762aaad052049c809fd8c6f0c
Author: Jens Axboe <jaxboe@fusionio.com>
Date:   Tue Apr 12 14:58:51 2011 +0200

    block: move queue run on unplug to kblockd
    
    There are worries that we are now consuming a lot more stack in
    some cases, since we potentially call into IO dispatch from
    schedule() or io_schedule(). We can reduce this problem by moving
    the running of the queue to kblockd, like the old plugging scheme
    did as well.
    
    This may or may not be a good idea from a performance perspective,
    depending on how many tasks have queue plugs running at the same
    time. For even the slightly contended case, doing just a single
    queue run from kblockd instead of multiple runs directly from the
    unpluggers will be faster.
    
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>


commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
Author: Jens Axboe <jaxboe@fusionio.com>
Date:   Sat Apr 16 13:27:55 2011 +0200

    block: let io_schedule() flush the plug inline
    
    Linus correctly observes that the most important dispatch cases
    are now done from kblockd, this isn't ideal for latency reasons.
    The original reason for switching dispatches out-of-line was to
    avoid too deep a stack, so by _only_ letting the "accidental"
    flush directly in schedule() be guarded by offload to kblockd,
    we should be able to get the best of both worlds.
    
    So add a blk_schedule_flush_plug() that offloads to kblockd,
    and only use that from the schedule() path.
    
    Signed-off-by: Jens Axboe <jaxboe@fusionio.com>

And now we have too deep a stack due to unplugging from io_schedule()...

> IOW, swap-out directly caused that extra 3kB of stack use in what was
> a deep call chain (due to memory allocation). I really don't
> understand why you are arguing anything else on a pure technicality.
>
> I thought you had some other argument for why swap was different, and
> against removing that "page_is_file_cache()" special case in
> shrink_page_list().

I've said in the past that swap is different to filesystem
->writepage implementations because it doesn't require significant
stack to do block allocation and doesn't trigger IO deep in that
allocation stack. Hence it has much lower stack overhead than the
filesystem ->writepage implementations and so is much less likely to
have stack issues.

This stack overflow shows us that just the memory reclaim + IO
layers are sufficient to cause a stack overflow, which is something
I've never seen before. That implies no IO in direct reclaim context
is safe - either from swap or io_schedule() unplugging. It also
lends a lot of weight to my assertion that the majority of the stack
growth over the past couple of years has been ocurring outside the
filesystems....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  1:30         ` Dave Chinner
@ 2014-05-29  1:58           ` Dave Chinner
  2014-05-29  2:51             ` Linus Torvalds
  2014-05-29 23:36             ` Minchan Kim
  2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
  1 sibling, 2 replies; 107+ messages in thread
From: Dave Chinner @ 2014-05-29  1:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> Author: Jens Axboe <jaxboe@fusionio.com>
> Date:   Sat Apr 16 13:27:55 2011 +0200
> 
>     block: let io_schedule() flush the plug inline
>     
>     Linus correctly observes that the most important dispatch cases
>     are now done from kblockd, this isn't ideal for latency reasons.
>     The original reason for switching dispatches out-of-line was to
>     avoid too deep a stack, so by _only_ letting the "accidental"
>     flush directly in schedule() be guarded by offload to kblockd,
>     we should be able to get the best of both worlds.
>     
>     So add a blk_schedule_flush_plug() that offloads to kblockd,
>     and only use that from the schedule() path.
>     
>     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> 
> And now we have too deep a stack due to unplugging from io_schedule()...

So, if we make io_schedule() push the plug list off to the kblockd
like is done for schedule()....

> > IOW, swap-out directly caused that extra 3kB of stack use in what was
> > a deep call chain (due to memory allocation). I really don't
> > understand why you are arguing anything else on a pure technicality.
> >
> > I thought you had some other argument for why swap was different, and
> > against removing that "page_is_file_cache()" special case in
> > shrink_page_list().
> 
> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.
> 
> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow,

.... we solve this problem directly by being able to remove the IO
stack usage from the direct reclaim swap path.

IOWs, we don't need to turn swap off at all in direct reclaim
because all the swap IO can be captured in a plug list and
dispatched via kblockd. This could be done either by io_schedule()
or a new blk_flush_plug_list() wrapper that pushes the work to
kblockd...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  1:30         ` Dave Chinner
  2014-05-29  1:58           ` Dave Chinner
@ 2014-05-29  2:42           ` Linus Torvalds
  2014-05-29  5:14             ` H. Peter Anvin
                               ` (3 more replies)
  1 sibling, 4 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-29  2:42 UTC (permalink / raw)
  To: Dave Chinner, Jens Axboe
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 6:30 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> You're focussing on the specific symptoms, not the bigger picture.
> i.e. you're ignoring all the other "let's start IO" triggers in
> direct reclaim. e.g there's two separate plug flush triggers in
> shrink_inactive_list(), one of which is:

Fair enough. I certainly agree that we should look at the other cases here too.

In fact, I also find it distasteful just how much stack space some of
those VM routines are just using up on their own, never mind any
actual IO paths at all. The fact that __alloc_pages_nodemask() uses
350 bytes of stackspace on its own is actually quite disturbing. The
fact that kernel_map_pages() apparently has almost 400 bytes of stack
is just crazy. Obviously that case only happens with
CONFIG_DEBUG_PAGEALLOC, but still..

> I'm not saying we shouldn't turn of swap from direct reclaim, just
> that all we'd be doing by turning off swap is playing whack-a-stack
> - the next report will simply be from one of the other direct
> reclaim IO schedule points.

Playing whack-a-mole with this for a while might not be a bad idea,
though. It's not like we will ever really improve unless we start
whacking the worst cases. And it should still be a fairly limited
number.

After all, historically, some of the cases we've played whack-a-mole
on have been in XFS, so I'd think you'd be thrilled to see some other
code get blamed this time around ;)

> Regardless of whether it is swap or something external queues the
> bio on the plug, perhaps we should look at why it's done inline
> rather than by kblockd, where it was moved because it was blowing
> the stack from schedule():

So it sounds like we need to do this for io_schedule() too.

In fact, we've generally found it to be a mistake every time we
"automatically" unblock some IO queue. And I'm not saying that because
of stack space, but because we've _often_ had the situation that eager
unblocking results in IO that could have been done as bigger requests.

Of course, we do need to worry about latency for starting IO, but any
of these kinds of memory-pressure writeback patterns are pretty much
by definition not about the latency of one _particular_ IO, so they
don't tent to be latency-sensitive. Quite the reverse: we start
writeback and then end up waiting on something else altogether
(possibly a writeback that got started much earlier).

swapout certainly is _not_ IO-latency-sensitive, especially these
days. And while we _do_ want to throttle in direct reclaim, if it's
about throttling I'd certainly think that it sounds quite reasonable
to push any unplugging to kblockd than try to do that synchronously.
If we are throttling in direct-reclaim, we need to slow things _down_
for the writer, not worry about latency.

> I've said in the past that swap is different to filesystem
> ->writepage implementations because it doesn't require significant
> stack to do block allocation and doesn't trigger IO deep in that
> allocation stack. Hence it has much lower stack overhead than the
> filesystem ->writepage implementations and so is much less likely to
> have stack issues.

Clearly it is true that it lacks the actual filesystem part needed for
the writeback. At the same time, Minchan's example is certainly a good
one of a filesystem (ext4) already being reasonably deep in its own
stack space when it then wants memory.

Looking at that callchain, I have to say that ext4 doesn't look
horrible compared to the whole block layer and virtio.. Yes,
"ext4_writepages()" is using almost 400 bytes of stack, and most of
that seems to be due to:

        struct mpage_da_data mpd;
        struct blk_plug plug;

which looks at least understandable (nothing like the mess in the VM
code where the stack usage is because gcc creates horrible spills)

> This stack overflow shows us that just the memory reclaim + IO
> layers are sufficient to cause a stack overflow, which is something
> I've never seen before.

Well, we've definitely have had some issues with deeper callchains
with md, but I suspect virtio might be worse, and the new blk-mq code
is lilkely worse in this respect too.

And Minchan running out of stack is at least _partly_ due to his debug
options (that DEBUG_PAGEALLOC thing as an extreme example, but I
suspect there's a few other options there that generate more bloated
data structures too too).

>                That implies no IO in direct reclaim context
> is safe - either from swap or io_schedule() unplugging. It also
> lends a lot of weight to my assertion that the majority of the stack
> growth over the past couple of years has been ocurring outside the
> filesystems....

I think Minchan's stack trace definitely backs you up on that. The
filesystem part - despite that one ext4_writepages() function - is a
very small part of the whole. It sits at about ~1kB of stack. Just the
VM "top-level" writeback code is about as much, and then the VM page
alloc/shrinking code when the filesystem needs memory is *twice* that,
and then the block layer and the virtio code are another 1kB each.

The rest is just kthread overhead and that DEBUG_PAGEALLOC thing.
Other debug options might be bloating Minchan's stack use numbers in
general, but probably not by massive amounts. Locks will generally be
_hugely_ bigger due to lock debugging, but that's seldom on the stack.

So no, this is not a filesystem problem. This is definitely core VM
and block layer, no arguments what-so-ever.

I note that Jens wasn't cc'd. Added him in.

               Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  1:09     ` Minchan Kim
@ 2014-05-29  2:44       ` Steven Rostedt
  2014-05-29  4:11         ` Minchan Kim
  2014-05-29  2:47       ` Rusty Russell
  1 sibling, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2014-05-29  2:44 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michael S. Tsirkin, linux-kernel, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, Dave Hansen

On Thu, 29 May 2014 10:09:40 +0900
Minchan Kim <minchan@kernel.org> wrote:

> stacktrace reported that vring_add_indirect used 376byte and objdump says
> 
> ffffffff8141dc60 <vring_add_indirect>:
> ffffffff8141dc60:       55                      push   %rbp
> ffffffff8141dc61:       48 89 e5                mov    %rsp,%rbp
> ffffffff8141dc64:       41 57                   push   %r15
> ffffffff8141dc66:       41 56                   push   %r14
> ffffffff8141dc68:       41 55                   push   %r13
> ffffffff8141dc6a:       49 89 fd                mov    %rdi,%r13
> ffffffff8141dc6d:       89 cf                   mov    %ecx,%edi
> ffffffff8141dc6f:       48 c1 e7 04             shl    $0x4,%rdi
> ffffffff8141dc73:       41 54                   push   %r12
> ffffffff8141dc75:       49 89 d4                mov    %rdx,%r12
> ffffffff8141dc78:       53                      push   %rbx
> ffffffff8141dc79:       48 89 f3                mov    %rsi,%rbx
> ffffffff8141dc7c:       48 83 ec 28             sub    $0x28,%rsp
> ffffffff8141dc80:       8b 75 20                mov    0x20(%rbp),%esi
> ffffffff8141dc83:       89 4d bc                mov    %ecx,-0x44(%rbp)
> ffffffff8141dc86:       44 89 45 cc             mov    %r8d,-0x34(%rbp)
> ffffffff8141dc8a:       44 89 4d c8             mov    %r9d,-0x38(%rbp)
> ffffffff8141dc8e:       83 e6 dd                and    $0xffffffdd,%esi
> ffffffff8141dc91:       e8 7a d1 d7 ff          callq  ffffffff8119ae10 <__kmalloc>
> ffffffff8141dc96:       48 85 c0                test   %rax,%rax
> 
> So, it's *strange*.
> 
> I will add .config and .o.
> Maybe someone might find what happens.
> 

This is really bothering me. I'm trying to figure it out. We have from
the stack trace:

[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320

The way the stack tracer works, is that when it detects a new max stack
it calls save_stack_trace() to get the complete call chain from the
stack. This should be rather accurate as it seems that your kernel was
compiled with frame pointers (confirmed by the objdump as well as the
config file). It then uses that stack trace that it got to examine the
stack to find the locations of the saved return addresses and records
them in an array (in your case, an array of 50 entries).

>From your .o file:

vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6)

0000000000000370 <vring_add_indirect>:

 39e:   83 e6 dd                and    $0xffffffdd,%esi
 3a1:   e8 00 00 00 00          callq  3a6 <vring_add_indirect+0x36>
                        3a2: R_X86_64_PC32      __kmalloc-0x4
 3a6:   48 85 c0                test   %rax,%rax

Definitely the return address to the call to __kmalloc. Then to
determine the size of the stack frame, it is subtracted from the next
one down. In this case, the location of virtqueue_add_sgs+0x2e2.

virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62)

0000000000000880 <virtqueue_add_sgs>:

b4f:   89 4c 24 08             mov    %ecx,0x8(%rsp)
 b53:   48 c7 c2 00 00 00 00    mov    $0x0,%rdx
                        b56: R_X86_64_32S       .text+0x570
 b5a:   44 89 d1                mov    %r10d,%ecx
 b5d:   e8 0e f8 ff ff          callq  370 <vring_add_indirect>
 b62:   85 c0                   test   %eax,%eax


Which is the return address of where vring_add_indirect was called.

The return address back to virtqueue_add_sgs was found at 6000 bytes of
the stack. The return address back to vring_add_indirect was found at
6376 bytes from the top of the stack.

My question is, why were they so far apart? I see 6 words pushed
(8bytes each, for a total of 48 bytes), and a subtraction of the stack
pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need
to add the push of the return address itself which would just give us
96 bytes for the stack frame. What is making this show 376 bytes??

Looking more into this, I'm not sure I trust the top numbers anymore.
kmalloc reports a stack frame of 80, and I'm coming up with 104
(perhaps even 112). And slab_alloc only has 8. Something's messed up there.

-- Steve

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  1:09     ` Minchan Kim
  2014-05-29  2:44       ` Steven Rostedt
@ 2014-05-29  2:47       ` Rusty Russell
  1 sibling, 0 replies; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  2:47 UTC (permalink / raw)
  To: Minchan Kim, Michael S. Tsirkin
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Dave Hansen, Steven Rostedt,
	Linus Torvalds

Minchan Kim <minchan@kernel.org> writes:
> On Wed, May 28, 2014 at 12:04:09PM +0300, Michael S. Tsirkin wrote:
>> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
>> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
>> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
>> > [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320

Hmm, we can actually skip the vring_add_indirect if we're hurting for
stack.  It just means the request will try to fit linearly in the ring,
rather than using indirect.

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e443629f76d..496e727cefc8 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -184,6 +184,13 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	return head;
 }
 
+/* The Morton Technique */
+static noinline bool stack_trouble(void)
+{
+	unsigned long sp = (unsigned long)&sp;
+	return sp - (sp & ~(THREAD_SIZE - 1)) < 3000;
+}
+
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
 				struct scatterlist *(*next)
@@ -226,7 +233,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
+	if (vq->indirect && total_sg > 1 && vq->vq.num_free && !stack_trouble()) {
 		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
 					  total_in,
 					  out_sgs, in_sgs, gfp);

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  1:58           ` Dave Chinner
@ 2014-05-29  2:51             ` Linus Torvalds
  2014-05-29 23:36             ` Minchan Kim
  1 sibling, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-29  2:51 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

[ Crossed emails ]

On Wed, May 28, 2014 at 6:58 PM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
>>
>> And now we have too deep a stack due to unplugging from io_schedule()...
>
> So, if we make io_schedule() push the plug list off to the kblockd
> like is done for schedule()....

We might have a few different cases.

The cases where we *do* care about latency is when we are waiting for
the IO ourselves (ie in wait_on_page() and friends), and those end up
using io_schedule() too.

So in *that* case we definitely have a latency argument for doing it
directly, and we shouldn't kick it off to kblockd. That's very much a
"get this started as soon as humanly possible".

But the "wait_iff_congested()" code that also uses io_schedule()
should push it out to kblockd, methinks.

>> This stack overflow shows us that just the memory reclaim + IO
>> layers are sufficient to cause a stack overflow,
>
> .... we solve this problem directly by being able to remove the IO
> stack usage from the direct reclaim swap path.
>
> IOWs, we don't need to turn swap off at all in direct reclaim
> because all the swap IO can be captured in a plug list and
> dispatched via kblockd. This could be done either by io_schedule()
> or a new blk_flush_plug_list() wrapper that pushes the work to
> kblockd...

That would work. That said, I personally would not mind to see that
"swap is special" go away, if possible. Because it can be behind a
filesystem too. Christ, even NFS (and people used to fight that tooth
and nail!) is back as a swap target..

                Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/2] ftrace: print stack usage right before Oops
  2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
  2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
  2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
@ 2014-05-29  3:01 ` Steven Rostedt
  2014-05-29  3:49   ` Minchan Kim
  2 siblings, 1 reply; 107+ messages in thread
From: Steven Rostedt @ 2014-05-29  3:01 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen

On Wed, 28 May 2014 15:53:58 +0900
Minchan Kim <minchan@kernel.org> wrote:

> While I played with my own feature(ex, something on the way to reclaim),
> kernel went to oops easily. I guessed reason would be stack overflow
> and wanted to prove it.
> 
> I found stack tracer which would be very useful for me but kernel went
> oops before my user program gather the information via
> "watch cat /sys/kernel/debug/tracing/stack_trace" so I couldn't get an
> stack usage of each functions.
> 
> What I want was that emit the kernel stack usage when kernel goes oops.
> 
> This patch records callstack of max stack usage into ftrace buffer
> right before Oops and print that information with ftrace_dump_on_oops.
> At last, I can find a culprit. :)
> 
> The result is as follows.
> 
>   111.402376] ------------[ cut here ]------------
> [  111.403077] kernel BUG at kernel/trace/trace_stack.c:177!
> [  111.403831] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [  111.404635] Dumping ftrace buffer:
> [  111.404781] ---------------------------------
> [  111.404781]    <...>-15987   5d..2 111689526us : stack_trace_call:         Depth    Size   Location    (49 entries)
> [  111.404781]         -----    ----   --------
> [  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   0)     7216      24   __change_page_attr_set_clr+0xe0/0xb50
> [  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   1)     7192     392   kernel_map_pages+0x6c/0x120
> [  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   2)     6800     256   get_page_from_freelist+0x489/0x920
> [  111.404781]    <...>-15987   5d..2 111689536us : stack_trace_call:   3)     6544     352   __alloc_pages_nodemask+0x5e1/0xb20
> [  111.404781]    <...>-15987   5d..2 111689536us : stack_trace_call:   4)     6192       8   alloc_pages_current+0x10f/0x1f0
> [  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   5)     6184     168   new_slab+0x2c5/0x370
> [  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   6)     6016       8   __slab_alloc+0x3a9/0x501
> [  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   7)     6008      80   __kmalloc+0x1cb/0x200
> [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   8)     5928     376   vring_add_indirect+0x36/0x200

This is a different report than patch 2/2 has, but the numbers are the
same. Are you sure that you used the posted config to get these
crashes? I'm still having a hard time figuring out where these numbers
are coming from. :-/

-- Steve


> [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   9)     5552     144   virtqueue_add_sgs+0x2e2/0x320
> [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  10)     5408     288   __virtblk_add_req+0xda/0x1b0
> [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  11)     5120      96   virtio_queue_rq+0xd3/0x1d0
> [  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  12)     5024     128   __blk_mq_run_hw_queue+0x1ef/0x440
> [  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  13)     4896      16   blk_mq_run_hw_queue+0x35/0x40
> [  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  14)     4880      96   blk_mq_insert_requests+0xdb/0x160
> [  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  15)     4784     112   blk_mq_flush_plug_list+0x12b/0x140
> [  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  16)     4672     112   blk_flush_plug_list+0xc7/0x220
> [  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  17)     4560      64   io_schedule_timeout+0x88/0x100
> [  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  18)     4496     128   mempool_alloc+0x145/0x170
> [  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  19)     4368      96   bio_alloc_bioset+0x10b/0x1d0
> [  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  20)     4272      48   get_swap_bio+0x30/0x90
> [  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  21)     4224     160   __swap_writepage+0x150/0x230
> [  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  22)     4064      32   swap_writepage+0x42/0x90
> [  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  23)     4032     320   shrink_page_list+0x676/0xa80
> [  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  24)     3712     208   shrink_inactive_list+0x262/0x4e0
> [  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  25)     3504     304   shrink_lruvec+0x3e1/0x6a0
> [  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  26)     3200      80   shrink_zone+0x3f/0x110
> [  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  27)     3120     128   do_try_to_free_pages+0x156/0x4c0
> [  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  28)     2992     208   try_to_free_pages+0xf7/0x1e0
> [  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  29)     2784     352   __alloc_pages_nodemask+0x783/0xb20
> [  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  30)     2432       8   alloc_pages_current+0x10f/0x1f0
> [  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  31)     2424     168   new_slab+0x2c5/0x370
> [  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  32)     2256       8   __slab_alloc+0x3a9/0x501
> [  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  33)     2248      80   kmem_cache_alloc+0x1ac/0x1c0
> [  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  34)     2168     296   mempool_alloc_slab+0x15/0x20
> [  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  35)     1872     128   mempool_alloc+0x5e/0x170
> [  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  36)     1744      96   bio_alloc_bioset+0x10b/0x1d0
> [  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  37)     1648      48   mpage_alloc+0x38/0xa0
> [  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  38)     1600     208   do_mpage_readpage+0x49b/0x5d0
> [  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  39)     1392     224   mpage_readpages+0xcf/0x120
> [  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  40)     1168      48   ext4_readpages+0x45/0x60
> [  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  41)     1120     224   __do_page_cache_readahead+0x222/0x2d0
> [  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  42)      896      16   ra_submit+0x21/0x30
> [  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  43)      880     112   filemap_fault+0x2d7/0x4f0
> [  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  44)      768     144   __do_fault+0x6d/0x4c0
> [  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  45)      624     160   handle_mm_fault+0x1a6/0xaf0
> [  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  46)      464     272   __do_page_fault+0x18a/0x590
> [  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  47)      192      16   do_page_fault+0xc/0x10
> [  111.404781]    <...>-15987   5d..2 111689551us : stack_trace_call:  48)      176     176   page_fault+0x22/0x30
> [  111.404781] ---------------------------------
> [  111.404781] Modules linked in:
> [  111.404781] CPU: 5 PID: 15987 Comm: cc1 Not tainted 3.14.0+ #162
> [  111.404781] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> [  111.404781] task: ffff880008a4a0e0 ti: ffff88000002c000 task.ti: ffff88000002c000
> [  111.404781] RIP: 0010:[<ffffffff8112340f>]  [<ffffffff8112340f>] stack_trace_call+0x37f/0x390
> [  111.404781] RSP: 0000:ffff88000002c2b0  EFLAGS: 00010092
> [  111.404781] RAX: ffff88000002c000 RBX: 0000000000000005 RCX: 0000000000000002
> [  111.404781] RDX: 0000000000000006 RSI: 0000000000000002 RDI: ffff88002780be00
> [  111.404781] RBP: ffff88000002c310 R08: 00000000000009e8 R09: ffffffffffffffff
> [  111.404781] R10: ffff88000002dfd8 R11: 0000000000000001 R12: 000000000000f2e8
> [  111.404781] R13: 0000000000000005 R14: ffffffff82768dfc R15: 00000000000000f8
> [  111.404781] FS:  00002ae66a6e4640(0000) GS:ffff880027ca0000(0000) knlGS:0000000000000000
> [  111.404781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  111.404781] CR2: 00002ba016c8e004 CR3: 00000000045b7000 CR4: 00000000000006e0
> [  111.404781] Stack:
> [  111.404781]  0000000000000005 ffffffff81042410 0000000000000087 0000000000001c30
> [  111.404781]  ffff88000002c000 00002ae66a6f3000 ffffffffffffe000 0000000000000002
> [  111.404781]  ffff88000002c510 ffff880000d04000 ffff88000002c4b8 0000000000000002
> [  111.404781] Call Trace:
> [  111.404781]  [<ffffffff81042410>] ? __change_page_attr_set_clr+0xe0/0xb50
> [  111.404781]  [<ffffffff816efdff>] ftrace_call+0x5/0x2f
> [  111.404781]  [<ffffffff81004ba7>] ? dump_trace+0x177/0x2b0
> [  111.404781]  [<ffffffff81041a65>] ? _lookup_address_cpa.isra.3+0x5/0x40
> [  111.404781]  [<ffffffff81041a65>] ? _lookup_address_cpa.isra.3+0x5/0x40
> [  111.404781]  [<ffffffff81042410>] ? __change_page_attr_set_clr+0xe0/0xb50
> [  111.404781]  [<ffffffff811231a9>] ? stack_trace_call+0x119/0x390
> [  111.404781]  [<ffffffff81043eac>] ? kernel_map_pages+0x6c/0x120
> [  111.404781]  [<ffffffff810a22dd>] ? trace_hardirqs_off+0xd/0x10
> [  111.404781]  [<ffffffff81150131>] ? get_page_from_freelist+0x3d1/0x920
> [  111.404781]  [<ffffffff81043eac>] kernel_map_pages+0x6c/0x120
> [  111.404781]  [<ffffffff811501e9>] get_page_from_freelist+0x489/0x920
> [  111.404781]  [<ffffffff81150c61>] __alloc_pages_nodemask+0x5e1/0xb20
> [  111.404781]  [<ffffffff8119188f>] alloc_pages_current+0x10f/0x1f0
> [  111.404781]  [<ffffffff8119ac35>] ? new_slab+0x2c5/0x370
> [  111.404781]  [<ffffffff8119ac35>] new_slab+0x2c5/0x370
> [  111.404781]  [<ffffffff816dbfc9>] __slab_alloc+0x3a9/0x501
> [  111.404781]  [<ffffffff8119beeb>] ? __kmalloc+0x1cb/0x200
> [  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
> [  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
> [  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
> [  111.404781]  [<ffffffff8119beeb>] __kmalloc+0x1cb/0x200
> [  111.404781]  [<ffffffff8141ed70>] ? vring_add_indirect+0x200/0x200
> [  111.404781]  [<ffffffff8141eba6>] vring_add_indirect+0x36/0x200
> [  111.404781]  [<ffffffff8141f362>] virtqueue_add_sgs+0x2e2/0x320
> [  111.404781]  [<ffffffff8148f2ba>] __virtblk_add_req+0xda/0x1b0
> [  111.404781]  [<ffffffff813780c5>] ? __delay+0x5/0x20
> [  111.404781]  [<ffffffff8148f463>] virtio_queue_rq+0xd3/0x1d0
> [  111.404781]  [<ffffffff8134b96f>] __blk_mq_run_hw_queue+0x1ef/0x440
> [  111.404781]  [<ffffffff8134c035>] blk_mq_run_hw_queue+0x35/0x40
> [  111.404781]  [<ffffffff8134c71b>] blk_mq_insert_requests+0xdb/0x160
> [  111.404781]  [<ffffffff8134cdbb>] blk_mq_flush_plug_list+0x12b/0x140
> [  111.404781]  [<ffffffff810c5ab5>] ? ktime_get_ts+0x125/0x150
> [  111.404781]  [<ffffffff81343197>] blk_flush_plug_list+0xc7/0x220
> [  111.404781]  [<ffffffff816e70bf>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
> [  111.404781]  [<ffffffff816e26b8>] io_schedule_timeout+0x88/0x100
> [  111.404781]  [<ffffffff816e2635>] ? io_schedule_timeout+0x5/0x100
> [  111.404781]  [<ffffffff81149465>] mempool_alloc+0x145/0x170
> [  111.404781]  [<ffffffff8109baf0>] ? __init_waitqueue_head+0x60/0x60
> [  111.404781]  [<ffffffff811e33cb>] bio_alloc_bioset+0x10b/0x1d0
> [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> [  111.404781]  [<ffffffff81184160>] get_swap_bio+0x30/0x90
> [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> [  111.404781]  [<ffffffff811846b0>] __swap_writepage+0x150/0x230
> [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> [  111.404781]  [<ffffffff81184565>] ? __swap_writepage+0x5/0x230
> [  111.404781]  [<ffffffff811847d2>] swap_writepage+0x42/0x90
> [  111.404781]  [<ffffffff8115aee6>] shrink_page_list+0x676/0xa80
> [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> [  111.404781]  [<ffffffff8115b8c2>] shrink_inactive_list+0x262/0x4e0
> [  111.404781]  [<ffffffff8115c211>] shrink_lruvec+0x3e1/0x6a0
> [  111.404781]  [<ffffffff8115c50f>] shrink_zone+0x3f/0x110
> [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> [  111.404781]  [<ffffffff8115ca36>] do_try_to_free_pages+0x156/0x4c0
> [  111.404781]  [<ffffffff8115cf97>] try_to_free_pages+0xf7/0x1e0
> [  111.404781]  [<ffffffff81150e03>] __alloc_pages_nodemask+0x783/0xb20
> [  111.404781]  [<ffffffff8119188f>] alloc_pages_current+0x10f/0x1f0
> [  111.404781]  [<ffffffff8119ac35>] ? new_slab+0x2c5/0x370
> [  111.404781]  [<ffffffff8119ac35>] new_slab+0x2c5/0x370
> [  111.404781]  [<ffffffff816dbfc9>] __slab_alloc+0x3a9/0x501
> [  111.404781]  [<ffffffff8119d95c>] ? kmem_cache_alloc+0x1ac/0x1c0
> [  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
> [  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
> [  111.404781]  [<ffffffff8119d95c>] kmem_cache_alloc+0x1ac/0x1c0
> [  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
> [  111.404781]  [<ffffffff81149025>] mempool_alloc_slab+0x15/0x20
> [  111.404781]  [<ffffffff8114937e>] mempool_alloc+0x5e/0x170
> [  111.404781]  [<ffffffff811e33cb>] bio_alloc_bioset+0x10b/0x1d0
> [  111.404781]  [<ffffffff811ea618>] mpage_alloc+0x38/0xa0
> [  111.404781]  [<ffffffff811eb2eb>] do_mpage_readpage+0x49b/0x5d0
> [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> [  111.404781]  [<ffffffff811eb55f>] mpage_readpages+0xcf/0x120
> [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> [  111.404781]  [<ffffffff81153e21>] ? __do_page_cache_readahead+0xc1/0x2d0
> [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> [  111.404781]  [<ffffffff8124d045>] ext4_readpages+0x45/0x60
> [  111.404781]  [<ffffffff81153f82>] __do_page_cache_readahead+0x222/0x2d0
> [  111.404781]  [<ffffffff81153e21>] ? __do_page_cache_readahead+0xc1/0x2d0
> [  111.404781]  [<ffffffff811541c1>] ra_submit+0x21/0x30
> [  111.404781]  [<ffffffff811482f7>] filemap_fault+0x2d7/0x4f0
> [  111.404781]  [<ffffffff8116f3ad>] __do_fault+0x6d/0x4c0
> [  111.404781]  [<ffffffff81172596>] handle_mm_fault+0x1a6/0xaf0
> [  111.404781]  [<ffffffff816eb1aa>] __do_page_fault+0x18a/0x590
> [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> [  111.404781]  [<ffffffff81081e9c>] ? finish_task_switch+0x7c/0x120
> [  111.404781]  [<ffffffff81081e5f>] ? finish_task_switch+0x3f/0x120
> [  111.404781]  [<ffffffff816eb5bc>] do_page_fault+0xc/0x10
> [  111.404781]  [<ffffffff816e7a52>] page_fault+0x22/0x30
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  kernel/trace/trace_stack.c | 32 ++++++++++++++++++++++++++++++--
>  1 file changed, 30 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> index 5aa9a5b9b6e2..5eb88e60bc5e 100644
> --- a/kernel/trace/trace_stack.c
> +++ b/kernel/trace/trace_stack.c
> @@ -51,6 +51,30 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
>  int stack_tracer_enabled;
>  static int last_stack_tracer_enabled;
>  
> +static inline void print_max_stack(void)
> +{
> +	long i;
> +	int size;
> +
> +	trace_printk("        Depth    Size   Location"
> +			   "    (%d entries)\n"
> +			   "        -----    ----   --------\n",
> +			   max_stack_trace.nr_entries - 1);
> +
> +	for (i = 0; i < max_stack_trace.nr_entries; i++) {
> +		if (stack_dump_trace[i] == ULONG_MAX)
> +			break;
> +		if (i+1 == max_stack_trace.nr_entries ||
> +				stack_dump_trace[i+1] == ULONG_MAX)
> +			size = stack_dump_index[i];
> +		else
> +			size = stack_dump_index[i] - stack_dump_index[i+1];
> +
> +		trace_printk("%3ld) %8d   %5d   %pS\n", i, stack_dump_index[i],
> +				size, (void *)stack_dump_trace[i]);
> +	}
> +}
> +
>  static inline void
>  check_stack(unsigned long ip, unsigned long *stack)
>  {
> @@ -149,8 +173,12 @@ check_stack(unsigned long ip, unsigned long *stack)
>  			i++;
>  	}
>  
> -	BUG_ON(current != &init_task &&
> -		*(end_of_stack(current)) != STACK_END_MAGIC);
> +	if ((current != &init_task &&
> +		*(end_of_stack(current)) != STACK_END_MAGIC)) {
> +		print_max_stack();
> +		BUG();
> +	}
> +
>   out:
>  	arch_spin_unlock(&max_stack_lock);
>  	local_irq_restore(flags);


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:09   ` Linus Torvalds
  2014-05-28 22:31     ` Dave Chinner
@ 2014-05-29  3:46     ` Minchan Kim
  2014-05-29  4:13       ` Linus Torvalds
  2014-05-30 21:23     ` Andi Kleen
  2 siblings, 1 reply; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  3:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 09:09:23AM -0700, Linus Torvalds wrote:
> On Tue, May 27, 2014 at 11:53 PM, Minchan Kim <minchan@kernel.org> wrote:
> >
> > So, my stupid idea is just let's expand stack size and keep an eye
> > toward stack consumption on each kernel functions via stacktrace of ftrace.
> 
> We probably have to do this at some point, but that point is not -rc7.
> 
> And quite frankly, from the backtrace, I can only say: there is some
> bad shit there. The current VM stands out as a bloated pig:
> 
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   0)     7696      16   lookup_address+0x28/0x30
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> > [ 1065.604404] kworker/-5766    0d..2 1071625991us : stack_trace_call:   3)     7640     392   kernel_map_pages+0x6c/0x120
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   4)     7248     256   get_page_from_freelist+0x489/0x920
> > [ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> 
> > [ 1065.604404] kworker/-5766    0d..2 1071625995us : stack_trace_call:  23)     4672     160   __swap_writepage+0x150/0x230
> > [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  24)     4512      32   swap_writepage+0x42/0x90
> > [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  25)     4480     320   shrink_page_list+0x676/0xa80
> > [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  26)     4160     208   shrink_inactive_list+0x262/0x4e0
> > [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> > [ 1065.604404] kworker/-5766    0d..2 1071625996us : stack_trace_call:  28)     3648      80   shrink_zone+0x3f/0x110
> > [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> > [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  30)     3440     208   try_to_free_pages+0xf7/0x1e0
> > [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> > [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  32)     2880       8   alloc_pages_current+0x10f/0x1f0
> > [ 1065.604404] kworker/-5766    0d..2 1071625997us : stack_trace_call:  33)     2872     200   __page_cache_alloc+0x13f/0x160
> 
> That __alloc_pages_nodemask() thing in particular looks bad. It
> actually seems not to be the usual "let's just allocate some
> structures on the stack" disease, it looks more like "lots of
> inlining, horrible calling conventions, and lots of random stupid
> variables".

Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
we can reduce 176byte. And there are more places we could reduce stack
consumption but I thought it was bandaid although reducing stack itself
is desireable.

    before
    
    ffffffff81150600 <__alloc_pages_nodemask>:
    ffffffff81150600:	e8 fb f6 59 00       	callq  ffffffff816efd00 <__entry_text_start>
    ffffffff81150605:	55                   	push   %rbp
    ffffffff81150606:	b8 e8 e8 00 00       	mov    $0xe8e8,%eax
    ffffffff8115060b:	48 89 e5             	mov    %rsp,%rbp
    ffffffff8115060e:	41 57                	push   %r15
    ffffffff81150610:	41 56                	push   %r14
    ffffffff81150612:	41 be 22 01 32 01    	mov    $0x1320122,%r14d
    ffffffff81150618:	41 55                	push   %r13
    ffffffff8115061a:	41 54                	push   %r12
    ffffffff8115061c:	41 89 fc             	mov    %edi,%r12d
    ffffffff8115061f:	53                   	push   %rbx
    ffffffff81150620:	48 81 ec 28 01 00 00 	sub    $0x128,%rsp
    ffffffff81150627:	48 89 55 88          	mov    %rdx,-0x78(%rbp)
    ffffffff8115062b:	89 fa                	mov    %edi,%edx
    ffffffff8115062d:	83 e2 0f             	and    $0xf,%edx
    ffffffff81150630:	48 89 4d 90          	mov    %rcx,-0x70(%rbp)
    
    after:
    
    ffffffff81150600 <__alloc_pages_nodemask>:
    ffffffff81150600:	e8 7b f6 59 00       	callq  ffffffff816efc80 <__entry_text_start>
    ffffffff81150605:	55                   	push   %rbp
    ffffffff81150606:	b8 e8 e8 00 00       	mov    $0xe8e8,%eax
    ffffffff8115060b:	48 89 e5             	mov    %rsp,%rbp
    ffffffff8115060e:	41 57                	push   %r15
    ffffffff81150610:	41 bf 22 01 32 01    	mov    $0x1320122,%r15d
    ffffffff81150616:	41 56                	push   %r14
    ffffffff81150618:	41 55                	push   %r13
    ffffffff8115061a:	41 54                	push   %r12
    ffffffff8115061c:	41 89 fc             	mov    %edi,%r12d
    ffffffff8115061f:	53                   	push   %rbx
    ffffffff81150620:	48 83 ec 78          	sub    $0x78,%rsp
    ffffffff81150624:	48 89 55 a8          	mov    %rdx,-0x58(%rbp)
    ffffffff81150628:	89 fa                	mov    %edi,%edx
    ffffffff8115062a:	83 e2 0f             	and    $0xf,%edx
    ffffffff8115062d:	48 89 4d b0          	mov    %rcx,-0x50(%rbp)
    
> 
> >From a quick glance at the frame usage, some of it seems to be gcc
> being rather bad at stack allocation, but lots of it is just nasty
> spilling around the disgusting call-sites with tons or arguments. A
> _lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
> "spill frame pointer", afaik).
> 
> Avoiding some inlining, and using a single flag value rather than the
> collection of "bool"s would probably help. But nothing really
> trivially obvious stands out.
> 
> But what *does* stand out (once again) is that we probably shouldn't
> do swap-out in direct reclaim. This came up the last time we had stack
> issues (XFS) too. I really do suspect that direct reclaim should only
> do the kind of reclaim that does not need any IO at all.
> 
> I think we _do_ generally avoid IO in direct reclaim, but swap is
> special. And not for a good reason, afaik. DaveC, remind me, I think
> you said something about the swap case the last time this came up..
> 
>                   Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/2] ftrace: print stack usage right before Oops
  2014-05-29  3:01 ` Steven Rostedt
@ 2014-05-29  3:49   ` Minchan Kim
  0 siblings, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  3:49 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen

On Wed, May 28, 2014 at 11:01:25PM -0400, Steven Rostedt wrote:
> On Wed, 28 May 2014 15:53:58 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > While I played with my own feature(ex, something on the way to reclaim),
> > kernel went to oops easily. I guessed reason would be stack overflow
> > and wanted to prove it.
> > 
> > I found stack tracer which would be very useful for me but kernel went
> > oops before my user program gather the information via
> > "watch cat /sys/kernel/debug/tracing/stack_trace" so I couldn't get an
> > stack usage of each functions.
> > 
> > What I want was that emit the kernel stack usage when kernel goes oops.
> > 
> > This patch records callstack of max stack usage into ftrace buffer
> > right before Oops and print that information with ftrace_dump_on_oops.
> > At last, I can find a culprit. :)
> > 
> > The result is as follows.
> > 
> >   111.402376] ------------[ cut here ]------------
> > [  111.403077] kernel BUG at kernel/trace/trace_stack.c:177!
> > [  111.403831] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> > [  111.404635] Dumping ftrace buffer:
> > [  111.404781] ---------------------------------
> > [  111.404781]    <...>-15987   5d..2 111689526us : stack_trace_call:         Depth    Size   Location    (49 entries)
> > [  111.404781]         -----    ----   --------
> > [  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   0)     7216      24   __change_page_attr_set_clr+0xe0/0xb50
> > [  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   1)     7192     392   kernel_map_pages+0x6c/0x120
> > [  111.404781]    <...>-15987   5d..2 111689535us : stack_trace_call:   2)     6800     256   get_page_from_freelist+0x489/0x920
> > [  111.404781]    <...>-15987   5d..2 111689536us : stack_trace_call:   3)     6544     352   __alloc_pages_nodemask+0x5e1/0xb20
> > [  111.404781]    <...>-15987   5d..2 111689536us : stack_trace_call:   4)     6192       8   alloc_pages_current+0x10f/0x1f0
> > [  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   5)     6184     168   new_slab+0x2c5/0x370
> > [  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   6)     6016       8   __slab_alloc+0x3a9/0x501
> > [  111.404781]    <...>-15987   5d..2 111689537us : stack_trace_call:   7)     6008      80   __kmalloc+0x1cb/0x200
> > [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   8)     5928     376   vring_add_indirect+0x36/0x200
> 
> This is a different report than patch 2/2 has, but the numbers are the
> same. Are you sure that you used the posted config to get these
> crashes? I'm still having a hard time figuring out where these numbers
> are coming from. :-/

It was a different report but same path to go bomb.
Just wanted to show one of sample example.

> 
> -- Steve
> 
> 
> > [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   9)     5552     144   virtqueue_add_sgs+0x2e2/0x320
> > [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  10)     5408     288   __virtblk_add_req+0xda/0x1b0
> > [  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  11)     5120      96   virtio_queue_rq+0xd3/0x1d0
> > [  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  12)     5024     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > [  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  13)     4896      16   blk_mq_run_hw_queue+0x35/0x40
> > [  111.404781]    <...>-15987   5d..2 111689539us : stack_trace_call:  14)     4880      96   blk_mq_insert_requests+0xdb/0x160
> > [  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  15)     4784     112   blk_mq_flush_plug_list+0x12b/0x140
> > [  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  16)     4672     112   blk_flush_plug_list+0xc7/0x220
> > [  111.404781]    <...>-15987   5d..2 111689540us : stack_trace_call:  17)     4560      64   io_schedule_timeout+0x88/0x100
> > [  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  18)     4496     128   mempool_alloc+0x145/0x170
> > [  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  19)     4368      96   bio_alloc_bioset+0x10b/0x1d0
> > [  111.404781]    <...>-15987   5d..2 111689541us : stack_trace_call:  20)     4272      48   get_swap_bio+0x30/0x90
> > [  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  21)     4224     160   __swap_writepage+0x150/0x230
> > [  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  22)     4064      32   swap_writepage+0x42/0x90
> > [  111.404781]    <...>-15987   5d..2 111689542us : stack_trace_call:  23)     4032     320   shrink_page_list+0x676/0xa80
> > [  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  24)     3712     208   shrink_inactive_list+0x262/0x4e0
> > [  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  25)     3504     304   shrink_lruvec+0x3e1/0x6a0
> > [  111.404781]    <...>-15987   5d..2 111689543us : stack_trace_call:  26)     3200      80   shrink_zone+0x3f/0x110
> > [  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  27)     3120     128   do_try_to_free_pages+0x156/0x4c0
> > [  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  28)     2992     208   try_to_free_pages+0xf7/0x1e0
> > [  111.404781]    <...>-15987   5d..2 111689544us : stack_trace_call:  29)     2784     352   __alloc_pages_nodemask+0x783/0xb20
> > [  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  30)     2432       8   alloc_pages_current+0x10f/0x1f0
> > [  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  31)     2424     168   new_slab+0x2c5/0x370
> > [  111.404781]    <...>-15987   5d..2 111689545us : stack_trace_call:  32)     2256       8   __slab_alloc+0x3a9/0x501
> > [  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  33)     2248      80   kmem_cache_alloc+0x1ac/0x1c0
> > [  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  34)     2168     296   mempool_alloc_slab+0x15/0x20
> > [  111.404781]    <...>-15987   5d..2 111689546us : stack_trace_call:  35)     1872     128   mempool_alloc+0x5e/0x170
> > [  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  36)     1744      96   bio_alloc_bioset+0x10b/0x1d0
> > [  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  37)     1648      48   mpage_alloc+0x38/0xa0
> > [  111.404781]    <...>-15987   5d..2 111689547us : stack_trace_call:  38)     1600     208   do_mpage_readpage+0x49b/0x5d0
> > [  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  39)     1392     224   mpage_readpages+0xcf/0x120
> > [  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  40)     1168      48   ext4_readpages+0x45/0x60
> > [  111.404781]    <...>-15987   5d..2 111689548us : stack_trace_call:  41)     1120     224   __do_page_cache_readahead+0x222/0x2d0
> > [  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  42)      896      16   ra_submit+0x21/0x30
> > [  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  43)      880     112   filemap_fault+0x2d7/0x4f0
> > [  111.404781]    <...>-15987   5d..2 111689549us : stack_trace_call:  44)      768     144   __do_fault+0x6d/0x4c0
> > [  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  45)      624     160   handle_mm_fault+0x1a6/0xaf0
> > [  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  46)      464     272   __do_page_fault+0x18a/0x590
> > [  111.404781]    <...>-15987   5d..2 111689550us : stack_trace_call:  47)      192      16   do_page_fault+0xc/0x10
> > [  111.404781]    <...>-15987   5d..2 111689551us : stack_trace_call:  48)      176     176   page_fault+0x22/0x30
> > [  111.404781] ---------------------------------
> > [  111.404781] Modules linked in:
> > [  111.404781] CPU: 5 PID: 15987 Comm: cc1 Not tainted 3.14.0+ #162
> > [  111.404781] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
> > [  111.404781] task: ffff880008a4a0e0 ti: ffff88000002c000 task.ti: ffff88000002c000
> > [  111.404781] RIP: 0010:[<ffffffff8112340f>]  [<ffffffff8112340f>] stack_trace_call+0x37f/0x390
> > [  111.404781] RSP: 0000:ffff88000002c2b0  EFLAGS: 00010092
> > [  111.404781] RAX: ffff88000002c000 RBX: 0000000000000005 RCX: 0000000000000002
> > [  111.404781] RDX: 0000000000000006 RSI: 0000000000000002 RDI: ffff88002780be00
> > [  111.404781] RBP: ffff88000002c310 R08: 00000000000009e8 R09: ffffffffffffffff
> > [  111.404781] R10: ffff88000002dfd8 R11: 0000000000000001 R12: 000000000000f2e8
> > [  111.404781] R13: 0000000000000005 R14: ffffffff82768dfc R15: 00000000000000f8
> > [  111.404781] FS:  00002ae66a6e4640(0000) GS:ffff880027ca0000(0000) knlGS:0000000000000000
> > [  111.404781] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [  111.404781] CR2: 00002ba016c8e004 CR3: 00000000045b7000 CR4: 00000000000006e0
> > [  111.404781] Stack:
> > [  111.404781]  0000000000000005 ffffffff81042410 0000000000000087 0000000000001c30
> > [  111.404781]  ffff88000002c000 00002ae66a6f3000 ffffffffffffe000 0000000000000002
> > [  111.404781]  ffff88000002c510 ffff880000d04000 ffff88000002c4b8 0000000000000002
> > [  111.404781] Call Trace:
> > [  111.404781]  [<ffffffff81042410>] ? __change_page_attr_set_clr+0xe0/0xb50
> > [  111.404781]  [<ffffffff816efdff>] ftrace_call+0x5/0x2f
> > [  111.404781]  [<ffffffff81004ba7>] ? dump_trace+0x177/0x2b0
> > [  111.404781]  [<ffffffff81041a65>] ? _lookup_address_cpa.isra.3+0x5/0x40
> > [  111.404781]  [<ffffffff81041a65>] ? _lookup_address_cpa.isra.3+0x5/0x40
> > [  111.404781]  [<ffffffff81042410>] ? __change_page_attr_set_clr+0xe0/0xb50
> > [  111.404781]  [<ffffffff811231a9>] ? stack_trace_call+0x119/0x390
> > [  111.404781]  [<ffffffff81043eac>] ? kernel_map_pages+0x6c/0x120
> > [  111.404781]  [<ffffffff810a22dd>] ? trace_hardirqs_off+0xd/0x10
> > [  111.404781]  [<ffffffff81150131>] ? get_page_from_freelist+0x3d1/0x920
> > [  111.404781]  [<ffffffff81043eac>] kernel_map_pages+0x6c/0x120
> > [  111.404781]  [<ffffffff811501e9>] get_page_from_freelist+0x489/0x920
> > [  111.404781]  [<ffffffff81150c61>] __alloc_pages_nodemask+0x5e1/0xb20
> > [  111.404781]  [<ffffffff8119188f>] alloc_pages_current+0x10f/0x1f0
> > [  111.404781]  [<ffffffff8119ac35>] ? new_slab+0x2c5/0x370
> > [  111.404781]  [<ffffffff8119ac35>] new_slab+0x2c5/0x370
> > [  111.404781]  [<ffffffff816dbfc9>] __slab_alloc+0x3a9/0x501
> > [  111.404781]  [<ffffffff8119beeb>] ? __kmalloc+0x1cb/0x200
> > [  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
> > [  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
> > [  111.404781]  [<ffffffff8141eba6>] ? vring_add_indirect+0x36/0x200
> > [  111.404781]  [<ffffffff8119beeb>] __kmalloc+0x1cb/0x200
> > [  111.404781]  [<ffffffff8141ed70>] ? vring_add_indirect+0x200/0x200
> > [  111.404781]  [<ffffffff8141eba6>] vring_add_indirect+0x36/0x200
> > [  111.404781]  [<ffffffff8141f362>] virtqueue_add_sgs+0x2e2/0x320
> > [  111.404781]  [<ffffffff8148f2ba>] __virtblk_add_req+0xda/0x1b0
> > [  111.404781]  [<ffffffff813780c5>] ? __delay+0x5/0x20
> > [  111.404781]  [<ffffffff8148f463>] virtio_queue_rq+0xd3/0x1d0
> > [  111.404781]  [<ffffffff8134b96f>] __blk_mq_run_hw_queue+0x1ef/0x440
> > [  111.404781]  [<ffffffff8134c035>] blk_mq_run_hw_queue+0x35/0x40
> > [  111.404781]  [<ffffffff8134c71b>] blk_mq_insert_requests+0xdb/0x160
> > [  111.404781]  [<ffffffff8134cdbb>] blk_mq_flush_plug_list+0x12b/0x140
> > [  111.404781]  [<ffffffff810c5ab5>] ? ktime_get_ts+0x125/0x150
> > [  111.404781]  [<ffffffff81343197>] blk_flush_plug_list+0xc7/0x220
> > [  111.404781]  [<ffffffff816e70bf>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
> > [  111.404781]  [<ffffffff816e26b8>] io_schedule_timeout+0x88/0x100
> > [  111.404781]  [<ffffffff816e2635>] ? io_schedule_timeout+0x5/0x100
> > [  111.404781]  [<ffffffff81149465>] mempool_alloc+0x145/0x170
> > [  111.404781]  [<ffffffff8109baf0>] ? __init_waitqueue_head+0x60/0x60
> > [  111.404781]  [<ffffffff811e33cb>] bio_alloc_bioset+0x10b/0x1d0
> > [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> > [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> > [  111.404781]  [<ffffffff81184160>] get_swap_bio+0x30/0x90
> > [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> > [  111.404781]  [<ffffffff811846b0>] __swap_writepage+0x150/0x230
> > [  111.404781]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
> > [  111.404781]  [<ffffffff81184565>] ? __swap_writepage+0x5/0x230
> > [  111.404781]  [<ffffffff811847d2>] swap_writepage+0x42/0x90
> > [  111.404781]  [<ffffffff8115aee6>] shrink_page_list+0x676/0xa80
> > [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> > [  111.404781]  [<ffffffff8115b8c2>] shrink_inactive_list+0x262/0x4e0
> > [  111.404781]  [<ffffffff8115c211>] shrink_lruvec+0x3e1/0x6a0
> > [  111.404781]  [<ffffffff8115c50f>] shrink_zone+0x3f/0x110
> > [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> > [  111.404781]  [<ffffffff8115ca36>] do_try_to_free_pages+0x156/0x4c0
> > [  111.404781]  [<ffffffff8115cf97>] try_to_free_pages+0xf7/0x1e0
> > [  111.404781]  [<ffffffff81150e03>] __alloc_pages_nodemask+0x783/0xb20
> > [  111.404781]  [<ffffffff8119188f>] alloc_pages_current+0x10f/0x1f0
> > [  111.404781]  [<ffffffff8119ac35>] ? new_slab+0x2c5/0x370
> > [  111.404781]  [<ffffffff8119ac35>] new_slab+0x2c5/0x370
> > [  111.404781]  [<ffffffff816dbfc9>] __slab_alloc+0x3a9/0x501
> > [  111.404781]  [<ffffffff8119d95c>] ? kmem_cache_alloc+0x1ac/0x1c0
> > [  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
> > [  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
> > [  111.404781]  [<ffffffff8119d95c>] kmem_cache_alloc+0x1ac/0x1c0
> > [  111.404781]  [<ffffffff81149025>] ? mempool_alloc_slab+0x15/0x20
> > [  111.404781]  [<ffffffff81149025>] mempool_alloc_slab+0x15/0x20
> > [  111.404781]  [<ffffffff8114937e>] mempool_alloc+0x5e/0x170
> > [  111.404781]  [<ffffffff811e33cb>] bio_alloc_bioset+0x10b/0x1d0
> > [  111.404781]  [<ffffffff811ea618>] mpage_alloc+0x38/0xa0
> > [  111.404781]  [<ffffffff811eb2eb>] do_mpage_readpage+0x49b/0x5d0
> > [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> > [  111.404781]  [<ffffffff811eb55f>] mpage_readpages+0xcf/0x120
> > [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> > [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> > [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> > [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> > [  111.404781]  [<ffffffff81153e21>] ? __do_page_cache_readahead+0xc1/0x2d0
> > [  111.404781]  [<ffffffff812512f0>] ? ext4_get_block_write+0x20/0x20
> > [  111.404781]  [<ffffffff8124d045>] ext4_readpages+0x45/0x60
> > [  111.404781]  [<ffffffff81153f82>] __do_page_cache_readahead+0x222/0x2d0
> > [  111.404781]  [<ffffffff81153e21>] ? __do_page_cache_readahead+0xc1/0x2d0
> > [  111.404781]  [<ffffffff811541c1>] ra_submit+0x21/0x30
> > [  111.404781]  [<ffffffff811482f7>] filemap_fault+0x2d7/0x4f0
> > [  111.404781]  [<ffffffff8116f3ad>] __do_fault+0x6d/0x4c0
> > [  111.404781]  [<ffffffff81172596>] handle_mm_fault+0x1a6/0xaf0
> > [  111.404781]  [<ffffffff816eb1aa>] __do_page_fault+0x18a/0x590
> > [  111.404781]  [<ffffffff816efdff>] ? ftrace_call+0x5/0x2f
> > [  111.404781]  [<ffffffff81081e9c>] ? finish_task_switch+0x7c/0x120
> > [  111.404781]  [<ffffffff81081e5f>] ? finish_task_switch+0x3f/0x120
> > [  111.404781]  [<ffffffff816eb5bc>] do_page_fault+0xc/0x10
> > [  111.404781]  [<ffffffff816e7a52>] page_fault+0x22/0x30
> > 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  kernel/trace/trace_stack.c | 32 ++++++++++++++++++++++++++++++--
> >  1 file changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> > index 5aa9a5b9b6e2..5eb88e60bc5e 100644
> > --- a/kernel/trace/trace_stack.c
> > +++ b/kernel/trace/trace_stack.c
> > @@ -51,6 +51,30 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
> >  int stack_tracer_enabled;
> >  static int last_stack_tracer_enabled;
> >  
> > +static inline void print_max_stack(void)
> > +{
> > +	long i;
> > +	int size;
> > +
> > +	trace_printk("        Depth    Size   Location"
> > +			   "    (%d entries)\n"
> > +			   "        -----    ----   --------\n",
> > +			   max_stack_trace.nr_entries - 1);
> > +
> > +	for (i = 0; i < max_stack_trace.nr_entries; i++) {
> > +		if (stack_dump_trace[i] == ULONG_MAX)
> > +			break;
> > +		if (i+1 == max_stack_trace.nr_entries ||
> > +				stack_dump_trace[i+1] == ULONG_MAX)
> > +			size = stack_dump_index[i];
> > +		else
> > +			size = stack_dump_index[i] - stack_dump_index[i+1];
> > +
> > +		trace_printk("%3ld) %8d   %5d   %pS\n", i, stack_dump_index[i],
> > +				size, (void *)stack_dump_trace[i]);
> > +	}
> > +}
> > +
> >  static inline void
> >  check_stack(unsigned long ip, unsigned long *stack)
> >  {
> > @@ -149,8 +173,12 @@ check_stack(unsigned long ip, unsigned long *stack)
> >  			i++;
> >  	}
> >  
> > -	BUG_ON(current != &init_task &&
> > -		*(end_of_stack(current)) != STACK_END_MAGIC);
> > +	if ((current != &init_task &&
> > +		*(end_of_stack(current)) != STACK_END_MAGIC)) {
> > +		print_max_stack();
> > +		BUG();
> > +	}
> > +
> >   out:
> >  	arch_spin_unlock(&max_stack_lock);
> >  	local_irq_restore(flags);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/2] ftrace: print stack usage right before Oops
  2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
@ 2014-05-29  3:52   ` Minchan Kim
  0 siblings, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  3:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, Andrew Morton, linux-mm, H. Peter Anvin,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, rusty, mst, Dave Hansen

On Wed, May 28, 2014 at 12:18:32PM -0400, Steven Rostedt wrote:
> On Wed, 28 May 2014 15:53:58 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > While I played with my own feature(ex, something on the way to reclaim),
> > kernel went to oops easily. I guessed reason would be stack overflow
> > and wanted to prove it.
> > 
> > I found stack tracer which would be very useful for me but kernel went
> > oops before my user program gather the information via
> > "watch cat /sys/kernel/debug/tracing/stack_trace" so I couldn't get an
> > stack usage of each functions.
> > 
> > What I want was that emit the kernel stack usage when kernel goes oops.
> > 
> > This patch records callstack of max stack usage into ftrace buffer
> > right before Oops and print that information with ftrace_dump_on_oops.
> > At last, I can find a culprit. :)
> > 
> 
> This is not dependent on patch 2/2, nor is 2/2 dependent on this patch,
> I'll review this as if 2/2 does not exist.

Yeb, Thanks!

> 
> 
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> >  kernel/trace/trace_stack.c | 32 ++++++++++++++++++++++++++++++--
> >  1 file changed, 30 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_stack.c b/kernel/trace/trace_stack.c
> > index 5aa9a5b9b6e2..5eb88e60bc5e 100644
> > --- a/kernel/trace/trace_stack.c
> > +++ b/kernel/trace/trace_stack.c
> > @@ -51,6 +51,30 @@ static DEFINE_MUTEX(stack_sysctl_mutex);
> >  int stack_tracer_enabled;
> >  static int last_stack_tracer_enabled;
> >  
> > +static inline void print_max_stack(void)
> > +{
> > +	long i;
> > +	int size;
> > +
> > +	trace_printk("        Depth    Size   Location"
> > +			   "    (%d entries)\n"
> 
> Please do not break strings just to satisfy that silly 80 character
> limit. Even Linus Torvalds said that's pretty stupid.

I just copied existing code from trace_stack.c.
Okay, I will fix. :)

> 
> Also, do not use trace_printk(). It is not made to be included in a
> production kernel. It reserves special buffers to make it as fast as
> possible, and those buffers should not be created in production
> systems. In fact, I will probably add for 3.16 a big warning message
> when trace_printk() is used.
> 
> Since this is a bug, why not just use printk() instead?

Thanks for the info. I will use printk(KERN_EMERG).

> 
> BTW, wouldn't this this function crash as well if the stack is already
> bad?

It wasn't crashed until code has data from corrupted threadinfo.

> 
> -- Steve
> 
> > +			   "        -----    ----   --------\n",
> > +			   max_stack_trace.nr_entries - 1);
> > +
> > +	for (i = 0; i < max_stack_trace.nr_entries; i++) {
> > +		if (stack_dump_trace[i] == ULONG_MAX)
> > +			break;
> > +		if (i+1 == max_stack_trace.nr_entries ||
> > +				stack_dump_trace[i+1] == ULONG_MAX)
> > +			size = stack_dump_index[i];
> > +		else
> > +			size = stack_dump_index[i] - stack_dump_index[i+1];
> > +
> > +		trace_printk("%3ld) %8d   %5d   %pS\n", i, stack_dump_index[i],
> > +				size, (void *)stack_dump_trace[i]);
> > +	}
> > +}
> > +
> >  static inline void
> >  check_stack(unsigned long ip, unsigned long *stack)
> >  {
> > @@ -149,8 +173,12 @@ check_stack(unsigned long ip, unsigned long *stack)
> >  			i++;
> >  	}
> >  
> > -	BUG_ON(current != &init_task &&
> > -		*(end_of_stack(current)) != STACK_END_MAGIC);
> > +	if ((current != &init_task &&
> > +		*(end_of_stack(current)) != STACK_END_MAGIC)) {
> > +		print_max_stack();
> > +		BUG();
> > +	}
> > +
> >   out:
> >  	arch_spin_unlock(&max_stack_lock);
> >  	local_irq_restore(flags);
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* virtio_ring stack usage.
  2014-05-28  9:04   ` Michael S. Tsirkin
  2014-05-29  1:09     ` Minchan Kim
@ 2014-05-29  4:10     ` Rusty Russell
  1 sibling, 0 replies; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  4:10 UTC (permalink / raw)
  To: Michael S. Tsirkin, Minchan Kim; +Cc: linux-kernel

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
>> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
>
> virtio stack usage seems very high.
> Here is virtio_ring.su generated using -fstack-usage flag for gcc 4.8.2.
>
> virtio_ring.c:107:35:sg_next_arr        16      static
...
> <--- this is a surprise, I really expected it to be inlined
>      same for sg_next_chained.
> <--- Rusty: should we force compiler to inline it?

Extra cc's dropped.

Weird, works here (gcc 4.8.2, 32 bit).  Hmm, same with 64 bit:

gcc -Wp,-MD,drivers/virtio/.virtio_ring.o.d  -nostdinc -isystem /usr/lib/gcc/x86_64-linux-gnu/4.8/include -I/home/rusty/devel/kernel/linux/arch/x86/include -Iarch/x86/include/generated  -Iinclude -I/home/rusty/devel/kernel/linux/arch/x86/include/uapi -Iarch/x86/include/generated/uapi -I/home/rusty/devel/kernel/linux/include/uapi -Iinclude/generated/uapi -include /home/rusty/devel/kernel/linux/include/linux/kconfig.h -D__KERNEL__ -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -Werror-implicit-function-declaration -Wno-format-security -fno-delete-null-pointer-checks -O2 -m64 -mno-mmx -mno-sse -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -march=core2 -mno-red-zone -mcmodel=kernel -funit-at-a-time -maccumulate-outgoing-args -DCONFIG_AS_CFI=1 -DCONFIG_AS_CFI_SIGNAL_FRAME=1 -DCONFIG_AS_CFI_SECTIONS=1 -DCONFIG_AS_FXSAVEQ=1 -DCONFIG_AS_CRC32=1 -DCONFIG_AS_AVX=1 -DCONFIG_AS_AVX2=1 -pipe -Wno-sign-compare -fno-asynchronous-unwind-tables -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -Wframe-larger-than=1024 -fno-stack-protector -Wno-unused-but-set-variable -fno-omit-frame-pointer -fno-optimize-sibling-calls -Wdeclaration-after-statement -Wno-pointer-sign -fno-strict-overflow -fconserve-stack -Werror=implicit-int -Werror=strict-prototypes -DCC_HAVE_ASM_GOTO    -D"KBUILD_STR(s)=#s" -D"KBUILD_BASENAME=KBUILD_STR(virtio_ring)"  -D"KBUILD_MODNAME=KBUILD_STR(virtio_ring)" -c -o drivers/virtio/virtio_ring.o drivers/virtio/virtio_ring.c

$ objdump -dr drivers/virtio/virtio_ring.o | grep sg_next
			988: R_X86_64_PC32	sg_next-0x4
			9d8: R_X86_64_PC32	sg_next-0x4
			ae9: R_X86_64_PC32	sg_next-0x4
			b99: R_X86_64_PC32	sg_next-0x4
			d31: R_X86_64_PC32	sg_next-0x4
			df1: R_X86_64_PC32	sg_next-0x4
$

It's worth noting that older GCCs would sometimes successfully inline
the indirect function (ie. sg_next_chained and sg_next_ar) but still
emit an unused copy.  Is that happening for you too?

I added a hack to actually measure how much stack we're using (x86-64):

gcc 4.8.4:
[    3.261826] virtio_blk: stack used = 408

gcc 4.6:
[    3.276449] virtio_blk: stack depth = 448

Here's the hack I used:

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 6d8a87f252de..bcd6336e3561 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -151,15 +151,19 @@ static void virtblk_done(struct virtqueue *vq)
 		blk_mq_start_stopped_hw_queues(vblk->disk->queue);
 }
 
+extern struct task_struct *record_stack;
+extern unsigned long stack_top;
+
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
+	unsigned long stack_bottom = (unsigned long)&stack_bottom;
 	struct virtio_blk *vblk = hctx->queue->queuedata;
 	struct virtblk_req *vbr = req->special;
 	unsigned long flags;
 	unsigned int num;
 	const bool last = (req->cmd_flags & REQ_END) != 0;
 	int err;
-
+	
 	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
 
 	vbr->req = req;
@@ -199,7 +203,10 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 	}
 
 	spin_lock_irqsave(&vblk->vq_lock, flags);
+	record_stack = current;
 	err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, num);
+	record_stack = NULL;
+	printk("virtio_blk: stack used = %lu\n", stack_bottom - stack_top);
 	if (err) {
 		virtqueue_kick(vblk->vq);
 		spin_unlock_irqrestore(&vblk->vq_lock, flags);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 1e443629f76d..39158d6079a9 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -113,6 +113,14 @@ static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
 	return sg + 1;
 }
 
+extern struct task_struct *record_stack;
+struct task_struct *record_stack;
+EXPORT_SYMBOL(record_stack);
+
+extern unsigned long stack_top;
+unsigned long stack_top;
+EXPORT_SYMBOL(stack_top);
+
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
 				     struct scatterlist *sgs[],
@@ -141,6 +149,9 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	if (!desc)
 		return -ENOMEM;
 
+	if (record_stack == current)
+		stack_top = (unsigned long)&desc;
+
 	/* Transfer entries from the sg lists into the indirect page */
 	i = 0;
 	for (n = 0; n < out_sgs; n++) {

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  2:44       ` Steven Rostedt
@ 2014-05-29  4:11         ` Minchan Kim
  0 siblings, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  4:11 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Michael S. Tsirkin, linux-kernel, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, Dave Hansen

On Wed, May 28, 2014 at 10:44:48PM -0400, Steven Rostedt wrote:
> On Thu, 29 May 2014 10:09:40 +0900
> Minchan Kim <minchan@kernel.org> wrote:
> 
> > stacktrace reported that vring_add_indirect used 376byte and objdump says
> > 
> > ffffffff8141dc60 <vring_add_indirect>:
> > ffffffff8141dc60:       55                      push   %rbp
> > ffffffff8141dc61:       48 89 e5                mov    %rsp,%rbp
> > ffffffff8141dc64:       41 57                   push   %r15
> > ffffffff8141dc66:       41 56                   push   %r14
> > ffffffff8141dc68:       41 55                   push   %r13
> > ffffffff8141dc6a:       49 89 fd                mov    %rdi,%r13
> > ffffffff8141dc6d:       89 cf                   mov    %ecx,%edi
> > ffffffff8141dc6f:       48 c1 e7 04             shl    $0x4,%rdi
> > ffffffff8141dc73:       41 54                   push   %r12
> > ffffffff8141dc75:       49 89 d4                mov    %rdx,%r12
> > ffffffff8141dc78:       53                      push   %rbx
> > ffffffff8141dc79:       48 89 f3                mov    %rsi,%rbx
> > ffffffff8141dc7c:       48 83 ec 28             sub    $0x28,%rsp
> > ffffffff8141dc80:       8b 75 20                mov    0x20(%rbp),%esi
> > ffffffff8141dc83:       89 4d bc                mov    %ecx,-0x44(%rbp)
> > ffffffff8141dc86:       44 89 45 cc             mov    %r8d,-0x34(%rbp)
> > ffffffff8141dc8a:       44 89 4d c8             mov    %r9d,-0x38(%rbp)
> > ffffffff8141dc8e:       83 e6 dd                and    $0xffffffdd,%esi
> > ffffffff8141dc91:       e8 7a d1 d7 ff          callq  ffffffff8119ae10 <__kmalloc>
> > ffffffff8141dc96:       48 85 c0                test   %rax,%rax
> > 
> > So, it's *strange*.
> > 
> > I will add .config and .o.
> > Maybe someone might find what happens.
> > 
> 
> This is really bothering me. I'm trying to figure it out. We have from
> the stack trace:
> 
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> 
> The way the stack tracer works, is that when it detects a new max stack
> it calls save_stack_trace() to get the complete call chain from the
> stack. This should be rather accurate as it seems that your kernel was
> compiled with frame pointers (confirmed by the objdump as well as the
> config file). It then uses that stack trace that it got to examine the
> stack to find the locations of the saved return addresses and records
> them in an array (in your case, an array of 50 entries).
> 
> From your .o file:
> 
> vring_add_indirect + 0x36: (0x370 + 0x36 = 0x3a6)
> 
> 0000000000000370 <vring_add_indirect>:
> 
>  39e:   83 e6 dd                and    $0xffffffdd,%esi
>  3a1:   e8 00 00 00 00          callq  3a6 <vring_add_indirect+0x36>
>                         3a2: R_X86_64_PC32      __kmalloc-0x4
>  3a6:   48 85 c0                test   %rax,%rax
> 
> Definitely the return address to the call to __kmalloc. Then to
> determine the size of the stack frame, it is subtracted from the next
> one down. In this case, the location of virtqueue_add_sgs+0x2e2.
> 
> virtqueue_add_sgs + 0x2e2: (0x880 + 0x2e2 = 0xb62)
> 
> 0000000000000880 <virtqueue_add_sgs>:
> 
> b4f:   89 4c 24 08             mov    %ecx,0x8(%rsp)
>  b53:   48 c7 c2 00 00 00 00    mov    $0x0,%rdx
>                         b56: R_X86_64_32S       .text+0x570
>  b5a:   44 89 d1                mov    %r10d,%ecx
>  b5d:   e8 0e f8 ff ff          callq  370 <vring_add_indirect>
>  b62:   85 c0                   test   %eax,%eax
> 
> 
> Which is the return address of where vring_add_indirect was called.
> 
> The return address back to virtqueue_add_sgs was found at 6000 bytes of
> the stack. The return address back to vring_add_indirect was found at
> 6376 bytes from the top of the stack.
> 
> My question is, why were they so far apart? I see 6 words pushed
> (8bytes each, for a total of 48 bytes), and a subtraction of the stack
> pointer of 0x28 (40 bytes) giving us a total of 88 bytes. Plus we need
> to add the push of the return address itself which would just give us
> 96 bytes for the stack frame. What is making this show 376 bytes??

That's what I want to know. :(

> 
> Looking more into this, I'm not sure I trust the top numbers anymore.
> kmalloc reports a stack frame of 80, and I'm coming up with 104
> (perhaps even 112). And slab_alloc only has 8. Something's messed up there.

Yes, Looks weired but some of upper functions in callstack match well so might
think only top functions of callstack was corrupted.
But in case of alloc_pages_current(8 byte), it looks weired too but it reports
same value 8 bytes in uppder and bottom of callstack. :(
> 
> -- Steve
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  3:46     ` Minchan Kim
@ 2014-05-29  4:13       ` Linus Torvalds
  2014-05-29  5:10         ` Minchan Kim
  0 siblings, 1 reply; 107+ messages in thread
From: Linus Torvalds @ 2014-05-29  4:13 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 8:46 PM, Minchan Kim <minchan@kernel.org> wrote:
>
> Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
> we can reduce 176byte.

Well, but it will then call that __alloc_pages_slowpath() function,
which has a 176-byte stack frame.. Plus the call frame.

Now, that only triggers for when the initial "__GFP_HARDWALL" case
fails, but that's exactly what happens when we do need to do direct
reclaim.

That said, I *have* seen cases where the gcc spill code got really
confused, and simplifying the function (by not inlining excessively)
actually causes a truly smaller stack overall, despite the actual call
frames etc.  But I think the gcc people fixed the kinds of things that
caused *that* kind of stack slot explosion.

And avoiding inlining can end up resulting in less stack, if the
really deep parts don't happen to go through that function that got
inlined (ie any call chain that wouldn't have gone through that
"slowpath" function at all).

But in this case, __alloc_pages_slowpath() is where we end up doing
the actual direct reclaim anyway, so just uninlining doesn't actually
help. Although it would probably make the asm code more readable ;)

              Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  4:13       ` Linus Torvalds
@ 2014-05-29  5:10         ` Minchan Kim
  0 siblings, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  5:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Wed, May 28, 2014 at 09:13:15PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 8:46 PM, Minchan Kim <minchan@kernel.org> wrote:
> >
> > Yes. For example, with mark __alloc_pages_slowpath noinline_for_stack,
> > we can reduce 176byte.
> 
> Well, but it will then call that __alloc_pages_slowpath() function,
> which has a 176-byte stack frame.. Plus the call frame.
> 
> Now, that only triggers for when the initial "__GFP_HARDWALL" case
> fails, but that's exactly what happens when we do need to do direct
> reclaim.
> 
> That said, I *have* seen cases where the gcc spill code got really
> confused, and simplifying the function (by not inlining excessively)
> actually causes a truly smaller stack overall, despite the actual call
> frames etc.  But I think the gcc people fixed the kinds of things that
> caused *that* kind of stack slot explosion.
> 
> And avoiding inlining can end up resulting in less stack, if the
> really deep parts don't happen to go through that function that got
> inlined (ie any call chain that wouldn't have gone through that
> "slowpath" function at all).
> 
> But in this case, __alloc_pages_slowpath() is where we end up doing
> the actual direct reclaim anyway, so just uninlining doesn't actually
> help. Although it would probably make the asm code more readable ;)

Indeed. :(

Actually I found there are other places to opitmize out.
For example, we can unline try_preserve_large_page for __change_page_attr_set_clr.
Although I'm not familiar with that part, I guess large page would be rare
so we could save 112-byte.
    
    before:
    
    ffffffff81042330 <__change_page_attr_set_clr>:
    ffffffff81042330:	e8 4b da 6a 00       	callq  ffffffff816efd80 <__entry_text_start>
    ffffffff81042335:	55                   	push   %rbp
    ffffffff81042336:	48 89 e5             	mov    %rsp,%rbp
    ffffffff81042339:	41 57                	push   %r15
    ffffffff8104233b:	41 56                	push   %r14
    ffffffff8104233d:	41 55                	push   %r13
    ffffffff8104233f:	41 54                	push   %r12
    ffffffff81042341:	49 89 fc             	mov    %rdi,%r12
    ffffffff81042344:	53                   	push   %rbx
    ffffffff81042345:	48 81 ec f8 00 00 00 	sub    $0xf8,%rsp
    ffffffff8104234c:	8b 47 20             	mov    0x20(%rdi),%eax
    ffffffff8104234f:	89 b5 50 ff ff ff    	mov    %esi,-0xb0(%rbp)
    ffffffff81042355:	85 c0                	test   %eax,%eax
    ffffffff81042357:	89 85 5c ff ff ff    	mov    %eax,-0xa4(%rbp)
    ffffffff8104235d:	0f 84 8c 06 00 00    	je     ffffffff810429ef <__change_page_attr_set_clr+0x6bf>
    
    after:
    
    ffffffff81042740 <__change_page_attr_set_clr>:
    ffffffff81042740:	e8 bb d5 6a 00       	callq  ffffffff816efd00 <__entry_text_start>
    ffffffff81042745:	55                   	push   %rbp
    ffffffff81042746:	48 89 e5             	mov    %rsp,%rbp
    ffffffff81042749:	41 57                	push   %r15
    ffffffff8104274b:	41 56                	push   %r14
    ffffffff8104274d:	41 55                	push   %r13
    ffffffff8104274f:	49 89 fd             	mov    %rdi,%r13
    ffffffff81042752:	41 54                	push   %r12
    ffffffff81042754:	53                   	push   %rbx
    ffffffff81042755:	48 81 ec 88 00 00 00 	sub    $0x88,%rsp
    ffffffff8104275c:	8b 47 20             	mov    0x20(%rdi),%eax
    ffffffff8104275f:	89 b5 70 ff ff ff    	mov    %esi,-0x90(%rbp)
    ffffffff81042765:	85 c0                	test   %eax,%eax
    ffffffff81042767:	89 85 74 ff ff ff    	mov    %eax,-0x8c(%rbp)
    ffffffff8104276d:	0f 84 cb 02 00 00    	je     ffffffff81042a3e <__change_page_attr_set_clr+0x2fe>
    

And below patch saves 96-byte from shrink_lruvec.

That would be not all and I am not saying optimization of every functions of VM
is way to go but just want to notice we have rooms to optimize it out.
I will wait more discussions and happy to test it(I can reproduce it in 1~2 hour
if I have a luck)

Thanks!
    
    ffffffff8115b560 <shrink_lruvec>:
    ffffffff8115b560:	e8 db 46 59 00       	callq  ffffffff816efc40 <__entry_text_start>
    ffffffff8115b565:	55                   	push   %rbp
    ffffffff8115b566:	65 48 8b 04 25 40 ba 	mov    %gs:0xba40,%rax
    ffffffff8115b56d:	00 00
    ffffffff8115b56f:	48 89 e5             	mov    %rsp,%rbp
    ffffffff8115b572:	41 57                	push   %r15
    ffffffff8115b574:	41 56                	push   %r14
    ffffffff8115b576:	45 31 f6             	xor    %r14d,%r14d
    ffffffff8115b579:	41 55                	push   %r13
    ffffffff8115b57b:	49 89 fd             	mov    %rdi,%r13
    ffffffff8115b57e:	41 54                	push   %r12
    ffffffff8115b580:	49 89 f4             	mov    %rsi,%r12
    ffffffff8115b583:	49 83 c4 34          	add    $0x34,%r12
    ffffffff8115b587:	53                   	push   %rbx
    ffffffff8115b588:	48 8d 9f c8 fa ff ff 	lea    -0x538(%rdi),%rbx
    ffffffff8115b58f:	48 81 ec f8 00 00 00 	sub    $0xf8,%rsp
    ffffffff8115b596:	f6 40 16 04          	testb  $0x4,0x16(%rax)
    
    after
    
    ffffffff8115b870 <shrink_lruvec>:
    ffffffff8115b870:	e8 8b 43 59 00       	callq  ffffffff816efc00 <__entry_text_start>
    ffffffff8115b875:	55                   	push   %rbp
    ffffffff8115b876:	48 8d 56 34          	lea    0x34(%rsi),%rdx
    ffffffff8115b87a:	48 89 e5             	mov    %rsp,%rbp
    ffffffff8115b87d:	41 57                	push   %r15
    ffffffff8115b87f:	41 bf 20 00 00 00    	mov    $0x20,%r15d
    ffffffff8115b885:	48 8d 4d 90          	lea    -0x70(%rbp),%rcx
    ffffffff8115b889:	41 56                	push   %r14
    ffffffff8115b88b:	49 89 f6             	mov    %rsi,%r14
    ffffffff8115b88e:	48 8d 76 2c          	lea    0x2c(%rsi),%rsi
    ffffffff8115b892:	41 55                	push   %r13
    ffffffff8115b894:	49 89 fd             	mov    %rdi,%r13
    ffffffff8115b897:	41 54                	push   %r12
    ffffffff8115b899:	45 31 e4             	xor    %r12d,%r12d
    ffffffff8115b89c:	53                   	push   %rbx
    ffffffff8115b89d:	48 81 ec 98 00 00 00 	sub    $0x98,%rsp
    ffffffff8115b8a4:	e8 47 df ff ff       	callq  ffffffff811597f0 <get_scan_count.isra.60>
    ffffffff8115b8a9:	48 8b 45 90          	mov    -0x70(%rbp),%rax

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9b61b9bf81ac..574f9ce838b3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -165,12 +165,14 @@ enum lru_list {
 	LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
 	LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
 	LRU_UNEVICTABLE,
+	NR_EVICTABLE_LRU_LISTS = LRU_UNEVICTABLE,
 	NR_LRU_LISTS
 };
 
 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
 
-#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
+#define for_each_evictable_lru(lru) for (lru = 0; \
+			lru < NR_EVICTABLE_LRU_LISTS; lru++)
 
 static inline int is_file_lru(enum lru_list lru)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 65cb7758dd09..bb330d1b76ae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1839,8 +1839,8 @@ enum scan_balance {
  * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
  * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
  */
-static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
-			   unsigned long *nr)
+static noinline_for_stack void get_scan_count(struct lruvec *lruvec,
+			struct scan_control *sc, unsigned long *nr)
 {
 	struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 	u64 fraction[2];
@@ -2012,12 +2012,11 @@ out:
  */
 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	unsigned long nr[NR_LRU_LISTS];
-	unsigned long targets[NR_LRU_LISTS];
+	unsigned long nr[NR_EVICTABLE_LRU_LISTS];
+	unsigned long targets[NR_EVICTABLE_LRU_LISTS];
 	unsigned long nr_to_scan;
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
 	struct blk_plug plug;
 	bool scan_adjusted = false;
 
@@ -2042,7 +2041,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			}
 		}
 
-		if (nr_reclaimed < nr_to_reclaim || scan_adjusted)
+		if (nr_reclaimed < sc->nr_to_reclaim || scan_adjusted)
 			continue;
 
 		/*


-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
@ 2014-05-29  5:14             ` H. Peter Anvin
  2014-05-29  6:01             ` Rusty Russell
                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-29  5:14 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	Ingo Molnar, Peter Zijlstra, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Rusty Russell, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

On 05/28/2014 07:42 PM, Linus Torvalds wrote:
> 
> And Minchan running out of stack is at least _partly_ due to his debug
> options (that DEBUG_PAGEALLOC thing as an extreme example, but I
> suspect there's a few other options there that generate more bloated
> data structures too too).
> 

I have wondered if a larger stack would make sense as a debug option.
I'm just worried it will be abused.

	-hpa



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
  2014-05-29  5:14             ` H. Peter Anvin
@ 2014-05-29  6:01             ` Rusty Russell
  2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
  2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
  2014-05-31  2:06             ` Jens Axboe
  3 siblings, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  6:01 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

Linus Torvalds <torvalds@linux-foundation.org> writes:
> Well, we've definitely have had some issues with deeper callchains
> with md, but I suspect virtio might be worse, and the new blk-mq code
> is lilkely worse in this respect too.

I looked at this; I've now got a couple of virtio core cleanups, and
I'm testing with Minchan's config and gcc versions now.

MST reported that gcc 4.8 is a better than 4.6, but I'll test that too.

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:06       ` Johannes Weiner
  2014-05-28 21:55         ` Dave Chinner
@ 2014-05-29  6:06         ` Minchan Kim
  1 sibling, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  6:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Dave Chinner, linux-kernel, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Hugh Dickins, rusty, mst, Dave Hansen,
	Steven Rostedt, xfs

On Wed, May 28, 2014 at 12:06:58PM -0400, Johannes Weiner wrote:
> On Wed, May 28, 2014 at 07:13:45PM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 06:37:38PM +1000, Dave Chinner wrote:
> > > [ cc XFS list ]
> > 
> > [and now there is a complete copy on the XFs list, I'll add my 2c]
> > 
> > > On Wed, May 28, 2014 at 03:53:59PM +0900, Minchan Kim wrote:
> > > > While I play inhouse patches with much memory pressure on qemu-kvm,
> > > > 3.14 kernel was randomly crashed. The reason was kernel stack overflow.
> > > > 
> > > > When I investigated the problem, the callstack was a little bit deeper
> > > > by involve with reclaim functions but not direct reclaim path.
> > > > 
> > > > I tried to diet stack size of some functions related with alloc/reclaim
> > > > so did a hundred of byte but overflow was't disappeard so that I encounter
> > > > overflow by another deeper callstack on reclaim/allocator path.
> > 
> > That's a no win situation. The stack overruns through ->writepage
> > we've been seeing with XFS over the past *4 years* are much larger
> > than a few bytes. The worst case stack usage on a virtio block
> > device was about 10.5KB of stack usage.
> > 
> > And, like this one, it came from the flusher thread as well. The
> > difference was that the allocation that triggered the reclaim path
> > you've reported occurred when 5k of the stack had already been
> > used...
> > 
> > > > Of course, we might sweep every sites we have found for reducing
> > > > stack usage but I'm not sure how long it saves the world(surely,
> > > > lots of developer start to add nice features which will use stack
> > > > agains) and if we consider another more complex feature in I/O layer
> > > > and/or reclaim path, it might be better to increase stack size(
> > > > meanwhile, stack usage on 64bit machine was doubled compared to 32bit
> > > > while it have sticked to 8K. Hmm, it's not a fair to me and arm64
> > > > already expaned to 16K. )
> > 
> > Yup, that's all been pointed out previously. 8k stacks were never
> > large enough to fit the linux IO architecture on x86-64, but nobody
> > outside filesystem and IO developers has been willing to accept that
> > argument as valid, despite regular stack overruns and filesystem
> > having to add workaround after workaround to prevent stack overruns.
> > 
> > That's why stuff like this appears in various filesystem's
> > ->writepage:
> > 
> >         /*
> >          * Refuse to write the page out if we are called from reclaim context.
> >          *
> >          * This avoids stack overflows when called from deeply used stacks in
> >          * random callers for direct reclaim or memcg reclaim.  We explicitly
> >          * allow reclaim from kswapd as the stack usage there is relatively low.
> >          *
> >          * This should never happen except in the case of a VM regression so
> >          * warn about it.
> >          */
> >         if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> >                         PF_MEMALLOC))
> >                 goto redirty;
> > 
> > That still doesn't guarantee us enough stack space to do writeback,
> > though, because memory allocation can occur when reading in metadata
> > needed to do delayed allocation, and so we could trigger GFP_NOFS
> > memory allocation from the flusher thread with 4-5k of stack already
> > consumed, so that would still overrun teh stack.
> > 
> > So, a couple of years ago we started defering half the writeback
> > stack usage to a worker thread (commit c999a22 "xfs: introduce an
> > allocation workqueue"), under the assumption that the worst stack
> > usage when we call memory allocation is around 3-3.5k of stack used.
> > We thought that would be safe, but the stack trace you've posted
> > shows that alloc_page(GFP_NOFS) can consume upwards of 5k of stack,
> > which means we're still screwed despite all the workarounds we have
> > in place.
> 
> The allocation and reclaim stack itself is only 2k per the stacktrace
> below.  What got us in this particular case is that we engaged a
> complicated block layer setup from within the allocation context in
> order to swap out a page.
> 
> In the past we disabled filesystem ->writepage from within the
> allocation context and deferred it to kswapd for stack reasons (see
> the WARN_ON_ONCE and the comment in your above quote), but I think we
> have to go further and do the same for even swap_writepage():
> 
> > > > I guess this topic was discussed several time so there might be
> > > > strong reason not to increase kernel stack size on x86_64, for me not
> > > > knowing so Ccing x86_64 maintainers, other MM guys and virtio
> > > > maintainers.
> > > >
> > > >          Depth    Size   Location    (51 entries)
> > > > 
> > > >    0)     7696      16   lookup_address+0x28/0x30
> > > >    1)     7680      16   _lookup_address_cpa.isra.3+0x3b/0x40
> > > >    2)     7664      24   __change_page_attr_set_clr+0xe0/0xb50
> > > >    3)     7640     392   kernel_map_pages+0x6c/0x120
> > > >    4)     7248     256   get_page_from_freelist+0x489/0x920
> > > >    5)     6992     352   __alloc_pages_nodemask+0x5e1/0xb20
> > > >    6)     6640       8   alloc_pages_current+0x10f/0x1f0
> > > >    7)     6632     168   new_slab+0x2c5/0x370
> > > >    8)     6464       8   __slab_alloc+0x3a9/0x501
> > > >    9)     6456      80   __kmalloc+0x1cb/0x200
> > > >   10)     6376     376   vring_add_indirect+0x36/0x200
> > > >   11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> > > >   12)     5856     288   __virtblk_add_req+0xda/0x1b0
> > > >   13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> > > >   14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > > >   15)     5344      16   blk_mq_run_hw_queue+0x35/0x40
> > > >   16)     5328      96   blk_mq_insert_requests+0xdb/0x160
> > > >   17)     5232     112   blk_mq_flush_plug_list+0x12b/0x140
> > > >   18)     5120     112   blk_flush_plug_list+0xc7/0x220
> > > >   19)     5008      64   io_schedule_timeout+0x88/0x100
> > > >   20)     4944     128   mempool_alloc+0x145/0x170
> > > >   21)     4816      96   bio_alloc_bioset+0x10b/0x1d0
> > > >   22)     4720      48   get_swap_bio+0x30/0x90
> > > >   23)     4672     160   __swap_writepage+0x150/0x230
> > > >   24)     4512      32   swap_writepage+0x42/0x90
> 
> Without swap IO from the allocation context, the stack would have
> ended here, which would have been easily survivable.  And left the
> writeout work to kswapd, which has a much shallower stack than this:
> 
> > > >   25)     4480     320   shrink_page_list+0x676/0xa80
> > > >   26)     4160     208   shrink_inactive_list+0x262/0x4e0
> > > >   27)     3952     304   shrink_lruvec+0x3e1/0x6a0
> > > >   28)     3648      80   shrink_zone+0x3f/0x110
> > > >   29)     3568     128   do_try_to_free_pages+0x156/0x4c0
> > > >   30)     3440     208   try_to_free_pages+0xf7/0x1e0
> > > >   31)     3232     352   __alloc_pages_nodemask+0x783/0xb20
> > > >   32)     2880       8   alloc_pages_current+0x10f/0x1f0
> > > >   33)     2872     200   __page_cache_alloc+0x13f/0x160
> > > >   34)     2672      80   find_or_create_page+0x4c/0xb0
> > > >   35)     2592      80   ext4_mb_load_buddy+0x1e9/0x370
> > > >   36)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
> > > >   37)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
> > > >   38)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
> > > >   39)     1952     160   ext4_map_blocks+0x325/0x530
> > > >   40)     1792     384   ext4_writepages+0x6d1/0xce0
> > > >   41)     1408      16   do_writepages+0x23/0x40
> > > >   42)     1392      96   __writeback_single_inode+0x45/0x2e0
> > > >   43)     1296     176   writeback_sb_inodes+0x2ad/0x500
> > > >   44)     1120      80   __writeback_inodes_wb+0x9e/0xd0
> > > >   45)     1040     160   wb_writeback+0x29b/0x350
> > > >   46)      880     208   bdi_writeback_workfn+0x11c/0x480
> > > >   47)      672     144   process_one_work+0x1d2/0x570
> > > >   48)      528     112   worker_thread+0x116/0x370
> > > >   49)      416     240   kthread+0xf3/0x110
> > > >   50)      176     176   ret_from_fork+0x7c/0xb0
> > 
> > Impressive: 3 nested allocations - GFP_NOFS, GFP_NOIO and then
> > GFP_ATOMIC before the stack goes boom. XFS usually only needs 2...
> 
> Do they also usually involve swap_writepage()?

Maybe it works but the problem I can think of is churn of LRU because
anon pages scanned in direct reclaim would live another round in LRU
and as Dave already pointed out, it couldn't prevent synchronous
unplugging caused by another shedule point in direct reclaim path
so I buy Dave's idea which pass plug list off to the kblockd.

> 
> ---
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 7c59ef681381..02e7e3c168cf 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -233,6 +233,22 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
>  {
>  	int ret = 0;
>  
> +	/*
> +	 * Refuse to write the page out if we are called from reclaim context.
> +	 *
> +	 * This avoids stack overflows when called from deeply used stacks in
> +	 * random callers for direct reclaim or memcg reclaim.  We explicitly
> +	 * allow reclaim from kswapd as the stack usage there is relatively low.
> +	 *
> +	 * This should never happen except in the case of a VM regression so
> +	 * warn about it.
> +	 */
> +	if (WARN_ON_ONCE((current->flags & (PF_MEMALLOC|PF_KSWAPD)) ==
> +			PF_MEMALLOC)) {
> +		SetPageDirty(page);
> +		goto out;
> +	}
> +
>  	if (try_to_free_swap(page)) {
>  		unlock_page(page);
>  		goto out;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 61c576083c07..99cca6633e0d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -985,13 +985,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  
>  		if (PageDirty(page)) {
>  			/*
> -			 * Only kswapd can writeback filesystem pages to
> -			 * avoid risk of stack overflow but only writeback
> +			 * Only kswapd can writeback pages to avoid
> +			 * risk of stack overflow but only writeback
>  			 * if many dirty pages have been encountered.
>  			 */
> -			if (page_is_file_cache(page) &&
> -					(!current_is_kswapd() ||
> -					 !zone_is_reclaim_dirty(zone))) {
> +			if (!current_is_kswapd() ||
> +			    !zone_is_reclaim_dirty(zone))) {
>  				/*
>  				 * Immediately reclaim when written back.
>  				 * Similar in principal to deactivate_page()
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
  2014-05-29  5:14             ` H. Peter Anvin
  2014-05-29  6:01             ` Rusty Russell
@ 2014-05-29  7:26             ` Dave Chinner
  2014-05-29 15:24               ` Linus Torvalds
  2014-05-31  2:06             ` Jens Axboe
  3 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-29  7:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Wed, May 28, 2014 at 07:42:40PM -0700, Linus Torvalds wrote:
> On Wed, May 28, 2014 at 6:30 PM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > You're focussing on the specific symptoms, not the bigger picture.
> > i.e. you're ignoring all the other "let's start IO" triggers in
> > direct reclaim. e.g there's two separate plug flush triggers in
> > shrink_inactive_list(), one of which is:
> 
> Fair enough. I certainly agree that we should look at the other cases here too.
> 
> In fact, I also find it distasteful just how much stack space some of
> those VM routines are just using up on their own, never mind any
> actual IO paths at all. The fact that __alloc_pages_nodemask() uses
> 350 bytes of stackspace on its own is actually quite disturbing. The
> fact that kernel_map_pages() apparently has almost 400 bytes of stack
> is just crazy. Obviously that case only happens with
> CONFIG_DEBUG_PAGEALLOC, but still..

What concerns me about both __alloc_pages_nodemask() and
kernel_map_pages is that when I look at the code I see functions
that have no obvious stack usage problem. However, the compiler is
producing functions with huge stack footprints and it's not at all
obvious when I read the code. So in this case I'm more concerned
that we have a major disconnect between the source code structure
and the code that the compiler produces...

> > I'm not saying we shouldn't turn of swap from direct reclaim, just
> > that all we'd be doing by turning off swap is playing whack-a-stack
> > - the next report will simply be from one of the other direct
> > reclaim IO schedule points.
> 
> Playing whack-a-mole with this for a while might not be a bad idea,
> though. It's not like we will ever really improve unless we start
> whacking the worst cases. And it should still be a fairly limited
> number.

I guess I've been playing whack-a-stack for so long now and some of
the overruns have been so large I just don't see it as a viable
medium to long term solution.

> After all, historically, some of the cases we've played whack-a-mole
> on have been in XFS, so I'd think you'd be thrilled to see some other
> code get blamed this time around ;)

Blame shifting doesn't thrill me - I'm still at the pointy end of
stack overrun reports, and we've still got to do the hard work of
solving the problem. However, I am happy to see acknowlegement of
the problem so we can work out how to solve the issues...

> > Regardless of whether it is swap or something external queues the
> > bio on the plug, perhaps we should look at why it's done inline
> > rather than by kblockd, where it was moved because it was blowing
> > the stack from schedule():
> 
> So it sounds like we need to do this for io_schedule() too.
> 
> In fact, we've generally found it to be a mistake every time we
> "automatically" unblock some IO queue. And I'm not saying that because
> of stack space, but because we've _often_ had the situation that eager
> unblocking results in IO that could have been done as bigger requests.
> 
> Of course, we do need to worry about latency for starting IO, but any
> of these kinds of memory-pressure writeback patterns are pretty much
> by definition not about the latency of one _particular_ IO, so they
> don't tent to be latency-sensitive. Quite the reverse: we start
> writeback and then end up waiting on something else altogether
> (possibly a writeback that got started much earlier).

*nod*

> swapout certainly is _not_ IO-latency-sensitive, especially these
> days. And while we _do_ want to throttle in direct reclaim, if it's
> about throttling I'd certainly think that it sounds quite reasonable
> to push any unplugging to kblockd than try to do that synchronously.
> If we are throttling in direct-reclaim, we need to slow things _down_
> for the writer, not worry about latency.

Right, we are adding latency to the caller by having to swap so
a small amount of additional IO dispatch latency for IO we aren't
going to wait directly on doesn't really matter at all.

> >                That implies no IO in direct reclaim context
> > is safe - either from swap or io_schedule() unplugging. It also
> > lends a lot of weight to my assertion that the majority of the stack
> > growth over the past couple of years has been ocurring outside the
> > filesystems....
> 
> I think Minchan's stack trace definitely backs you up on that. The
> filesystem part - despite that one ext4_writepages() function - is a
> very small part of the whole. It sits at about ~1kB of stack. Just the
> VM "top-level" writeback code is about as much, and then the VM page
> alloc/shrinking code when the filesystem needs memory is *twice* that,
> and then the block layer and the virtio code are another 1kB each.

*nod*

As i said early, look at this in the context of the bigger picture.
We can also have more stack using layers in the IO stack and/or more
stack-expensive layers. e.g.  it could be block -> dm -> md -> SCSI
-> mempool_alloc in that stack rather than block -> virtio ->
kmalloc. Hence 1k of virtio stack could be 1.5k of SCSI stack,
md/dm could contribute a few hundred bytes each (or more depending
on how many layers of dm/md there are), and so on.

When you start adding all that up, it doesn't paint a pretty
picture. That's one of the main reasons why I don't think the
whack-a-stack approach will solve the problem in the medium to long
term...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* virtio ring cleanups, which save stack on older gcc
  2014-05-29  6:01             ` Rusty Russell
@ 2014-05-29  7:26               ` Rusty Russell
  2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
                                   ` (4 more replies)
  0 siblings, 5 replies; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  7:26 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

They don't make much difference: the easier fix is use gcc 4.8
which drops stack required across virtio block's virtio_queue_rq
down to that kmalloc in virtio_ring from 528 to 392 bytes.

Still, these (*lightly tested*) patches reduce to 432 bytes,
even for gcc 4.6.4.  Posted here FYI.

Cheers,
Rusty.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk
  2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
@ 2014-05-29  7:26                 ` Rusty Russell
  2014-05-29 15:39                   ` Linus Torvalds
  2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
                                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  7:26 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt
  Cc: Rusty Russell

Results (x86-64, Minchan's .config):

gcc 4.8.2: virtio_blk: stack used = 392
gcc 4.6.4: virtio_blk: stack used = 528

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/block/virtio_blk.c   | 11 ++++++++++-
 drivers/virtio/virtio_ring.c | 11 +++++++++++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index cb9b1f8326c3..894e290b4bd2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -151,15 +151,19 @@ static void virtblk_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vblk->vq_lock, flags);
 }
 
+extern struct task_struct *record_stack;
+extern unsigned long stack_top;
+
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
 	struct virtio_blk *vblk = hctx->queue->queuedata;
 	struct virtblk_req *vbr = req->special;
 	unsigned long flags;
 	unsigned int num;
+	unsigned long stack_bottom;
 	const bool last = (req->cmd_flags & REQ_END) != 0;
 	int err;
-
+	
 	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
 
 	vbr->req = req;
@@ -199,7 +203,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 	}
 
 	spin_lock_irqsave(&vblk->vq_lock, flags);
+	record_stack = current;
+	__asm__ __volatile__("movq %%rsp,%0" : "=g" (stack_bottom));
 	err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, num);
+	record_stack = NULL;
+
+	printk("virtio_blk: stack used = %lu\n", stack_bottom - stack_top);
 	if (err) {
 		virtqueue_kick(vblk->vq);
 		blk_mq_stop_hw_queue(hctx);
diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a9c29..f6ad99ffdc40 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -54,6 +54,14 @@
 #define END_USE(vq)
 #endif
 
+extern struct task_struct *record_stack;
+struct task_struct *record_stack;
+EXPORT_SYMBOL(record_stack);
+
+extern unsigned long stack_top;
+unsigned long stack_top;
+EXPORT_SYMBOL(stack_top);
+
 struct vring_virtqueue
 {
 	struct virtqueue vq;
@@ -137,6 +145,9 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	 */
 	gfp &= ~(__GFP_HIGHMEM | __GFP_HIGH);
 
+	if (record_stack == current)
+		__asm__ __volatile__("movq %%rsp,%0" : "=g" (stack_top));
+
 	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
 	if (!desc)
 		return -ENOMEM;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf()
  2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
  2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
@ 2014-05-29  7:26                 ` Rusty Russell
  2014-05-29 10:07                   ` Michael S. Tsirkin
  2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
                                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  7:26 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt
  Cc: Rusty Russell

This is the only place which doesn't hand virtqueue_add_inbuf or
virtqueue_add_outbuf a well-formed, well-terminated sg.  Fix it,
so we can make virtio_add_* simpler.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/virtio_net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 8a852b5f215f..63299b04cdf2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -590,6 +590,8 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
 	offset = sizeof(struct padded_vnet_hdr);
 	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
 
+	sg_mark_end(&rq->sg[MAX_SKB_FRAGS + 2 - 1]);
+
 	/* chain first in list head */
 	first->private = (unsigned long)list;
 	err = virtqueue_add_inbuf(rq->vq, rq->sg, MAX_SKB_FRAGS + 2,
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 3/4] virtio_ring: assume sgs are always well-formed.
  2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
  2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
  2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
@ 2014-05-29  7:26                 ` Rusty Russell
  2014-05-29 11:18                   ` Michael S. Tsirkin
  2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
  2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
  4 siblings, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  7:26 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt
  Cc: Rusty Russell

We used to have several callers which just used arrays.  They're
gone, so we can use sg_next() everywhere, simplifying the code.

Before:
	gcc 4.8.2: virtio_blk: stack used = 392
	gcc 4.6.4: virtio_blk: stack used = 528

After:
	gcc 4.8.2: virtio_blk: stack used = 392
	gcc 4.6.4: virtio_blk: stack used = 480

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/virtio/virtio_ring.c | 68 +++++++++++++-------------------------------
 1 file changed, 19 insertions(+), 49 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index f6ad99ffdc40..5d29cd85d6cf 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -107,28 +107,10 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
-static inline struct scatterlist *sg_next_chained(struct scatterlist *sg,
-						  unsigned int *count)
-{
-	return sg_next(sg);
-}
-
-static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
-					      unsigned int *count)
-{
-	if (--(*count) == 0)
-		return NULL;
-	return sg + 1;
-}
-
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
 				     struct scatterlist *sgs[],
-				     struct scatterlist *(*next)
-				       (struct scatterlist *, unsigned int *),
 				     unsigned int total_sg,
-				     unsigned int total_out,
-				     unsigned int total_in,
 				     unsigned int out_sgs,
 				     unsigned int in_sgs,
 				     gfp_t gfp)
@@ -155,7 +137,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	/* Transfer entries from the sg lists into the indirect page */
 	i = 0;
 	for (n = 0; n < out_sgs; n++) {
-		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
+		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
 			desc[i].flags = VRING_DESC_F_NEXT;
 			desc[i].addr = sg_phys(sg);
 			desc[i].len = sg->length;
@@ -164,7 +146,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 		}
 	}
 	for (; n < (out_sgs + in_sgs); n++) {
-		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
+		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
 			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 			desc[i].addr = sg_phys(sg);
 			desc[i].len = sg->length;
@@ -197,10 +179,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 
 static inline int virtqueue_add(struct virtqueue *_vq,
 				struct scatterlist *sgs[],
-				struct scatterlist *(*next)
-				  (struct scatterlist *, unsigned int *),
-				unsigned int total_out,
-				unsigned int total_in,
+				unsigned int total_sg,
 				unsigned int out_sgs,
 				unsigned int in_sgs,
 				void *data,
@@ -208,7 +187,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
-	unsigned int i, n, avail, uninitialized_var(prev), total_sg;
+	unsigned int i, n, avail, uninitialized_var(prev);
 	int head;
 
 	START_USE(vq);
@@ -233,13 +212,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	}
 #endif
 
-	total_sg = total_in + total_out;
-
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
-					  total_in,
+		head = vring_add_indirect(vq, sgs, total_sg, 
 					  out_sgs, in_sgs, gfp);
 		if (likely(head >= 0))
 			goto add_head;
@@ -265,7 +241,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 
 	head = i = vq->free_head;
 	for (n = 0; n < out_sgs; n++) {
-		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
+		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
@@ -274,7 +250,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 		}
 	}
 	for (; n < (out_sgs + in_sgs); n++) {
-		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
+		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
 			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
 			vq->vring.desc[i].addr = sg_phys(sg);
 			vq->vring.desc[i].len = sg->length;
@@ -335,29 +311,23 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
 		      void *data,
 		      gfp_t gfp)
 {
-	unsigned int i, total_out, total_in;
+	unsigned int i, total_sg = 0;
 
 	/* Count them first. */
-	for (i = total_out = total_in = 0; i < out_sgs; i++) {
-		struct scatterlist *sg;
-		for (sg = sgs[i]; sg; sg = sg_next(sg))
-			total_out++;
-	}
-	for (; i < out_sgs + in_sgs; i++) {
+	for (i = 0; i < out_sgs + in_sgs; i++) {
 		struct scatterlist *sg;
 		for (sg = sgs[i]; sg; sg = sg_next(sg))
-			total_in++;
+			total_sg++;
 	}
-	return virtqueue_add(_vq, sgs, sg_next_chained,
-			     total_out, total_in, out_sgs, in_sgs, data, gfp);
+	return virtqueue_add(_vq, sgs, total_sg, out_sgs, in_sgs, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
 
 /**
  * virtqueue_add_outbuf - expose output buffers to other end
  * @vq: the struct virtqueue we're talking about.
- * @sgs: array of scatterlists (need not be terminated!)
- * @num: the number of scatterlists readable by other side
+ * @sg: scatterlist (must be well-formed and terminated!)
+ * @num: the number of entries in @sg readable by other side
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -367,19 +337,19 @@ EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
  * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
  */
 int virtqueue_add_outbuf(struct virtqueue *vq,
-			 struct scatterlist sg[], unsigned int num,
+			 struct scatterlist *sg, unsigned int num,
 			 void *data,
 			 gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
+	return virtqueue_add(vq, &sg, num, 1, 0, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
 
 /**
  * virtqueue_add_inbuf - expose input buffers to other end
  * @vq: the struct virtqueue we're talking about.
- * @sgs: array of scatterlists (need not be terminated!)
- * @num: the number of scatterlists writable by other side
+ * @sg: scatterlist (must be well-formed and terminated!)
+ * @num: the number of entries in @sg writable by other side
  * @data: the token identifying the buffer.
  * @gfp: how to do memory allocations (if necessary).
  *
@@ -389,11 +359,11 @@ EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
  * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
  */
 int virtqueue_add_inbuf(struct virtqueue *vq,
-			struct scatterlist sg[], unsigned int num,
+			struct scatterlist *sg, unsigned int num,
 			void *data,
 			gfp_t gfp)
 {
-	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
+	return virtqueue_add(vq, &sg, num, 0, 1, data, gfp);
 }
 EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
                                   ` (2 preceding siblings ...)
  2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
@ 2014-05-29  7:26                 ` Rusty Russell
  2014-05-29  7:52                   ` Peter Zijlstra
  2014-05-29 11:29                   ` Michael S. Tsirkin
  2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
  4 siblings, 2 replies; 107+ messages in thread
From: Rusty Russell @ 2014-05-29  7:26 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt
  Cc: Rusty Russell

virtqueue_add() populates the virtqueue descriptor table from the sgs
given.  If it uses an indirect descriptor table, then it puts a single
descriptor in the descriptor table pointing to the kmalloc'ed indirect
table where the sg is populated.

Previously vring_add_indirect() did the allocation and the simple
linear layout.  We replace that with alloc_indirect() which allocates
the indirect table then chains it like the normal descriptor table so
we can reuse the core logic.

Before:
	gcc 4.8.2: virtio_blk: stack used = 392
	gcc 4.6.4: virtio_blk: stack used = 480

After:
	gcc 4.8.2: virtio_blk: stack used = 408
	gcc 4.6.4: virtio_blk: stack used = 432

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/virtio/virtio_ring.c | 120 ++++++++++++++++---------------------------
 1 file changed, 45 insertions(+), 75 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5d29cd85d6cf..3adf5978b92b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -107,18 +107,10 @@ struct vring_virtqueue
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
-/* Set up an indirect table of descriptors and add it to the queue. */
-static inline int vring_add_indirect(struct vring_virtqueue *vq,
-				     struct scatterlist *sgs[],
-				     unsigned int total_sg,
-				     unsigned int out_sgs,
-				     unsigned int in_sgs,
-				     gfp_t gfp)
+static struct vring_desc *alloc_indirect(unsigned int total_sg, gfp_t gfp)
 {
-	struct vring_desc *desc;
-	unsigned head;
-	struct scatterlist *sg;
-	int i, n;
+ 	struct vring_desc *desc;
+	unsigned int i;
 
 	/*
 	 * We require lowmem mappings for the descriptors because
@@ -130,51 +122,13 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
 	if (record_stack == current)
 		__asm__ __volatile__("movq %%rsp,%0" : "=g" (stack_top));
 
-	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
-	if (!desc)
-		return -ENOMEM;
-
-	/* Transfer entries from the sg lists into the indirect page */
-	i = 0;
-	for (n = 0; n < out_sgs; n++) {
-		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-			desc[i].flags = VRING_DESC_F_NEXT;
-			desc[i].addr = sg_phys(sg);
-			desc[i].len = sg->length;
-			desc[i].next = i+1;
-			i++;
-		}
-	}
-	for (; n < (out_sgs + in_sgs); n++) {
-		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			desc[i].addr = sg_phys(sg);
-			desc[i].len = sg->length;
-			desc[i].next = i+1;
-			i++;
-		}
-	}
-	BUG_ON(i != total_sg);
-
-	/* Last one doesn't continue. */
-	desc[i-1].flags &= ~VRING_DESC_F_NEXT;
-	desc[i-1].next = 0;
-
-	/* We're about to use a buffer */
-	vq->vq.num_free--;
-
-	/* Use a single buffer which doesn't continue */
-	head = vq->free_head;
-	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
-	vq->vring.desc[head].addr = virt_to_phys(desc);
-	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
-	kmemleak_ignore(desc);
-	vq->vring.desc[head].len = i * sizeof(struct vring_desc);
-
-	/* Update free pointer */
-	vq->free_head = vq->vring.desc[head].next;
+ 	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
+ 	if (!desc)
+		return NULL;
 
-	return head;
+	for (i = 0; i < total_sg; i++)
+		desc[i].next = i+1;
+	return desc;
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
@@ -187,6 +141,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
+	struct vring_desc *desc = NULL;
 	unsigned int i, n, avail, uninitialized_var(prev);
 	int head;
 
@@ -212,18 +167,32 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	}
 #endif
 
+	BUG_ON(total_sg > vq->vring.num);
+	BUG_ON(total_sg == 0);
+
+	head = vq->free_head;
+
 	/* If the host supports indirect descriptor tables, and we have multiple
 	 * buffers, then go indirect. FIXME: tune this threshold */
-	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
-		head = vring_add_indirect(vq, sgs, total_sg, 
-					  out_sgs, in_sgs, gfp);
-		if (likely(head >= 0))
-			goto add_head;
+	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+		desc = alloc_indirect(total_sg, gfp);
+
+	if (desc) {
+		/* Use a single buffer which doesn't continue */
+		vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+		vq->vring.desc[head].addr = virt_to_phys(desc);
+		/* avoid kmemleak false positive (tis hidden by virt_to_phys) */
+		kmemleak_ignore(desc);
+		vq->vring.desc[head].len = total_sg * sizeof(struct vring_desc);
+
+		/* Set up rest to use this indirect table. */
+		i = 0;
+		total_sg = 1;
+	} else {
+		desc = vq->vring.desc;
+		i = head;
 	}
 
-	BUG_ON(total_sg > vq->vring.num);
-	BUG_ON(total_sg == 0);
-
 	if (vq->vq.num_free < total_sg) {
 		pr_debug("Can't add buf len %i - avail = %i\n",
 			 total_sg, vq->vq.num_free);
@@ -239,32 +208,33 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	/* We're about to use some buffers from the free list. */
 	vq->vq.num_free -= total_sg;
 
-	head = i = vq->free_head;
 	for (n = 0; n < out_sgs; n++) {
 		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
-			vq->vring.desc[i].addr = sg_phys(sg);
-			vq->vring.desc[i].len = sg->length;
+			desc[i].flags = VRING_DESC_F_NEXT;
+			desc[i].addr = sg_phys(sg);
+			desc[i].len = sg->length;
 			prev = i;
-			i = vq->vring.desc[i].next;
+			i = desc[i].next;
 		}
 	}
 	for (; n < (out_sgs + in_sgs); n++) {
 		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
-			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
-			vq->vring.desc[i].addr = sg_phys(sg);
-			vq->vring.desc[i].len = sg->length;
+			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+			desc[i].addr = sg_phys(sg);
+			desc[i].len = sg->length;
 			prev = i;
-			i = vq->vring.desc[i].next;
+			i = desc[i].next;
 		}
 	}
 	/* Last one doesn't continue. */
-	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
+	desc[prev].flags &= ~VRING_DESC_F_NEXT;
 
 	/* Update free pointer */
-	vq->free_head = i;
+	if (desc == vq->vring.desc)
+		vq->free_head = i;
+	else
+		vq->free_head = vq->vring.desc[head].next;
 
-add_head:
 	/* Set token. */
 	vq->data[head] = data;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: virtio ring cleanups, which save stack on older gcc
  2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
                                   ` (3 preceding siblings ...)
  2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
@ 2014-05-29  7:41                 ` Minchan Kim
  2014-05-29 10:39                   ` Dave Chinner
  2014-05-29 11:08                   ` Rusty Russell
  4 siblings, 2 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29  7:41 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

Hello Rusty,

On Thu, May 29, 2014 at 04:56:41PM +0930, Rusty Russell wrote:
> They don't make much difference: the easier fix is use gcc 4.8
> which drops stack required across virtio block's virtio_queue_rq
> down to that kmalloc in virtio_ring from 528 to 392 bytes.
> 
> Still, these (*lightly tested*) patches reduce to 432 bytes,
> even for gcc 4.6.4.  Posted here FYI.

I am testing with below which was hack for Dave's idea so don't have
a machine to test your patches until tomorrow.
So, I will queue your patches into testing machine tomorrow morning.

Thanks!

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..95f169e85dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 void __sched io_schedule(void)
 {
 	struct rq *rq = raw_rq();
+	struct blk_plug *plug = current->plug;
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	if (plug)
+		blk_flush_plug_list(plug, true);
+
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;

> 
> Cheers,
> Rusty.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
@ 2014-05-29  7:52                   ` Peter Zijlstra
  2014-05-29 11:05                     ` Rusty Russell
  2014-05-29 11:29                   ` Michael S. Tsirkin
  1 sibling, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2014-05-29  7:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 896 bytes --]

On Thu, May 29, 2014 at 04:56:45PM +0930, Rusty Russell wrote:
> virtqueue_add() populates the virtqueue descriptor table from the sgs
> given.  If it uses an indirect descriptor table, then it puts a single
> descriptor in the descriptor table pointing to the kmalloc'ed indirect
> table where the sg is populated.
> 
> Previously vring_add_indirect() did the allocation and the simple
> linear layout.  We replace that with alloc_indirect() which allocates
> the indirect table then chains it like the normal descriptor table so
> we can reuse the core logic.
> 
> Before:
> 	gcc 4.8.2: virtio_blk: stack used = 392
> 	gcc 4.6.4: virtio_blk: stack used = 480
> 
> After:
> 	gcc 4.8.2: virtio_blk: stack used = 408
> 	gcc 4.6.4: virtio_blk: stack used = 432

Is it worth it to make the good compiler worse? People are going to use
the newer GCC more as time goes on anyhow.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf()
  2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
@ 2014-05-29 10:07                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 107+ messages in thread
From: Michael S. Tsirkin @ 2014-05-29 10:07 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Dave Hansen,
	Steven Rostedt

On Thu, May 29, 2014 at 04:56:43PM +0930, Rusty Russell wrote:
> This is the only place which doesn't hand virtqueue_add_inbuf or
> virtqueue_add_outbuf a well-formed, well-terminated sg.  Fix it,
> so we can make virtio_add_* simpler.
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
>  drivers/net/virtio_net.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 8a852b5f215f..63299b04cdf2 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -590,6 +590,8 @@ static int add_recvbuf_big(struct receive_queue *rq, gfp_t gfp)
>  	offset = sizeof(struct padded_vnet_hdr);
>  	sg_set_buf(&rq->sg[1], p + offset, PAGE_SIZE - offset);
>  
> +	sg_mark_end(&rq->sg[MAX_SKB_FRAGS + 2 - 1]);
> +
>  	/* chain first in list head */
>  	first->private = (unsigned long)list;
>  	err = virtqueue_add_inbuf(rq->vq, rq->sg, MAX_SKB_FRAGS + 2,

Not that performance of add_recvbuf_big actually mattered anymore, but
in fact this can be done in virtnet_probe if we like.


Anyway

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtio ring cleanups, which save stack on older gcc
  2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
@ 2014-05-29 10:39                   ` Dave Chinner
  2014-05-29 11:08                   ` Rusty Russell
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Chinner @ 2014-05-29 10:39 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rusty Russell, Linus Torvalds, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 04:41:17PM +0900, Minchan Kim wrote:
> Hello Rusty,
> 
> On Thu, May 29, 2014 at 04:56:41PM +0930, Rusty Russell wrote:
> > They don't make much difference: the easier fix is use gcc 4.8
> > which drops stack required across virtio block's virtio_queue_rq
> > down to that kmalloc in virtio_ring from 528 to 392 bytes.
> > 
> > Still, these (*lightly tested*) patches reduce to 432 bytes,
> > even for gcc 4.6.4.  Posted here FYI.
> 
> I am testing with below which was hack for Dave's idea so don't have
> a machine to test your patches until tomorrow.
> So, I will queue your patches into testing machine tomorrow morning.
> 
> Thanks!
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..95f169e85dbe 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
>  void __sched io_schedule(void)
>  {
>  	struct rq *rq = raw_rq();
> +	struct blk_plug *plug = current->plug;
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	if (plug)
> +		blk_flush_plug_list(plug, true);
> +

Could simply be

-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-29  7:52                   ` Peter Zijlstra
@ 2014-05-29 11:05                     ` Rusty Russell
  2014-05-29 11:33                       ` Michael S. Tsirkin
  0 siblings, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-29 11:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

Peter Zijlstra <peterz@infradead.org> writes:
> On Thu, May 29, 2014 at 04:56:45PM +0930, Rusty Russell wrote:
>> Before:
>> 	gcc 4.8.2: virtio_blk: stack used = 392
>> 	gcc 4.6.4: virtio_blk: stack used = 480
>> 
>> After:
>> 	gcc 4.8.2: virtio_blk: stack used = 408
>> 	gcc 4.6.4: virtio_blk: stack used = 432
>
> Is it worth it to make the good compiler worse? People are going to use
> the newer GCC more as time goes on anyhow.

No, but it's only 16 bytes of stack loss for a simplicity win:

 virtio_ring.c |  120 +++++++++++++++++++++-------------------------------------
 1 file changed, 45 insertions(+), 75 deletions(-)

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtio ring cleanups, which save stack on older gcc
  2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
  2014-05-29 10:39                   ` Dave Chinner
@ 2014-05-29 11:08                   ` Rusty Russell
  2014-05-29 23:45                     ` Minchan Kim
  1 sibling, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-29 11:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

Minchan Kim <minchan@kernel.org> writes:
> Hello Rusty,
>
> On Thu, May 29, 2014 at 04:56:41PM +0930, Rusty Russell wrote:
>> They don't make much difference: the easier fix is use gcc 4.8
>> which drops stack required across virtio block's virtio_queue_rq
>> down to that kmalloc in virtio_ring from 528 to 392 bytes.
>> 
>> Still, these (*lightly tested*) patches reduce to 432 bytes,
>> even for gcc 4.6.4.  Posted here FYI.
>
> I am testing with below which was hack for Dave's idea so don't have
> a machine to test your patches until tomorrow.
> So, I will queue your patches into testing machine tomorrow morning.

More interesting would be updating your compiler to 4.8, I think.
Saving <100 bytes on virtio is not going to save you, right?

Cheers,
Rusty.

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 3/4] virtio_ring: assume sgs are always well-formed.
  2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
@ 2014-05-29 11:18                   ` Michael S. Tsirkin
  0 siblings, 0 replies; 107+ messages in thread
From: Michael S. Tsirkin @ 2014-05-29 11:18 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Dave Hansen,
	Steven Rostedt

On Thu, May 29, 2014 at 04:56:44PM +0930, Rusty Russell wrote:
> We used to have several callers which just used arrays.  They're
> gone, so we can use sg_next() everywhere, simplifying the code.
> 
> Before:
> 	gcc 4.8.2: virtio_blk: stack used = 392
> 	gcc 4.6.4: virtio_blk: stack used = 528
> 
> After:
> 	gcc 4.8.2: virtio_blk: stack used = 392
> 	gcc 4.6.4: virtio_blk: stack used = 480
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>


Nice cleanup.

Acked-by: Michael S. Tsirkin <mst@redhat.com>


> ---
>  drivers/virtio/virtio_ring.c | 68 +++++++++++++-------------------------------
>  1 file changed, 19 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index f6ad99ffdc40..5d29cd85d6cf 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -107,28 +107,10 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> -static inline struct scatterlist *sg_next_chained(struct scatterlist *sg,
> -						  unsigned int *count)
> -{
> -	return sg_next(sg);
> -}
> -
> -static inline struct scatterlist *sg_next_arr(struct scatterlist *sg,
> -					      unsigned int *count)
> -{
> -	if (--(*count) == 0)
> -		return NULL;
> -	return sg + 1;
> -}
> -
>  /* Set up an indirect table of descriptors and add it to the queue. */
>  static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  				     struct scatterlist *sgs[],
> -				     struct scatterlist *(*next)
> -				       (struct scatterlist *, unsigned int *),
>  				     unsigned int total_sg,
> -				     unsigned int total_out,
> -				     unsigned int total_in,
>  				     unsigned int out_sgs,
>  				     unsigned int in_sgs,
>  				     gfp_t gfp)
> @@ -155,7 +137,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  	/* Transfer entries from the sg lists into the indirect page */
>  	i = 0;
>  	for (n = 0; n < out_sgs; n++) {
> -		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
> +		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
>  			desc[i].flags = VRING_DESC_F_NEXT;
>  			desc[i].addr = sg_phys(sg);
>  			desc[i].len = sg->length;
> @@ -164,7 +146,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  		}
>  	}
>  	for (; n < (out_sgs + in_sgs); n++) {
> -		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
> +		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
>  			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
>  			desc[i].addr = sg_phys(sg);
>  			desc[i].len = sg->length;
> @@ -197,10 +179,7 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
>  				struct scatterlist *sgs[],
> -				struct scatterlist *(*next)
> -				  (struct scatterlist *, unsigned int *),
> -				unsigned int total_out,
> -				unsigned int total_in,
> +				unsigned int total_sg,
>  				unsigned int out_sgs,
>  				unsigned int in_sgs,
>  				void *data,
> @@ -208,7 +187,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  {
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  	struct scatterlist *sg;
> -	unsigned int i, n, avail, uninitialized_var(prev), total_sg;
> +	unsigned int i, n, avail, uninitialized_var(prev);
>  	int head;
>  
>  	START_USE(vq);
> @@ -233,13 +212,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	}
>  #endif
>  
> -	total_sg = total_in + total_out;
> -
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
>  	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
> -		head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
> -					  total_in,
> +		head = vring_add_indirect(vq, sgs, total_sg, 
>  					  out_sgs, in_sgs, gfp);
>  		if (likely(head >= 0))
>  			goto add_head;
> @@ -265,7 +241,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  
>  	head = i = vq->free_head;
>  	for (n = 0; n < out_sgs; n++) {
> -		for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
> +		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
>  			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
>  			vq->vring.desc[i].addr = sg_phys(sg);
>  			vq->vring.desc[i].len = sg->length;
> @@ -274,7 +250,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  		}
>  	}
>  	for (; n < (out_sgs + in_sgs); n++) {
> -		for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
> +		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
>  			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
>  			vq->vring.desc[i].addr = sg_phys(sg);
>  			vq->vring.desc[i].len = sg->length;
> @@ -335,29 +311,23 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
>  		      void *data,
>  		      gfp_t gfp)
>  {
> -	unsigned int i, total_out, total_in;
> +	unsigned int i, total_sg = 0;
>  
>  	/* Count them first. */
> -	for (i = total_out = total_in = 0; i < out_sgs; i++) {
> -		struct scatterlist *sg;
> -		for (sg = sgs[i]; sg; sg = sg_next(sg))
> -			total_out++;
> -	}
> -	for (; i < out_sgs + in_sgs; i++) {
> +	for (i = 0; i < out_sgs + in_sgs; i++) {
>  		struct scatterlist *sg;
>  		for (sg = sgs[i]; sg; sg = sg_next(sg))
> -			total_in++;
> +			total_sg++;
>  	}
> -	return virtqueue_add(_vq, sgs, sg_next_chained,
> -			     total_out, total_in, out_sgs, in_sgs, data, gfp);
> +	return virtqueue_add(_vq, sgs, total_sg, out_sgs, in_sgs, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
>  
>  /**
>   * virtqueue_add_outbuf - expose output buffers to other end
>   * @vq: the struct virtqueue we're talking about.
> - * @sgs: array of scatterlists (need not be terminated!)
> - * @num: the number of scatterlists readable by other side
> + * @sg: scatterlist (must be well-formed and terminated!)
> + * @num: the number of entries in @sg readable by other side
>   * @data: the token identifying the buffer.
>   * @gfp: how to do memory allocations (if necessary).
>   *
> @@ -367,19 +337,19 @@ EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
>   * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
>   */
>  int virtqueue_add_outbuf(struct virtqueue *vq,
> -			 struct scatterlist sg[], unsigned int num,
> +			 struct scatterlist *sg, unsigned int num,
>  			 void *data,
>  			 gfp_t gfp)
>  {
> -	return virtqueue_add(vq, &sg, sg_next_arr, num, 0, 1, 0, data, gfp);
> +	return virtqueue_add(vq, &sg, num, 1, 0, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>  
>  /**
>   * virtqueue_add_inbuf - expose input buffers to other end
>   * @vq: the struct virtqueue we're talking about.
> - * @sgs: array of scatterlists (need not be terminated!)
> - * @num: the number of scatterlists writable by other side
> + * @sg: scatterlist (must be well-formed and terminated!)
> + * @num: the number of entries in @sg writable by other side
>   * @data: the token identifying the buffer.
>   * @gfp: how to do memory allocations (if necessary).
>   *
> @@ -389,11 +359,11 @@ EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>   * Returns zero or a negative error (ie. ENOSPC, ENOMEM, EIO).
>   */
>  int virtqueue_add_inbuf(struct virtqueue *vq,
> -			struct scatterlist sg[], unsigned int num,
> +			struct scatterlist *sg, unsigned int num,
>  			void *data,
>  			gfp_t gfp)
>  {
> -	return virtqueue_add(vq, &sg, sg_next_arr, 0, num, 0, 1, data, gfp);
> +	return virtqueue_add(vq, &sg, num, 0, 1, data, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
>  
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
  2014-05-29  7:52                   ` Peter Zijlstra
@ 2014-05-29 11:29                   ` Michael S. Tsirkin
  2014-05-30  2:37                     ` Rusty Russell
  1 sibling, 1 reply; 107+ messages in thread
From: Michael S. Tsirkin @ 2014-05-29 11:29 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Dave Hansen,
	Steven Rostedt

On Thu, May 29, 2014 at 04:56:45PM +0930, Rusty Russell wrote:
> virtqueue_add() populates the virtqueue descriptor table from the sgs
> given.  If it uses an indirect descriptor table, then it puts a single
> descriptor in the descriptor table pointing to the kmalloc'ed indirect
> table where the sg is populated.
> 
> Previously vring_add_indirect() did the allocation and the simple
> linear layout.  We replace that with alloc_indirect() which allocates
> the indirect table then chains it like the normal descriptor table so
> we can reuse the core logic.
> 
> Before:
> 	gcc 4.8.2: virtio_blk: stack used = 392
> 	gcc 4.6.4: virtio_blk: stack used = 480
> 
> After:
> 	gcc 4.8.2: virtio_blk: stack used = 408
> 	gcc 4.6.4: virtio_blk: stack used = 432
> 
> Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
> ---
>  drivers/virtio/virtio_ring.c | 120 ++++++++++++++++---------------------------
>  1 file changed, 45 insertions(+), 75 deletions(-)

It's nice that we have less code now but it's data path -
are you sure it's worth the performance cost?

> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 5d29cd85d6cf..3adf5978b92b 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -107,18 +107,10 @@ struct vring_virtqueue
>  
>  #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
>  
> -/* Set up an indirect table of descriptors and add it to the queue. */
> -static inline int vring_add_indirect(struct vring_virtqueue *vq,
> -				     struct scatterlist *sgs[],
> -				     unsigned int total_sg,
> -				     unsigned int out_sgs,
> -				     unsigned int in_sgs,
> -				     gfp_t gfp)
> +static struct vring_desc *alloc_indirect(unsigned int total_sg, gfp_t gfp)
>  {
> -	struct vring_desc *desc;
> -	unsigned head;
> -	struct scatterlist *sg;
> -	int i, n;
> + 	struct vring_desc *desc;
> +	unsigned int i;
>  
>  	/*
>  	 * We require lowmem mappings for the descriptors because
> @@ -130,51 +122,13 @@ static inline int vring_add_indirect(struct vring_virtqueue *vq,
>  	if (record_stack == current)
>  		__asm__ __volatile__("movq %%rsp,%0" : "=g" (stack_top));
>  
> -	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> -	if (!desc)
> -		return -ENOMEM;
> -
> -	/* Transfer entries from the sg lists into the indirect page */
> -	i = 0;
> -	for (n = 0; n < out_sgs; n++) {
> -		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -			desc[i].flags = VRING_DESC_F_NEXT;
> -			desc[i].addr = sg_phys(sg);
> -			desc[i].len = sg->length;
> -			desc[i].next = i+1;
> -			i++;
> -		}
> -	}
> -	for (; n < (out_sgs + in_sgs); n++) {
> -		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
> -			desc[i].addr = sg_phys(sg);
> -			desc[i].len = sg->length;
> -			desc[i].next = i+1;
> -			i++;
> -		}
> -	}
> -	BUG_ON(i != total_sg);
> -
> -	/* Last one doesn't continue. */
> -	desc[i-1].flags &= ~VRING_DESC_F_NEXT;
> -	desc[i-1].next = 0;
> -
> -	/* We're about to use a buffer */
> -	vq->vq.num_free--;
> -
> -	/* Use a single buffer which doesn't continue */
> -	head = vq->free_head;
> -	vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
> -	vq->vring.desc[head].addr = virt_to_phys(desc);
> -	/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
> -	kmemleak_ignore(desc);
> -	vq->vring.desc[head].len = i * sizeof(struct vring_desc);
> -
> -	/* Update free pointer */
> -	vq->free_head = vq->vring.desc[head].next;
> + 	desc = kmalloc(total_sg * sizeof(struct vring_desc), gfp);
> + 	if (!desc)
> +		return NULL;
>  
> -	return head;
> +	for (i = 0; i < total_sg; i++)
> +		desc[i].next = i+1;
> +	return desc;

Hmm we are doing an extra walk over descriptors here.
This might hurt performance esp for big descriptors.

>  }
>  
>  static inline int virtqueue_add(struct virtqueue *_vq,
> @@ -187,6 +141,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  {
>  	struct vring_virtqueue *vq = to_vvq(_vq);
>  	struct scatterlist *sg;
> +	struct vring_desc *desc = NULL;
>  	unsigned int i, n, avail, uninitialized_var(prev);
>  	int head;
>  
> @@ -212,18 +167,32 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	}
>  #endif
>  
> +	BUG_ON(total_sg > vq->vring.num);
> +	BUG_ON(total_sg == 0);
> +
> +	head = vq->free_head;
> +
>  	/* If the host supports indirect descriptor tables, and we have multiple
>  	 * buffers, then go indirect. FIXME: tune this threshold */
> -	if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
> -		head = vring_add_indirect(vq, sgs, total_sg, 
> -					  out_sgs, in_sgs, gfp);
> -		if (likely(head >= 0))
> -			goto add_head;
> +	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
> +		desc = alloc_indirect(total_sg, gfp);

else desc = NULL will be a bit clearer won't it?

> +
> +	if (desc) {
> +		/* Use a single buffer which doesn't continue */
> +		vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
> +		vq->vring.desc[head].addr = virt_to_phys(desc);
> +		/* avoid kmemleak false positive (tis hidden by virt_to_phys) */
> +		kmemleak_ignore(desc);
> +		vq->vring.desc[head].len = total_sg * sizeof(struct vring_desc);
> +
> +		/* Set up rest to use this indirect table. */
> +		i = 0;
> +		total_sg = 1;
> +	} else {
> +		desc = vq->vring.desc;
> +		i = head;
>  	}
>  
> -	BUG_ON(total_sg > vq->vring.num);
> -	BUG_ON(total_sg == 0);
> -
>  	if (vq->vq.num_free < total_sg) {
>  		pr_debug("Can't add buf len %i - avail = %i\n",
>  			 total_sg, vq->vq.num_free);
> @@ -239,32 +208,33 @@ static inline int virtqueue_add(struct virtqueue *_vq,
>  	/* We're about to use some buffers from the free list. */
>  	vq->vq.num_free -= total_sg;
>  
> -	head = i = vq->free_head;
>  	for (n = 0; n < out_sgs; n++) {
>  		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -			vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
> -			vq->vring.desc[i].addr = sg_phys(sg);
> -			vq->vring.desc[i].len = sg->length;
> +			desc[i].flags = VRING_DESC_F_NEXT;
> +			desc[i].addr = sg_phys(sg);
> +			desc[i].len = sg->length;
>  			prev = i;
> -			i = vq->vring.desc[i].next;
> +			i = desc[i].next;
>  		}
>  	}
>  	for (; n < (out_sgs + in_sgs); n++) {
>  		for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -			vq->vring.desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
> -			vq->vring.desc[i].addr = sg_phys(sg);
> -			vq->vring.desc[i].len = sg->length;
> +			desc[i].flags = VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
> +			desc[i].addr = sg_phys(sg);
> +			desc[i].len = sg->length;
>  			prev = i;
> -			i = vq->vring.desc[i].next;
> +			i = desc[i].next;
>  		}
>  	}
>  	/* Last one doesn't continue. */
> -	vq->vring.desc[prev].flags &= ~VRING_DESC_F_NEXT;
> +	desc[prev].flags &= ~VRING_DESC_F_NEXT;
>  
>  	/* Update free pointer */
> -	vq->free_head = i;
> +	if (desc == vq->vring.desc)
> +		vq->free_head = i;
> +	else
> +		vq->free_head = vq->vring.desc[head].next;

This one is slightly ugly isn't it?


>  
> -add_head:
>  	/* Set token. */
>  	vq->data[head] = data;
>  
> -- 
> 1.9.1

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-29 11:05                     ` Rusty Russell
@ 2014-05-29 11:33                       ` Michael S. Tsirkin
  0 siblings, 0 replies; 107+ messages in thread
From: Michael S. Tsirkin @ 2014-05-29 11:33 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Peter Zijlstra, Linus Torvalds, Dave Chinner, Jens Axboe,
	Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Mel Gorman, Rik van Riel,
	Johannes Weiner, Hugh Dickins, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 08:35:58PM +0930, Rusty Russell wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> > On Thu, May 29, 2014 at 04:56:45PM +0930, Rusty Russell wrote:
> >> Before:
> >> 	gcc 4.8.2: virtio_blk: stack used = 392
> >> 	gcc 4.6.4: virtio_blk: stack used = 480
> >> 
> >> After:
> >> 	gcc 4.8.2: virtio_blk: stack used = 408
> >> 	gcc 4.6.4: virtio_blk: stack used = 432
> >
> > Is it worth it to make the good compiler worse? People are going to use
> > the newer GCC more as time goes on anyhow.
> 
> No, but it's only 16 bytes of stack loss for a simplicity win:
> 
>  virtio_ring.c |  120 +++++++++++++++++++++-------------------------------------
>  1 file changed, 45 insertions(+), 75 deletions(-)
> 
> Cheers,
> Rusty.

I'm concerned that we are doing an extra descriptor walk now though.
And desc == &vq.desc at the end is kind of ugly too.

How about
		if (indirect)
                        vq->vring.desc[i].next = i + 1;
		else
                        i = vq->vring.desc[i].next;

or something like this?


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28  9:27   ` [RFC 2/2] x86_64: expand kernel stack to 16K Borislav Petkov
@ 2014-05-29 13:23     ` One Thousand Gnomes
  0 siblings, 0 replies; 107+ messages in thread
From: One Thousand Gnomes @ 2014-05-29 13:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Minchan Kim, linux-kernel, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, rusty, mst,
	Dave Hansen, Steven Rostedt

> Hmm, stupid question: what happens when 16K is not enough too, do we
> increase again? When do we stop increasing? 1M, 2M... ?

It's not a stupid question, it's IMHO the most important question

> Sounds like we want to make it a config option with a couple of sizes
> for everyone to be happy. :-)

At the moment it goes bang if you freakily get three layers of recursion
through allocations. But show me the proof we can't already hit four, or
five, or six  ....

We don't *need* to allocate tons of stack memory to each task just because
we might recursively allocate. We don't solve the problem by doing so
either. We at best fudge over it.

Why is *any* recursive memory allocation not ending up waiting for other
kernel worker threads to free up memory (beyond it being rather hard to
go and retrofit) ?



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
@ 2014-05-29 15:24               ` Linus Torvalds
  2014-05-29 23:40                 ` Minchan Kim
                                   ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-29 15:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Jens Axboe, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Thu, May 29, 2014 at 12:26 AM, Dave Chinner <david@fromorbit.com> wrote:
>
> What concerns me about both __alloc_pages_nodemask() and
> kernel_map_pages is that when I look at the code I see functions
> that have no obvious stack usage problem. However, the compiler is
> producing functions with huge stack footprints and it's not at all
> obvious when I read the code. So in this case I'm more concerned
> that we have a major disconnect between the source code structure
> and the code that the compiler produces...

I agree. In fact, this is the main reason that Minchan's call trace
and this thread has actually convinced me that yes, we really do need
to make x86-64 have a 16kB stack (well, 16kB allocation - there's
still the thread info etc too).

Usually when we see the stack-smashing traces, they are because
somebody did something stupid. In this case, there are certainly
stupid details, and things I think we should fix, but there is *not*
the usual red flag of "Christ, somebody did something _really_ wrong".

So I'm not in fact arguing against Minchan's patch of upping
THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
remain one of my "we really need to be careful" issues, so while I am
basically planning on applying that patch, I _also_ want to make sure
that we fix the problems we do see and not just paper them over.

The 8kB stack has been somewhat restrictive and painful for a while,
and I'm ok with admitting that it is just getting _too_ damn painful,
but I don't want to just give up entirely when we have a known deep
stack case.

                      Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk
  2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
@ 2014-05-29 15:39                   ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-29 15:39 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Dave Chinner, Jens Axboe, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 12:26 AM, Rusty Russell <rusty@rustcorp.com.au> wrote:
> Results (x86-64, Minchan's .config):
>
> gcc 4.8.2: virtio_blk: stack used = 392
> gcc 4.6.4: virtio_blk: stack used = 528

I wonder if that's just random luck (although 35% more stack use seems
to be bigger than "random" - that's quite a big difference), or
whether the gcc guys are aware of having fixed some major stack spill
issue.

But yeah, Minchan uses gcc 4.6.3 according to one of his emails, so
_part_ of his stack smashing is probably due to compiler version.

                  Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  1:58           ` Dave Chinner
  2014-05-29  2:51             ` Linus Torvalds
@ 2014-05-29 23:36             ` Minchan Kim
  2014-05-30  0:05               ` Linus Torvalds
  2014-05-30  0:15               ` Dave Chinner
  1 sibling, 2 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29 23:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton,
	linux-mm, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Johannes Weiner, Hugh Dickins,
	Rusty Russell, Michael S. Tsirkin, Dave Hansen, Steven Rostedt

Hello Dave,

On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > Author: Jens Axboe <jaxboe@fusionio.com>
> > Date:   Sat Apr 16 13:27:55 2011 +0200
> > 
> >     block: let io_schedule() flush the plug inline
> >     
> >     Linus correctly observes that the most important dispatch cases
> >     are now done from kblockd, this isn't ideal for latency reasons.
> >     The original reason for switching dispatches out-of-line was to
> >     avoid too deep a stack, so by _only_ letting the "accidental"
> >     flush directly in schedule() be guarded by offload to kblockd,
> >     we should be able to get the best of both worlds.
> >     
> >     So add a blk_schedule_flush_plug() that offloads to kblockd,
> >     and only use that from the schedule() path.
> >     
> >     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > 
> > And now we have too deep a stack due to unplugging from io_schedule()...
> 
> So, if we make io_schedule() push the plug list off to the kblockd
> like is done for schedule()....
> 
> > > IOW, swap-out directly caused that extra 3kB of stack use in what was
> > > a deep call chain (due to memory allocation). I really don't
> > > understand why you are arguing anything else on a pure technicality.
> > >
> > > I thought you had some other argument for why swap was different, and
> > > against removing that "page_is_file_cache()" special case in
> > > shrink_page_list().
> > 
> > I've said in the past that swap is different to filesystem
> > ->writepage implementations because it doesn't require significant
> > stack to do block allocation and doesn't trigger IO deep in that
> > allocation stack. Hence it has much lower stack overhead than the
> > filesystem ->writepage implementations and so is much less likely to
> > have stack issues.
> > 
> > This stack overflow shows us that just the memory reclaim + IO
> > layers are sufficient to cause a stack overflow,
> 
> .... we solve this problem directly by being able to remove the IO
> stack usage from the direct reclaim swap path.
> 
> IOWs, we don't need to turn swap off at all in direct reclaim
> because all the swap IO can be captured in a plug list and
> dispatched via kblockd. This could be done either by io_schedule()
> or a new blk_flush_plug_list() wrapper that pushes the work to
> kblockd...

I did below hacky test to apply your idea and the result is overflow again.
So, again it would second stack expansion. Otherwise, we should prevent
swapout in direct reclaim.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..95f169e85dbe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
 void __sched io_schedule(void)
 {
 	struct rq *rq = raw_rq();
+	struct blk_plug *plug = current->plug;
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	if (plug)
+		blk_flush_plug_list(plug, true);
+
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;


[ 1209.764725] kworker/u24:0 (23627) used greatest stack depth: 304 bytes left
[ 1510.835509] kworker/u24:1 (25817) used greatest stack depth: 144 bytes left
[ 3701.482790] PANIC: double fault, error_code: 0x0
[ 3701.483297] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201
[ 3701.483980] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[ 3701.484366] task: ffff8800353c41c0 ti: ffff880000106000 task.ti: ffff880000106000
[ 3701.484366] RIP: 0010:[<ffffffff810a5390>]  [<ffffffff810a5390>] __lock_acquire+0x170/0x1ca0
[ 3701.484366] RSP: 0000:ffff880000105f58  EFLAGS: 00010046
[ 3701.484366] RAX: 0000000000000001 RBX: ffff8800353c41c0 RCX: 0000000000000002
[ 3701.484366] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff81c4a1e0
[ 3701.484366] RBP: ffff880000106048 R08: 0000000000000001 R09: 0000000000000001
[ 3701.484366] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[ 3701.484366] R13: 0000000000000000 R14: ffffffff81c4a1e0 R15: 0000000000000000
[ 3701.484366] FS:  0000000000000000(0000) GS:ffff880037d00000(0000) knlGS:0000000000000000
[ 3701.484366] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3701.484366] CR2: ffff880000105f48 CR3: 0000000001c0b000 CR4: 00000000000006e0
[ 3701.484366] Stack:
[ 3701.484366] BUG: unable to handle kernel paging request at ffff880000105f58
[ 3701.484366] IP: [<ffffffff81004e14>] show_stack_log_lvl+0x134/0x1a0
[ 3701.484366] PGD 28c5067 PUD 28c6067 PMD 28c7067 PTE 8000000000105060
[ 3701.484366] Thread overran stack, or stack corrupted
[ 3701.484366] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 3701.484366] Dumping ftrace buffer:
[ 3701.484366] ---------------------------------
[ 3701.484366]    <...>-6117    8d..4 3786719374us : stack_trace_call:         Depth    Size   Location    (46 entries)
[ 3701.484366]         -----    ----   --------
[ 3701.484366]    <...>-6117    8d..4 3786719395us : stack_trace_call:   0)     7200       8   _raw_spin_lock_irqsave+0x51/0x60
[ 3701.484366]    <...>-6117    8d..4 3786719395us : stack_trace_call:   1)     7192     296   get_page_from_freelist+0x886/0x920
[ 3701.484366]    <...>-6117    8d..4 3786719395us : stack_trace_call:   2)     6896     352   __alloc_pages_nodemask+0x5e1/0xb20
[ 3701.484366]    <...>-6117    8d..4 3786719396us : stack_trace_call:   3)     6544       8   alloc_pages_current+0x10f/0x1f0
[ 3701.484366]    <...>-6117    8d..4 3786719396us : stack_trace_call:   4)     6536     168   new_slab+0x2c5/0x370
[ 3701.484366]    <...>-6117    8d..4 3786719396us : stack_trace_call:   5)     6368       8   __slab_alloc+0x3a9/0x501
[ 3701.484366]    <...>-6117    8d..4 3786719396us : stack_trace_call:   6)     6360      80   __kmalloc+0x1cb/0x200
[ 3701.484366]    <...>-6117    8d..4 3786719396us : stack_trace_call:   7)     6280     376   vring_add_indirect+0x36/0x200
[ 3701.484366]    <...>-6117    8d..4 3786719397us : stack_trace_call:   8)     5904     144   virtqueue_add_sgs+0x2e2/0x320
[ 3701.484366]    <...>-6117    8d..4 3786719397us : stack_trace_call:   9)     5760     288   __virtblk_add_req+0xda/0x1b0
[ 3701.484366]    <...>-6117    8d..4 3786719397us : stack_trace_call:  10)     5472      96   virtio_queue_rq+0xd3/0x1d0
[ 3701.484366]    <...>-6117    8d..4 3786719397us : stack_trace_call:  11)     5376     128   __blk_mq_run_hw_queue+0x1ef/0x440
[ 3701.484366]    <...>-6117    8d..4 3786719397us : stack_trace_call:  12)     5248      16   blk_mq_run_hw_queue+0x35/0x40
[ 3701.484366]    <...>-6117    8d..4 3786719397us : stack_trace_call:  13)     5232      96   blk_mq_insert_requests+0xdb/0x160
[ 3701.484366]    <...>-6117    8d..4 3786719398us : stack_trace_call:  14)     5136     112   blk_mq_flush_plug_list+0x12b/0x140
[ 3701.484366]    <...>-6117    8d..4 3786719398us : stack_trace_call:  15)     5024     112   blk_flush_plug_list+0xc7/0x220
[ 3701.484366]    <...>-6117    8d..4 3786719398us : stack_trace_call:  16)     4912     128   blk_mq_make_request+0x42a/0x600
[ 3701.484366]    <...>-6117    8d..4 3786719398us : stack_trace_call:  17)     4784      48   generic_make_request+0xc0/0x100
[ 3701.484366]    <...>-6117    8d..4 3786719398us : stack_trace_call:  18)     4736     112   submit_bio+0x86/0x160
[ 3701.484366]    <...>-6117    8d..4 3786719398us : stack_trace_call:  19)     4624     160   __swap_writepage+0x198/0x230
[ 3701.484366]    <...>-6117    8d..4 3786719399us : stack_trace_call:  20)     4464      32   swap_writepage+0x42/0x90
[ 3701.484366]    <...>-6117    8d..4 3786719399us : stack_trace_call:  21)     4432     320   shrink_page_list+0x676/0xa80
[ 3701.484366]    <...>-6117    8d..4 3786719399us : stack_trace_call:  22)     4112     208   shrink_inactive_list+0x262/0x4e0
[ 3701.484366]    <...>-6117    8d..4 3786719399us : stack_trace_call:  23)     3904     304   shrink_lruvec+0x3e1/0x6a0
[ 3701.484366]    <...>-6117    8d..4 3786719399us : stack_trace_call:  24)     3600      80   shrink_zone+0x3f/0x110
[ 3701.484366]    <...>-6117    8d..4 3786719400us : stack_trace_call:  25)     3520     128   do_try_to_free_pages+0x156/0x4c0
[ 3701.484366]    <...>-6117    8d..4 3786719400us : stack_trace_call:  26)     3392     208   try_to_free_pages+0xf7/0x1e0
[ 3701.484366]    <...>-6117    8d..4 3786719400us : stack_trace_call:  27)     3184     352   __alloc_pages_nodemask+0x783/0xb20
[ 3701.484366]    <...>-6117    8d..4 3786719400us : stack_trace_call:  28)     2832       8   alloc_pages_current+0x10f/0x1f0
[ 3701.484366]    <...>-6117    8d..4 3786719400us : stack_trace_call:  29)     2824     200   __page_cache_alloc+0x13f/0x160
[ 3701.484366]    <...>-6117    8d..4 3786719400us : stack_trace_call:  30)     2624      80   find_or_create_page+0x4c/0xb0
[ 3701.484366]    <...>-6117    8d..4 3786719401us : stack_trace_call:  31)     2544     112   __getblk+0x109/0x2f0
[ 3701.484366]    <...>-6117    8d..4 3786719401us : stack_trace_call:  32)     2432     224   ext4_ext_insert_extent+0x4d8/0x1270
[ 3701.484366]    <...>-6117    8d..4 3786719401us : stack_trace_call:  33)     2208     256   ext4_ext_map_blocks+0x8d4/0x1010
[ 3701.484366]    <...>-6117    8d..4 3786719401us : stack_trace_call:  34)     1952     160   ext4_map_blocks+0x325/0x530
[ 3701.484366]    <...>-6117    8d..4 3786719401us : stack_trace_call:  35)     1792     384   ext4_writepages+0x6d1/0xce0
[ 3701.484366]    <...>-6117    8d..4 3786719402us : stack_trace_call:  36)     1408      16   do_writepages+0x23/0x40
[ 3701.484366]    <...>-6117    8d..4 3786719402us : stack_trace_call:  37)     1392      96   __writeback_single_inode+0x45/0x2e0
[ 3701.484366]    <...>-6117    8d..4 3786719402us : stack_trace_call:  38)     1296     176   writeback_sb_inodes+0x2ad/0x500
[ 3701.484366]    <...>-6117    8d..4 3786719402us : stack_trace_call:  39)     1120      80   __writeback_inodes_wb+0x9e/0xd0
[ 3701.484366]    <...>-6117    8d..4 3786719402us : stack_trace_call:  40)     1040     160   wb_writeback+0x29b/0x350
[ 3701.484366]    <...>-6117    8d..4 3786719402us : stack_trace_call:  41)      880     208   bdi_writeback_workfn+0x11c/0x480
[ 3701.484366]    <...>-6117    8d..4 3786719403us : stack_trace_call:  42)      672     144   process_one_work+0x1d2/0x570
[ 3701.484366]    <...>-6117    8d..4 3786719403us : stack_trace_call:  43)      528     112   worker_thread+0x116/0x370
[ 3701.484366]    <...>-6117    8d..4 3786719403us : stack_trace_call:  44)      416     240   kthread+0xf3/0x110
[ 3701.484366]    <...>-6117    8d..4 3786719403us : stack_trace_call:  45)      176     176   ret_from_fork+0x7c/0xb0
[ 3701.484366] ---------------------------------
[ 3701.484366] Modules linked in:
[ 3701.484366] CPU: 8 PID: 6117 Comm: kworker/u24:1 Not tainted 3.14.0+ #201
[ 3701.484366] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 3701.484366] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[ 3701.484366] task: ffff8800353c41c0 ti: ffff880000106000 task.ti: ffff880000106000
[ 3701.484366] RIP: 0010:[<ffffffff81004e14>]  [<ffffffff81004e14>] show_stack_log_lvl+0x134/0x1a0
[ 3701.484366] RSP: 0000:ffff880037d06e58  EFLAGS: 00010046
[ 3701.484366] RAX: 000000000000000c RBX: 0000000000000000 RCX: 0000000000000000
[ 3701.484366] RDX: ffff880037cfffc0 RSI: ffff880037d06f58 RDI: 0000000000000000
[ 3701.484366] RBP: ffff880037d06ea8 R08: ffffffff81a0804c R09: 0000000000000000
[ 3701.484366] R10: 0000000000000002 R11: 0000000000000002 R12: ffff880037d06f58
[ 3701.484366] R13: 0000000000000000 R14: ffff880000105f58 R15: ffff880037d03fc0
[ 3701.484366] FS:  0000000000000000(0000) GS:ffff880037d00000(0000) knlGS:0000000000000000
[ 3701.484366] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3701.484366] CR2: ffff880000105f58 CR3: 0000000001c0b000 CR4: 00000000000006e0
[ 3701.484366] Stack:
[ 3701.484366]  0000000000000000 ffff880000105f58 ffff880037d06f58 ffff880000105f58
[ 3701.484366]  ffff880000106000 ffff880037d06f58 0000000000000040 ffff880037d06f58
[ 3701.484366]  ffff880000105f58 0000000000000000 ffff880037d06ef8 ffffffff81004f1c
[ 3701.484366] Call Trace:
[ 3701.484366]  <#DF> 
[ 3701.484366]  [<ffffffff81004f1c>] show_regs+0x9c/0x1f0
[ 3701.484366]  [<ffffffff8103aa37>] df_debug+0x27/0x40
[ 3701.484366]  [<ffffffff81003361>] do_double_fault+0x61/0x80
[ 3701.484366]  [<ffffffff816f0907>] double_fault+0x27/0x30
[ 3701.484366]  [<ffffffff810a5390>] ? __lock_acquire+0x170/0x1ca0
[ 3701.484366]  <<EOE>> 

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29 15:24               ` Linus Torvalds
@ 2014-05-29 23:40                 ` Minchan Kim
  2014-05-29 23:53                 ` Dave Chinner
  2014-05-30  9:48                 ` Richard Weinberger
  2 siblings, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29 23:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jens Axboe, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

Hello Linus,

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 12:26 AM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > What concerns me about both __alloc_pages_nodemask() and
> > kernel_map_pages is that when I look at the code I see functions
> > that have no obvious stack usage problem. However, the compiler is
> > producing functions with huge stack footprints and it's not at all
> > obvious when I read the code. So in this case I'm more concerned
> > that we have a major disconnect between the source code structure
> > and the code that the compiler produces...
> 
> I agree. In fact, this is the main reason that Minchan's call trace
> and this thread has actually convinced me that yes, we really do need
> to make x86-64 have a 16kB stack (well, 16kB allocation - there's
> still the thread info etc too).
> 
> Usually when we see the stack-smashing traces, they are because
> somebody did something stupid. In this case, there are certainly
> stupid details, and things I think we should fix, but there is *not*
> the usual red flag of "Christ, somebody did something _really_ wrong".
> 
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.

So, should I resend a patch w/o RFC in subject but with Acked-by from Dave?
Or, you will do it by yourself?

> 
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.
> 
>                       Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtio ring cleanups, which save stack on older gcc
  2014-05-29 11:08                   ` Rusty Russell
@ 2014-05-29 23:45                     ` Minchan Kim
  2014-05-30  1:06                       ` Minchan Kim
  2014-05-30  6:56                       ` Rusty Russell
  0 siblings, 2 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-29 23:45 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 08:38:33PM +0930, Rusty Russell wrote:
> Minchan Kim <minchan@kernel.org> writes:
> > Hello Rusty,
> >
> > On Thu, May 29, 2014 at 04:56:41PM +0930, Rusty Russell wrote:
> >> They don't make much difference: the easier fix is use gcc 4.8
> >> which drops stack required across virtio block's virtio_queue_rq
> >> down to that kmalloc in virtio_ring from 528 to 392 bytes.
> >> 
> >> Still, these (*lightly tested*) patches reduce to 432 bytes,
> >> even for gcc 4.6.4.  Posted here FYI.
> >
> > I am testing with below which was hack for Dave's idea so don't have
> > a machine to test your patches until tomorrow.
> > So, I will queue your patches into testing machine tomorrow morning.
> 
> More interesting would be updating your compiler to 4.8, I think.
> Saving <100 bytes on virtio is not going to save you, right?

But in my report, virtio_ring consumes more than yours.
As I mentioned other thread to Steven, I don't know why stacktrace report
vring_add_indirect consumes 376-byte. Apparently, objdump says it didn't
consume too much so I'd like to test your patches and see the result.

Thanks.

[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0

> 
> Cheers,
> Rusty.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29 15:24               ` Linus Torvalds
  2014-05-29 23:40                 ` Minchan Kim
@ 2014-05-29 23:53                 ` Dave Chinner
  2014-05-30  0:06                   ` Dave Jones
  2014-05-30  9:48                 ` Richard Weinberger
  2 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-29 23:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Thu, May 29, 2014 at 08:24:49AM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 12:26 AM, Dave Chinner <david@fromorbit.com> wrote:
> >
> > What concerns me about both __alloc_pages_nodemask() and
> > kernel_map_pages is that when I look at the code I see functions
> > that have no obvious stack usage problem. However, the compiler is
> > producing functions with huge stack footprints and it's not at all
> > obvious when I read the code. So in this case I'm more concerned
> > that we have a major disconnect between the source code structure
> > and the code that the compiler produces...
> 
> I agree. In fact, this is the main reason that Minchan's call trace
> and this thread has actually convinced me that yes, we really do need
> to make x86-64 have a 16kB stack (well, 16kB allocation - there's
> still the thread info etc too).
> 
> Usually when we see the stack-smashing traces, they are because
> somebody did something stupid. In this case, there are certainly
> stupid details, and things I think we should fix, but there is *not*
> the usual red flag of "Christ, somebody did something _really_ wrong".
> 
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.
> 
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.

That sounds like a plan. Perhaps it would be useful to add a
WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
8k) so that we get some indication that we're hitting a deep stack
but the system otherwise keeps functioning. That gives us some
motivation to keep stack usage down but isn't a fatal problem like
it is now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29 23:36             ` Minchan Kim
@ 2014-05-30  0:05               ` Linus Torvalds
  2014-05-30  0:20                 ` Minchan Kim
  2014-05-30  1:30                 ` Linus Torvalds
  2014-05-30  0:15               ` Dave Chinner
  1 sibling, 2 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30  0:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On Thu, May 29, 2014 at 4:36 PM, Minchan Kim <minchan@kernel.org> wrote:
>
> I did below hacky test to apply your idea and the result is overflow again.
> So, again it would second stack expansion. Otherwise, we should prevent
> swapout in direct reclaim.

So changing io_schedule() is bad, for the reasons I outlined elsewhere
(we use it for wait_for_page*() - see sleep_on_page().

It's the congestion waiting where the io_schedule() should be avoided.

So maybe test a patch something like the attached.

NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
horrible things. It seems to compile, but I have absolutely no reason
to believe that it would work. I didn't actually test that this moves
anything at all to kblockd. So think of it as a concept patch that
*might* work, but as Dave said, there might also be other things that
cause unplugging and need some tough love.

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/plain, Size: 2753 bytes --]

 mm/backing-dev.c | 28 ++++++++++++++++++----------
 mm/vmscan.c      |  4 +---
 2 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 09d9591b7708..cb26b24c2da2 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include <linux/writeback.h>
 #include <linux/device.h>
 #include <trace/events/writeback.h>
+#include <linux/blkdev.h>
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+	long ret;
+	DEFINE_WAIT(wait);
+	struct blk_plug *plug = current->plug;
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	if (plug)
+		blk_flush_plug_list(plug, true);
+	ret = schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+	return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	ret = congestion_timeout(sync,timeout);
 
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
@@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	/*
 	 * If there is no congestion, or heavy congestion is not being
@@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	}
 
 	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	ret = congestion_timeout(sync, timeout);
 
 out:
 	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 32c661d66a45..1e524000b83e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * avoid risk of stack overflow but only writeback
 			 * if many dirty pages have been encountered.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !zone_is_reclaim_dirty(zone))) {
+			if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29 23:53                 ` Dave Chinner
@ 2014-05-30  0:06                   ` Dave Jones
  2014-05-30  0:21                     ` Dave Chinner
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Jones @ 2014-05-30  0:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:

 > That sounds like a plan. Perhaps it would be useful to add a
 > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
 > 8k) so that we get some indication that we're hitting a deep stack
 > but the system otherwise keeps functioning. That gives us some
 > motivation to keep stack usage down but isn't a fatal problem like
 > it is now....

We have check_stack_usage() and DEBUG_STACK_USAGE for this.
Though it needs some tweaking if we move to 16K

I gave it a try yesterday, and noticed a spew of noisy warnings as soon
as I gave it a workload to chew on. (Moreso than usual with 8K stacks)

	Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29 23:36             ` Minchan Kim
  2014-05-30  0:05               ` Linus Torvalds
@ 2014-05-30  0:15               ` Dave Chinner
  2014-05-30  2:12                 ` Minchan Kim
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-30  0:15 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton,
	linux-mm, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Johannes Weiner, Hugh Dickins,
	Rusty Russell, Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> Hello Dave,
> 
> On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > Author: Jens Axboe <jaxboe@fusionio.com>
> > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > 
> > >     block: let io_schedule() flush the plug inline
> > >     
> > >     Linus correctly observes that the most important dispatch cases
> > >     are now done from kblockd, this isn't ideal for latency reasons.
> > >     The original reason for switching dispatches out-of-line was to
> > >     avoid too deep a stack, so by _only_ letting the "accidental"
> > >     flush directly in schedule() be guarded by offload to kblockd,
> > >     we should be able to get the best of both worlds.
> > >     
> > >     So add a blk_schedule_flush_plug() that offloads to kblockd,
> > >     and only use that from the schedule() path.
> > >     
> > >     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > 
> > > And now we have too deep a stack due to unplugging from io_schedule()...
> > 
> > So, if we make io_schedule() push the plug list off to the kblockd
> > like is done for schedule()....
....
> I did below hacky test to apply your idea and the result is overflow again.
> So, again it would second stack expansion. Otherwise, we should prevent
> swapout in direct reclaim.
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..95f169e85dbe 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
>  void __sched io_schedule(void)
>  {
>  	struct rq *rq = raw_rq();
> +	struct blk_plug *plug = current->plug;
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	if (plug)
> +		blk_flush_plug_list(plug, true);
> +
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;

.....

>         Depth    Size   Location    (46 entries)
>
>   0)     7200       8   _raw_spin_lock_irqsave+0x51/0x60
>   1)     7192     296   get_page_from_freelist+0x886/0x920
>   2)     6896     352   __alloc_pages_nodemask+0x5e1/0xb20
>   3)     6544       8   alloc_pages_current+0x10f/0x1f0
>   4)     6536     168   new_slab+0x2c5/0x370
>   5)     6368       8   __slab_alloc+0x3a9/0x501
>   6)     6360      80   __kmalloc+0x1cb/0x200
>   7)     6280     376   vring_add_indirect+0x36/0x200
>   8)     5904     144   virtqueue_add_sgs+0x2e2/0x320
>   9)     5760     288   __virtblk_add_req+0xda/0x1b0
>  10)     5472      96   virtio_queue_rq+0xd3/0x1d0
>  11)     5376     128   __blk_mq_run_hw_queue+0x1ef/0x440
>  12)     5248      16   blk_mq_run_hw_queue+0x35/0x40
>  13)     5232      96   blk_mq_insert_requests+0xdb/0x160
>  14)     5136     112   blk_mq_flush_plug_list+0x12b/0x140
>  15)     5024     112   blk_flush_plug_list+0xc7/0x220
>  16)     4912     128   blk_mq_make_request+0x42a/0x600
>  17)     4784      48   generic_make_request+0xc0/0x100
>  18)     4736     112   submit_bio+0x86/0x160
>  19)     4624     160   __swap_writepage+0x198/0x230
>  20)     4464      32   swap_writepage+0x42/0x90
>  21)     4432     320   shrink_page_list+0x676/0xa80
>  22)     4112     208   shrink_inactive_list+0x262/0x4e0
>  23)     3904     304   shrink_lruvec+0x3e1/0x6a0

The device is supposed to be plugged here in shrink_lruvec().

Oh, a plug can only hold 16 individual bios, and then it does a
synchronous flush. Hmmm - perhaps that should also defer the flush
to the kblockd, because if we are overrunning a plug then we've
already surrendered IO dispatch latency....

So, in blk_mq_make_request(), can you do:

			if (list_empty(&plug->mq_list))
				trace_block_plug(q);
			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
				trace_block_plug(q);
			}
			list_add_tail(&rq->queuelist, &plug->mq_list);

To see if that defers all the swap IO to kblockd?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:05               ` Linus Torvalds
@ 2014-05-30  0:20                 ` Minchan Kim
  2014-05-30  0:31                   ` Linus Torvalds
  2014-05-30  1:30                 ` Linus Torvalds
  1 sibling, 1 reply; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  0:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

Hello Linus,

On Thu, May 29, 2014 at 05:05:17PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 4:36 PM, Minchan Kim <minchan@kernel.org> wrote:
> >
> > I did below hacky test to apply your idea and the result is overflow again.
> > So, again it would second stack expansion. Otherwise, we should prevent
> > swapout in direct reclaim.
> 
> So changing io_schedule() is bad, for the reasons I outlined elsewhere
> (we use it for wait_for_page*() - see sleep_on_page().
> 
> It's the congestion waiting where the io_schedule() should be avoided.
> 
> So maybe test a patch something like the attached.
> 
> NOTE! This is absolutely TOTALLY UNTESTED! It might do horrible
> horrible things. It seems to compile, but I have absolutely no reason
> to believe that it would work. I didn't actually test that this moves
> anything at all to kblockd. So think of it as a concept patch that
> *might* work, but as Dave said, there might also be other things that
> cause unplugging and need some tough love.
> 
>                    Linus

>  mm/backing-dev.c | 28 ++++++++++++++++++----------
>  mm/vmscan.c      |  4 +---
>  2 files changed, 19 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 09d9591b7708..cb26b24c2da2 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -11,6 +11,7 @@
>  #include <linux/writeback.h>
>  #include <linux/device.h>
>  #include <trace/events/writeback.h>
> +#include <linux/blkdev.h>
>  
>  static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
>  
> @@ -573,6 +574,21 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
>  }
>  EXPORT_SYMBOL(set_bdi_congested);
>  
> +static long congestion_timeout(int sync, long timeout)
> +{
> +	long ret;
> +	DEFINE_WAIT(wait);
> +	struct blk_plug *plug = current->plug;
> +	wait_queue_head_t *wqh = &congestion_wqh[sync];
> +
> +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	if (plug)
> +		blk_flush_plug_list(plug, true);
> +	ret = schedule_timeout(timeout);
> +	finish_wait(wqh, &wait);
> +	return ret;
> +}
> +
>  /**
>   * congestion_wait - wait for a backing_dev to become uncongested
>   * @sync: SYNC or ASYNC IO
> @@ -586,12 +602,8 @@ long congestion_wait(int sync, long timeout)
>  {
>  	long ret;
>  	unsigned long start = jiffies;
> -	DEFINE_WAIT(wait);
> -	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> -	ret = io_schedule_timeout(timeout);
> -	finish_wait(wqh, &wait);
> +	ret = congestion_timeout(sync,timeout);
>  
>  	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
>  					jiffies_to_usecs(jiffies - start));
> @@ -622,8 +634,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>  {
>  	long ret;
>  	unsigned long start = jiffies;
> -	DEFINE_WAIT(wait);
> -	wait_queue_head_t *wqh = &congestion_wqh[sync];
>  
>  	/*
>  	 * If there is no congestion, or heavy congestion is not being
> @@ -643,9 +653,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
>  	}
>  
>  	/* Sleep until uncongested or a write happens */
> -	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> -	ret = io_schedule_timeout(timeout);
> -	finish_wait(wqh, &wait);
> +	ret = congestion_timeout(sync, timeout);
>  
>  out:
>  	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 32c661d66a45..1e524000b83e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -989,9 +989,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  			 * avoid risk of stack overflow but only writeback
>  			 * if many dirty pages have been encountered.
>  			 */
> -			if (page_is_file_cache(page) &&
> -					(!current_is_kswapd() ||
> -					 !zone_is_reclaim_dirty(zone))) {
> +			if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) {
>  				/*
>  				 * Immediately reclaim when written back.
>  				 * Similar in principal to deactivate_page()

I guess this part which avoid swapout in direct reclaim would be key
if this patch were successful. But it could make anon pages rotate back
into inactive's head from tail in direct reclaim path until kswapd can
catch up. And kswapd kswapd can swap out anon pages from tail of inactive
LRU so I suspect it could make side-effect LRU churning.

Anyway, I will queue it into testing machine since Rusty's test is done.
Thanks!

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:06                   ` Dave Jones
@ 2014-05-30  0:21                     ` Dave Chinner
  2014-05-30  0:29                       ` Dave Jones
  2014-05-30  0:32                       ` Minchan Kim
  0 siblings, 2 replies; 107+ messages in thread
From: Dave Chinner @ 2014-05-30  0:21 UTC (permalink / raw)
  To: Dave Jones, Linus Torvalds, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> 
>  > That sounds like a plan. Perhaps it would be useful to add a
>  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
>  > 8k) so that we get some indication that we're hitting a deep stack
>  > but the system otherwise keeps functioning. That gives us some
>  > motivation to keep stack usage down but isn't a fatal problem like
>  > it is now....
> 
> We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> Though it needs some tweaking if we move to 16K

Right, but it doesn't throw loud warnings when a specific threshold
is reached - it just issues a quiet message when a process exits
telling you what the maximum was without giving us a stack to chew
on....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:21                     ` Dave Chinner
@ 2014-05-30  0:29                       ` Dave Jones
  2014-05-30  0:32                       ` Minchan Kim
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Jones @ 2014-05-30  0:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
 > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
 > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
 > > 
 > >  > That sounds like a plan. Perhaps it would be useful to add a
 > >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
 > >  > 8k) so that we get some indication that we're hitting a deep stack
 > >  > but the system otherwise keeps functioning. That gives us some
 > >  > motivation to keep stack usage down but isn't a fatal problem like
 > >  > it is now....
 > > 
 > > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
 > > Though it needs some tweaking if we move to 16K
 > 
 > Right, but it doesn't throw loud warnings when a specific threshold
 > is reached - it just issues a quiet message when a process exits
 > telling you what the maximum was without giving us a stack to chew
 > on....

ah, right good point. That would be more useful.

	Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:20                 ` Minchan Kim
@ 2014-05-30  0:31                   ` Linus Torvalds
  2014-05-30  0:50                     ` Minchan Kim
  0 siblings, 1 reply; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30  0:31 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 5:20 PM, Minchan Kim <minchan@kernel.org> wrote:
>
> I guess this part which avoid swapout in direct reclaim would be key
> if this patch were successful. But it could make anon pages rotate back
> into inactive's head from tail in direct reclaim path until kswapd can
> catch up. And kswapd kswapd can swap out anon pages from tail of inactive
> LRU so I suspect it could make side-effect LRU churning.

Oh, it could make bad things happen, no question about that.

That said, those bad things are what happens to shared mapped pages
today, so in that sense it's not new. But large dirty shared mmap's
have traditionally been a great way to really hurt out VM, so "it
should work as well as shared mapping pages" is definitely not a
ringing endorsement!

(Of course, *if* we can improve kswapd behavior for both swap-out and
shared dirty pages, that would then be a double win, so there is
_some_ argument for saying that we should aim to handle both kinds of
pages equally).

> Anyway, I will queue it into testing machine since Rusty's test is done.

You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

            Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:21                     ` Dave Chinner
  2014-05-30  0:29                       ` Dave Jones
@ 2014-05-30  0:32                       ` Minchan Kim
  2014-05-30  1:34                         ` Dave Chinner
  1 sibling, 1 reply; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  0:32 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Dave Jones, Linus Torvalds, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
> On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> > 
> >  > That sounds like a plan. Perhaps it would be useful to add a
> >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
> >  > 8k) so that we get some indication that we're hitting a deep stack
> >  > but the system otherwise keeps functioning. That gives us some
> >  > motivation to keep stack usage down but isn't a fatal problem like
> >  > it is now....
> > 
> > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> > Though it needs some tweaking if we move to 16K
> 
> Right, but it doesn't throw loud warnings when a specific threshold
> is reached - it just issues a quiet message when a process exits
> telling you what the maximum was without giving us a stack to chew
> on....

But we could enhance the inform so notice the risk to the user.
as follow

...
"kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
there is some horrible stack hogger in your kernel. Please report it
the LKML and enable stacktrace to investigate who is culprit"


> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:31                   ` Linus Torvalds
@ 2014-05-30  0:50                     ` Minchan Kim
  2014-05-30  1:24                       ` Linus Torvalds
  0 siblings, 1 reply; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  0:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 05:31:42PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:20 PM, Minchan Kim <minchan@kernel.org> wrote:
> >
> > I guess this part which avoid swapout in direct reclaim would be key
> > if this patch were successful. But it could make anon pages rotate back
> > into inactive's head from tail in direct reclaim path until kswapd can
> > catch up. And kswapd kswapd can swap out anon pages from tail of inactive
> > LRU so I suspect it could make side-effect LRU churning.
> 
> Oh, it could make bad things happen, no question about that.
> 
> That said, those bad things are what happens to shared mapped pages
> today, so in that sense it's not new. But large dirty shared mmap's
> have traditionally been a great way to really hurt out VM, so "it
> should work as well as shared mapping pages" is definitely not a
> ringing endorsement!

True.

> 
> (Of course, *if* we can improve kswapd behavior for both swap-out and
> shared dirty pages, that would then be a double win, so there is
> _some_ argument for saying that we should aim to handle both kinds of
> pages equally).

Just an idea for preventing LRU churn.
We can return back the pages to tail of inactive instead of head if it's
not proper pages in this context and reclaimer uses the cursor as list_head
instead of LRU head to scan victim page and record the cursor in somewhere
like lruvec after shrinking is done. It makes VM code more complicated but
is worthy to try if we approach that way.

> 
> > Anyway, I will queue it into testing machine since Rusty's test is done.
> 
> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.

Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
them yours except vmscan.c part.

Thanks.

> 
>             Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: virtio ring cleanups, which save stack on older gcc
  2014-05-29 23:45                     ` Minchan Kim
@ 2014-05-30  1:06                       ` Minchan Kim
  2014-05-30  6:56                       ` Rusty Russell
  1 sibling, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  1:06 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 08:45:22AM +0900, Minchan Kim wrote:
> On Thu, May 29, 2014 at 08:38:33PM +0930, Rusty Russell wrote:
> > Minchan Kim <minchan@kernel.org> writes:
> > > Hello Rusty,
> > >
> > > On Thu, May 29, 2014 at 04:56:41PM +0930, Rusty Russell wrote:
> > >> They don't make much difference: the easier fix is use gcc 4.8
> > >> which drops stack required across virtio block's virtio_queue_rq
> > >> down to that kmalloc in virtio_ring from 528 to 392 bytes.
> > >> 
> > >> Still, these (*lightly tested*) patches reduce to 432 bytes,
> > >> even for gcc 4.6.4.  Posted here FYI.
> > >
> > > I am testing with below which was hack for Dave's idea so don't have
> > > a machine to test your patches until tomorrow.
> > > So, I will queue your patches into testing machine tomorrow morning.
> > 
> > More interesting would be updating your compiler to 4.8, I think.
> > Saving <100 bytes on virtio is not going to save you, right?
> 
> But in my report, virtio_ring consumes more than yours.
> As I mentioned other thread to Steven, I don't know why stacktrace report
> vring_add_indirect consumes 376-byte. Apparently, objdump says it didn't
> consume too much so I'd like to test your patches and see the result.
> 
> Thanks.
> 
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
> [ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
> 

As you expected, virtio_ring consumes less than before but not enough but
interesting thing is consumption of stack usage of __kmalloc and __slab_alloc
works right in this time. Hmm....

In my previous report,

[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   6)     6640       8   alloc_pages_current+0x10f/0x1f0
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   7)     6632     168   new_slab+0x2c5/0x370
[ 1065.604404] kworker/-5766    0d..2 1071625992us : stack_trace_call:   8)     6464       8   __slab_alloc+0x3a9/0x501
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:   9)     6456      80   __kmalloc+0x1cb/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  10)     6376     376   vring_add_indirect+0x36/0x200
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  11)     6000     144   virtqueue_add_sgs+0x2e2/0x320
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  12)     5856     288   __virtblk_add_req+0xda/0x1b0
[ 1065.604404] kworker/-5766    0d..2 1071625993us : stack_trace_call:  13)     5568      96   virtio_queue_rq+0xd3/0x1d0
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  14)     5472     128   __blk_mq_run_hw_queue+0x1ef/0x440
[ 1065.604404] kworker/-5766    0d..2 1071625994us : stack_trace_call:  15)     5344      16   blk_mq_run_hw_queue+0x35/0x40

In this time,

[ 2069.135929] kworker/u24:2 (26991) used greatest stack depth: 408 bytes left
[ 2580.413428] ------------[ cut here ]------------
[ 2580.413926] kernel BUG at kernel/trace/trace_stack.c:177!
[ 2580.414479] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 2580.415073] Dumping ftrace buffer:
[ 2580.415465] ---------------------------------
[ 2580.415763]    <...>-18634   9d..2 2598341673us : stack_trace_call:         Depth    Size   Location    (49 entries)
[ 2580.415763]         -----    ----   --------
[ 2580.415763]    <...>-18634   9d..2 2598341697us : stack_trace_call:   0)     7280       8   __alloc_pages_nodemask+0x199/0xb20
[ 2580.415763]    <...>-18634   9d..2 2598341698us : stack_trace_call:   1)     7272     352   alloc_pages_current+0x10f/0x1f0
[ 2580.415763]    <...>-18634   9d..2 2598341698us : stack_trace_call:   2)     6920     168   new_slab+0x2c5/0x370
[ 2580.415763]    <...>-18634   9d..2 2598341698us : stack_trace_call:   3)     6752     256   __slab_alloc+0x3a9/0x501
[ 2580.415763]    <...>-18634   9d..2 2598341699us : stack_trace_call:   4)     6496     112   __kmalloc+0x1cb/0x200
[ 2580.415763]    <...>-18634   9d..2 2598341699us : stack_trace_call:   5)     6384      32   alloc_indirect+0x1e/0x50
[ 2580.415763]    <...>-18634   9d..2 2598341699us : stack_trace_call:   6)     6352     112   virtqueue_add_sgs+0xc7/0x300
[ 2580.415763]    <...>-18634   9d..2 2598341699us : stack_trace_call:   7)     6240     288   __virtblk_add_req+0xda/0x1b0
[ 2580.415763]    <...>-18634   9d..2 2598341699us : stack_trace_call:   8)     5952      96   virtio_queue_rq+0xd3/0x1d0
[ 2580.415763]    <...>-18634   9d..2 2598341699us : stack_trace_call:   9)     5856     128   __blk_mq_run_hw_queue+0x1ef/0x440
[ 2580.415763]    <...>-18634   9d..2 2598341700us : stack_trace_call:  10)     5728      16   blk_mq_run_hw_queue+0x35/0x40
[ 2580.415763]    <...>-18634   9d..2 2598341700us : stack_trace_call:  11)     5712      96   blk_mq_insert_requests+0xdb/0x160
[ 2580.415763]    <...>-18634   9d..2 2598341700us : stack_trace_call:  12)     5616     112   blk_mq_flush_plug_list+0x12b/0x140
[ 2580.415763]    <...>-18634   9d..2 2598341700us : stack_trace_call:  13)     5504     112   blk_flush_plug_list+0xc7/0x220
[ 2580.415763]    <...>-18634   9d..2 2598341700us : stack_trace_call:  14)     5392      64   io_schedule_timeout+0x88/0x100
[ 2580.415763]    <...>-18634   9d..2 2598341701us : stack_trace_call:  15)     5328     128   mempool_alloc+0x145/0x170
[ 2580.415763]    <...>-18634   9d..2 2598341701us : stack_trace_call:  16)     5200      96   bio_alloc_bioset+0x10b/0x1d0
[ 2580.415763]    <...>-18634   9d..2 2598341701us : stack_trace_call:  17)     5104      48   get_swap_bio+0x30/0x90
[ 2580.415763]    <...>-18634   9d..2 2598341701us : stack_trace_call:  18)     5056     160   __swap_writepage+0x150/0x230
[ 2580.415763]    <...>-18634   9d..2 2598341701us : stack_trace_call:  19)     4896      32   swap_writepage+0x42/0x90
[ 2580.415763]    <...>-18634   9d..2 2598341701us : stack_trace_call:  20)     4864     320   shrink_page_list+0x676/0xa80
[ 2580.415763]    <...>-18634   9d..2 2598341702us : stack_trace_call:  21)     4544     208   shrink_inactive_list+0x262/0x4e0
[ 2580.415763]    <...>-18634   9d..2 2598341702us : stack_trace_call:  22)     4336     304   shrink_lruvec+0x3e1/0x6a0
[ 2580.415763]    <...>-18634   9d..2 2598341702us : stack_trace_call:  23)     4032      80   shrink_zone+0x3f/0x110
[ 2580.415763]    <...>-18634   9d..2 2598341702us : stack_trace_call:  24)     3952     128   do_try_to_free_pages+0x156/0x4c0
[ 2580.415763]    <...>-18634   9d..2 2598341702us : stack_trace_call:  25)     3824     208   try_to_free_pages+0xf7/0x1e0
[ 2580.415763]    <...>-18634   9d..2 2598341703us : stack_trace_call:  26)     3616     352   __alloc_pages_nodemask+0x783/0xb20
[ 2580.415763]    <...>-18634   9d..2 2598341703us : stack_trace_call:  27)     3264       8   alloc_pages_current+0x10f/0x1f0
[ 2580.415763]    <...>-18634   9d..2 2598341703us : stack_trace_call:  28)     3256     200   __page_cache_alloc+0x13f/0x160
[ 2580.415763]    <...>-18634   9d..2 2598341703us : stack_trace_call:  29)     3056      80   find_or_create_page+0x4c/0xb0
[ 2580.415763]    <...>-18634   9d..2 2598341703us : stack_trace_call:  30)     2976     112   __getblk+0x109/0x2f0
[ 2580.415763]    <...>-18634   9d..2 2598341703us : stack_trace_call:  31)     2864      80   ext4_read_block_bitmap_nowait+0x5e/0x330
[ 2580.415763]    <...>-18634   9d..2 2598341704us : stack_trace_call:  32)     2784     192   ext4_mb_init_cache+0x158/0x780
[ 2580.415763]    <...>-18634   9d..2 2598341704us : stack_trace_call:  33)     2592      80   ext4_mb_load_buddy+0x28a/0x370
[ 2580.415763]    <...>-18634   9d..2 2598341704us : stack_trace_call:  34)     2512     176   ext4_mb_regular_allocator+0x1b7/0x460
[ 2580.415763]    <...>-18634   9d..2 2598341704us : stack_trace_call:  35)     2336     128   ext4_mb_new_blocks+0x458/0x5f0
[ 2580.415763]    <...>-18634   9d..2 2598341704us : stack_trace_call:  36)     2208     256   ext4_ext_map_blocks+0x70b/0x1010
[ 2580.415763]    <...>-18634   9d..2 2598341704us : stack_trace_call:  37)     1952     160   ext4_map_blocks+0x325/0x530
[ 2580.415763]    <...>-18634   9d..2 2598341705us : stack_trace_call:  38)     1792     384   ext4_writepages+0x6d1/0xce0
[ 2580.415763]    <...>-18634   9d..2 2598341705us : stack_trace_call:  39)     1408      16   do_writepages+0x23/0x40
[ 2580.415763]    <...>-18634   9d..2 2598341705us : stack_trace_call:  40)     1392      96   __writeback_single_inode+0x45/0x2e0
[ 2580.415763]    <...>-18634   9d..2 2598341705us : stack_trace_call:  41)     1296     176   writeback_sb_inodes+0x2ad/0x500
[ 2580.415763]    <...>-18634   9d..2 2598341705us : stack_trace_call:  42)     1120      80   __writeback_inodes_wb+0x9e/0xd0
[ 2580.415763]    <...>-18634   9d..2 2598341706us : stack_trace_call:  43)     1040     160   wb_writeback+0x29b/0x350
[ 2580.415763]    <...>-18634   9d..2 2598341706us : stack_trace_call:  44)      880     208   bdi_writeback_workfn+0x11c/0x480
[ 2580.415763]    <...>-18634   9d..2 2598341706us : stack_trace_call:  45)      672     144   process_one_work+0x1d2/0x570
[ 2580.415763]    <...>-18634   9d..2 2598341706us : stack_trace_call:  46)      528     112   worker_thread+0x116/0x370
[ 2580.415763]    <...>-18634   9d..2 2598341706us : stack_trace_call:  47)      416     240   kthread+0xf3/0x110
[ 2580.415763]    <...>-18634   9d..2 2598341706us : stack_trace_call:  48)      176     176   ret_from_fork+0x7c/0xb0
[ 2580.415763] ---------------------------------
[ 2580.415763] Modules linked in:
[ 2580.415763] CPU: 9 PID: 18634 Comm: kworker/u24:1 Not tainted 3.14.0+ #202
[ 2580.415763] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 2580.415763] Workqueue: writeback bdi_writeback_workfn (flush-253:0)
[ 2580.415763] task: ffff88001e9ca0e0 ti: ffff880029c52000 task.ti: ffff880029c52000
[ 2580.415763] RIP: 0010:[<ffffffff8112340f>]  [<ffffffff8112340f>] stack_trace_call+0x37f/0x390
[ 2580.415763] RSP: 0000:ffff880029c52270  EFLAGS: 00010096
[ 2580.415763] RAX: ffff880029c52000 RBX: 0000000000000009 RCX: 0000000000000002
[ 2580.415763] RDX: 0000000000000006 RSI: 0000000000000002 RDI: ffff88003780be00
[ 2580.415763] RBP: ffff880029c522d0 R08: 00000000000009e8 R09: ffffffffffffffff
[ 2580.415763] R10: ffff880029c53fd8 R11: 0000000000000001 R12: 000000000000f2e8
[ 2580.415763] R13: 0000000000000009 R14: ffffffff82768dfc R15: 00000000000000f8
[ 2580.415763] FS:  0000000000000000(0000) GS:ffff880037d20000(0000) knlGS:0000000000000000
[ 2580.415763] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2580.415763] CR2: 00002aea5db57000 CR3: 0000000001c0b000 CR4: 00000000000006e0
[ 2580.415763] Stack:
[ 2580.415763]  0000000000000009 ffffffff81150819 0000000000000083 0000000000001c70
[ 2580.415763]  ffff880029c52300 ffffffff81005e11 ffffffff81c55ef0 0000000000000000
[ 2580.415763]  0000000000000002 ffff88001e9ca0e0 ffff88001e9cb108 0000000000000000
[ 2580.415763] Call Trace:
[ 2580.415763]  [<ffffffff81150819>] ? __alloc_pages_nodemask+0x199/0xb20
[ 2580.415763]  [<ffffffff81005e11>] ? print_context_stack+0x81/0x140
[ 2580.415763]  [<ffffffff816eedbf>] ftrace_call+0x5/0x2f
[ 2580.415763]  [<ffffffff8119097f>] ? alloc_pages_current+0x10f/0x1f0
[ 2580.415763]  [<ffffffff8119097f>] ? alloc_pages_current+0x10f/0x1f0
[ 2580.415763]  [<ffffffff811650b5>] ? next_zones_zonelist+0x5/0x70
[ 2580.415763]  [<ffffffff810a22dd>] ? trace_hardirqs_off+0xd/0x10
[ 2580.415763]  [<ffffffff811650b5>] ? next_zones_zonelist+0x5/0x70
[ 2580.415763]  [<ffffffff81150819>] ? __alloc_pages_nodemask+0x199/0xb20
[ 2580.415763]  [<ffffffff8119097f>] ? alloc_pages_current+0x10f/0x1f0
[ 2580.415763]  [<ffffffff810a22dd>] ? trace_hardirqs_off+0xd/0x10
[ 2580.415763]  [<ffffffff811231a9>] ? stack_trace_call+0x119/0x390
[ 2580.415763]  [<ffffffff816eedbf>] ? ftrace_call+0x5/0x2f
[ 2580.415763]  [<ffffffff8119097f>] alloc_pages_current+0x10f/0x1f0
[ 2580.415763]  [<ffffffff81199d25>] ? new_slab+0x2c5/0x370
[ 2580.415763]  [<ffffffff81199d25>] new_slab+0x2c5/0x370
[ 2580.415763]  [<ffffffff816dafb2>] __slab_alloc+0x3a9/0x501
[ 2580.415763]  [<ffffffff8141daee>] ? alloc_indirect+0x1e/0x50
[ 2580.415763]  [<ffffffff8141daee>] ? alloc_indirect+0x1e/0x50
[ 2580.415763]  [<ffffffff8141daee>] ? alloc_indirect+0x1e/0x50
[ 2580.415763]  [<ffffffff8119afdb>] __kmalloc+0x1cb/0x200
[ 2580.415763]  [<ffffffff8141daee>] alloc_indirect+0x1e/0x50
[ 2580.415763]  [<ffffffff8141e297>] virtqueue_add_sgs+0xc7/0x300
[ 2580.415763]  [<ffffffff8148e2fa>] __virtblk_add_req+0xda/0x1b0
[ 2580.415763]  [<ffffffff8148e4a3>] virtio_queue_rq+0xd3/0x1d0
[ 2580.415763]  [<ffffffff8134aa5f>] __blk_mq_run_hw_queue+0x1ef/0x440
[ 2580.415763]  [<ffffffff8134b125>] blk_mq_run_hw_queue+0x35/0x40
[ 2580.415763]  [<ffffffff8134b80b>] blk_mq_insert_requests+0xdb/0x160
[ 2580.415763]  [<ffffffff8134beab>] blk_mq_flush_plug_list+0x12b/0x140
[ 2580.415763]  [<ffffffff81342287>] blk_flush_plug_list+0xc7/0x220
[ 2580.415763]  [<ffffffff816e609f>] ? _raw_spin_unlock_irqrestore+0x3f/0x70
[ 2580.415763]  [<ffffffff816e1698>] io_schedule_timeout+0x88/0x100
[ 2580.415763]  [<ffffffff816e1615>] ? io_schedule_timeout+0x5/0x100
[ 2580.415763]  [<ffffffff81149465>] mempool_alloc+0x145/0x170
[ 2580.415763]  [<ffffffff8109baf0>] ? __init_waitqueue_head+0x60/0x60
[ 2580.415763]  [<ffffffff811e24bb>] bio_alloc_bioset+0x10b/0x1d0
[ 2580.415763]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[ 2580.415763]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[ 2580.415763]  [<ffffffff81184160>] get_swap_bio+0x30/0x90
[ 2580.415763]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[ 2580.415763]  [<ffffffff811846b0>] __swap_writepage+0x150/0x230
[ 2580.415763]  [<ffffffff810ab405>] ? do_raw_spin_unlock+0x5/0xa0
[ 2580.415763]  [<ffffffff81184280>] ? end_swap_bio_read+0xc0/0xc0
[ 2580.415763]  [<ffffffff81184565>] ? __swap_writepage+0x5/0x230
[ 2580.415763]  [<ffffffff811847d2>] swap_writepage+0x42/0x90
[ 2580.415763]  [<ffffffff8115aee6>] shrink_page_list+0x676/0xa80
[ 2580.415763]  [<ffffffff816eedbf>] ? ftrace_call+0x5/0x2f
[ 2580.415763]  [<ffffffff8115b8c2>] shrink_inactive_list+0x262/0x4e0
[ 2580.415763]  [<ffffffff8115c211>] shrink_lruvec+0x3e1/0x6a0
[ 2580.415763]  [<ffffffff8115c50f>] shrink_zone+0x3f/0x110
[ 2580.415763]  [<ffffffff816eedbf>] ? ftrace_call+0x5/0x2f
[ 2580.415763]  [<ffffffff8115ca36>] do_try_to_free_pages+0x156/0x4c0
[ 2580.415763]  [<ffffffff8115cf97>] try_to_free_pages+0xf7/0x1e0
[ 2580.415763]  [<ffffffff81150e03>] __alloc_pages_nodemask+0x783/0xb20
[ 2580.415763]  [<ffffffff8119097f>] alloc_pages_current+0x10f/0x1f0
[ 2580.415763]  [<ffffffff81145c5f>] ? __page_cache_alloc+0x13f/0x160
[ 2580.415763]  [<ffffffff81145c5f>] __page_cache_alloc+0x13f/0x160
[ 2580.415763]  [<ffffffff81146cbc>] find_or_create_page+0x4c/0xb0
[ 2580.415763]  [<ffffffff811ded09>] __getblk+0x109/0x2f0
[ 2580.415763]  [<ffffffff8124629e>] ext4_read_block_bitmap_nowait+0x5e/0x330
[ 2580.415763]  [<ffffffff81282bf8>] ext4_mb_init_cache+0x158/0x780
[ 2580.415763]  [<ffffffff816eedbf>] ? ftrace_call+0x5/0x2f
[ 2580.415763]  [<ffffffff81155d15>] ? __lru_cache_add+0x5/0x90
[ 2580.415763]  [<ffffffff81146435>] ? find_get_page+0x5/0x130
[ 2580.415763]  [<ffffffff812838aa>] ext4_mb_load_buddy+0x28a/0x370
[ 2580.415763]  [<ffffffff81284c57>] ext4_mb_regular_allocator+0x1b7/0x460
[ 2580.415763]  [<ffffffff812810c0>] ? ext4_mb_use_preallocated+0x40/0x360
[ 2580.415763]  [<ffffffff816eedbf>] ? ftrace_call+0x5/0x2f
[ 2580.415763]  [<ffffffff81287f08>] ext4_mb_new_blocks+0x458/0x5f0
[ 2580.415763]  [<ffffffff8127d88b>] ext4_ext_map_blocks+0x70b/0x1010
[ 2580.415763]  [<ffffffff8124e725>] ext4_map_blocks+0x325/0x530
[ 2580.415763]  [<ffffffff812538c1>] ext4_writepages+0x6d1/0xce0
[ 2580.415763]  [<ffffffff812531f0>] ? ext4_journalled_write_end+0x330/0x330
[ 2580.415763]  [<ffffffff81153a03>] do_writepages+0x23/0x40
[ 2580.415763]  [<ffffffff811d23b5>] __writeback_single_inode+0x45/0x2e0
[ 2580.415763]  [<ffffffff811d373d>] writeback_sb_inodes+0x2ad/0x500
[ 2580.415763]  [<ffffffff811d3a2e>] __writeback_inodes_wb+0x9e/0xd0
[ 2580.415763]  [<ffffffff811d410b>] wb_writeback+0x29b/0x350
[ 2580.415763]  [<ffffffff81057c3d>] ? __local_bh_enable_ip+0x6d/0xd0
[ 2580.415763]  [<ffffffff811d6eec>] bdi_writeback_workfn+0x11c/0x480
[ 2580.415763]  [<ffffffff81070610>] ? process_one_work+0x170/0x570
[ 2580.415763]  [<ffffffff81070672>] process_one_work+0x1d2/0x570
[ 2580.415763]  [<ffffffff81070610>] ? process_one_work+0x170/0x570
[ 2580.415763]  [<ffffffff81071bb6>] worker_thread+0x116/0x370
[ 2580.415763]  [<ffffffff81071aa0>] ? manage_workers.isra.19+0x2e0/0x2e0
[ 2580.415763]  [<ffffffff81078e53>] kthread+0xf3/0x110
[ 2580.415763]  [<ffffffff81078d60>] ? flush_kthread_worker+0x150/0x150
[ 2580.415763]  [<ffffffff816ef06c>] ret_from_fork+0x7c/0xb0
[ 2580.415763]  [<ffffffff81078d60>] ? flush_kthread_worker+0x150/0x150

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:50                     ` Minchan Kim
@ 2014-05-30  1:24                       ` Linus Torvalds
  2014-05-30  1:58                         ` Dave Chinner
  2014-05-30  6:21                         ` Minchan Kim
  0 siblings, 2 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30  1:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 5:50 PM, Minchan Kim <minchan@kernel.org> wrote:
>>
>> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
>
> Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
> them yours except vmscan.c part.

Looking more at Dave's patch (well, description), I don't think there
is any way in hell we can ever apply it. If I read it right, it will
cause all IO that overflows the max request count to go through the
scheduler to get it flushed. Maybe I misread it, but that's definitely
not acceptable. Maybe it's not noticeable with a slow rotational
device, but modern ssd hardware? No way.

I'd *much* rather slow down the swap side. Not "real IO". So I think
my mm/vmscan.c patch is preferable (but yes, it might require some
work to make kswapd do better).

So you can try Dave's patch just to see what it does for stack depth,
but other than that it looks unacceptable unless I misread things.

             Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:05               ` Linus Torvalds
  2014-05-30  0:20                 ` Minchan Kim
@ 2014-05-30  1:30                 ` Linus Torvalds
  1 sibling, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30  1:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 5:05 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So maybe test a patch something like the attached.
>
> NOTE! This is absolutely TOTALLY UNTESTED!

It's still untested, but I realized that the whole
"blk_flush_plug_list(plug, true);" thing is pointless, since
schedule() itself will do that for us.

So I think you can remove the

+       struct blk_plug *plug = current->plug;
+       if (plug)
+               blk_flush_plug_list(plug, true);

part from congestion_timeout().

Not that it should *hurt* to have it there, so I'm not bothering to
send a changed patch.

And again, no actual testing by me on any of this, just looking at the code.

           Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:32                       ` Minchan Kim
@ 2014-05-30  1:34                         ` Dave Chinner
  2014-05-30 15:25                           ` H. Peter Anvin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-30  1:34 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Jones, Linus Torvalds, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 09:32:19AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:21:13AM +1000, Dave Chinner wrote:
> > On Thu, May 29, 2014 at 08:06:49PM -0400, Dave Jones wrote:
> > > On Fri, May 30, 2014 at 09:53:08AM +1000, Dave Chinner wrote:
> > > 
> > >  > That sounds like a plan. Perhaps it would be useful to add a
> > >  > WARN_ON_ONCE(stack_usage > 8k) (or some other arbitrary depth beyond
> > >  > 8k) so that we get some indication that we're hitting a deep stack
> > >  > but the system otherwise keeps functioning. That gives us some
> > >  > motivation to keep stack usage down but isn't a fatal problem like
> > >  > it is now....
> > > 
> > > We have check_stack_usage() and DEBUG_STACK_USAGE for this.
> > > Though it needs some tweaking if we move to 16K
> > 
> > Right, but it doesn't throw loud warnings when a specific threshold
> > is reached - it just issues a quiet message when a process exits
> > telling you what the maximum was without giving us a stack to chew
> > on....
> 
> But we could enhance the inform so notice the risk to the user.
> as follow
> 
> ...
> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
> there is some horrible stack hogger in your kernel. Please report it
> the LKML and enable stacktrace to investigate who is culprit"

That, however, presumes that a user can reproduce the problem on
demand. Experience tells me that this is the exception rather than
the norm for production systems, and so capturing the stack in real
time is IMO the only useful thing we could add...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  1:24                       ` Linus Torvalds
@ 2014-05-30  1:58                         ` Dave Chinner
  2014-05-30  2:13                           ` Linus Torvalds
  2014-05-30  6:21                         ` Minchan Kim
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Chinner @ 2014-05-30  1:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:50 PM, Minchan Kim <minchan@kernel.org> wrote:
> >>
> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
> >
> > Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
> > them yours except vmscan.c part.
> 
> Looking more at Dave's patch (well, description), I don't think there
> is any way in hell we can ever apply it. If I read it right, it will
> cause all IO that overflows the max request count to go through the
> scheduler to get it flushed. Maybe I misread it, but that's definitely
> not acceptable. Maybe it's not noticeable with a slow rotational
> device, but modern ssd hardware? No way.
> 
> I'd *much* rather slow down the swap side. Not "real IO". So I think
> my mm/vmscan.c patch is preferable (but yes, it might require some
> work to make kswapd do better).
> 
> So you can try Dave's patch just to see what it does for stack depth,
> but other than that it looks unacceptable unless I misread things.

Yeah, it's a hack, not intended as a potential solution.

I'm thinking, though, that plug flushing behaviour is actually
dependent on plugger context and there is no one "correct"
behaviour. If we are doing process driven IO, then we want to do
immediate dispatch, but for IO where stack is an issue or is for
bulk throughput (e.g. background writeback) async dispatch through
kblockd is desirable.

If the patch I sent solves the swap stack usage issue, then perhaps
we should look towards adding "blk_plug_start_async()" to pass such
hints to the plug flushing. I'd want to use the same behaviour in
__xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
probably also in mpage_writepages() for bulk data writeback in
WB_SYNC_NONE context....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  0:15               ` Dave Chinner
@ 2014-05-30  2:12                 ` Minchan Kim
  2014-05-30  4:37                   ` Linus Torvalds
                                     ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  2:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton,
	linux-mm, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Johannes Weiner, Hugh Dickins,
	Rusty Russell, Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > Hello Dave,
> > 
> > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > Author: Jens Axboe <jaxboe@fusionio.com>
> > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > 
> > > >     block: let io_schedule() flush the plug inline
> > > >     
> > > >     Linus correctly observes that the most important dispatch cases
> > > >     are now done from kblockd, this isn't ideal for latency reasons.
> > > >     The original reason for switching dispatches out-of-line was to
> > > >     avoid too deep a stack, so by _only_ letting the "accidental"
> > > >     flush directly in schedule() be guarded by offload to kblockd,
> > > >     we should be able to get the best of both worlds.
> > > >     
> > > >     So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > >     and only use that from the schedule() path.
> > > >     
> > > >     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > > 
> > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > 
> > > So, if we make io_schedule() push the plug list off to the kblockd
> > > like is done for schedule()....
> ....
> > I did below hacky test to apply your idea and the result is overflow again.
> > So, again it would second stack expansion. Otherwise, we should prevent
> > swapout in direct reclaim.
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index f5c6635b806c..95f169e85dbe 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> >  void __sched io_schedule(void)
> >  {
> >  	struct rq *rq = raw_rq();
> > +	struct blk_plug *plug = current->plug;
> >  
> >  	delayacct_blkio_start();
> >  	atomic_inc(&rq->nr_iowait);
> > -	blk_flush_plug(current);
> > +	if (plug)
> > +		blk_flush_plug_list(plug, true);
> > +
> >  	current->in_iowait = 1;
> >  	schedule();
> >  	current->in_iowait = 0;
> 
> .....
> 
> >         Depth    Size   Location    (46 entries)
> >
> >   0)     7200       8   _raw_spin_lock_irqsave+0x51/0x60
> >   1)     7192     296   get_page_from_freelist+0x886/0x920
> >   2)     6896     352   __alloc_pages_nodemask+0x5e1/0xb20
> >   3)     6544       8   alloc_pages_current+0x10f/0x1f0
> >   4)     6536     168   new_slab+0x2c5/0x370
> >   5)     6368       8   __slab_alloc+0x3a9/0x501
> >   6)     6360      80   __kmalloc+0x1cb/0x200
> >   7)     6280     376   vring_add_indirect+0x36/0x200
> >   8)     5904     144   virtqueue_add_sgs+0x2e2/0x320
> >   9)     5760     288   __virtblk_add_req+0xda/0x1b0
> >  10)     5472      96   virtio_queue_rq+0xd3/0x1d0
> >  11)     5376     128   __blk_mq_run_hw_queue+0x1ef/0x440
> >  12)     5248      16   blk_mq_run_hw_queue+0x35/0x40
> >  13)     5232      96   blk_mq_insert_requests+0xdb/0x160
> >  14)     5136     112   blk_mq_flush_plug_list+0x12b/0x140
> >  15)     5024     112   blk_flush_plug_list+0xc7/0x220
> >  16)     4912     128   blk_mq_make_request+0x42a/0x600
> >  17)     4784      48   generic_make_request+0xc0/0x100
> >  18)     4736     112   submit_bio+0x86/0x160
> >  19)     4624     160   __swap_writepage+0x198/0x230
> >  20)     4464      32   swap_writepage+0x42/0x90
> >  21)     4432     320   shrink_page_list+0x676/0xa80
> >  22)     4112     208   shrink_inactive_list+0x262/0x4e0
> >  23)     3904     304   shrink_lruvec+0x3e1/0x6a0
> 
> The device is supposed to be plugged here in shrink_lruvec().
> 
> Oh, a plug can only hold 16 individual bios, and then it does a
> synchronous flush. Hmmm - perhaps that should also defer the flush
> to the kblockd, because if we are overrunning a plug then we've
> already surrendered IO dispatch latency....
> 
> So, in blk_mq_make_request(), can you do:
> 
> 			if (list_empty(&plug->mq_list))
> 				trace_block_plug(q);
> 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> -				blk_flush_plug_list(plug, false);
> +				blk_flush_plug_list(plug, true);
> 				trace_block_plug(q);
> 			}
> 			list_add_tail(&rq->queuelist, &plug->mq_list);
> 
> To see if that defers all the swap IO to kblockd?
> 

Interim report,

I applied below(we need to fix io_schedule_timeout due to mempool_alloc)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
 			trace_block_plug(q);
 		else {
 			if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 		}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;

And result is as follows, It reduce about 800-byte compared to
my first report but still stack usage seems to be high.
Really needs diet of VM functions.

        -----    ----   --------
  0)     6896      16   lookup_address+0x28/0x30
  1)     6880      16   _lookup_address_cpa.isra.3+0x3b/0x40
  2)     6864     304   __change_page_attr_set_clr+0xe0/0xb50
  3)     6560     112   kernel_map_pages+0x6c/0x120
  4)     6448     256   get_page_from_freelist+0x489/0x920
  5)     6192     352   __alloc_pages_nodemask+0x5e1/0xb20
  6)     5840       8   alloc_pages_current+0x10f/0x1f0
  7)     5832     168   new_slab+0x35d/0x370
  8)     5664       8   __slab_alloc+0x3a9/0x501
  9)     5656      80   kmem_cache_alloc+0x1ac/0x1c0
 10)     5576     296   mempool_alloc_slab+0x15/0x20
 11)     5280     128   mempool_alloc+0x5e/0x170
 12)     5152      96   bio_alloc_bioset+0x10b/0x1d0
 13)     5056      48   get_swap_bio+0x30/0x90
 14)     5008     160   __swap_writepage+0x150/0x230
 15)     4848      32   swap_writepage+0x42/0x90
 16)     4816     320   shrink_page_list+0x676/0xa80
 17)     4496     208   shrink_inactive_list+0x262/0x4e0
 18)     4288     304   shrink_lruvec+0x3e1/0x6a0
 19)     3984      80   shrink_zone+0x3f/0x110
 20)     3904     128   do_try_to_free_pages+0x156/0x4c0
 21)     3776     208   try_to_free_pages+0xf7/0x1e0
 22)     3568     352   __alloc_pages_nodemask+0x783/0xb20
 23)     3216       8   alloc_pages_current+0x10f/0x1f0
 24)     3208     168   new_slab+0x2c5/0x370
 25)     3040       8   __slab_alloc+0x3a9/0x501
 26)     3032      80   kmem_cache_alloc+0x1ac/0x1c0
 27)     2952     296   mempool_alloc_slab+0x15/0x20
 28)     2656     128   mempool_alloc+0x5e/0x170
 29)     2528      96   bio_alloc_bioset+0x10b/0x1d0
 30)     2432      48   mpage_alloc+0x38/0xa0
 31)     2384     208   do_mpage_readpage+0x49b/0x5d0
 32)     2176     224   mpage_readpages+0xcf/0x120
 33)     1952      48   ext4_readpages+0x45/0x60
 34)     1904     224   __do_page_cache_readahead+0x222/0x2d0
 35)     1680      16   ra_submit+0x21/0x30
 36)     1664     112   filemap_fault+0x2d7/0x4f0
 37)     1552     144   __do_fault+0x6d/0x4c0
 38)     1408     160   handle_mm_fault+0x1a6/0xaf0
 39)     1248     272   __do_page_fault+0x18a/0x590
 40)      976      16   do_page_fault+0xc/0x10
 41)      960     208   page_fault+0x22/0x30
 42)      752      16   clear_user+0x2e/0x40
 43)      736      16   padzero+0x2d/0x40
 44)      720     304   load_elf_binary+0xa47/0x1a40
 45)      416      48   search_binary_handler+0x9c/0x1a0
 46)      368     144   do_execve_common.isra.25+0x58d/0x700
 47)      224      16   do_execve+0x18/0x20
 48)      208      32   SyS_execve+0x2e/0x40
 49)      176     176   stub_execve+0x69/0xa0



-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  1:58                         ` Dave Chinner
@ 2014-05-30  2:13                           ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30  2:13 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 6:58 PM, Dave Chinner <david@fromorbit.com> wrote:
>
> If the patch I sent solves the swap stack usage issue, then perhaps
> we should look towards adding "blk_plug_start_async()" to pass such
> hints to the plug flushing. I'd want to use the same behaviour in
> __xfs_buf_delwri_submit() for bulk metadata writeback in XFS, and
> probably also in mpage_writepages() for bulk data writeback in
> WB_SYNC_NONE context...

Yeah, adding a flag to the plug about what kind of plug it is does
sound quite reasonable. It already has that "magic" field, it could
easily be extended to have a "async" vs "sync" bit to it..

Of course, it's also possible that the unplugging code could just look
at the actual requests that are plugged to determine that, and maybe
we wouldn't even need to mark things specially. I don't think we ever
end up mixing reads and writes under the same plug, so "first request
is a write" is probably a good approximation for "async".

             Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-29 11:29                   ` Michael S. Tsirkin
@ 2014-05-30  2:37                     ` Rusty Russell
  2014-05-30  6:21                       ` Rusty Russell
  0 siblings, 1 reply; 107+ messages in thread
From: Rusty Russell @ 2014-05-30  2:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Dave Hansen,
	Steven Rostedt

"Michael S. Tsirkin" <mst@redhat.com> writes:
> On Thu, May 29, 2014 at 04:56:45PM +0930, Rusty Russell wrote:
>> virtqueue_add() populates the virtqueue descriptor table from the sgs
>> given.  If it uses an indirect descriptor table, then it puts a single
>> descriptor in the descriptor table pointing to the kmalloc'ed indirect
>> table where the sg is populated.
>> +	for (i = 0; i < total_sg; i++)
>> +		desc[i].next = i+1;
>> +	return desc;
>
> Hmm we are doing an extra walk over descriptors here.
> This might hurt performance esp for big descriptors.

Yes, this needs to be benchmarked; since it's cache hot my gut feel is
that it's a NOOP, but on modern machines my gut feel is always wrong.

>> +	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
>> +		desc = alloc_indirect(total_sg, gfp);
>
> else desc = NULL will be a bit clearer won't it?

Agreed.

>>  	/* Update free pointer */
>> -	vq->free_head = i;
>> +	if (desc == vq->vring.desc)
>> +		vq->free_head = i;
>> +	else
>> +		vq->free_head = vq->vring.desc[head].next;
>
> This one is slightly ugly isn't it?

Yes, but it avoided another variable, and I was originally aiming
at stack conservation.  Turns out adding 'bool indirect' adds 32 bytes
more stack for gcc 4.6.4 :(

virtio_ring: minor neating

Before:
	gcc 4.8.2: virtio_blk: stack used = 408
	gcc 4.6.4: virtio_blk: stack used = 432

After:
	gcc 4.8.2: virtio_blk: stack used = 408
	gcc 4.6.4: virtio_blk: stack used = 464

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 3adf5978b92b..7a7849bc26af 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -141,9 +141,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 {
 	struct vring_virtqueue *vq = to_vvq(_vq);
 	struct scatterlist *sg;
-	struct vring_desc *desc = NULL;
+	struct vring_desc *desc;
 	unsigned int i, n, avail, uninitialized_var(prev);
 	int head;
+	bool indirect;
 
 	START_USE(vq);
 
@@ -176,21 +177,25 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	 * buffers, then go indirect. FIXME: tune this threshold */
 	if (vq->indirect && total_sg > 1 && vq->vq.num_free)
 		desc = alloc_indirect(total_sg, gfp);
+	else
+		desc = NULL;
 
 	if (desc) {
 		/* Use a single buffer which doesn't continue */
 		vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
 		vq->vring.desc[head].addr = virt_to_phys(desc);
-		/* avoid kmemleak false positive (tis hidden by virt_to_phys) */
+		/* avoid kmemleak false positive (hidden by virt_to_phys) */
 		kmemleak_ignore(desc);
 		vq->vring.desc[head].len = total_sg * sizeof(struct vring_desc);
 
 		/* Set up rest to use this indirect table. */
 		i = 0;
 		total_sg = 1;
+		indirect = true;
 	} else {
 		desc = vq->vring.desc;
 		i = head;
+		indirect = false;
 	}
 
 	if (vq->vq.num_free < total_sg) {
@@ -230,10 +235,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
 	desc[prev].flags &= ~VRING_DESC_F_NEXT;
 
 	/* Update free pointer */
-	if (desc == vq->vring.desc)
-		vq->free_head = i;
-	else
+	if (indirect)
 		vq->free_head = vq->vring.desc[head].next;
+	else
+		vq->free_head = i;
 
 	/* Set token. */
 	vq->data[head] = data;

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  2:12                 ` Minchan Kim
@ 2014-05-30  4:37                   ` Linus Torvalds
  2014-05-31  1:45                     ` Linus Torvalds
  2014-05-30  6:12                   ` Minchan Kim
  2014-06-03 13:28                   ` Rasmus Villemoes
  2 siblings, 1 reply; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30  4:37 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 7:12 PM, Minchan Kim <minchan@kernel.org> wrote:
>
> Interim report,
>
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.

Yes. And in this case uninlining things might actually help, because
the it's not actually performing reclaim in the second case, so
inlining the reclaim code into that huge __alloc_pages_nodemask()
function means that it has the stack frame for all those cases even if
they don't actually get used.

That said, the way those functions are set up (with lots of arguments
passed from one to the other), not inlining will cause huge costs too
for the argument setup.

It really might be very good to create a "struct alloc_info" that
contains those shared arguments, and just pass a (const) pointer to
that around. Gcc would likely tend to be *much* better at generating
code for that, because it avoids a tons of temporaries being created
by function calls. Even when it's inlined, the argument itself ends up
being a new temporary internally, and I suspect one reason gcc
(especially your 4.6.3 version, apparently) generates those big spill
frames is because there's tons of these duplicate temporaries that
apparently don't get merged properly.

Ugh. I think I'll try looking at that tomorrow.

                Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  2:12                 ` Minchan Kim
  2014-05-30  4:37                   ` Linus Torvalds
@ 2014-05-30  6:12                   ` Minchan Kim
  2014-06-03 13:28                   ` Rasmus Villemoes
  2 siblings, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  6:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Linus Torvalds, Linux Kernel Mailing List, Andrew Morton,
	linux-mm, H. Peter Anvin, Ingo Molnar, Peter Zijlstra,
	Mel Gorman, Rik van Riel, Johannes Weiner, Hugh Dickins,
	Rusty Russell, Michael S. Tsirkin, Dave Hansen, Steven Rostedt

Final result,

I tested the machine below patch (Dave suggested + some part I modified)
and I couldn't see the problem any more(tested 4hr, I will queue it into
the machine during weekend for long running test if I don't get more
enhanced version before leaving the office today) but as I reported
interim result, still VM's stack usage is high.

Anyway, it's another issue we should really diet of VM functions
(ex, uninlining slow path part from __alloc_pages_nodemask and
alloc_info idea from Linus and more).

Looking forwad to seeing blk_plug_start_async way.
Thanks, Dave!

---
 block/blk-core.c    | 2 +-
 block/blk-mq.c      | 2 +-
 kernel/sched/core.c | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index bfe16d5af9f9..0c81aacec75b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1585,7 +1585,7 @@ get_rq:
 			trace_block_plug(q);
 		else {
 			if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 		}
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 883f72089015..6e72e700d11e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -897,7 +897,7 @@ static void blk_mq_make_request(struct request_queue *q, struct bio *bio)
 			if (list_empty(&plug->mq_list))
 				trace_block_plug(q);
 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
-				blk_flush_plug_list(plug, false);
+				blk_flush_plug_list(plug, true);
 				trace_block_plug(q);
 			}
 			list_add_tail(&rq->queuelist, &plug->mq_list);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f5c6635b806c..ebca9e1f200f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	schedule();
 	current->in_iowait = 0;
@@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
 
 	delayacct_blkio_start();
 	atomic_inc(&rq->nr_iowait);
-	blk_flush_plug(current);
+	blk_schedule_flush_plug(current);
 	current->in_iowait = 1;
 	ret = schedule_timeout(timeout);
 	current->in_iowait = 0;
-- 
1.9.2


On Fri, May 30, 2014 at 11:12:47AM +0900, Minchan Kim wrote:
> On Fri, May 30, 2014 at 10:15:58AM +1000, Dave Chinner wrote:
> > On Fri, May 30, 2014 at 08:36:38AM +0900, Minchan Kim wrote:
> > > Hello Dave,
> > > 
> > > On Thu, May 29, 2014 at 11:58:30AM +1000, Dave Chinner wrote:
> > > > On Thu, May 29, 2014 at 11:30:07AM +1000, Dave Chinner wrote:
> > > > > On Wed, May 28, 2014 at 03:41:11PM -0700, Linus Torvalds wrote:
> > > > > commit a237c1c5bc5dc5c76a21be922dca4826f3eca8ca
> > > > > Author: Jens Axboe <jaxboe@fusionio.com>
> > > > > Date:   Sat Apr 16 13:27:55 2011 +0200
> > > > > 
> > > > >     block: let io_schedule() flush the plug inline
> > > > >     
> > > > >     Linus correctly observes that the most important dispatch cases
> > > > >     are now done from kblockd, this isn't ideal for latency reasons.
> > > > >     The original reason for switching dispatches out-of-line was to
> > > > >     avoid too deep a stack, so by _only_ letting the "accidental"
> > > > >     flush directly in schedule() be guarded by offload to kblockd,
> > > > >     we should be able to get the best of both worlds.
> > > > >     
> > > > >     So add a blk_schedule_flush_plug() that offloads to kblockd,
> > > > >     and only use that from the schedule() path.
> > > > >     
> > > > >     Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
> > > > > 
> > > > > And now we have too deep a stack due to unplugging from io_schedule()...
> > > > 
> > > > So, if we make io_schedule() push the plug list off to the kblockd
> > > > like is done for schedule()....
> > ....
> > > I did below hacky test to apply your idea and the result is overflow again.
> > > So, again it would second stack expansion. Otherwise, we should prevent
> > > swapout in direct reclaim.
> > > 
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index f5c6635b806c..95f169e85dbe 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -4241,10 +4241,13 @@ EXPORT_SYMBOL_GPL(yield_to);
> > >  void __sched io_schedule(void)
> > >  {
> > >  	struct rq *rq = raw_rq();
> > > +	struct blk_plug *plug = current->plug;
> > >  
> > >  	delayacct_blkio_start();
> > >  	atomic_inc(&rq->nr_iowait);
> > > -	blk_flush_plug(current);
> > > +	if (plug)
> > > +		blk_flush_plug_list(plug, true);
> > > +
> > >  	current->in_iowait = 1;
> > >  	schedule();
> > >  	current->in_iowait = 0;
> > 
> > .....
> > 
> > >         Depth    Size   Location    (46 entries)
> > >
> > >   0)     7200       8   _raw_spin_lock_irqsave+0x51/0x60
> > >   1)     7192     296   get_page_from_freelist+0x886/0x920
> > >   2)     6896     352   __alloc_pages_nodemask+0x5e1/0xb20
> > >   3)     6544       8   alloc_pages_current+0x10f/0x1f0
> > >   4)     6536     168   new_slab+0x2c5/0x370
> > >   5)     6368       8   __slab_alloc+0x3a9/0x501
> > >   6)     6360      80   __kmalloc+0x1cb/0x200
> > >   7)     6280     376   vring_add_indirect+0x36/0x200
> > >   8)     5904     144   virtqueue_add_sgs+0x2e2/0x320
> > >   9)     5760     288   __virtblk_add_req+0xda/0x1b0
> > >  10)     5472      96   virtio_queue_rq+0xd3/0x1d0
> > >  11)     5376     128   __blk_mq_run_hw_queue+0x1ef/0x440
> > >  12)     5248      16   blk_mq_run_hw_queue+0x35/0x40
> > >  13)     5232      96   blk_mq_insert_requests+0xdb/0x160
> > >  14)     5136     112   blk_mq_flush_plug_list+0x12b/0x140
> > >  15)     5024     112   blk_flush_plug_list+0xc7/0x220
> > >  16)     4912     128   blk_mq_make_request+0x42a/0x600
> > >  17)     4784      48   generic_make_request+0xc0/0x100
> > >  18)     4736     112   submit_bio+0x86/0x160
> > >  19)     4624     160   __swap_writepage+0x198/0x230
> > >  20)     4464      32   swap_writepage+0x42/0x90
> > >  21)     4432     320   shrink_page_list+0x676/0xa80
> > >  22)     4112     208   shrink_inactive_list+0x262/0x4e0
> > >  23)     3904     304   shrink_lruvec+0x3e1/0x6a0
> > 
> > The device is supposed to be plugged here in shrink_lruvec().
> > 
> > Oh, a plug can only hold 16 individual bios, and then it does a
> > synchronous flush. Hmmm - perhaps that should also defer the flush
> > to the kblockd, because if we are overrunning a plug then we've
> > already surrendered IO dispatch latency....
> > 
> > So, in blk_mq_make_request(), can you do:
> > 
> > 			if (list_empty(&plug->mq_list))
> > 				trace_block_plug(q);
> > 			else if (request_count >= BLK_MAX_REQUEST_COUNT) {
> > -				blk_flush_plug_list(plug, false);
> > +				blk_flush_plug_list(plug, true);
> > 				trace_block_plug(q);
> > 			}
> > 			list_add_tail(&rq->queuelist, &plug->mq_list);
> > 
> > To see if that defers all the swap IO to kblockd?
> > 
> 
> Interim report,
> 
> I applied below(we need to fix io_schedule_timeout due to mempool_alloc)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bfe16d5af9f9..0c81aacec75b 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1585,7 +1585,7 @@ get_rq:
>  			trace_block_plug(q);
>  		else {
>  			if (request_count >= BLK_MAX_REQUEST_COUNT) {
> -				blk_flush_plug_list(plug, false);
> +				blk_flush_plug_list(plug, true);
>  				trace_block_plug(q);
>  			}
>  		}
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5c6635b806c..ebca9e1f200f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4244,7 +4244,7 @@ void __sched io_schedule(void)
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  	current->in_iowait = 1;
>  	schedule();
>  	current->in_iowait = 0;
> @@ -4260,7 +4260,7 @@ long __sched io_schedule_timeout(long timeout)
>  
>  	delayacct_blkio_start();
>  	atomic_inc(&rq->nr_iowait);
> -	blk_flush_plug(current);
> +	blk_schedule_flush_plug(current);
>  	current->in_iowait = 1;
>  	ret = schedule_timeout(timeout);
>  	current->in_iowait = 0;
> 
> And result is as follows, It reduce about 800-byte compared to
> my first report but still stack usage seems to be high.
> Really needs diet of VM functions.
> 
>         -----    ----   --------
>   0)     6896      16   lookup_address+0x28/0x30
>   1)     6880      16   _lookup_address_cpa.isra.3+0x3b/0x40
>   2)     6864     304   __change_page_attr_set_clr+0xe0/0xb50
>   3)     6560     112   kernel_map_pages+0x6c/0x120
>   4)     6448     256   get_page_from_freelist+0x489/0x920
>   5)     6192     352   __alloc_pages_nodemask+0x5e1/0xb20
>   6)     5840       8   alloc_pages_current+0x10f/0x1f0
>   7)     5832     168   new_slab+0x35d/0x370
>   8)     5664       8   __slab_alloc+0x3a9/0x501
>   9)     5656      80   kmem_cache_alloc+0x1ac/0x1c0
>  10)     5576     296   mempool_alloc_slab+0x15/0x20
>  11)     5280     128   mempool_alloc+0x5e/0x170
>  12)     5152      96   bio_alloc_bioset+0x10b/0x1d0
>  13)     5056      48   get_swap_bio+0x30/0x90
>  14)     5008     160   __swap_writepage+0x150/0x230
>  15)     4848      32   swap_writepage+0x42/0x90
>  16)     4816     320   shrink_page_list+0x676/0xa80
>  17)     4496     208   shrink_inactive_list+0x262/0x4e0
>  18)     4288     304   shrink_lruvec+0x3e1/0x6a0
>  19)     3984      80   shrink_zone+0x3f/0x110
>  20)     3904     128   do_try_to_free_pages+0x156/0x4c0
>  21)     3776     208   try_to_free_pages+0xf7/0x1e0
>  22)     3568     352   __alloc_pages_nodemask+0x783/0xb20
>  23)     3216       8   alloc_pages_current+0x10f/0x1f0
>  24)     3208     168   new_slab+0x2c5/0x370
>  25)     3040       8   __slab_alloc+0x3a9/0x501
>  26)     3032      80   kmem_cache_alloc+0x1ac/0x1c0
>  27)     2952     296   mempool_alloc_slab+0x15/0x20
>  28)     2656     128   mempool_alloc+0x5e/0x170
>  29)     2528      96   bio_alloc_bioset+0x10b/0x1d0
>  30)     2432      48   mpage_alloc+0x38/0xa0
>  31)     2384     208   do_mpage_readpage+0x49b/0x5d0
>  32)     2176     224   mpage_readpages+0xcf/0x120
>  33)     1952      48   ext4_readpages+0x45/0x60
>  34)     1904     224   __do_page_cache_readahead+0x222/0x2d0
>  35)     1680      16   ra_submit+0x21/0x30
>  36)     1664     112   filemap_fault+0x2d7/0x4f0
>  37)     1552     144   __do_fault+0x6d/0x4c0
>  38)     1408     160   handle_mm_fault+0x1a6/0xaf0
>  39)     1248     272   __do_page_fault+0x18a/0x590
>  40)      976      16   do_page_fault+0xc/0x10
>  41)      960     208   page_fault+0x22/0x30
>  42)      752      16   clear_user+0x2e/0x40
>  43)      736      16   padzero+0x2d/0x40
>  44)      720     304   load_elf_binary+0xa47/0x1a40
>  45)      416      48   search_binary_handler+0x9c/0x1a0
>  46)      368     144   do_execve_common.isra.25+0x58d/0x700
>  47)      224      16   do_execve+0x18/0x20
>  48)      208      32   SyS_execve+0x2e/0x40
>  49)      176     176   stub_execve+0x69/0xa0
> 
> 
> 
> -- 
> Kind regards,
> Minchan Kim
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  1:24                       ` Linus Torvalds
  2014-05-30  1:58                         ` Dave Chinner
@ 2014-05-30  6:21                         ` Minchan Kim
  1 sibling, 0 replies; 107+ messages in thread
From: Minchan Kim @ 2014-05-30  6:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 06:24:02PM -0700, Linus Torvalds wrote:
> On Thu, May 29, 2014 at 5:50 PM, Minchan Kim <minchan@kernel.org> wrote:
> >>
> >> You could also try Dave's patch, and _not_ do my mm/vmscan.c part.
> >
> > Sure. While I write this, Rusty's test was crached so I will try Dave's patch,
> > them yours except vmscan.c part.
> 
> Looking more at Dave's patch (well, description), I don't think there
> is any way in hell we can ever apply it. If I read it right, it will
> cause all IO that overflows the max request count to go through the
> scheduler to get it flushed. Maybe I misread it, but that's definitely
> not acceptable. Maybe it's not noticeable with a slow rotational
> device, but modern ssd hardware? No way.
> 
> I'd *much* rather slow down the swap side. Not "real IO". So I think
> my mm/vmscan.c patch is preferable (but yes, it might require some
> work to make kswapd do better).
> 
> So you can try Dave's patch just to see what it does for stack depth,
> but other than that it looks unacceptable unless I misread things.
> 
>              Linus

I tested below patch and the result is endless OOM although there are
lots of anon pages and empty space of swap.

I guess __alloc_pages_direct_reclaim couldn't proceed due to anon pages
once VM drop most of file-backed pages, then go to OOM.

---
 mm/backing-dev.c | 25 +++++++++++++++----------
 mm/vmscan.c      |  4 +---
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index ce682f7a4f29..2762b16404bd 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -11,6 +11,7 @@
 #include <linux/writeback.h>
 #include <linux/device.h>
 #include <trace/events/writeback.h>
+#include <linux/blkdev.h>
 
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
@@ -565,6 +566,18 @@ void set_bdi_congested(struct backing_dev_info *bdi, int sync)
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+static long congestion_timeout(int sync, long timeout)
+{
+	long ret;
+	DEFINE_WAIT(wait);
+
+	wait_queue_head_t *wqh = &congestion_wqh[sync];
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+	return ret;
+}
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -578,12 +591,8 @@ long congestion_wait(int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	ret = congestion_timeout(sync,timeout);
 
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
@@ -614,8 +623,6 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 {
 	long ret;
 	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
 
 	/*
 	 * If there is no congestion, or heavy congestion is not being
@@ -635,9 +642,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
 	}
 
 	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
+	ret = congestion_timeout(sync, timeout);
 
 out:
 	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9c74b409681..e4ad7cd1885b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -975,9 +975,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 			 * avoid risk of stack overflow but only writeback
 			 * if many dirty pages have been encountered.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() ||
-					 !zone_is_reclaim_dirty(zone))) {
+			if (!current_is_kswapd() || !zone_is_reclaim_dirty(zone)) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
-- 
1.9.2

-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 4/4] virtio_ring: unify direct/indirect code paths.
  2014-05-30  2:37                     ` Rusty Russell
@ 2014-05-30  6:21                       ` Rusty Russell
  0 siblings, 0 replies; 107+ messages in thread
From: Rusty Russell @ 2014-05-30  6:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Dave Chinner, Jens Axboe, Minchan Kim, Linux Kernel Mailing List

Rusty Russell <rusty@rustcorp.com.au> writes:
> "Michael S. Tsirkin" <mst@redhat.com> writes:
>> On Thu, May 29, 2014 at 04:56:45PM +0930, Rusty Russell wrote:
>>> virtqueue_add() populates the virtqueue descriptor table from the sgs
>>> given.  If it uses an indirect descriptor table, then it puts a single
>>> descriptor in the descriptor table pointing to the kmalloc'ed indirect
>>> table where the sg is populated.
>>> +	for (i = 0; i < total_sg; i++)
>>> +		desc[i].next = i+1;
>>> +	return desc;
>>
>> Hmm we are doing an extra walk over descriptors here.
>> This might hurt performance esp for big descriptors.
>
> Yes, this needs to be benchmarked; since it's cache hot my gut feel is
> that it's a NOOP, but on modern machines my gut feel is always wrong.

CC's trimmed.

Well, I was almost right about being wrong.

I wrote a userspace virtio_ring microbench which does 10000000
virtqueue_add_outbuf() calls (which go indirect) and not much else.

Read as <MIN>-<MAX>(<MEAN>+/-<STDDEV>):
Current kernel:           936153354- 967745359(9.44739e+08+/-6.1e+06)ns
Using sg_next:           1061485790-1104800648(1.08254e+09+/-6.6e+06)ns
Unifying indirect path:  1214289435-1272686712(1.22564e+09+/-8e+06)ns
Using indirect flag:     1125610268-1183528965(1.14172e+09+/-8e+06)ns

Of course this might be lost in the noise on real networking, so that's
my job on Monday.

Subject: vring_bench: simple benchmark for adding descriptors to a virtqueue.

This userspace benchark uses the kernel code to add 8 16-element
scatterlists to a virtqueue, then consume them and start again.

For example:
	$ for i in `seq 10`; do ./vring_bench; done | stats --trim-outliers
	936153354-967745359(9.44739e+08+/-6.1e+06)ns
	9999872 returned

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
index 1cfbb0157a46..ff32cca971d8 100644
--- a/tools/virtio/.gitignore
+++ b/tools/virtio/.gitignore
@@ -1,3 +1,4 @@
 *.d
 virtio_test
 vringh_test
+vring_bench
diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
index 3187c62d9814..103101273049 100644
--- a/tools/virtio/Makefile
+++ b/tools/virtio/Makefile
@@ -1,6 +1,7 @@
 all: test mod
-test: virtio_test vringh_test
+test: virtio_test vringh_test vring_bench
 virtio_test: virtio_ring.o virtio_test.o
+vring_bench: virtio_ring.o vring_bench.o
 vringh_test: vringh_test.o vringh.o virtio_ring.o
 
 CFLAGS += -g -O2 -Wall -I. -I ../../usr/include/ -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE
@@ -9,6 +10,6 @@ mod:
 	${MAKE} -C `pwd`/../.. M=`pwd`/vhost_test
 .PHONY: all test mod clean
 clean:
-	${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
+	${RM} *.o vringh_test virtio_test vring_bench vhost_test/*.o vhost_test/.*.cmd \
               vhost_test/Module.symvers vhost_test/modules.order *.d
 -include *.d
diff --git a/tools/virtio/linux/kernel.h b/tools/virtio/linux/kernel.h
index fba705963968..8dcff8e3374c 100644
--- a/tools/virtio/linux/kernel.h
+++ b/tools/virtio/linux/kernel.h
@@ -109,4 +109,7 @@ static inline void free_page(unsigned long addr)
 	(void) (&_min1 == &_min2);		\
 	_min1 < _min2 ? _min1 : _min2; })
 
+/* Just make it compile */
+#define list_for_each_entry(iter, list, member)
+
 #endif /* KERNEL_H */
diff --git a/tools/virtio/vring_bench.c b/tools/virtio/vring_bench.c
new file mode 100644
index 000000000000..0d7544fd26ad
--- /dev/null
+++ b/tools/virtio/vring_bench.c
@@ -0,0 +1,125 @@
+#define _GNU_SOURCE
+#include <time.h>
+#include <getopt.h>
+#include <string.h>
+#include <poll.h>
+#include <sys/eventfd.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <unistd.h>
+#include <sys/ioctl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <fcntl.h>
+#include <stdbool.h>
+#include <linux/virtio.h>
+#include <linux/virtio_ring.h>
+
+/* Unused */
+void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
+
+static struct vring vring;
+static uint16_t last_avail_idx;
+static unsigned int returned;
+
+static bool vq_notify(struct virtqueue *vq)
+{
+	/* "Use" them all. */
+	while (vring.avail->idx != last_avail_idx) {
+		unsigned int i, head;
+
+		i = last_avail_idx++ & (vring.num - 1);
+		head = vring.avail->ring[i];
+		assert(head < vring.num);
+
+		i = vring.used->idx & (vring.num - 1);
+		vring.used->ring[i].id = head;
+		vring.used->ring[i].len = 0;
+		vring.used->idx++;
+	}
+	return true;
+}
+
+static void vq_callback(struct virtqueue *vq)
+{
+	unsigned int len;
+	void *p;
+
+	while ((p = virtqueue_get_buf(vq, &len)) != NULL)
+		returned++;
+}
+
+/* Ring size 128, just like qemu uses */
+#define VRING_NUM 128
+#define SG_SIZE 16
+
+static inline struct timespec time_sub(struct timespec recent,
+				       struct timespec old)
+{
+	struct timespec diff;
+
+	diff.tv_sec = recent.tv_sec - old.tv_sec;
+	if (old.tv_nsec > recent.tv_nsec) {
+		diff.tv_sec--;
+		diff.tv_nsec = 1000000000 + recent.tv_nsec - old.tv_nsec;
+	} else
+		diff.tv_nsec = recent.tv_nsec - old.tv_nsec;
+
+	return diff;
+}
+
+static struct timespec time_now(void)
+{
+	struct timespec ret;
+	clock_gettime(CLOCK_REALTIME, &ret);
+	return ret;
+}
+
+static inline uint64_t time_to_nsec(struct timespec t)
+{
+	uint64_t nsec;
+
+	nsec = t.tv_nsec + (uint64_t)t.tv_sec * 1000000000;
+	return nsec;
+}
+
+int main(int argc, char *argv[])
+{
+	struct virtqueue *vq;
+	struct virtio_device vdev;
+	void *ring;
+	unsigned int i, num;
+	int e;
+	struct scatterlist sg[SG_SIZE];
+	struct timespec start;
+
+	sg_init_table(sg, SG_SIZE);
+
+	e = posix_memalign(&ring, 4096, vring_size(VRING_NUM, 4096));
+	assert(e >= 0);
+
+	vdev.features[0] = (1UL << VIRTIO_RING_F_INDIRECT_DESC) |
+		(1UL << VIRTIO_RING_F_EVENT_IDX);
+
+	vq = vring_new_virtqueue(0, VRING_NUM, 4096, &vdev, true, ring,
+				 vq_notify, vq_callback, "benchmark");
+	assert(vq);
+	vring_init(&vring, VRING_NUM, ring, 4096);
+
+	num = atoi(argv[1] ?: "10000000");
+
+	start = time_now();
+	for (i = 0; i < num; i++) {
+	again:
+		e = virtqueue_add_outbuf(vq, sg, SG_SIZE, sg, GFP_ATOMIC);
+		if (e < 0) {
+			virtqueue_kick(vq);
+			vring_interrupt(0, vq);
+			goto again;
+		}
+	}
+	printf("%lluns\n",
+	       (long long)time_to_nsec(time_sub(time_now(), start)));
+	printf("%u returned\n", returned);
+	return 0;
+}



^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: virtio ring cleanups, which save stack on older gcc
  2014-05-29 23:45                     ` Minchan Kim
  2014-05-30  1:06                       ` Minchan Kim
@ 2014-05-30  6:56                       ` Rusty Russell
  1 sibling, 0 replies; 107+ messages in thread
From: Rusty Russell @ 2014-05-30  6:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Linus Torvalds, Dave Chinner, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Michael S. Tsirkin,
	Dave Hansen, Steven Rostedt

Minchan Kim <minchan@kernel.org> writes:
> On Thu, May 29, 2014 at 08:38:33PM +0930, Rusty Russell wrote:
>> Minchan Kim <minchan@kernel.org> writes:
>> > Hello Rusty,
>> >
>> > On Thu, May 29, 2014 at 04:56:41PM +0930, Rusty Russell wrote:
>> >> They don't make much difference: the easier fix is use gcc 4.8
>> >> which drops stack required across virtio block's virtio_queue_rq
>> >> down to that kmalloc in virtio_ring from 528 to 392 bytes.
>> >> 
>> >> Still, these (*lightly tested*) patches reduce to 432 bytes,
>> >> even for gcc 4.6.4.  Posted here FYI.
>> >
>> > I am testing with below which was hack for Dave's idea so don't have
>> > a machine to test your patches until tomorrow.
>> > So, I will queue your patches into testing machine tomorrow morning.
>> 
>> More interesting would be updating your compiler to 4.8, I think.
>> Saving <100 bytes on virtio is not going to save you, right?
>
> But in my report, virtio_ring consumes more than yours.

Yeah, weird.  I wonder if it's because I'm measuring before the call to
kmalloc; gcc probably spills extra crap on the stack before that.

You got 904 bytes:

5928     376   vring_add_indirect+0x36/0x200
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:   9)    
5552     144   virtqueue_add_sgs+0x2e2/0x320
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  10)    
5408     288   __virtblk_add_req+0xda/0x1b0
[  111.404781]    <...>-15987   5d..2 111689538us : stack_trace_call:  11)    
5120      96   virtio_queue_rq+0xd3/0x1d0

When I move my "stack_top" save code into __kmalloc, with gcc 4.6 and your
.config I get:

[    2.506869] virtio_blk: stack used = 640

So I don't know quite what's going on :(

Cheers,
Rusty.

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index cb9b1f8326c3..894e290b4bd2 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -151,15 +151,19 @@ static void virtblk_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(&vblk->vq_lock, flags);
 }
 
+extern struct task_struct *record_stack;
+extern unsigned long stack_top;
+
 static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 {
 	struct virtio_blk *vblk = hctx->queue->queuedata;
 	struct virtblk_req *vbr = req->special;
 	unsigned long flags;
 	unsigned int num;
+	unsigned long stack_bottom;
 	const bool last = (req->cmd_flags & REQ_END) != 0;
 	int err;
-
+	
 	BUG_ON(req->nr_phys_segments + 2 > vblk->sg_elems);
 
 	vbr->req = req;
@@ -199,7 +203,12 @@ static int virtio_queue_rq(struct blk_mq_hw_ctx *hctx, struct request *req)
 	}
 
 	spin_lock_irqsave(&vblk->vq_lock, flags);
+	record_stack = current;
+	__asm__ __volatile__("movq %%rsp,%0" : "=g" (stack_bottom));
 	err = __virtblk_add_req(vblk->vq, vbr, vbr->sg, num);
+	record_stack = NULL;
+
+	printk("virtio_blk: stack used = %lu\n", stack_bottom - stack_top);
 	if (err) {
 		virtqueue_kick(vblk->vq);
 		blk_mq_stop_hw_queue(hctx);
diff --git a/mm/slub.c b/mm/slub.c
index 2b1ce697fc4b..0f9a1a6b381e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3278,11 +3278,22 @@ static int __init setup_slub_nomerge(char *str)
 
 __setup("slub_nomerge", setup_slub_nomerge);
 
+extern struct task_struct *record_stack;
+struct task_struct *record_stack;
+EXPORT_SYMBOL(record_stack);
+
+extern unsigned long stack_top;
+unsigned long stack_top;
+EXPORT_SYMBOL(stack_top);
+
 void *__kmalloc(size_t size, gfp_t flags)
 {
 	struct kmem_cache *s;
 	void *ret;
 
+	if (record_stack == current)
+		__asm__ __volatile__("movq %%rsp,%0" : "=g" (stack_top));
+
 	if (unlikely(size > KMALLOC_MAX_CACHE_SIZE))
 		return kmalloc_large(size, flags);
 

^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29 15:24               ` Linus Torvalds
  2014-05-29 23:40                 ` Minchan Kim
  2014-05-29 23:53                 ` Dave Chinner
@ 2014-05-30  9:48                 ` Richard Weinberger
  2014-05-30 15:36                   ` Linus Torvalds
  2 siblings, 1 reply; 107+ messages in thread
From: Richard Weinberger @ 2014-05-30  9:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Jens Axboe, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Thu, May 29, 2014 at 5:24 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So I'm not in fact arguing against Minchan's patch of upping
> THREAD_SIZE_ORDER to 2 on x86-64, but at the same time stack size does
> remain one of my "we really need to be careful" issues, so while I am
> basically planning on applying that patch, I _also_ want to make sure
> that we fix the problems we do see and not just paper them over.
>
> The 8kB stack has been somewhat restrictive and painful for a while,
> and I'm ok with admitting that it is just getting _too_ damn painful,
> but I don't want to just give up entirely when we have a known deep
> stack case.

If we raise the stack size on x86_64 to 16k, what about i386?
Beside of the fact that most of you consider 32bits as dead and must die... ;)

-- 
Thanks,
//richard

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  1:34                         ` Dave Chinner
@ 2014-05-30 15:25                           ` H. Peter Anvin
  2014-05-30 15:41                             ` Linus Torvalds
  0 siblings, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-30 15:25 UTC (permalink / raw)
  To: Dave Chinner, Minchan Kim
  Cc: Dave Jones, Linus Torvalds, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On 05/29/2014 06:34 PM, Dave Chinner wrote:
>> ...
>> "kworker/u24:1 (94) used greatest stack depth: 8K bytes left, it means
>> there is some horrible stack hogger in your kernel. Please report it
>> the LKML and enable stacktrace to investigate who is culprit"
> 
> That, however, presumes that a user can reproduce the problem on
> demand. Experience tells me that this is the exception rather than
> the norm for production systems, and so capturing the stack in real
> time is IMO the only useful thing we could add...
> 

If we removed struct thread_info from the stack allocation then one
could do a guard page below the stack.  Of course, we'd have to use IST
for #PF in that case, which makes it a non-production option.

	-hpa



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  9:48                 ` Richard Weinberger
@ 2014-05-30 15:36                   ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30 15:36 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: Dave Chinner, Jens Axboe, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Fri, May 30, 2014 at 2:48 AM, Richard Weinberger
<richard.weinberger@gmail.com> wrote:
>
> If we raise the stack size on x86_64 to 16k, what about i386?
> Beside of the fact that most of you consider 32bits as dead and must die... ;)

x86-32 doesn't have nearly the same issue, since a large portion of
stack content tends to be pointers and longs. So it's not like it uses
half the stack, but a 32-bit environment does use a lot less stack
than a 64-bit one.

                Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30 15:25                           ` H. Peter Anvin
@ 2014-05-30 15:41                             ` Linus Torvalds
  2014-05-30 15:52                               ` H. Peter Anvin
  2014-10-21  2:00                               ` Dave Jones
  0 siblings, 2 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30 15:41 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Chinner, Minchan Kim, Dave Jones, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>
> If we removed struct thread_info from the stack allocation then one
> could do a guard page below the stack.  Of course, we'd have to use IST
> for #PF in that case, which makes it a non-production option.

We could just have the guard page in between the stack and the
thread_info, take a double fault, and then just map it back in on
double fault.

That would give us 8kB of "normal" stack, with a very loud fault - and
then an extra 7kB or so of stack (whatever the size of thread-info is)
- after the first time it traps.

That said, it's still likely a non-production option due to the page
table games we'd have to play at fork/clone time.

               Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30 15:41                             ` Linus Torvalds
@ 2014-05-30 15:52                               ` H. Peter Anvin
  2014-05-30 16:06                                 ` Linus Torvalds
  2014-10-21  2:00                               ` Dave Jones
  1 sibling, 1 reply; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-30 15:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Chinner, Minchan Kim, Dave Jones, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt, PJ Waskiewicz

On 05/30/2014 08:41 AM, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> If we removed struct thread_info from the stack allocation then one
>> could do a guard page below the stack.  Of course, we'd have to use IST
>> for #PF in that case, which makes it a non-production option.
> 
> We could just have the guard page in between the stack and the
> thread_info, take a double fault, and then just map it back in on
> double fault.
> 

Oh, duh.  Right, much better.  Similar to the espfix64 hack, too.

> That would give us 8kB of "normal" stack, with a very loud fault - and
> then an extra 7kB or so of stack (whatever the size of thread-info is)
> - after the first time it traps.
> 
> That said, it's still likely a non-production option due to the page
> table games we'd have to play at fork/clone time.

Still, seems much more tractable.

I would still like struct thread_info off the stack allocation for other
reasons (as we have discussed in the past.)

	-hpa


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30 15:52                               ` H. Peter Anvin
@ 2014-05-30 16:06                                 ` Linus Torvalds
  2014-05-30 17:24                                   ` Dave Hansen
  0 siblings, 1 reply; 107+ messages in thread
From: Linus Torvalds @ 2014-05-30 16:06 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Dave Chinner, Minchan Kim, Dave Jones, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt, PJ Waskiewicz

On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>
>> That said, it's still likely a non-production option due to the page
>> table games we'd have to play at fork/clone time.
>
> Still, seems much more tractable.

We might be able to make it more attractive by having a small
front-end cache of the 16kB allocations with the second page unmapped.
That would at least capture the common "lots of short-lived processes"
case without having to do kernel page table work.

               Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30 16:06                                 ` Linus Torvalds
@ 2014-05-30 17:24                                   ` Dave Hansen
  2014-05-30 18:12                                     ` H. Peter Anvin
  0 siblings, 1 reply; 107+ messages in thread
From: Dave Hansen @ 2014-05-30 17:24 UTC (permalink / raw)
  To: Linus Torvalds, H. Peter Anvin
  Cc: Dave Chinner, Minchan Kim, Dave Jones, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Steven Rostedt,
	PJ Waskiewicz

On 05/30/2014 09:06 AM, Linus Torvalds wrote:
> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> That said, it's still likely a non-production option due to the page
>>> table games we'd have to play at fork/clone time.
>>
>> Still, seems much more tractable.
> 
> We might be able to make it more attractive by having a small
> front-end cache of the 16kB allocations with the second page unmapped.
> That would at least capture the common "lots of short-lived processes"
> case without having to do kernel page table work.

If we want to use 4k mappings, we'd need to move the stack over to using
vmalloc() (or at least be out of the linear mapping) to avoid breaking
up the linear map's page tables too much.  Doing that, we'd actually not
_have_ to worry about fragmentation, and we could actually utilize the
per-cpu-pageset code since we'd could be back to using order-0 pages.
So it's at least not all a loss.  Although, I do remember playing with
4k stacks back in the 32-bit days and not getting much of a win with it.

We'd definitely that cache, if for no other reason than the vmalloc/vmap
code as-is isn't super-scalable.



^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30 17:24                                   ` Dave Hansen
@ 2014-05-30 18:12                                     ` H. Peter Anvin
  0 siblings, 0 replies; 107+ messages in thread
From: H. Peter Anvin @ 2014-05-30 18:12 UTC (permalink / raw)
  To: Dave Hansen, Linus Torvalds
  Cc: Dave Chinner, Minchan Kim, Dave Jones, Jens Axboe,
	Linux Kernel Mailing List, Andrew Morton, linux-mm, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Steven Rostedt,
	PJ Waskiewicz

On 05/30/2014 10:24 AM, Dave Hansen wrote:
> On 05/30/2014 09:06 AM, Linus Torvalds wrote:
>> On Fri, May 30, 2014 at 8:52 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>>> That said, it's still likely a non-production option due to the page
>>>> table games we'd have to play at fork/clone time.
>>>
>>> Still, seems much more tractable.
>>
>> We might be able to make it more attractive by having a small
>> front-end cache of the 16kB allocations with the second page unmapped.
>> That would at least capture the common "lots of short-lived processes"
>> case without having to do kernel page table work.
> 
> If we want to use 4k mappings, we'd need to move the stack over to using
> vmalloc() (or at least be out of the linear mapping) to avoid breaking
> up the linear map's page tables too much.  Doing that, we'd actually not
> _have_ to worry about fragmentation, and we could actually utilize the
> per-cpu-pageset code since we'd could be back to using order-0 pages.
> So it's at least not all a loss.  Although, I do remember playing with
> 4k stacks back in the 32-bit days and not getting much of a win with it.
> 
> We'd definitely that cache, if for no other reason than the vmalloc/vmap
> code as-is isn't super-scalable.
> 

I don't think we want to use 4K mappings for production...

	-hpa


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-28 16:09   ` Linus Torvalds
  2014-05-28 22:31     ` Dave Chinner
  2014-05-29  3:46     ` Minchan Kim
@ 2014-05-30 21:23     ` Andi Kleen
  2 siblings, 0 replies; 107+ messages in thread
From: Andi Kleen @ 2014-05-30 21:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

Linus Torvalds <torvalds@linux-foundation.org> writes:

> From a quick glance at the frame usage, some of it seems to be gcc
> being rather bad at stack allocation, but lots of it is just nasty
> spilling around the disgusting call-sites with tons or arguments. A
> _lot_ of the stack slots are marked as "%sfp" (which is gcc'ese for
> "spill frame pointer", afaik).

> Avoiding some inlining, and using a single flag value rather than the
> collection of "bool"s would probably help. But nothing really
> trivially obvious stands out.

One thing that may be worth playing around with gcc's
--param large-stack-frame and --param large-stack-frame-growth

This tells the inliner when to stop inlining when too much
stack would be used.

We use conserve stack I believe. So perhaps smaller values than 100
and 400 would make sense to try.

       -fconserve-stack
           Attempt to minimize stack usage.  The compiler attempts to
           use less stack space, even if that makes the program slower.
           This option
           implies setting the large-stack-frame parameter to 100 and
           the large-stack-frame-growth parameter to 400.


-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  4:37                   ` Linus Torvalds
@ 2014-05-31  1:45                     ` Linus Torvalds
  0 siblings, 0 replies; 107+ messages in thread
From: Linus Torvalds @ 2014-05-31  1:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Thu, May 29, 2014 at 9:37 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> It really might be very good to create a "struct alloc_info" that
> contains those shared arguments, and just pass a (const) pointer to
> that around. [ .. ]
>
> Ugh. I think I'll try looking at that tomorrow.

I did look at it, but the thing is horrible. I started on this
something like ten times, and always ended up running away screaming.
Some things are truly fixed (notably "order"), but most things end up
changing subtly halfway through the callchain.

I might look at it some more later, but people may have noticed that I
decided to just apply Minchan's original patch in the meantime. I'll
make an rc8 this weekend..

        Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
                               ` (2 preceding siblings ...)
  2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
@ 2014-05-31  2:06             ` Jens Axboe
  2014-06-02 22:59               ` Dave Chinner
  2014-06-03 13:02               ` Konstantin Khlebnikov
  3 siblings, 2 replies; 107+ messages in thread
From: Jens Axboe @ 2014-05-31  2:06 UTC (permalink / raw)
  To: Linus Torvalds, Dave Chinner
  Cc: Minchan Kim, Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On 2014-05-28 20:42, Linus Torvalds wrote:
>> Regardless of whether it is swap or something external queues the
>> bio on the plug, perhaps we should look at why it's done inline
>> rather than by kblockd, where it was moved because it was blowing
>> the stack from schedule():
>
> So it sounds like we need to do this for io_schedule() too.
>
> In fact, we've generally found it to be a mistake every time we
> "automatically" unblock some IO queue. And I'm not saying that because
> of stack space, but because we've _often_ had the situation that eager
> unblocking results in IO that could have been done as bigger requests.

We definitely need to auto-unplug on the schedule path, otherwise we run 
into all sorts of trouble. But making it async off the IO schedule path 
is fine. By definition, it's not latency sensitive if we are hitting 
unplug on schedule. I'm pretty sure it was run inline on CPU concerns 
here, as running inline is certainly cheaper than punting to kblockd.

> Looking at that callchain, I have to say that ext4 doesn't look
> horrible compared to the whole block layer and virtio.. Yes,
> "ext4_writepages()" is using almost 400 bytes of stack, and most of
> that seems to be due to:
>
>          struct mpage_da_data mpd;
>          struct blk_plug plug;

Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the 
magic part of it, since that's never caught any bugs. Only saves 8 
bytes, but may as well take that. Especially if we end up with nested plugs.

> Well, we've definitely have had some issues with deeper callchains
> with md, but I suspect virtio might be worse, and the new blk-mq code
> is lilkely worse in this respect too.

I don't think blk-mq is worse than the older stack, in fact it should be 
better. The call chains are shorter, and a lot less cruft on the stack. 
Historically the stack issues have been nested devices, however. And for 
sync IO, we do run it inline, so if the driver chews up a lot of stack, 
well...

Looks like I'm late here and the decision has been made to go 16K 
stacks, which I think is a good one. We've been living on the edge (and 
sometimes over) for heavy dm/md setups for a while, and have been 
patching around that fact in the IO stack for years.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-31  2:06             ` Jens Axboe
@ 2014-06-02 22:59               ` Dave Chinner
  2014-06-03 13:02               ` Konstantin Khlebnikov
  1 sibling, 0 replies; 107+ messages in thread
From: Dave Chinner @ 2014-06-02 22:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Minchan Kim, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Fri, May 30, 2014 at 08:06:53PM -0600, Jens Axboe wrote:
> On 2014-05-28 20:42, Linus Torvalds wrote:
> >Well, we've definitely have had some issues with deeper callchains
> >with md, but I suspect virtio might be worse, and the new blk-mq code
> >is lilkely worse in this respect too.
> 
> I don't think blk-mq is worse than the older stack, in fact it
> should be better. The call chains are shorter, and a lot less cruft
> on the stack. Historically the stack issues have been nested
> devices, however. And for sync IO, we do run it inline, so if the
> driver chews up a lot of stack, well...

Hi Jens - as we found out with the mm code, there's a significant
disconnect between what the code looks like (i.e. it may use very
little stack directly) and what the compiler is generating.

Before blk-mq:

  9)     3952     112   scsi_request_fn+0x4b/0x490
 10)     3840      32   __blk_run_queue+0x37/0x50
 11)     3808      64   queue_unplugged+0x39/0xb0
 12)     3744     112   blk_flush_plug_list+0x20b/0x240

Now with blk-mq:

  3)     4672      96   virtio_queue_rq+0xd2/0x1e0
  4)     4576     128   __blk_mq_run_hw_queue+0x1f0/0x3e0
  5)     4448      16   blk_mq_run_hw_queue+0x35/0x40
  6)     4432      80   blk_mq_insert_requests+0xc7/0x130
  7)     4352      96   blk_mq_flush_plug_list+0x129/0x140
  8)     4256     112   blk_flush_plug_list+0xe7/0x230

So previously flushing a plug used rough 200 bytes of stack.  With
blk-mq, it's over 400 bytes. IOWs, blk-mq has more than doubled the
block layer stack usage...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-31  2:06             ` Jens Axboe
  2014-06-02 22:59               ` Dave Chinner
@ 2014-06-03 13:02               ` Konstantin Khlebnikov
  1 sibling, 0 replies; 107+ messages in thread
From: Konstantin Khlebnikov @ 2014-06-03 13:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Dave Chinner, Minchan Kim,
	Linux Kernel Mailing List, Andrew Morton, linux-mm,
	H. Peter Anvin, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Rik van Riel, Johannes Weiner, Hugh Dickins, Rusty Russell,
	Michael S. Tsirkin, Dave Hansen, Steven Rostedt

On Sat, May 31, 2014 at 6:06 AM, Jens Axboe <axboe@kernel.dk> wrote:
> On 2014-05-28 20:42, Linus Torvalds wrote:
>>>
>>> Regardless of whether it is swap or something external queues the
>>> bio on the plug, perhaps we should look at why it's done inline
>>> rather than by kblockd, where it was moved because it was blowing
>>> the stack from schedule():
>>
>>
>> So it sounds like we need to do this for io_schedule() too.
>>
>> In fact, we've generally found it to be a mistake every time we
>> "automatically" unblock some IO queue. And I'm not saying that because
>> of stack space, but because we've _often_ had the situation that eager
>> unblocking results in IO that could have been done as bigger requests.
>
>
> We definitely need to auto-unplug on the schedule path, otherwise we run
> into all sorts of trouble. But making it async off the IO schedule path is
> fine. By definition, it's not latency sensitive if we are hitting unplug on
> schedule. I'm pretty sure it was run inline on CPU concerns here, as running
> inline is certainly cheaper than punting to kblockd.
>
>
>> Looking at that callchain, I have to say that ext4 doesn't look
>> horrible compared to the whole block layer and virtio.. Yes,
>> "ext4_writepages()" is using almost 400 bytes of stack, and most of
>> that seems to be due to:
>>
>>          struct mpage_da_data mpd;
>>          struct blk_plug plug;
>
>
> Plus blk_plug is pretty tiny as it is. I queued up a patch to kill the magic
> part of it, since that's never caught any bugs. Only saves 8 bytes, but may
> as well take that. Especially if we end up with nested plugs.

In case of nested plugs only the first one is used? Right?
So, it may be embedded into task_struct together with integer recursion counter.
This will save bit of precious stack and make it looks cleaner.


>
>
>> Well, we've definitely have had some issues with deeper callchains
>> with md, but I suspect virtio might be worse, and the new blk-mq code
>> is lilkely worse in this respect too.
>
>
> I don't think blk-mq is worse than the older stack, in fact it should be
> better. The call chains are shorter, and a lot less cruft on the stack.
> Historically the stack issues have been nested devices, however. And for
> sync IO, we do run it inline, so if the driver chews up a lot of stack,
> well...
>
> Looks like I'm late here and the decision has been made to go 16K stacks,
> which I think is a good one. We've been living on the edge (and sometimes
> over) for heavy dm/md setups for a while, and have been patching around that
> fact in the IO stack for years.
>
>
> --
> Jens Axboe
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30  2:12                 ` Minchan Kim
  2014-05-30  4:37                   ` Linus Torvalds
  2014-05-30  6:12                   ` Minchan Kim
@ 2014-06-03 13:28                   ` Rasmus Villemoes
  2014-06-03 19:04                     ` Linus Torvalds
  2 siblings, 1 reply; 107+ messages in thread
From: Rasmus Villemoes @ 2014-06-03 13:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Dave Chinner, Linus Torvalds, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

Possibly stupid question: Is it true that any given task can only be
using one wait_queue_t at a time? If so, would it be an idea to put a
wait_queue_t into struct task_struct [maybe union'ed with a struct
wait_bit_queue] and avoid allocating this 40 byte structure repeatedly
on the stack.

E.g., in one of Minchan's stack traces, there are two calls of
mempool_alloc (which itself declares a wait_queue_t) and one
try_to_free_pages (which is the only caller of throttle_direct_reclaim,
which in turn uses wait_event_interruptible_timeout and
wait_event_killable).

Rasmus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-06-03 13:28                   ` Rasmus Villemoes
@ 2014-06-03 19:04                     ` Linus Torvalds
  2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
  0 siblings, 1 reply; 107+ messages in thread
From: Linus Torvalds @ 2014-06-03 19:04 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Minchan Kim, Dave Chinner, Linux Kernel Mailing List,
	Andrew Morton, linux-mm, H. Peter Anvin, Ingo Molnar,
	Peter Zijlstra, Mel Gorman, Rik van Riel, Johannes Weiner,
	Hugh Dickins, Rusty Russell, Michael S. Tsirkin, Dave Hansen,
	Steven Rostedt

On Tue, Jun 3, 2014 at 6:28 AM, Rasmus Villemoes
<linux@rasmusvillemoes.dk> wrote:
> Possibly stupid question: Is it true that any given task can only be
> using one wait_queue_t at a time?

Nope.

Being on multiple different wait-queues is actually very common. The
obvious case is select/poll, but there are others. The more subtle
ones involve being on a wait-queue while doing something that can
cause nested waiting (iow, it's technically not wrong to be on a
wait-queue and then do a user space access, which obviously can end up
doing IO).

That said, the case of a single wait-queue entry is another common
case, and it wouldn't necessarily be wrong to have one pre-initialized
wait queue entry in the task structure for that special case, for when
you know that there is no possible nesting. And even if it *does*
nest, if it's the "leaf" entry it could be used for that innermost
nesting without worrying about other wait queue users (who use stack
allocations or actual explicit allocations like poll).

So it might certainly be worth looking at. In fact, it might be worth
it having multiple per-thread entries, so that we could get rid of the
special on-stack allocation for poll too (and making one of them
special and not available to poll, to handle the "leaf waiter" case).

So it's not necessarily a bad idea at all, even if the general case
requires more than one (or a few) static per-thread allocations.

Anybody want to try to code it up?

              Linus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 0/2] Per-task wait_queue_t
  2014-06-03 19:04                     ` Linus Torvalds
@ 2014-06-10 12:29                       ` Rasmus Villemoes
  2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
                                           ` (2 more replies)
  0 siblings, 3 replies; 107+ messages in thread
From: Rasmus Villemoes @ 2014-06-10 12:29 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Oleg Nesterov, Rik van Riel, David Rientjes, Eric W. Biederman,
	Davidlohr Bueso, Michal Simek
  Cc: linux-kernel, Rasmus Villemoes

This is an attempt to reduce the stack footprint of various functions
(those using any of the wait_event_* macros), by removing the need to
allocate a wait_queue_t on the stack and instead use a single instance
embedded in task_struct. I'm not really sure where the best place to
put it is; I just placed it next to other list bookkeeping fields.

For now, there is a little unconditional debugging. This could later
be removed or maybe be made dependent on some CONFIG_* variable. The
idea of using ->flags is taken from Pavel [1] (I originally stored
(void*)1 into ->private).

Compiles, but not actually tested.

[1] http://thread.gmane.org/gmane.linux.kernel/1720670

Rasmus Villemoes (2):
  wait: Introduce per-task wait_queue_t
  wait: Use the per-task wait_queue_t in ___wait_event macro

 include/linux/sched.h | 23 +++++++++++++++++++++++
 include/linux/wait.h  | 18 ++++++++++--------
 kernel/fork.c         |  1 +
 3 files changed, 34 insertions(+), 8 deletions(-)

-- 
1.9.2


^ permalink raw reply	[flat|nested] 107+ messages in thread

* [PATCH 1/2] wait: Introduce per-task wait_queue_t
  2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
@ 2014-06-10 12:29                         ` Rasmus Villemoes
  2014-06-11 15:16                           ` Oleg Nesterov
  2014-06-10 12:29                         ` [PATCH 2/2] wait: Use the per-task wait_queue_t in ___wait_event macro Rasmus Villemoes
  2014-06-10 15:50                         ` [PATCH 0/2] Per-task wait_queue_t Peter Zijlstra
  2 siblings, 1 reply; 107+ messages in thread
From: Rasmus Villemoes @ 2014-06-10 12:29 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Oleg Nesterov, Rik van Riel, David Rientjes, Eric W. Biederman,
	Davidlohr Bueso, Michal Simek
  Cc: linux-kernel, Rasmus Villemoes

This introduces a single wait_queue_t into the task structure.
Functions which need to wait, but which do not call other functions
that might wait while on the wait queue, may use current->__wq for
bookkeeping instead of allocating an instance on the stack. To help
ensure that the users are indeed "leaf waiters", make all access
through two helper functions.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 include/linux/sched.h | 23 +++++++++++++++++++++++
 include/linux/wait.h  |  1 +
 kernel/fork.c         |  1 +
 3 files changed, 25 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 221b2bd..c7c97fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1303,6 +1303,15 @@ struct task_struct {
 	struct list_head thread_group;
 	struct list_head thread_node;
 
+	/*
+	 * "Leaf" waiters may use this instead of allocating a
+	 * wait_queue_t on the stack. To help ensure exclusive use of
+	 * __wq, one should use the helper functions current_wq_get(),
+	 * current_wq_put() below. Leaf waiters include the
+	 * wait_event_* macros.
+	 */
+	wait_queue_t __wq;
+
 	struct completion *vfork_done;		/* for vfork() */
 	int __user *set_child_tid;		/* CLONE_CHILD_SETTID */
 	int __user *clear_child_tid;		/* CLONE_CHILD_CLEARTID */
@@ -1612,6 +1621,20 @@ struct task_struct {
 #endif
 };
 
+static inline wait_queue_t *current_wq_get(void)
+{
+	wait_queue_t *wq = &current->__wq;
+	BUG_ON(wq->flags != WQ_FLAG_AVAILABLE);
+	wq->flags = 0;
+	return wq;
+}
+static inline void current_wq_put(wait_queue_t *wq)
+{
+	BUG_ON(wq != &current->__wq);
+	wq->flags = WQ_FLAG_AVAILABLE;
+}
+
+
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
diff --git a/include/linux/wait.h b/include/linux/wait.h
index bd68819..94279be 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -16,6 +16,7 @@ int default_wake_function(wait_queue_t *wait, unsigned mode, int flags, void *ke
 struct __wait_queue {
 	unsigned int		flags;
 #define WQ_FLAG_EXCLUSIVE	0x01
+#define WQ_FLAG_AVAILABLE	0x80000000U
 	void			*private;
 	wait_queue_func_t	func;
 	struct list_head	task_list;
diff --git a/kernel/fork.c b/kernel/fork.c
index 54a8d26..7166955 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1315,6 +1315,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->sequential_io	= 0;
 	p->sequential_io_avg	= 0;
 #endif
+	p->__wq.flags = WQ_FLAG_AVAILABLE;
 
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	retval = sched_fork(clone_flags, p);
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* [PATCH 2/2] wait: Use the per-task wait_queue_t in ___wait_event macro
  2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
  2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
@ 2014-06-10 12:29                         ` Rasmus Villemoes
  2014-06-10 15:50                         ` [PATCH 0/2] Per-task wait_queue_t Peter Zijlstra
  2 siblings, 0 replies; 107+ messages in thread
From: Rasmus Villemoes @ 2014-06-10 12:29 UTC (permalink / raw)
  To: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Oleg Nesterov, Rik van Riel, David Rientjes, Eric W. Biederman,
	Davidlohr Bueso, Michal Simek
  Cc: linux-kernel, Rasmus Villemoes

The ___wait_event macro satisfies the requirements for making use of
the per-task wait_queue_t, so use it. This should make the stack
footprint of all users of the wait_event_* macros smaller.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 include/linux/wait.h | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 94279be..5f51252 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -207,17 +207,17 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 #define ___wait_event(wq, condition, state, exclusive, ret, cmd)	\
 ({									\
 	__label__ __out;						\
-	wait_queue_t __wait;						\
+	wait_queue_t *__wait = current_wq_get();			\
 	long __ret = ret;	/* explicit shadow */			\
 									\
-	INIT_LIST_HEAD(&__wait.task_list);				\
+	INIT_LIST_HEAD(&__wait->task_list);				\
 	if (exclusive)							\
-		__wait.flags = WQ_FLAG_EXCLUSIVE;			\
+		__wait->flags = WQ_FLAG_EXCLUSIVE;			\
 	else								\
-		__wait.flags = 0;					\
+		__wait->flags = 0;					\
 									\
 	for (;;) {							\
-		long __int = prepare_to_wait_event(&wq, &__wait, state);\
+		long __int = prepare_to_wait_event(&wq, __wait, state);	\
 									\
 		if (condition)						\
 			break;						\
@@ -225,7 +225,7 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 		if (___wait_is_interruptible(state) && __int) {		\
 			__ret = __int;					\
 			if (exclusive) {				\
-				abort_exclusive_wait(&wq, &__wait,	\
+				abort_exclusive_wait(&wq, __wait,	\
 						     state, NULL);	\
 				goto __out;				\
 			}						\
@@ -234,8 +234,9 @@ wait_queue_head_t *bit_waitqueue(void *, int);
 									\
 		cmd;							\
 	}								\
-	finish_wait(&wq, &__wait);					\
-__out:	__ret;								\
+	finish_wait(&wq, __wait);					\
+__out:	current_wq_put(__wait);						\
+	__ret;								\
 })
 
 #define __wait_event(wq, condition)					\
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/2] Per-task wait_queue_t
  2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
  2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
  2014-06-10 12:29                         ` [PATCH 2/2] wait: Use the per-task wait_queue_t in ___wait_event macro Rasmus Villemoes
@ 2014-06-10 15:50                         ` Peter Zijlstra
  2014-06-12 21:46                           ` Rasmus Villemoes
  2 siblings, 1 reply; 107+ messages in thread
From: Peter Zijlstra @ 2014-06-10 15:50 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Oleg Nesterov,
	Rik van Riel, David Rientjes, Eric W. Biederman, Davidlohr Bueso,
	Michal Simek, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

On Tue, Jun 10, 2014 at 02:29:17PM +0200, Rasmus Villemoes wrote:
> This is an attempt to reduce the stack footprint of various functions
> (those using any of the wait_event_* macros), by removing the need to
> allocate a wait_queue_t on the stack and instead use a single instance
> embedded in task_struct. I'm not really sure where the best place to
> put it is; I just placed it next to other list bookkeeping fields.
> 
> For now, there is a little unconditional debugging. This could later
> be removed or maybe be made dependent on some CONFIG_* variable. The
> idea of using ->flags is taken from Pavel [1] (I originally stored
> (void*)1 into ->private).
> 
> Compiles, but not actually tested.
> 

Doesn't look too bad, would be good to be tested and have some numbers
on the amount of stack saved etc..

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 1/2] wait: Introduce per-task wait_queue_t
  2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
@ 2014-06-11 15:16                           ` Oleg Nesterov
  0 siblings, 0 replies; 107+ messages in thread
From: Oleg Nesterov @ 2014-06-11 15:16 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Linus Torvalds, Ingo Molnar, Peter Zijlstra, Andrew Morton,
	Rik van Riel, David Rientjes, Eric W. Biederman, Davidlohr Bueso,
	Michal Simek, linux-kernel

On 06/10, Rasmus Villemoes wrote:
>
> This introduces a single wait_queue_t into the task structure.
> Functions which need to wait, but which do not call other functions
> that might wait while on the wait queue, may use current->__wq

I am not going to argue, but I am not sure that wait_event() (changed
by the next patch) meets this criteria...

wait_event(wq, something_nontrivial_which_uses_wait_event_too()) is
legal currently although perhaps nobody does this.

> +static inline wait_queue_t *current_wq_get(void)
> +{
> +	wait_queue_t *wq = &current->__wq;
> +	BUG_ON(wq->flags != WQ_FLAG_AVAILABLE);
> +	wq->flags = 0;
> +	return wq;
> +}
> +static inline void current_wq_put(wait_queue_t *wq)
> +{
> +	BUG_ON(wq != &current->__wq);
> +	wq->flags = WQ_FLAG_AVAILABLE;
> +}

Or, perhaps, current_wq_get() can simply check list_empty(->task_list)
initialized by copy_process().

This way you do not need current_wq_put(), WQ_FLAG_AVAILABLE, and you
can kill INIT_LIST_HEAD() in ___wait_event().

Honestly, I am not sure about this patch... sizeof(wait_queue_t) is not
that large, and otoh it is not good to have yet another "rarely used"
member in the already huge task_struct. But again, I won't insist.

Oleg.


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [PATCH 0/2] Per-task wait_queue_t
  2014-06-10 15:50                         ` [PATCH 0/2] Per-task wait_queue_t Peter Zijlstra
@ 2014-06-12 21:46                           ` Rasmus Villemoes
  0 siblings, 0 replies; 107+ messages in thread
From: Rasmus Villemoes @ 2014-06-12 21:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Andrew Morton, Oleg Nesterov,
	Rik van Riel, David Rientjes, Eric W. Biederman, Davidlohr Bueso,
	Michal Simek, linux-kernel

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Jun 10, 2014 at 02:29:17PM +0200, Rasmus Villemoes wrote:
>> This is an attempt to reduce the stack footprint of various functions
>> (those using any of the wait_event_* macros), by removing the need to
>> allocate a wait_queue_t on the stack and instead use a single instance
>> embedded in task_struct. I'm not really sure where the best place to
>> put it is; I just placed it next to other list bookkeeping fields.
>> 
>> For now, there is a little unconditional debugging. This could later
>> be removed or maybe be made dependent on some CONFIG_* variable. The
>> idea of using ->flags is taken from Pavel [1] (I originally stored
>> (void*)1 into ->private).
>> 
>> Compiles, but not actually tested.
>> 
>
> Doesn't look too bad, would be good to be tested and have some numbers
> on the amount of stack saved etc..

Here are some numbers, and the fact that it only has a positive effect
on 28 functions, together with Oleg's concerns, makes me think that it's
probably not worth it.

(defconfig on x86_64, based on 3.15)

file                 function                       old  new  delta
vmlinux              i915_pipe_crc_read              120  136  +16
vmlinux              try_to_free_pages               144  136   -8
vmlinux              mousedev_read                   112   96  -16
vmlinux              sky2_probe                      192  176  -16
vmlinux              gss_cred_init                   136  120  -16
vmlinux              do_coredump                     296  280  -16
vmlinux              md_do_sync                      360  344  -16
vmlinux              save_image_lzo                  200  168  -32
vmlinux              read_events                     184  152  -32
vmlinux              loop_make_request                96   64  -32
vmlinux              i801_access                     104   72  -32
vmlinux              tty_port_block_til_ready        104   72  -32
vmlinux              loop_thread                     152  120  -32
vmlinux              evdev_read                      136  104  -32
vmlinux              __sb_start_write                 96   64  -32
vmlinux              rfkill_fop_read                 104   72  -32
vmlinux              intel_dp_aux_ch                 152  120  -32
vmlinux              i801_transaction                 96   64  -32
vmlinux              blk_mq_queue_enter               96   64  -32
vmlinux              cypress_send_ext_cmd            152  104  -48
vmlinux              rcu_gp_kthread                  120   72  -48
vmlinux              locks_mandatory_area            264  216  -48
vmlinux              hub_thread                      232  184  -48
vmlinux              start_this_handle               120   72  -48
vmlinux              autofs4_wait                    136   88  -48
vmlinux              fcntl_setlk                     136   88  -48
vmlinux              load_module                     264  216  -48
vmlinux              serport_ldisc_read              144   96  -48
vmlinux              sg_read                         160  112  -48

Rasmus

^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-05-30 15:41                             ` Linus Torvalds
  2014-05-30 15:52                               ` H. Peter Anvin
@ 2014-10-21  2:00                               ` Dave Jones
  2014-10-21  4:59                                 ` Andy Lutomirski
  1 sibling, 1 reply; 107+ messages in thread
From: Dave Jones @ 2014-10-21  2:00 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Linus Torvalds

On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
 > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
 > >
 > > If we removed struct thread_info from the stack allocation then one
 > > could do a guard page below the stack.  Of course, we'd have to use IST
 > > for #PF in that case, which makes it a non-production option.
 > 
 > We could just have the guard page in between the stack and the
 > thread_info, take a double fault, and then just map it back in on
 > double fault.
 > 
 > That would give us 8kB of "normal" stack, with a very loud fault - and
 > then an extra 7kB or so of stack (whatever the size of thread-info is)
 > - after the first time it traps.
 > 
 > That said, it's still likely a non-production option due to the page
 > table games we'd have to play at fork/clone time.

[thread necrophilia]

So digging this back up, it occurs to me that after we bumped to 16K,
we never did anything like the debug stuff you suggested here.

The reason I'm bringing this up, is that the last few weeks, I've been
seeing things like..

[27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left

So we're now eating past that first 8KB in some situations.

Do we care ? Or shall we only start worrying if it gets even deeper ?

	Dave


^ permalink raw reply	[flat|nested] 107+ messages in thread

* Re: [RFC 2/2] x86_64: expand kernel stack to 16K
  2014-10-21  2:00                               ` Dave Jones
@ 2014-10-21  4:59                                 ` Andy Lutomirski
  0 siblings, 0 replies; 107+ messages in thread
From: Andy Lutomirski @ 2014-10-21  4:59 UTC (permalink / raw)
  To: Dave Jones, Linux Kernel Mailing List, Linus Torvalds

On 10/20/2014 07:00 PM, Dave Jones wrote:
> On Fri, May 30, 2014 at 08:41:00AM -0700, Linus Torvalds wrote:
>  > On Fri, May 30, 2014 at 8:25 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>  > >
>  > > If we removed struct thread_info from the stack allocation then one
>  > > could do a guard page below the stack.  Of course, we'd have to use IST
>  > > for #PF in that case, which makes it a non-production option.

Why is thread_info in the stack allocation anyway?  Every time I look at
the entry asm, one (minor) thing that contributes to general
brain-hurtingness / sense of horrified awe is the incomprehensible (to
me) split between task_struct and thread_info.

struct thread_info is at the bottom of the stack, right?  If we don't
want to merge it into task_struct, couldn't we stick it at the top of
the stack instead?  Anything that can overwrite the *top* of the stack
gives trivial user-controlled CPL0 execution regardless.

>  > 
>  > We could just have the guard page in between the stack and the
>  > thread_info, take a double fault, and then just map it back in on
>  > double fault.
>  > 
>  > That would give us 8kB of "normal" stack, with a very loud fault - and
>  > then an extra 7kB or so of stack (whatever the size of thread-info is)
>  > - after the first time it traps.
>  > 
>  > That said, it's still likely a non-production option due to the page
>  > table games we'd have to play at fork/clone time.

What's wrong with vmalloc?  Doesn't it already have guard pages?

(Also, we have a shiny hardware dirty bit, so we could relatively
cheaply check whether we're near the limit without any weird
#PF-in-weird-context issues.)

Also, muahaha, I've infected more people with the crazy idea that
intentional double-faults are okay.  Suckers!  Soon I'll have Linux
returning from interrupts with lret!  (IIRC Windows used to do
intentional *triple* faults on context switches, so this should be
considered entirely sensible.)

> 
> [thread necrophilia]
> 
> So digging this back up, it occurs to me that after we bumped to 16K,
> we never did anything like the debug stuff you suggested here.
> 
> The reason I'm bringing this up, is that the last few weeks, I've been
> seeing things like..
> 
> [27871.793753] trinity-c386 (28793) used greatest stack depth: 7728 bytes left
> 
> So we're now eating past that first 8KB in some situations.
> 
> Do we care ? Or shall we only start worrying if it gets even deeper ?

I would *love* to have an immediate, loud failure when we overrun the
stack.  This will unavoidably increase the number of TLB misses, but
that probably isn't so bad.

--Andy


^ permalink raw reply	[flat|nested] 107+ messages in thread

end of thread, other threads:[~2014-10-21  4:59 UTC | newest]

Thread overview: 107+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-28  6:53 [PATCH 1/2] ftrace: print stack usage right before Oops Minchan Kim
2014-05-28  6:53 ` [RFC 2/2] x86_64: expand kernel stack to 16K Minchan Kim
2014-05-28  8:37   ` Dave Chinner
2014-05-28  9:13     ` Dave Chinner
2014-05-28 16:06       ` Johannes Weiner
2014-05-28 21:55         ` Dave Chinner
2014-05-29  6:06         ` Minchan Kim
2014-05-28  9:04   ` Michael S. Tsirkin
2014-05-29  1:09     ` Minchan Kim
2014-05-29  2:44       ` Steven Rostedt
2014-05-29  4:11         ` Minchan Kim
2014-05-29  2:47       ` Rusty Russell
2014-05-29  4:10     ` virtio_ring stack usage Rusty Russell
2014-05-28  9:27   ` [RFC 2/2] x86_64: expand kernel stack to 16K Borislav Petkov
2014-05-29 13:23     ` One Thousand Gnomes
2014-05-28 14:14   ` Steven Rostedt
2014-05-28 14:23     ` H. Peter Anvin
2014-05-28 22:11       ` Dave Chinner
2014-05-28 22:42         ` H. Peter Anvin
2014-05-28 23:17           ` Dave Chinner
2014-05-28 23:21             ` H. Peter Anvin
2014-05-28 15:43   ` Richard Weinberger
2014-05-28 16:08     ` Steven Rostedt
2014-05-28 16:11       ` Richard Weinberger
2014-05-28 16:13       ` Linus Torvalds
2014-05-28 16:09   ` Linus Torvalds
2014-05-28 22:31     ` Dave Chinner
2014-05-28 22:41       ` Linus Torvalds
2014-05-29  1:30         ` Dave Chinner
2014-05-29  1:58           ` Dave Chinner
2014-05-29  2:51             ` Linus Torvalds
2014-05-29 23:36             ` Minchan Kim
2014-05-30  0:05               ` Linus Torvalds
2014-05-30  0:20                 ` Minchan Kim
2014-05-30  0:31                   ` Linus Torvalds
2014-05-30  0:50                     ` Minchan Kim
2014-05-30  1:24                       ` Linus Torvalds
2014-05-30  1:58                         ` Dave Chinner
2014-05-30  2:13                           ` Linus Torvalds
2014-05-30  6:21                         ` Minchan Kim
2014-05-30  1:30                 ` Linus Torvalds
2014-05-30  0:15               ` Dave Chinner
2014-05-30  2:12                 ` Minchan Kim
2014-05-30  4:37                   ` Linus Torvalds
2014-05-31  1:45                     ` Linus Torvalds
2014-05-30  6:12                   ` Minchan Kim
2014-06-03 13:28                   ` Rasmus Villemoes
2014-06-03 19:04                     ` Linus Torvalds
2014-06-10 12:29                       ` [PATCH 0/2] Per-task wait_queue_t Rasmus Villemoes
2014-06-10 12:29                         ` [PATCH 1/2] wait: Introduce per-task wait_queue_t Rasmus Villemoes
2014-06-11 15:16                           ` Oleg Nesterov
2014-06-10 12:29                         ` [PATCH 2/2] wait: Use the per-task wait_queue_t in ___wait_event macro Rasmus Villemoes
2014-06-10 15:50                         ` [PATCH 0/2] Per-task wait_queue_t Peter Zijlstra
2014-06-12 21:46                           ` Rasmus Villemoes
2014-05-29  2:42           ` [RFC 2/2] x86_64: expand kernel stack to 16K Linus Torvalds
2014-05-29  5:14             ` H. Peter Anvin
2014-05-29  6:01             ` Rusty Russell
2014-05-29  7:26               ` virtio ring cleanups, which save stack on older gcc Rusty Russell
2014-05-29  7:26                 ` [PATCH 1/4] Hack: measure stack taken by vring from virtio_blk Rusty Russell
2014-05-29 15:39                   ` Linus Torvalds
2014-05-29  7:26                 ` [PATCH 2/4] virtio_net: pass well-formed sg to virtqueue_add_inbuf() Rusty Russell
2014-05-29 10:07                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 3/4] virtio_ring: assume sgs are always well-formed Rusty Russell
2014-05-29 11:18                   ` Michael S. Tsirkin
2014-05-29  7:26                 ` [PATCH 4/4] virtio_ring: unify direct/indirect code paths Rusty Russell
2014-05-29  7:52                   ` Peter Zijlstra
2014-05-29 11:05                     ` Rusty Russell
2014-05-29 11:33                       ` Michael S. Tsirkin
2014-05-29 11:29                   ` Michael S. Tsirkin
2014-05-30  2:37                     ` Rusty Russell
2014-05-30  6:21                       ` Rusty Russell
2014-05-29  7:41                 ` virtio ring cleanups, which save stack on older gcc Minchan Kim
2014-05-29 10:39                   ` Dave Chinner
2014-05-29 11:08                   ` Rusty Russell
2014-05-29 23:45                     ` Minchan Kim
2014-05-30  1:06                       ` Minchan Kim
2014-05-30  6:56                       ` Rusty Russell
2014-05-29  7:26             ` [RFC 2/2] x86_64: expand kernel stack to 16K Dave Chinner
2014-05-29 15:24               ` Linus Torvalds
2014-05-29 23:40                 ` Minchan Kim
2014-05-29 23:53                 ` Dave Chinner
2014-05-30  0:06                   ` Dave Jones
2014-05-30  0:21                     ` Dave Chinner
2014-05-30  0:29                       ` Dave Jones
2014-05-30  0:32                       ` Minchan Kim
2014-05-30  1:34                         ` Dave Chinner
2014-05-30 15:25                           ` H. Peter Anvin
2014-05-30 15:41                             ` Linus Torvalds
2014-05-30 15:52                               ` H. Peter Anvin
2014-05-30 16:06                                 ` Linus Torvalds
2014-05-30 17:24                                   ` Dave Hansen
2014-05-30 18:12                                     ` H. Peter Anvin
2014-10-21  2:00                               ` Dave Jones
2014-10-21  4:59                                 ` Andy Lutomirski
2014-05-30  9:48                 ` Richard Weinberger
2014-05-30 15:36                   ` Linus Torvalds
2014-05-31  2:06             ` Jens Axboe
2014-06-02 22:59               ` Dave Chinner
2014-06-03 13:02               ` Konstantin Khlebnikov
2014-05-29  3:46     ` Minchan Kim
2014-05-29  4:13       ` Linus Torvalds
2014-05-29  5:10         ` Minchan Kim
2014-05-30 21:23     ` Andi Kleen
2014-05-28 16:18 ` [PATCH 1/2] ftrace: print stack usage right before Oops Steven Rostedt
2014-05-29  3:52   ` Minchan Kim
2014-05-29  3:01 ` Steven Rostedt
2014-05-29  3:49   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).