All of lore.kernel.org
 help / color / mirror / Atom feed
From: Joerg Roedel <jroedel@suse.de>
To: Steven Rostedt <rostedt@goodmis.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@kernel.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Borislav Petkov <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shile Zhang <shile.zhang@linux.alibaba.com>,
	Andy Lutomirski <luto@amacapital.net>,
	"Rafael J. Wysocki" <rafael.j.wysocki@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	Tzvetomir Stoyanov <tz.stoyanov@gmail.com>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Subject: Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()
Date: Thu, 30 Apr 2020 16:11:21 +0200	[thread overview]
Message-ID: <20200430141120.GA8135@suse.de> (raw)
In-Reply-To: <20200429100731.201312a9@gandalf.local.home>

Hi,

On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> Talking with Mathieu about this on IRC, he pointed out that my code does
> have a vzalloc() that is called:
> 
> in trace_pid_write()
> 
> 	pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> 
> This is done when -P1,2 is on the trace-cmd command line.

Okay, tracked it down, some instrumentation in the page-fault and
double-fault handler gave me the stack-traces. Here is what happens:

As already pointed out, it all happens because of page-faults on the
vzalloc'ed pid bitmap. It starts with this stack-trace:

 RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
 Code: e9 c2 4b 6b 00 cc cc 48 8b 57 28 48 8b 8a b8 00 00 00 48 8b 82 c0 00 00 00 48 85 c0 74 11 48 8b 42 28 65 48 03 05 5d 9c e6 7e <0f> b6 40 7c c3 48 85 c9 75 ea f3 c3 90 48 8b 4f 70 48 83 02 01 48
 RSP: 0018:ffffc90000673bd8 EFLAGS: 00010082
 RAX: ffffe8ffffd8c870 RBX: 0000000000000203 RCX: ffff88810734ca90
 RDX: ffff888451578000 RSI: ffffffff820f3d2a RDI: ffff888453594d68
 RBP: ffff888453594d68 R08: 000000000001845e R09: ffffffff81114ba0
 R10: 0000000000000000 R11: 000000000000000e R12: 4000000000000000
 R13: ffffffff820f3d2a R14: 000000000001845e R15: 4000000000000002
  ? trace_event_raw_event_rcu_fqs+0xa0/0xa0
  trace_event_raw_event_rcu_dyntick+0x89/0xa0
  ? trace_event_raw_event_rcu_dyntick+0x89/0xa0
  ? trace_event_raw_event_rcu_dyntick+0x89/0xa0
  ? insn_get_prefixes.part.2+0x174/0x2d0
  rcu_irq_enter+0xf0/0x1d0
  rcu_irq_enter_irqson+0x21/0x50
  switch_mm_irqs_off+0x43c/0x570
  ? do_one_initcall+0x51/0x210
  __text_poke+0x1a9/0x470
  text_poke_bp_batch+0x73/0x180
  text_poke_flush+0x43/0x50
  arch_jump_label_transform_apply+0x16/0x30
  __static_key_slow_dec_cpuslocked+0x42/0x50
  static_key_slow_dec+0x1f/0x50
  tracepoint_probe_unregister+0x1e2/0x220
  trace_event_reg+0x6a/0x80
  __ftrace_event_enable_disable+0x1ca/0x240
  __ftrace_set_clr_event_nolock+0xe1/0x140
  __ftrace_set_clr_event+0x3d/0x60
  system_enable_write+0x76/0xa0
  vfs_write+0xad/0x1a0
  ? rcu_irq_exit+0xb8/0x170
  ksys_write+0x48/0xb0
  do_syscall_64+0x60/0x200
  entry_SYSCALL_64_after_hwframe+0x49/0xb3
 RIP: 0033:0x7f875b0ecdb0
 Code: Bad RIP value.
 RSP: 002b:00007ffc746be918 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f875b0ecdb0
 RDX: 0000000000000001 RSI: 00007ffc746be93f RDI: 0000000000000004
 RBP: 00007ffc746be9a0 R08: 636172742f6c6500 R09: 0000000002242ee0
 R10: 6f662f7365636e61 R11: 0000000000000246 R12: 000000000040a4b0
 R13: 00007ffc746bebd0 R14: 0000000000000000 R15: 0000000000000000

I havn't figured out how rcu_irq_enter() calls down into
trace_event_raw_event_rcu_dyntick() and further into
trace_event_ignore_this_pid(), but the stacktrace shows it does.

So trace_event_ignore_this_pid() faults on the vzalloc()'ed memory, calling
into the page-fault handler. What happens there is:

 RIP: 0010:trace_event_ignore_this_pid+0x23/0x30
 Code: e9 c2 4b 6b 00 cc cc 48 8b 57 28 48 8b 8a b8 00 00 00 48 8b 82 c0 00 00 00 48 85 c0 74 11 48 8b 42 28 65 48 03 05 5d 9c e6 7e <0f> b6 40 7c c3 48 85 c9 75 ea f3 c3 90 48 8b 4f 70 48 83 02 01 48
 RSP: 0018:ffffc90000673a50 EFLAGS: 00010082
 RAX: ffffe8ffffd8c870 RBX: 0000000000000203 RCX: ffff88810734ca90
 RDX: ffff888451578000 RSI: ffffe8ffffd8c8ec RDI: ffff88844fd98478
 RBP: ffff88844fd98478 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000000 R12: ffffe8ffffd8c8ec
 R13: 0000000000000000 R14: ffffc90000673b28 R15: 0000000000000000
  trace_event_raw_event_x86_exceptions+0x87/0xa0
  ? trace_event_buffer_lock_reserve+0x6e/0x110
  do_page_fault+0x45e/0x630
  ? trace_hardirqs_off_thunk+0x1a/0x37
  page_fault+0x43/0x50

The page-fault handler calls a tracing function which again ends up in
trace_event_ignore_this_pid(), where it faults again. From here on the CPU is in
a page-fault loop, which continues until the stack overflows (with
CONFIG_VMAP_STACK).

Then there is no mapped stack anymore, so the page-fault results in a
double-fault, which uses an IST stack. The double-fault handler does
ist_enter(), which calls into rcu_nmi_enter(), which also has trace-events down
its call-path. I have no stack-trace for this, but what likely happens
now is that it page-faults again while on the IST stack and the
page-fault loops until the #DF IST stack overflows. Then the next #DF
happens and the stack pointer is reset to the top of the #DF IST stack,
starting the loop over again. This loops forever, causing the hang.

Regards,

	Joerg


  parent reply	other threads:[~2020-04-30 14:11 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-29  9:48 [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke() Steven Rostedt
2020-04-29 10:59 ` Joerg Roedel
2020-04-29 12:28   ` Steven Rostedt
2020-04-29 14:07     ` Steven Rostedt
2020-04-29 14:10       ` Joerg Roedel
2020-04-29 14:32         ` Steven Rostedt
2020-04-29 15:44           ` Peter Zijlstra
2020-04-29 16:17       ` Joerg Roedel
2020-04-29 16:20         ` Joerg Roedel
2020-04-29 16:52           ` Steven Rostedt
2020-04-29 17:29             ` Mathieu Desnoyers
2020-04-29 18:51               ` Peter Zijlstra
2020-04-30 14:11       ` Joerg Roedel [this message]
2020-04-30 14:50         ` Joerg Roedel
2020-04-30 15:20           ` Mathieu Desnoyers
2020-04-30 16:16             ` Steven Rostedt
2020-04-30 16:18               ` Mathieu Desnoyers
2020-04-30 16:30                 ` Steven Rostedt
2020-04-30 16:35                   ` Mathieu Desnoyers
2020-04-30 15:23         ` Mathieu Desnoyers
2020-04-30 16:12           ` Steven Rostedt
2020-04-30 16:11         ` Steven Rostedt
2020-04-30 16:16           ` Mathieu Desnoyers
2020-04-30 16:25             ` Steven Rostedt
2020-04-30 19:14           ` Joerg Roedel
2020-05-01  1:13             ` Steven Rostedt
2020-05-01  2:26               ` Mathieu Desnoyers
2020-05-01  2:39                 ` Steven Rostedt
2020-05-01 10:16                   ` Joerg Roedel
2020-05-01 13:35                   ` Mathieu Desnoyers
2020-05-04 15:12                   ` [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu() Joerg Roedel
2020-05-04 15:28                     ` Mathieu Desnoyers
2020-05-04 15:31                       ` Joerg Roedel
2020-05-04 15:38                         ` Mathieu Desnoyers
2020-05-04 15:51                           ` Joerg Roedel
2020-05-04 17:04                           ` Steven Rostedt
2020-05-04 17:40                     ` Steven Rostedt
2020-05-04 18:38                       ` Joerg Roedel
2020-05-04 19:10                         ` Steven Rostedt
2020-05-05 12:31                           ` [PATCH] tracing: Call vmalloc_sync_mappings() after alloc_percpu() Joerg Roedel
2020-05-06 15:17                             ` Steven Rostedt
2020-05-08 14:42                               ` Joerg Roedel
2020-05-04 20:25                     ` [PATCH] percpu: Sync vmalloc mappings in pcpu_alloc() and free_percpu() Peter Zijlstra
2020-05-04 20:43                       ` Steven Rostedt
2020-05-01  4:20                 ` [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke() Steven Rostedt
2020-05-01 13:22                   ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200430141120.GA8135@suse.de \
    --to=jroedel@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rafael.j.wysocki@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=shile.zhang@linux.alibaba.com \
    --cc=tglx@linutronix.de \
    --cc=tz.stoyanov@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.