linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support
@ 2012-01-10 11:48 Srikar Dronamraju
  2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
                   ` (8 more replies)
  0 siblings, 9 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:48 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


This patchset implements Uprobes which enables you to dynamically probe
any routine in a user space application and collect information
non-disruptively.

This patchset resolves most of the comments on the previous posting
() patchset applies on top of
commit (805a6af8dba Linux 3.2)

This patchset depends on Paul McKenney's "rcu: Introduce raw SRCU
read-side primitives" - 0c53dd8b3; and my patches "x86: Clean up and
extend do_int3()" - cc3a1bf52; "x86: Call do_notify_resume() with
interrupts enabled" - 3596ff4e6b
All three commits in -tip tree.

uprobes git is hosted at git://github.com/srikard/linux.git
with branch inode_uprobes_v32. Branch for-next is also 
updated with these changes.

Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

For previous postings: please refer: https://lkml.org/lkml/2011/11/18/149
https://lkml.org/lkml/2011/11/10/408 https://lkml.org/lkml/2011/9/20/123
https://lkml.org/lkml/2011/6/7/232 https://lkml.org/lkml/2011/4/1/176
http://lkml.org/lkml/2011/3/14/171/ http://lkml.org/lkml/2010/12/16/65
http://lkml.org/lkml/2010/8/25/165 http://lkml.org/lkml/2010/7/27/121
http://lkml.org/lkml/2010/7/12/67 http://lkml.org/lkml/2010/7/8/239
http://lkml.org/lkml/2010/6/29/299 http://lkml.org/lkml/2010/6/14/41
http://lkml.org/lkml/2010/3/20/107 and http://lkml.org/lkml/2010/5/18/307

This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint.  In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location.  On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.

4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace.  In future we could look at a single
backtrace showing application, library and kernel calls.

Changes from last patchset:
- Rebased to 805a6af8dba Linux 3.2
- Handled comments from Masami on 'perf probe'
- Verify vma returned from find_vma as suggested by Oleg.

Here is the list of TODO Items.

- worker thread if unregister_uprobe were to fail.
- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.

Please refer "[PATCH 3.2 7/9] tracing: uprobes trace_event interface".

Please refer "[PATCH 3.2 9/9] perf: perf interface for uprobes".

Please do provide your valuable comments.

Thanks in advance.
Srikar

Srikar Dronamraju (9)
 0: Uprobes patchset with perf probe support
 1: uprobes: Install and remove breakpoints.
 2: uprobes: handle breakpoint and signal step exception.
 3: uprobes: slot allocation.
 4: uprobes: counter to optimize probe hits.
 5: tracing: modify is_delete, is_return from ints to bool.
 6: tracing: Extract out common code for kprobes/uprobes traceevents.
 7: tracing: uprobes trace_event interface
 8: perf: rename target_module to target
 9: perf: perf interface for uprobes


 Documentation/trace/uprobetracer.txt    |   93 ++
 arch/Kconfig                            |    3 +
 arch/x86/Kconfig                        |    5 +-
 arch/x86/include/asm/thread_info.h      |    2 +
 arch/x86/include/asm/uprobes.h          |   59 ++
 arch/x86/kernel/Makefile                |    1 +
 arch/x86/kernel/signal.c                |    6 +
 arch/x86/kernel/uprobes.c               |  675 ++++++++++++++
 include/linux/mm_types.h                |    5 +
 include/linux/sched.h                   |    4 +
 include/linux/uprobes.h                 |  171 ++++
 kernel/Makefile                         |    1 +
 kernel/fork.c                           |   15 +
 kernel/signal.c                         |    3 +
 kernel/trace/Kconfig                    |   20 +
 kernel/trace/Makefile                   |    2 +
 kernel/trace/trace.h                    |    5 +
 kernel/trace/trace_kprobe.c             |  899 +-----------------
 kernel/trace/trace_probe.c              |  786 ++++++++++++++++
 kernel/trace/trace_probe.h              |  162 ++++
 kernel/trace/trace_uprobe.c             |  768 +++++++++++++++
 kernel/uprobes.c                        | 1546 +++++++++++++++++++++++++++++++
 mm/mmap.c                               |   33 +-
 tools/perf/Documentation/perf-probe.txt |   14 +
 tools/perf/builtin-probe.c              |   49 +-
 tools/perf/util/probe-event.c           |  430 +++++++--
 tools/perf/util/probe-event.h           |   12 +-
 tools/perf/util/symbol.c                |    8 +
 tools/perf/util/symbol.h                |    1 +
 29 files changed, 4790 insertions(+), 988 deletions(-)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
@ 2012-01-10 11:48 ` Srikar Dronamraju
  2012-01-25 14:39   ` Denys Vlasenko
                     ` (2 more replies)
  2012-01-10 11:48 ` [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception Srikar Dronamraju
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:48 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Uprobes are maintained in an rb-tree indexed by inode and offset (the
offset here is from the start of the mapping). For a unique (inode,
offset) tuple, there can be atmost one uprobe in the rb-tree.

Since the (inode, offset) tuple identifies a unique uprobe, more
than one user may be interested in the same uprobe. Provides the
ability to hook multiple 'consumers' for the same uprobe.

Each consumer defines a handler and a filter (optional). The
'handler' is run every time the uprobe is hit, if it matches the
'filter' criteria.

The first consumer of a uprobe causes the breakpoint to be inserted
at the specified address and subsequent consumers are appended to
this list.  On subsequent probes, the consumer gets appended to the
existing list of consumers. The breakpoint is removed when the last
consumer unregisters. For all other unregisterations, the consumer
is removed from the list of consumers.

Given a inode, we get a list of the mms that have mapped the inode. Do
the actual registration if mm maps the page where a probe needs to
be inserted/removed.

We use a temporary list to walk thro the vmas that map the inode.
- The number of maps that map the inode, is not known before we walk
  the rmap and keeps changing.
- extending vm_area_struct wasnt recommended.
- There can be more than one maps of the inode in the same mm.

Hook mmap to keep an eye on text vmas that are of interest to
uprobes.  When a vma of interest is mapped, insert the breakpoint at
the right address.

Uprobe works by replacing the instruction at the address defined by
(inode, offset) with the arch specific breakpoint instruction. Save a
copy of the original instruction at the uprobed address.

This is needed for:
a. executing the instruction out-of-line (xol).
b. instruction analysis for any subsequent fixups.
c. restoring the instruction back when the uprobe is unregistered.

We insert or delete a breakpoint instruction, and this breakpoint
instruction is assumed to be the smallest instruction available on
the platform. For fixed size instruction platforms this is trivially
true, for variable size instruction platforms the breakpoint
instruction is typically the smallest (often a single byte).

Writing the instruction is done by COWing the page and changing the
instruction during the copy, this even though most platforms allow
atomic writes of the breakpoint instruction. This also mirrors the
behaviour of a ptrace() memory write to a PRIVATE file map.

The core worker is derived from ksm's replace_page() logic.

In essence, similar to ksm:
a. allocate a new page and copy over contents of the page that has the
   uprobed vaddr
b. modify the copy and insert the breakpoint at the required address
c. switch the original page with the copy containing the breakpoint
d. flush page tables.

replace_page is being replicated here because of some minor changes
in the type of pages and also because Hugh Dickins had plans to
improve replace_page for ksm specific work.

Instruction analysis on x86 is based on instruction decoder and
determines if an instruction can be probed and determines the
necessary fixups after singlestep.  Instruction analysis is done at
probe insertion time so that we avoid having to repeat the same
analysis every time a probe is hit.

A lot of code here is due to the improvement/suggestions/inputs from
Peter Zijlstra.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog:(Since v5)
- Modified del_consumer as per comments from Peter.
- Drop reference to inode before dropping reference to uprobe.
- Use i_size_read(inode) instead of inode->i_size.
- Ensure uprobe->consumers is NULL, before __unregister_uprobe() is called.
- Includes errno.h as recommended by Stephen Rothwell to fix a build issue
  on sparc defconfig
- Remove restrictions while unregistering.
- Earlier code leaked inode references under some conditions while
   registering/unregistering.
- Continue the vma-rmap walk even if the intermediate vma doesnt
   meet the requirements.
- Validate the vma found by find_vma before inserting/removing the
   breakpoint
- Call del_consumer under mutex_lock.
- Use hash locks.
- Handle mremap.
- Introduce find_least_offset_node() instead of close match logic in
  find_uprobe
- Uprobes no more depends on MM_OWNER; No reference to task_structs
  while inserting/removing a probe.
- Uses read_mapping_page instead of grab_cache_page so that the pages
  have valid content.
- pass NULL to get_user_pages for the task parameter.
- call SetPageUptodate on the new page allocated in write_opcode.
- fix leaking a reference to the new page under certain conditions.
- Include Instruction Decoder if Uprobes gets defined.
- Remove const attributes for instruction prefix arrays.
- Uses mm_context to know if the application is 32 bit.

Changelog: (Since v7) : Handle comments from Peter Zijlstra
- Dont take reference to inode. (expect inode to register_uprobe to be sane).
- Use PTR_ERR to set the return value.
- No need to take reference to inode.
- use PTR_ERR to return error value.
- register and unregister_uprobe share code.

 arch/Kconfig                   |   11 
 arch/x86/Kconfig               |    5 
 arch/x86/include/asm/uprobes.h |   42 ++
 arch/x86/kernel/Makefile       |    1 
 arch/x86/kernel/uprobes.c      |  403 +++++++++++++++++
 include/linux/uprobes.h        |   98 ++++
 kernel/Makefile                |    1 
 kernel/uprobes.c               |  976 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                      |   23 +
 9 files changed, 1559 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/include/asm/uprobes.h
 create mode 100644 arch/x86/kernel/uprobes.c
 create mode 100644 include/linux/uprobes.h
 create mode 100644 kernel/uprobes.c

diff --git a/arch/Kconfig b/arch/Kconfig
index 4b0669c..6328b49 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,17 @@ config OPTPROBES
 	depends on KPROBES && HAVE_OPTPROBES
 	depends on !PREEMPT
 
+config UPROBES
+	bool "User-space probes (EXPERIMENTAL)"
+	depends on ARCH_SUPPORTS_UPROBES
+	default n
+	help
+	  Uprobes enables kernel subsystems to establish probepoints
+	  in user applications and execute handler functions when
+	  the probepoints are hit.
+
+	  If in doubt, say "N".
+
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
 	help
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index efb4294..9c821ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -77,7 +77,7 @@ config X86
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 
 config INSTRUCTION_DECODER
-	def_bool (KPROBES || PERF_EVENTS)
+	def_bool (KPROBES || PERF_EVENTS || UPROBES)
 
 config OUTPUT_FORMAT
 	string
@@ -249,6 +249,9 @@ config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU
 
+config ARCH_SUPPORTS_UPROBES
+	def_bool y
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..8208234
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+struct uprobe_arch_info {
+	u16			fixups;
+#ifdef CONFIG_X86_64
+	unsigned long rip_rela_target_address;
+#endif
+};
+
+struct uprobe;
+extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
+#endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8baca3c..8f28be8 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 obj-$(CONFIG_OF)			+= devicetree.o
+obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..a980d1e
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,403 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+/* Post-execution fixups. */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE	0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP		0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL	0x2
+
+#define UPROBES_FIX_RIP_AX	0x8000
+#define UPROBES_FIX_RIP_CX	0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
+	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
+	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
+	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
+	 << (row % 32))
+
+#ifdef CONFIG_X86_64
+static volatile u32 good_insns_64[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+	W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+#endif
+
+/* Good-instruction tables for 32-bit apps */
+
+static volatile u32 good_insns_32[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static volatile u32 good_2byte_insns[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+	W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+	W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+	W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field.  On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes.  These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ *	but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+	int i;
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		switch (insn->prefixes.bytes[i]) {
+		case 0x26:	/*INAT_PFX_ES   */
+		case 0x2E:	/*INAT_PFX_CS   */
+		case 0x36:	/*INAT_PFX_DS   */
+		case 0x3E:	/*INAT_PFX_SS   */
+		case 0xF0:	/*INAT_PFX_LOCK */
+			return true;
+		}
+	}
+	return false;
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, false);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *)good_insns_32))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn), (unsigned long *)good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->arch_info.fixups accordingly.  To start with,
+ * uprobe->arch_info.fixups is either zero or it reflects rip-related
+ * fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+	bool fix_ip = true, fix_call = false;	/* defaults */
+	int reg;
+
+	insn_get_opcode(insn);	/* should be a nop */
+
+	switch (OPCODE1(insn)) {
+	case 0xc3:		/* ret/lret */
+	case 0xcb:
+	case 0xc2:
+	case 0xca:
+		/* ip is correct */
+		fix_ip = false;
+		break;
+	case 0xe8:		/* call relative - Fix return addr */
+		fix_call = true;
+		break;
+	case 0x9a:		/* call absolute - Fix return addr, not ip */
+		fix_call = true;
+		fix_ip = false;
+		break;
+	case 0xff:
+		insn_get_modrm(insn);
+		reg = MODRM_REG(insn);
+		if (reg == 2 || reg == 3) {
+			/* call or lcall, indirect */
+			/* Fix return addr; ip is correct. */
+			fix_call = true;
+			fix_ip = false;
+		} else if (reg == 4 || reg == 5) {
+			/* jmp or ljmp, indirect */
+			/* ip is correct. */
+			fix_ip = false;
+		}
+		break;
+	case 0xea:		/* jmp absolute -- ip is correct */
+		fix_ip = false;
+		break;
+	default:
+		break;
+	}
+	if (fix_ip)
+		uprobe->arch_info.fixups |= UPROBES_FIX_IP;
+	if (fix_call)
+		uprobe->arch_info.fixups |= UPROBES_FIX_CALL;
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return
+ * immediately.  Otherwise, rewrite the instruction so that it accesses
+ * its memory operand indirectly through a scratch register.  Set
+ * uprobe->arch_info.fixups and uprobe->arch_info.rip_rela_target_address
+ * accordingly.  (The contents of the scratch register will be saved
+ * before we single-step the modified instruction, and restored
+ * afterward.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area.  At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
+							struct insn *insn)
+{
+	u8 *cursor;
+	u8 reg;
+
+	if (mm->context.ia32_compat)
+		return;
+
+	uprobe->arch_info.rip_rela_target_address = 0x0;
+	if (!insn_rip_relative(insn))
+		return;
+
+	/*
+	 * Point cursor at the modrm byte.  The next 4 bytes are the
+	 * displacement.  Beyond the displacement, for some instructions,
+	 * is the immediate operand.
+	 */
+	cursor = uprobe->insn + insn->prefixes.nbytes
+			+ insn->rex_prefix.nbytes + insn->opcode.nbytes;
+	insn_get_length(insn);
+
+	/*
+	 * Convert from rip-relative addressing to indirect addressing
+	 * via a scratch register.  Change the r/m field from 0x5 (%rip)
+	 * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+	 */
+	reg = MODRM_REG(insn);
+	if (reg == 0) {
+		/*
+		 * The register operand (if any) is either the A register
+		 * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+		 * REX prefix) %r8.  In any case, we know the C register
+		 * is NOT the register operand, so we use %rcx (register
+		 * #1) for the scratch register.
+		 */
+		uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
+		/* Change modrm from 00 000 101 to 00 000 001. */
+		*cursor = 0x1;
+	} else {
+		/* Use %rax (register #0) for the scratch register. */
+		uprobe->arch_info.fixups = UPROBES_FIX_RIP_AX;
+		/* Change modrm from 00 xxx 101 to 00 xxx 000 */
+		*cursor = (reg << 3);
+	}
+
+	/* Target address = address of next instruction + (signed) offset */
+	uprobe->arch_info.rip_rela_target_address = (long)insn->length
+					+ insn->displacement.value;
+	/* Displacement field is gone; slide immediate field (if any) over. */
+	if (insn->immediate.nbytes) {
+		cursor++;
+		memmove(cursor, cursor + insn->displacement.nbytes,
+						insn->immediate.nbytes);
+	}
+	return;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, true);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *)good_insns_64))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn), (unsigned long *)good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+static int validate_insn_bits(struct mm_struct *mm, struct uprobe *uprobe,
+				struct insn *insn)
+{
+	if (mm->context.ia32_compat)
+		return validate_insn_32bits(uprobe, insn);
+	return validate_insn_64bits(uprobe, insn);
+}
+#else
+static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
+							struct insn *insn)
+{
+	return;
+}
+
+static int validate_insn_bits(struct mm_struct *mm, struct uprobe *uprobe,
+				struct insn *insn)
+{
+	return validate_insn_32bits(uprobe, insn);
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @mm: the probed address space.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe)
+{
+	int ret;
+	struct insn insn;
+
+	uprobe->arch_info.fixups = 0;
+	ret = validate_insn_bits(mm, uprobe, &insn);
+	if (ret != 0)
+		return ret;
+	handle_riprel_insn(mm, uprobe, &insn);
+	prepare_fixups(uprobe, &insn);
+	return 0;
+}
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..f1d13fd
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,98 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/errno.h>
+#include <linux/rbtree.h>
+
+struct vm_area_struct;
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+
+typedef u8 uprobe_opcode_t;
+struct uprobe_arch_info {};
+
+#define MAX_UINSN_BYTES 4
+#endif
+
+#define uprobe_opcode_sz sizeof(uprobe_opcode_t)
+
+/* flags that denote/change uprobes behaviour */
+/* Have a copy of original instruction */
+#define UPROBES_COPY_INSN	0x1
+/* Dont run handlers when first register/ last unregister in progress*/
+#define UPROBES_RUN_HANDLER	0x2
+
+struct uprobe_consumer {
+	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+	/*
+	 * filter is optional; If a filter exists, handler is run
+	 * if and only if filter returns true.
+	 */
+	bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+	struct uprobe_consumer *next;
+};
+
+struct uprobe {
+	struct rb_node		rb_node;	/* node in the rb tree */
+	atomic_t		ref;
+	struct rw_semaphore	consumer_rwsem;
+	struct list_head	pending_list;
+	struct uprobe_arch_info arch_info;
+	struct uprobe_consumer	*consumers;
+	struct inode		*inode;		/* Also hold a ref to inode */
+	loff_t			offset;
+	int			flags;
+	u8			insn[MAX_UINSN_BYTES];
+};
+
+#ifdef CONFIG_UPROBES
+extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
+							unsigned long vaddr);
+extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe,
+					unsigned long vaddr, bool verify);
+extern bool __weak is_bkpt_insn(uprobe_opcode_t *insn);
+extern int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+extern int mmap_uprobe(struct vm_area_struct *vma);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+}
+static inline int mmap_uprobe(struct vm_area_struct *vma)
+{
+	return 0;
+}
+#endif /* CONFIG_UPROBES */
+#endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..9fb670d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_UPROBES) += uprobes.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..72e8bb3
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,976 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include <linux/pagemap.h>	/* read_mapping_page */
+#include <linux/slab.h>
+#include <linux/sched.h>
+#include <linux/rmap.h>		/* anon_vma_prepare */
+#include <linux/mmu_notifier.h>	/* set_pte_at_notify */
+#include <linux/swap.h>		/* try_to_free_swap */
+#include <linux/uprobes.h>
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
+
+#define UPROBES_HASH_SZ	13
+/* serialize (un)register */
+static struct mutex uprobes_mutex[UPROBES_HASH_SZ];
+#define uprobes_hash(v)	(&uprobes_mutex[((unsigned long)(v)) %\
+						UPROBES_HASH_SZ])
+
+/* serialize uprobe->pending_list */
+static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
+#define uprobes_mmap_hash(v)	(&uprobes_mmap_mutex[((unsigned long)(v)) %\
+						UPROBES_HASH_SZ])
+
+/*
+ * uprobe_events allows us to skip the mmap_uprobe if there are no uprobe
+ * events active at this time.  Probably a fine grained per inode count is
+ * better?
+ */
+static atomic_t uprobe_events = ATOMIC_INIT(0);
+
+/*
+ * Maintain a temporary per vma info that can be used to search if a vma
+ * has already been handled. This structure is introduced since extending
+ * vm_area_struct wasnt recommended.
+ */
+struct vma_info {
+	struct list_head probe_list;
+	struct mm_struct *mm;
+	loff_t vaddr;
+};
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma
+ * Relax restrictions while unregistering: vm_flags might have
+ * changed after breakpoint was inserted.
+ *	- is_register: indicates if we are in register context.
+ *	- Return 1 if the specified virtual address is in an
+ *	  executable vma.
+ */
+static bool valid_vma(struct vm_area_struct *vma, bool is_register)
+{
+	if (!vma->vm_file)
+		return false;
+
+	if (!is_register)
+		return true;
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+						(VM_READ|VM_EXEC))
+		return true;
+
+	return false;
+}
+
+static loff_t vma_address(struct vm_area_struct *vma, loff_t offset)
+{
+	loff_t vaddr;
+
+	vaddr = vma->vm_start + offset;
+	vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+	return vaddr;
+}
+
+/**
+ * __replace_page - replace page in vma by new page.
+ * based on replace_page in mm/ksm.c
+ *
+ * @vma:      vma that holds the pte pointing to page
+ * @page:     the cowed page we are replacing by kpage
+ * @kpage:    the modified page we replace page by
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int __replace_page(struct vm_area_struct *vma, struct page *page,
+					struct page *kpage)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int err = -EFAULT;
+
+	addr = page_address_in_vma(page, vma);
+	if (addr == -EFAULT)
+		goto out;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out;
+
+	get_page(kpage);
+	page_add_new_anon_rmap(kpage, vma, addr);
+
+	flush_cache_page(vma, addr, pte_pfn(*ptep));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+	page_remove_rmap(page);
+	if (!page_mapped(page))
+		try_to_free_swap(page);
+	put_page(page);
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+
+out:
+	return err;
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(uprobe_opcode_t *insn)
+{
+	return (*insn == UPROBES_BKPT_INSN);
+}
+
+/*
+ * NOTE:
+ * Expect the breakpoint instruction to be the smallest size instruction for
+ * the architecture. If an arch has variable length instruction and the
+ * breakpoint instruction is not of the smallest length instruction
+ * supported by that architecture then we need to modify read_opcode /
+ * write_opcode accordingly. This would never be a problem for archs that
+ * have fixed length instructions.
+ */
+
+/*
+ * write_opcode - write the opcode at a given virtual address.
+ * @mm: the probed process address space.
+ * @uprobe: the breakpointing information.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @vaddr.
+ *
+ * Called with mm->mmap_sem held (for read and with a reference to
+ * mm).
+ *
+ * For mm @mm, write the opcode at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe,
+			unsigned long vaddr, uprobe_opcode_t opcode)
+{
+	struct page *old_page, *new_page;
+	struct address_space *mapping;
+	void *vaddr_old, *vaddr_new;
+	struct vm_area_struct *vma;
+	loff_t addr;
+	int ret;
+
+	/* Read the page with vaddr into memory */
+	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma);
+	if (ret <= 0)
+		return ret;
+	ret = -EINVAL;
+
+	/*
+	 * We are interested in text pages only. Our pages of interest
+	 * should be mapped for read and execute only. We desist from
+	 * adding probes in write mapped pages since the breakpoints
+	 * might end up in the file copy.
+	 */
+	if (!valid_vma(vma, is_bkpt_insn(&opcode)))
+		goto put_out;
+
+	mapping = uprobe->inode->i_mapping;
+	if (mapping != vma->vm_file->f_mapping)
+		goto put_out;
+
+	addr = vma_address(vma, uprobe->offset);
+	if (vaddr != (unsigned long)addr)
+		goto put_out;
+
+	ret = -ENOMEM;
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+	if (!new_page)
+		goto put_out;
+
+	__SetPageUptodate(new_page);
+
+	/*
+	 * lock page will serialize against do_wp_page()'s
+	 * PageAnon() handling
+	 */
+	lock_page(old_page);
+	/* copy the page now that we've got it stable */
+	vaddr_old = kmap_atomic(old_page);
+	vaddr_new = kmap_atomic(new_page);
+
+	memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+	/* poke the new insn in, ASSUMES we don't cross page boundary */
+	vaddr &= ~PAGE_MASK;
+	BUG_ON(vaddr + uprobe_opcode_sz > PAGE_SIZE);
+	memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+	kunmap_atomic(vaddr_new);
+	kunmap_atomic(vaddr_old);
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		goto unlock_out;
+
+	lock_page(new_page);
+	ret = __replace_page(vma, old_page, new_page);
+	unlock_page(new_page);
+
+unlock_out:
+	unlock_page(old_page);
+	page_cache_release(new_page);
+
+put_out:
+	put_page(old_page);	/* we did a get_page in the beginning */
+	return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @mm: the probed process address space.
+ * @vaddr: the virtual address to read the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * Called with mm->mmap_sem held (for read and with a reference to
+ * mm.
+ *
+ * For mm @mm, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+static int read_opcode(struct mm_struct *mm, unsigned long vaddr,
+						uprobe_opcode_t *opcode)
+{
+	struct page *page;
+	void *vaddr_new;
+	int ret;
+
+	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL);
+	if (ret <= 0)
+		return ret;
+
+	lock_page(page);
+	vaddr_new = kmap_atomic(page);
+	vaddr &= ~PAGE_MASK;
+	memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+	kunmap_atomic(vaddr_new);
+	unlock_page(page);
+	put_page(page);		/* we did a get_user_pages in the beginning */
+	return 0;
+}
+
+static int is_bkpt_at_addr(struct mm_struct *mm, unsigned long vaddr)
+{
+	uprobe_opcode_t opcode;
+	int result = read_opcode(mm, vaddr, &opcode);
+
+	if (result)
+		return result;
+
+	if (is_bkpt_insn(&opcode))
+		return 1;
+
+	return 0;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @mm: the probed process address space.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For mm @mm, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
+						unsigned long vaddr)
+{
+	int result = is_bkpt_at_addr(mm, vaddr);
+
+	if (result == 1)
+		return -EEXIST;
+
+	if (result)
+		return result;
+
+	return write_opcode(mm, uprobe, vaddr, UPROBES_BKPT_INSN);
+}
+
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @mm: the probed process address space.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For mm @mm, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe,
+					unsigned long vaddr, bool verify)
+{
+	if (verify) {
+		int result = is_bkpt_at_addr(mm, vaddr);
+
+		if (!result)
+			return -EINVAL;
+
+		if (result != 1)
+			return result;
+	}
+	return write_opcode(mm, uprobe, vaddr,
+				*(uprobe_opcode_t *)uprobe->insn);
+}
+
+static int match_uprobe(struct uprobe *l, struct uprobe *r)
+{
+	if (l->inode < r->inode)
+		return -1;
+	if (l->inode > r->inode)
+		return 1;
+	else {
+		if (l->offset < r->offset)
+			return -1;
+
+		if (l->offset > r->offset)
+			return 1;
+	}
+
+	return 0;
+}
+
+static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe u = { .inode = inode, .offset = offset };
+	struct rb_node *n = uprobes_tree.rb_node;
+	struct uprobe *uprobe;
+	int match;
+
+	while (n) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		match = match_uprobe(&u, uprobe);
+		if (!match) {
+			atomic_inc(&uprobe->ref);
+			return uprobe;
+		}
+		if (match < 0)
+			n = n->rb_left;
+		else
+			n = n->rb_right;
+	}
+	return NULL;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires uprobes_treelock
+ */
+static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe *uprobe;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, offset);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return uprobe;
+}
+
+static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
+{
+	struct rb_node **p = &uprobes_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct uprobe *u;
+	int match;
+
+	while (*p) {
+		parent = *p;
+		u = rb_entry(parent, struct uprobe, rb_node);
+		match = match_uprobe(uprobe, u);
+		if (!match) {
+			atomic_inc(&u->ref);
+			return u;
+		}
+
+		if (match < 0)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+
+	}
+	u = NULL;
+	rb_link_node(&uprobe->rb_node, parent, p);
+	rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+	/* get access + creation ref */
+	atomic_set(&uprobe->ref, 2);
+	return u;
+}
+
+/*
+ * Acquires uprobes_treelock.
+ * Matching uprobe already exists in rbtree;
+ *	increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ *	get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+	struct uprobe *u;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	u = __insert_uprobe(uprobe);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+	if (atomic_dec_and_test(&uprobe->ref))
+		kfree(uprobe);
+}
+
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe *uprobe, *cur_uprobe;
+
+	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+	if (!uprobe)
+		return NULL;
+
+	uprobe->inode = igrab(inode);
+	uprobe->offset = offset;
+	init_rwsem(&uprobe->consumer_rwsem);
+	INIT_LIST_HEAD(&uprobe->pending_list);
+
+	/* add to uprobes_tree, sorted on inode:offset */
+	cur_uprobe = insert_uprobe(uprobe);
+
+	/* a uprobe exists for this inode:offset combination */
+	if (cur_uprobe) {
+		kfree(uprobe);
+		uprobe = cur_uprobe;
+		iput(inode);
+	} else
+		atomic_inc(&uprobe_events);
+	return uprobe;
+}
+
+/* Returns the previous consumer */
+static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	down_write(&uprobe->consumer_rwsem);
+	consumer->next = uprobe->consumers;
+	uprobe->consumers = consumer;
+	up_write(&uprobe->consumer_rwsem);
+	return consumer->next;
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe_consumer **con;
+	bool ret = false;
+
+	down_write(&uprobe->consumer_rwsem);
+	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
+		if (*con == consumer) {
+			*con = consumer->next;
+			ret = true;
+			break;
+		}
+	}
+	up_write(&uprobe->consumer_rwsem);
+	return ret;
+}
+
+static int __copy_insn(struct address_space *mapping,
+			struct vm_area_struct *vma, char *insn,
+			unsigned long nbytes, unsigned long offset)
+{
+	struct file *filp = vma->vm_file;
+	struct page *page;
+	void *vaddr;
+	unsigned long off1;
+	unsigned long idx;
+
+	if (!filp)
+		return -EINVAL;
+
+	idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT);
+	off1 = offset &= ~PAGE_MASK;
+
+	/*
+	 * Ensure that the page that has the original instruction is
+	 * populated and in page-cache.
+	 */
+	page = read_mapping_page(mapping, idx, filp);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+
+	vaddr = kmap_atomic(page);
+	memcpy(insn, vaddr + off1, nbytes);
+	kunmap_atomic(vaddr);
+	page_cache_release(page);
+	return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
+					unsigned long addr)
+{
+	struct address_space *mapping;
+	int bytes;
+	unsigned long nbytes;
+
+	addr &= ~PAGE_MASK;
+	nbytes = PAGE_SIZE - addr;
+	mapping = uprobe->inode->i_mapping;
+
+	/* Instruction at end of binary; copy only available bytes */
+	if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+		bytes = uprobe->inode->i_size - uprobe->offset;
+	else
+		bytes = MAX_UINSN_BYTES;
+
+	/* Instruction at the page-boundary; copy bytes in second page */
+	if (nbytes < bytes) {
+		if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
+				bytes - nbytes, uprobe->offset + nbytes))
+			return -ENOMEM;
+
+		bytes = nbytes;
+	}
+	return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
+}
+
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+				struct vm_area_struct *vma, loff_t vaddr)
+{
+	unsigned long addr;
+	int ret;
+
+	/*
+	 * If probe is being deleted, unregister thread could be done with
+	 * the vma-rmap-walk through. Adding a probe now can be fatal since
+	 * nobody will be able to cleanup. Also we could be from fork or
+	 * mremap path, where the probe might have already been inserted.
+	 * Hence behave as if probe already existed.
+	 */
+	if (!uprobe->consumers)
+		return -EEXIST;
+
+	addr = (unsigned long)vaddr;
+	if (!(uprobe->flags & UPROBES_COPY_INSN)) {
+		ret = copy_insn(uprobe, vma, addr);
+		if (ret)
+			return ret;
+
+		if (is_bkpt_insn((uprobe_opcode_t *)uprobe->insn))
+			return -EEXIST;
+
+		ret = analyze_insn(mm, uprobe);
+		if (ret)
+			return ret;
+
+		uprobe->flags |= UPROBES_COPY_INSN;
+	}
+	ret = set_bkpt(mm, uprobe, addr);
+
+	return ret;
+}
+
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+							loff_t vaddr)
+{
+	set_orig_insn(mm, uprobe, (unsigned long)vaddr, true);
+}
+
+static void delete_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	iput(uprobe->inode);
+	put_uprobe(uprobe);
+	atomic_dec(&uprobe_events);
+}
+
+static struct vma_info *__find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping,
+			struct vma_info *vi, bool is_register)
+{
+	struct prio_tree_iter iter;
+	struct vm_area_struct *vma;
+	struct vma_info *tmpvi;
+	loff_t vaddr;
+	unsigned long pgoff = offset >> PAGE_SHIFT;
+	int existing_vma;
+
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (!valid_vma(vma, is_register))
+			continue;
+
+		existing_vma = 0;
+		vaddr = vma_address(vma, offset);
+		list_for_each_entry(tmpvi, head, probe_list) {
+			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
+				existing_vma = 1;
+				break;
+			}
+		}
+
+		/*
+		 * Another vma needs a probe to be installed. However skip
+		 * installing the probe if the vma is about to be unlinked.
+		 */
+		if (!existing_vma &&
+				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
+			vi->mm = vma->vm_mm;
+			vi->vaddr = vaddr;
+			list_add(&vi->probe_list, head);
+			return vi;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * Iterate in the rmap prio tree  and find a vma where a probe has not
+ * yet been inserted.
+ */
+static struct vma_info *find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping,
+			bool is_register)
+{
+	struct vma_info *vi, *retvi;
+	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
+	if (!vi)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	retvi = __find_next_vma_info(head, offset, mapping, vi, is_register);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (!retvi)
+		kfree(vi);
+	return retvi;
+}
+
+static int register_for_each_vma(struct uprobe *uprobe, bool is_register)
+{
+	struct list_head try_list;
+	struct vm_area_struct *vma;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct mm_struct *mm;
+	loff_t vaddr;
+	int ret = 0;
+
+	mapping = uprobe->inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, uprobe->offset,
+					mapping, is_register)) != NULL) {
+		if (IS_ERR(vi)) {
+			ret = PTR_ERR(vi);
+			break;
+		}
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long)vi->vaddr);
+		if (!vma || !valid_vma(vma, is_register)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		vaddr = vma_address(vma, uprobe->offset);
+		if (vma->vm_file->f_mapping->host != uprobe->inode ||
+						vaddr != vi->vaddr) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+
+		if (is_register)
+			ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
+		else
+			remove_breakpoint(mm, uprobe, vi->vaddr);
+
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+		if (is_register) {
+			if (ret && ret == -EEXIST)
+				ret = 0;
+			if (ret)
+				break;
+		}
+	}
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	return ret;
+}
+
+static int __register_uprobe(struct uprobe *uprobe)
+{
+	return register_for_each_vma(uprobe, true);
+}
+
+static void __unregister_uprobe(struct uprobe *uprobe)
+{
+	if (!register_for_each_vma(uprobe, false))
+		delete_uprobe(uprobe);
+
+	/* TODO : cant unregister? schedule a worker thread */
+}
+
+/*
+ * register_uprobe - register a probe
+ * @inode: the file in which the probe has to be placed.
+ * @offset: offset from the start of the file.
+ * @consumer: information on howto handle the probe..
+ *
+ * Apart from the access refcount, register_uprobe() takes a creation
+ * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
+ * inserted into the rbtree (i.e first consumer for a @inode:@offset
+ * tuple).  Creation refcount stops unregister_uprobe from freeing the
+ * @uprobe even before the register operation is complete. Creation
+ * refcount is released when the last @consumer for the @uprobe
+ * unregisters.
+ *
+ * Return errno if it cannot successully install probes
+ * else return 0 (success)
+ */
+int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe;
+	int ret = -EINVAL;
+
+	if (!inode || !consumer || consumer->next)
+		return ret;
+
+	if (offset > i_size_read(inode))
+		return ret;
+
+	ret = 0;
+	mutex_lock(uprobes_hash(inode));
+	uprobe = alloc_uprobe(inode, offset);
+	if (uprobe && !add_consumer(uprobe, consumer)) {
+		ret = __register_uprobe(uprobe);
+		if (ret) {
+			uprobe->consumers = NULL;
+			__unregister_uprobe(uprobe);
+		} else
+			uprobe->flags |= UPROBES_RUN_HANDLER;
+	}
+
+	mutex_unlock(uprobes_hash(inode));
+	put_uprobe(uprobe);
+
+	return ret;
+}
+
+/*
+ * unregister_uprobe - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @consumer: identify which probe if multiple probes are colocated.
+ */
+void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe = NULL;
+
+	if (!inode || !consumer)
+		return;
+
+	uprobe = find_uprobe(inode, offset);
+	if (!uprobe)
+		return;
+
+	mutex_lock(uprobes_hash(inode));
+	if (!del_consumer(uprobe, consumer))
+		goto unreg_out;
+
+	if (!uprobe->consumers) {
+		__unregister_uprobe(uprobe);
+		uprobe->flags &= ~UPROBES_RUN_HANDLER;
+	}
+
+unreg_out:
+	mutex_unlock(uprobes_hash(inode));
+	if (uprobe)
+		put_uprobe(uprobe);
+}
+
+/*
+ * Of all the nodes that correspond to the given inode, return the node
+ * with the least offset.
+ */
+static struct rb_node *find_least_offset_node(struct inode *inode)
+{
+	struct uprobe u = { .inode = inode, .offset = 0};
+	struct rb_node *n = uprobes_tree.rb_node;
+	struct rb_node *close_node = NULL;
+	struct uprobe *uprobe;
+	int match;
+
+	while (n) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		match = match_uprobe(&u, uprobe);
+		if (uprobe->inode == inode)
+			close_node = n;
+
+		if (!match)
+			return close_node;
+
+		if (match < 0)
+			n = n->rb_left;
+		else
+			n = n->rb_right;
+	}
+	return close_node;
+}
+
+/*
+ * For a given inode, build a list of probes that need to be inserted.
+ */
+static void build_probe_list(struct inode *inode, struct list_head *head)
+{
+	struct uprobe *uprobe;
+	struct rb_node *n;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	n = find_least_offset_node(inode);
+	for (; n; n = rb_next(n)) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		if (uprobe->inode != inode)
+			break;
+
+		list_add(&uprobe->pending_list, head);
+		atomic_inc(&uprobe->ref);
+	}
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ *
+ * Return -ve no if we fail to insert probes and we cannot
+ * bail-out.
+ * Return 0 otherwise. i.e :
+ *	- successful insertion of probes
+ *	- (or) no possible probes to be inserted.
+ *	- (or) insertion of probes failed but we can bail-out.
+ */
+int mmap_uprobe(struct vm_area_struct *vma)
+{
+	struct list_head tmp_list;
+	struct uprobe *uprobe, *u;
+	struct inode *inode;
+	int ret = 0;
+
+	if (!atomic_read(&uprobe_events) || !valid_vma(vma, true))
+		return ret;	/* Bail-out */
+
+	inode = vma->vm_file->f_mapping->host;
+	if (!inode)
+		return ret;
+
+	INIT_LIST_HEAD(&tmp_list);
+	mutex_lock(uprobes_mmap_hash(inode));
+	build_probe_list(inode, &tmp_list);
+	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+		loff_t vaddr;
+
+		list_del(&uprobe->pending_list);
+		if (!ret) {
+			vaddr = vma_address(vma, uprobe->offset);
+			if (vaddr < vma->vm_start || vaddr >= vma->vm_end) {
+				put_uprobe(uprobe);
+				continue;
+			}
+			ret = install_breakpoint(vma->vm_mm, uprobe, vma,
+								vaddr);
+			if (ret == -EEXIST)
+				ret = 0;
+		}
+		put_uprobe(uprobe);
+	}
+
+	mutex_unlock(uprobes_mmap_hash(inode));
+
+	return ret;
+}
+
+static int __init init_uprobes(void)
+{
+	int i;
+
+	for (i = 0; i < UPROBES_HASH_SZ; i++) {
+		mutex_init(&uprobes_mutex[i]);
+		mutex_init(&uprobes_mmap_mutex[i]);
+	}
+	return 0;
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);
diff --git a/mm/mmap.c b/mm/mmap.c
index eae90af..0966418 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
 #include <linux/perf_event.h>
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -616,6 +617,13 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 
+	if (root) {
+		mmap_uprobe(vma);
+
+		if (adjust_next)
+			mmap_uprobe(next);
+	}
+
 	if (remove_next) {
 		if (file) {
 			fput(file);
@@ -637,6 +645,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 			goto again;
 		}
 	}
+	if (insert && file)
+		mmap_uprobe(insert);
 
 	validate_mm(mm);
 
@@ -1329,6 +1339,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			mm->locked_vm += (len >> PAGE_SHIFT);
 	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
+
+	if (file && mmap_uprobe(vma))
+		/* matching probes but cannot insert */
+		goto unmap_and_free_vma;
+
 	return addr;
 
 unmap_and_free_vma:
@@ -2305,6 +2320,10 @@ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 	if ((vma->vm_flags & VM_ACCOUNT) &&
 	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
 		return -ENOMEM;
+
+	if (vma->vm_file && mmap_uprobe(vma))
+		return -EINVAL;
+
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	return 0;
 }
@@ -2356,6 +2375,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			new_vma->vm_pgoff = pgoff;
 			if (new_vma->vm_file) {
 				get_file(new_vma->vm_file);
+
+				if (mmap_uprobe(new_vma))
+					goto out_free_mempol;
+
 				if (vma->vm_flags & VM_EXECUTABLE)
 					added_exe_file_vma(mm);
 			}


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
  2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
@ 2012-01-10 11:48 ` Srikar Dronamraju
  2012-01-18  8:39   ` Anton Arapov
  2012-01-10 11:48 ` [PATCH v9 3.2 3/9] uprobes: slot allocation Srikar Dronamraju
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:48 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Uprobes uses exception notifiers to get to know if a thread hit a
breakpoint or singlestep exception.

When a thread hits a uprobe or is singlestepping post a uprobe hit,
the uprobe exception notifier, sets its TIF_UPROBE bit, which will
then be checked on its return to userspace path (do_notify_resume()
->uprobe_notify_resume()), where the consumers handlers are run
(in task context) based on the defined filters.

Uprobe hits are thread specific and hence we need to maintain
information about if a task hit a uprobe, what uprobe was hit, the slot
where the original instruction was copied for xol so that it can be
singlestepped with appropriate fixups.

In some cases, special care is needed for instructions that are
executed out of line (xol). These are architecture specific artefacts,
such as handling RIP relative instructions on x86_64.

Since the instruction at which the uprobe was inserted is executed out
of line, architecture specific fixups are added so that the thread
continues normal execution in the presence of a uprobe.

Postpone the signals until we execute the probed insn. post_xol()
path does a recalc_sigpending() before return to user-mode, this
ensures the signal can't be lost.

Uprobes relies on DIE_DEBUG notification to notify if a singlestep is
complete.

Adds x86 specific uprobe exception notifiers and appropriate hooks
needed to determine a uprobe hit and subsequent post processing.

Add requisite x86 fixups for xol for uprobes. Specific cases needing
fixups include relative jumps (x86_64), calls, etc.

Where possible, we check and skip singlestepping the breakpointed
instructions. For now we skip single byte as well as few multibyte
nop instructions. However this can be extended to other
instructions too.

Credits to Oleg Nesterov for suggestions/patches related to signal,
breakpoint, singlestep handling code.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
(Changelog (since v8))
- verify the vma returned by find_vma (suggested by Oleg). 

(Changelog (since v7)): Resolve comments from Dan Carpenter.
- add_utask returns NULL on error.
- Handles architecture agnostic parts of a uprobe breakpoint hit and
  subsequent xol singlestepping.

Changelog (since v6)
- added x86 specific hook for aborting xol.

Changelog (since v5)
- Use srcu_raw instead of synchronize_sched
- Introduce per task srcu_id to store the srcu_id
- Modified comments.
- No more do a i386 specific enable interrupts. (Its not part of another
  patch posted separately)

 arch/x86/include/asm/thread_info.h |    2 
 arch/x86/include/asm/uprobes.h     |   17 ++
 arch/x86/kernel/signal.c           |    6 +
 arch/x86/kernel/uprobes.c          |  272 ++++++++++++++++++++++++++++++++++++
 include/linux/sched.h              |    4 +
 include/linux/uprobes.h            |   46 ++++++
 kernel/fork.c                      |    7 +
 kernel/signal.c                    |    3 
 kernel/uprobes.c                   |  275 ++++++++++++++++++++++++++++++++++++
 9 files changed, 630 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..aeb3e04 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
+#define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 8208234..475563b 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -23,6 +23,8 @@
  *	Jim Keniston
  */
 
+#include <linux/notifier.h>
+
 typedef u8 uprobe_opcode_t;
 #define MAX_UINSN_BYTES 16
 #define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
@@ -37,6 +39,21 @@ struct uprobe_arch_info {
 #endif
 };
 
+struct uprobe_task_arch_info {
+	unsigned long saved_trap_no;
+#ifdef CONFIG_X86_64
+	unsigned long saved_scratch_register;
+#endif
+};
+
 struct uprobe;
+
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
+extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern bool xol_was_trapped(struct task_struct *tsk);
+extern int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data);
+extern void abort_xol(struct pt_regs *regs, struct uprobe *uprobe);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb..4fdf470 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -820,6 +821,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
+	if (thread_info_flags & _TIF_UPROBE) {
+		clear_thread_flag(TIF_UPROBE);
+		uprobe_notify_resume(regs);
+	}
+
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index a980d1e..e4e0dfd 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,10 +25,17 @@
 #include <linux/sched.h>
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
+#include <linux/uaccess.h>
 
 #include <linux/kdebug.h>
 #include <asm/insn.h>
 
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
 /* Post-execution fixups. */
 
 /* No fixup needed */
@@ -401,3 +408,268 @@ int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe)
 	prepare_fixups(uprobe, &insn);
 	return 0;
 }
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
+{
+	regs->ip = vaddr;
+}
+
+#define	UPROBE_TRAP_NO		UINT_MAX
+
+/*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+#ifdef CONFIG_X86_64
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
+	regs->ip = current->utask->xol_vaddr;
+	if (uprobe->arch_info.fixups & UPROBES_FIX_RIP_AX) {
+		tskinfo->saved_scratch_register = regs->ax;
+		regs->ax = current->utask->vaddr;
+		regs->ax += uprobe->arch_info.rip_rela_target_address;
+	} else if (uprobe->arch_info.fixups & UPROBES_FIX_RIP_CX) {
+		tskinfo->saved_scratch_register = regs->cx;
+		regs->cx = current->utask->vaddr;
+		regs->cx += uprobe->arch_info.rip_rela_target_address;
+	}
+	return 0;
+}
+#else
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
+	regs->ip = current->utask->xol_vaddr;
+	return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+	int rasize, ncopied;
+	long ra = 0;
+
+	if (is_32bit_app(current))
+		rasize = 4;
+	else
+		rasize = 8;
+
+	ncopied = copy_from_user(&ra, (void __user *)sp, rasize);
+	if (unlikely(ncopied))
+		return -EFAULT;
+
+	ra += correction;
+	ncopied = copy_to_user((void __user *)sp, &ra, rasize);
+	if (unlikely(ncopied))
+		return -EFAULT;
+
+	return 0;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+	return ((uprobe->arch_info.fixups &
+			(UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+	if (is_riprel_insn(uprobe)) {
+		struct uprobe_task_arch_info *tskinfo;
+		tskinfo = &current->utask->tskinfo;
+
+		if (uprobe->arch_info.fixups & UPROBES_FIX_RIP_AX)
+			regs->ax = tskinfo->saved_scratch_register;
+		else
+			regs->cx = tskinfo->saved_scratch_register;
+		/*
+		 * The original instruction includes a displacement, and so
+		 * is 4 bytes longer than what we've just single-stepped.
+		 * Fall through to handle stuff like "jmpq *...(%rip)" and
+		 * "callq *...(%rip)".
+		 */
+		if (correction)
+			*correction += 4;
+	}
+}
+#else
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+}
+#endif
+
+/*
+ * If xol insn itself traps and generates a signal(Say,
+ * SIGILL/SIGSEGV/etc), then detect the case where a singlestepped
+ * instruction jumps back to its own address. It is assumed that anything
+ * like do_page_fault/do_trap/etc sets thread.trap_no != -1.
+ *
+ * pre_xol/post_xol save/restore thread.trap_no, xol_was_trapped() simply
+ * checks that ->trap_no is not equal to UPROBE_TRAP_NO == -1 set by
+ * pre_xol().
+ */
+bool xol_was_trapped(struct task_struct *tsk)
+{
+	if (tsk->thread.trap_no != UPROBE_TRAP_NO)
+		return true;
+
+	return false;
+}
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction.  We need
+ * to make it relative to the original instruction (FIX_IP).  Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction.  We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field).  (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task *utask = current->utask;
+	int result = 0;
+	long correction;
+
+	WARN_ON_ONCE(current->thread.trap_no != UPROBE_TRAP_NO);
+
+	current->thread.trap_no = utask->tskinfo.saved_trap_no;
+	correction = (long)(utask->vaddr - utask->xol_vaddr);
+	handle_riprel_post_xol(uprobe, regs, &correction);
+	if (uprobe->arch_info.fixups & UPROBES_FIX_IP)
+		regs->ip += correction;
+	if (uprobe->arch_info.fixups & UPROBES_FIX_CALL)
+		result = adjust_ret_addr(regs->sp, correction);
+	return result;
+}
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+	int ret = NOTIFY_DONE;
+
+	/* We are only interested in userspace traps */
+	if (regs && !user_mode_vm(regs))
+		return NOTIFY_DONE;
+
+	switch (val) {
+	case DIE_INT3:
+		/* Run your handler here */
+		if (uprobe_bkpt_notifier(regs))
+			ret = NOTIFY_STOP;
+		break;
+	case DIE_DEBUG:
+		if (uprobe_post_notifier(regs))
+			ret = NOTIFY_STOP;
+	default:
+		break;
+	}
+	return ret;
+}
+
+/*
+ * xol insn either trapped or thread has a fatal signal, so reset the
+ * instruction pointer to its probed address.
+ */
+void abort_xol(struct pt_regs *regs, struct uprobe *uprobe)
+{
+	struct uprobe_task *utask = current->utask;
+
+	current->thread.trap_no = utask->tskinfo.saved_trap_no;
+	handle_riprel_post_xol(uprobe, regs, NULL);
+	set_instruction_pointer(regs, utask->vaddr);
+}
+
+/*
+ * Skip these instructions:
+ *
+ * 0f 19 90 90 90 90 90		nopl   -0x6f6f6f70(%rax)
+ * 0f 1f 00			nopl (%rax)
+ * 0f 1f 40 00			nopl 0x0(%rax)
+ * 0f 1f 44 00 00		nopl 0x0(%rax,%rax,1)
+ * 0f 1f 80 00 00 00 00		nopl 0x0(%rax)
+ * 0f 1f 84 00 00 00 00		nopl 0x0(%rax,%rax,1)
+ * 66 0f 1f 44 00 00 00		nopw 0x0(%rax,%rax,1)
+ * 66 0f 1f 84 00 00 00		nopw 0x0(%rax,%rax,1)
+ * 66 87 c0			xchg %eax,%eax
+ * 66 90			nop
+ * 87 c0			xchg %eax,%eax
+ * 90 				nop
+ */
+
+bool can_skip_xol(struct pt_regs *regs, struct uprobe *u)
+{
+	int i;
+
+	for (i = 0; i < MAX_UINSN_BYTES; i++) {
+		if ((u->insn[i] == 0x66))
+			continue;
+
+		if (u->insn[i] == 0x90)
+			return true;
+
+		if (i == (MAX_UINSN_BYTES - 1))
+			break;
+
+		if ((u->insn[i] == 0x0f) && (u->insn[i+1] == 0x1f))
+			return true;
+
+		if ((u->insn[i] == 0x0f) && (u->insn[i+1] == 0x19))
+			return true;
+
+		if ((u->insn[i] == 0x87) && (u->insn[i+1] == 0xc0))
+			return true;
+
+		break;
+	}
+
+	u->flags &= ~UPROBES_SKIP_SSTEP;
+	return false;
+}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..bab3da7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1572,6 +1572,10 @@ struct task_struct {
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
 #endif
+#ifdef CONFIG_UPROBES
+	struct uprobe_task *utask;
+	int uprobes_srcu_id;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f1d13fd..41037e9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,7 +33,7 @@ struct vm_area_struct;
 
 typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
-
+struct uprobe_task_arch_info {};	/* arch specific task info */
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -44,6 +44,8 @@ struct uprobe_arch_info {};
 #define UPROBES_COPY_INSN	0x1
 /* Dont run handlers when first register/ last unregister in progress*/
 #define UPROBES_RUN_HANDLER	0x2
+/* Can skip singlestep */
+#define UPROBES_SKIP_SSTEP	0x4
 
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
@@ -69,6 +71,27 @@ struct uprobe {
 	u8			insn[MAX_UINSN_BYTES];
 };
 
+enum uprobe_task_state {
+	UTASK_RUNNING,
+	UTASK_BP_HIT,
+	UTASK_SSTEP,
+	UTASK_SSTEP_ACK,
+	UTASK_SSTEP_TRAPPED,
+};
+
+/*
+ * uprobe_task: Metadata of a task while it singlesteps.
+ */
+struct uprobe_task {
+	unsigned long xol_vaddr;
+	unsigned long vaddr;
+
+	enum uprobe_task_state state;
+	struct uprobe_task_arch_info tskinfo;
+
+	struct uprobe *active_uprobe;
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -79,7 +102,14 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern void free_uprobe_utask(struct task_struct *tsk);
 extern int mmap_uprobe(struct vm_area_struct *vma);
+extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
+extern bool uprobe_deny_signal(void);
+extern bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -94,5 +124,19 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
 {
 	return 0;
 }
+static inline void uprobe_notify_resume(struct pt_regs *regs)
+{
+}
+static inline bool uprobe_deny_signal(void)
+{
+	return false;
+}
+static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return 0;
+}
+static inline void free_uprobe_utask(struct task_struct *tsk)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index da4a6a1..7a0b7d7 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -677,6 +678,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		exit_pi_state_list(tsk);
 #endif
 
+	free_uprobe_utask(tsk);
+
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
@@ -1272,6 +1275,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	INIT_LIST_HEAD(&p->pi_state_list);
 	p->pi_state_cache = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	p->utask = NULL;
+	p->uprobes_srcu_id = -1;
+#endif
 	/*
 	 * sigaltstack should be cleared when sharing the same VM
 	 */
diff --git a/kernel/signal.c b/kernel/signal.c
index 2065515..a9d8a50 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2147,6 +2147,9 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+	if (unlikely(uprobe_deny_signal()))
+		return 0;
+
 relock:
 	/*
 	 * We'll jump back here after any time we were stopped in TASK_STOPPED.
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 72e8bb3..674d6da 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -29,8 +29,11 @@
 #include <linux/rmap.h>		/* anon_vma_prepare */
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
+#include <linux/ptrace.h>	/* user_enable_single_step */
+#include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
+static struct srcu_struct uprobes_srcu;
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
 
@@ -460,6 +463,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	spin_lock_irqsave(&uprobes_treelock, flags);
 	u = __insert_uprobe(uprobe);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
+
+	/* For now assume that the instruction need not be single-stepped */
+	uprobe->flags |= UPROBES_SKIP_SSTEP;
 	return u;
 }
 
@@ -495,6 +501,24 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_consumer *consumer;
+
+	if (!(uprobe->flags & UPROBES_RUN_HANDLER))
+		return;
+
+	down_read(&uprobe->consumer_rwsem);
+	consumer = uprobe->consumers;
+	for (consumer = uprobe->consumers; consumer;
+					consumer = consumer->next) {
+		if (!consumer->filter ||
+				consumer->filter(consumer, current))
+			consumer->handler(consumer, regs);
+	}
+	up_read(&uprobe->consumer_rwsem);
+}
+
 /* Returns the previous consumer */
 static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
 				struct uprobe_consumer *consumer)
@@ -630,10 +654,21 @@ static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 	set_orig_insn(mm, uprobe, (unsigned long)vaddr, true);
 }
 
+/*
+ * There could be threads that have hit the breakpoint and are entering the
+ * notifier code and trying to acquire the uprobes_treelock. The thread
+ * calling delete_uprobe() that is removing the uprobe from the rb_tree can
+ * race with these threads and might acquire the uprobes_treelock compared
+ * to some of the breakpoint hit threads. In such a case, the breakpoint hit
+ * threads will not find the uprobe. Hence wait till the current breakpoint
+ * hit threads acquire the uprobes_treelock before the uprobe is removed
+ * from the rbtree.
+ */
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
 
+	synchronize_srcu(&uprobes_srcu);
 	spin_lock_irqsave(&uprobes_treelock, flags);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
@@ -957,6 +992,243 @@ int mmap_uprobe(struct vm_area_struct *vma)
 	return ret;
 }
 
+/**
+ * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void free_uprobe_utask(struct task_struct *tsk)
+{
+	struct uprobe_task *utask = tsk->utask;
+
+	if (tsk->uprobes_srcu_id != -1)
+		srcu_read_unlock_raw(&uprobes_srcu, tsk->uprobes_srcu_id);
+
+	if (!utask)
+		return;
+
+	if (utask->active_uprobe)
+		put_uprobe(utask->active_uprobe);
+
+	kfree(utask);
+	tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - NULL otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+	struct uprobe_task *utask;
+
+	utask = kzalloc(sizeof *utask, GFP_KERNEL);
+	if (unlikely(utask == NULL))
+		return NULL;
+
+	utask->active_uprobe = NULL;
+	current->utask = utask;
+	return utask;
+}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+				unsigned long vaddr)
+{
+	return -EFAULT;
+}
+
+/*
+ * If we are singlestepping, then ensure this thread is not connected to
+ * non-fatal signals until completion of singlestep.  When xol insn itself
+ * triggers the signal,  restart the original insn even if the task is
+ * already SIGKILL'ed (since coredump should report the correct ip).  This
+ * is even more important if the task has a handler for SIGSEGV/etc, The
+ * _same_ instruction should be repeated again after return from the signal
+ * handler, and SSTEP can never finish in this case.
+ */
+bool uprobe_deny_signal(void)
+{
+	struct task_struct *tsk = current;
+	struct uprobe_task *utask = tsk->utask;
+
+	if (likely(!utask || !utask->active_uprobe))
+		return false;
+
+	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
+
+	if (signal_pending(tsk)) {
+		spin_lock_irq(&tsk->sighand->siglock);
+		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
+		spin_unlock_irq(&tsk->sighand->siglock);
+
+		if (__fatal_signal_pending(tsk) || xol_was_trapped(tsk)) {
+			utask->state = UTASK_SSTEP_TRAPPED;
+			set_tsk_thread_flag(tsk, TIF_UPROBE);
+			set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+		}
+	}
+
+	return true;
+}
+
+bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u)
+{
+	u->flags &= ~UPROBES_SKIP_SSTEP;
+	return false;
+}
+
+/*
+ * On breakpoint hit, breakpoint notifier sets the TIF_UPROBE flag.  (and on
+ * subsequent probe hits on the thread sets the state to UTASK_BP_HIT) and
+ * allows the thread to return from interrupt.  While returning to
+ * userspace, thread noticies the TIF_UPROBE flag and calls
+ * uprobe_notify_resume(). uprobe_notify_resume will run the handler and ask
+ * the thread to singlestep.
+ *
+ * On subsequent singlestep exception, singlestep notifier sets the
+ * TIF_UPROBE flag and also sets the state to UTASK_SSTEP_ACK and allows the
+ * thread to return from interrupt. While returning to userspace, thread
+ * notices the TIF_UPROBE and calls uprobe_notify_resume().
+ * uprobe_notify_resume disables singlestep and performs the required
+ * fix-ups.
+ *
+ * All non-fatal signals cannot interrupt thread while the thread singlesteps.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+	struct vm_area_struct *vma;
+	struct uprobe_task *utask;
+	struct mm_struct *mm;
+	struct uprobe *u = NULL;
+	unsigned long probept;
+
+	utask = current->utask;
+	mm = current->mm;
+	if (!utask || utask->state == UTASK_BP_HIT) {
+		probept = get_uprobe_bkpt_addr(regs);
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, probept);
+		if (vma && vma->vm_start <= probept && valid_vma(vma, false))
+			u = find_uprobe(vma->vm_file->f_mapping->host,
+					probept - vma->vm_start +
+					(vma->vm_pgoff << PAGE_SHIFT));
+
+		srcu_read_unlock_raw(&uprobes_srcu,
+					current->uprobes_srcu_id);
+		current->uprobes_srcu_id = -1;
+		up_read(&mm->mmap_sem);
+		if (!u)
+			/* No matching uprobe; signal SIGTRAP. */
+			goto cleanup_ret;
+		if (!utask) {
+			utask = add_utask();
+			/* Cannot Allocate; re-execute the instruction. */
+			if (!utask)
+				goto cleanup_ret;
+		}
+		utask->active_uprobe = u;
+		handler_chain(u, regs);
+
+		if (u->flags & UPROBES_SKIP_SSTEP && can_skip_xol(regs, u))
+			goto cleanup_ret;
+
+		utask->state = UTASK_SSTEP;
+		if (!pre_ssout(u, regs, probept))
+			user_enable_single_step(current);
+		else
+			/* Cannot Singlestep; re-execute the instruction. */
+			goto cleanup_ret;
+	} else {
+		u = utask->active_uprobe;
+		if (utask->state == UTASK_SSTEP_ACK)
+			post_xol(u, regs);
+		else if (utask->state == UTASK_SSTEP_TRAPPED)
+			abort_xol(regs, u);
+		else
+			WARN_ON_ONCE(1);
+
+		put_uprobe(u);
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+		user_disable_single_step(current);
+
+		spin_lock_irq(&current->sighand->siglock);
+		recalc_sigpending(); /* see uprobe_deny_signal() */
+		spin_unlock_irq(&current->sighand->siglock);
+	}
+	return;
+
+cleanup_ret:
+	if (utask) {
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+	}
+	if (u) {
+		if (!(u->flags & UPROBES_SKIP_SSTEP))
+			set_instruction_pointer(regs, probept);
+
+		put_uprobe(u);
+	} else
+		send_sig(SIGTRAP, current, 0);
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context as part of
+ * notifier mechanism. Set TIF_UPROBE flag and indicate breakpoint hit.
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+	struct uprobe_task *utask;
+
+	if (!current->mm)
+		return 0;
+
+	utask = current->utask;
+	if (utask)
+		utask->state = UTASK_BP_HIT;
+
+	set_thread_flag(TIF_UPROBE);
+	current->uprobes_srcu_id = srcu_read_lock_raw(&uprobes_srcu);
+	return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context as part of notifier
+ * mechanism. Set TIF_UPROBE flag and indicate completion of singlestep.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+	struct uprobe_task *utask = current->utask;
+
+	if (!current->mm || !utask || !utask->active_uprobe)
+		/* task is currently not uprobed */
+		return 0;
+
+	utask->state = UTASK_SSTEP_ACK;
+	set_thread_flag(TIF_UPROBE);
+	return 1;
+}
+
+struct notifier_block uprobe_exception_nb = {
+	.notifier_call = uprobe_exception_notify,
+	.priority = INT_MAX - 1,	/* notified after kprobes, kgdb */
+};
+
 static int __init init_uprobes(void)
 {
 	int i;
@@ -965,7 +1237,8 @@ static int __init init_uprobes(void)
 		mutex_init(&uprobes_mutex[i]);
 		mutex_init(&uprobes_mmap_mutex[i]);
 	}
-	return 0;
+	init_srcu_struct(&uprobes_srcu);
+	return register_die_notifier(&uprobe_exception_nb);
 }
 
 static void __exit exit_uprobes(void)


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 3/9] uprobes: slot allocation.
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
  2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
  2012-01-10 11:48 ` [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception Srikar Dronamraju
@ 2012-01-10 11:48 ` Srikar Dronamraju
  2012-01-10 11:49 ` [PATCH v9 3.2 4/9] uprobes: counter to optimize probe hits Srikar Dronamraju
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:48 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Uprobes executes the original instruction at a probed location out of
line. For this, we allocate a page (per mm) upon the first uprobe hit,
in the process' user address space, divide it into slots that are used
to store the actual instructions to be singlestepped.

Care is taken to ensure that the allocation is in an unmapped area as
close to the top of the user address space as possible, with appropriate
permission settings to keep selinux like frameworks happy.

Upon a uprobe hit, a free slot is acquired, and is released after the
singlestep completes.

[ Folded a fix for build issue on powerpc fixed and reported by Stephen
Rothwell]

Lots of improvements courtesy suggestions/inputs from Peter and Oleg.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog (since v5)
- no more spin lock needed for slot allocation.
- use install_special_mapping to add a vma. (previous approach used
  init_creds)
- set uprobes_xol_area while holding map_sem exclusively.

 include/linux/mm_types.h |    4 +
 include/linux/uprobes.h  |   25 ++++++
 kernel/fork.c            |    4 +
 kernel/uprobes.c         |  202 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 235 insertions(+), 0 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5b42f1b..e9eac38 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/uprobes.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -389,6 +390,9 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_UPROBES
+	struct uprobes_xol_area *uprobes_xol_area;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 41037e9..9e25c41 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -27,6 +27,7 @@
 #include <linux/rbtree.h>
 
 struct vm_area_struct;
+struct mm_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
@@ -92,6 +93,26 @@ struct uprobe_task {
 	struct uprobe *active_uprobe;
 };
 
+/*
+ * On a breakpoint hit, thread contests for a slot.  It free the
+ * slot after singlestep.  Only definite number of slots are
+ * allocated.
+ */
+
+struct uprobes_xol_area {
+	wait_queue_head_t wq;	/* if all slots are busy */
+	atomic_t slot_count;	/* currently in use slots */
+	unsigned long *bitmap;	/* 0 = free slot */
+	struct page *page;
+
+	/*
+	 * We keep the vma's vm_start rather than a pointer to the vma
+	 * itself.  The probed process or a naughty kernel module could make
+	 * the vma go away, and we must handle that reasonably gracefully.
+	 */
+	unsigned long vaddr;		/* Page(s) of instruction slots */
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -103,6 +124,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void free_uprobe_utask(struct task_struct *tsk);
+extern void free_uprobes_xol_area(struct mm_struct *mm);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
 extern int uprobe_post_notifier(struct pt_regs *regs);
@@ -138,5 +160,8 @@ static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
 static inline void free_uprobe_utask(struct task_struct *tsk)
 {
 }
+static inline void free_uprobes_xol_area(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7a0b7d7..2da316e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -550,6 +550,7 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
+		free_uprobes_xol_area(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
@@ -736,6 +737,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	mm->uprobes_xol_area = NULL;
+#endif
 
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 674d6da..73d223e 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,6 +33,9 @@
 #include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
+#define UINSNS_PER_PAGE	(PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
 static struct srcu_struct uprobes_srcu;
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
@@ -992,6 +995,201 @@ int mmap_uprobe(struct vm_area_struct *vma)
 	return ret;
 }
 
+/* Slot allocation for XOL */
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		return -ENOMEM;
+
+	mm = current->mm;
+	down_write(&mm->mmap_sem);
+	ret = -EALREADY;
+	if (mm->uprobes_xol_area)
+		goto fail;
+
+	ret = -ENOMEM;
+
+	/* Try to map as high as possible, this is only a hint. */
+	area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+							PAGE_SIZE, 0, 0);
+	if (area->vaddr & ~PAGE_MASK) {
+		ret = area->vaddr;
+		goto fail;
+	}
+
+	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
+				&area->page);
+	if (ret)
+		goto fail;
+
+	smp_wmb();	/* pairs with get_uprobes_xol_area() */
+	mm->uprobes_xol_area = area;
+	ret = 0;
+
+fail:
+	up_write(&mm->mmap_sem);
+	if (ret)
+		__free_page(area->page);
+
+	return ret;
+}
+
+static struct uprobes_xol_area *get_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+	smp_read_barrier_depends();/* pairs with wmb in xol_add_vma() */
+	return area;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+	struct uprobes_xol_area *area;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+								GFP_KERNEL);
+
+	if (!area->bitmap)
+		goto fail;
+
+	init_waitqueue_head(&area->wq);
+	if (!xol_add_vma(area))
+		return area;
+
+fail:
+	kfree(area->bitmap);
+	kfree(area);
+	return get_uprobes_xol_area(current->mm);
+}
+
+/*
+ * free_uprobes_xol_area - Free the area allocated for slots.
+ */
+void free_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+	if (!area)
+		return;
+
+	put_page(area->page);
+	kfree(area->bitmap);
+	kfree(area);
+}
+
+/*
+ *  - search for a free slot.
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+	unsigned long slot_addr;
+	int slot_nr;
+
+	do {
+		slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+		if (slot_nr < UINSNS_PER_PAGE) {
+			if (!test_and_set_bit(slot_nr, area->bitmap))
+				break;
+
+			slot_nr = UINSNS_PER_PAGE;
+			continue;
+		}
+		wait_event(area->wq,
+			(atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+	} while (slot_nr >= UINSNS_PER_PAGE);
+
+	slot_addr = area->vaddr + (slot_nr * UPROBES_XOL_SLOT_BYTES);
+	atomic_inc(&area->slot_count);
+	return slot_addr;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+					unsigned long slot_addr)
+{
+	struct uprobes_xol_area *area;
+	unsigned long offset;
+	void *vaddr;
+
+	area = get_uprobes_xol_area(current->mm);
+	if (!area) {
+		area = xol_alloc_area();
+		if (!area)
+			return 0;
+	}
+	current->utask->xol_vaddr = xol_take_insn_slot(area);
+
+	/*
+	 * Initialize the slot if xol_vaddr points to valid
+	 * instruction slot.
+	 */
+	if (unlikely(!current->utask->xol_vaddr))
+		return 0;
+
+	current->utask->vaddr = slot_addr;
+	offset = current->utask->xol_vaddr & ~PAGE_MASK;
+	vaddr = kmap_atomic(area->page);
+	memcpy(vaddr + offset, uprobe->insn, MAX_UINSN_BYTES);
+	kunmap_atomic(vaddr);
+	return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk)
+{
+	struct uprobes_xol_area *area;
+	unsigned long vma_end;
+	unsigned long slot_addr;
+
+	if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
+		return;
+
+	slot_addr = tsk->utask->xol_vaddr;
+
+	if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+		return;
+
+	area = tsk->mm->uprobes_xol_area;
+	vma_end = area->vaddr + PAGE_SIZE;
+	if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+		int slot_nr;
+		unsigned long offset = slot_addr - area->vaddr;
+
+		slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+		if (slot_nr >= UINSNS_PER_PAGE)
+			return;
+
+		clear_bit(slot_nr, area->bitmap);
+		atomic_dec(&area->slot_count);
+		if (waitqueue_active(&area->wq))
+			wake_up(&area->wq);
+		tsk->utask->xol_vaddr = 0;
+	}
+}
+
 /**
  * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
  * @regs: Reflects the saved state of the task after it has hit a breakpoint
@@ -1020,6 +1218,7 @@ void free_uprobe_utask(struct task_struct *tsk)
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
+	xol_free_insn_slot(tsk);
 	kfree(utask);
 	tsk->utask = NULL;
 }
@@ -1049,6 +1248,8 @@ static struct uprobe_task *add_utask(void)
 static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 				unsigned long vaddr)
 {
+	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs))
+		return 0;
 	return -EFAULT;
 }
 
@@ -1166,6 +1367,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		utask->active_uprobe = NULL;
 		utask->state = UTASK_RUNNING;
 		user_disable_single_step(current);
+		xol_free_insn_slot(current);
 
 		spin_lock_irq(&current->sighand->siglock);
 		recalc_sigpending(); /* see uprobe_deny_signal() */


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 4/9] uprobes: counter to optimize probe hits.
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2012-01-10 11:48 ` [PATCH v9 3.2 3/9] uprobes: slot allocation Srikar Dronamraju
@ 2012-01-10 11:49 ` Srikar Dronamraju
  2012-01-10 11:49 ` [PATCH v9 3.2 5/9] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Maintain a per-mm counter (number of uprobes that are inserted on
this process address space). This counter can be used at probe hit
time to determine if we need to do a uprobe lookup in the uprobes
rbtree. Everytime a probe gets inserted successfully, the probe
count is incremented and everytime a probe gets removed successfully
the probe count is removed.

A new hook munmap_uprobe is added to keep the counter to be correct
even when a region is unmapped or remapped. This patch expects that
once a munmap_uprobe() is called, the vma either go away or a
subsequent mmap_uprobe gets called before a removal of a probe from
unregister_uprobe in the same address space.

On every executable vma thats cowed at fork, mmap_uprobe is called
so that the mm_uprobes_count is in sync.

When a vma of interest is mapped, insert the breakpoint at the right
address. Upon munmap, just make sure the data structures are
adjusted/cleaned up.

On process creation, make sure the probes count in the child is set
correctly.

Special cases that are taken care include:
a. mremap()
b. VM_DONTCOPY vmas on fork()
c. insertion/removal races in the parent during fork().

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog (since v5)
- While forking, handle vma's that have VM_DONTCOPY.
- While forking, handle race of new breakpoints being inserted / removed
  in the parent process.

Changelog (since v7)
- Separate this patch from the patch that implements mmap_uprobe.
- Increment counter only after verifying that the probe lies underneath.

 include/linux/mm_types.h |    1 
 include/linux/uprobes.h  |    4 ++
 kernel/fork.c            |    4 ++
 kernel/uprobes.c         |  103 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/mmap.c                |   10 ++++
 5 files changed, 117 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e9eac38..2595c9c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -391,6 +391,7 @@ struct mm_struct {
 	struct cpumask cpumask_allocation;
 #endif
 #ifdef CONFIG_UPROBES
+	atomic_t mm_uprobes_count;
 	struct uprobes_xol_area *uprobes_xol_area;
 #endif
 };
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 9e25c41..adaa8d3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -126,6 +126,7 @@ extern void unregister_uprobe(struct inode *inode, loff_t offset,
 extern void free_uprobe_utask(struct task_struct *tsk);
 extern void free_uprobes_xol_area(struct mm_struct *mm);
 extern int mmap_uprobe(struct vm_area_struct *vma);
+extern void munmap_uprobe(struct vm_area_struct *vma);
 extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
 extern int uprobe_post_notifier(struct pt_regs *regs);
 extern int uprobe_bkpt_notifier(struct pt_regs *regs);
@@ -146,6 +147,9 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
 {
 	return 0;
 }
+static inline void munmap_uprobe(struct vm_area_struct *vma)
+{
+}
 static inline void uprobe_notify_resume(struct pt_regs *regs)
 {
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 2da316e..2b3a5e0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -417,6 +417,9 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 
 		if (retval)
 			goto out;
+
+		if (file && mmap_uprobe(tmp))
+			goto out;
 	}
 	/* a new mm has just been created */
 	arch_dup_mmap(oldmm, mm);
@@ -738,6 +741,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 	mm->pmd_huge_pte = NULL;
 #endif
 #ifdef CONFIG_UPROBES
+	atomic_set(&mm->mm_uprobes_count, 0);
 	mm->uprobes_xol_area = NULL;
 #endif
 
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 73d223e..21ce23c 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -615,6 +615,30 @@ static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
 	return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
 }
 
+/*
+ * How mm_uprobes_count gets updated
+ * mmap_uprobe() increments the count if
+ * 	- it successfully adds a breakpoint.
+ * 	- it not add a breakpoint, but sees that there is a underlying
+ * 	  breakpoint (via a is_bkpt_at_addr()).
+ *
+ * munmap_uprobe() decrements the count if
+ * 	- it sees a underlying breakpoint, (via is_bkpt_at_addr)
+ * 	- Subsequent unregister_uprobe wouldnt find the breakpoint
+ * 	  unless a mmap_uprobe kicks in, since the old vma would be
+ * 	  dropped just after munmap_uprobe.
+ *
+ * register_uprobe increments the count if:
+ * 	- it successfully adds a breakpoint.
+ *
+ * unregister_uprobe decrements the count if:
+ * 	- it sees a underlying breakpoint and removes successfully.
+ * 			(via is_bkpt_at_addr)
+ * 	- Subsequent munmap_uprobe wouldnt find the breakpoint
+ * 	  since there is no underlying breakpoint after the
+ * 	  breakpoint removal.
+ */
+
 static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 				struct vm_area_struct *vma, loff_t vaddr)
 {
@@ -646,7 +670,19 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 
 		uprobe->flags |= UPROBES_COPY_INSN;
 	}
+
+	/*
+	 * Ideally, should be updating the probe count after the breakpoint
+	 * has been successfully inserted. However a thread could hit the
+	 * breakpoint we just inserted even before the probe count is
+	 * incremented. If this is the first breakpoint placed, breakpoint
+	 * notifier might ignore uprobes and pass the trap to the thread.
+	 * Hence increment before and decrement on failure.
+	 */
+	atomic_inc(&mm->mm_uprobes_count);
 	ret = set_bkpt(mm, uprobe, addr);
+	if (ret)
+		atomic_dec(&mm->mm_uprobes_count);
 
 	return ret;
 }
@@ -654,7 +690,8 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 							loff_t vaddr)
 {
-	set_orig_insn(mm, uprobe, (unsigned long)vaddr, true);
+	if (!set_orig_insn(mm, uprobe, (unsigned long)vaddr, true))
+		atomic_dec(&mm->mm_uprobes_count);
 }
 
 /*
@@ -960,7 +997,7 @@ int mmap_uprobe(struct vm_area_struct *vma)
 	struct list_head tmp_list;
 	struct uprobe *uprobe, *u;
 	struct inode *inode;
-	int ret = 0;
+	int ret = 0, count = 0;
 
 	if (!atomic_read(&uprobe_events) || !valid_vma(vma, true))
 		return ret;	/* Bail-out */
@@ -984,17 +1021,74 @@ int mmap_uprobe(struct vm_area_struct *vma)
 			}
 			ret = install_breakpoint(vma->vm_mm, uprobe, vma,
 								vaddr);
-			if (ret == -EEXIST)
+			if (ret == -EEXIST) {
 				ret = 0;
+				if (!is_bkpt_at_addr(vma->vm_mm, vaddr))
+					continue;
+
+				/*
+				 * Unable to insert a breakpoint, but
+				 * breakpoint lies underneath. Increment the
+				 * probe count.
+				 */
+				atomic_inc(&vma->vm_mm->mm_uprobes_count);
+			}
+			if (!ret)
+				count++;
+
 		}
 		put_uprobe(uprobe);
 	}
 
 	mutex_unlock(uprobes_mmap_hash(inode));
+	if (ret)
+		atomic_sub(count, &vma->vm_mm->mm_uprobes_count);
 
 	return ret;
 }
 
+/*
+ * Called in context of a munmap of a vma.
+ */
+void munmap_uprobe(struct vm_area_struct *vma)
+{
+	struct list_head tmp_list;
+	struct uprobe *uprobe, *u;
+	struct inode *inode;
+
+	if (!atomic_read(&uprobe_events) || !valid_vma(vma, false))
+		return;		/* Bail-out */
+
+	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
+		return;
+
+	inode = vma->vm_file->f_mapping->host;
+	if (!inode)
+		return;
+
+	INIT_LIST_HEAD(&tmp_list);
+	mutex_lock(uprobes_mmap_hash(inode));
+	build_probe_list(inode, &tmp_list);
+	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+		loff_t vaddr;
+
+		list_del(&uprobe->pending_list);
+		vaddr = vma_address(vma, uprobe->offset);
+		if (vaddr >= vma->vm_start && vaddr < vma->vm_end) {
+
+			/*
+			 * An unregister could have removed the probe before
+			 * unmap. So check before we decrement the count.
+			 */
+			if (is_bkpt_at_addr(vma->vm_mm, vaddr) == 1)
+				atomic_dec(&vma->vm_mm->mm_uprobes_count);
+		}
+		put_uprobe(uprobe);
+	}
+	mutex_unlock(uprobes_mmap_hash(inode));
+	return;
+}
+
 /* Slot allocation for XOL */
 static int xol_add_vma(struct uprobes_xol_area *area)
 {
@@ -1397,7 +1491,8 @@ int uprobe_bkpt_notifier(struct pt_regs *regs)
 {
 	struct uprobe_task *utask;
 
-	if (!current->mm)
+	if (!current->mm || !atomic_read(&current->mm->mm_uprobes_count))
+		/* task is currently not uprobed */
 		return 0;
 
 	utask = current->utask;
diff --git a/mm/mmap.c b/mm/mmap.c
index 0966418..83813fa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -218,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
 		mutex_lock(&mapping->i_mmap_mutex);
 		__remove_shared_vm_struct(vma, file, mapping);
 		mutex_unlock(&mapping->i_mmap_mutex);
+		munmap_uprobe(vma);
 	}
 }
 
@@ -546,8 +547,14 @@ again:			remove_next = 1 + (end > next->vm_end);
 
 	if (file) {
 		mapping = file->f_mapping;
-		if (!(vma->vm_flags & VM_NONLINEAR))
+		if (!(vma->vm_flags & VM_NONLINEAR)) {
 			root = &mapping->i_mmap;
+			munmap_uprobe(vma);
+
+			if (adjust_next)
+				munmap_uprobe(next);
+		}
+
 		mutex_lock(&mapping->i_mmap_mutex);
 		if (insert) {
 			/*
@@ -626,6 +633,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 
 	if (remove_next) {
 		if (file) {
+			munmap_uprobe(next);
 			fput(file);
 			if (next->vm_flags & VM_EXECUTABLE)
 				removed_exe_file_vma(mm);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 5/9] tracing: modify is_delete, is_return from ints to bool.
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2012-01-10 11:49 ` [PATCH v9 3.2 4/9] uprobes: counter to optimize probe hits Srikar Dronamraju
@ 2012-01-10 11:49 ` Srikar Dronamraju
  2012-01-10 11:49 ` [PATCH v9 3.2 6/9] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


is_delete and is_return can take atmost 2 values and
are better of being a boolean than a int.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5):
- extracted from the next patch on Masami's suggestion.

 kernel/trace/trace_kprobe.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 00d527c..2490dd1 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -651,7 +651,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 					     void *addr,
 					     const char *symbol,
 					     unsigned long offs,
-					     int nargs, int is_return)
+					     int nargs, bool is_return)
 {
 	struct trace_probe *tp;
 	int ret = -ENOMEM;
@@ -944,7 +944,7 @@ static int split_symbol_offset(char *symbol, unsigned long *offset)
 #define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
 
 static int parse_probe_vars(char *arg, const struct fetch_type *t,
-			    struct fetch_param *f, int is_return)
+			    struct fetch_param *f, bool is_return)
 {
 	int ret = 0;
 	unsigned long param;
@@ -977,7 +977,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
 
 /* Recursive argument parser */
 static int __parse_probe_arg(char *arg, const struct fetch_type *t,
-			     struct fetch_param *f, int is_return)
+			     struct fetch_param *f, bool is_return)
 {
 	int ret = 0;
 	unsigned long param;
@@ -1089,7 +1089,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
 
 /* String length checking wrapper */
 static int parse_probe_arg(char *arg, struct trace_probe *tp,
-			   struct probe_arg *parg, int is_return)
+			   struct probe_arg *parg, bool is_return)
 {
 	const char *t;
 	int ret;
@@ -1162,7 +1162,7 @@ static int create_trace_probe(int argc, char **argv)
 	 */
 	struct trace_probe *tp;
 	int i, ret = 0;
-	int is_return = 0, is_delete = 0;
+	bool is_return = false, is_delete = false;
 	char *symbol = NULL, *event = NULL, *group = NULL;
 	char *arg;
 	unsigned long offset = 0;
@@ -1171,11 +1171,11 @@ static int create_trace_probe(int argc, char **argv)
 
 	/* argc must be >= 1 */
 	if (argv[0][0] == 'p')
-		is_return = 0;
+		is_return = false;
 	else if (argv[0][0] == 'r')
-		is_return = 1;
+		is_return = true;
 	else if (argv[0][0] == '-')
-		is_delete = 1;
+		is_delete = true;
 	else {
 		pr_info("Probe definition must be started with 'p', 'r' or"
 			" '-'.\n");


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 6/9] tracing: Extract out common code for kprobes/uprobes traceevents.
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2012-01-10 11:49 ` [PATCH v9 3.2 5/9] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
@ 2012-01-10 11:49 ` Srikar Dronamraju
  2012-01-10 11:49 ` [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface Srikar Dronamraju
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog (since v5)
- Extracted out int to bool changes to a separate patch.
- Fix a bug in kprobe_trace_self_tests_init that was introduced
  in previous patchset.

Changelog (since v7)
- Modified Copyright as suggested by Steven Rostedt.

 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/trace_kprobe.c |  889 +------------------------------------------
 kernel/trace/trace_probe.c  |  780 ++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |  161 ++++++++
 5 files changed, 964 insertions(+), 871 deletions(-)
 create mode 100644 kernel/trace/trace_probe.c
 create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..520106a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
 	bool "Enable kprobes-based dynamic events"
 	select TRACING
+	select PROBE_EVENTS
 	default y
 	help
 	  This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config PROBE_EVENTS
+	def_bool n
+
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace tracepoints dynamically"
 	depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 5f39a07..fa10d5c 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -61,5 +61,6 @@ endif
 ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
+obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 2490dd1..967e634 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,547 +19,15 @@
 
 #include <linux/module.h>
 #include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
-	"common_type",
-	"common_flags",
-	"common_preempt_count",
-	"common_pid",
-	"common_tgid",
-	FIELD_STRING_IP,
-	FIELD_STRING_RETIP,
-	FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
-				 void *);
-#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
-#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
-
-/* Printing  in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
-						const char *name,	\
-						void *data, void *ent)\
-{									\
-	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-}									\
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs)	\
-	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl)	((u32)(dl) >> 16)
-#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
-	return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
-	return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- *  data_rloc stores the offset from data_rloc itself, but data_loc
- *  stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
-						  const char *name,
-						  void *data, void *ent)
-{
-	int len = *(u32 *)data >> 16;
-
-	if (!len)
-		return trace_seq_printf(s, " %s=(fault)", name);
-	else
-		return trace_seq_printf(s, " %s=\"%s\"", name,
-					(const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
-	fetch_func_t	fn;
-	void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
-				 struct pt_regs *regs, void *dest)
-{
-	return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8)		\
-DEFINE_FETCH_##method(u16)		\
-DEFINE_FETCH_##method(u32)		\
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn)			\
-	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
-	 && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type)						\
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
-					void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_register(regs,			\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type)					\
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
-					  void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type)					\
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
-					  void *dummy, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_return_value(regs);			\
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type)					\
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
-					  void *addr, void *dest)	\
-{									\
-	type retval;							\
-	if (probe_kernel_address(addr, retval))				\
-		*(type *)dest = 0;					\
-	else								\
-		*(type *)dest = retval;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
-						      void *addr, void *dest)
-{
-	long ret;
-	int maxlen = get_rloc_len(*(u32 *)dest);
-	u8 *dst = get_rloc_data(dest);
-	u8 *src = addr;
-	mm_segment_t old_fs = get_fs();
-	if (!maxlen)
-		return;
-	/*
-	 * Try to get string again, since the string can be changed while
-	 * probing.
-	 */
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do
-		ret = __copy_from_user_inatomic(dst++, src++, 1);
-	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
-	dst[-1] = '\0';
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0) {	/* Failed to fetch string */
-		((u8 *)get_rloc_data(dest))[0] = '\0';
-		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
-	} else
-		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
-					      get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
-							void *addr, void *dest)
-{
-	int ret, len = 0;
-	u8 c;
-	mm_segment_t old_fs = get_fs();
-
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do {
-		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
-		len++;
-	} while (c && ret == 0 && len < MAX_STRING_SIZE);
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0)	/* Failed to check the length */
-		*(u32 *)dest = 0;
-	else
-		*(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
-	char *symbol;
-	long offset;
-	unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
-	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
-	if (sc->addr)
-		sc->addr += sc->offset;
-	return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
-	kfree(sc->symbol);
-	kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
-	struct symbol_cache *sc;
-
-	if (!sym || strlen(sym) == 0)
-		return NULL;
-	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
-	if (!sc)
-		return NULL;
-
-	sc->symbol = kstrdup(sym, GFP_KERNEL);
-	if (!sc->symbol) {
-		kfree(sc);
-		return NULL;
-	}
-	sc->offset = offset;
 
-	update_symbol_cache(sc);
-	return sc;
-}
-
-#define DEFINE_FETCH_symbol(type)					\
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
-					  void *data, void *dest)	\
-{									\
-	struct symbol_cache *sc = data;					\
-	if (sc->addr)							\
-		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
-	else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
-	struct fetch_param orig;
-	long offset;
-};
-
-#define DEFINE_FETCH_deref(type)					\
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct deref_fetch_param *dprm = data;				\
-	unsigned long addr;						\
-	call_fetch(&dprm->orig, regs, &addr);				\
-	if (addr) {							\
-		addr += dprm->offset;					\
-		fetch_memory_##type(regs, (void *)addr, dest);		\
-	} else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
-	struct fetch_param orig;
-	unsigned char hi_shift;
-	unsigned char low_shift;
-};
+#include "trace_probe.h"
 
-#define DEFINE_FETCH_bitfield(type)					\
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct bitfield_fetch_param *bprm = data;			\
-	type buf = 0;							\
-	call_fetch(&bprm->orig, regs, &buf);				\
-	if (buf) {							\
-		buf <<= bprm->hi_shift;					\
-		buf >>= bprm->low_shift;				\
-	}								\
-	*(type *)dest = buf;						\
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-update_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
-	FETCH_MTD_reg = 0,
-	FETCH_MTD_stack,
-	FETCH_MTD_retval,
-	FETCH_MTD_memory,
-	FETCH_MTD_symbol,
-	FETCH_MTD_deref,
-	FETCH_MTD_bitfield,
-	FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type)	\
-	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
-	{.name = _name,				\
-	 .size = _size,					\
-	 .is_signed = sign,				\
-	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
-	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
-	 .fmttype = _fmttype,				\
-	 .fetch = {					\
-ASSIGN_FETCH_FUNC(reg, ftype),				\
-ASSIGN_FETCH_FUNC(stack, ftype),			\
-ASSIGN_FETCH_FUNC(retval, ftype),			\
-ASSIGN_FETCH_FUNC(memory, ftype),			\
-ASSIGN_FETCH_FUNC(symbol, ftype),			\
-ASSIGN_FETCH_FUNC(deref, ftype),			\
-ASSIGN_FETCH_FUNC(bitfield, ftype),			\
-	  }						\
-	}
-
-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
-	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
-	const char	*name;		/* Name of type */
-	size_t		size;		/* Byte size of type */
-	int		is_signed;	/* Signed flag */
-	print_type_func_t	print;	/* Print functions */
-	const char	*fmt;		/* Fromat string */
-	const char	*fmttype;	/* Name in format file */
-	/* Fetch functions */
-	fetch_func_t	fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
-	/* Special types */
-	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
-					sizeof(u32), 1, "__data_loc char[]"),
-	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
-					string_size, sizeof(u32), 0, "u32"),
-	/* Basic types */
-	ASSIGN_FETCH_TYPE(u8,  u8,  0),
-	ASSIGN_FETCH_TYPE(u16, u16, 0),
-	ASSIGN_FETCH_TYPE(u32, u32, 0),
-	ASSIGN_FETCH_TYPE(u64, u64, 0),
-	ASSIGN_FETCH_TYPE(s8,  u8,  1),
-	ASSIGN_FETCH_TYPE(s16, u16, 1),
-	ASSIGN_FETCH_TYPE(s32, u32, 1),
-	ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
-	int i;
-
-	if (!type)
-		type = DEFAULT_FETCH_TYPE_STR;
-
-	/* Special case: bitfield */
-	if (*type == 'b') {
-		unsigned long bs;
-		type = strchr(type, '/');
-		if (!type)
-			goto fail;
-		type++;
-		if (strict_strtoul(type, 0, &bs))
-			goto fail;
-		switch (bs) {
-		case 8:
-			return find_fetch_type("u8");
-		case 16:
-			return find_fetch_type("u16");
-		case 32:
-			return find_fetch_type("u32");
-		case 64:
-			return find_fetch_type("u64");
-		default:
-			goto fail;
-		}
-	}
-
-	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
-		if (strcmp(type, fetch_type_table[i].name) == 0)
-			return &fetch_type_table[i];
-fail:
-	return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
-					  void *dummy, void *dest)
-{
-	*(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
-					    fetch_func_t orig_fn)
-{
-	int i;
-
-	if (type != &fetch_type_table[FETCH_TYPE_STRING])
-		return NULL;	/* Only string type needs size function */
-	for (i = 0; i < FETCH_MTD_END; i++)
-		if (type->fetch[i] == orig_fn)
-			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
-	WARN_ON(1);	/* This should not happen */
-	return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"
 
 /**
  * Kprobe event core functions
  */
 
-struct probe_arg {
-	struct fetch_param	fetch;
-	struct fetch_param	fetch_size;
-	unsigned int		offset;	/* Offset from argument entry */
-	const char		*name;	/* Name of this argument */
-	const char		*comm;	/* Command of this argument */
-	const struct fetch_type	*type;	/* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE	1
-#define TP_FLAG_PROFILE	2
-#define TP_FLAG_REGISTERED 4
-
 struct trace_probe {
 	struct list_head	list;
 	struct kretprobe	rp;	/* Use rp.kp for kprobe use */
@@ -631,18 +99,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
 static int kretprobe_dispatcher(struct kretprobe_instance *ri,
 				struct pt_regs *regs);
 
-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
-	if (!isalpha(*name) && *name != '_')
-		return 0;
-	while (*++name != '\0') {
-		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
-			return 0;
-	}
-	return 1;
-}
-
 /*
  * Allocate new trace_probe and initialize it (including kprobes).
  */
@@ -702,34 +158,12 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 	return ERR_PTR(ret);
 }
 
-static void update_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		update_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		update_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		update_symbol_cache(arg->fetch.data);
-}
-
-static void free_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		free_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		free_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		free_symbol_cache(arg->fetch.data);
-	kfree(arg->name);
-	kfree(arg->comm);
-}
-
 static void free_trace_probe(struct trace_probe *tp)
 {
 	int i;
 
 	for (i = 0; i < tp->nr_args; i++)
-		free_probe_arg(&tp->args[i]);
+		traceprobe_free_probe_arg(&tp->args[i]);
 
 	kfree(tp->call.class->system);
 	kfree(tp->call.name);
@@ -787,7 +221,7 @@ static int __register_trace_probe(struct trace_probe *tp)
 		return -EINVAL;
 
 	for (i = 0; i < tp->nr_args; i++)
-		update_probe_arg(&tp->args[i]);
+		traceprobe_update_arg(&tp->args[i]);
 
 	/* Set/clear disabled flag according to tp->flag */
 	if (trace_probe_is_enabled(tp))
@@ -919,227 +353,6 @@ static struct notifier_block trace_probe_module_nb = {
 	.priority = 1	/* Invoked after kprobe module callback */
 };
 
-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
-	char *tmp;
-	int ret;
-
-	if (!offset)
-		return -EINVAL;
-
-	tmp = strchr(symbol, '+');
-	if (tmp) {
-		/* skip sign because strict_strtol doesn't accept '+' */
-		ret = strict_strtoul(tmp + 1, 0, offset);
-		if (ret)
-			return ret;
-		*tmp = '\0';
-	} else
-		*offset = 0;
-	return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
-			    struct fetch_param *f, bool is_return)
-{
-	int ret = 0;
-	unsigned long param;
-
-	if (strcmp(arg, "retval") == 0) {
-		if (is_return)
-			f->fn = t->fetch[FETCH_MTD_retval];
-		else
-			ret = -EINVAL;
-	} else if (strncmp(arg, "stack", 5) == 0) {
-		if (arg[5] == '\0') {
-			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
-				f->fn = fetch_stack_address;
-			else
-				ret = -EINVAL;
-		} else if (isdigit(arg[5])) {
-			ret = strict_strtoul(arg + 5, 10, &param);
-			if (ret || param > PARAM_MAX_STACK)
-				ret = -EINVAL;
-			else {
-				f->fn = t->fetch[FETCH_MTD_stack];
-				f->data = (void *)param;
-			}
-		} else
-			ret = -EINVAL;
-	} else
-		ret = -EINVAL;
-	return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
-			     struct fetch_param *f, bool is_return)
-{
-	int ret = 0;
-	unsigned long param;
-	long offset;
-	char *tmp;
-
-	switch (arg[0]) {
-	case '$':
-		ret = parse_probe_vars(arg + 1, t, f, is_return);
-		break;
-	case '%':	/* named register */
-		ret = regs_query_register_offset(arg + 1);
-		if (ret >= 0) {
-			f->fn = t->fetch[FETCH_MTD_reg];
-			f->data = (void *)(unsigned long)ret;
-			ret = 0;
-		}
-		break;
-	case '@':	/* memory or symbol */
-		if (isdigit(arg[1])) {
-			ret = strict_strtoul(arg + 1, 0, &param);
-			if (ret)
-				break;
-			f->fn = t->fetch[FETCH_MTD_memory];
-			f->data = (void *)param;
-		} else {
-			ret = split_symbol_offset(arg + 1, &offset);
-			if (ret)
-				break;
-			f->data = alloc_symbol_cache(arg + 1, offset);
-			if (f->data)
-				f->fn = t->fetch[FETCH_MTD_symbol];
-		}
-		break;
-	case '+':	/* deref memory */
-		arg++;	/* Skip '+', because strict_strtol() rejects it. */
-	case '-':
-		tmp = strchr(arg, '(');
-		if (!tmp)
-			break;
-		*tmp = '\0';
-		ret = strict_strtol(arg, 0, &offset);
-		if (ret)
-			break;
-		arg = tmp + 1;
-		tmp = strrchr(arg, ')');
-		if (tmp) {
-			struct deref_fetch_param *dprm;
-			const struct fetch_type *t2 = find_fetch_type(NULL);
-			*tmp = '\0';
-			dprm = kzalloc(sizeof(struct deref_fetch_param),
-				       GFP_KERNEL);
-			if (!dprm)
-				return -ENOMEM;
-			dprm->offset = offset;
-			ret = __parse_probe_arg(arg, t2, &dprm->orig,
-						is_return);
-			if (ret)
-				kfree(dprm);
-			else {
-				f->fn = t->fetch[FETCH_MTD_deref];
-				f->data = (void *)dprm;
-			}
-		}
-		break;
-	}
-	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
-		pr_info("%s type has no corresponding fetch method.\n",
-			t->name);
-		ret = -EINVAL;
-	}
-	return ret;
-}
-
-#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
-				      const struct fetch_type *t,
-				      struct fetch_param *f)
-{
-	struct bitfield_fetch_param *bprm;
-	unsigned long bw, bo;
-	char *tail;
-
-	if (*bf != 'b')
-		return 0;
-
-	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
-	if (!bprm)
-		return -ENOMEM;
-	bprm->orig = *f;
-	f->fn = t->fetch[FETCH_MTD_bitfield];
-	f->data = (void *)bprm;
-
-	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
-	if (bw == 0 || *tail != '@')
-		return -EINVAL;
-
-	bf = tail + 1;
-	bo = simple_strtoul(bf, &tail, 0);
-	if (tail == bf || *tail != '/')
-		return -EINVAL;
-
-	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
-	bprm->low_shift = bprm->hi_shift + bo;
-	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
-			   struct probe_arg *parg, bool is_return)
-{
-	const char *t;
-	int ret;
-
-	if (strlen(arg) > MAX_ARGSTR_LEN) {
-		pr_info("Argument is too long.: %s\n",  arg);
-		return -ENOSPC;
-	}
-	parg->comm = kstrdup(arg, GFP_KERNEL);
-	if (!parg->comm) {
-		pr_info("Failed to allocate memory for command '%s'.\n", arg);
-		return -ENOMEM;
-	}
-	t = strchr(parg->comm, ':');
-	if (t) {
-		arg[t - parg->comm] = '\0';
-		t++;
-	}
-	parg->type = find_fetch_type(t);
-	if (!parg->type) {
-		pr_info("Unsupported type: %s\n", t);
-		return -EINVAL;
-	}
-	parg->offset = tp->size;
-	tp->size += parg->type->size;
-	ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
-	if (ret >= 0 && t != NULL)
-		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
-	if (ret >= 0) {
-		parg->fetch_size.fn = get_fetch_size_function(parg->type,
-							      parg->fetch.fn);
-		parg->fetch_size.data = parg->fetch.data;
-	}
-	return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
-			       struct probe_arg *args, int narg)
-{
-	int i;
-	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
-		if (strcmp(reserved_field_names[i], name) == 0)
-			return 1;
-	for (i = 0; i < narg; i++)
-		if (strcmp(args[i].name, name) == 0)
-			return 1;
-	return 0;
-}
-
 static int create_trace_probe(int argc, char **argv)
 {
 	/*
@@ -1240,7 +453,7 @@ static int create_trace_probe(int argc, char **argv)
 		/* a symbol specified */
 		symbol = argv[1];
 		/* TODO: support .init module functions */
-		ret = split_symbol_offset(symbol, &offset);
+		ret = traceprobe_split_symbol_offset(symbol, &offset);
 		if (ret) {
 			pr_info("Failed to parse symbol.\n");
 			return ret;
@@ -1302,7 +515,8 @@ static int create_trace_probe(int argc, char **argv)
 			goto error;
 		}
 
-		if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
 			pr_info("Argument[%d] name '%s' conflicts with "
 				"another field.\n", i, argv[i]);
 			ret = -EINVAL;
@@ -1310,7 +524,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								is_return);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
@@ -1412,70 +627,11 @@ static int probes_open(struct inode *inode, struct file *file)
 	return seq_open(file, &probes_seq_op);
 }
 
-static int command_trace_probe(const char *buf)
-{
-	char **argv;
-	int argc = 0, ret = 0;
-
-	argv = argv_split(GFP_KERNEL, buf, &argc);
-	if (!argv)
-		return -ENOMEM;
-
-	if (argc)
-		ret = create_trace_probe(argc, argv);
-
-	argv_free(argv);
-	return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
 static ssize_t probes_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
-	char *kbuf, *tmp;
-	int ret;
-	size_t done;
-	size_t size;
-
-	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
-	if (!kbuf)
-		return -ENOMEM;
-
-	ret = done = 0;
-	while (done < count) {
-		size = count - done;
-		if (size >= WRITE_BUFSIZE)
-			size = WRITE_BUFSIZE - 1;
-		if (copy_from_user(kbuf, buffer + done, size)) {
-			ret = -EFAULT;
-			goto out;
-		}
-		kbuf[size] = '\0';
-		tmp = strchr(kbuf, '\n');
-		if (tmp) {
-			*tmp = '\0';
-			size = tmp - kbuf + 1;
-		} else if (done + size < count) {
-			pr_warning("Line length is too long: "
-				   "Should be less than %d.", WRITE_BUFSIZE);
-			ret = -EINVAL;
-			goto out;
-		}
-		done += size;
-		/* Remove comments */
-		tmp = strchr(kbuf, '#');
-		if (tmp)
-			*tmp = '\0';
-
-		ret = command_trace_probe(kbuf);
-		if (ret)
-			goto out;
-	}
-	ret = done;
-out:
-	kfree(kbuf);
-	return ret;
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_probe);
 }
 
 static const struct file_operations kprobe_events_ops = {
@@ -1711,16 +867,6 @@ print_kretprobe_event(struct trace_iterator *iter, int flags,
 	return TRACE_TYPE_PARTIAL_LINE;
 }
 
-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed)			\
-	do {								\
-		ret = trace_define_field(event_call, #type, name,	\
-					 offsetof(typeof(field), item),	\
-					 sizeof(field.item), is_signed, \
-					 FILTER_OTHER);			\
-		if (ret)						\
-			return ret;					\
-	} while (0)
 
 static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
 {
@@ -2045,8 +1191,9 @@ static __init int kprobe_trace_self_tests_init(void)
 
 	pr_info("Testing kprobe tracing: ");
 
-	ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
-				  "$stack $stack0 +0($stack)");
+	ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
+				  "$stack $stack0 +0($stack)",
+				  create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function entry.\n");
 		warn++;
@@ -2060,8 +1207,8 @@ static __init int kprobe_trace_self_tests_init(void)
 			enable_trace_probe(tp, TP_FLAG_TRACE);
 	}
 
-	ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
-				  "$retval");
+	ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
+				  "$retval", create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function return.\n");
 		warn++;
@@ -2095,13 +1242,13 @@ static __init int kprobe_trace_self_tests_init(void)
 	} else
 		disable_trace_probe(tp, TP_FLAG_TRACE);
 
-	ret = command_trace_probe("-:testprobe");
+	ret = traceprobe_command("-:testprobe", create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
 	}
 
-	ret = command_trace_probe("-:testprobe2");
+	ret = traceprobe_command("-:testprobe2", create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..e79a9e5
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,780 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * This code was copied from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ *
+ * Updates to make this generic:
+ * Copyright (C) IBM Corporation, 2010-2011
+ * Author:     Srikar Dronamraju
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+	"common_type",
+	"common_flags",
+	"common_preempt_count",
+	"common_pid",
+	"common_tgid",
+	FIELD_STRING_IP,
+	FIELD_STRING_RETIP,
+	FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
+#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
+
+/* Printing  in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
+						const char *name,	\
+						void *data, void *ent)\
+{									\
+	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+}									\
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+	return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+	return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+						  const char *name,
+						  void *data, void *ent)
+{
+	int len = *(u32 *)data >> 16;
+
+	if (!len)
+		return trace_seq_printf(s, " %s=(fault)", name);
+	else
+		return trace_seq_printf(s, " %s=\"%s\"", name,
+					(const char *)get_loc_data(data, ent));
+}
+
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8)		\
+DEFINE_FETCH_##method(u16)		\
+DEFINE_FETCH_##method(u32)		\
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn)			\
+	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
+	 && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type)						\
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
+					void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_register(regs,			\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type)					\
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+					  void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type)					\
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+					  void *dummy, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_return_value(regs);			\
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type)					\
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+					  void *addr, void *dest)	\
+{									\
+	type retval;							\
+	if (probe_kernel_address(addr, retval))				\
+		*(type *)dest = 0;					\
+	else								\
+		*(type *)dest = retval;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+						      void *addr, void *dest)
+{
+	long ret;
+	int maxlen = get_rloc_len(*(u32 *)dest);
+	u8 *dst = get_rloc_data(dest);
+	u8 *src = addr;
+	mm_segment_t old_fs = get_fs();
+	if (!maxlen)
+		return;
+	/*
+	 * Try to get string again, since the string can be changed while
+	 * probing.
+	 */
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do
+		ret = __copy_from_user_inatomic(dst++, src++, 1);
+	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+	dst[-1] = '\0';
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0) {	/* Failed to fetch string */
+		((u8 *)get_rloc_data(dest))[0] = '\0';
+		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+	} else
+		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+					      get_rloc_offs(*(u32 *)dest));
+}
+
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+							void *addr, void *dest)
+{
+	int ret, len = 0;
+	u8 c;
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do {
+		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+		len++;
+	} while (c && ret == 0 && len < MAX_STRING_SIZE);
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0)	/* Failed to check the length */
+		*(u32 *)dest = 0;
+	else
+		*(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+	char *symbol;
+	long offset;
+	unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+	if (sc->addr)
+		sc->addr += sc->offset;
+	return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+	kfree(sc->symbol);
+	kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+	struct symbol_cache *sc;
+
+	if (!sym || strlen(sym) == 0)
+		return NULL;
+	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+	if (!sc)
+		return NULL;
+
+	sc->symbol = kstrdup(sym, GFP_KERNEL);
+	if (!sc->symbol) {
+		kfree(sc);
+		return NULL;
+	}
+	sc->offset = offset;
+
+	update_symbol_cache(sc);
+	return sc;
+}
+
+#define DEFINE_FETCH_symbol(type)					\
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+					  void *data, void *dest)	\
+{									\
+	struct symbol_cache *sc = data;					\
+	if (sc->addr)							\
+		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
+	else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+	struct fetch_param orig;
+	long offset;
+};
+
+#define DEFINE_FETCH_deref(type)					\
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct deref_fetch_param *dprm = data;				\
+	unsigned long addr;						\
+	call_fetch(&dprm->orig, regs, &addr);				\
+	if (addr) {							\
+		addr += dprm->offset;					\
+		fetch_memory_##type(regs, (void *)addr, dest);		\
+	} else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+	struct fetch_param orig;
+	unsigned char hi_shift;
+	unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type)					\
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct bitfield_fetch_param *bprm = data;			\
+	type buf = 0;							\
+	call_fetch(&bprm->orig, regs, &buf);				\
+	if (buf) {							\
+		buf <<= bprm->hi_shift;					\
+		buf >>= bprm->low_shift;				\
+	}								\
+	*(type *)dest = buf;						\
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+update_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type)	\
+	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
+	{.name = _name,				\
+	 .size = _size,					\
+	 .is_signed = sign,				\
+	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
+	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
+	 .fmttype = _fmttype,				\
+	 .fetch = {					\
+ASSIGN_FETCH_FUNC(reg, ftype),				\
+ASSIGN_FETCH_FUNC(stack, ftype),			\
+ASSIGN_FETCH_FUNC(retval, ftype),			\
+ASSIGN_FETCH_FUNC(memory, ftype),			\
+ASSIGN_FETCH_FUNC(symbol, ftype),			\
+ASSIGN_FETCH_FUNC(deref, ftype),			\
+ASSIGN_FETCH_FUNC(bitfield, ftype),			\
+	  }						\
+	}
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
+	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+	/* Special types */
+	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+					sizeof(u32), 1, "__data_loc char[]"),
+	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+					string_size, sizeof(u32), 0, "u32"),
+	/* Basic types */
+	ASSIGN_FETCH_TYPE(u8,  u8,  0),
+	ASSIGN_FETCH_TYPE(u16, u16, 0),
+	ASSIGN_FETCH_TYPE(u32, u32, 0),
+	ASSIGN_FETCH_TYPE(u64, u64, 0),
+	ASSIGN_FETCH_TYPE(s8,  u8,  1),
+	ASSIGN_FETCH_TYPE(s16, u16, 1),
+	ASSIGN_FETCH_TYPE(s32, u32, 1),
+	ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+	int i;
+
+	if (!type)
+		type = DEFAULT_FETCH_TYPE_STR;
+
+	/* Special case: bitfield */
+	if (*type == 'b') {
+		unsigned long bs;
+		type = strchr(type, '/');
+		if (!type)
+			goto fail;
+		type++;
+		if (strict_strtoul(type, 0, &bs))
+			goto fail;
+		switch (bs) {
+		case 8:
+			return find_fetch_type("u8");
+		case 16:
+			return find_fetch_type("u16");
+		case 32:
+			return find_fetch_type("u32");
+		case 64:
+			return find_fetch_type("u64");
+		default:
+			goto fail;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+		if (strcmp(type, fetch_type_table[i].name) == 0)
+			return &fetch_type_table[i];
+fail:
+	return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+					void *dummy, void *dest)
+{
+	*(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+					fetch_func_t orig_fn)
+{
+	int i;
+
+	if (type != &fetch_type_table[FETCH_TYPE_STRING])
+		return NULL;	/* Only string type needs size function */
+	for (i = 0; i < FETCH_MTD_END; i++)
+		if (type->fetch[i] == orig_fn)
+			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+	WARN_ON(1);	/* This should not happen */
+	return NULL;
+}
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+	char *tmp;
+	int ret;
+
+	if (!offset)
+		return -EINVAL;
+
+	tmp = strchr(symbol, '+');
+	if (tmp) {
+		/* skip sign because strict_strtol doesn't accept '+' */
+		ret = strict_strtoul(tmp + 1, 0, offset);
+		if (ret)
+			return ret;
+		*tmp = '\0';
+	} else
+		*offset = 0;
+	return 0;
+}
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+			    struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+
+	if (strcmp(arg, "retval") == 0) {
+		if (is_return)
+			f->fn = t->fetch[FETCH_MTD_retval];
+		else
+			ret = -EINVAL;
+	} else if (strncmp(arg, "stack", 5) == 0) {
+		if (arg[5] == '\0') {
+			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+				f->fn = fetch_stack_address;
+			else
+				ret = -EINVAL;
+		} else if (isdigit(arg[5])) {
+			ret = strict_strtoul(arg + 5, 10, &param);
+			if (ret || param > PARAM_MAX_STACK)
+				ret = -EINVAL;
+			else {
+				f->fn = t->fetch[FETCH_MTD_stack];
+				f->data = (void *)param;
+			}
+		} else
+			ret = -EINVAL;
+	} else
+		ret = -EINVAL;
+	return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+		     struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+	long offset;
+	char *tmp;
+
+	switch (arg[0]) {
+	case '$':
+		ret = parse_probe_vars(arg + 1, t, f, is_return);
+		break;
+	case '%':	/* named register */
+		ret = regs_query_register_offset(arg + 1);
+		if (ret >= 0) {
+			f->fn = t->fetch[FETCH_MTD_reg];
+			f->data = (void *)(unsigned long)ret;
+			ret = 0;
+		}
+		break;
+	case '@':	/* memory or symbol */
+		if (isdigit(arg[1])) {
+			ret = strict_strtoul(arg + 1, 0, &param);
+			if (ret)
+				break;
+			f->fn = t->fetch[FETCH_MTD_memory];
+			f->data = (void *)param;
+		} else {
+			ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+			if (ret)
+				break;
+			f->data = alloc_symbol_cache(arg + 1, offset);
+			if (f->data)
+				f->fn = t->fetch[FETCH_MTD_symbol];
+		}
+		break;
+	case '+':	/* deref memory */
+		arg++;	/* Skip '+', because strict_strtol() rejects it. */
+	case '-':
+		tmp = strchr(arg, '(');
+		if (!tmp)
+			break;
+		*tmp = '\0';
+		ret = strict_strtol(arg, 0, &offset);
+		if (ret)
+			break;
+		arg = tmp + 1;
+		tmp = strrchr(arg, ')');
+		if (tmp) {
+			struct deref_fetch_param *dprm;
+			const struct fetch_type *t2 = find_fetch_type(NULL);
+			*tmp = '\0';
+			dprm = kzalloc(sizeof(struct deref_fetch_param),
+				       GFP_KERNEL);
+			if (!dprm)
+				return -ENOMEM;
+			dprm->offset = offset;
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			if (ret)
+				kfree(dprm);
+			else {
+				f->fn = t->fetch[FETCH_MTD_deref];
+				f->data = (void *)dprm;
+			}
+		}
+		break;
+	}
+	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
+		pr_info("%s type has no corresponding fetch method.\n",
+			t->name);
+		ret = -EINVAL;
+	}
+	return ret;
+}
+
+#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+				      const struct fetch_type *t,
+				      struct fetch_param *f)
+{
+	struct bitfield_fetch_param *bprm;
+	unsigned long bw, bo;
+	char *tail;
+
+	if (*bf != 'b')
+		return 0;
+
+	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	if (!bprm)
+		return -ENOMEM;
+	bprm->orig = *f;
+	f->fn = t->fetch[FETCH_MTD_bitfield];
+	f->data = (void *)bprm;
+
+	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
+	if (bw == 0 || *tail != '@')
+		return -EINVAL;
+
+	bf = tail + 1;
+	bo = simple_strtoul(bf, &tail, 0);
+	if (tail == bf || *tail != '/')
+		return -EINVAL;
+
+	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+	bprm->low_shift = bprm->hi_shift + bo;
+	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		struct probe_arg *parg, bool is_return)
+{
+	const char *t;
+	int ret;
+
+	if (strlen(arg) > MAX_ARGSTR_LEN) {
+		pr_info("Argument is too long.: %s\n",  arg);
+		return -ENOSPC;
+	}
+	parg->comm = kstrdup(arg, GFP_KERNEL);
+	if (!parg->comm) {
+		pr_info("Failed to allocate memory for command '%s'.\n", arg);
+		return -ENOMEM;
+	}
+	t = strchr(parg->comm, ':');
+	if (t) {
+		arg[t - parg->comm] = '\0';
+		t++;
+	}
+	parg->type = find_fetch_type(t);
+	if (!parg->type) {
+		pr_info("Unsupported type: %s\n", t);
+		return -EINVAL;
+	}
+	parg->offset = *size;
+	*size += parg->type->size;
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	if (ret >= 0 && t != NULL)
+		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+	if (ret >= 0) {
+		parg->fetch_size.fn = get_fetch_size_function(parg->type,
+							      parg->fetch.fn);
+		parg->fetch_size.data = parg->fetch.data;
+	}
+	return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+		if (strcmp(reserved_field_names[i], name) == 0)
+			return 1;
+	for (i = 0; i < narg; i++)
+		if (strcmp(args[i].name, name) == 0)
+			return 1;
+	return 0;
+}
+
+void traceprobe_update_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		update_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		update_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		update_symbol_cache(arg->fetch.data);
+}
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		free_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		free_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		free_symbol_cache(arg->fetch.data);
+	kfree(arg->name);
+	kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char **))
+{
+	char **argv;
+	int argc = 0, ret = 0;
+
+	argv = argv_split(GFP_KERNEL, buf, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	if (argc)
+		ret = createfn(argc, argv);
+
+	argv_free(argv);
+	return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+				size_t count, loff_t *ppos,
+				int (*createfn)(int, char **))
+{
+	char *kbuf, *tmp;
+	int ret = 0;
+	size_t done = 0;
+	size_t size;
+
+	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	while (done < count) {
+		size = count - done;
+		if (size >= WRITE_BUFSIZE)
+			size = WRITE_BUFSIZE - 1;
+		if (copy_from_user(kbuf, buffer + done, size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		kbuf[size] = '\0';
+		tmp = strchr(kbuf, '\n');
+		if (tmp) {
+			*tmp = '\0';
+			size = tmp - kbuf + 1;
+		} else if (done + size < count) {
+			pr_warning("Line length is too long: "
+				   "Should be less than %d.", WRITE_BUFSIZE);
+			ret = -EINVAL;
+			goto out;
+		}
+		done += size;
+		/* Remove comments */
+		tmp = strchr(kbuf, '#');
+		if (tmp)
+			*tmp = '\0';
+
+		ret = traceprobe_command(kbuf, createfn);
+		if (ret)
+			goto out;
+	}
+	ret = done;
+out:
+	kfree(kbuf);
+	return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..84eef0a
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,161 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * This code was copied from kernel/trace/trace_kprobe.h written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ *
+ * Updates to make this generic:
+ * Copyright (C) IBM Corporation, 2010-2011
+ * Author:     Srikar Dronamraju
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed)			\
+	do {								\
+		ret = trace_define_field(event_call, #type, name,	\
+					 offsetof(typeof(field), item),	\
+					 sizeof(field.item), is_signed, \
+					 FILTER_OTHER);			\
+		if (ret)						\
+			return ret;					\
+	} while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE	1
+#define TP_FLAG_PROFILE	2
+#define TP_FLAG_REGISTERED 4
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs)	\
+	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl)	((u32)(dl) >> 16)
+#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ *  data_rloc stores the offset from data_rloc itself, but data_loc
+ *  stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+				 void *);
+
+/* Fetch types */
+enum {
+	FETCH_MTD_reg = 0,
+	FETCH_MTD_stack,
+	FETCH_MTD_retval,
+	FETCH_MTD_memory,
+	FETCH_MTD_symbol,
+	FETCH_MTD_deref,
+	FETCH_MTD_bitfield,
+	FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+	const char	*name;		/* Name of type */
+	size_t		size;		/* Byte size of type */
+	int		is_signed;	/* Signed flag */
+	print_type_func_t	print;	/* Print functions */
+	const char	*fmt;		/* Fromat string */
+	const char	*fmttype;	/* Name in format file */
+	/* Fetch functions */
+	fetch_func_t	fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+	fetch_func_t	fn;
+	void *data;
+};
+
+struct probe_arg {
+	struct fetch_param	fetch;
+	struct fetch_param	fetch_size;
+	unsigned int		offset;	/* Offset from argument entry */
+	const char		*name;	/* Name of this argument */
+	const char		*comm;	/* Command of this argument */
+	const struct fetch_type	*type;	/* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+				 struct pt_regs *regs, void *dest)
+{
+	return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static inline int is_good_name(const char *name)
+{
+	if (!isalpha(*name) && *name != '_')
+		return 0;
+	while (*++name != '\0') {
+		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+			return 0;
+	}
+	return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		   struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg);
+
+extern void traceprobe_update_arg(struct probe_arg *arg);
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *ppos,
+		int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2012-01-10 11:49 ` [PATCH v9 3.2 6/9] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
@ 2012-01-10 11:49 ` Srikar Dronamraju
  2012-01-16 13:11   ` Jiri Olsa
  2012-01-10 11:49 ` [PATCH v9 3.2 8/9] perf: rename target_module to target Srikar Dronamraju
  2012-01-16  8:34 ` [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Ingo Molnar
  8 siblings, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.

The following example shows how to dump the instruction pointer and %ax
a register at the probed text address.  Here we are trying to probe
zfree in /bin/zsh

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g    DF .text  0000000000000012  Base        zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

TODO: Connect a filter to a consumer.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- Added uprobe tracer documentation to this patch.

 Documentation/trace/uprobetracer.txt |   93 ++++
 arch/Kconfig                         |   10 
 kernel/trace/Kconfig                 |   16 +
 kernel/trace/Makefile                |    1 
 kernel/trace/trace.h                 |    5 
 kernel/trace/trace_kprobe.c          |    4 
 kernel/trace/trace_probe.c           |   14 -
 kernel/trace/trace_probe.h           |    3 
 kernel/trace/trace_uprobe.c          |  768 ++++++++++++++++++++++++++++++++++
 9 files changed, 898 insertions(+), 16 deletions(-)
 create mode 100644 Documentation/trace/uprobetracer.txt
 create mode 100644 kernel/trace/trace_uprobe.c

diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
new file mode 100644
index 0000000..457932f
--- /dev/null
+++ b/Documentation/trace/uprobetracer.txt
@@ -0,0 +1,93 @@
+		Uprobe-tracer: Uprobe-based Event Tracing
+		=========================================
+                 Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+  p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS]	: Set a probe
+
+ GRP		: Group name. If omitted, use "uprobes" for it.
+ EVENT		: Event name. If omitted, the event name is generated
+		  based on SYMBOL+offs.
+ PATH		: path to an executable or a library.
+ SYMBOL[+offs]	: Symbol+offset where the probe is inserted.
+
+ FETCHARGS	: Arguments. Each probe can have up to 128 args.
+  %REG		: Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+  echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+  echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address.  Here we are trying to probe
+function zfree in /bin/zsh
+
+    # cd /sys/kernel/debug/tracing/
+    # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
+    00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+    # objdump -T /bin/zsh | grep -w zfree
+    0000000000446420 g    DF .text  0000000000000012  Base        zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+    # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+    # cat uprobe_events
+    p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+    # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+    # sleep 20
+    # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+    # cat trace
+    # tracer: nop
+    #
+    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
+    #              | |       |          |         |
+                 zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
diff --git a/arch/Kconfig b/arch/Kconfig
index 6328b49..6c6df9f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,15 +62,7 @@ config OPTPROBES
 	depends on !PREEMPT
 
 config UPROBES
-	bool "User-space probes (EXPERIMENTAL)"
-	depends on ARCH_SUPPORTS_UPROBES
-	default n
-	help
-	  Uprobes enables kernel subsystems to establish probepoints
-	  in user applications and execute handler functions when
-	  the probepoints are hit.
-
-	  If in doubt, say "N".
+	def_bool n
 
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 520106a..b001fb1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config UPROBE_EVENT
+	bool "Enable uprobes-based dynamic events"
+	depends on ARCH_SUPPORTS_UPROBES
+	depends on MMU
+	select UPROBES
+	select PROBE_EVENTS
+	select TRACING
+	default n
+	help
+	  This allows the user to add tracing events on top of userspace dynamic
+	  events (similar to tracepoints) on the fly via the traceevents interface.
+	  Those events can be inserted wherever uprobes can probe, and record
+	  various registers.
+	  This option is required if you plan to use perf-probe subcommand of perf
+	  tools on user space applications.
+
 config PROBE_EVENTS
 	def_bool n
 
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index fa10d5c..1734c03 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -62,5 +62,6 @@ ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
 obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 092e1f8..f5f7bb3 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
 	unsigned long		ret_ip;
 };
 
+struct uprobe_trace_entry_head {
+	struct trace_entry	ent;
+	unsigned long		ip;
+};
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 967e634..60384df 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -524,8 +524,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
-								is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size,
+					&tp->args[i], is_return, true);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index e79a9e5..c0a98e6 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -529,13 +529,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
 
 /* Recursive argument parser */
 static int parse_probe_arg(char *arg, const struct fetch_type *t,
-		     struct fetch_param *f, bool is_return)
+		     struct fetch_param *f, bool is_return, bool is_kprobe)
 {
 	int ret = 0;
 	unsigned long param;
 	long offset;
 	char *tmp;
 
+	/* Until uprobe_events supports only reg arguments */
+	if (!is_kprobe && arg[0] != '%')
+		return -EINVAL;
+
 	switch (arg[0]) {
 	case '$':
 		ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -585,7 +589,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
 			if (!dprm)
 				return -ENOMEM;
 			dprm->offset = offset;
-			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+							is_kprobe);
 			if (ret)
 				kfree(dprm);
 			else {
@@ -640,7 +645,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
 
 /* String length checking wrapper */
 int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		struct probe_arg *parg, bool is_return)
+		struct probe_arg *parg, bool is_return, bool is_kprobe)
 {
 	const char *t;
 	int ret;
@@ -666,7 +671,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
 	}
 	parg->offset = *size;
 	*size += parg->type->size;
-	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+							is_kprobe);
 	if (ret >= 0 && t != NULL)
 		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
 	if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 84eef0a..31e635e 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -66,6 +66,7 @@
 #define TP_FLAG_TRACE	1
 #define TP_FLAG_PROFILE	2
 #define TP_FLAG_REGISTERED 4
+#define TP_FLAG_UPROBE	8
 
 
 /* data_rloc: data relative location, compatible with u32 */
@@ -144,7 +145,7 @@ static inline int is_good_name(const char *name)
 }
 
 extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		   struct probe_arg *parg, bool is_return);
+		   struct probe_arg *parg, bool is_return, bool is_kprobe);
 
 extern int traceprobe_conflict_field_name(const char *name,
 			       struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..af29368
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,768 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:	Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+	struct uprobe_consumer cons;
+	struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+	struct list_head	list;
+	struct ftrace_event_class	class;
+	struct ftrace_event_call	call;
+	struct uprobe_trace_consumer	*consumer;
+	struct inode		*inode;
+	char			*filename;
+	unsigned long		offset;
+	unsigned long		nhit;
+	unsigned int		flags;	/* For TP_FLAG_* */
+	ssize_t			size;		/* trace entry size */
+	unsigned int		nr_args;
+	struct probe_arg	args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n)			\
+	(offsetof(struct trace_uprobe, args) +	\
+	(sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+				const char *event, int nargs)
+{
+	struct trace_uprobe *tp;
+
+	if (!event || !is_good_name(event))
+		return ERR_PTR(-EINVAL);
+
+	if (!group || !is_good_name(group))
+		return ERR_PTR(-EINVAL);
+
+	tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+	if (!tp)
+		return ERR_PTR(-ENOMEM);
+
+	tp->call.class = &tp->class;
+	tp->call.name = kstrdup(event, GFP_KERNEL);
+	if (!tp->call.name)
+		goto error;
+
+	tp->class.system = kstrdup(group, GFP_KERNEL);
+	if (!tp->class.system)
+		goto error;
+
+	INIT_LIST_HEAD(&tp->list);
+	return tp;
+error:
+	kfree(tp->call.name);
+	kfree(tp);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+	int i;
+
+	for (i = 0; i < tp->nr_args; i++)
+		traceprobe_free_probe_arg(&tp->args[i]);
+
+	iput(tp->inode);
+	kfree(tp->call.class->system);
+	kfree(tp->call.name);
+	kfree(tp->filename);
+	kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+					const char *group)
+{
+	struct trace_uprobe *tp;
+
+	list_for_each_entry(tp, &uprobe_list, list)
+		if (strcmp(tp->call.name, event) == 0 &&
+		    strcmp(tp->call.class->system, group) == 0)
+			return tp;
+	return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+	list_del(&tp->list);
+	unregister_uprobe_event(tp);
+	free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+	struct trace_uprobe *old_tp;
+	int ret;
+
+	mutex_lock(&uprobe_lock);
+
+	/* register as an event */
+	old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+	if (old_tp)
+		/* delete old event */
+		unregister_trace_uprobe(old_tp);
+
+	ret = register_uprobe_event(tp);
+	if (ret) {
+		pr_warning("Failed to register probe event(%d)\n", ret);
+		goto end;
+	}
+
+	list_add_tail(&tp->list, &uprobe_list);
+end:
+	mutex_unlock(&uprobe_lock);
+	return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+	/*
+	 * Argument syntax:
+	 *  - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+	 *
+	 *  - Remove uprobe: -:[GRP/]EVENT
+	 */
+	struct path path;
+	struct inode *inode = NULL;
+	struct trace_uprobe *tp;
+	int i, ret = 0;
+	int is_delete = 0;
+	char *arg = NULL, *event = NULL, *group = NULL;
+	unsigned long offset;
+	char buf[MAX_EVENT_NAME_LEN];
+	char *filename;
+
+	/* argc must be >= 1 */
+	if (argv[0][0] == '-')
+		is_delete = 1;
+	else if (argv[0][0] != 'p') {
+		pr_info("Probe definition must be started with 'p', 'r' or"
+			" '-'.\n");
+		return -EINVAL;
+	}
+
+	if (argv[0][1] == ':') {
+		event = &argv[0][2];
+		if (strchr(event, '/')) {
+			group = event;
+			event = strchr(group, '/') + 1;
+			event[-1] = '\0';
+			if (strlen(group) == 0) {
+				pr_info("Group name is not specified\n");
+				return -EINVAL;
+			}
+		}
+		if (strlen(event) == 0) {
+			pr_info("Event name is not specified\n");
+			return -EINVAL;
+		}
+	}
+	if (!group)
+		group = UPROBE_EVENT_SYSTEM;
+
+	if (is_delete) {
+		if (!event) {
+			pr_info("Delete command needs an event name.\n");
+			return -EINVAL;
+		}
+		mutex_lock(&uprobe_lock);
+		tp = find_probe_event(event, group);
+		if (!tp) {
+			mutex_unlock(&uprobe_lock);
+			pr_info("Event %s/%s doesn't exist.\n", group, event);
+			return -ENOENT;
+		}
+		/* delete an event */
+		unregister_trace_uprobe(tp);
+		mutex_unlock(&uprobe_lock);
+		return 0;
+	}
+
+	if (argc < 2) {
+		pr_info("Probe point is not specified.\n");
+		return -EINVAL;
+	}
+	if (isdigit(argv[1][0])) {
+		pr_info("probe point must be have a filename.\n");
+		return -EINVAL;
+	}
+	arg = strchr(argv[1], ':');
+	if (!arg)
+		goto fail_address_parse;
+
+	*arg++ = '\0';
+	filename = argv[1];
+	ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto fail_address_parse;
+
+	inode = igrab(path.dentry->d_inode);
+
+	ret = strict_strtoul(arg, 0, &offset);
+		if (ret)
+			goto fail_address_parse;
+
+	argc -= 2;
+	argv += 2;
+
+	/* setup a probe */
+	if (!event) {
+		char *tail = strrchr(filename, '/');
+		char *ptr;
+
+		ptr = kstrdup((tail ? tail + 1 : filename), GFP_KERNEL);
+		if (!ptr) {
+			ret = -ENOMEM;
+			goto fail_address_parse;
+		}
+
+		tail = ptr;
+		ptr = strpbrk(tail, ".-_");
+		if (ptr)
+			*ptr = '\0';
+
+		snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p', tail,
+				offset);
+		event = buf;
+		kfree(tail);
+	}
+	tp = alloc_trace_uprobe(group, event, argc);
+	if (IS_ERR(tp)) {
+		pr_info("Failed to allocate trace_uprobe.(%d)\n",
+			(int)PTR_ERR(tp));
+		iput(inode);
+		return PTR_ERR(tp);
+	}
+	tp->offset = offset;
+	tp->inode = inode;
+	tp->filename = kstrdup(filename, GFP_KERNEL);
+	if (!tp->filename) {
+			pr_info("Failed to allocate filename.\n");
+			ret = -ENOMEM;
+			goto error;
+	}
+
+	/* parse arguments */
+	ret = 0;
+	for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+		/* Increment count for freeing args in error case */
+		tp->nr_args++;
+
+		/* Parse argument name */
+		arg = strchr(argv[i], '=');
+		if (arg) {
+			*arg++ = '\0';
+			tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+		} else {
+			arg = argv[i];
+			/* If argument name is omitted, set "argN" */
+			snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+			tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+		}
+
+		if (!tp->args[i].name) {
+			pr_info("Failed to allocate argument[%d] name.\n", i);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		if (!is_good_name(tp->args[i].name)) {
+			pr_info("Invalid argument[%d] name: %s\n",
+				i, tp->args[i].name);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
+			pr_info("Argument[%d] name '%s' conflicts with "
+				"another field.\n", i, argv[i]);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		/* Parse fetch argument */
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								false, false);
+		if (ret) {
+			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+			goto error;
+		}
+	}
+
+	ret = register_trace_uprobe(tp);
+	if (ret)
+		goto error;
+	return 0;
+
+error:
+	free_trace_uprobe(tp);
+	return ret;
+
+fail_address_parse:
+	if (inode)
+		iput(inode);
+	pr_info("Failed to parse address.\n");
+	return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+	struct trace_uprobe *tp;
+
+	mutex_lock(&uprobe_lock);
+	while (!list_empty(&uprobe_list)) {
+		tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+		unregister_trace_uprobe(tp);
+	}
+	mutex_unlock(&uprobe_lock);
+}
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+	mutex_lock(&uprobe_lock);
+	return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+	mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+	int i;
+
+	seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+	seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+	for (i = 0; i < tp->nr_args; i++)
+		seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
+		cleanup_all_probes();
+
+	return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+			    size_t count, loff_t *ppos)
+{
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+	.owner          = THIS_MODULE,
+	.open           = probes_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+	.write		= probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+
+	seq_printf(m, "  %s %-44s %15lu\n", tp->filename, tp->call.name,
+								tp->nhit);
+	return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+	.owner          = THIS_MODULE,
+	.open           = profile_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+	struct uprobe_trace_entry_head *entry;
+	struct ring_buffer_event *event;
+	struct ring_buffer *buffer;
+	u8 *data;
+	int size, i, pc;
+	unsigned long irq_flags;
+	struct ftrace_event_call *call = &tp->call;
+
+	tp->nhit++;
+
+	local_save_flags(irq_flags);
+	pc = preempt_count();
+
+	size = sizeof(*entry) + tp->size;
+
+	event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+						  size, irq_flags, pc);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	if (!filter_current_check_discard(buffer, call, entry, event))
+		trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+static enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+		   struct trace_event *event)
+{
+	struct uprobe_trace_entry_head *field;
+	struct trace_seq *s = &iter->seq;
+	struct trace_uprobe *tp;
+	u8 *data;
+	int i;
+
+	field = (struct uprobe_trace_entry_head *)iter->ent;
+	tp = container_of(event, struct trace_uprobe, call.event);
+
+	if (!trace_seq_printf(s, "%s: (", tp->call.name))
+		goto partial;
+
+	if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+		goto partial;
+
+	if (!trace_seq_puts(s, ")"))
+		goto partial;
+
+	data = (u8 *)&field[1];
+	for (i = 0; i < tp->nr_args; i++)
+		if (!tp->args[i].type->print(s, tp->args[i].name,
+					     data + tp->args[i].offset, field))
+			goto partial;
+
+	if (!trace_seq_puts(s, "\n"))
+		goto partial;
+
+	return TRACE_TYPE_HANDLED;
+partial:
+	return TRACE_TYPE_PARTIAL_LINE;
+}
+
+static int probe_event_enable(struct trace_uprobe *tp, int flag)
+{
+	struct uprobe_trace_consumer *utc;
+	int ret = 0;
+
+	if (!tp->inode || tp->consumer)
+		return -EINTR;
+
+	utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+	if (!utc)
+		return -EINTR;
+
+	utc->cons.handler = uprobe_dispatcher;
+	utc->cons.filter = NULL;
+	ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+	if (ret) {
+		kfree(utc);
+		return ret;
+	}
+
+	tp->flags |= flag;
+	utc->tp = tp;
+	tp->consumer = utc;
+	return 0;
+}
+
+static void probe_event_disable(struct trace_uprobe *tp, int flag)
+{
+	if (!tp->inode || !tp->consumer)
+		return;
+
+	unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+	tp->flags &= ~flag;
+	kfree(tp->consumer);
+	tp->consumer = NULL;
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+	int ret, i;
+	struct uprobe_trace_entry_head field;
+	struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+	DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+	/* Set argument names as fields */
+	for (i = 0; i < tp->nr_args; i++) {
+		ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+					 tp->args[i].name,
+					 sizeof(field) + tp->args[i].offset,
+					 tp->args[i].type->size,
+					 tp->args[i].type->is_signed,
+					 FILTER_OTHER);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+	int i;
+	int pos = 0;
+
+	const char *fmt, *arg;
+
+	fmt = "(%lx)";
+	arg = "REC->" FIELD_STRING_IP;
+
+	/* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+				tp->args[i].name, tp->args[i].type->fmt);
+	}
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+				tp->args[i].name);
+	}
+
+#undef LEN_OR_ZERO
+
+	/* return the length of print_fmt */
+	return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+	int len;
+	char *print_fmt;
+
+	/* First: called with 0 length to calculate the needed length */
+	len = __set_print_fmt(tp, NULL, 0);
+	print_fmt = kmalloc(len + 1, GFP_KERNEL);
+	if (!print_fmt)
+		return -ENOMEM;
+
+	/* Second: actually write the @print_fmt */
+	__set_print_fmt(tp, print_fmt, len + 1);
+	tp->call.print_fmt = print_fmt;
+
+	return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+	struct ftrace_event_call *call = &tp->call;
+	struct uprobe_trace_entry_head *entry;
+	struct hlist_head *head;
+	u8 *data;
+	int size, __size, i;
+	int rctx;
+
+	__size = sizeof(*entry) + tp->size;
+	size = ALIGN(__size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+		     "profile buffer not large enough"))
+		return;
+
+	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+	if (!entry)
+		return;
+
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	head = this_cpu_ptr(call->perf_events);
+	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+#endif	/* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+	switch (type) {
+	case TRACE_REG_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_TRACE);
+	case TRACE_REG_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_TRACE);
+		return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+	case TRACE_REG_PERF_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_PROFILE);
+	case TRACE_REG_PERF_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_PROFILE);
+		return 0;
+#endif
+	}
+	return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+	struct uprobe_trace_consumer *utc;
+	struct trace_uprobe *tp;
+
+	utc = container_of(con, struct uprobe_trace_consumer, cons);
+	tp = utc->tp;
+	if (!tp || tp->consumer != utc)
+		return 0;
+
+	if (tp->flags & TP_FLAG_TRACE)
+		uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+	if (tp->flags & TP_FLAG_PROFILE)
+		uprobe_perf_func(tp, regs);
+#endif
+	return 0;
+}
+
+static struct trace_event_functions uprobe_funcs = {
+	.trace		= print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+	struct ftrace_event_call *call = &tp->call;
+	int ret;
+
+	/* Initialize ftrace_event_call */
+	INIT_LIST_HEAD(&call->class->fields);
+	call->event.funcs = &uprobe_funcs;
+	call->class->define_fields = uprobe_event_define_fields;
+	if (set_print_fmt(tp) < 0)
+		return -ENOMEM;
+	ret = register_ftrace_event(&call->event);
+	if (!ret) {
+		kfree(call->print_fmt);
+		return -ENODEV;
+	}
+	call->flags = 0;
+	call->class->reg = uprobe_register;
+	call->data = tp;
+	ret = trace_add_event_call(call);
+	if (ret) {
+		pr_info("Failed to register uprobe event: %s\n", call->name);
+		kfree(call->print_fmt);
+		unregister_ftrace_event(&call->event);
+	}
+	return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+	/* tp->event is unregistered in trace_remove_event_call() */
+	trace_remove_event_call(&tp->call);
+	kfree(tp->call.print_fmt);
+	tp->call.print_fmt = NULL;
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+	struct dentry *d_tracer;
+
+	d_tracer = tracing_init_dentry();
+	if (!d_tracer)
+		return 0;
+
+	trace_create_file("uprobe_events", 0644, d_tracer,
+				    NULL, &uprobe_events_ops);
+	/* Profile interface */
+	trace_create_file("uprobe_profile", 0444, d_tracer,
+				    NULL, &uprobe_profile_ops);
+	return 0;
+}
+
+fs_initcall(init_uprobe_trace);


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v9 3.2 8/9] perf: rename target_module to target
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2012-01-10 11:49 ` [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface Srikar Dronamraju
@ 2012-01-10 11:49 ` Srikar Dronamraju
  2012-01-16  8:34 ` [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Ingo Molnar
  8 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-10 11:49 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Ingo Molnar, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


This is a precursor patch that modifies names that refer to
kernel/module to also refer to user space names.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |   12 ++++++------
 tools/perf/util/probe-event.c |   26 +++++++++++++-------------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 710ae3d..93d5171 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -61,7 +61,7 @@ static struct {
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
 	struct line_range line_range;
-	const char *target_module;
+	const char *target;
 	int max_probe_points;
 	struct strfilter *filter;
 } params;
@@ -249,7 +249,7 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target_module,
+	OPT_STRING('m', "module", &params.target,
 		   "modname|path",
 		   "target module name (for online) or path (for offline)"),
 #endif
@@ -336,7 +336,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target_module,
+		ret = show_available_funcs(params.target,
 					   params.filter);
 		strfilter__delete(params.filter);
 		if (ret < 0)
@@ -357,7 +357,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			usage_with_options(probe_usage, options);
 		}
 
-		ret = show_line_range(&params.line_range, params.target_module);
+		ret = show_line_range(&params.line_range, params.target);
 		if (ret < 0)
 			pr_err("  Error: Failed to show lines. (%d)\n", ret);
 		return ret;
@@ -374,7 +374,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 
 		ret = show_available_vars(params.events, params.nevents,
 					  params.max_probe_points,
-					  params.target_module,
+					  params.target,
 					  params.filter,
 					  params.show_ext_vars);
 		strfilter__delete(params.filter);
@@ -396,7 +396,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	if (params.nevents) {
 		ret = add_perf_probe_events(params.events, params.nevents,
 					    params.max_probe_points,
-					    params.target_module,
+					    params.target,
 					    params.force_add);
 		if (ret < 0) {
 			pr_err("  Error: Failed to add events. (%d)\n", ret);
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index eb25900..d54eefb 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -275,10 +275,10 @@ static int add_module_to_probe_trace_events(struct probe_trace_event *tevs,
 /* Try to find perf_probe_event with debuginfo */
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	bool need_dwarf = perf_probe_event_need_dwarf(pev);
-	struct debuginfo *dinfo = open_debuginfo(module);
+	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
 	if (!dinfo) {
@@ -297,9 +297,9 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 
 	if (ntevs > 0) {	/* Succeeded to find trace events */
 		pr_debug("find %d probe_trace_events.\n", ntevs);
-		if (module)
+		if (target)
 			ret = add_module_to_probe_trace_events(*tevs, ntevs,
-							       module);
+							       target);
 		return ret < 0 ? ret : ntevs;
 	}
 
@@ -1798,14 +1798,14 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 
 static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	struct symbol *sym;
 	int ret = 0, i;
 	struct probe_trace_event *tev;
 
 	/* Convert perf_probe_event with debuginfo */
-	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, module);
+	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, target);
 	if (ret != 0)
 		return ret;	/* Found in debuginfo or got an error */
 
@@ -1821,8 +1821,8 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		goto error;
 	}
 
-	if (module) {
-		tev->point.module = strdup(module);
+	if (target) {
+		tev->point.module = strdup(target);
 		if (tev->point.module == NULL) {
 			ret = -ENOMEM;
 			goto error;
@@ -1886,7 +1886,7 @@ struct __event_package {
 };
 
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
-			  int max_tevs, const char *module, bool force_add)
+			  int max_tevs, const char *target, bool force_add)
 {
 	int i, j, ret;
 	struct __event_package *pkgs;
@@ -1909,7 +1909,7 @@ int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 		ret  = convert_to_probe_trace_events(pkgs[i].pev,
 						     &pkgs[i].tevs,
 						     max_tevs,
-						     module);
+						     target);
 		if (ret < 0)
 			goto end;
 		pkgs[i].ntevs = ret;
@@ -2065,7 +2065,7 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *module, struct strfilter *_filter)
+int show_available_funcs(const char *target, struct strfilter *_filter)
 {
 	struct map *map;
 	int ret;
@@ -2076,9 +2076,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(module);
+	map = kernel_get_module_map(target);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
 		return -EINVAL;
 	}
 	available_func_filter = _filter;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support
  2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2012-01-10 11:49 ` [PATCH v9 3.2 8/9] perf: rename target_module to target Srikar Dronamraju
@ 2012-01-16  8:34 ` Ingo Molnar
  2012-01-16 15:17   ` Srikar Dronamraju
  8 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-16  8:34 UTC (permalink / raw)
  To: Srikar Dronamraju, Andrew Morton, Arnaldo Carvalho de Melo
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> This patchset implements Uprobes which enables you to 
> dynamically probe any routine in a user space application and 
> collect information non-disruptively.

Did all review feedback get addressed in your latest tree?

If yes then it would be nice to hear the opinion of Andrew about 
this bit:

>  mm/mmap.c                               |   33 +-

The relevant portion of the patch is:

> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -30,6 +30,7 @@
>  #include <linux/perf_event.h>
>  #include <linux/audit.h>
>  #include <linux/khugepaged.h>
> +#include <linux/uprobes.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/cacheflush.h>
> @@ -616,6 +617,13 @@ again:			remove_next = 1 + (end > next->vm_end);
>  	if (mapping)
>  		mutex_unlock(&mapping->i_mmap_mutex);
>  
> +	if (root) {
> +		mmap_uprobe(vma);
> +
> +		if (adjust_next)
> +			mmap_uprobe(next);
> +	}
> +
>  	if (remove_next) {
>  		if (file) {
>  			fput(file);
> @@ -637,6 +645,8 @@ again:			remove_next = 1 + (end > next->vm_end);
>  			goto again;
>  		}
>  	}
> +	if (insert && file)
> +		mmap_uprobe(insert);
>  
>  	validate_mm(mm);
>  
> @@ -1329,6 +1339,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
>  			mm->locked_vm += (len >> PAGE_SHIFT);
>  	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
>  		make_pages_present(addr, addr + len);
> +
> +	if (file && mmap_uprobe(vma))
> +		/* matching probes but cannot insert */
> +		goto unmap_and_free_vma;
> +
>  	return addr;
>  
>  unmap_and_free_vma:
> @@ -2305,6 +2320,10 @@ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
>  	if ((vma->vm_flags & VM_ACCOUNT) &&
>  	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
>  		return -ENOMEM;
> +
> +	if (vma->vm_file && mmap_uprobe(vma))
> +		return -EINVAL;
> +
>  	vma_link(mm, vma, prev, rb_link, rb_parent);
>  	return 0;
>  }
> @@ -2356,6 +2375,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
>  			new_vma->vm_pgoff = pgoff;
>  			if (new_vma->vm_file) {
>  				get_file(new_vma->vm_file);
> +
> +				if (mmap_uprobe(new_vma))
> +					goto out_free_mempol;
> +
>  				if (vma->vm_flags & VM_EXECUTABLE)
>  					added_exe_file_vma(mm);
>  			}

it's named mmap_uprobe(), which makes it rather single-purpose. 
The uprobes code wants to track vma life-time so that it can 
manage uprobes breakpoints installed here, correct?

We already have some other vma tracking goodies in perf itself 
(see perf_event_mmap() et al) - would it make sense to merge the 
two vma instrumentation facilities and not burden mm/ with two 
separate sets of callbacks?

If all such issues are resolved then i guess we could queue up 
uprobes in -tip, conditional on it remaining sufficiently 
regression-, problem- and NAK-free.

Also, it would be nice to hear Arnaldo's opinion about the 
tools/perf/ bits.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-10 11:49 ` [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface Srikar Dronamraju
@ 2012-01-16 13:11   ` Jiri Olsa
  2012-01-16 14:45     ` Srikar Dronamraju
  2012-01-17  9:28     ` Ingo Molnar
  0 siblings, 2 replies; 45+ messages in thread
From: Jiri Olsa @ 2012-01-16 13:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On Tue, Jan 10, 2012 at 05:19:43PM +0530, Srikar Dronamraju wrote:
> 
> Implements trace_event support for uprobes. In its current form it can
> be used to put probes at a specified offset in a file and dump the
> required registers when the code flow reaches the probed address.
> 
> The following example shows how to dump the instruction pointer and %ax
> a register at the probed text address.  Here we are trying to probe
> zfree in /bin/zsh
> 
> # cd /sys/kernel/debug/tracing/
> # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
> 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
> # objdump -T /bin/zsh | grep -w zfree
> 0000000000446420 g    DF .text  0000000000000012  Base        zfree
> # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
> # cat uprobe_events
> p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
> # echo 1 > events/uprobes/enable
> # sleep 20
> # echo 0 > events/uprobes/enable
> # cat trace
> # tracer: nop
> #
> #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
> #              | |       |          |         |
>              zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
>              zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
>              zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
>              zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
> 
> TODO: Connect a filter to a consumer.

hi,

It looks to me the perf/trace filtering is already working
with the small fix attached.

In another email discussion I think you suggested to use
the 'filter property of uprobe consumer' for the perf/trace
filtering.  The current code does the filtering in the handler
itself, and it's same for other perf/trace based events.

I discussed the 'filter property of uprobe consumer' with Oleg,
and he is not sure why it was added.. was it only for the perf/trace
filtering?

I've tested following event:
        echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events

and commands like:
        perf record -a -e probe_libc:free  --filter "common_pid == 1127"
        perf record -e probe_libc:free --filter "arg1 == 0xa" ls

got me proper results.

thanks,
jirka

---
The preemption needs to be disabled when submitting data into perf.

---
 kernel/trace/trace_uprobe.c |    6 +++++-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index af29368..4d3857c 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -653,9 +653,11 @@ static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
 		     "profile buffer not large enough"))
 		return;
 
+	preempt_disable();
+
 	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
 	if (!entry)
-		return;
+		goto out;
 
 	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
 	data = (u8 *)&entry[1];
@@ -665,6 +667,8 @@ static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
 
 	head = this_cpu_ptr(call->perf_events);
 	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+ out:
+	preempt_enable();
 }
 #endif	/* CONFIG_PERF_EVENTS */
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-16 13:11   ` Jiri Olsa
@ 2012-01-16 14:45     ` Srikar Dronamraju
  2012-01-16 15:33       ` Jiri Olsa
  2012-01-17  9:28     ` Ingo Molnar
  1 sibling, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-16 14:45 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

> 
> I've tested following event:
>         echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events
> 
> and commands like:
>         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
>         perf record -e probe_libc:free --filter "arg1 == 0xa" ls
> 
> got me proper results.
> 

Okay thanks for the inputs.

> thanks,
> jirka
> 
> ---
> The preemption needs to be disabled when submitting data into perf.

I actually looked at other places where perf_trace_buf_prepare and
perf_trace_buf_submit are being called. for example perf_syscall_enter
and perf_syscall_exit both call the above routines and they didnt seem
to be called with premption disabled. Is that the way perf probe is
called in our case that needs us to call pre-emption here? Did you see a
case where calling these without preemption disabled caused a problem?


> ---
>  kernel/trace/trace_uprobe.c |    6 +++++-
>  1 files changed, 5 insertions(+), 1 deletions(-)
> 
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index af29368..4d3857c 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -653,9 +653,11 @@ static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
>  		     "profile buffer not large enough"))
>  		return;
> 
> +	preempt_disable();
> +
>  	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
>  	if (!entry)
> -		return;
> +		goto out;
> 
>  	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
>  	data = (u8 *)&entry[1];
> @@ -665,6 +667,8 @@ static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
> 
>  	head = this_cpu_ptr(call->perf_events);
>  	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
> + out:
> +	preempt_enable();
>  }
>  #endif	/* CONFIG_PERF_EVENTS */
> 

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support
  2012-01-16  8:34 ` [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Ingo Molnar
@ 2012-01-16 15:17   ` Srikar Dronamraju
  2012-01-17  9:39     ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-16 15:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andrew Morton, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell

* Ingo Molnar <mingo@elte.hu> [2012-01-16 09:34:42]:

> 
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> 
> > This patchset implements Uprobes which enables you to 
> > dynamically probe any routine in a user space application and 
> > collect information non-disruptively.
> 
> Did all review feedback get addressed in your latest tree?

I think this question would be better answered by Peter, Oleg and
Masami.  For my part, I have fixed all comments till now.  Also uprobes
has been part of -next for quite sometime.

> 
> If yes then it would be nice to hear the opinion of Andrew about 
> this bit:
> 
> >  mm/mmap.c                               |   33 +-
> 
> The relevant portion of the patch is:
> 
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -30,6 +30,7 @@
> >  #include <linux/perf_event.h>
> >  #include <linux/audit.h>
> >  #include <linux/khugepaged.h>
> > +#include <linux/uprobes.h>
> >  
> >  #include <asm/uaccess.h>
> >  #include <asm/cacheflush.h>
> > @@ -616,6 +617,13 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  	if (mapping)
> >  		mutex_unlock(&mapping->i_mmap_mutex);
> >  
> > +	if (root) {
> > +		mmap_uprobe(vma);
> > +
> > +		if (adjust_next)
> > +			mmap_uprobe(next);
> > +	}
> > +
> >  	if (remove_next) {
> >  		if (file) {
> >  			fput(file);
> > @@ -637,6 +645,8 @@ again:			remove_next = 1 + (end > next->vm_end);
> >  			goto again;
> >  		}
> >  	}
> > +	if (insert && file)
> > +		mmap_uprobe(insert);
> >  
> >  	validate_mm(mm);
> >  
> > @@ -1329,6 +1339,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> >  			mm->locked_vm += (len >> PAGE_SHIFT);
> >  	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
> >  		make_pages_present(addr, addr + len);
> > +
> > +	if (file && mmap_uprobe(vma))
> > +		/* matching probes but cannot insert */
> > +		goto unmap_and_free_vma;
> > +
> >  	return addr;
> >  
> >  unmap_and_free_vma:
> > @@ -2305,6 +2320,10 @@ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
> >  	if ((vma->vm_flags & VM_ACCOUNT) &&
> >  	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
> >  		return -ENOMEM;
> > +
> > +	if (vma->vm_file && mmap_uprobe(vma))
> > +		return -EINVAL;
> > +
> >  	vma_link(mm, vma, prev, rb_link, rb_parent);
> >  	return 0;
> >  }
> > @@ -2356,6 +2375,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
> >  			new_vma->vm_pgoff = pgoff;
> >  			if (new_vma->vm_file) {
> >  				get_file(new_vma->vm_file);
> > +
> > +				if (mmap_uprobe(new_vma))
> > +					goto out_free_mempol;
> > +
> >  				if (vma->vm_flags & VM_EXECUTABLE)
> >  					added_exe_file_vma(mm);
> >  			}
> 
> it's named mmap_uprobe(), which makes it rather single-purpose. 
> The uprobes code wants to track vma life-time so that it can 
> manage uprobes breakpoints installed here, correct?
> 

Yes, 

> We already have some other vma tracking goodies in perf itself 
> (see perf_event_mmap() et al) - would it make sense to merge the 
> two vma instrumentation facilities and not burden mm/ with two 
> separate sets of callbacks?

Atleast for file backed vmas, perf_event_mmap seems to be interested in
just the new vma creations. Uprobes would also be interested in the size
changes like the vma growing/shrinking/remap. Is perf_event_mmap
interested in such changes? From what i could see, perf_event_mmap seems
to be interested in stack vma size changes but not file vma size
changes.

Also mmap_uprobe gets called in fork path. Currently we have a hook in
copy_mm/dup_mm so that we get to know the context of each vma that gets
added to the child and add its breakpoints. At dup_mm/dup_mmap we would
have taken mmap_sem for both parent and child so there is no way we
could have missed a register/unregister in the parent not reflected in
the child.

I see the perf_event_fork but that would have to enhanced to do a lot
more to help us do a mmap_uprobe.

> 
> If all such issues are resolved then i guess we could queue up 
> uprobes in -tip, conditional on it remaining sufficiently 
> regression-, problem- and NAK-free.

Okay. Accepting uprobes into tip, would provide more testing/feedback.

> 
> Also, it would be nice to hear Arnaldo's opinion about the 
> tools/perf/ bits.

Whatever comments Arnaldo/Masami have given till now have been resolved.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-16 14:45     ` Srikar Dronamraju
@ 2012-01-16 15:33       ` Jiri Olsa
  2012-01-16 16:41         ` Srikar Dronamraju
  0 siblings, 1 reply; 45+ messages in thread
From: Jiri Olsa @ 2012-01-16 15:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On Mon, Jan 16, 2012 at 08:15:38PM +0530, Srikar Dronamraju wrote:
> > 
> > I've tested following event:
> >         echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events
> > 
> > and commands like:
> >         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
> >         perf record -e probe_libc:free --filter "arg1 == 0xa" ls
> > 
> > got me proper results.
> > 
> 
> Okay thanks for the inputs.
> 
> > thanks,
> > jirka
> > 
> > ---
> > The preemption needs to be disabled when submitting data into perf.
> 
> I actually looked at other places where perf_trace_buf_prepare and
> perf_trace_buf_submit are being called. for example perf_syscall_enter
> and perf_syscall_exit both call the above routines and they didnt seem
> to be called with premption disabled. Is that the way perf probe is
> called in our case that needs us to call pre-emption here? Did you see a
> case where calling these without preemption disabled caused a problem?

the perf_trace_buf_prepare touches per cpu variables,
hence the preemption disabling

the perf_trace_buf_prepare code is used by syscalls,
kprobes, and trace events

- both syscalls and trace events are implemented by
  tracepoints which disable preemption before calling the probe
  (see __DO_TRACE macro in include/linux/tracepoint.h)

- kprobes disable preemption as well
  (kprobe_handler in arch/x86/kernel/kprobes.c)
  haven't checked the optimalized kprobes,
  but should be the same case

jirka

> 
> 
> > ---
> >  kernel/trace/trace_uprobe.c |    6 +++++-
> >  1 files changed, 5 insertions(+), 1 deletions(-)
> > 
> > diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> > index af29368..4d3857c 100644
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -653,9 +653,11 @@ static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
> >  		     "profile buffer not large enough"))
> >  		return;
> > 
> > +	preempt_disable();
> > +
> >  	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
> >  	if (!entry)
> > -		return;
> > +		goto out;
> > 
> >  	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
> >  	data = (u8 *)&entry[1];
> > @@ -665,6 +667,8 @@ static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
> > 
> >  	head = this_cpu_ptr(call->perf_events);
> >  	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
> > + out:
> > +	preempt_enable();
> >  }
> >  #endif	/* CONFIG_PERF_EVENTS */
> > 
> 
> -- 
> Thanks and Regards
> Srikar
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-16 15:33       ` Jiri Olsa
@ 2012-01-16 16:41         ` Srikar Dronamraju
  0 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-16 16:41 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

* Jiri Olsa <jolsa@redhat.com> [2012-01-16 16:33:27]:

> On Mon, Jan 16, 2012 at 08:15:38PM +0530, Srikar Dronamraju wrote:
> > > 
> > > I've tested following event:
> > >         echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events
> > > 
> > > and commands like:
> > >         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
> > >         perf record -e probe_libc:free --filter "arg1 == 0xa" ls
> > > 
> > > got me proper results.
> > > 
> > 
> > Okay thanks for the inputs.
> > 
> > > thanks,
> > > jirka
> > > 
> > > ---
> > > The preemption needs to be disabled when submitting data into perf.
> > 
> > I actually looked at other places where perf_trace_buf_prepare and
> > perf_trace_buf_submit are being called. for example perf_syscall_enter
> > and perf_syscall_exit both call the above routines and they didnt seem
> > to be called with premption disabled. Is that the way perf probe is
> > called in our case that needs us to call pre-emption here? Did you see a
> > case where calling these without preemption disabled caused a problem?
> 
> the perf_trace_buf_prepare touches per cpu variables,
> hence the preemption disabling
> 
> the perf_trace_buf_prepare code is used by syscalls,
> kprobes, and trace events
> 
> - both syscalls and trace events are implemented by
>   tracepoints which disable preemption before calling the probe
>   (see __DO_TRACE macro in include/linux/tracepoint.h)
> 
> - kprobes disable preemption as well
>   (kprobe_handler in arch/x86/kernel/kprobes.c)
>   haven't checked the optimalized kprobes,
>   but should be the same case
> 
> jirka
> 

Okay, thanks for the clarification, I will pick the fix.

--
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-16 13:11   ` Jiri Olsa
  2012-01-16 14:45     ` Srikar Dronamraju
@ 2012-01-17  9:28     ` Ingo Molnar
  2012-01-17 10:22       ` Srikar Dronamraju
  1 sibling, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17  9:28 UTC (permalink / raw)
  To: Jiri Olsa, Arnaldo Carvalho de Melo
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell


* Jiri Olsa <jolsa@redhat.com> wrote:

> I've tested following event:
>         echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events
> 
> and commands like:
>         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
>         perf record -e probe_libc:free --filter "arg1 == 0xa" ls
> 
> got me proper results.

Btw., Srikar, if that's the primary UI today then we'll need to 
make it a *lot* more user-friendly than the above usage 
workflow.

In particular this line:

>         echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events

is not something a mere mortal will be able to figure out.

There needs to be perf probe integration, that allows intuitive 
usage, such as:

   perf probe add libc:free

Using the perf symbols code it should first search a libc*so DSO 
in the system, finding say /lib64/libc-2.15.so. The 'free' 
symbol is readily available there:

  aldebaran:~> eu-readelf -s /lib64/libc-2.15.so  | grep ' free$'
  7186: 00000039ff47f080    224 FUNC    GLOBAL DEFAULT       12 free

then the tool can automatically turn that symbol information 
into the specific probe.

Will it all work with DSO randomization, prelinking and default 
placement as well?

Users should not be expected to enter magic hexa numbers to get 
a trivial usecase going ...

this bit:

>         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
>         perf record -e probe_libc:free --filter "arg1 == 0xa" ls

looks good and intuitive and 'perf list' should list all the 
available uprobes.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support
  2012-01-16 15:17   ` Srikar Dronamraju
@ 2012-01-17  9:39     ` Ingo Molnar
  2012-01-25 14:11       ` Peter Zijlstra
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17  9:39 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Andrew Morton, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> > We already have some other vma tracking goodies in perf 
> > itself (see perf_event_mmap() et al) - would it make sense 
> > to merge the two vma instrumentation facilities and not 
> > burden mm/ with two separate sets of callbacks?
> 
> Atleast for file backed vmas, perf_event_mmap seems to be 
> interested in just the new vma creations. Uprobes would also 
> be interested in the size changes like the vma 
> growing/shrinking/remap. Is perf_event_mmap interested in such 
> changes? From what i could see, perf_event_mmap seems to be 
> interested in stack vma size changes but not file vma size 
> changes.

I don't think perf_event_mmap() would be hurt from getting more 
callbacks, as long as there's enough metadata to differentiate 
the various events from each other.

( In the long run we really want all such callbacks to be
  tracepoints so that we can make them even less intrusive via
  jump label patching optimizations, etc. - but we are not there
  yet. )

> Also mmap_uprobe gets called in fork path. Currently we have a 
> hook in copy_mm/dup_mm so that we get to know the context of 
> each vma that gets added to the child and add its breakpoints. 
> At dup_mm/dup_mmap we would have taken mmap_sem for both 
> parent and child so there is no way we could have missed a 
> register/unregister in the parent not reflected in the child.
> 
> I see the perf_event_fork but that would have to enhanced to 
> do a lot more to help us do a mmap_uprobe.

Andrew's call i guess.

I did not suggest anything complex or intrusive: just basically 
unify the namespace, have a single set of callbacks, and call 
into the uprobes and perf code from those callbacks - out of the 
sight of MM code.

That unified namespace could be called:

    event_mmap(...);
    event_fork(...);

etc. - and from event_mmap() you could do a simple:

	perf_event_mmap(...)
	uprobes_event_mmap(...)

[ Once all this is updated to use tracepoints it would turn into 
  a notification callback chain kind of thing. ]

> > If all such issues are resolved then i guess we could queue 
> > up uprobes in -tip, conditional on it remaining sufficiently 
> > regression-, problem- and NAK-free.
> 
> Okay. Accepting uprobes into tip, would provide more 
> testing/feedback.
> 
> > Also, it would be nice to hear Arnaldo's opinion about the 
> > tools/perf/ bits.
> 
> Whatever comments Arnaldo/Masami have given till now have been 
> resolved.

Please see the probe installation syntax feedback i've sent in 
the previous mail - that needs to be addressed too.

You should think with the head of an ordinary user-space 
developer who wants to say debug a library and wants to use 
uprobes for that. What would he have to type to achieve that and 
is what he types the minimum number of characters to reasonably 
achieve that goal?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17  9:28     ` Ingo Molnar
@ 2012-01-17 10:22       ` Srikar Dronamraju
  2012-01-17 10:59         ` Ingo Molnar
                           ` (3 more replies)
  0 siblings, 4 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-17 10:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell

> > 
> > and commands like:
> >         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
> >         perf record -e probe_libc:free --filter "arg1 == 0xa" ls
> > 
> > got me proper results.
> 
> Btw., Srikar, if that's the primary UI today then we'll need to 
> make it a *lot* more user-friendly than the above usage 
> workflow.
> 
> In particular this line:
> 
> >         echo "p:probe_libc/free /lib64/libc-2.13.so:0x7a4f0 %ax" > ./uprobe_events
> 
> is not something a mere mortal will be able to figure out.

Agree, perf probe is the primary interface to use uprobes.

> 
> There needs to be perf probe integration, that allows intuitive 
> usage, such as:
> 
>    perf probe add libc:free

Current usage is like perf probe -x <executable> -a <func1> -a <func2>
So we could use
perf probe -x /lib64/libc.so.6 free

or 
perf probe -x /lib64/libc.so.6 -a free -a malloc -a strcpy

The -x option helps perf to identify  that its a user space based probing.
This currently restricts that all probes defined per "perf probe"
invocation to just one executable. This usage was suggested by Masami.

Earlier we used perf probe free@/lib/libc.so.6 malloc@/lib/libc.so.6
The objection for this was that perf was already using @ to signify
source file. Similarly : is already used for Relative line number.

This also goes with perf probe -F -x /lib64/libc.so to list the
available probes in libc.

> 
> Using the perf symbols code it should first search a libc*so DSO 
> in the system, finding say /lib64/libc-2.15.so. The 'free' 
> symbol is readily available there:

While I understand the ease of using a libc instead of the full path, I
think it does have few issues.
- Do we need to keep checking if the new files created in the system
  match the pattern and if they match the pattern, dynamically add the
  files?

- Also the current model helps if the user wants to restrict his trace
  to particular executable where there are more that one executable with
  the same name.

> 
>   aldebaran:~> eu-readelf -s /lib64/libc-2.15.so  | grep ' free$'
>   7186: 00000039ff47f080    224 FUNC    GLOBAL DEFAULT       12 free
> 
> then the tool can automatically turn that symbol information 
> into the specific probe.
> 

Given a function in an executable, we are 

> Will it all work with DSO randomization, prelinking and default 
> placement as well?

Works with DSO randomization, I havent tried with prelinking.  did you
mean http://en.wikipedia.org/wiki/Placement_syntax#Default_placement
when you said default placement?


> 
> Users should not be expected to enter magic hexa numbers to get 
> a trivial usecase going ...

yes, that why we dont allow perf probe -x /lib/libc.so.6 0x1234

> 
> this bit:
> 
> >         perf record -a -e probe_libc:free  --filter "common_pid == 1127"
> >         perf record -e probe_libc:free --filter "arg1 == 0xa" ls
> 
> looks good and intuitive and 'perf list' should list all the 
> available uprobes.

Currently "perf probe -F -x /bin/zsh" lists all available functions in
zsh. perf probe --list has been enhanced to show uprobes that are
already registered.

Similar to kprobes based probes, available user space probes are not
part of "perf list".
> 
> Thanks,
> 
> 	Ingo
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 10:22       ` Srikar Dronamraju
@ 2012-01-17 10:59         ` Ingo Molnar
  2012-01-17 11:03           ` Srikar Dronamraju
  2012-01-17 11:22         ` Ingo Molnar
                           ` (2 subsequent siblings)
  3 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17 10:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


btw., i'd write the CONFIG_UPROBE_EVENT .config help text:

 This allows the user to add tracing events on top of userspace
 dynamic events (similar to tracepoints) on the fly via the
 traceevents interface. Those events can be inserted wherever 
 uprobes can probe, and record various registers.

 This option is required if you plan to use perf-probe 
 subcommand of perf tools on user space applications.

In a clearer form:

 This allows the user to add user-space dynamic events on the 
 fly, similar to tracepoints. These events can be inserted
 into just about any user-space code transparently, and can 
 record register state there on the fly, which is then used by
 instrumentation tools.

 This option is required if you plan to use the "perf probe"
 subcommand of the perf tool on user space applications.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 10:59         ` Ingo Molnar
@ 2012-01-17 11:03           ` Srikar Dronamraju
  0 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-17 11:03 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell

* Ingo Molnar <mingo@elte.hu> [2012-01-17 11:59:16]:

> 
> btw., i'd write the CONFIG_UPROBE_EVENT .config help text:
> 
>  This allows the user to add tracing events on top of userspace
>  dynamic events (similar to tracepoints) on the fly via the
>  traceevents interface. Those events can be inserted wherever 
>  uprobes can probe, and record various registers.
> 
>  This option is required if you plan to use perf-probe 
>  subcommand of perf tools on user space applications.
> 
> In a clearer form:
> 
>  This allows the user to add user-space dynamic events on the 
>  fly, similar to tracepoints. These events can be inserted
>  into just about any user-space code transparently, and can 
>  record register state there on the fly, which is then used by
>  instrumentation tools.
> 
>  This option is required if you plan to use the "perf probe"
>  subcommand of the perf tool on user space applications.
> 

Okay, Will change as suggested.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 10:22       ` Srikar Dronamraju
  2012-01-17 10:59         ` Ingo Molnar
@ 2012-01-17 11:22         ` Ingo Molnar
  2012-01-17 11:57           ` Srikar Dronamraju
  2012-01-17 12:11         ` Ingo Molnar
  2012-01-17 12:21         ` Ingo Molnar
  3 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17 11:22 UTC (permalink / raw)
  To: Srikar Dronamraju, Paul E. McKenney
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Srikar, rebased commits like this one:

  c5af743: rcu: Introduce raw SRCU read-side primitives

are absolutely inacceptable:

 commit c5af7439f322db86e347991e184b95dd80676967
 Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
 Date:   Sun Oct 9 15:13:11 2011 -0700

    rcu: Introduce raw SRCU read-side primitives

    ...

    Requested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
    Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

If you pull a commit from Paul into your tree then you must 
never rebase it. If you applied Paul's commit out of email then 
you should add a Signed-off-by, not a Tested-by tag at the end.

Also, please test whether it works when merged to latest -tip, 
there's been various RCU changes in this area. AFAICS these 
read-side primitives are already upstream via:

  0c53dd8b3140: rcu: Introduce raw SRCU read-side primitives

so pulling your tree would break the build.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 11:22         ` Ingo Molnar
@ 2012-01-17 11:57           ` Srikar Dronamraju
  2012-01-17 12:23             ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-17 11:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Paul E. McKenney, Jiri Olsa, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

> Srikar, rebased commits like this one:
> 
>   c5af743: rcu: Introduce raw SRCU read-side primitives
> 
> are absolutely inacceptable:
> 

Okay, I wasnt sure whats the best way for me to expose a tree such that
people who just use my tree to try (without basing on any other trees
except Linus) can build and use uprobes but people who already have this
commit arent affected by me having this commit.

So couple of people did tell me that git was intelligent enuf that if
the commit was already in, it would create a empty commit. 

do you have suggestions on how to work around such situations in future?

>  commit c5af7439f322db86e347991e184b95dd80676967
>  Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>  Date:   Sun Oct 9 15:13:11 2011 -0700
> 
>     rcu: Introduce raw SRCU read-side primitives
> 
>     ...
> 
>     Requested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>     Tested-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> 
> If you pull a commit from Paul into your tree then you must 
> never rebase it. If you applied Paul's commit out of email then 
> you should add a Signed-off-by, not a Tested-by tag at the end.

I actually cherry-picked commit from your tree i.e -tip. 
The Tested-by tag was added by Paul after I confirmed to him that it
works.  Even the commit 0c53dd8b3140: that you refer has the same tags.

> 
> Also, please test whether it works when merged to latest -tip, 
> there's been various RCU changes in this area. AFAICS these 
> read-side primitives are already upstream via:
> 
>   0c53dd8b3140: rcu: Introduce raw SRCU read-side primitives
> 
> so pulling your tree would break the build.

I had used v3.2 as a base for my tree and 0c53dd8b3140: was not part of
that tag. Now that its gone into mainline, I will drop the commit.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 10:22       ` Srikar Dronamraju
  2012-01-17 10:59         ` Ingo Molnar
  2012-01-17 11:22         ` Ingo Molnar
@ 2012-01-17 12:11         ` Ingo Molnar
  2012-01-17 12:21         ` Ingo Molnar
  3 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17 12:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


A couple of 'perf probe' usability issues.

When running it as unprivileged user right now it fails with:

 $ perf probe -x /lib64/libc.so.6 free
 Failed to open uprobe_events file: Permission denied

That error message should reference the full file name in 
question, i.e. /sys/kernel/debug/tracing/uprobe_events - that 
way the user can make that file writable if it's secure to do 
that on that system.

The other thing is the text that gets printed:

 $ perf probe --del free
 Remove event: probe_libc:free

that's not how tools generally communicate - it should be 
something like:

 $ perf probe --del free
 Removed event: probe_libc:free

Note the past tense - this tells the user that the action has 
been performed successfully.

Likewise, 'perf probe --add' should talk in past tense as well, 
to indicate success. So it should say something like:

 $ perf probe -x /lib64/libc.so.6 free
 Added new event:
  probe_libc:free      (on 0x7f080)

 You can now use it in all perf tools, such as:

	perf record -e probe_libc:free -aR sleep 1

(Also note the s/on/in change in the other text.)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 10:22       ` Srikar Dronamraju
                           ` (2 preceding siblings ...)
  2012-01-17 12:11         ` Ingo Molnar
@ 2012-01-17 12:21         ` Ingo Molnar
  2012-01-18 10:07           ` Srikar Dronamraju
  3 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17 12:21 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


Have you tried to use 'perf probe' to achieve any useful 
instrumentation on a real app?

I just tried out the 'glibc:free' usecase and it's barely 
usable.

Firstly, the recording very frequently produces overruns:

 $ perf record -e probe_libc:free -aR sleep 1
 [ perf record: Woken up 169 times to write data ]
 [ perf record: Captured and wrote 89.674 MB perf.data (~3917919 samples) ]
 Warning:Processed 1349133 events and lost 1 chunks!

Using -m 4096 made it work better.

Adding -g for call-graph profiling caused 'perf report' to lock 
up:

  perf record -m 4096 -e probe_libc:free -agR sleep 1
  perf report
  [ loops forever ]

I've sent a testcase to Arnaldo separately. Note that perf 
report --stdio appears to work.

Regular '-e cycles -g' works fine, so this is a uprobes specific 
bug.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 11:57           ` Srikar Dronamraju
@ 2012-01-17 12:23             ` Ingo Molnar
  2012-01-17 12:27               ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17 12:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Paul E. McKenney, Jiri Olsa, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell


* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:

> > Srikar, rebased commits like this one:
> > 
> >   c5af743: rcu: Introduce raw SRCU read-side primitives
> > 
> > are absolutely inacceptable:
> > 
> 
> Okay, I wasnt sure whats the best way for me to expose a tree 
> such that people who just use my tree to try (without basing 
> on any other trees except Linus) can build and use uprobes but 
> people who already have this commit arent affected by me 
> having this commit.
> 
> So couple of people did tell me that git was intelligent enuf 
> that if the commit was already in, it would create a empty 
> commit.
> 
> do you have suggestions on how to work around such situations 
> in future?

You can just apply it out of email, as a regular commit.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 12:23             ` Ingo Molnar
@ 2012-01-17 12:27               ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2012-01-17 12:27 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Paul E. McKenney, Jiri Olsa, Arnaldo Carvalho de Melo,
	Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell


a couple of other 'perf probe' suggestions:

i think the 'perf probe' UI should be streamlined.

Right now we have kprobes assumptions like this syntax:

  perf probe <kernel_symbol>

will add a probe.

That is nonsensical once we merge uprobes, as i'd expect much 
more developers to be interested in user-space symbols than 
kernel-space symbols.

So we could change this syntax and add the more intuitive:

  perf probe add ...
  perf probe del ...
  perf probe list

Variants. Also, the -x syntax isnt particularly intuitive either 
- it should be a 'perf probe add ...' variant.

If someone uses the old syntax it will produce a meaningful 
error message.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-10 11:48 ` [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception Srikar Dronamraju
@ 2012-01-18  8:39   ` Anton Arapov
  2012-01-18  9:02     ` Srikar Dronamraju
  0 siblings, 1 reply; 45+ messages in thread
From: Anton Arapov @ 2012-01-18  8:39 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On Tue, Jan 10, 2012 at 05:18:42PM +0530, Srikar Dronamraju wrote:
[snip]
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 8208234..475563b 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
[snip]
> @@ -37,6 +39,21 @@ struct uprobe_arch_info {
>  #endif
>  };
>  
> +struct uprobe_task_arch_info {
> +	unsigned long saved_trap_no;
> +#ifdef CONFIG_X86_64
> +	unsigned long saved_scratch_register;
> +#endif
> +};
> +
>  struct uprobe;
> +
>  extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
> +extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
Srikar,

  Can we use existing SET_IP() instead of set_instruction_pointer() ?

[snip]  
>  static void __exit exit_uprobes(void)
> 

===
[PATCH] uprobes: cleanup, eliminate set_instruction_pointer(), use existing SET_IP() instead

Use SET_IP() available in include/asm-generic/ptrace.h

Signed-off-by: Anton Arapov <anton@redhat.com>
---
 arch/x86/include/asm/uprobes.h |    1 -
 arch/x86/kernel/uprobes.c      |   12 +-----------
 kernel/uprobes.c               |    2 +-
 3 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 475563b..88df7ec 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -49,7 +49,6 @@ struct uprobe_task_arch_info {
 struct uprobe;
 
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
-extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern bool xol_was_trapped(struct task_struct *tsk);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e4e0dfd..08b633f 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -409,16 +409,6 @@ int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe)
 	return 0;
 }
 
-/*
- * @reg: reflects the saved state of the task
- * @vaddr: the virtual address to jump to.
- * Return 0 on success or a -ve number on error.
- */
-void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
-{
-	regs->ip = vaddr;
-}
-
 #define	UPROBE_TRAP_NO		UINT_MAX
 
 /*
@@ -624,7 +614,7 @@ void abort_xol(struct pt_regs *regs, struct uprobe *uprobe)
 
 	current->thread.trap_no = utask->tskinfo.saved_trap_no;
 	handle_riprel_post_xol(uprobe, regs, NULL);
-	set_instruction_pointer(regs, utask->vaddr);
+	SET_IP(regs, utask->vaddr);
 }
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 0918448..b0db46b 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1479,7 +1479,7 @@ cleanup_ret:
 	}
 	if (u) {
 		if (!(u->flags & UPROBES_SKIP_SSTEP))
-			set_instruction_pointer(regs, probept);
+			SET_IP(regs, probept);
 
 		put_uprobe(u);
 	} else
-- 
1.7.7.5


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-18  8:39   ` Anton Arapov
@ 2012-01-18  9:02     ` Srikar Dronamraju
  2012-01-18 10:18       ` Mike Frysinger
  0 siblings, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-18  9:02 UTC (permalink / raw)
  To: Anton Arapov
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell,
	Mike Frysinger

> Srikar,
> 
>   Can we use existing SET_IP() instead of set_instruction_pointer() ?
> 

Oleg had already commented about this in one his uprobes reviews.

The GET_IP/SET_IP available in include/asm-generic/ptrace.h doesnt work
on all archs. Atleast it doesnt work on powerpc when I tried it.

Also most archs define instruction_pointer(). So I thought (rather Peter
Zijlstra suggested the name set_instruction_pointer())
set_instruction_pointer was a better bet than SET_IP. I 

Also I dont see any usage for SET_IP/GET_IP.

May be we should have something like this in
include/asm-generic/ptrace.h

#ifdef instruction_pointer
#define GET_IP(regs)		(instruction_pointer(regs))

#define set_instruction_pointer(regs, val) (instruction_pointer(regs) = (val))

#define SET_IP(regs, val)	(set_instruction_pointer(regs,val))

#endif

or should we do away with GET_IP/SET_IP esp since there are no many
users?

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface
  2012-01-17 12:21         ` Ingo Molnar
@ 2012-01-18 10:07           ` Srikar Dronamraju
  0 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-18 10:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Arnaldo Carvalho de Melo, Peter Zijlstra,
	Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell

> 
> Have you tried to use 'perf probe' to achieve any useful 
> instrumentation on a real app?
> 
> I just tried out the 'glibc:free' usecase and it's barely 
> usable.
> 

Thanks for trying.

Yes, I have actually place probes on free and malloc while building a
kernel and it has worked for me.

> Firstly, the recording very frequently produces overruns:
> 
>  $ perf record -e probe_libc:free -aR sleep 1
>  [ perf record: Woken up 169 times to write data ]
>  [ perf record: Captured and wrote 89.674 MB perf.data (~3917919 samples) ]
>  Warning:Processed 1349133 events and lost 1 chunks!
> 

I have seen the "lost chunks" but thats not always and mostly when I
have run perf record for a very long time. But I have seen this even
with just kernel probes too. So I didnt debug that.

> Using -m 4096 made it work better.
> 
> Adding -g for call-graph profiling caused 'perf report' to lock 
> up:
> 
>   perf record -m 4096 -e probe_libc:free -agR sleep 1
>   perf report
>   [ loops forever ]
> 

This actually works for me.
I had CONFIG_PREEMPT not set. So I set it and tried with and without
Jiri's fix. It still worked for me. Can you pass me your config that
shows this hang so that I can try and fix it.

> I've sent a testcase to Arnaldo separately. Note that perf 
> report --stdio appears to work.

There are no changes in perf report because of uprobes. So the data
collected from uprobes could trigger the hang. I was speculating that
not disabling preemption that Jiri pointed out could be causing this. So
I tried with and without but couldnt reproduce. Can I request you to see
if Jiri's patch helps?

Today with -g option, on one machine I do see a 
$ sudo perf report
perf: Floating point exception

I will take a look at this too. Other machines dont seem to show this
error.

> Regular '-e cycles -g' works fine, so this is a uprobes specific 
> bug.
> 

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-18  9:02     ` Srikar Dronamraju
@ 2012-01-18 10:18       ` Mike Frysinger
  2012-01-18 10:47         ` Srikar Dronamraju
  0 siblings, 1 reply; 45+ messages in thread
From: Mike Frysinger @ 2012-01-18 10:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Anton Arapov, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

[-- Attachment #1: Type: Text/Plain, Size: 1767 bytes --]

On Wednesday 18 January 2012 04:02:32 Srikar Dronamraju wrote:
> >   Can we use existing SET_IP() instead of set_instruction_pointer() ?
> 
> Oleg had already commented about this in one his uprobes reviews.
> 
> The GET_IP/SET_IP available in include/asm-generic/ptrace.h doesnt work
> on all archs. Atleast it doesnt work on powerpc when I tried it.

so migrate the arches you need over to it.

> Also most archs define instruction_pointer(). So I thought (rather Peter
> Zijlstra suggested the name set_instruction_pointer())
> set_instruction_pointer was a better bet than SET_IP. I

asm-generic/ptrace.h already has instruction_pointer_set()

> Also I dont see any usage for SET_IP/GET_IP.

i think you mean "users" here ?  the usage should be fairly obvious.  both 
macros are used by asm-generic/ptrace.h internally, but (currently) rarely 
defined by arches themselves (by design).  the funcs that are based on these 
GET/SET helpers though do get used in many places.

simply grep arch/*/include/asm/ptrace.h

> May be we should have something like this in
> include/asm-generic/ptrace.h
> 
> #ifdef instruction_pointer
> #define GET_IP(regs)		(instruction_pointer(regs))
> #define set_instruction_pointer(regs, val) (instruction_pointer(regs) =
> (val))
> #define SET_IP(regs, val)	(set_instruction_pointer(regs,val))
> #endif
> 

what you propose here won't work on all arches which is the whole point of 
{G,S}ET_IP in the first place.  i proposed a similar idea before and was shot 
down for exactly that reason.  look at ia64 for an obvious example.

> or should we do away with GET_IP/SET_IP esp since there are no many
> users?

no, the point is to migrate to asm-generic/ptrace.h, not away from it.
-mike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-18 10:18       ` Mike Frysinger
@ 2012-01-18 10:47         ` Srikar Dronamraju
  2012-01-18 11:01           ` Mike Frysinger
  0 siblings, 1 reply; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-18 10:47 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: Anton Arapov, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Stephen Rothwell


> On Wednesday 18 January 2012 04:02:32 Srikar Dronamraju wrote:
> > >   Can we use existing SET_IP() instead of set_instruction_pointer() ?
> > 
> > Oleg had already commented about this in one his uprobes reviews.
> > 
> > The GET_IP/SET_IP available in include/asm-generic/ptrace.h doesnt work
> > on all archs. Atleast it doesnt work on powerpc when I tried it.
> 
> so migrate the arches you need over to it.

One question that could be asked is why arent we using instruction_pointer
instead of GET_IP since instruction_pointer is being defined in 25
places and with references in 120 places.

> 
> > Also most archs define instruction_pointer(). So I thought (rather Peter
> > Zijlstra suggested the name set_instruction_pointer())
> > set_instruction_pointer was a better bet than SET_IP. I
> 
> asm-generic/ptrace.h already has instruction_pointer_set()
> 
> > Also I dont see any usage for SET_IP/GET_IP.
> 
> i think you mean "users" here ?  the usage should be fairly obvious.  both 
> macros are used by asm-generic/ptrace.h internally, but (currently) rarely 
> defined by arches themselves (by design).  the funcs that are based on these 
> GET/SET helpers though do get used in many places.
> 
> simply grep arch/*/include/asm/ptrace.h


here are the stats

$ grep -r -w GET_IP * | wc -l 
5
$ grep -r -w SET_IP * | wc -l 
3
$ grep -r -w instruction_pointer * | wc -l 
120
$ grep -r -w instruction_pointer_set * | wc -l
3

The only place I saw GET_IP was used was to define SET_IP
The only place I saw SET_IP was used was to define
instruction_pointer_set.
The only place  I saw instruction_pointer_set being used is drivers/misc/kgdbts.c 

instruction_pointer was defined in close to 25 places.

> 
> > May be we should have something like this in
> > include/asm-generic/ptrace.h
> > 
> > #ifdef instruction_pointer
> > #define GET_IP(regs)		(instruction_pointer(regs))
> > #define set_instruction_pointer(regs, val) (instruction_pointer(regs) =
> > (val))
> > #define SET_IP(regs, val)	(set_instruction_pointer(regs,val))
> > #endif
> > 
> 
> what you propose here won't work on all arches which is the whole point of 
> {G,S}ET_IP in the first place.  i proposed a similar idea before and was shot 
> down for exactly that reason.  look at ia64 for an obvious example.

Sorry, I didnt quite understand this.
Was it that people objected to instruction_pointer or 
Is it that instruction_pointer and GET_IP will work differently on few
architectures or 
Is it people had an objection to defining instruction_pointer.

So let me rephrase here. Initially we used set_ip. But Peter suggested
that the name be changed to set_instruction_pointer so that it goes with 
instruction_pointer. I also felt that set_instruction_pointer was
better.  However I am okay with any other name including
SET_IP/instruction_pointer_set. I have no issues in moving the
set_instruction_pointer to arch/*/ptrace.h files it it helps (including
include/asm-generic/ptrace.h).

But I think  we should either have GET_IP or instruction_pointer.
Similarly either SET_IP/set_instruction_pointer{_set}. Since
instruction_pointer is more widely used, I would side by the
instruction_pointer.

> 
> > or should we do away with GET_IP/SET_IP esp since there are no many
> > users?
> 
> no, the point is to migrate to asm-generic/ptrace.h, not away from it.

I think the rational for having asm-generic/ptrace.h was to have define
a way to get the instruction_pointer such that the each archs dont have
to define their own definition unless and untill its necessary.

If yes, then why did we choose the names GET_IP/SET_IP instead of
instruction_pointer and the like.

-- 
Thanks and Regards
Srikar



^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-18 10:47         ` Srikar Dronamraju
@ 2012-01-18 11:01           ` Mike Frysinger
  2012-01-20 22:57             ` Mike Frysinger
  0 siblings, 1 reply; 45+ messages in thread
From: Mike Frysinger @ 2012-01-18 11:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Anton Arapov, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Stephen Rothwell

[-- Attachment #1: Type: Text/Plain, Size: 1034 bytes --]

On Wednesday 18 January 2012 05:47:49 Srikar Dronamraju wrote:
> > On Wednesday 18 January 2012 04:02:32 Srikar Dronamraju wrote:
> > > >   Can we use existing SET_IP() instead of set_instruction_pointer() ?
> > > 
> > > Oleg had already commented about this in one his uprobes reviews.
> > > 
> > > The GET_IP/SET_IP available in include/asm-generic/ptrace.h doesnt work
> > > on all archs. Atleast it doesnt work on powerpc when I tried it.
> > 
> > so migrate the arches you need over to it.
> 
> One question that could be asked is why arent we using instruction_pointer
> instead of GET_IP since instruction_pointer is being defined in 25
> places and with references in 120 places.

i think you misunderstand the point.  {G,S}ET_IP() is the glue between the 
arch's pt_regs struct and the public facing API.  the only people who should 
be touching those macros are the ptrace core.  instruction_pointer() and 
instruction_pointer_set() are the API that asm/ptrace.h exports to the rest of 
the tree.
-mike

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-18 11:01           ` Mike Frysinger
@ 2012-01-20 22:57             ` Mike Frysinger
  2012-01-25  8:12               ` Srikar Dronamraju
  0 siblings, 1 reply; 45+ messages in thread
From: Mike Frysinger @ 2012-01-20 22:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Anton Arapov, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Stephen Rothwell

On Wed, Jan 18, 2012 at 06:01, Mike Frysinger wrote:
> On Wednesday 18 January 2012 05:47:49 Srikar Dronamraju wrote:
>> > On Wednesday 18 January 2012 04:02:32 Srikar Dronamraju wrote:
>> > > >   Can we use existing SET_IP() instead of set_instruction_pointer() ?
>> > >
>> > > Oleg had already commented about this in one his uprobes reviews.
>> > >
>> > > The GET_IP/SET_IP available in include/asm-generic/ptrace.h doesnt work
>> > > on all archs. Atleast it doesnt work on powerpc when I tried it.
>> >
>> > so migrate the arches you need over to it.
>>
>> One question that could be asked is why arent we using instruction_pointer
>> instead of GET_IP since instruction_pointer is being defined in 25
>> places and with references in 120 places.
>
> i think you misunderstand the point.  {G,S}ET_IP() is the glue between the
> arch's pt_regs struct and the public facing API.  the only people who should
> be touching those macros are the ptrace core.  instruction_pointer() and
> instruction_pointer_set() are the API that asm/ptrace.h exports to the rest of
> the tree.

Srikar: does that make sense ?  i'm happy to help with improving
asm-generic/ptrace.h.
-mike

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception.
  2012-01-20 22:57             ` Mike Frysinger
@ 2012-01-25  8:12               ` Srikar Dronamraju
  0 siblings, 0 replies; 45+ messages in thread
From: Srikar Dronamraju @ 2012-01-25  8:12 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: Anton Arapov, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Stephen Rothwell

> >>
> >> One question that could be asked is why arent we using instruction_pointer
> >> instead of GET_IP since instruction_pointer is being defined in 25
> >> places and with references in 120 places.
> >
> > i think you misunderstand the point.  {G,S}ET_IP() is the glue between the
> > arch's pt_regs struct and the public facing API.  the only people who should
> > be touching those macros are the ptrace core.  instruction_pointer() and
> > instruction_pointer_set() are the API that asm/ptrace.h exports to the rest of
> > the tree.
> 
> Srikar: does that make sense ?  i'm happy to help with improving
> asm-generic/ptrace.h.
> -mike
> 

Yes, I think it makes sense. I have modified the code to use
instruction_pointer_set instead of set_instruction_pointer.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support
  2012-01-17  9:39     ` Ingo Molnar
@ 2012-01-25 14:11       ` Peter Zijlstra
  2012-01-26 11:10         ` Ingo Molnar
  0 siblings, 1 reply; 45+ messages in thread
From: Peter Zijlstra @ 2012-01-25 14:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Srikar Dronamraju, Andrew Morton, Arnaldo Carvalho de Melo,
	Linus Torvalds, Oleg Nesterov, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell

On Tue, 2012-01-17 at 10:39 +0100, Ingo Molnar wrote:

> I did not suggest anything complex or intrusive: just basically 
> unify the namespace, have a single set of callbacks, and call 
> into the uprobes and perf code from those callbacks - out of the 
> sight of MM code.
> 
> That unified namespace could be called:
> 
>     event_mmap(...);
>     event_fork(...);
> 
> etc. - and from event_mmap() you could do a simple:
> 
> 	perf_event_mmap(...)
> 	uprobes_event_mmap(...)
> 
> [ Once all this is updated to use tracepoints it would turn into 
>   a notification callback chain kind of thing. ]

We keep disagreeing on this. I utterly loathe hiding stuff in notifier
lists. It makes it completely non-obvious who all does what.

Another very good reason to not do what you suggest is that
perf_event_mmap() is a pure consumer, it doesn't have a return value,
whereas uprobes_mmap() can actually fail the mmap.




^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
@ 2012-01-25 14:39   ` Denys Vlasenko
  2012-01-25 16:31     ` Oleg Nesterov
  2012-01-25 15:13   ` Denys Vlasenko
  2012-01-26 13:45   ` Masami Hiramatsu
  2 siblings, 1 reply; 45+ messages in thread
From: Denys Vlasenko @ 2012-01-25 14:39 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On Tue, Jan 10, 2012 at 12:48 PM, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
> +/*
> + * opcodes we'll probably never support:
> + * 6c-6d, e4-e5, ec-ed - in
> + * 6e-6f, e6-e7, ee-ef - out
> + * cc, cd - int3, int

I imagine desire to set a breakpoint on int 0x80 will be rather typical.
(Same for sysenter).

> + * cf - iret

Iret does work. Test program for 32-bit x86:

/* gcc -nostartfiles -nostdlib -o iret iret.S */
_start: .globl  _start
        pushf
        push    %cs
        push    $_e
        iret  /* will this jump to _e? Yes! */
        hlt   /* segv if reached */
_e:     movl    $42, %ebx
        movl    $1, %eax
        int     $0x80

I guess supporting probes in weird stuff like ancient DOS emulators
(they actually use iret) is not important.

OTOH iret doesn't seem to be too hard: if it fails (bad cs/eflags
on stack), then the location of iret instruction per se is not
terribly important.
If it works, then you need to be careful to not mess up eip,
same as you already do with ordinary [l]ret, nothing more.


Come to think of it, why do you bother checking for
invalid instructions? What can happen if you would just copy
and run all instructions? You already are prepared to handle
exceptions, right?


-- 
vda

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
  2012-01-25 14:39   ` Denys Vlasenko
@ 2012-01-25 15:13   ` Denys Vlasenko
  2012-01-25 15:32     ` Denys Vlasenko
  2012-01-26 13:38     ` Masami Hiramatsu
  2012-01-26 13:45   ` Masami Hiramatsu
  2 siblings, 2 replies; 45+ messages in thread
From: Denys Vlasenko @ 2012-01-25 15:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On Tue, Jan 10, 2012 at 12:48 PM, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
> +/*
> + * If uprobe->insn doesn't use rip-relative addressing, return
> + * immediately.  Otherwise, rewrite the instruction so that it accesses
> + * its memory operand indirectly through a scratch register.  Set
> + * uprobe->arch_info.fixups and uprobe->arch_info.rip_rela_target_address
> + * accordingly.  (The contents of the scratch register will be saved
> + * before we single-step the modified instruction, and restored
> + * afterward.)
> + *
> + * We do this because a rip-relative instruction can access only a
> + * relatively small area (+/- 2 GB from the instruction), and the XOL
> + * area typically lies beyond that area.  At least for instructions
> + * that store to memory, we can't execute the original instruction
> + * and "fix things up" later, because the misdirected store could be
> + * disastrous.
> + *
> + * Some useful facts about rip-relative instructions:
> + * - There's always a modrm byte.
> + * - There's never a SIB byte.
> + * - The displacement is always 4 bytes.
> + */
> +static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
> +                                                       struct insn *insn)
> +{
> +       u8 *cursor;
> +       u8 reg;
> +
> +       if (mm->context.ia32_compat)
> +               return;
> +
> +       uprobe->arch_info.rip_rela_target_address = 0x0;
> +       if (!insn_rip_relative(insn))
> +               return;
> +
> +       /*
> +        * Point cursor at the modrm byte.  The next 4 bytes are the
> +        * displacement.  Beyond the displacement, for some instructions,
> +        * is the immediate operand.
> +        */
> +       cursor = uprobe->insn + insn->prefixes.nbytes
> +                       + insn->rex_prefix.nbytes + insn->opcode.nbytes;
> +       insn_get_length(insn);
> +
> +       /*
> +        * Convert from rip-relative addressing to indirect addressing
> +        * via a scratch register.  Change the r/m field from 0x5 (%rip)
> +        * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
> +        */
> +       reg = MODRM_REG(insn);
> +       if (reg == 0) {
> +               /*
> +                * The register operand (if any) is either the A register
> +                * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
> +                * REX prefix) %r8.  In any case, we know the C register
> +                * is NOT the register operand, so we use %rcx (register
> +                * #1) for the scratch register.
> +                */
> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
> +               /* Change modrm from 00 000 101 to 00 000 001. */
> +               *cursor = 0x1;
> +       } else {
> +               /* Use %rax (register #0) for the scratch register. */
> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_AX;
> +               /* Change modrm from 00 xxx 101 to 00 xxx 000 */
> +               *cursor = (reg << 3);
> +       }
> +
> +       /* Target address = address of next instruction + (signed) offset */
> +       uprobe->arch_info.rip_rela_target_address = (long)insn->length
> +                                       + insn->displacement.value;
> +       /* Displacement field is gone; slide immediate field (if any) over. */
> +       if (insn->immediate.nbytes) {
> +               cursor++;
> +               memmove(cursor, cursor + insn->displacement.nbytes,
> +                                               insn->immediate.nbytes);
> +       }
> +       return;
> +}

It seems to be possible to store RIP value *without displacement*
into AX/CX and convert rip-relative instruction into AX/CX *relative* one.
Example:
c7 05 78 56 34 12 2a 00 00 00 	movl   $0x2a,0x12345678(%rip)
converts to:
c7 81 78 56 34 12 2a 00 00 00 	movl   $0x2a,0x12345678(%rcx)

This way instruction size stays the same and you don't need
to memmove immediate value.

-- 
vda

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-25 15:13   ` Denys Vlasenko
@ 2012-01-25 15:32     ` Denys Vlasenko
  2012-01-26 14:14       ` Masami Hiramatsu
  2012-01-26 13:38     ` Masami Hiramatsu
  1 sibling, 1 reply; 45+ messages in thread
From: Denys Vlasenko @ 2012-01-25 15:32 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On Wed, Jan 25, 2012 at 4:13 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
>> +       /*
>> +        * Convert from rip-relative addressing to indirect addressing
>> +        * via a scratch register.  Change the r/m field from 0x5 (%rip)
>> +        * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
>> +        */
>> +       reg = MODRM_REG(insn);
>> +       if (reg == 0) {
>> +               /*
>> +                * The register operand (if any) is either the A register
>> +                * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
>> +                * REX prefix) %r8.  In any case, we know the C register
>> +                * is NOT the register operand, so we use %rcx (register
>> +                * #1) for the scratch register.
>> +                */
>> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
>> +               /* Change modrm from 00 000 101 to 00 000 001. */
>> +               *cursor = 0x1;

Hmm. I think we have a bug here.

What if this instruction has REX.B = 1? Granted, REX.B = 1 has no effect on
rip-relative addressing and therefore normally won't be generated by gcc/as,
but still. If you replace md and r/m fields as above, you are trying to convert
0x12345678(%rip) reference to (%rcx), but if REX.B = 1, then you in fact
converted it to (%r9)!

-- 
vda

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-25 14:39   ` Denys Vlasenko
@ 2012-01-25 16:31     ` Oleg Nesterov
  0 siblings, 0 replies; 45+ messages in thread
From: Oleg Nesterov @ 2012-01-25 16:31 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

On 01/25, Denys Vlasenko wrote:
>
> On Tue, Jan 10, 2012 at 12:48 PM, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> > +/*
> > + * opcodes we'll probably never support:
> > + * 6c-6d, e4-e5, ec-ed - in
> > + * 6e-6f, e6-e7, ee-ef - out
> > + * cc, cd - int3, int
>
> I imagine desire to set a breakpoint on int 0x80 will be rather typical.
> (Same for sysenter).

May be uprobes will support this later. But imho we should not
try to do this now.

With the current code, afaics we do not want to allow the
UTASK_SSTEP/TIF_SINGLESTEP task to enter the kernel mode,
this state is "too special". Just for example, suppose it
clones another task and the child gets the invalid uprobe
state.

Oleg.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support
  2012-01-25 14:11       ` Peter Zijlstra
@ 2012-01-26 11:10         ` Ingo Molnar
  0 siblings, 0 replies; 45+ messages in thread
From: Ingo Molnar @ 2012-01-26 11:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Andrew Morton, Arnaldo Carvalho de Melo,
	Linus Torvalds, Oleg Nesterov, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Rothwell


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, 2012-01-17 at 10:39 +0100, Ingo Molnar wrote:
> 
> > I did not suggest anything complex or intrusive: just basically 
> > unify the namespace, have a single set of callbacks, and call 
> > into the uprobes and perf code from those callbacks - out of the 
> > sight of MM code.
> > 
> > That unified namespace could be called:
> > 
> >     event_mmap(...);
> >     event_fork(...);
> > 
> > etc. - and from event_mmap() you could do a simple:
> > 
> > 	perf_event_mmap(...)
> > 	uprobes_event_mmap(...)
> > 
> > [ Once all this is updated to use tracepoints it would turn into 
> >   a notification callback chain kind of thing. ]
> 
> We keep disagreeing on this. I utterly loathe hiding stuff in 
> notifier lists. It makes it completely non-obvious who all 
> does what.

My immediate suggestion was not a notifier list but an 
open-coded list of function calls done in helper inline 
functions - to minimize the impact of the callbacks on mm/.

> Another very good reason to not do what you suggest is that 
> perf_event_mmap() is a pure consumer, it doesn't have a return 
> value, whereas uprobes_mmap() can actually fail the mmap.

You know that i disagree with that, there is no fundamental 
reason why event callbacks couldnt participate in program logic, 
as long as the call site explicitly wants such side effects. It 
avoids senseless duplication of callbacks.

Anyway, if Andrew is fine with the current callbacks as-is then 
it's fine to me as well.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-25 15:13   ` Denys Vlasenko
  2012-01-25 15:32     ` Denys Vlasenko
@ 2012-01-26 13:38     ` Masami Hiramatsu
  1 sibling, 0 replies; 45+ messages in thread
From: Masami Hiramatsu @ 2012-01-26 13:38 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell,
	yrl.pp-manager.tt

(2012/01/26 0:13), Denys Vlasenko wrote:
> On Tue, Jan 10, 2012 at 12:48 PM, Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
>> +/*
>> + * If uprobe->insn doesn't use rip-relative addressing, return
>> + * immediately.  Otherwise, rewrite the instruction so that it accesses
>> + * its memory operand indirectly through a scratch register.  Set
>> + * uprobe->arch_info.fixups and uprobe->arch_info.rip_rela_target_address
>> + * accordingly.  (The contents of the scratch register will be saved
>> + * before we single-step the modified instruction, and restored
>> + * afterward.)
>> + *
>> + * We do this because a rip-relative instruction can access only a
>> + * relatively small area (+/- 2 GB from the instruction), and the XOL
>> + * area typically lies beyond that area.  At least for instructions
>> + * that store to memory, we can't execute the original instruction
>> + * and "fix things up" later, because the misdirected store could be
>> + * disastrous.
>> + *
>> + * Some useful facts about rip-relative instructions:
>> + * - There's always a modrm byte.
>> + * - There's never a SIB byte.
>> + * - The displacement is always 4 bytes.
>> + */
>> +static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
>> +                                                       struct insn *insn)
>> +{
>> +       u8 *cursor;
>> +       u8 reg;
>> +
>> +       if (mm->context.ia32_compat)
>> +               return;
>> +
>> +       uprobe->arch_info.rip_rela_target_address = 0x0;
>> +       if (!insn_rip_relative(insn))
>> +               return;
>> +
>> +       /*
>> +        * Point cursor at the modrm byte.  The next 4 bytes are the
>> +        * displacement.  Beyond the displacement, for some instructions,
>> +        * is the immediate operand.
>> +        */
>> +       cursor = uprobe->insn + insn->prefixes.nbytes
>> +                       + insn->rex_prefix.nbytes + insn->opcode.nbytes;
>> +       insn_get_length(insn);
>> +
>> +       /*
>> +        * Convert from rip-relative addressing to indirect addressing
>> +        * via a scratch register.  Change the r/m field from 0x5 (%rip)
>> +        * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
>> +        */
>> +       reg = MODRM_REG(insn);
>> +       if (reg == 0) {
>> +               /*
>> +                * The register operand (if any) is either the A register
>> +                * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
>> +                * REX prefix) %r8.  In any case, we know the C register
>> +                * is NOT the register operand, so we use %rcx (register
>> +                * #1) for the scratch register.
>> +                */
>> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
>> +               /* Change modrm from 00 000 101 to 00 000 001. */
>> +               *cursor = 0x1;
>> +       } else {
>> +               /* Use %rax (register #0) for the scratch register. */
>> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_AX;
>> +               /* Change modrm from 00 xxx 101 to 00 xxx 000 */
>> +               *cursor = (reg << 3);
>> +       }
>> +
>> +       /* Target address = address of next instruction + (signed) offset */
>> +       uprobe->arch_info.rip_rela_target_address = (long)insn->length
>> +                                       + insn->displacement.value;
>> +       /* Displacement field is gone; slide immediate field (if any) over. */
>> +       if (insn->immediate.nbytes) {
>> +               cursor++;
>> +               memmove(cursor, cursor + insn->displacement.nbytes,
>> +                                               insn->immediate.nbytes);
>> +       }
>> +       return;
>> +}
> 
> It seems to be possible to store RIP value *without displacement*
> into AX/CX and convert rip-relative instruction into AX/CX *relative* one.
> Example:
> c7 05 78 56 34 12 2a 00 00 00 	movl   $0x2a,0x12345678(%rip)
> converts to:
> c7 81 78 56 34 12 2a 00 00 00 	movl   $0x2a,0x12345678(%rcx)
> 
> This way instruction size stays the same and you don't need
> to memmove immediate value.

Right, I agree there is a possibility of optimizing.
However, for ease of review, I think it's better to be
a separate patch.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
  2012-01-25 14:39   ` Denys Vlasenko
  2012-01-25 15:13   ` Denys Vlasenko
@ 2012-01-26 13:45   ` Masami Hiramatsu
  2 siblings, 0 replies; 45+ messages in thread
From: Masami Hiramatsu @ 2012-01-26 13:45 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Ingo Molnar,
	Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell

(2012/01/10 20:48), Srikar Dronamraju wrote:
> +static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
> +							struct insn *insn)
> +{
> +	u8 *cursor;
> +	u8 reg;
> +
> +	if (mm->context.ia32_compat)
> +		return;
> +
> +	uprobe->arch_info.rip_rela_target_address = 0x0;
> +	if (!insn_rip_relative(insn))
> +		return;
> +
> +	/*
> +	 * Point cursor at the modrm byte.  The next 4 bytes are the
> +	 * displacement.  Beyond the displacement, for some instructions,
> +	 * is the immediate operand.
> +	 */
> +	cursor = uprobe->insn + insn->prefixes.nbytes
> +			+ insn->rex_prefix.nbytes + insn->opcode.nbytes;

FYI, insn.h already provide a macro for this purpose.
You can write this as below;

	cursor = uprobe->insn + insn_offset_modrm(insn);

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-25 15:32     ` Denys Vlasenko
@ 2012-01-26 14:14       ` Masami Hiramatsu
  2012-01-26 18:28         ` Denys Vlasenko
  0 siblings, 1 reply; 45+ messages in thread
From: Masami Hiramatsu @ 2012-01-26 14:14 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell,
	yrl.pp-manager.tt

(2012/01/26 0:32), Denys Vlasenko wrote:
> On Wed, Jan 25, 2012 at 4:13 PM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
>>> +       /*
>>> +        * Convert from rip-relative addressing to indirect addressing
>>> +        * via a scratch register.  Change the r/m field from 0x5 (%rip)
>>> +        * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
>>> +        */
>>> +       reg = MODRM_REG(insn);
>>> +       if (reg == 0) {
>>> +               /*
>>> +                * The register operand (if any) is either the A register
>>> +                * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
>>> +                * REX prefix) %r8.  In any case, we know the C register
>>> +                * is NOT the register operand, so we use %rcx (register
>>> +                * #1) for the scratch register.
>>> +                */
>>> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
>>> +               /* Change modrm from 00 000 101 to 00 000 001. */
>>> +               *cursor = 0x1;
> 
> Hmm. I think we have a bug here.
> 
> What if this instruction has REX.B = 1? Granted, REX.B = 1 has no effect on
> rip-relative addressing and therefore normally won't be generated by gcc/as,
> but still. If you replace md and r/m fields as above, you are trying to convert
> 0x12345678(%rip) reference to (%rcx), but if REX.B = 1, then you in fact
> converted it to (%r9)!

Right, thanks for finding :)
And %rax register reference encoding has same problem, doesn't it?

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-26 14:14       ` Masami Hiramatsu
@ 2012-01-26 18:28         ` Denys Vlasenko
  2012-01-30  9:51           ` Masami Hiramatsu
  0 siblings, 1 reply; 45+ messages in thread
From: Denys Vlasenko @ 2012-01-26 18:28 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell,
	yrl.pp-manager.tt

On Thu, Jan 26, 2012 at 3:14 PM, Masami Hiramatsu
<masami.hiramatsu.pt@hitachi.com> wrote:
> (2012/01/26 0:32), Denys Vlasenko wrote:
>> On Wed, Jan 25, 2012 at 4:13 PM, Denys Vlasenko
>> <vda.linux@googlemail.com> wrote:
>>>> +       /*
>>>> +        * Convert from rip-relative addressing to indirect addressing
>>>> +        * via a scratch register.  Change the r/m field from 0x5 (%rip)
>>>> +        * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
>>>> +        */
>>>> +       reg = MODRM_REG(insn);
>>>> +       if (reg == 0) {
>>>> +               /*
>>>> +                * The register operand (if any) is either the A register
>>>> +                * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
>>>> +                * REX prefix) %r8.  In any case, we know the C register
>>>> +                * is NOT the register operand, so we use %rcx (register
>>>> +                * #1) for the scratch register.
>>>> +                */
>>>> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
>>>> +               /* Change modrm from 00 000 101 to 00 000 001. */
>>>> +               *cursor = 0x1;
>>
>> Hmm. I think we have a bug here.
>>
>> What if this instruction has REX.B = 1? Granted, REX.B = 1 has no effect on
>> rip-relative addressing and therefore normally won't be generated by gcc/as,
>> but still. If you replace md and r/m fields as above, you are trying to convert
>> 0x12345678(%rip) reference to (%rcx), but if REX.B = 1, then you in fact
>> converted it to (%r9)!
>
> Right, thanks for finding :)
> And %rax register reference encoding has same problem, doesn't it?

Yes.

The solution is trivial: "if (REX pfx exists) REX.B = 0;"

Also, I don't remember whether (%rip) addressing is
affected by 0x67 (address size) prefix. If it is, then
nothing needs to be done.

But if there is an exception and CPU ignores 0x67 pfx
for (%rip) addressing, we will need to check for and
remove this pfx.

Otherwise, it'll instruct to use 32-bit registers
for addressing, thus it will turn our (%rax) operand
into (%eax). Not good.

-- 
vda

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints.
  2012-01-26 18:28         ` Denys Vlasenko
@ 2012-01-30  9:51           ` Masami Hiramatsu
  0 siblings, 0 replies; 45+ messages in thread
From: Masami Hiramatsu @ 2012-01-30  9:51 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Rothwell,
	yrl.pp-manager.tt

(2012/01/27 3:28), Denys Vlasenko wrote:
> On Thu, Jan 26, 2012 at 3:14 PM, Masami Hiramatsu
> <masami.hiramatsu.pt@hitachi.com> wrote:
>> (2012/01/26 0:32), Denys Vlasenko wrote:
>>> On Wed, Jan 25, 2012 at 4:13 PM, Denys Vlasenko
>>> <vda.linux@googlemail.com> wrote:
>>>>> +       /*
>>>>> +        * Convert from rip-relative addressing to indirect addressing
>>>>> +        * via a scratch register.  Change the r/m field from 0x5 (%rip)
>>>>> +        * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
>>>>> +        */
>>>>> +       reg = MODRM_REG(insn);
>>>>> +       if (reg == 0) {
>>>>> +               /*
>>>>> +                * The register operand (if any) is either the A register
>>>>> +                * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
>>>>> +                * REX prefix) %r8.  In any case, we know the C register
>>>>> +                * is NOT the register operand, so we use %rcx (register
>>>>> +                * #1) for the scratch register.
>>>>> +                */
>>>>> +               uprobe->arch_info.fixups = UPROBES_FIX_RIP_CX;
>>>>> +               /* Change modrm from 00 000 101 to 00 000 001. */
>>>>> +               *cursor = 0x1;
>>>
>>> Hmm. I think we have a bug here.
>>>
>>> What if this instruction has REX.B = 1? Granted, REX.B = 1 has no effect on
>>> rip-relative addressing and therefore normally won't be generated by gcc/as,
>>> but still. If you replace md and r/m fields as above, you are trying to convert
>>> 0x12345678(%rip) reference to (%rcx), but if REX.B = 1, then you in fact
>>> converted it to (%r9)!
>>
>> Right, thanks for finding :)
>> And %rax register reference encoding has same problem, doesn't it?
>
> Yes.
>
> The solution is trivial: "if (REX pfx exists) REX.B = 0;"
>
> Also, I don't remember whether (%rip) addressing is
> affected by 0x67 (address size) prefix. If it is, then
> nothing needs to be done.

In Intel SDM vol.2, "2.2.1.6 RIP-Relative Addressing" explains
that as below;

---
RIP-relative addressing is enabled by 64-bit mode, not by a 64-bit address-size.
The use of the address-size prefix does not disable RIP-relative addressing. The
effect of the address-size prefix is to truncate and zero-extend the computed
effective address to 32 bits.
---

This means 0x67 pfx can be used with RIP-relative, and
effective address(EA) will be 32bit, as same as others.
So, I think it should work :)


Thank you!

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2012-01-30  9:51 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-10 11:48 [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Srikar Dronamraju
2012-01-10 11:48 ` [PATCH v9 3.2 1/9] uprobes: Install and remove breakpoints Srikar Dronamraju
2012-01-25 14:39   ` Denys Vlasenko
2012-01-25 16:31     ` Oleg Nesterov
2012-01-25 15:13   ` Denys Vlasenko
2012-01-25 15:32     ` Denys Vlasenko
2012-01-26 14:14       ` Masami Hiramatsu
2012-01-26 18:28         ` Denys Vlasenko
2012-01-30  9:51           ` Masami Hiramatsu
2012-01-26 13:38     ` Masami Hiramatsu
2012-01-26 13:45   ` Masami Hiramatsu
2012-01-10 11:48 ` [PATCH v9 3.2 2/9] uprobes: handle breakpoint and signal step exception Srikar Dronamraju
2012-01-18  8:39   ` Anton Arapov
2012-01-18  9:02     ` Srikar Dronamraju
2012-01-18 10:18       ` Mike Frysinger
2012-01-18 10:47         ` Srikar Dronamraju
2012-01-18 11:01           ` Mike Frysinger
2012-01-20 22:57             ` Mike Frysinger
2012-01-25  8:12               ` Srikar Dronamraju
2012-01-10 11:48 ` [PATCH v9 3.2 3/9] uprobes: slot allocation Srikar Dronamraju
2012-01-10 11:49 ` [PATCH v9 3.2 4/9] uprobes: counter to optimize probe hits Srikar Dronamraju
2012-01-10 11:49 ` [PATCH v9 3.2 5/9] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
2012-01-10 11:49 ` [PATCH v9 3.2 6/9] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
2012-01-10 11:49 ` [PATCH v9 3.2 7/9] tracing: uprobes trace_event interface Srikar Dronamraju
2012-01-16 13:11   ` Jiri Olsa
2012-01-16 14:45     ` Srikar Dronamraju
2012-01-16 15:33       ` Jiri Olsa
2012-01-16 16:41         ` Srikar Dronamraju
2012-01-17  9:28     ` Ingo Molnar
2012-01-17 10:22       ` Srikar Dronamraju
2012-01-17 10:59         ` Ingo Molnar
2012-01-17 11:03           ` Srikar Dronamraju
2012-01-17 11:22         ` Ingo Molnar
2012-01-17 11:57           ` Srikar Dronamraju
2012-01-17 12:23             ` Ingo Molnar
2012-01-17 12:27               ` Ingo Molnar
2012-01-17 12:11         ` Ingo Molnar
2012-01-17 12:21         ` Ingo Molnar
2012-01-18 10:07           ` Srikar Dronamraju
2012-01-10 11:49 ` [PATCH v9 3.2 8/9] perf: rename target_module to target Srikar Dronamraju
2012-01-16  8:34 ` [PATCH v9 3.2 0/9] Uprobes patchset with perf probe support Ingo Molnar
2012-01-16 15:17   ` Srikar Dronamraju
2012-01-17  9:39     ` Ingo Molnar
2012-01-25 14:11       ` Peter Zijlstra
2012-01-26 11:10         ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).