linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support
@ 2011-11-18 11:06 Srikar Dronamraju
  2011-11-18 11:06 ` [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes Srikar Dronamraju
                   ` (31 more replies)
  0 siblings, 32 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:06 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


This patchset implements Uprobes which enables you to dynamically probe
any routine in a user space application and collect information
non-disruptively.

This patchset resolves most of the comments on the previous posting
(https://lkml.org/lkml/2011/11/10/408) patchset applies on top of
commit cfcfc9eca2b

This patchset depends on bulkref patch from Paul McKenney
https://lkml.org/lkml/2011/11/2/365 and enable interrupts before
calling do_notify_resume on i686 patch
https://lkml.org/lkml/2011/10/25/265.

uprobes git is hosted at git://github.com/srikard/linux.git
with branch inode_uprobes_v32rc2.
(The previous patchset posted to lkml has been rebased to 3.2-rc2 is also
available at branch inode_uprobes_v32rc2_prev. This is to help the
reviewers of the previous patchsets to quickly identify the changes.)

Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

For previous postings: please refer: https://lkml.org/lkml/2011/9/20/123
https://lkml.org/lkml/2011/6/7/232 https://lkml.org/lkml/2011/4/1/176
http://lkml.org/lkml/2011/3/14/171/ http://lkml.org/lkml/2010/12/16/65
http://lkml.org/lkml/2010/8/25/165 http://lkml.org/lkml/2010/7/27/121
http://lkml.org/lkml/2010/7/12/67 http://lkml.org/lkml/2010/7/8/239
http://lkml.org/lkml/2010/6/29/299 http://lkml.org/lkml/2010/6/14/41
http://lkml.org/lkml/2010/3/20/107 and http://lkml.org/lkml/2010/5/18/307

This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint.  In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location.  On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.

4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace.  In future we could look at a single
backtrace showing application, library and kernel calls.

Changes from last patchset:
- Rebased to Linus's 3.2-rc2 (cfcfc9eca2b)
- abort_xol does arch specific cleanups.
- added nop optimization.

Here is the list of TODO Items.

- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.

Please refer "[PATCH 3.2-rc2 21/30] tracing: uprobes trace_event interface".

Please refer "[PATCH 3.2-rc2 23/30] perf: perf interface for uprobes".

Please do provide your valuable comments.

Thanks in advance.
Srikar

Srikar Dronamraju (30)
 0: Uprobes patchset with perf probe support
 1: uprobes: Auxillary routines to insert, find, delete uprobes
 2: Uprobes: Allow multiple consumers for an uprobe.
 3: Uprobes: register/unregister probes.
 4: uprobes: Define hooks for mmap/munmap.
 5: Uprobes: copy of the original instruction.
 6: Uprobes: define fixups.
 7: Uprobes: uprobes arch info
 8: x86: analyze instruction and determine fixups.
 9: Uprobes: Background page replacement.
10: x86: Set instruction pointer.
11: x86: Introduce TIF_UPROBE FLAG.
12: Uprobes: Handle breakpoint and Singlestep
13: x86: define a x86 specific exception notifier.
14: uprobe: register exception notifier
15: x86: Define x86_64 specific uprobe_task_arch_info structure
16: uprobes: Introduce uprobe_task_arch_info structure.
17: x86: arch specific hooks for pre/post singlestep handling.
18: uprobes: slot allocation.
19: tracing: modify is_delete, is_return from ints to bool.
20: tracing: Extract out common code for kprobes/uprobes traceevents.
21: tracing: uprobes trace_event interface
22: perf: rename target_module to target
23: perf: perf interface for uprobes
24: perf: show possible probes in a given executable file or library.
25: uprobes: call post_xol() unconditionally
26: uprobes: introduce uprobe_deny_signal()
27: uprobes: x86: introduce xol_was_trapped()
28: uprobes: introduce UTASK_SSTEP_TRAPPED logic
29: uprobes: Introduce uprobe flags
30: x86: skip singlestep where possible


 Documentation/trace/uprobetracer.txt    |   93 ++
 arch/Kconfig                            |    3 +
 arch/x86/Kconfig                        |    5 +-
 arch/x86/include/asm/thread_info.h      |    2 +
 arch/x86/include/asm/uprobes.h          |   59 ++
 arch/x86/kernel/Makefile                |    1 +
 arch/x86/kernel/signal.c                |    6 +
 arch/x86/kernel/uprobes.c               |  648 +++++++++++++
 include/linux/mm_types.h                |    5 +
 include/linux/sched.h                   |    4 +
 include/linux/uprobes.h                 |  178 ++++
 kernel/Makefile                         |    1 +
 kernel/fork.c                           |   15 +
 kernel/signal.c                         |    3 +
 kernel/trace/Kconfig                    |   20 +
 kernel/trace/Makefile                   |    2 +
 kernel/trace/trace.h                    |    5 +
 kernel/trace/trace_kprobe.c             |  899 +------------------
 kernel/trace/trace_probe.c              |  785 ++++++++++++++++
 kernel/trace/trace_probe.h              |  161 ++++
 kernel/trace/trace_uprobe.c             |  768 ++++++++++++++++
 kernel/uprobes.c                        | 1500 +++++++++++++++++++++++++++++++
 mm/mmap.c                               |   33 +-
 tools/perf/Documentation/perf-probe.txt |   14 +
 tools/perf/builtin-probe.c              |   49 +-
 tools/perf/util/probe-event.c           |  411 +++++++--
 tools/perf/util/probe-event.h           |   12 +-
 tools/perf/util/symbol.c                |    8 +
 tools/perf/util/symbol.h                |    1 +
 29 files changed, 4710 insertions(+), 981 deletions(-)


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
@ 2011-11-18 11:06 ` Srikar Dronamraju
  2011-11-23 18:23   ` Peter Zijlstra
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 2/30] uprobes: Allow multiple consumers for an uprobe Srikar Dronamraju
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:06 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Uprobes are maintained in a rb-tree indexed by inode and offset (offset
from the start of the map). For a unique inode, offset combination,
there can be one unique uprobe in the rbtree. Provide routines that
insert a given uprobe, find a uprobe given a inode and offset.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog: (from v5)
1. drop reference to inode before dropping reference to uprobe.

 include/linux/uprobes.h |   35 +++++++++
 kernel/uprobes.c        |  174 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/uprobes.h
 create mode 100644 kernel/uprobes.c

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..bfb85c4
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/rbtree.h>
+
+struct uprobe {
+	struct rb_node		rb_node;	/* node in the rb tree */
+	atomic_t		ref;
+	struct inode		*inode;		/* Also hold a ref to inode */
+	loff_t			offset;
+};
+
+#endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..cacf333
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,174 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/uprobes.h>
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
+
+static int match_uprobe(struct uprobe *l, struct uprobe *r)
+{
+	if (l->inode < r->inode)
+		return -1;
+	if (l->inode > r->inode)
+		return 1;
+	else {
+		if (l->offset < r->offset)
+			return -1;
+
+		if (l->offset > r->offset)
+			return 1;
+	}
+
+	return 0;
+}
+
+static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe u = { .inode = inode, .offset = offset };
+	struct rb_node *n = uprobes_tree.rb_node;
+	struct uprobe *uprobe;
+	int match;
+
+	while (n) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		match = match_uprobe(&u, uprobe);
+		if (!match) {
+			atomic_inc(&uprobe->ref);
+			return uprobe;
+		}
+		if (match < 0)
+			n = n->rb_left;
+		else
+			n = n->rb_right;
+
+	}
+	return NULL;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires uprobes_treelock
+ */
+static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe *uprobe;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, offset);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return uprobe;
+}
+
+static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
+{
+	struct rb_node **p = &uprobes_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct uprobe *u;
+	int match;
+
+	while (*p) {
+		parent = *p;
+		u = rb_entry(parent, struct uprobe, rb_node);
+		match = match_uprobe(uprobe, u);
+		if (!match) {
+			atomic_inc(&u->ref);
+			return u;
+		}
+
+		if (match < 0)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+
+	}
+	u = NULL;
+	rb_link_node(&uprobe->rb_node, parent, p);
+	rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+	/* get access + creation ref */
+	atomic_set(&uprobe->ref, 2);
+	return u;
+}
+
+/*
+ * Acquires uprobes_treelock.
+ * Matching uprobe already exists in rbtree;
+ *	increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ *	get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+	struct uprobe *u;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	u = __insert_uprobe(uprobe);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+	if (atomic_dec_and_test(&uprobe->ref))
+		kfree(uprobe);
+}
+
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe *uprobe, *cur_uprobe;
+
+	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+	if (!uprobe)
+		return NULL;
+
+	uprobe->inode = igrab(inode);
+	uprobe->offset = offset;
+
+	/* add to uprobes_tree, sorted on inode:offset */
+	cur_uprobe = insert_uprobe(uprobe);
+
+	/* a uprobe exists for this inode:offset combination */
+	if (cur_uprobe) {
+		kfree(uprobe);
+		uprobe = cur_uprobe;
+		iput(inode);
+	}
+	return uprobe;
+}
+
+static void delete_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	iput(uprobe->inode);
+	put_uprobe(uprobe);
+}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 2/30] uprobes: Allow multiple consumers for an uprobe.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
  2011-11-18 11:06 ` [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes Srikar Dronamraju
@ 2011-11-18 11:07 ` Srikar Dronamraju
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
                   ` (29 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Since there is a unique uprobe for a inode, offset combination, provide
an ability for users to have more than one consumer for a uprobe.

Each consumer will define a handler and an optional filter.  Handler
specifies the routine to run on hitting a probepoint.  Filter allows to
selectively run the handler on hitting the probepoint.  Handler/Filter
will be relevant on probehit.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog:(Since v5)
modified del_consumer as per comments from Peter.

 include/linux/uprobes.h |   13 +++++++++++++
 kernel/uprobes.c        |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bfb85c4..bf31f7c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -25,9 +25,22 @@
 
 #include <linux/rbtree.h>
 
+struct uprobe_consumer {
+	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+	/*
+	 * filter is optional; If a filter exists, handler is run
+	 * if and only if filter returns true.
+	 */
+	bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+	struct uprobe_consumer *next;
+};
+
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	atomic_t		ref;
+	struct rw_semaphore	consumer_rwsem;
+	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
 };
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index cacf333..2c92b9a 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -149,6 +149,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 
 	uprobe->inode = igrab(inode);
 	uprobe->offset = offset;
+	init_rwsem(&uprobe->consumer_rwsem);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
@@ -162,6 +163,40 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/* Returns the previous consumer */
+static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	down_write(&uprobe->consumer_rwsem);
+	consumer->next = uprobe->consumers;
+	uprobe->consumers = consumer;
+	up_write(&uprobe->consumer_rwsem);
+	return consumer->next;
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe_consumer **con;
+	bool ret = false;
+
+	down_write(&uprobe->consumer_rwsem);
+	for (con = &uprobe->consumers; *con; con = &(*con)->next) {
+		if (*con == consumer) {
+			*con = consumer->next;
+			ret = true;
+			break;
+		}
+	}
+	up_write(&uprobe->consumer_rwsem);
+	return ret;
+}
+
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
  2011-11-18 11:06 ` [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes Srikar Dronamraju
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 2/30] uprobes: Allow multiple consumers for an uprobe Srikar Dronamraju
@ 2011-11-18 11:07 ` Srikar Dronamraju
  2011-11-23 16:09   ` Peter Zijlstra
                     ` (5 more replies)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
                   ` (28 subsequent siblings)
  31 siblings, 6 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


A probe is specified by a file:offset. Probe specifications are maintained
in a rb tree. A uprobe can be shared by many consumers.  While registering
a probe, a breakpoint is inserted for the first consumer, On subsequent
probes, the consumer gets appended to the existing list of consumers. While
unregistering a probe, breakpoint is removed if and only if the consumer
happens to be the only remaining consumer for the probe.  All other
unregisterations, the consumer is removed from the list of consumers.

Given a inode, we get a list of mm's that have mapped the inode. Do the
actual registration if mm maps the page where a probe needs to be
inserted/removed.

We use a temporary list to walk thro the vmas that map the inode.
- The number of maps that map the inode, is not known before we walk
  the rmap and keeps changing.
- extending vm_area_struct wasnt recommended.
- There can be more than one maps of the inode in the same mm.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog: (Since v5)
1. Use i_size_read(inode) instead of inode->i_size.
2. Ensure uprobe->consumers is NULL, before __unregister_uprobe() is
   called.
3. remove restriction while unregistering.
4. Earlier code leaked inode references under some conditions while
   registering/unregistering.
5. continue the vma-rmap walk even if the intermediate vma doesnt
   meet the requirements.
6. validate the vma found by find_vma before inserting/removing the
   breakpoint
7. call del_consumer under mutex_lock.

 arch/Kconfig            |    9 +
 include/linux/uprobes.h |   16 ++
 kernel/Makefile         |    1 
 kernel/uprobes.c        |  323 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 349 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4b0669c..dedd489 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,15 @@ config OPTPROBES
 	depends on KPROBES && HAVE_OPTPROBES
 	depends on !PREEMPT
 
+config UPROBES
+	bool "User-space probes (EXPERIMENTAL)"
+	help
+	  Uprobes enables kernel subsystems to establish probepoints
+	  in user applications and execute handler functions when
+	  the probepoints are hit.
+
+	  If in doubt, say "N".
+
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
 	help
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bf31f7c..6d5a3fe 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -45,4 +45,20 @@ struct uprobe {
 	loff_t			offset;
 };
 
+#ifdef CONFIG_UPROBES
+extern int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+}
+#endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index e898c5b..9fb670d 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -109,6 +109,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_UPROBES) += uprobes.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 2c92b9a..70ab372 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -24,11 +24,52 @@
 #include <linux/kernel.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
 
+#define UPROBES_HASH_SZ	13
+/* serialize (un)register */
+static struct mutex uprobes_mutex[UPROBES_HASH_SZ];
+#define uprobes_hash(v)	(&uprobes_mutex[((unsigned long)(v)) %\
+						UPROBES_HASH_SZ])
+
+/*
+ * Maintain a temporary per vma info that can be used to search if a vma
+ * has already been handled. This structure is introduced since extending
+ * vm_area_struct wasnt recommended.
+ */
+struct vma_info {
+	struct list_head probe_list;
+	struct mm_struct *mm;
+	loff_t vaddr;
+};
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma
+ * Relax restrictions while unregistering: vm_flags might have
+ * changed after breakpoint was inserted.
+ *	- is_reg: indicates if we are in register context.
+ *	- Return 1 if the specified virtual address is in an
+ *	  executable vma.
+ */
+static bool valid_vma(struct vm_area_struct *vma, bool is_reg)
+{
+	if (!vma->vm_file)
+		return false;
+
+	if (!is_reg)
+		return true;
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+						(VM_READ|VM_EXEC))
+		return true;
+
+	return false;
+}
+
 static int match_uprobe(struct uprobe *l, struct uprobe *r)
 {
 	if (l->inode < r->inode)
@@ -197,6 +238,18 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
+static int install_breakpoint(struct mm_struct *mm)
+{
+	/* Placeholder: Yet to be implemented */
+	return 0;
+}
+
+static void remove_breakpoint(struct mm_struct *mm)
+{
+	/* Placeholder: Yet to be implemented */
+	return;
+}
+
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
@@ -207,3 +260,273 @@ static void delete_uprobe(struct uprobe *uprobe)
 	iput(uprobe->inode);
 	put_uprobe(uprobe);
 }
+
+static struct vma_info *__find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping,
+			struct vma_info *vi, bool is_register)
+{
+	struct prio_tree_iter iter;
+	struct vm_area_struct *vma;
+	struct vma_info *tmpvi;
+	loff_t vaddr;
+	unsigned long pgoff = offset >> PAGE_SHIFT;
+	int existing_vma;
+
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (!valid_vma(vma, is_register))
+			continue;
+
+		existing_vma = 0;
+		vaddr = vma->vm_start + offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		list_for_each_entry(tmpvi, head, probe_list) {
+			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
+				existing_vma = 1;
+				break;
+			}
+		}
+
+		/*
+		 * Another vma needs a probe to be installed. However skip
+		 * installing the probe if the vma is about to be unlinked.
+		 */
+		if (!existing_vma &&
+				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
+			vi->mm = vma->vm_mm;
+			vi->vaddr = vaddr;
+			list_add(&vi->probe_list, head);
+			return vi;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * Iterate in the rmap prio tree  and find a vma where a probe has not
+ * yet been inserted.
+ */
+static struct vma_info *find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping,
+			bool is_register)
+{
+	struct vma_info *vi, *retvi;
+	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
+	if (!vi)
+		return ERR_PTR(-ENOMEM);
+
+	mutex_lock(&mapping->i_mmap_mutex);
+	retvi = __find_next_vma_info(head, offset, mapping, vi, is_register);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (!retvi)
+		kfree(vi);
+	return retvi;
+}
+
+static int __register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe *uprobe)
+{
+	struct list_head try_list;
+	struct vm_area_struct *vma;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct mm_struct *mm;
+	loff_t vaddr;
+	int ret = 0;
+
+	mapping = inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, offset,
+						mapping, true)) != NULL) {
+		if (IS_ERR(vi)) {
+			ret = -ENOMEM;
+			break;
+		}
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long)vi->vaddr);
+		if (!vma || !valid_vma(vma, true)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		vaddr = vma->vm_start + offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		if (vma->vm_file->f_mapping->host != inode ||
+						vaddr != vi->vaddr) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		ret = install_breakpoint(mm);
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+		if (ret && ret == -EEXIST)
+			ret = 0;
+		if (!ret)
+			break;
+	}
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	return ret;
+}
+
+static void __unregister_uprobe(struct inode *inode, loff_t offset,
+						struct uprobe *uprobe)
+{
+	struct list_head try_list;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	loff_t vaddr;
+
+	mapping = inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, offset,
+						mapping, false)) != NULL) {
+		if (IS_ERR(vi))
+			break;
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long)vi->vaddr);
+		if (!vma || !valid_vma(vma, false)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		vaddr = vma->vm_start + offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		if (vma->vm_file->f_mapping->host != inode ||
+						vaddr != vi->vaddr) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		remove_breakpoint(mm);
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	delete_uprobe(uprobe);
+}
+
+/*
+ * register_uprobe - register a probe
+ * @inode: the file in which the probe has to be placed.
+ * @offset: offset from the start of the file.
+ * @consumer: information on howto handle the probe..
+ *
+ * Apart from the access refcount, register_uprobe() takes a creation
+ * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
+ * inserted into the rbtree (i.e first consumer for a @inode:@offset
+ * tuple).  Creation refcount stops unregister_uprobe from freeing the
+ * @uprobe even before the register operation is complete. Creation
+ * refcount is released when the last @consumer for the @uprobe
+ * unregisters.
+ *
+ * Return errno if it cannot successully install probes
+ * else return 0 (success)
+ */
+int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe;
+	int ret = -EINVAL;
+
+	if (!consumer || consumer->next)
+		return ret;
+
+	inode = igrab(inode);
+	if (!inode)
+		return ret;
+
+	if (offset > i_size_read(inode))
+		goto reg_out;
+
+	ret = 0;
+	mutex_lock(uprobes_hash(inode));
+	uprobe = alloc_uprobe(inode, offset);
+	if (uprobe && !add_consumer(uprobe, consumer)) {
+		ret = __register_uprobe(inode, offset, uprobe);
+		if (ret) {
+			uprobe->consumers = NULL;
+			__unregister_uprobe(inode, offset, uprobe);
+		}
+	}
+
+	mutex_unlock(uprobes_hash(inode));
+	put_uprobe(uprobe);
+
+reg_out:
+	iput(inode);
+	return ret;
+}
+
+/*
+ * unregister_uprobe - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @consumer: identify which probe if multiple probes are colocated.
+ */
+void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe = NULL;
+
+	inode = igrab(inode);
+	if (!inode || !consumer)
+		goto unreg_out;
+
+	uprobe = find_uprobe(inode, offset);
+	if (!uprobe)
+		goto unreg_out;
+
+	mutex_lock(uprobes_hash(inode));
+	if (!del_consumer(uprobe, consumer)) {
+		mutex_unlock(uprobes_hash(inode));
+		goto unreg_out;
+	}
+
+	if (!uprobe->consumers)
+		__unregister_uprobe(inode, offset, uprobe);
+
+	mutex_unlock(uprobes_hash(inode));
+
+unreg_out:
+	if (uprobe)
+		put_uprobe(uprobe);
+	if (inode)
+		iput(inode);
+}
+
+static int __init init_uprobes(void)
+{
+	int i;
+
+	for (i = 0; i < UPROBES_HASH_SZ; i++)
+		mutex_init(&uprobes_mutex[i]);
+
+	return 0;
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
@ 2011-11-18 11:07 ` Srikar Dronamraju
  2011-11-23 17:13   ` Peter Zijlstra
                     ` (2 more replies)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction Srikar Dronamraju
                   ` (27 subsequent siblings)
  31 siblings, 3 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


If an executable vma is getting mapped, search and insert corresponding
probes. On unmap, make sure the probes count is decremented by appropriate
amount.

On process creation, make sure the probes count in the child is set
correctly.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog: (Since v5)
- use hash locks.
- Handle mremap.
- while forking, handle vma's that have VM_DONTCOPY.
- while forking, handle race of new breakpoints being inserted / removed
  in the parent process.
- Introduce find_least_offset_node() instead of close match logic in
  find_uprobe
- munmap now reuses build_probe_list instead of dec_mm_uprobes_count.

 include/linux/mm_types.h |    3 +
 include/linux/uprobes.h  |   12 +++
 kernel/fork.c            |    7 ++
 kernel/uprobes.c         |  188 ++++++++++++++++++++++++++++++++++++++++++++--
 mm/mmap.c                |   33 ++++++++
 5 files changed, 233 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5b42f1b..544a0b6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -389,6 +389,9 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_UPROBES
+	atomic_t mm_uprobes_count;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6d5a3fe..b4de058 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -25,6 +25,8 @@
 
 #include <linux/rbtree.h>
 
+struct vm_area_struct;
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -40,6 +42,7 @@ struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	atomic_t		ref;
 	struct rw_semaphore	consumer_rwsem;
+	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
@@ -50,6 +53,8 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern int mmap_uprobe(struct vm_area_struct *vma);
+extern void munmap_uprobe(struct vm_area_struct *vma);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -60,5 +65,12 @@ static inline void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
 {
 }
+static inline int mmap_uprobe(struct vm_area_struct *vma)
+{
+	return 0;
+}
+static inline void munmap_uprobe(struct vm_area_struct *vma)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index ba0d172..c8c287a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -421,6 +422,9 @@ static int dup_mmap(struct mm_struct *mm, struct mm_struct *oldmm)
 
 		if (retval)
 			goto out;
+
+		if (file && mmap_uprobe(tmp))
+			goto out;
 	}
 	/* a new mm has just been created */
 	arch_dup_mmap(oldmm, mm);
@@ -738,6 +742,9 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	atomic_set(&mm->mm_uprobes_count, 0);
+#endif
 
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 70ab372..1baae40 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -36,6 +36,18 @@ static struct mutex uprobes_mutex[UPROBES_HASH_SZ];
 #define uprobes_hash(v)	(&uprobes_mutex[((unsigned long)(v)) %\
 						UPROBES_HASH_SZ])
 
+/* serialize uprobe->pending_list */
+static struct mutex uprobes_mmap_mutex[UPROBES_HASH_SZ];
+#define uprobes_mmap_hash(v)	(&uprobes_mmap_mutex[((unsigned long)(v)) %\
+						UPROBES_HASH_SZ])
+
+/*
+ * uprobe_events allows us to skip the mmap_uprobe if there are no uprobe
+ * events active at this time.  Probably a fine grained per inode count is
+ * better?
+ */
+static atomic_t uprobe_events = ATOMIC_INIT(0);
+
 /*
  * Maintain a temporary per vma info that can be used to search if a vma
  * has already been handled. This structure is introduced since extending
@@ -105,7 +117,6 @@ static struct uprobe *__find_uprobe(struct inode *inode, loff_t offset)
 			n = n->rb_left;
 		else
 			n = n->rb_right;
-
 	}
 	return NULL;
 }
@@ -191,6 +202,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	uprobe->inode = igrab(inode);
 	uprobe->offset = offset;
 	init_rwsem(&uprobe->consumer_rwsem);
+	INIT_LIST_HEAD(&uprobe->pending_list);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
@@ -200,7 +212,8 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 		kfree(uprobe);
 		uprobe = cur_uprobe;
 		iput(inode);
-	}
+	} else
+		atomic_inc(&uprobe_events);
 	return uprobe;
 }
 
@@ -238,15 +251,24 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
-static int install_breakpoint(struct mm_struct *mm)
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 {
-	/* Placeholder: Yet to be implemented */
+	/*
+	 * Probe is to be deleted;
+	 * Dont know if somebody already inserted the probe;
+	 * behave as if probe already exists.
+	 */
+	if (!uprobe->consumers)
+		return -EEXIST;
+
+	atomic_inc(&mm->mm_uprobes_count);
 	return 0;
 }
 
-static void remove_breakpoint(struct mm_struct *mm)
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 {
 	/* Placeholder: Yet to be implemented */
+	atomic_dec(&mm->mm_uprobes_count);
 	return;
 }
 
@@ -259,6 +281,7 @@ static void delete_uprobe(struct uprobe *uprobe)
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
 	iput(uprobe->inode);
 	put_uprobe(uprobe);
+	atomic_dec(&uprobe_events);
 }
 
 static struct vma_info *__find_next_vma_info(struct list_head *head,
@@ -362,7 +385,7 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm);
+		ret = install_breakpoint(mm, uprobe);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 		if (ret && ret == -EEXIST)
@@ -413,7 +436,7 @@ static void __unregister_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		remove_breakpoint(mm);
+		remove_breakpoint(mm, uprobe);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
@@ -514,13 +537,160 @@ void unregister_uprobe(struct inode *inode, loff_t offset,
 		iput(inode);
 }
 
+/*
+ * Of all the nodes that correspond to the given inode, return the node
+ * with the least offset.
+ */
+static struct rb_node *find_least_offset_node(struct inode *inode)
+{
+	struct uprobe u = { .inode = inode, .offset = 0};
+	struct rb_node *n = uprobes_tree.rb_node;
+	struct rb_node *close_node = NULL;
+	struct uprobe *uprobe;
+	int match;
+
+	while (n) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		match = match_uprobe(&u, uprobe);
+		if (uprobe->inode == inode)
+			close_node = n;
+
+		if (!match)
+			return close_node;
+
+		if (match < 0)
+			n = n->rb_left;
+		else
+			n = n->rb_right;
+	}
+	return close_node;
+}
+
+/*
+ * For a given inode, build a list of probes that need to be inserted.
+ */
+static void build_probe_list(struct inode *inode, struct list_head *head)
+{
+	struct uprobe *uprobe;
+	struct rb_node *n;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	n = find_least_offset_node(inode);
+	for (; n; n = rb_next(n)) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		if (uprobe->inode != inode)
+			break;
+
+		list_add(&uprobe->pending_list, head);
+		atomic_inc(&uprobe->ref);
+	}
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ *
+ * Return -ve no if we fail to insert probes and we cannot
+ * bail-out.
+ * Return 0 otherwise. i.e :
+ *	- successful insertion of probes
+ *	- (or) no possible probes to be inserted.
+ *	- (or) insertion of probes failed but we can bail-out.
+ */
+int mmap_uprobe(struct vm_area_struct *vma)
+{
+	struct list_head tmp_list;
+	struct uprobe *uprobe, *u;
+	struct inode *inode;
+	int ret = 0, count = 0;
+
+	if (!atomic_read(&uprobe_events) || !valid_vma(vma, true))
+		return ret;	/* Bail-out */
+
+	inode = igrab(vma->vm_file->f_mapping->host);
+	if (!inode)
+		return ret;
+
+	INIT_LIST_HEAD(&tmp_list);
+	mutex_lock(uprobes_mmap_hash(inode));
+	build_probe_list(inode, &tmp_list);
+	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+		loff_t vaddr;
+
+		list_del(&uprobe->pending_list);
+		if (!ret) {
+			vaddr = vma->vm_start + uprobe->offset;
+			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+			if (vaddr < vma->vm_start || vaddr >= vma->vm_end) {
+				put_uprobe(uprobe);
+				continue;
+			}
+			ret = install_breakpoint(vma->vm_mm, uprobe);
+			if (ret == -EEXIST) {
+				atomic_inc(&vma->vm_mm->mm_uprobes_count);
+				ret = 0;
+			}
+			if (!ret)
+				count++;
+		}
+		put_uprobe(uprobe);
+	}
+
+	mutex_unlock(uprobes_mmap_hash(inode));
+	iput(inode);
+	if (ret)
+		atomic_sub(count, &vma->vm_mm->mm_uprobes_count);
+
+	return ret;
+}
+
+/*
+ * Called in context of a munmap of a vma.
+ */
+void munmap_uprobe(struct vm_area_struct *vma)
+{
+	struct list_head tmp_list;
+	struct uprobe *uprobe, *u;
+	struct inode *inode;
+
+	if (!atomic_read(&uprobe_events) || !valid_vma(vma, false))
+		return;		/* Bail-out */
+
+	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
+		return;
+
+	inode = igrab(vma->vm_file->f_mapping->host);
+	if (!inode)
+		return;
+
+	INIT_LIST_HEAD(&tmp_list);
+	mutex_lock(uprobes_mmap_hash(inode));
+	build_probe_list(inode, &tmp_list);
+	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+		loff_t vaddr;
+
+		list_del(&uprobe->pending_list);
+		vaddr = vma->vm_start + uprobe->offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		if (vaddr >= vma->vm_start && vaddr < vma->vm_end)
+			atomic_dec(&vma->vm_mm->mm_uprobes_count);
+		put_uprobe(uprobe);
+	}
+	mutex_unlock(uprobes_mmap_hash(inode));
+	iput(inode);
+	return;
+}
+
 static int __init init_uprobes(void)
 {
 	int i;
 
-	for (i = 0; i < UPROBES_HASH_SZ; i++)
+	for (i = 0; i < UPROBES_HASH_SZ; i++) {
 		mutex_init(&uprobes_mutex[i]);
-
+		mutex_init(&uprobes_mmap_mutex[i]);
+	}
 	return 0;
 }
 
diff --git a/mm/mmap.c b/mm/mmap.c
index eae90af..83813fa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
 #include <linux/perf_event.h>
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -217,6 +218,7 @@ void unlink_file_vma(struct vm_area_struct *vma)
 		mutex_lock(&mapping->i_mmap_mutex);
 		__remove_shared_vm_struct(vma, file, mapping);
 		mutex_unlock(&mapping->i_mmap_mutex);
+		munmap_uprobe(vma);
 	}
 }
 
@@ -545,8 +547,14 @@ again:			remove_next = 1 + (end > next->vm_end);
 
 	if (file) {
 		mapping = file->f_mapping;
-		if (!(vma->vm_flags & VM_NONLINEAR))
+		if (!(vma->vm_flags & VM_NONLINEAR)) {
 			root = &mapping->i_mmap;
+			munmap_uprobe(vma);
+
+			if (adjust_next)
+				munmap_uprobe(next);
+		}
+
 		mutex_lock(&mapping->i_mmap_mutex);
 		if (insert) {
 			/*
@@ -616,8 +624,16 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (mapping)
 		mutex_unlock(&mapping->i_mmap_mutex);
 
+	if (root) {
+		mmap_uprobe(vma);
+
+		if (adjust_next)
+			mmap_uprobe(next);
+	}
+
 	if (remove_next) {
 		if (file) {
+			munmap_uprobe(next);
 			fput(file);
 			if (next->vm_flags & VM_EXECUTABLE)
 				removed_exe_file_vma(mm);
@@ -637,6 +653,8 @@ again:			remove_next = 1 + (end > next->vm_end);
 			goto again;
 		}
 	}
+	if (insert && file)
+		mmap_uprobe(insert);
 
 	validate_mm(mm);
 
@@ -1329,6 +1347,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			mm->locked_vm += (len >> PAGE_SHIFT);
 	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
+
+	if (file && mmap_uprobe(vma))
+		/* matching probes but cannot insert */
+		goto unmap_and_free_vma;
+
 	return addr;
 
 unmap_and_free_vma:
@@ -2305,6 +2328,10 @@ int insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma)
 	if ((vma->vm_flags & VM_ACCOUNT) &&
 	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
 		return -ENOMEM;
+
+	if (vma->vm_file && mmap_uprobe(vma))
+		return -EINVAL;
+
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	return 0;
 }
@@ -2356,6 +2383,10 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			new_vma->vm_pgoff = pgoff;
 			if (new_vma->vm_file) {
 				get_file(new_vma->vm_file);
+
+				if (mmap_uprobe(new_vma))
+					goto out_free_mempol;
+
 				if (vma->vm_flags & VM_EXECUTABLE)
 					added_exe_file_vma(mm);
 			}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
@ 2011-11-18 11:07 ` Srikar Dronamraju
  2011-11-23 18:26   ` Peter Zijlstra
                     ` (2 more replies)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 6/30] uprobes: define fixups Srikar Dronamraju
                   ` (26 subsequent siblings)
  31 siblings, 3 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


When inserting the first probepoint, save a copy of the original
instruction.  This copy is later used for fixup analysis, copied to the slot
on probe-hit and for restoring the original instruction.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog: (Since v5)
- Uprobes no more depends on MM_OWNER; No reference to task_structs
  while inserting/removing a probe.
- Uses read_mapping_page instead of grab_cache_page so that the pages
  have valid content.

 include/linux/uprobes.h |   12 +++++
 kernel/uprobes.c        |  111 +++++++++++++++++++++++++++++++++++++++++++----
 2 files changed, 113 insertions(+), 10 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b4de058..fa2b663 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -26,6 +26,12 @@
 #include <linux/rbtree.h>
 
 struct vm_area_struct;
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+
+#define MAX_UINSN_BYTES 4
+#endif
 
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
@@ -46,9 +52,15 @@ struct uprobe {
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
+	int			copy;
+	u8			insn[MAX_UINSN_BYTES];
 };
 
 #ifdef CONFIG_UPROBES
+extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
+							unsigned long vaddr);
+extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe,
+					unsigned long vaddr, bool verify);
 extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 1baae40..f4574fd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -23,6 +23,7 @@
 
 #include <linux/kernel.h>
 #include <linux/highmem.h>
+#include <linux/pagemap.h>	/* read_mapping_page */
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/uprobes.h>
@@ -82,6 +83,20 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_reg)
 	return false;
 }
 
+int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
+						unsigned long vaddr)
+{
+	/* placeholder: yet to be implemented */
+	return 0;
+}
+
+int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe,
+					unsigned long vaddr, bool verify)
+{
+	/* placeholder: yet to be implemented */
+	return 0;
+}
+
 static int match_uprobe(struct uprobe *l, struct uprobe *r)
 {
 	if (l->inode < r->inode)
@@ -251,8 +266,71 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
-static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+static int __copy_insn(struct address_space *mapping,
+			struct vm_area_struct *vma, char *insn,
+			unsigned long nbytes, unsigned long offset)
+{
+	struct file *filp = vma->vm_file;
+	struct page *page;
+	void *vaddr;
+	unsigned long off1;
+	unsigned long idx;
+
+	if (!filp)
+		return -EINVAL;
+
+	idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT);
+	off1 = offset &= ~PAGE_MASK;
+
+	/*
+	 * Ensure that the page that has the original instruction is
+	 * populated and in page-cache.
+	 */
+	page = read_mapping_page(mapping, idx, filp);
+	if (IS_ERR(page))
+		return -ENOMEM;
+
+	vaddr = kmap_atomic(page);
+	memcpy(insn, vaddr + off1, nbytes);
+	kunmap_atomic(vaddr);
+	page_cache_release(page);
+	return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
+					unsigned long addr)
+{
+	struct address_space *mapping;
+	int bytes;
+	unsigned long nbytes;
+
+	addr &= ~PAGE_MASK;
+	nbytes = PAGE_SIZE - addr;
+	mapping = uprobe->inode->i_mapping;
+
+	/* Instruction at end of binary; copy only available bytes */
+	if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+		bytes = uprobe->inode->i_size - uprobe->offset;
+	else
+		bytes = MAX_UINSN_BYTES;
+
+	/* Instruction at the page-boundary; copy bytes in second page */
+	if (nbytes < bytes) {
+		if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
+				bytes - nbytes, uprobe->offset + nbytes))
+			return -ENOMEM;
+
+		bytes = nbytes;
+	}
+	return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
+}
+
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+				struct vm_area_struct *vma, loff_t vaddr)
 {
+	unsigned long addr;
+	int ret = -EINVAL;
+
 	/*
 	 * Probe is to be deleted;
 	 * Dont know if somebody already inserted the probe;
@@ -261,15 +339,27 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 	if (!uprobe->consumers)
 		return -EEXIST;
 
-	atomic_inc(&mm->mm_uprobes_count);
-	return 0;
+	addr = (unsigned long)vaddr;
+	if (!uprobe->copy) {
+		ret = copy_insn(uprobe, vma, addr);
+		if (ret)
+			return ret;
+
+		/* TODO : Analysis and verification of instruction */
+		uprobe->copy = 1;
+	}
+	ret = set_bkpt(mm, uprobe, addr);
+	if (!ret)
+		atomic_inc(&mm->mm_uprobes_count);
+
+	return ret;
 }
 
-static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+							loff_t vaddr)
 {
-	/* Placeholder: Yet to be implemented */
-	atomic_dec(&mm->mm_uprobes_count);
-	return;
+	if (!set_orig_insn(mm, uprobe, (unsigned long)vaddr, true))
+		atomic_dec(&mm->mm_uprobes_count);
 }
 
 static void delete_uprobe(struct uprobe *uprobe)
@@ -385,7 +475,7 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm, uprobe);
+		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 		if (ret && ret == -EEXIST)
@@ -436,7 +526,7 @@ static void __unregister_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		remove_breakpoint(mm, uprobe);
+		remove_breakpoint(mm, uprobe, vi->vaddr);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
@@ -627,7 +717,8 @@ int mmap_uprobe(struct vm_area_struct *vma)
 				put_uprobe(uprobe);
 				continue;
 			}
-			ret = install_breakpoint(vma->vm_mm, uprobe);
+			ret = install_breakpoint(vma->vm_mm, uprobe, vma,
+								vaddr);
 			if (ret == -EEXIST) {
 				atomic_inc(&vma->vm_mm->mm_uprobes_count);
 				ret = 0;


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 6/30] uprobes: define fixups.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction Srikar Dronamraju
@ 2011-11-18 11:07 ` Srikar Dronamraju
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 7/30] uprobes: uprobes arch info Srikar Dronamraju
                   ` (25 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


During the first insertion of a probepoint, instruction is analyzed for
fixups and cached in the per-uprobe struct. On a probehit, the cached
fixup is used. Fixup analysis and caching is done in arch-specific
code.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index fa2b663..dd308fa 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,17 @@ struct vm_area_struct;
 #define MAX_UINSN_BYTES 4
 #endif
 
+#define uprobe_opcode_sz sizeof(uprobe_opcode_t)
+
+/* Post-execution fixups.  Some architectures may define others. */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE	0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP	0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL	0x2
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -53,6 +64,7 @@ struct uprobe {
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
 	int			copy;
+	u16			fixups;
 	u8			insn[MAX_UINSN_BYTES];
 };
 


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 7/30] uprobes: uprobes arch info
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 6/30] uprobes: define fixups Srikar Dronamraju
@ 2011-11-18 11:07 ` Srikar Dronamraju
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups Srikar Dronamraju
                   ` (24 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:07 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Introduce per uprobe arch info structure.
Used to store arch specific details. For example: details to handle
Rip relative instructions in X86_64.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index dd308fa..44f28dc 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,7 +29,7 @@ struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
-
+struct uprobe_arch_info {};
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -60,6 +60,7 @@ struct uprobe {
 	atomic_t		ref;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
+	struct uprobe_arch_info arch_info;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 7/30] uprobes: uprobes arch info Srikar Dronamraju
@ 2011-11-18 11:08 ` Srikar Dronamraju
  2011-11-30 18:57   ` Oleg Nesterov
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
                   ` (23 subsequent siblings)
  31 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:08 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


The instruction analysis is based on x86 instruction decoder and
determines if an instruction can be probed and determines the necessary
fixups after singlestep.  Instruction analysis is done at probe
insertion time so that we avoid having to repeat the same analysis every
time a probe is hit.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- Include Instruction Decoder if Uprobes gets defined.
- Remove const attributes for instruction prefix arrays.
- Uses mm_context to know if the application is 32 bit.

 arch/x86/Kconfig               |    5 -
 arch/x86/include/asm/uprobes.h |   42 ++++
 arch/x86/kernel/Makefile       |    1 
 arch/x86/kernel/uprobes.c      |  399 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 446 insertions(+), 1 deletions(-)
 create mode 100644 arch/x86/include/asm/uprobes.h
 create mode 100644 arch/x86/kernel/uprobes.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cb9a104..029b4cc 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -77,7 +77,7 @@ config X86
 	select ARCH_HAVE_NMI_SAFE_CMPXCHG
 
 config INSTRUCTION_DECODER
-	def_bool (KPROBES || PERF_EVENTS)
+	def_bool (KPROBES || PERF_EVENTS || UPROBES)
 
 config OUTPUT_FORMAT
 	string
@@ -249,6 +249,9 @@ config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU
 
+config ARCH_SUPPORTS_UPROBES
+	def_bool y
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..f0b4b2b
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+#ifdef CONFIG_X86_64
+struct uprobe_arch_info {
+	unsigned long rip_rela_target_address;
+};
+#else
+struct uprobe_arch_info {};
+#endif
+struct uprobe;
+extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
+#endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 8baca3c..8f28be8 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 obj-$(CONFIG_OF)			+= devicetree.o
+obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..0be7e67
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,399 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UPROBES_FIX_RIP_AX	0x8000
+#define UPROBES_FIX_RIP_CX	0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
+	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
+	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
+	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
+	 << (row % 32))
+
+#ifdef CONFIG_X86_64
+static volatile u32 good_insns_64[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+	W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+#endif
+
+/* Good-instruction tables for 32-bit apps */
+
+static volatile u32 good_insns_32[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static volatile u32 good_2byte_insns[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+	W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+	W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+	W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field.  On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes.  These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ *	but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+	int i;
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		switch (insn->prefixes.bytes[i]) {
+		case 0x26:	/*INAT_PFX_ES   */
+		case 0x2E:	/*INAT_PFX_CS   */
+		case 0x36:	/*INAT_PFX_DS   */
+		case 0x3E:	/*INAT_PFX_SS   */
+		case 0xF0:	/*INAT_PFX_LOCK */
+			return true;
+		}
+	}
+	return false;
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, false);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *)good_insns_32))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn), (unsigned long *)good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->fixups accordingly.  To start with, uprobe->fixups is
+ * either zero or it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+	bool fix_ip = true, fix_call = false;	/* defaults */
+	int reg;
+
+	insn_get_opcode(insn);	/* should be a nop */
+
+	switch (OPCODE1(insn)) {
+	case 0xc3:		/* ret/lret */
+	case 0xcb:
+	case 0xc2:
+	case 0xca:
+		/* ip is correct */
+		fix_ip = false;
+		break;
+	case 0xe8:		/* call relative - Fix return addr */
+		fix_call = true;
+		break;
+	case 0x9a:		/* call absolute - Fix return addr, not ip */
+		fix_call = true;
+		fix_ip = false;
+		break;
+	case 0xff:
+		insn_get_modrm(insn);
+		reg = MODRM_REG(insn);
+		if (reg == 2 || reg == 3) {
+			/* call or lcall, indirect */
+			/* Fix return addr; ip is correct. */
+			fix_call = true;
+			fix_ip = false;
+		} else if (reg == 4 || reg == 5) {
+			/* jmp or ljmp, indirect */
+			/* ip is correct. */
+			fix_ip = false;
+		}
+		break;
+	case 0xea:		/* jmp absolute -- ip is correct */
+		fix_ip = false;
+		break;
+	default:
+		break;
+	}
+	if (fix_ip)
+		uprobe->fixups |= UPROBES_FIX_IP;
+	if (fix_call)
+		uprobe->fixups |= UPROBES_FIX_CALL;
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return
+ * immediately.  Otherwise, rewrite the instruction so that it accesses
+ * its memory operand indirectly through a scratch register.  Set
+ * uprobe->fixups and uprobe->arch_info.rip_rela_target_address
+ * accordingly.  (The contents of the scratch register will be saved
+ * before we single-step the modified instruction, and restored
+ * afterward.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area.  At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
+							struct insn *insn)
+{
+	u8 *cursor;
+	u8 reg;
+
+	if (mm->context.ia32_compat)
+		return;
+
+	uprobe->arch_info.rip_rela_target_address = 0x0;
+	if (!insn_rip_relative(insn))
+		return;
+
+	/*
+	 * Point cursor at the modrm byte.  The next 4 bytes are the
+	 * displacement.  Beyond the displacement, for some instructions,
+	 * is the immediate operand.
+	 */
+	cursor = uprobe->insn + insn->prefixes.nbytes
+			+ insn->rex_prefix.nbytes + insn->opcode.nbytes;
+	insn_get_length(insn);
+
+	/*
+	 * Convert from rip-relative addressing to indirect addressing
+	 * via a scratch register.  Change the r/m field from 0x5 (%rip)
+	 * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+	 */
+	reg = MODRM_REG(insn);
+	if (reg == 0) {
+		/*
+		 * The register operand (if any) is either the A register
+		 * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+		 * REX prefix) %r8.  In any case, we know the C register
+		 * is NOT the register operand, so we use %rcx (register
+		 * #1) for the scratch register.
+		 */
+		uprobe->fixups = UPROBES_FIX_RIP_CX;
+		/* Change modrm from 00 000 101 to 00 000 001. */
+		*cursor = 0x1;
+	} else {
+		/* Use %rax (register #0) for the scratch register. */
+		uprobe->fixups = UPROBES_FIX_RIP_AX;
+		/* Change modrm from 00 xxx 101 to 00 xxx 000 */
+		*cursor = (reg << 3);
+	}
+
+	/* Target address = address of next instruction + (signed) offset */
+	uprobe->arch_info.rip_rela_target_address = (long)insn->length
+					+ insn->displacement.value;
+	/* Displacement field is gone; slide immediate field (if any) over. */
+	if (insn->immediate.nbytes) {
+		cursor++;
+		memmove(cursor, cursor + insn->displacement.nbytes,
+						insn->immediate.nbytes);
+	}
+	return;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, true);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *)good_insns_64))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn), (unsigned long *)good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+static int validate_insn_bits(struct mm_struct *mm, struct uprobe *uprobe,
+				struct insn *insn)
+{
+	if (mm->context.ia32_compat)
+		return validate_insn_32bits(uprobe, insn);
+	return validate_insn_64bits(uprobe, insn);
+}
+#else
+static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
+							struct insn *insn)
+{
+	return;
+}
+
+static int validate_insn_bits(struct mm_struct *mm, struct uprobe *uprobe,
+				struct insn *insn)
+{
+	return validate_insn_32bits(uprobe, insn);
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @mm: the probed address space.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe)
+{
+	int ret;
+	struct insn insn;
+
+	uprobe->fixups = 0;
+	ret = validate_insn_bits(mm, uprobe, &insn);
+	if (ret != 0)
+		return ret;
+	handle_riprel_insn(mm, uprobe, &insn);
+	prepare_fixups(uprobe, &insn);
+	return 0;
+}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups Srikar Dronamraju
@ 2011-11-18 11:08 ` Srikar Dronamraju
  2011-11-25 14:29   ` Peter Zijlstra
                     ` (3 more replies)
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 10/30] x86: Set instruction pointer Srikar Dronamraju
                   ` (22 subsequent siblings)
  31 siblings, 4 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:08 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Provides Background page replacement by
 - cow the page that needs replacement.
 - modify a copy of the cowed page.
 - replace the cow page with the modified page
 - flush the page tables.

Also provides additional routines to read an opcode from a given virtual
address and for verifying if a instruction is a breakpoint instruction.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- pass NULL to get_user_pages for the task parameter.
- call SetPageUptodate on the new page allocated in write_opcode.
- fix leaking a reference to the new page under certain conditions.

 include/linux/uprobes.h |    2 
 kernel/uprobes.c        |  264 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 258 insertions(+), 8 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 44f28dc..bc1f190 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,6 +29,7 @@ struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
+typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
 #define MAX_UINSN_BYTES 4
 #endif
@@ -74,6 +75,7 @@ extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 							unsigned long vaddr);
 extern int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe,
 					unsigned long vaddr, bool verify);
+extern bool __weak is_bkpt_insn(u8 *insn);
 extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index f4574fd..1acf020 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -26,6 +26,9 @@
 #include <linux/pagemap.h>	/* read_mapping_page */
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/rmap.h>		/* anon_vma_prepare */
+#include <linux/mmu_notifier.h>	/* set_pte_at_notify */
+#include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -83,18 +86,248 @@ static bool valid_vma(struct vm_area_struct *vma, bool is_reg)
 	return false;
 }
 
+/**
+ * __replace_page - replace page in vma by new page.
+ * based on replace_page in mm/ksm.c
+ *
+ * @vma:      vma that holds the pte pointing to page
+ * @page:     the cowed page we are replacing by kpage
+ * @kpage:    the modified page we replace page by
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int __replace_page(struct vm_area_struct *vma, struct page *page,
+					struct page *kpage)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int err = -EFAULT;
+
+	addr = page_address_in_vma(page, vma);
+	if (addr == -EFAULT)
+		goto out;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out;
+
+	get_page(kpage);
+	page_add_new_anon_rmap(kpage, vma, addr);
+
+	flush_cache_page(vma, addr, pte_pfn(*ptep));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+	page_remove_rmap(page);
+	if (!page_mapped(page))
+		try_to_free_swap(page);
+	put_page(page);
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+
+out:
+	return err;
+}
+
+/*
+ * NOTE:
+ * Expect the breakpoint instruction to be the smallest size instruction for
+ * the architecture. If an arch has variable length instruction and the
+ * breakpoint instruction is not of the smallest length instruction
+ * supported by that architecture then we need to modify read_opcode /
+ * write_opcode accordingly. This would never be a problem for archs that
+ * have fixed length instructions.
+ */
+
+/*
+ * write_opcode - write the opcode at a given virtual address.
+ * @mm: the probed process address space.
+ * @uprobe: the breakpointing information.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @vaddr.
+ *
+ * Called with mm->mmap_sem held (for read and with a reference to
+ * mm).
+ *
+ * For mm @mm, write the opcode at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+static int write_opcode(struct mm_struct *mm, struct uprobe *uprobe,
+			unsigned long vaddr, uprobe_opcode_t opcode)
+{
+	struct page *old_page, *new_page;
+	struct address_space *mapping;
+	void *vaddr_old, *vaddr_new;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+	int ret;
+
+	/* Read the page with vaddr into memory */
+	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &old_page, &vma);
+	if (ret <= 0)
+		return ret;
+	ret = -EINVAL;
+
+	/*
+	 * We are interested in text pages only. Our pages of interest
+	 * should be mapped for read and execute only. We desist from
+	 * adding probes in write mapped pages since the breakpoints
+	 * might end up in the file copy.
+	 */
+	if (!valid_vma(vma, opcode == UPROBES_BKPT_INSN))
+		goto put_out;
+
+	mapping = uprobe->inode->i_mapping;
+	if (mapping != vma->vm_file->f_mapping)
+		goto put_out;
+
+	addr = vma->vm_start + uprobe->offset;
+	addr -= vma->vm_pgoff << PAGE_SHIFT;
+	if (vaddr != (unsigned long)addr)
+		goto put_out;
+
+	ret = -ENOMEM;
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+	if (!new_page)
+		goto put_out;
+
+	__SetPageUptodate(new_page);
+
+	/*
+	 * lock page will serialize against do_wp_page()'s
+	 * PageAnon() handling
+	 */
+	lock_page(old_page);
+	/* copy the page now that we've got it stable */
+	vaddr_old = kmap_atomic(old_page);
+	vaddr_new = kmap_atomic(new_page);
+
+	memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+	/* poke the new insn in, ASSUMES we don't cross page boundary */
+	vaddr &= ~PAGE_MASK;
+	memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+	kunmap_atomic(vaddr_new);
+	kunmap_atomic(vaddr_old);
+
+	ret = anon_vma_prepare(vma);
+	if (ret)
+		goto unlock_out;
+
+	lock_page(new_page);
+	ret = __replace_page(vma, old_page, new_page);
+	unlock_page(new_page);
+
+unlock_out:
+	unlock_page(old_page);
+	page_cache_release(new_page);
+
+put_out:
+	put_page(old_page);	/* we did a get_page in the beginning */
+	return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @mm: the probed process address space.
+ * @vaddr: the virtual address to read the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * Called with mm->mmap_sem held (for read and with a reference to
+ * mm.
+ *
+ * For mm @mm, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+static int read_opcode(struct mm_struct *mm, unsigned long vaddr,
+						uprobe_opcode_t *opcode)
+{
+	struct page *page;
+	void *vaddr_new;
+	int ret;
+
+	ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL);
+	if (ret <= 0)
+		return ret;
+
+	lock_page(page);
+	vaddr_new = kmap_atomic(page);
+	vaddr &= ~PAGE_MASK;
+	memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+	kunmap_atomic(vaddr_new);
+	unlock_page(page);
+	put_page(page);		/* we did a get_user_pages in the beginning */
+	return 0;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @mm: the probed process address space.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For mm @mm, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
 int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 						unsigned long vaddr)
 {
-	/* placeholder: yet to be implemented */
-	return 0;
+	return write_opcode(mm, uprobe, vaddr, UPROBES_BKPT_INSN);
 }
 
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @mm: the probed process address space.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For mm @mm, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
 int __weak set_orig_insn(struct mm_struct *mm, struct uprobe *uprobe,
 					unsigned long vaddr, bool verify)
 {
-	/* placeholder: yet to be implemented */
-	return 0;
+	if (verify) {
+		uprobe_opcode_t opcode;
+		int result = read_opcode(mm, vaddr, &opcode);
+
+		if (result)
+			return result;
+
+		if (opcode != UPROBES_BKPT_INSN)
+			return -EINVAL;
+	}
+	return write_opcode(mm, uprobe, vaddr,
+				*(uprobe_opcode_t *)uprobe->insn);
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(u8 *insn)
+{
+	return (insn[0] == UPROBES_BKPT_INSN);
 }
 
 static int match_uprobe(struct uprobe *l, struct uprobe *r)
@@ -329,7 +562,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 				struct vm_area_struct *vma, loff_t vaddr)
 {
 	unsigned long addr;
-	int ret = -EINVAL;
+	int ret;
 
 	/*
 	 * Probe is to be deleted;
@@ -345,7 +578,13 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 		if (ret)
 			return ret;
 
-		/* TODO : Analysis and verification of instruction */
+		if (is_bkpt_insn(uprobe->insn))
+			return -EEXIST;
+
+		ret = analyze_insn(mm, uprobe);
+		if (ret)
+			return ret;
+
 		uprobe->copy = 1;
 	}
 	ret = set_bkpt(mm, uprobe, addr);
@@ -761,12 +1000,21 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	build_probe_list(inode, &tmp_list);
 	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
 		loff_t vaddr;
+		uprobe_opcode_t opcode;
 
 		list_del(&uprobe->pending_list);
 		vaddr = vma->vm_start + uprobe->offset;
 		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
-		if (vaddr >= vma->vm_start && vaddr < vma->vm_end)
-			atomic_dec(&vma->vm_mm->mm_uprobes_count);
+		if (vaddr >= vma->vm_start && vaddr < vma->vm_end) {
+
+			/*
+			 * An unregister could have removed the probe before
+			 * unmap. So check before we decrement the count.
+			 */
+			if (!read_opcode(vma->vm_mm, vaddr, &opcode) &&
+						(opcode == UPROBES_BKPT_INSN))
+				atomic_dec(&vma->vm_mm->mm_uprobes_count);
+		}
 		put_uprobe(uprobe);
 	}
 	mutex_unlock(uprobes_mmap_hash(inode));


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 10/30] x86: Set instruction pointer.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
@ 2011-11-18 11:08 ` Srikar Dronamraju
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 11/30] x86: Introduce TIF_UPROBE FLAG Srikar Dronamraju
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:08 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Provides x86 specific routine to set the instruction pointer to the
given address.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    1 +
 arch/x86/kernel/uprobes.c      |   10 ++++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index f0b4b2b..509c023 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -39,4 +39,5 @@ struct uprobe_arch_info {};
 #endif
 struct uprobe;
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
+extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 0be7e67..67b926f 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -397,3 +397,13 @@ int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe)
 	prepare_fixups(uprobe, &insn);
 	return 0;
 }
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
+{
+	regs->ip = vaddr;
+}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 11/30] x86: Introduce TIF_UPROBE FLAG.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 10/30] x86: Set instruction pointer Srikar Dronamraju
@ 2011-11-18 11:08 ` Srikar Dronamraju
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep Srikar Dronamraju
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:08 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


On a breakpoint or singlestep, the exception notifier will just
set this thread_info FLAG so that do_notify_resume can be made aware
that a breakpoint/singlestep has occurred.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/thread_info.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..aeb3e04 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
+#define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (10 preceding siblings ...)
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 11/30] x86: Introduce TIF_UPROBE FLAG Srikar Dronamraju
@ 2011-11-18 11:09 ` Srikar Dronamraju
  2011-11-25 15:24   ` Peter Zijlstra
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 13/30] x86: define a x86 specific exception notifier Srikar Dronamraju
                   ` (19 subsequent siblings)
  31 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Provides routines to create/manage and free the task specific
information. Uses bulkref interface.
Adds a hook in uprobe_notify_resume to handle breakpoint and singlestep
exception.

Uprobes needs to maintain some task specific information including if a
task has hit a probepoint, uprobe corresponding to the probehit,
the slot where the original instruction is copied to before
single-stepping.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- Use bulkref instead of synchronize_sched
- Introduce per task bulkref_id to store the bulkref_id
- Modified comments.

 include/linux/sched.h   |    4 +
 include/linux/uprobes.h |   33 +++++++
 kernel/fork.c           |    6 +
 kernel/uprobes.c        |  208 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 251 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68daf4f..bb274de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1573,6 +1573,10 @@ struct task_struct {
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
 #endif
+#ifdef CONFIG_UPROBES
+	struct uprobe_task *utask;
+	int uprobes_bulkref_id;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bc1f190..0882223 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -70,6 +70,24 @@ struct uprobe {
 	u8			insn[MAX_UINSN_BYTES];
 };
 
+enum uprobe_task_state {
+	UTASK_RUNNING,
+	UTASK_BP_HIT,
+	UTASK_SSTEP
+};
+
+/*
+ * uprobe_task: Metadata of a task while it singlesteps.
+ */
+struct uprobe_task {
+	unsigned long xol_vaddr;
+	unsigned long vaddr;
+
+	enum uprobe_task_state state;
+
+	struct uprobe *active_uprobe;
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -80,8 +98,13 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern void free_uprobe_utask(struct task_struct *tsk);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
+extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -99,5 +122,15 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
 static inline void munmap_uprobe(struct vm_area_struct *vma)
 {
 }
+static inline void uprobe_notify_resume(struct pt_regs *regs)
+{
+}
+static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return 0;
+}
+static inline void free_uprobe_utask(struct task_struct *tsk)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index c8c287a..a03f436 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -686,6 +686,8 @@ void mm_release(struct task_struct *tsk, struct mm_struct *mm)
 		exit_pi_state_list(tsk);
 #endif
 
+	free_uprobe_utask(tsk);
+
 	/* Get rid of any cached register state */
 	deactivate_mm(tsk, mm);
 
@@ -1284,6 +1286,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	INIT_LIST_HEAD(&p->pi_state_list);
 	p->pi_state_cache = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	p->utask = NULL;
+	p->uprobes_bulkref_id = -1;
+#endif
 	/*
 	 * sigaltstack should be cleared when sharing the same VM
 	 */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 1acf020..9789b65 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -29,8 +29,10 @@
 #include <linux/rmap.h>		/* anon_vma_prepare */
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
+#include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/uprobes.h>
 
+static bulkref_t uprobes_srcu;
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
 
@@ -465,6 +467,21 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_consumer *consumer;
+
+	down_read(&uprobe->consumer_rwsem);
+	consumer = uprobe->consumers;
+	for (consumer = uprobe->consumers; consumer;
+					consumer = consumer->next) {
+		if (!consumer->filter ||
+				consumer->filter(consumer, current))
+			consumer->handler(consumer, regs);
+	}
+	up_read(&uprobe->consumer_rwsem);
+}
+
 /* Returns the previous consumer */
 static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
 				struct uprobe_consumer *consumer)
@@ -601,10 +618,21 @@ static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 		atomic_dec(&mm->mm_uprobes_count);
 }
 
+/*
+ * There could be threads that have hit the breakpoint and are entering the
+ * notifier code and trying to acquire the uprobes_treelock. The thread
+ * calling delete_uprobe() that is removing the uprobe from the rb_tree can
+ * race with these threads and might acquire the uprobes_treelock compared
+ * to some of the breakpoint hit threads. In such a case, the breakpoint hit
+ * threads will not find the uprobe. Hence wait till the current breakpoint
+ * hit threads acquire the uprobes_treelock before the uprobe is removed
+ * from the rbtree.
+ */
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
 
+	bulkref_wait_old(&uprobes_srcu);
 	spin_lock_irqsave(&uprobes_treelock, flags);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
@@ -1022,6 +1050,185 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	return;
 }
 
+/**
+ * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void free_uprobe_utask(struct task_struct *tsk)
+{
+	struct uprobe_task *utask = tsk->utask;
+
+	if (tsk->uprobes_bulkref_id != -1)
+		bulkref_put(&uprobes_srcu, tsk->uprobes_bulkref_id);
+
+	if (!utask)
+		return;
+
+	if (utask->active_uprobe)
+		put_uprobe(utask->active_uprobe);
+
+	kfree(utask);
+	tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - negative errno otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+	struct uprobe_task *utask;
+
+	utask = kzalloc(sizeof *utask, GFP_KERNEL);
+	if (unlikely(utask == NULL))
+		return ERR_PTR(-ENOMEM);
+
+	utask->active_uprobe = NULL;
+	current->utask = utask;
+	return utask;
+}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+				unsigned long vaddr)
+{
+	/* TODO: Yet to be implemented */
+	return -EFAULT;
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	/* TODO: Yet to be implemented */
+	return false;
+}
+
+/*
+ * uprobe_notify_resume gets called in task context just before returning
+ * to userspace.
+ *
+ *  If its the first time the probepoint is hit, slot gets allocated here.
+ *  If its the first time the thread hit a breakpoint, utask gets
+ *  allocated here.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+	struct vm_area_struct *vma;
+	struct uprobe_task *utask;
+	struct mm_struct *mm;
+	struct uprobe *u = NULL;
+	unsigned long probept;
+
+	utask = current->utask;
+	mm = current->mm;
+	if (!utask || utask->state == UTASK_BP_HIT) {
+		probept = get_uprobe_bkpt_addr(regs);
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, probept);
+		if (vma && valid_vma(vma, false))
+			u = find_uprobe(vma->vm_file->f_mapping->host,
+					probept - vma->vm_start +
+					(vma->vm_pgoff << PAGE_SHIFT));
+
+		bulkref_put(&uprobes_srcu, current->uprobes_bulkref_id);
+		current->uprobes_bulkref_id = -1;
+		up_read(&mm->mmap_sem);
+		if (!u)
+			/* No matching uprobe; signal SIGTRAP. */
+			goto cleanup_ret;
+		if (!utask) {
+			utask = add_utask();
+			/* Cannot Allocate; re-execute the instruction. */
+			if (!utask)
+				goto cleanup_ret;
+		}
+		utask->active_uprobe = u;
+		handler_chain(u, regs);
+		utask->state = UTASK_SSTEP;
+		if (!pre_ssout(u, regs, probept))
+			user_enable_single_step(current);
+		else
+			/* Cannot Singlestep; re-execute the instruction. */
+			goto cleanup_ret;
+	} else if (utask->state == UTASK_SSTEP) {
+		u = utask->active_uprobe;
+		if (sstep_complete(u, regs)) {
+			put_uprobe(u);
+			utask->active_uprobe = NULL;
+			utask->state = UTASK_RUNNING;
+			user_disable_single_step(current);
+		}
+	}
+	return;
+
+cleanup_ret:
+	if (utask) {
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+	}
+	if (u) {
+		put_uprobe(u);
+		set_instruction_pointer(regs, probept);
+	} else {
+		/*TODO Return SIGTRAP signal */
+	}
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context
+ * it gets a reference to the ppt and sets TIF_UPROBE flag,
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+	struct uprobe_task *utask;
+
+	if (!current->mm || !atomic_read(&current->mm->mm_uprobes_count))
+		/* task is currently not uprobed */
+		return 0;
+
+	utask = current->utask;
+	if (utask)
+		utask->state = UTASK_BP_HIT;
+
+	set_thread_flag(TIF_UPROBE);
+	current->uprobes_bulkref_id = bulkref_get(&uprobes_srcu);
+	return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context.
+ * It completes the single step operation.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+	struct uprobe_task *utask = current->utask;
+
+	if (!current->mm || !utask || !utask->active_uprobe)
+		/* task is currently not uprobed */
+		return 0;
+
+	set_thread_flag(TIF_UPROBE);
+	return 1;
+}
+
 static int __init init_uprobes(void)
 {
 	int i;
@@ -1030,6 +1237,7 @@ static int __init init_uprobes(void)
 		mutex_init(&uprobes_mutex[i]);
 		mutex_init(&uprobes_mmap_mutex[i]);
 	}
+	init_bulkref(&uprobes_srcu);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 13/30] x86: define a x86 specific exception notifier.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (11 preceding siblings ...)
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep Srikar Dronamraju
@ 2011-11-18 11:09 ` Srikar Dronamraju
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 14/30] uprobe: register " Srikar Dronamraju
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Uprobes uses notifier mechanism to get in control when an application
encounters a breakpoint or a singlestep exception.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- No more do a i386 specific enable interrupts. (Its now part of another
  patchset posted separately)

 arch/x86/include/asm/uprobes.h |    4 ++++
 arch/x86/kernel/signal.c       |    6 ++++++
 arch/x86/kernel/uprobes.c      |   29 +++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 509c023..19a5949 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -23,6 +23,8 @@
  *	Jim Keniston
  */
 
+#include <linux/notifier.h>
+
 typedef u8 uprobe_opcode_t;
 #define MAX_UINSN_BYTES 16
 #define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
@@ -40,4 +42,6 @@ struct uprobe_arch_info {};
 struct uprobe;
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb..4fdf470 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -820,6 +821,11 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
+	if (thread_info_flags & _TIF_UPROBE) {
+		clear_thread_flag(TIF_UPROBE);
+		uprobe_notify_resume(regs);
+	}
+
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 67b926f..2ee5ddc 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -407,3 +407,32 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 {
 	regs->ip = vaddr;
 }
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+	int ret = NOTIFY_DONE;
+
+	/* We are only interested in userspace traps */
+	if (regs && !user_mode_vm(regs))
+		return NOTIFY_DONE;
+
+	switch (val) {
+	case DIE_INT3:
+		/* Run your handler here */
+		if (uprobe_bkpt_notifier(regs))
+			ret = NOTIFY_STOP;
+		break;
+	case DIE_DEBUG:
+		if (uprobe_post_notifier(regs))
+			ret = NOTIFY_STOP;
+	default:
+		break;
+	}
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 14/30] uprobe: register exception notifier
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (12 preceding siblings ...)
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 13/30] x86: define a x86 specific exception notifier Srikar Dronamraju
@ 2011-11-18 11:09 ` Srikar Dronamraju
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 15/30] x86: Define x86_64 specific uprobe_task_arch_info structure Srikar Dronamraju
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Use the notifier mechanism to register uprobes exception notifier.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/uprobes.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9789b65..b9e1932 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -30,6 +30,7 @@
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/ptrace.h>	/* user_enable_single_step */
+#include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
 static bulkref_t uprobes_srcu;
@@ -1229,6 +1230,11 @@ int uprobe_post_notifier(struct pt_regs *regs)
 	return 1;
 }
 
+struct notifier_block uprobe_exception_nb = {
+	.notifier_call = uprobe_exception_notify,
+	.priority = INT_MAX - 1,	/* notified after kprobes, kgdb */
+};
+
 static int __init init_uprobes(void)
 {
 	int i;
@@ -1238,7 +1244,7 @@ static int __init init_uprobes(void)
 		mutex_init(&uprobes_mmap_mutex[i]);
 	}
 	init_bulkref(&uprobes_srcu);
-	return 0;
+	return register_die_notifier(&uprobe_exception_nb);
 }
 
 static void __exit exit_uprobes(void)


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 15/30] x86: Define x86_64 specific uprobe_task_arch_info structure
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (13 preceding siblings ...)
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 14/30] uprobe: register " Srikar Dronamraju
@ 2011-11-18 11:09 ` Srikar Dronamraju
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 16/30] uprobes: Introduce " Srikar Dronamraju
                   ` (16 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


On x86_64, need to handle RIP relative instructions, which requires us to
save and restore a register.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 19a5949..cf794bf 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -36,8 +36,13 @@ typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {
 	unsigned long rip_rela_target_address;
 };
+
+struct uprobe_task_arch_info {
+	unsigned long saved_scratch_register;
+};
 #else
 struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};
 #endif
 struct uprobe;
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 16/30] uprobes: Introduce uprobe_task_arch_info structure.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (14 preceding siblings ...)
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 15/30] x86: Define x86_64 specific uprobe_task_arch_info structure Srikar Dronamraju
@ 2011-11-18 11:09 ` Srikar Dronamraju
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 17/30] x86: arch specific hooks for pre/post singlestep handling Srikar Dronamraju
                   ` (15 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


uprobe_task_arch_info structure helps save and restore architecture
specific artifacts at the probehit/singlestep/original instruction
restore time.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0882223..c1378a9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
 #else
 typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};	/* arch specific task info */
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -84,6 +85,7 @@ struct uprobe_task {
 	unsigned long vaddr;
 
 	enum uprobe_task_state state;
+	struct uprobe_task_arch_info tskinfo;
 
 	struct uprobe *active_uprobe;
 };


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 17/30] x86: arch specific hooks for pre/post singlestep handling.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (15 preceding siblings ...)
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 16/30] uprobes: Introduce " Srikar Dronamraju
@ 2011-11-18 11:09 ` Srikar Dronamraju
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 18/30] uprobes: slot allocation Srikar Dronamraju
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:09 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Hooks for handling pre singlestepping and post singlestepping.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    2 +
 arch/x86/kernel/uprobes.c      |  135 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 137 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index cf794bf..99d7d4b 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -47,6 +47,8 @@ struct uprobe_task_arch_info {};
 struct uprobe;
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 2ee5ddc..0792fc8 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
+#include <linux/uaccess.h>
 
 #include <linux/kdebug.h>
 #include <asm/insn.h>
@@ -409,6 +410,140 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 }
 
 /*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+#ifdef CONFIG_X86_64
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	regs->ip = current->utask->xol_vaddr;
+	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
+		tskinfo->saved_scratch_register = regs->ax;
+		regs->ax = current->utask->vaddr;
+		regs->ax += uprobe->arch_info.rip_rela_target_address;
+	} else if (uprobe->fixups & UPROBES_FIX_RIP_CX) {
+		tskinfo->saved_scratch_register = regs->cx;
+		regs->cx = current->utask->vaddr;
+		regs->cx += uprobe->arch_info.rip_rela_target_address;
+	}
+	return 0;
+}
+#else
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	regs->ip = current->utask->xol_vaddr;
+	return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+	int rasize, ncopied;
+	long ra = 0;
+
+	if (is_32bit_app(current))
+		rasize = 4;
+	else
+		rasize = 8;
+
+	ncopied = copy_from_user(&ra, (void __user *)sp, rasize);
+	if (unlikely(ncopied))
+		return -EFAULT;
+
+	ra += correction;
+	ncopied = copy_to_user((void __user *)sp, &ra, rasize);
+	if (unlikely(ncopied))
+		return -EFAULT;
+
+	return 0;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+	return ((uprobe->fixups &
+			(UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+	if (is_riprel_insn(uprobe)) {
+		struct uprobe_task_arch_info *tskinfo;
+		tskinfo = &current->utask->tskinfo;
+
+		if (uprobe->fixups & UPROBES_FIX_RIP_AX)
+			regs->ax = tskinfo->saved_scratch_register;
+		else
+			regs->cx = tskinfo->saved_scratch_register;
+		/*
+		 * The original instruction includes a displacement, and so
+		 * is 4 bytes longer than what we've just single-stepped.
+		 * Fall through to handle stuff like "jmpq *...(%rip)" and
+		 * "callq *...(%rip)".
+		 */
+		*correction += 4;
+	}
+}
+#else
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+}
+#endif
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction.  We need
+ * to make it relative to the original instruction (FIX_IP).  Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction.  We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field).  (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task *utask = current->utask;
+	int result = 0;
+	long correction;
+
+	correction = (long)(utask->vaddr - utask->xol_vaddr);
+	handle_riprel_post_xol(uprobe, regs, &correction);
+	if (uprobe->fixups & UPROBES_FIX_IP)
+		regs->ip += correction;
+	if (uprobe->fixups & UPROBES_FIX_CALL)
+		result = adjust_ret_addr(regs->sp, correction);
+	return result;
+}
+
+/*
  * Wrapper routine for handling exceptions.
  */
 int uprobe_exception_notify(struct notifier_block *self,


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 18/30] uprobes: slot allocation.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (16 preceding siblings ...)
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 17/30] x86: arch specific hooks for pre/post singlestep handling Srikar Dronamraju
@ 2011-11-18 11:10 ` Srikar Dronamraju
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 19/30] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


One page of slots are allocated per mm.
On a probehit one free slot is acquired and released after
singlestep operation completes.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- no more spin lock needed for slot allocation.
- use install_special_mapping to add a vma. (previous approach used
  init_creds)
- set uprobes_xol_area while holding map_sem exclusively.

 include/linux/mm_types.h |    2 
 include/linux/uprobes.h  |   24 +++++
 kernel/fork.c            |    2 
 kernel/uprobes.c         |  215 +++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 240 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 544a0b6..2595c9c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/uprobes.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -391,6 +392,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_UPROBES
 	atomic_t mm_uprobes_count;
+	struct uprobes_xol_area *uprobes_xol_area;
 #endif
 };
 
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index c1378a9..add5222 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -90,6 +90,26 @@ struct uprobe_task {
 	struct uprobe *active_uprobe;
 };
 
+/*
+ * On a breakpoint hit, thread contests for a slot.  It free the
+ * slot after singlestep.  Only definite number of slots are
+ * allocated.
+ */
+
+struct uprobes_xol_area {
+	wait_queue_head_t wq;	/* if all slots are busy */
+	atomic_t slot_count;	/* currently in use slots */
+	unsigned long *bitmap;	/* 0 = free slot */
+	struct page *page;
+
+	/*
+	 * We keep the vma's vm_start rather than a pointer to the vma
+	 * itself.  The probed process or a naughty kernel module could make
+	 * the vma go away, and we must handle that reasonably gracefully.
+	 */
+	unsigned long vaddr;		/* Page(s) of instruction slots */
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -101,6 +121,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void free_uprobe_utask(struct task_struct *tsk);
+extern void free_uprobes_xol_area(struct mm_struct *mm);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
 extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
@@ -134,5 +155,8 @@ static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
 static inline void free_uprobe_utask(struct task_struct *tsk)
 {
 }
+static inline void free_uprobes_xol_area(struct mm_struct *mm)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index a03f436..c605f2a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -558,6 +558,7 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
+		free_uprobes_xol_area(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
@@ -746,6 +747,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #endif
 #ifdef CONFIG_UPROBES
 	atomic_set(&mm->mm_uprobes_count, 0);
+	mm->uprobes_xol_area = NULL;
 #endif
 
 	if (!mm_init(mm, tsk))
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index b9e1932..b440acd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,6 +33,9 @@
 #include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
+#define UINSNS_PER_PAGE	(PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
 static bulkref_t uprobes_srcu;
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
@@ -1051,6 +1054,201 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	return;
 }
 
+/* Slot allocation for XOL */
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+	struct mm_struct *mm;
+	int ret;
+
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		return -ENOMEM;
+
+	mm = current->mm;
+	down_write(&mm->mmap_sem);
+	ret = -EALREADY;
+	if (mm->uprobes_xol_area)
+		goto fail;
+
+	ret = -ENOMEM;
+
+	/* Try to map as high as possible, this is only a hint. */
+	area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+							PAGE_SIZE, 0, 0);
+	if (area->vaddr & ~PAGE_MASK) {
+		ret = area->vaddr;
+		goto fail;
+	}
+
+	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
+				&area->page);
+	if (ret)
+		goto fail;
+
+	smp_wmb();	/* pairs with get_uprobes_xol_area() */
+	mm->uprobes_xol_area = area;
+	ret = 0;
+
+fail:
+	up_write(&mm->mmap_sem);
+	if (ret)
+		__free_page(area->page);
+
+	return ret;
+}
+
+static struct uprobes_xol_area *get_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+	smp_read_barrier_depends();/* pairs with wmb in xol_add_vma() */
+	return area;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+	struct uprobes_xol_area *area;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+								GFP_KERNEL);
+
+	if (!area->bitmap)
+		goto fail;
+
+	init_waitqueue_head(&area->wq);
+	if (!xol_add_vma(area))
+		return area;
+
+fail:
+	kfree(area->bitmap);
+	kfree(area);
+	return get_uprobes_xol_area(current->mm);
+}
+
+/*
+ * free_uprobes_xol_area - Free the area allocated for slots.
+ */
+void free_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+	if (!area)
+		return;
+
+	put_page(area->page);
+	kfree(area->bitmap);
+	kfree(area);
+}
+
+/*
+ *  - search for a free slot.
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+	unsigned long slot_addr;
+	int slot_nr;
+
+	do {
+		slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+		if (slot_nr < UINSNS_PER_PAGE) {
+			if (!test_and_set_bit(slot_nr, area->bitmap))
+				break;
+
+			slot_nr = UINSNS_PER_PAGE;
+			continue;
+		}
+		wait_event(area->wq,
+			(atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+	} while (slot_nr >= UINSNS_PER_PAGE);
+
+	slot_addr = area->vaddr + (slot_nr * UPROBES_XOL_SLOT_BYTES);
+	atomic_inc(&area->slot_count);
+	return slot_addr;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+					unsigned long slot_addr)
+{
+	struct uprobes_xol_area *area;
+	unsigned long offset;
+	void *vaddr;
+
+	area = get_uprobes_xol_area(current->mm);
+	if (!area) {
+		area = xol_alloc_area();
+		if (!area)
+			return 0;
+	}
+	current->utask->xol_vaddr = xol_take_insn_slot(area);
+
+	/*
+	 * Initialize the slot if xol_vaddr points to valid
+	 * instruction slot.
+	 */
+	if (unlikely(!current->utask->xol_vaddr))
+		return 0;
+
+	current->utask->vaddr = slot_addr;
+	offset = current->utask->xol_vaddr & ~PAGE_MASK;
+	vaddr = kmap_atomic(area->page);
+	memcpy(vaddr + offset, uprobe->insn, MAX_UINSN_BYTES);
+	kunmap_atomic(vaddr);
+	return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk)
+{
+	struct uprobes_xol_area *area;
+	unsigned long vma_end;
+	unsigned long slot_addr;
+
+	if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
+		return;
+
+	slot_addr = tsk->utask->xol_vaddr;
+
+	if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+		return;
+
+	area = tsk->mm->uprobes_xol_area;
+	vma_end = area->vaddr + PAGE_SIZE;
+	if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+		int slot_nr;
+		unsigned long offset = slot_addr - area->vaddr;
+
+		slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+		if (slot_nr >= UINSNS_PER_PAGE)
+			return;
+
+		clear_bit(slot_nr, area->bitmap);
+		atomic_dec(&area->slot_count);
+		if (waitqueue_active(&area->wq))
+			wake_up(&area->wq);
+		tsk->utask->xol_vaddr = 0;
+	}
+}
+
 /**
  * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
  * @regs: Reflects the saved state of the task after it has hit a breakpoint
@@ -1079,6 +1277,7 @@ void free_uprobe_utask(struct task_struct *tsk)
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
+	xol_free_insn_slot(tsk);
 	kfree(utask);
 	tsk->utask = NULL;
 }
@@ -1108,7 +1307,8 @@ static struct uprobe_task *add_utask(void)
 static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 				unsigned long vaddr)
 {
-	/* TODO: Yet to be implemented */
+	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs))
+		return 0;
 	return -EFAULT;
 }
 
@@ -1118,8 +1318,16 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
  */
 static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 {
-	/* TODO: Yet to be implemented */
-	return false;
+	unsigned long vaddr = instruction_pointer(regs);
+
+	/*
+	 * If we have executed out of line, Instruction pointer
+	 * cannot be same as virtual address of XOL slot.
+	 */
+	if (vaddr == current->utask->xol_vaddr)
+		return false;
+	post_xol(uprobe, regs);
+	return true;
 }
 
 /*
@@ -1176,6 +1384,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->active_uprobe = NULL;
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
+			xol_free_insn_slot(current);
 		}
 	}
 	return;


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 19/30] tracing: modify is_delete, is_return from ints to bool.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (17 preceding siblings ...)
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 18/30] uprobes: slot allocation Srikar Dronamraju
@ 2011-11-18 11:10 ` Srikar Dronamraju
  2011-11-23 19:24   ` Steven Rostedt
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
                   ` (12 subsequent siblings)
  31 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


is_delete and is_return can take atmost 2 values and
are better of being a boolean than a int.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5):
- extracted from the next patch on Masami's suggestion.

 kernel/trace/trace_kprobe.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 00d527c..2490dd1 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -651,7 +651,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 					     void *addr,
 					     const char *symbol,
 					     unsigned long offs,
-					     int nargs, int is_return)
+					     int nargs, bool is_return)
 {
 	struct trace_probe *tp;
 	int ret = -ENOMEM;
@@ -944,7 +944,7 @@ static int split_symbol_offset(char *symbol, unsigned long *offset)
 #define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
 
 static int parse_probe_vars(char *arg, const struct fetch_type *t,
-			    struct fetch_param *f, int is_return)
+			    struct fetch_param *f, bool is_return)
 {
 	int ret = 0;
 	unsigned long param;
@@ -977,7 +977,7 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
 
 /* Recursive argument parser */
 static int __parse_probe_arg(char *arg, const struct fetch_type *t,
-			     struct fetch_param *f, int is_return)
+			     struct fetch_param *f, bool is_return)
 {
 	int ret = 0;
 	unsigned long param;
@@ -1089,7 +1089,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
 
 /* String length checking wrapper */
 static int parse_probe_arg(char *arg, struct trace_probe *tp,
-			   struct probe_arg *parg, int is_return)
+			   struct probe_arg *parg, bool is_return)
 {
 	const char *t;
 	int ret;
@@ -1162,7 +1162,7 @@ static int create_trace_probe(int argc, char **argv)
 	 */
 	struct trace_probe *tp;
 	int i, ret = 0;
-	int is_return = 0, is_delete = 0;
+	bool is_return = false, is_delete = false;
 	char *symbol = NULL, *event = NULL, *group = NULL;
 	char *arg;
 	unsigned long offset = 0;
@@ -1171,11 +1171,11 @@ static int create_trace_probe(int argc, char **argv)
 
 	/* argc must be >= 1 */
 	if (argv[0][0] == 'p')
-		is_return = 0;
+		is_return = false;
 	else if (argv[0][0] == 'r')
-		is_return = 1;
+		is_return = true;
 	else if (argv[0][0] == '-')
-		is_delete = 1;
+		is_delete = true;
 	else {
 		pr_info("Probe definition must be started with 'p', 'r' or"
 			" '-'.\n");


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (18 preceding siblings ...)
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 19/30] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
@ 2011-11-18 11:10 ` Srikar Dronamraju
  2011-11-23 19:32   ` Steven Rostedt
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 21/30] tracing: uprobes trace_event interface Srikar Dronamraju
                   ` (11 subsequent siblings)
  31 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- Extracted out int to bool changes to a separate patch.
- Fix a bug in kprobe_trace_self_tests_init that was introduced
  in previous patchset.

 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/trace_kprobe.c |  889 +------------------------------------------
 kernel/trace/trace_probe.c  |  779 ++++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |  160 ++++++++
 5 files changed, 962 insertions(+), 871 deletions(-)
 create mode 100644 kernel/trace/trace_probe.c
 create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..520106a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
 	bool "Enable kprobes-based dynamic events"
 	select TRACING
+	select PROBE_EVENTS
 	default y
 	help
 	  This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config PROBE_EVENTS
+	def_bool n
+
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace tracepoints dynamically"
 	depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 5f39a07..fa10d5c 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -61,5 +61,6 @@ endif
 ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
+obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 2490dd1..967e634 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,547 +19,15 @@
 
 #include <linux/module.h>
 #include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
-	"common_type",
-	"common_flags",
-	"common_preempt_count",
-	"common_pid",
-	"common_tgid",
-	FIELD_STRING_IP,
-	FIELD_STRING_RETIP,
-	FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
-				 void *);
-#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
-#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
-
-/* Printing  in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
-						const char *name,	\
-						void *data, void *ent)\
-{									\
-	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-}									\
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs)	\
-	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl)	((u32)(dl) >> 16)
-#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
-	return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
-	return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- *  data_rloc stores the offset from data_rloc itself, but data_loc
- *  stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
-						  const char *name,
-						  void *data, void *ent)
-{
-	int len = *(u32 *)data >> 16;
-
-	if (!len)
-		return trace_seq_printf(s, " %s=(fault)", name);
-	else
-		return trace_seq_printf(s, " %s=\"%s\"", name,
-					(const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
-	fetch_func_t	fn;
-	void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
-				 struct pt_regs *regs, void *dest)
-{
-	return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8)		\
-DEFINE_FETCH_##method(u16)		\
-DEFINE_FETCH_##method(u32)		\
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn)			\
-	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
-	 && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type)						\
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
-					void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_register(regs,			\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type)					\
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
-					  void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type)					\
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
-					  void *dummy, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_return_value(regs);			\
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type)					\
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
-					  void *addr, void *dest)	\
-{									\
-	type retval;							\
-	if (probe_kernel_address(addr, retval))				\
-		*(type *)dest = 0;					\
-	else								\
-		*(type *)dest = retval;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
-						      void *addr, void *dest)
-{
-	long ret;
-	int maxlen = get_rloc_len(*(u32 *)dest);
-	u8 *dst = get_rloc_data(dest);
-	u8 *src = addr;
-	mm_segment_t old_fs = get_fs();
-	if (!maxlen)
-		return;
-	/*
-	 * Try to get string again, since the string can be changed while
-	 * probing.
-	 */
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do
-		ret = __copy_from_user_inatomic(dst++, src++, 1);
-	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
-	dst[-1] = '\0';
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0) {	/* Failed to fetch string */
-		((u8 *)get_rloc_data(dest))[0] = '\0';
-		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
-	} else
-		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
-					      get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
-							void *addr, void *dest)
-{
-	int ret, len = 0;
-	u8 c;
-	mm_segment_t old_fs = get_fs();
-
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do {
-		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
-		len++;
-	} while (c && ret == 0 && len < MAX_STRING_SIZE);
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0)	/* Failed to check the length */
-		*(u32 *)dest = 0;
-	else
-		*(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
-	char *symbol;
-	long offset;
-	unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
-	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
-	if (sc->addr)
-		sc->addr += sc->offset;
-	return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
-	kfree(sc->symbol);
-	kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
-	struct symbol_cache *sc;
-
-	if (!sym || strlen(sym) == 0)
-		return NULL;
-	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
-	if (!sc)
-		return NULL;
-
-	sc->symbol = kstrdup(sym, GFP_KERNEL);
-	if (!sc->symbol) {
-		kfree(sc);
-		return NULL;
-	}
-	sc->offset = offset;
 
-	update_symbol_cache(sc);
-	return sc;
-}
-
-#define DEFINE_FETCH_symbol(type)					\
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
-					  void *data, void *dest)	\
-{									\
-	struct symbol_cache *sc = data;					\
-	if (sc->addr)							\
-		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
-	else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
-	struct fetch_param orig;
-	long offset;
-};
-
-#define DEFINE_FETCH_deref(type)					\
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct deref_fetch_param *dprm = data;				\
-	unsigned long addr;						\
-	call_fetch(&dprm->orig, regs, &addr);				\
-	if (addr) {							\
-		addr += dprm->offset;					\
-		fetch_memory_##type(regs, (void *)addr, dest);		\
-	} else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
-	struct fetch_param orig;
-	unsigned char hi_shift;
-	unsigned char low_shift;
-};
+#include "trace_probe.h"
 
-#define DEFINE_FETCH_bitfield(type)					\
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct bitfield_fetch_param *bprm = data;			\
-	type buf = 0;							\
-	call_fetch(&bprm->orig, regs, &buf);				\
-	if (buf) {							\
-		buf <<= bprm->hi_shift;					\
-		buf >>= bprm->low_shift;				\
-	}								\
-	*(type *)dest = buf;						\
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-update_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
-	FETCH_MTD_reg = 0,
-	FETCH_MTD_stack,
-	FETCH_MTD_retval,
-	FETCH_MTD_memory,
-	FETCH_MTD_symbol,
-	FETCH_MTD_deref,
-	FETCH_MTD_bitfield,
-	FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type)	\
-	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
-	{.name = _name,				\
-	 .size = _size,					\
-	 .is_signed = sign,				\
-	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
-	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
-	 .fmttype = _fmttype,				\
-	 .fetch = {					\
-ASSIGN_FETCH_FUNC(reg, ftype),				\
-ASSIGN_FETCH_FUNC(stack, ftype),			\
-ASSIGN_FETCH_FUNC(retval, ftype),			\
-ASSIGN_FETCH_FUNC(memory, ftype),			\
-ASSIGN_FETCH_FUNC(symbol, ftype),			\
-ASSIGN_FETCH_FUNC(deref, ftype),			\
-ASSIGN_FETCH_FUNC(bitfield, ftype),			\
-	  }						\
-	}
-
-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
-	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
-	const char	*name;		/* Name of type */
-	size_t		size;		/* Byte size of type */
-	int		is_signed;	/* Signed flag */
-	print_type_func_t	print;	/* Print functions */
-	const char	*fmt;		/* Fromat string */
-	const char	*fmttype;	/* Name in format file */
-	/* Fetch functions */
-	fetch_func_t	fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
-	/* Special types */
-	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
-					sizeof(u32), 1, "__data_loc char[]"),
-	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
-					string_size, sizeof(u32), 0, "u32"),
-	/* Basic types */
-	ASSIGN_FETCH_TYPE(u8,  u8,  0),
-	ASSIGN_FETCH_TYPE(u16, u16, 0),
-	ASSIGN_FETCH_TYPE(u32, u32, 0),
-	ASSIGN_FETCH_TYPE(u64, u64, 0),
-	ASSIGN_FETCH_TYPE(s8,  u8,  1),
-	ASSIGN_FETCH_TYPE(s16, u16, 1),
-	ASSIGN_FETCH_TYPE(s32, u32, 1),
-	ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
-	int i;
-
-	if (!type)
-		type = DEFAULT_FETCH_TYPE_STR;
-
-	/* Special case: bitfield */
-	if (*type == 'b') {
-		unsigned long bs;
-		type = strchr(type, '/');
-		if (!type)
-			goto fail;
-		type++;
-		if (strict_strtoul(type, 0, &bs))
-			goto fail;
-		switch (bs) {
-		case 8:
-			return find_fetch_type("u8");
-		case 16:
-			return find_fetch_type("u16");
-		case 32:
-			return find_fetch_type("u32");
-		case 64:
-			return find_fetch_type("u64");
-		default:
-			goto fail;
-		}
-	}
-
-	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
-		if (strcmp(type, fetch_type_table[i].name) == 0)
-			return &fetch_type_table[i];
-fail:
-	return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
-					  void *dummy, void *dest)
-{
-	*(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
-					    fetch_func_t orig_fn)
-{
-	int i;
-
-	if (type != &fetch_type_table[FETCH_TYPE_STRING])
-		return NULL;	/* Only string type needs size function */
-	for (i = 0; i < FETCH_MTD_END; i++)
-		if (type->fetch[i] == orig_fn)
-			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
-	WARN_ON(1);	/* This should not happen */
-	return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"
 
 /**
  * Kprobe event core functions
  */
 
-struct probe_arg {
-	struct fetch_param	fetch;
-	struct fetch_param	fetch_size;
-	unsigned int		offset;	/* Offset from argument entry */
-	const char		*name;	/* Name of this argument */
-	const char		*comm;	/* Command of this argument */
-	const struct fetch_type	*type;	/* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE	1
-#define TP_FLAG_PROFILE	2
-#define TP_FLAG_REGISTERED 4
-
 struct trace_probe {
 	struct list_head	list;
 	struct kretprobe	rp;	/* Use rp.kp for kprobe use */
@@ -631,18 +99,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
 static int kretprobe_dispatcher(struct kretprobe_instance *ri,
 				struct pt_regs *regs);
 
-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
-	if (!isalpha(*name) && *name != '_')
-		return 0;
-	while (*++name != '\0') {
-		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
-			return 0;
-	}
-	return 1;
-}
-
 /*
  * Allocate new trace_probe and initialize it (including kprobes).
  */
@@ -702,34 +158,12 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 	return ERR_PTR(ret);
 }
 
-static void update_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		update_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		update_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		update_symbol_cache(arg->fetch.data);
-}
-
-static void free_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		free_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		free_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		free_symbol_cache(arg->fetch.data);
-	kfree(arg->name);
-	kfree(arg->comm);
-}
-
 static void free_trace_probe(struct trace_probe *tp)
 {
 	int i;
 
 	for (i = 0; i < tp->nr_args; i++)
-		free_probe_arg(&tp->args[i]);
+		traceprobe_free_probe_arg(&tp->args[i]);
 
 	kfree(tp->call.class->system);
 	kfree(tp->call.name);
@@ -787,7 +221,7 @@ static int __register_trace_probe(struct trace_probe *tp)
 		return -EINVAL;
 
 	for (i = 0; i < tp->nr_args; i++)
-		update_probe_arg(&tp->args[i]);
+		traceprobe_update_arg(&tp->args[i]);
 
 	/* Set/clear disabled flag according to tp->flag */
 	if (trace_probe_is_enabled(tp))
@@ -919,227 +353,6 @@ static struct notifier_block trace_probe_module_nb = {
 	.priority = 1	/* Invoked after kprobe module callback */
 };
 
-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
-	char *tmp;
-	int ret;
-
-	if (!offset)
-		return -EINVAL;
-
-	tmp = strchr(symbol, '+');
-	if (tmp) {
-		/* skip sign because strict_strtol doesn't accept '+' */
-		ret = strict_strtoul(tmp + 1, 0, offset);
-		if (ret)
-			return ret;
-		*tmp = '\0';
-	} else
-		*offset = 0;
-	return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
-			    struct fetch_param *f, bool is_return)
-{
-	int ret = 0;
-	unsigned long param;
-
-	if (strcmp(arg, "retval") == 0) {
-		if (is_return)
-			f->fn = t->fetch[FETCH_MTD_retval];
-		else
-			ret = -EINVAL;
-	} else if (strncmp(arg, "stack", 5) == 0) {
-		if (arg[5] == '\0') {
-			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
-				f->fn = fetch_stack_address;
-			else
-				ret = -EINVAL;
-		} else if (isdigit(arg[5])) {
-			ret = strict_strtoul(arg + 5, 10, &param);
-			if (ret || param > PARAM_MAX_STACK)
-				ret = -EINVAL;
-			else {
-				f->fn = t->fetch[FETCH_MTD_stack];
-				f->data = (void *)param;
-			}
-		} else
-			ret = -EINVAL;
-	} else
-		ret = -EINVAL;
-	return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
-			     struct fetch_param *f, bool is_return)
-{
-	int ret = 0;
-	unsigned long param;
-	long offset;
-	char *tmp;
-
-	switch (arg[0]) {
-	case '$':
-		ret = parse_probe_vars(arg + 1, t, f, is_return);
-		break;
-	case '%':	/* named register */
-		ret = regs_query_register_offset(arg + 1);
-		if (ret >= 0) {
-			f->fn = t->fetch[FETCH_MTD_reg];
-			f->data = (void *)(unsigned long)ret;
-			ret = 0;
-		}
-		break;
-	case '@':	/* memory or symbol */
-		if (isdigit(arg[1])) {
-			ret = strict_strtoul(arg + 1, 0, &param);
-			if (ret)
-				break;
-			f->fn = t->fetch[FETCH_MTD_memory];
-			f->data = (void *)param;
-		} else {
-			ret = split_symbol_offset(arg + 1, &offset);
-			if (ret)
-				break;
-			f->data = alloc_symbol_cache(arg + 1, offset);
-			if (f->data)
-				f->fn = t->fetch[FETCH_MTD_symbol];
-		}
-		break;
-	case '+':	/* deref memory */
-		arg++;	/* Skip '+', because strict_strtol() rejects it. */
-	case '-':
-		tmp = strchr(arg, '(');
-		if (!tmp)
-			break;
-		*tmp = '\0';
-		ret = strict_strtol(arg, 0, &offset);
-		if (ret)
-			break;
-		arg = tmp + 1;
-		tmp = strrchr(arg, ')');
-		if (tmp) {
-			struct deref_fetch_param *dprm;
-			const struct fetch_type *t2 = find_fetch_type(NULL);
-			*tmp = '\0';
-			dprm = kzalloc(sizeof(struct deref_fetch_param),
-				       GFP_KERNEL);
-			if (!dprm)
-				return -ENOMEM;
-			dprm->offset = offset;
-			ret = __parse_probe_arg(arg, t2, &dprm->orig,
-						is_return);
-			if (ret)
-				kfree(dprm);
-			else {
-				f->fn = t->fetch[FETCH_MTD_deref];
-				f->data = (void *)dprm;
-			}
-		}
-		break;
-	}
-	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
-		pr_info("%s type has no corresponding fetch method.\n",
-			t->name);
-		ret = -EINVAL;
-	}
-	return ret;
-}
-
-#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
-				      const struct fetch_type *t,
-				      struct fetch_param *f)
-{
-	struct bitfield_fetch_param *bprm;
-	unsigned long bw, bo;
-	char *tail;
-
-	if (*bf != 'b')
-		return 0;
-
-	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
-	if (!bprm)
-		return -ENOMEM;
-	bprm->orig = *f;
-	f->fn = t->fetch[FETCH_MTD_bitfield];
-	f->data = (void *)bprm;
-
-	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
-	if (bw == 0 || *tail != '@')
-		return -EINVAL;
-
-	bf = tail + 1;
-	bo = simple_strtoul(bf, &tail, 0);
-	if (tail == bf || *tail != '/')
-		return -EINVAL;
-
-	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
-	bprm->low_shift = bprm->hi_shift + bo;
-	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
-			   struct probe_arg *parg, bool is_return)
-{
-	const char *t;
-	int ret;
-
-	if (strlen(arg) > MAX_ARGSTR_LEN) {
-		pr_info("Argument is too long.: %s\n",  arg);
-		return -ENOSPC;
-	}
-	parg->comm = kstrdup(arg, GFP_KERNEL);
-	if (!parg->comm) {
-		pr_info("Failed to allocate memory for command '%s'.\n", arg);
-		return -ENOMEM;
-	}
-	t = strchr(parg->comm, ':');
-	if (t) {
-		arg[t - parg->comm] = '\0';
-		t++;
-	}
-	parg->type = find_fetch_type(t);
-	if (!parg->type) {
-		pr_info("Unsupported type: %s\n", t);
-		return -EINVAL;
-	}
-	parg->offset = tp->size;
-	tp->size += parg->type->size;
-	ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
-	if (ret >= 0 && t != NULL)
-		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
-	if (ret >= 0) {
-		parg->fetch_size.fn = get_fetch_size_function(parg->type,
-							      parg->fetch.fn);
-		parg->fetch_size.data = parg->fetch.data;
-	}
-	return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
-			       struct probe_arg *args, int narg)
-{
-	int i;
-	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
-		if (strcmp(reserved_field_names[i], name) == 0)
-			return 1;
-	for (i = 0; i < narg; i++)
-		if (strcmp(args[i].name, name) == 0)
-			return 1;
-	return 0;
-}
-
 static int create_trace_probe(int argc, char **argv)
 {
 	/*
@@ -1240,7 +453,7 @@ static int create_trace_probe(int argc, char **argv)
 		/* a symbol specified */
 		symbol = argv[1];
 		/* TODO: support .init module functions */
-		ret = split_symbol_offset(symbol, &offset);
+		ret = traceprobe_split_symbol_offset(symbol, &offset);
 		if (ret) {
 			pr_info("Failed to parse symbol.\n");
 			return ret;
@@ -1302,7 +515,8 @@ static int create_trace_probe(int argc, char **argv)
 			goto error;
 		}
 
-		if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
 			pr_info("Argument[%d] name '%s' conflicts with "
 				"another field.\n", i, argv[i]);
 			ret = -EINVAL;
@@ -1310,7 +524,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								is_return);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
@@ -1412,70 +627,11 @@ static int probes_open(struct inode *inode, struct file *file)
 	return seq_open(file, &probes_seq_op);
 }
 
-static int command_trace_probe(const char *buf)
-{
-	char **argv;
-	int argc = 0, ret = 0;
-
-	argv = argv_split(GFP_KERNEL, buf, &argc);
-	if (!argv)
-		return -ENOMEM;
-
-	if (argc)
-		ret = create_trace_probe(argc, argv);
-
-	argv_free(argv);
-	return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
 static ssize_t probes_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
-	char *kbuf, *tmp;
-	int ret;
-	size_t done;
-	size_t size;
-
-	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
-	if (!kbuf)
-		return -ENOMEM;
-
-	ret = done = 0;
-	while (done < count) {
-		size = count - done;
-		if (size >= WRITE_BUFSIZE)
-			size = WRITE_BUFSIZE - 1;
-		if (copy_from_user(kbuf, buffer + done, size)) {
-			ret = -EFAULT;
-			goto out;
-		}
-		kbuf[size] = '\0';
-		tmp = strchr(kbuf, '\n');
-		if (tmp) {
-			*tmp = '\0';
-			size = tmp - kbuf + 1;
-		} else if (done + size < count) {
-			pr_warning("Line length is too long: "
-				   "Should be less than %d.", WRITE_BUFSIZE);
-			ret = -EINVAL;
-			goto out;
-		}
-		done += size;
-		/* Remove comments */
-		tmp = strchr(kbuf, '#');
-		if (tmp)
-			*tmp = '\0';
-
-		ret = command_trace_probe(kbuf);
-		if (ret)
-			goto out;
-	}
-	ret = done;
-out:
-	kfree(kbuf);
-	return ret;
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_probe);
 }
 
 static const struct file_operations kprobe_events_ops = {
@@ -1711,16 +867,6 @@ print_kretprobe_event(struct trace_iterator *iter, int flags,
 	return TRACE_TYPE_PARTIAL_LINE;
 }
 
-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed)			\
-	do {								\
-		ret = trace_define_field(event_call, #type, name,	\
-					 offsetof(typeof(field), item),	\
-					 sizeof(field.item), is_signed, \
-					 FILTER_OTHER);			\
-		if (ret)						\
-			return ret;					\
-	} while (0)
 
 static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
 {
@@ -2045,8 +1191,9 @@ static __init int kprobe_trace_self_tests_init(void)
 
 	pr_info("Testing kprobe tracing: ");
 
-	ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
-				  "$stack $stack0 +0($stack)");
+	ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
+				  "$stack $stack0 +0($stack)",
+				  create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function entry.\n");
 		warn++;
@@ -2060,8 +1207,8 @@ static __init int kprobe_trace_self_tests_init(void)
 			enable_trace_probe(tp, TP_FLAG_TRACE);
 	}
 
-	ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
-				  "$retval");
+	ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
+				  "$retval", create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function return.\n");
 		warn++;
@@ -2095,13 +1242,13 @@ static __init int kprobe_trace_self_tests_init(void)
 	} else
 		disable_trace_probe(tp, TP_FLAG_TRACE);
 
-	ret = command_trace_probe("-:testprobe");
+	ret = traceprobe_command("-:testprobe", create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
 	}
 
-	ret = command_trace_probe("-:testprobe2");
+	ret = traceprobe_command("-:testprobe2", create_trace_probe);
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..07790f1
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,779 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:     Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+	"common_type",
+	"common_flags",
+	"common_preempt_count",
+	"common_pid",
+	"common_tgid",
+	FIELD_STRING_IP,
+	FIELD_STRING_RETIP,
+	FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
+#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
+
+/* Printing  in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
+						const char *name,	\
+						void *data, void *ent)\
+{									\
+	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+}									\
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+	return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+	return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+						  const char *name,
+						  void *data, void *ent)
+{
+	int len = *(u32 *)data >> 16;
+
+	if (!len)
+		return trace_seq_printf(s, " %s=(fault)", name);
+	else
+		return trace_seq_printf(s, " %s=\"%s\"", name,
+					(const char *)get_loc_data(data, ent));
+}
+
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8)		\
+DEFINE_FETCH_##method(u16)		\
+DEFINE_FETCH_##method(u32)		\
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn)			\
+	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
+	 && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type)						\
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
+					void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_register(regs,			\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type)					\
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+					  void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type)					\
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+					  void *dummy, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_return_value(regs);			\
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type)					\
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+					  void *addr, void *dest)	\
+{									\
+	type retval;							\
+	if (probe_kernel_address(addr, retval))				\
+		*(type *)dest = 0;					\
+	else								\
+		*(type *)dest = retval;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+						      void *addr, void *dest)
+{
+	long ret;
+	int maxlen = get_rloc_len(*(u32 *)dest);
+	u8 *dst = get_rloc_data(dest);
+	u8 *src = addr;
+	mm_segment_t old_fs = get_fs();
+	if (!maxlen)
+		return;
+	/*
+	 * Try to get string again, since the string can be changed while
+	 * probing.
+	 */
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do
+		ret = __copy_from_user_inatomic(dst++, src++, 1);
+	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+	dst[-1] = '\0';
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0) {	/* Failed to fetch string */
+		((u8 *)get_rloc_data(dest))[0] = '\0';
+		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+	} else
+		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+					      get_rloc_offs(*(u32 *)dest));
+}
+
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+							void *addr, void *dest)
+{
+	int ret, len = 0;
+	u8 c;
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do {
+		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+		len++;
+	} while (c && ret == 0 && len < MAX_STRING_SIZE);
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0)	/* Failed to check the length */
+		*(u32 *)dest = 0;
+	else
+		*(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+	char *symbol;
+	long offset;
+	unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+	if (sc->addr)
+		sc->addr += sc->offset;
+	return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+	kfree(sc->symbol);
+	kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+	struct symbol_cache *sc;
+
+	if (!sym || strlen(sym) == 0)
+		return NULL;
+	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+	if (!sc)
+		return NULL;
+
+	sc->symbol = kstrdup(sym, GFP_KERNEL);
+	if (!sc->symbol) {
+		kfree(sc);
+		return NULL;
+	}
+	sc->offset = offset;
+
+	update_symbol_cache(sc);
+	return sc;
+}
+
+#define DEFINE_FETCH_symbol(type)					\
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+					  void *data, void *dest)	\
+{									\
+	struct symbol_cache *sc = data;					\
+	if (sc->addr)							\
+		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
+	else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+	struct fetch_param orig;
+	long offset;
+};
+
+#define DEFINE_FETCH_deref(type)					\
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct deref_fetch_param *dprm = data;				\
+	unsigned long addr;						\
+	call_fetch(&dprm->orig, regs, &addr);				\
+	if (addr) {							\
+		addr += dprm->offset;					\
+		fetch_memory_##type(regs, (void *)addr, dest);		\
+	} else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+	struct fetch_param orig;
+	unsigned char hi_shift;
+	unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type)					\
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct bitfield_fetch_param *bprm = data;			\
+	type buf = 0;							\
+	call_fetch(&bprm->orig, regs, &buf);				\
+	if (buf) {							\
+		buf <<= bprm->hi_shift;					\
+		buf >>= bprm->low_shift;				\
+	}								\
+	*(type *)dest = buf;						\
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+update_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type)	\
+	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
+	{.name = _name,				\
+	 .size = _size,					\
+	 .is_signed = sign,				\
+	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
+	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
+	 .fmttype = _fmttype,				\
+	 .fetch = {					\
+ASSIGN_FETCH_FUNC(reg, ftype),				\
+ASSIGN_FETCH_FUNC(stack, ftype),			\
+ASSIGN_FETCH_FUNC(retval, ftype),			\
+ASSIGN_FETCH_FUNC(memory, ftype),			\
+ASSIGN_FETCH_FUNC(symbol, ftype),			\
+ASSIGN_FETCH_FUNC(deref, ftype),			\
+ASSIGN_FETCH_FUNC(bitfield, ftype),			\
+	  }						\
+	}
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
+	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+	/* Special types */
+	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+					sizeof(u32), 1, "__data_loc char[]"),
+	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+					string_size, sizeof(u32), 0, "u32"),
+	/* Basic types */
+	ASSIGN_FETCH_TYPE(u8,  u8,  0),
+	ASSIGN_FETCH_TYPE(u16, u16, 0),
+	ASSIGN_FETCH_TYPE(u32, u32, 0),
+	ASSIGN_FETCH_TYPE(u64, u64, 0),
+	ASSIGN_FETCH_TYPE(s8,  u8,  1),
+	ASSIGN_FETCH_TYPE(s16, u16, 1),
+	ASSIGN_FETCH_TYPE(s32, u32, 1),
+	ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+	int i;
+
+	if (!type)
+		type = DEFAULT_FETCH_TYPE_STR;
+
+	/* Special case: bitfield */
+	if (*type == 'b') {
+		unsigned long bs;
+		type = strchr(type, '/');
+		if (!type)
+			goto fail;
+		type++;
+		if (strict_strtoul(type, 0, &bs))
+			goto fail;
+		switch (bs) {
+		case 8:
+			return find_fetch_type("u8");
+		case 16:
+			return find_fetch_type("u16");
+		case 32:
+			return find_fetch_type("u32");
+		case 64:
+			return find_fetch_type("u64");
+		default:
+			goto fail;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+		if (strcmp(type, fetch_type_table[i].name) == 0)
+			return &fetch_type_table[i];
+fail:
+	return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+					void *dummy, void *dest)
+{
+	*(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+					fetch_func_t orig_fn)
+{
+	int i;
+
+	if (type != &fetch_type_table[FETCH_TYPE_STRING])
+		return NULL;	/* Only string type needs size function */
+	for (i = 0; i < FETCH_MTD_END; i++)
+		if (type->fetch[i] == orig_fn)
+			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+	WARN_ON(1);	/* This should not happen */
+	return NULL;
+}
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+	char *tmp;
+	int ret;
+
+	if (!offset)
+		return -EINVAL;
+
+	tmp = strchr(symbol, '+');
+	if (tmp) {
+		/* skip sign because strict_strtol doesn't accept '+' */
+		ret = strict_strtoul(tmp + 1, 0, offset);
+		if (ret)
+			return ret;
+		*tmp = '\0';
+	} else
+		*offset = 0;
+	return 0;
+}
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+			    struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+
+	if (strcmp(arg, "retval") == 0) {
+		if (is_return)
+			f->fn = t->fetch[FETCH_MTD_retval];
+		else
+			ret = -EINVAL;
+	} else if (strncmp(arg, "stack", 5) == 0) {
+		if (arg[5] == '\0') {
+			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+				f->fn = fetch_stack_address;
+			else
+				ret = -EINVAL;
+		} else if (isdigit(arg[5])) {
+			ret = strict_strtoul(arg + 5, 10, &param);
+			if (ret || param > PARAM_MAX_STACK)
+				ret = -EINVAL;
+			else {
+				f->fn = t->fetch[FETCH_MTD_stack];
+				f->data = (void *)param;
+			}
+		} else
+			ret = -EINVAL;
+	} else
+		ret = -EINVAL;
+	return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+		     struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+	long offset;
+	char *tmp;
+
+	switch (arg[0]) {
+	case '$':
+		ret = parse_probe_vars(arg + 1, t, f, is_return);
+		break;
+	case '%':	/* named register */
+		ret = regs_query_register_offset(arg + 1);
+		if (ret >= 0) {
+			f->fn = t->fetch[FETCH_MTD_reg];
+			f->data = (void *)(unsigned long)ret;
+			ret = 0;
+		}
+		break;
+	case '@':	/* memory or symbol */
+		if (isdigit(arg[1])) {
+			ret = strict_strtoul(arg + 1, 0, &param);
+			if (ret)
+				break;
+			f->fn = t->fetch[FETCH_MTD_memory];
+			f->data = (void *)param;
+		} else {
+			ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+			if (ret)
+				break;
+			f->data = alloc_symbol_cache(arg + 1, offset);
+			if (f->data)
+				f->fn = t->fetch[FETCH_MTD_symbol];
+		}
+		break;
+	case '+':	/* deref memory */
+		arg++;	/* Skip '+', because strict_strtol() rejects it. */
+	case '-':
+		tmp = strchr(arg, '(');
+		if (!tmp)
+			break;
+		*tmp = '\0';
+		ret = strict_strtol(arg, 0, &offset);
+		if (ret)
+			break;
+		arg = tmp + 1;
+		tmp = strrchr(arg, ')');
+		if (tmp) {
+			struct deref_fetch_param *dprm;
+			const struct fetch_type *t2 = find_fetch_type(NULL);
+			*tmp = '\0';
+			dprm = kzalloc(sizeof(struct deref_fetch_param),
+				       GFP_KERNEL);
+			if (!dprm)
+				return -ENOMEM;
+			dprm->offset = offset;
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			if (ret)
+				kfree(dprm);
+			else {
+				f->fn = t->fetch[FETCH_MTD_deref];
+				f->data = (void *)dprm;
+			}
+		}
+		break;
+	}
+	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
+		pr_info("%s type has no corresponding fetch method.\n",
+			t->name);
+		ret = -EINVAL;
+	}
+	return ret;
+}
+
+#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+				      const struct fetch_type *t,
+				      struct fetch_param *f)
+{
+	struct bitfield_fetch_param *bprm;
+	unsigned long bw, bo;
+	char *tail;
+
+	if (*bf != 'b')
+		return 0;
+
+	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	if (!bprm)
+		return -ENOMEM;
+	bprm->orig = *f;
+	f->fn = t->fetch[FETCH_MTD_bitfield];
+	f->data = (void *)bprm;
+
+	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
+	if (bw == 0 || *tail != '@')
+		return -EINVAL;
+
+	bf = tail + 1;
+	bo = simple_strtoul(bf, &tail, 0);
+	if (tail == bf || *tail != '/')
+		return -EINVAL;
+
+	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+	bprm->low_shift = bprm->hi_shift + bo;
+	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		struct probe_arg *parg, bool is_return)
+{
+	const char *t;
+	int ret;
+
+	if (strlen(arg) > MAX_ARGSTR_LEN) {
+		pr_info("Argument is too long.: %s\n",  arg);
+		return -ENOSPC;
+	}
+	parg->comm = kstrdup(arg, GFP_KERNEL);
+	if (!parg->comm) {
+		pr_info("Failed to allocate memory for command '%s'.\n", arg);
+		return -ENOMEM;
+	}
+	t = strchr(parg->comm, ':');
+	if (t) {
+		arg[t - parg->comm] = '\0';
+		t++;
+	}
+	parg->type = find_fetch_type(t);
+	if (!parg->type) {
+		pr_info("Unsupported type: %s\n", t);
+		return -EINVAL;
+	}
+	parg->offset = *size;
+	*size += parg->type->size;
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	if (ret >= 0 && t != NULL)
+		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+	if (ret >= 0) {
+		parg->fetch_size.fn = get_fetch_size_function(parg->type,
+							      parg->fetch.fn);
+		parg->fetch_size.data = parg->fetch.data;
+	}
+	return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+		if (strcmp(reserved_field_names[i], name) == 0)
+			return 1;
+	for (i = 0; i < narg; i++)
+		if (strcmp(args[i].name, name) == 0)
+			return 1;
+	return 0;
+}
+
+void traceprobe_update_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		update_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		update_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		update_symbol_cache(arg->fetch.data);
+}
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		free_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		free_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		free_symbol_cache(arg->fetch.data);
+	kfree(arg->name);
+	kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char **))
+{
+	char **argv;
+	int argc = 0, ret = 0;
+
+	argv = argv_split(GFP_KERNEL, buf, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	if (argc)
+		ret = createfn(argc, argv);
+
+	argv_free(argv);
+	return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+				size_t count, loff_t *ppos,
+				int (*createfn)(int, char **))
+{
+	char *kbuf, *tmp;
+	int ret = 0;
+	size_t done = 0;
+	size_t size;
+
+	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	while (done < count) {
+		size = count - done;
+		if (size >= WRITE_BUFSIZE)
+			size = WRITE_BUFSIZE - 1;
+		if (copy_from_user(kbuf, buffer + done, size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		kbuf[size] = '\0';
+		tmp = strchr(kbuf, '\n');
+		if (tmp) {
+			*tmp = '\0';
+			size = tmp - kbuf + 1;
+		} else if (done + size < count) {
+			pr_warning("Line length is too long: "
+				   "Should be less than %d.", WRITE_BUFSIZE);
+			ret = -EINVAL;
+			goto out;
+		}
+		done += size;
+		/* Remove comments */
+		tmp = strchr(kbuf, '#');
+		if (tmp)
+			*tmp = '\0';
+
+		ret = traceprobe_command(kbuf, createfn);
+		if (ret)
+			goto out;
+	}
+	ret = done;
+out:
+	kfree(kbuf);
+	return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..c9db197
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,160 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:     Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed)			\
+	do {								\
+		ret = trace_define_field(event_call, #type, name,	\
+					 offsetof(typeof(field), item),	\
+					 sizeof(field.item), is_signed, \
+					 FILTER_OTHER);			\
+		if (ret)						\
+			return ret;					\
+	} while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE	1
+#define TP_FLAG_PROFILE	2
+#define TP_FLAG_REGISTERED 4
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs)	\
+	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl)	((u32)(dl) >> 16)
+#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ *  data_rloc stores the offset from data_rloc itself, but data_loc
+ *  stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+				 void *);
+
+/* Fetch types */
+enum {
+	FETCH_MTD_reg = 0,
+	FETCH_MTD_stack,
+	FETCH_MTD_retval,
+	FETCH_MTD_memory,
+	FETCH_MTD_symbol,
+	FETCH_MTD_deref,
+	FETCH_MTD_bitfield,
+	FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+	const char	*name;		/* Name of type */
+	size_t		size;		/* Byte size of type */
+	int		is_signed;	/* Signed flag */
+	print_type_func_t	print;	/* Print functions */
+	const char	*fmt;		/* Fromat string */
+	const char	*fmttype;	/* Name in format file */
+	/* Fetch functions */
+	fetch_func_t	fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+	fetch_func_t	fn;
+	void *data;
+};
+
+struct probe_arg {
+	struct fetch_param	fetch;
+	struct fetch_param	fetch_size;
+	unsigned int		offset;	/* Offset from argument entry */
+	const char		*name;	/* Name of this argument */
+	const char		*comm;	/* Command of this argument */
+	const struct fetch_type	*type;	/* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+				 struct pt_regs *regs, void *dest)
+{
+	return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static inline int is_good_name(const char *name)
+{
+	if (!isalpha(*name) && *name != '_')
+		return 0;
+	while (*++name != '\0') {
+		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+			return 0;
+	}
+	return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		   struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg);
+
+extern void traceprobe_update_arg(struct probe_arg *arg);
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *ppos,
+		int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 21/30] tracing: uprobes trace_event interface
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (19 preceding siblings ...)
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
@ 2011-11-18 11:10 ` Srikar Dronamraju
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 22/30] perf: rename target_module to target Srikar Dronamraju
                   ` (10 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.

The following example shows how to dump the instruction pointer and %ax
a register at the probed text address.  Here we are trying to probe
zfree in /bin/zsh

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g    DF .text  0000000000000012  Base        zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

TODO: Connect a filter to a consumer.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v5)
- Added uprobe tracer documentation to this patch.

 Documentation/trace/uprobetracer.txt |   93 ++++
 arch/Kconfig                         |    8 
 kernel/trace/Kconfig                 |   16 +
 kernel/trace/Makefile                |    1 
 kernel/trace/trace.h                 |    5 
 kernel/trace/trace_kprobe.c          |    4 
 kernel/trace/trace_probe.c           |   14 -
 kernel/trace/trace_probe.h           |    3 
 kernel/trace/trace_uprobe.c          |  768 ++++++++++++++++++++++++++++++++++
 9 files changed, 898 insertions(+), 14 deletions(-)
 create mode 100644 Documentation/trace/uprobetracer.txt
 create mode 100644 kernel/trace/trace_uprobe.c

diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
new file mode 100644
index 0000000..457932f
--- /dev/null
+++ b/Documentation/trace/uprobetracer.txt
@@ -0,0 +1,93 @@
+		Uprobe-tracer: Uprobe-based Event Tracing
+		=========================================
+                 Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+  p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS]	: Set a probe
+
+ GRP		: Group name. If omitted, use "uprobes" for it.
+ EVENT		: Event name. If omitted, the event name is generated
+		  based on SYMBOL+offs.
+ PATH		: path to an executable or a library.
+ SYMBOL[+offs]	: Symbol+offset where the probe is inserted.
+
+ FETCHARGS	: Arguments. Each probe can have up to 128 args.
+  %REG		: Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+  echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+  echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address.  Here we are trying to probe
+function zfree in /bin/zsh
+
+    # cd /sys/kernel/debug/tracing/
+    # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
+    00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+    # objdump -T /bin/zsh | grep -w zfree
+    0000000000446420 g    DF .text  0000000000000012  Base        zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+    # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+    # cat uprobe_events
+    p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+    # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+    # sleep 20
+    # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+    # cat trace
+    # tracer: nop
+    #
+    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
+    #              | |       |          |         |
+                 zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
diff --git a/arch/Kconfig b/arch/Kconfig
index dedd489..6c6df9f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,13 +62,7 @@ config OPTPROBES
 	depends on !PREEMPT
 
 config UPROBES
-	bool "User-space probes (EXPERIMENTAL)"
-	help
-	  Uprobes enables kernel subsystems to establish probepoints
-	  in user applications and execute handler functions when
-	  the probepoints are hit.
-
-	  If in doubt, say "N".
+	def_bool n
 
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 520106a..b001fb1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config UPROBE_EVENT
+	bool "Enable uprobes-based dynamic events"
+	depends on ARCH_SUPPORTS_UPROBES
+	depends on MMU
+	select UPROBES
+	select PROBE_EVENTS
+	select TRACING
+	default n
+	help
+	  This allows the user to add tracing events on top of userspace dynamic
+	  events (similar to tracepoints) on the fly via the traceevents interface.
+	  Those events can be inserted wherever uprobes can probe, and record
+	  various registers.
+	  This option is required if you plan to use perf-probe subcommand of perf
+	  tools on user space applications.
+
 config PROBE_EVENTS
 	def_bool n
 
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index fa10d5c..1734c03 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -62,5 +62,6 @@ ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
 obj-$(CONFIG_PROBE_EVENTS) += trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 092e1f8..f5f7bb3 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
 	unsigned long		ret_ip;
 };
 
+struct uprobe_trace_entry_head {
+	struct trace_entry	ent;
+	unsigned long		ip;
+};
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 967e634..60384df 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -524,8 +524,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
-								is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size,
+					&tp->args[i], is_return, true);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 07790f1..a07420e 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -528,13 +528,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
 
 /* Recursive argument parser */
 static int parse_probe_arg(char *arg, const struct fetch_type *t,
-		     struct fetch_param *f, bool is_return)
+		     struct fetch_param *f, bool is_return, bool is_kprobe)
 {
 	int ret = 0;
 	unsigned long param;
 	long offset;
 	char *tmp;
 
+	/* Until uprobe_events supports only reg arguments */
+	if (!is_kprobe && arg[0] != '%')
+		return -EINVAL;
+
 	switch (arg[0]) {
 	case '$':
 		ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -584,7 +588,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
 			if (!dprm)
 				return -ENOMEM;
 			dprm->offset = offset;
-			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+							is_kprobe);
 			if (ret)
 				kfree(dprm);
 			else {
@@ -639,7 +644,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
 
 /* String length checking wrapper */
 int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		struct probe_arg *parg, bool is_return)
+		struct probe_arg *parg, bool is_return, bool is_kprobe)
 {
 	const char *t;
 	int ret;
@@ -665,7 +670,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
 	}
 	parg->offset = *size;
 	*size += parg->type->size;
-	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+							is_kprobe);
 	if (ret >= 0 && t != NULL)
 		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
 	if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index c9db197..832668f 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -65,6 +65,7 @@
 #define TP_FLAG_TRACE	1
 #define TP_FLAG_PROFILE	2
 #define TP_FLAG_REGISTERED 4
+#define TP_FLAG_UPROBE	8
 
 
 /* data_rloc: data relative location, compatible with u32 */
@@ -143,7 +144,7 @@ static inline int is_good_name(const char *name)
 }
 
 extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		   struct probe_arg *parg, bool is_return);
+		   struct probe_arg *parg, bool is_return, bool is_kprobe);
 
 extern int traceprobe_conflict_field_name(const char *name,
 			       struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..af29368
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,768 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:	Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+	struct uprobe_consumer cons;
+	struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+	struct list_head	list;
+	struct ftrace_event_class	class;
+	struct ftrace_event_call	call;
+	struct uprobe_trace_consumer	*consumer;
+	struct inode		*inode;
+	char			*filename;
+	unsigned long		offset;
+	unsigned long		nhit;
+	unsigned int		flags;	/* For TP_FLAG_* */
+	ssize_t			size;		/* trace entry size */
+	unsigned int		nr_args;
+	struct probe_arg	args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n)			\
+	(offsetof(struct trace_uprobe, args) +	\
+	(sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+				const char *event, int nargs)
+{
+	struct trace_uprobe *tp;
+
+	if (!event || !is_good_name(event))
+		return ERR_PTR(-EINVAL);
+
+	if (!group || !is_good_name(group))
+		return ERR_PTR(-EINVAL);
+
+	tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+	if (!tp)
+		return ERR_PTR(-ENOMEM);
+
+	tp->call.class = &tp->class;
+	tp->call.name = kstrdup(event, GFP_KERNEL);
+	if (!tp->call.name)
+		goto error;
+
+	tp->class.system = kstrdup(group, GFP_KERNEL);
+	if (!tp->class.system)
+		goto error;
+
+	INIT_LIST_HEAD(&tp->list);
+	return tp;
+error:
+	kfree(tp->call.name);
+	kfree(tp);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+	int i;
+
+	for (i = 0; i < tp->nr_args; i++)
+		traceprobe_free_probe_arg(&tp->args[i]);
+
+	iput(tp->inode);
+	kfree(tp->call.class->system);
+	kfree(tp->call.name);
+	kfree(tp->filename);
+	kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+					const char *group)
+{
+	struct trace_uprobe *tp;
+
+	list_for_each_entry(tp, &uprobe_list, list)
+		if (strcmp(tp->call.name, event) == 0 &&
+		    strcmp(tp->call.class->system, group) == 0)
+			return tp;
+	return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+	list_del(&tp->list);
+	unregister_uprobe_event(tp);
+	free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+	struct trace_uprobe *old_tp;
+	int ret;
+
+	mutex_lock(&uprobe_lock);
+
+	/* register as an event */
+	old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+	if (old_tp)
+		/* delete old event */
+		unregister_trace_uprobe(old_tp);
+
+	ret = register_uprobe_event(tp);
+	if (ret) {
+		pr_warning("Failed to register probe event(%d)\n", ret);
+		goto end;
+	}
+
+	list_add_tail(&tp->list, &uprobe_list);
+end:
+	mutex_unlock(&uprobe_lock);
+	return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+	/*
+	 * Argument syntax:
+	 *  - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+	 *
+	 *  - Remove uprobe: -:[GRP/]EVENT
+	 */
+	struct path path;
+	struct inode *inode = NULL;
+	struct trace_uprobe *tp;
+	int i, ret = 0;
+	int is_delete = 0;
+	char *arg = NULL, *event = NULL, *group = NULL;
+	unsigned long offset;
+	char buf[MAX_EVENT_NAME_LEN];
+	char *filename;
+
+	/* argc must be >= 1 */
+	if (argv[0][0] == '-')
+		is_delete = 1;
+	else if (argv[0][0] != 'p') {
+		pr_info("Probe definition must be started with 'p', 'r' or"
+			" '-'.\n");
+		return -EINVAL;
+	}
+
+	if (argv[0][1] == ':') {
+		event = &argv[0][2];
+		if (strchr(event, '/')) {
+			group = event;
+			event = strchr(group, '/') + 1;
+			event[-1] = '\0';
+			if (strlen(group) == 0) {
+				pr_info("Group name is not specified\n");
+				return -EINVAL;
+			}
+		}
+		if (strlen(event) == 0) {
+			pr_info("Event name is not specified\n");
+			return -EINVAL;
+		}
+	}
+	if (!group)
+		group = UPROBE_EVENT_SYSTEM;
+
+	if (is_delete) {
+		if (!event) {
+			pr_info("Delete command needs an event name.\n");
+			return -EINVAL;
+		}
+		mutex_lock(&uprobe_lock);
+		tp = find_probe_event(event, group);
+		if (!tp) {
+			mutex_unlock(&uprobe_lock);
+			pr_info("Event %s/%s doesn't exist.\n", group, event);
+			return -ENOENT;
+		}
+		/* delete an event */
+		unregister_trace_uprobe(tp);
+		mutex_unlock(&uprobe_lock);
+		return 0;
+	}
+
+	if (argc < 2) {
+		pr_info("Probe point is not specified.\n");
+		return -EINVAL;
+	}
+	if (isdigit(argv[1][0])) {
+		pr_info("probe point must be have a filename.\n");
+		return -EINVAL;
+	}
+	arg = strchr(argv[1], ':');
+	if (!arg)
+		goto fail_address_parse;
+
+	*arg++ = '\0';
+	filename = argv[1];
+	ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto fail_address_parse;
+
+	inode = igrab(path.dentry->d_inode);
+
+	ret = strict_strtoul(arg, 0, &offset);
+		if (ret)
+			goto fail_address_parse;
+
+	argc -= 2;
+	argv += 2;
+
+	/* setup a probe */
+	if (!event) {
+		char *tail = strrchr(filename, '/');
+		char *ptr;
+
+		ptr = kstrdup((tail ? tail + 1 : filename), GFP_KERNEL);
+		if (!ptr) {
+			ret = -ENOMEM;
+			goto fail_address_parse;
+		}
+
+		tail = ptr;
+		ptr = strpbrk(tail, ".-_");
+		if (ptr)
+			*ptr = '\0';
+
+		snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p', tail,
+				offset);
+		event = buf;
+		kfree(tail);
+	}
+	tp = alloc_trace_uprobe(group, event, argc);
+	if (IS_ERR(tp)) {
+		pr_info("Failed to allocate trace_uprobe.(%d)\n",
+			(int)PTR_ERR(tp));
+		iput(inode);
+		return PTR_ERR(tp);
+	}
+	tp->offset = offset;
+	tp->inode = inode;
+	tp->filename = kstrdup(filename, GFP_KERNEL);
+	if (!tp->filename) {
+			pr_info("Failed to allocate filename.\n");
+			ret = -ENOMEM;
+			goto error;
+	}
+
+	/* parse arguments */
+	ret = 0;
+	for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+		/* Increment count for freeing args in error case */
+		tp->nr_args++;
+
+		/* Parse argument name */
+		arg = strchr(argv[i], '=');
+		if (arg) {
+			*arg++ = '\0';
+			tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+		} else {
+			arg = argv[i];
+			/* If argument name is omitted, set "argN" */
+			snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+			tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+		}
+
+		if (!tp->args[i].name) {
+			pr_info("Failed to allocate argument[%d] name.\n", i);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		if (!is_good_name(tp->args[i].name)) {
+			pr_info("Invalid argument[%d] name: %s\n",
+				i, tp->args[i].name);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
+			pr_info("Argument[%d] name '%s' conflicts with "
+				"another field.\n", i, argv[i]);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		/* Parse fetch argument */
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								false, false);
+		if (ret) {
+			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+			goto error;
+		}
+	}
+
+	ret = register_trace_uprobe(tp);
+	if (ret)
+		goto error;
+	return 0;
+
+error:
+	free_trace_uprobe(tp);
+	return ret;
+
+fail_address_parse:
+	if (inode)
+		iput(inode);
+	pr_info("Failed to parse address.\n");
+	return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+	struct trace_uprobe *tp;
+
+	mutex_lock(&uprobe_lock);
+	while (!list_empty(&uprobe_list)) {
+		tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+		unregister_trace_uprobe(tp);
+	}
+	mutex_unlock(&uprobe_lock);
+}
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+	mutex_lock(&uprobe_lock);
+	return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+	mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+	int i;
+
+	seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+	seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+	for (i = 0; i < tp->nr_args; i++)
+		seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) && (file->f_flags & O_TRUNC))
+		cleanup_all_probes();
+
+	return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+			    size_t count, loff_t *ppos)
+{
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+	.owner          = THIS_MODULE,
+	.open           = probes_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+	.write		= probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+
+	seq_printf(m, "  %s %-44s %15lu\n", tp->filename, tp->call.name,
+								tp->nhit);
+	return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+	.owner          = THIS_MODULE,
+	.open           = profile_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+	struct uprobe_trace_entry_head *entry;
+	struct ring_buffer_event *event;
+	struct ring_buffer *buffer;
+	u8 *data;
+	int size, i, pc;
+	unsigned long irq_flags;
+	struct ftrace_event_call *call = &tp->call;
+
+	tp->nhit++;
+
+	local_save_flags(irq_flags);
+	pc = preempt_count();
+
+	size = sizeof(*entry) + tp->size;
+
+	event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+						  size, irq_flags, pc);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	if (!filter_current_check_discard(buffer, call, entry, event))
+		trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+static enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+		   struct trace_event *event)
+{
+	struct uprobe_trace_entry_head *field;
+	struct trace_seq *s = &iter->seq;
+	struct trace_uprobe *tp;
+	u8 *data;
+	int i;
+
+	field = (struct uprobe_trace_entry_head *)iter->ent;
+	tp = container_of(event, struct trace_uprobe, call.event);
+
+	if (!trace_seq_printf(s, "%s: (", tp->call.name))
+		goto partial;
+
+	if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+		goto partial;
+
+	if (!trace_seq_puts(s, ")"))
+		goto partial;
+
+	data = (u8 *)&field[1];
+	for (i = 0; i < tp->nr_args; i++)
+		if (!tp->args[i].type->print(s, tp->args[i].name,
+					     data + tp->args[i].offset, field))
+			goto partial;
+
+	if (!trace_seq_puts(s, "\n"))
+		goto partial;
+
+	return TRACE_TYPE_HANDLED;
+partial:
+	return TRACE_TYPE_PARTIAL_LINE;
+}
+
+static int probe_event_enable(struct trace_uprobe *tp, int flag)
+{
+	struct uprobe_trace_consumer *utc;
+	int ret = 0;
+
+	if (!tp->inode || tp->consumer)
+		return -EINTR;
+
+	utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+	if (!utc)
+		return -EINTR;
+
+	utc->cons.handler = uprobe_dispatcher;
+	utc->cons.filter = NULL;
+	ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+	if (ret) {
+		kfree(utc);
+		return ret;
+	}
+
+	tp->flags |= flag;
+	utc->tp = tp;
+	tp->consumer = utc;
+	return 0;
+}
+
+static void probe_event_disable(struct trace_uprobe *tp, int flag)
+{
+	if (!tp->inode || !tp->consumer)
+		return;
+
+	unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+	tp->flags &= ~flag;
+	kfree(tp->consumer);
+	tp->consumer = NULL;
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+	int ret, i;
+	struct uprobe_trace_entry_head field;
+	struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+	DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+	/* Set argument names as fields */
+	for (i = 0; i < tp->nr_args; i++) {
+		ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+					 tp->args[i].name,
+					 sizeof(field) + tp->args[i].offset,
+					 tp->args[i].type->size,
+					 tp->args[i].type->is_signed,
+					 FILTER_OTHER);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+	int i;
+	int pos = 0;
+
+	const char *fmt, *arg;
+
+	fmt = "(%lx)";
+	arg = "REC->" FIELD_STRING_IP;
+
+	/* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+				tp->args[i].name, tp->args[i].type->fmt);
+	}
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+				tp->args[i].name);
+	}
+
+#undef LEN_OR_ZERO
+
+	/* return the length of print_fmt */
+	return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+	int len;
+	char *print_fmt;
+
+	/* First: called with 0 length to calculate the needed length */
+	len = __set_print_fmt(tp, NULL, 0);
+	print_fmt = kmalloc(len + 1, GFP_KERNEL);
+	if (!print_fmt)
+		return -ENOMEM;
+
+	/* Second: actually write the @print_fmt */
+	__set_print_fmt(tp, print_fmt, len + 1);
+	tp->call.print_fmt = print_fmt;
+
+	return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+	struct ftrace_event_call *call = &tp->call;
+	struct uprobe_trace_entry_head *entry;
+	struct hlist_head *head;
+	u8 *data;
+	int size, __size, i;
+	int rctx;
+
+	__size = sizeof(*entry) + tp->size;
+	size = ALIGN(__size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+		     "profile buffer not large enough"))
+		return;
+
+	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+	if (!entry)
+		return;
+
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	head = this_cpu_ptr(call->perf_events);
+	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+#endif	/* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+	switch (type) {
+	case TRACE_REG_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_TRACE);
+	case TRACE_REG_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_TRACE);
+		return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+	case TRACE_REG_PERF_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_PROFILE);
+	case TRACE_REG_PERF_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_PROFILE);
+		return 0;
+#endif
+	}
+	return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+	struct uprobe_trace_consumer *utc;
+	struct trace_uprobe *tp;
+
+	utc = container_of(con, struct uprobe_trace_consumer, cons);
+	tp = utc->tp;
+	if (!tp || tp->consumer != utc)
+		return 0;
+
+	if (tp->flags & TP_FLAG_TRACE)
+		uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+	if (tp->flags & TP_FLAG_PROFILE)
+		uprobe_perf_func(tp, regs);
+#endif
+	return 0;
+}
+
+static struct trace_event_functions uprobe_funcs = {
+	.trace		= print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+	struct ftrace_event_call *call = &tp->call;
+	int ret;
+
+	/* Initialize ftrace_event_call */
+	INIT_LIST_HEAD(&call->class->fields);
+	call->event.funcs = &uprobe_funcs;
+	call->class->define_fields = uprobe_event_define_fields;
+	if (set_print_fmt(tp) < 0)
+		return -ENOMEM;
+	ret = register_ftrace_event(&call->event);
+	if (!ret) {
+		kfree(call->print_fmt);
+		return -ENODEV;
+	}
+	call->flags = 0;
+	call->class->reg = uprobe_register;
+	call->data = tp;
+	ret = trace_add_event_call(call);
+	if (ret) {
+		pr_info("Failed to register uprobe event: %s\n", call->name);
+		kfree(call->print_fmt);
+		unregister_ftrace_event(&call->event);
+	}
+	return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+	/* tp->event is unregistered in trace_remove_event_call() */
+	trace_remove_event_call(&tp->call);
+	kfree(tp->call.print_fmt);
+	tp->call.print_fmt = NULL;
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+	struct dentry *d_tracer;
+
+	d_tracer = tracing_init_dentry();
+	if (!d_tracer)
+		return 0;
+
+	trace_create_file("uprobe_events", 0644, d_tracer,
+				    NULL, &uprobe_events_ops);
+	/* Profile interface */
+	trace_create_file("uprobe_profile", 0444, d_tracer,
+				    NULL, &uprobe_profile_ops);
+	return 0;
+}
+
+fs_initcall(init_uprobe_trace);


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 22/30] perf: rename target_module to target
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (20 preceding siblings ...)
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 21/30] tracing: uprobes trace_event interface Srikar Dronamraju
@ 2011-11-18 11:10 ` Srikar Dronamraju
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 23/30] perf: perf interface for uprobes Srikar Dronamraju
                   ` (9 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:10 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


This is a precursor patch that modifies names that refer to
kernel/module to also refer to user space names.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |   12 ++++++------
 tools/perf/util/probe-event.c |   26 +++++++++++++-------------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 710ae3d..93d5171 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -61,7 +61,7 @@ static struct {
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
 	struct line_range line_range;
-	const char *target_module;
+	const char *target;
 	int max_probe_points;
 	struct strfilter *filter;
 } params;
@@ -249,7 +249,7 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target_module,
+	OPT_STRING('m', "module", &params.target,
 		   "modname|path",
 		   "target module name (for online) or path (for offline)"),
 #endif
@@ -336,7 +336,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target_module,
+		ret = show_available_funcs(params.target,
 					   params.filter);
 		strfilter__delete(params.filter);
 		if (ret < 0)
@@ -357,7 +357,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			usage_with_options(probe_usage, options);
 		}
 
-		ret = show_line_range(&params.line_range, params.target_module);
+		ret = show_line_range(&params.line_range, params.target);
 		if (ret < 0)
 			pr_err("  Error: Failed to show lines. (%d)\n", ret);
 		return ret;
@@ -374,7 +374,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 
 		ret = show_available_vars(params.events, params.nevents,
 					  params.max_probe_points,
-					  params.target_module,
+					  params.target,
 					  params.filter,
 					  params.show_ext_vars);
 		strfilter__delete(params.filter);
@@ -396,7 +396,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	if (params.nevents) {
 		ret = add_perf_probe_events(params.events, params.nevents,
 					    params.max_probe_points,
-					    params.target_module,
+					    params.target,
 					    params.force_add);
 		if (ret < 0) {
 			pr_err("  Error: Failed to add events. (%d)\n", ret);
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index eb25900..d54eefb 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -275,10 +275,10 @@ static int add_module_to_probe_trace_events(struct probe_trace_event *tevs,
 /* Try to find perf_probe_event with debuginfo */
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	bool need_dwarf = perf_probe_event_need_dwarf(pev);
-	struct debuginfo *dinfo = open_debuginfo(module);
+	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
 	if (!dinfo) {
@@ -297,9 +297,9 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 
 	if (ntevs > 0) {	/* Succeeded to find trace events */
 		pr_debug("find %d probe_trace_events.\n", ntevs);
-		if (module)
+		if (target)
 			ret = add_module_to_probe_trace_events(*tevs, ntevs,
-							       module);
+							       target);
 		return ret < 0 ? ret : ntevs;
 	}
 
@@ -1798,14 +1798,14 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 
 static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	struct symbol *sym;
 	int ret = 0, i;
 	struct probe_trace_event *tev;
 
 	/* Convert perf_probe_event with debuginfo */
-	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, module);
+	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, target);
 	if (ret != 0)
 		return ret;	/* Found in debuginfo or got an error */
 
@@ -1821,8 +1821,8 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		goto error;
 	}
 
-	if (module) {
-		tev->point.module = strdup(module);
+	if (target) {
+		tev->point.module = strdup(target);
 		if (tev->point.module == NULL) {
 			ret = -ENOMEM;
 			goto error;
@@ -1886,7 +1886,7 @@ struct __event_package {
 };
 
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
-			  int max_tevs, const char *module, bool force_add)
+			  int max_tevs, const char *target, bool force_add)
 {
 	int i, j, ret;
 	struct __event_package *pkgs;
@@ -1909,7 +1909,7 @@ int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 		ret  = convert_to_probe_trace_events(pkgs[i].pev,
 						     &pkgs[i].tevs,
 						     max_tevs,
-						     module);
+						     target);
 		if (ret < 0)
 			goto end;
 		pkgs[i].ntevs = ret;
@@ -2065,7 +2065,7 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *module, struct strfilter *_filter)
+int show_available_funcs(const char *target, struct strfilter *_filter)
 {
 	struct map *map;
 	int ret;
@@ -2076,9 +2076,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(module);
+	map = kernel_get_module_map(target);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
 		return -EINVAL;
 	}
 	available_func_filter = _filter;


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 23/30] perf: perf interface for uprobes
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (21 preceding siblings ...)
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 22/30] perf: rename target_module to target Srikar Dronamraju
@ 2011-11-18 11:11 ` Srikar Dronamraju
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 24/30] perf: show possible probes in a given executable file or library Srikar Dronamraju
                   ` (8 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:11 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Enhances perf probe to user space executables and libraries.
Provides very basic support for uprobes.

[ Probing a function in the executable using function name  ]
-------------------------------------------------------------
[root@localhost ~]# perf probe -x /bin/zsh zfree
Add new event:
  probe_zsh:zfree      (on /bin/zsh:0x45400)

You can now use it on all perf tools, such as:

	perf record -e probe_zsh:zfree -aR sleep 1

[root@localhost ~]# perf record -e probe_zsh:zfree -aR sleep 15
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.314 MB perf.data (~13715 samples) ]
[root@localhost ~]# perf report --stdio
# Events: 3K probe_zsh:zfree
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ......
#
   100.00%              zsh  zsh            [.] zfree


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[root@localhost ~]

[ Probing a library function using function name ]
--------------------------------------------------
[root@localhost]#
[root@localhost]# perf probe -x /lib64/libc.so.6 malloc
Add new event:
  probe_libc:malloc    (on /lib64/libc-2.5.so:0x74dc0)

You can now use it on all perf tools, such as:

	perf record -e probe_libc:malloc -aR sleep 1

[root@localhost]#
[root@localhost]# perf probe --list
  probe_libc:malloc    (on /lib64/libc-2.5.so:0x0000000000074dc0)

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

(Changelog (since v5)
- Removed the separate documentation change patch and added the
  documentation changes as part of this patch.

 tools/perf/Documentation/perf-probe.txt |   12 +
 tools/perf/builtin-probe.c              |   37 +++
 tools/perf/util/probe-event.c           |  348 +++++++++++++++++++++++++------
 tools/perf/util/probe-event.h           |    8 -
 tools/perf/util/symbol.c                |    8 +
 tools/perf/util/symbol.h                |    1 
 6 files changed, 336 insertions(+), 78 deletions(-)

diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
index 2780d9c..469ad6d 100644
--- a/tools/perf/Documentation/perf-probe.txt
+++ b/tools/perf/Documentation/perf-probe.txt
@@ -98,6 +98,11 @@ OPTIONS
 --max-probes::
 	Set the maximum number of probe points for an event. Default is 128.
 
+-x::
+--exec=PATH::
+	Specify path to the executable or shared library file for user
+	space tracing.
+
 PROBE SYNTAX
 ------------
 Probe points are defined by following syntax.
@@ -182,6 +187,13 @@ Delete all probes on schedule().
 
  ./perf probe --del='schedule*'
 
+Add probes at zfree() function on /bin/zsh
+
+ ./perf probe -x /bin/zsh zfree
+
+Add probes at malloc() function on libc
+
+ ./perf probe -x /lib/libc.so.6 malloc
 
 SEE ALSO
 --------
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 93d5171..43e6321 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -57,6 +57,7 @@ static struct {
 	bool show_ext_vars;
 	bool show_funcs;
 	bool mod_events;
+	bool uprobes;
 	int nevents;
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
@@ -78,6 +79,7 @@ static int parse_probe_event(const char *str)
 		return -1;
 	}
 
+	pev->uprobes = params.uprobes;
 	/* Parse a perf-probe command into event */
 	ret = parse_perf_probe_command(str, pev);
 	pr_debug("%d arguments\n", pev->nargs);
@@ -128,6 +130,27 @@ static int opt_del_probe_event(const struct option *opt __used,
 	return 0;
 }
 
+static int opt_set_target(const struct option *opt, const char *str,
+			int unset __used)
+{
+	int ret = -ENOENT;
+
+	if  (str && !params.target) {
+		if (!strcmp(opt->long_name, "exec"))
+			params.uprobes = true;
+#ifdef DWARF_SUPPORT
+		else if (!strcmp(opt->long_name, "module"))
+			params.uprobes = false;
+#endif
+		else
+			return ret;
+
+		params.target = str;
+		ret = 0;
+	}
+	return ret;
+}
+
 #ifdef DWARF_SUPPORT
 static int opt_show_lines(const struct option *opt __used,
 			  const char *str, int unset __used)
@@ -249,9 +272,9 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target,
-		   "modname|path",
-		   "target module name (for online) or path (for offline)"),
+	OPT_CALLBACK('m', "module", NULL, "modname|path",
+		"target module name (for online) or path (for offline)",
+		opt_set_target),
 #endif
 	OPT__DRY_RUN(&probe_event_dry_run),
 	OPT_INTEGER('\0', "max-probes", &params.max_probe_points,
@@ -263,6 +286,8 @@ static const struct option options[] = {
 		     "\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
 		     "\t\t\t \"" DEFAULT_FUNC_FILTER "\" for --funcs)",
 		     opt_set_filter),
+	OPT_CALLBACK('x', "exec", NULL, "executable|path",
+			"target executable name or path", opt_set_target),
 	OPT_END()
 };
 
@@ -313,6 +338,10 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			pr_err("  Error: Don't use --list with --funcs.\n");
 			usage_with_options(probe_usage, options);
 		}
+		if (params.uprobes) {
+			pr_warning("  Error: Don't use --list with --exec.\n");
+			usage_with_options(probe_usage, options);
+		}
 		ret = show_perf_probe_events();
 		if (ret < 0)
 			pr_err("  Error: Failed to show event list. (%d)\n",
@@ -346,7 +375,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	}
 
 #ifdef DWARF_SUPPORT
-	if (params.show_lines) {
+	if (params.show_lines && !params.uprobes) {
 		if (params.mod_events) {
 			pr_err("  Error: Don't use --line with"
 			       " --add/--del.\n");
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index d54eefb..d4f4c2b 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -73,6 +73,8 @@ static int e_snprintf(char *str, size_t size, const char *format, ...)
 }
 
 static char *synthesize_perf_probe_point(struct perf_probe_point *pp);
+static int convert_name_to_addr(struct perf_probe_event *pev,
+				const char *exec);
 static struct machine machine;
 
 /* Initialize symbol maps and path of vmlinux/modules */
@@ -173,6 +175,31 @@ const char *kernel_get_module_path(const char *module)
 	return (dso) ? dso->long_name : NULL;
 }
 
+static int init_perf_uprobes(void)
+{
+	int ret = 0;
+
+	symbol_conf.try_vmlinux_path = false;
+	symbol_conf.sort_by_name = true;
+	ret = symbol__init();
+	if (ret < 0)
+		pr_debug("Failed to init symbol map.\n");
+
+	return ret;
+}
+
+static int convert_to_perf_probe_point(struct probe_trace_point *tp,
+					struct perf_probe_point *pp)
+{
+	pp->function = strdup(tp->symbol);
+	if (pp->function == NULL)
+		return -ENOMEM;
+	pp->offset = tp->offset;
+	pp->retprobe = tp->retprobe;
+
+	return 0;
+}
+
 #ifdef DWARF_SUPPORT
 /* Open new debuginfo of given module */
 static struct debuginfo *open_debuginfo(const char *module)
@@ -281,6 +308,15 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
+	if (pev->uprobes) {
+		if (need_dwarf) {
+			pr_warning("Debuginfo-analysis is not yet supported"
+					" with -x/--exec option.\n");
+			return -ENOSYS;
+		}
+		return convert_name_to_addr(pev, target);
+	}
+
 	if (!dinfo) {
 		if (need_dwarf) {
 			pr_warning("Failed to open debuginfo file.\n");
@@ -606,23 +642,20 @@ static int kprobe_convert_to_perf_probe(struct probe_trace_point *tp,
 		pr_err("Failed to find symbol %s in kernel.\n", tp->symbol);
 		return -ENOENT;
 	}
-	pp->function = strdup(tp->symbol);
-	if (pp->function == NULL)
-		return -ENOMEM;
-	pp->offset = tp->offset;
-	pp->retprobe = tp->retprobe;
-
-	return 0;
+	return convert_to_perf_probe_point(tp, pp);
 }
 
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 				struct probe_trace_event **tevs __unused,
-				int max_tevs __unused, const char *mod __unused)
+				int max_tevs __unused, const char *target)
 {
 	if (perf_probe_event_need_dwarf(pev)) {
 		pr_warning("Debuginfo-analysis is not supported.\n");
 		return -ENOSYS;
 	}
+	if (pev->uprobes)
+		return convert_name_to_addr(pev, target);
+
 	return 0;
 }
 
@@ -887,6 +920,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
 		return -EINVAL;
 	}
 
+	if (pev->uprobes && !pp->function) {
+		semantic_error("No function specified for uprobes");
+		return -EINVAL;
+	}
+
 	if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
 		semantic_error("Offset/Line/Lazy pattern can't be used with "
 			       "return probe.\n");
@@ -896,6 +934,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
 	pr_debug("symbol:%s file:%s line:%d offset:%lu return:%d lazy:%s\n",
 		 pp->function, pp->file, pp->line, pp->offset, pp->retprobe,
 		 pp->lazy_line);
+
+	if (pev->uprobes && perf_probe_event_need_dwarf(pev)) {
+		semantic_error("no dwarf based probes for uprobes.");
+		return -EINVAL;
+	}
 	return 0;
 }
 
@@ -1047,7 +1090,8 @@ bool perf_probe_event_need_dwarf(struct perf_probe_event *pev)
 {
 	int i;
 
-	if (pev->point.file || pev->point.line || pev->point.lazy_line)
+	if ((pev->point.file && !pev->uprobes) || pev->point.line ||
+					pev->point.lazy_line)
 		return true;
 
 	for (i = 0; i < pev->nargs; i++)
@@ -1344,11 +1388,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 	if (buf == NULL)
 		return NULL;
 
-	len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s%s%s+%lu",
-			 tp->retprobe ? 'r' : 'p',
-			 tev->group, tev->event,
-			 tp->module ?: "", tp->module ? ":" : "",
-			 tp->symbol, tp->offset);
+	if (tev->uprobes)
+		len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+				 tp->retprobe ? 'r' : 'p',
+				 tev->group, tev->event, tp->symbol);
+	else
+		len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s%s%s+%lu",
+				 tp->retprobe ? 'r' : 'p',
+				 tev->group, tev->event,
+				 tp->module ?: "", tp->module ? ":" : "",
+				 tp->symbol, tp->offset);
+
 	if (len <= 0)
 		goto error;
 
@@ -1367,7 +1417,7 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 }
 
 static int convert_to_perf_probe_event(struct probe_trace_event *tev,
-				       struct perf_probe_event *pev)
+			       struct perf_probe_event *pev, bool is_kprobe)
 {
 	char buf[64] = "";
 	int i, ret;
@@ -1379,7 +1429,11 @@ static int convert_to_perf_probe_event(struct probe_trace_event *tev,
 		return -ENOMEM;
 
 	/* Convert trace_point to probe_point */
-	ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+	if (is_kprobe)
+		ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+	else
+		ret = convert_to_perf_probe_point(&tev->point, &pev->point);
+
 	if (ret < 0)
 		return ret;
 
@@ -1475,7 +1529,7 @@ static void clear_probe_trace_event(struct probe_trace_event *tev)
 	memset(tev, 0, sizeof(*tev));
 }
 
-static int open_kprobe_events(bool readwrite)
+static int open_probe_events(bool readwrite, bool is_kprobe)
 {
 	char buf[PATH_MAX];
 	const char *__debugfs;
@@ -1486,8 +1540,13 @@ static int open_kprobe_events(bool readwrite)
 		pr_warning("Debugfs is not mounted.\n");
 		return -ENOENT;
 	}
+	if (is_kprobe)
+		ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events",
+							__debugfs);
+	else
+		ret = e_snprintf(buf, PATH_MAX, "%stracing/uprobe_events",
+							__debugfs);
 
-	ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events", __debugfs);
 	if (ret >= 0) {
 		pr_debug("Opening %s write=%d\n", buf, readwrite);
 		if (readwrite && !probe_event_dry_run)
@@ -1498,16 +1557,29 @@ static int open_kprobe_events(bool readwrite)
 
 	if (ret < 0) {
 		if (errno == ENOENT)
-			pr_warning("kprobe_events file does not exist - please"
-				 " rebuild kernel with CONFIG_KPROBE_EVENT.\n");
+			pr_warning("%s file does not exist - please"
+				" rebuild kernel with CONFIG_%s_EVENT.\n",
+				is_kprobe ? "kprobe_events" : "uprobe_events",
+				is_kprobe ? "KPROBE" : "UPROBE");
 		else
-			pr_warning("Failed to open kprobe_events file: %s\n",
-				   strerror(errno));
+			pr_warning("Failed to open %s file: %s\n",
+				is_kprobe ? "kprobe_events" : "uprobe_events",
+				strerror(errno));
 	}
 	return ret;
 }
 
-/* Get raw string list of current kprobe_events */
+static int open_kprobe_events(bool readwrite)
+{
+	return open_probe_events(readwrite, 1);
+}
+
+static int open_uprobe_events(bool readwrite)
+{
+	return open_probe_events(readwrite, 0);
+}
+
+/* Get raw string list of current kprobe_events  or uprobe_events */
 static struct strlist *get_probe_trace_command_rawlist(int fd)
 {
 	int ret, idx;
@@ -1572,36 +1644,26 @@ static int show_perf_probe_event(struct perf_probe_event *pev)
 	return ret;
 }
 
-/* List up current perf-probe events */
-int show_perf_probe_events(void)
+static int __show_perf_probe_events(int fd, bool is_kprobe)
 {
-	int fd, ret;
+	int ret = 0;
 	struct probe_trace_event tev;
 	struct perf_probe_event pev;
 	struct strlist *rawlist;
 	struct str_node *ent;
 
-	setup_pager();
-	ret = init_vmlinux();
-	if (ret < 0)
-		return ret;
-
 	memset(&tev, 0, sizeof(tev));
 	memset(&pev, 0, sizeof(pev));
 
-	fd = open_kprobe_events(false);
-	if (fd < 0)
-		return fd;
-
 	rawlist = get_probe_trace_command_rawlist(fd);
-	close(fd);
 	if (!rawlist)
 		return -ENOENT;
 
 	strlist__for_each(ent, rawlist) {
 		ret = parse_probe_trace_command(ent->s, &tev);
 		if (ret >= 0) {
-			ret = convert_to_perf_probe_event(&tev, &pev);
+			ret = convert_to_perf_probe_event(&tev, &pev,
+								is_kprobe);
 			if (ret >= 0)
 				ret = show_perf_probe_event(&pev);
 		}
@@ -1611,6 +1673,31 @@ int show_perf_probe_events(void)
 			break;
 	}
 	strlist__delete(rawlist);
+	return ret;
+}
+
+/* List up current perf-probe events */
+int show_perf_probe_events(void)
+{
+	int fd, ret;
+
+	setup_pager();
+	fd = open_kprobe_events(false);
+	if (fd < 0)
+		return fd;
+
+	ret = init_vmlinux();
+	if (ret < 0)
+		return ret;
+
+	ret = __show_perf_probe_events(fd, true);
+	close(fd);
+
+	fd = open_uprobe_events(false);
+	if (fd >= 0) {
+		ret = __show_perf_probe_events(fd, false);
+		close(fd);
+	}
 
 	return ret;
 }
@@ -1720,7 +1807,10 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 	const char *event, *group;
 	struct strlist *namelist;
 
-	fd = open_kprobe_events(true);
+	if (pev->uprobes)
+		fd = open_uprobe_events(true);
+	else
+		fd = open_kprobe_events(true);
 	if (fd < 0)
 		return fd;
 	/* Get current event names */
@@ -1832,6 +1922,7 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 	tev->point.offset = pev->point.offset;
 	tev->point.retprobe = pev->point.retprobe;
 	tev->nargs = pev->nargs;
+	tev->uprobes = pev->uprobes;
 	if (tev->nargs) {
 		tev->args = zalloc(sizeof(struct probe_trace_arg)
 				   * tev->nargs);
@@ -1862,6 +1953,9 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		}
 	}
 
+	if (pev->uprobes)
+		return 1;
+
 	/* Currently just checking function name from symbol map */
 	sym = __find_kernel_function_by_name(tev->point.symbol, NULL);
 	if (!sym) {
@@ -1888,15 +1982,19 @@ struct __event_package {
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 			  int max_tevs, const char *target, bool force_add)
 {
-	int i, j, ret;
+	int i, j, ret = 0;
 	struct __event_package *pkgs;
 
 	pkgs = zalloc(sizeof(struct __event_package) * npevs);
 	if (pkgs == NULL)
 		return -ENOMEM;
 
-	/* Init vmlinux path */
-	ret = init_vmlinux();
+	if (!pevs->uprobes)
+		/* Init vmlinux path */
+		ret = init_vmlinux();
+	else
+		ret = init_perf_uprobes();
+
 	if (ret < 0) {
 		free(pkgs);
 		return ret;
@@ -1968,23 +2066,15 @@ static int __del_trace_probe_event(int fd, struct str_node *ent)
 	return ret;
 }
 
-static int del_trace_probe_event(int fd, const char *group,
-				  const char *event, struct strlist *namelist)
+static int del_trace_probe_event(int fd, const char *buf,
+						  struct strlist *namelist)
 {
-	char buf[128];
 	struct str_node *ent, *n;
-	int found = 0, ret = 0;
-
-	ret = e_snprintf(buf, 128, "%s:%s", group, event);
-	if (ret < 0) {
-		pr_err("Failed to copy event.\n");
-		return ret;
-	}
+	int ret = -1;
 
 	if (strpbrk(buf, "*?")) { /* Glob-exp */
 		strlist__for_each_safe(ent, n, namelist)
 			if (strglobmatch(ent->s, buf)) {
-				found++;
 				ret = __del_trace_probe_event(fd, ent);
 				if (ret < 0)
 					break;
@@ -1993,40 +2083,41 @@ static int del_trace_probe_event(int fd, const char *group,
 	} else {
 		ent = strlist__find(namelist, buf);
 		if (ent) {
-			found++;
 			ret = __del_trace_probe_event(fd, ent);
 			if (ret >= 0)
 				strlist__remove(namelist, ent);
 		}
 	}
-	if (found == 0 && ret >= 0)
-		pr_info("Info: Event \"%s\" does not exist.\n", buf);
-
 	return ret;
 }
 
 int del_perf_probe_events(struct strlist *dellist)
 {
-	int fd, ret = 0;
+	int ret = -1, ufd = -1, kfd = -1;
+	char buf[128];
 	const char *group, *event;
 	char *p, *str;
 	struct str_node *ent;
-	struct strlist *namelist;
-
-	fd = open_kprobe_events(true);
-	if (fd < 0)
-		return fd;
+	struct strlist *namelist = NULL, *unamelist = NULL;
 
 	/* Get current event names */
-	namelist = get_probe_trace_event_names(fd, true);
-	if (namelist == NULL)
-		return -EINVAL;
+	kfd = open_kprobe_events(true);
+	if (kfd < 0)
+		return kfd;
+	namelist = get_probe_trace_event_names(kfd, true);
+
+	ufd = open_uprobe_events(true);
+	if (ufd >= 0)
+		unamelist = get_probe_trace_event_names(ufd, true);
+
+	if (namelist == NULL && unamelist == NULL)
+		goto error;
 
 	strlist__for_each(ent, dellist) {
 		str = strdup(ent->s);
 		if (str == NULL) {
 			ret = -ENOMEM;
-			break;
+			goto error;
 		}
 		pr_debug("Parsing: %s\n", str);
 		p = strchr(str, ':');
@@ -2038,17 +2129,40 @@ int del_perf_probe_events(struct strlist *dellist)
 			group = "*";
 			event = str;
 		}
+
+		ret = e_snprintf(buf, 128, "%s:%s", group, event);
+		if (ret < 0) {
+			pr_err("Failed to copy event.");
+			free(str);
+			goto error;
+		}
+
 		pr_debug("Group: %s, Event: %s\n", group, event);
-		ret = del_trace_probe_event(fd, group, event, namelist);
+		if (namelist)
+			ret = del_trace_probe_event(kfd, buf, namelist);
+		if (unamelist && ret != 0)
+			ret = del_trace_probe_event(ufd, buf, unamelist);
+
 		free(str);
-		if (ret < 0)
-			break;
+		if (ret != 0)
+			pr_info("Info: Event \"%s\" does not exist.\n", buf);
 	}
-	strlist__delete(namelist);
-	close(fd);
 
+error:
+	if (kfd >= 0) {
+		if (namelist)
+			strlist__delete(namelist);
+		close(kfd);
+	}
+
+	if (ufd >= 0) {
+		if (unamelist)
+			strlist__delete(unamelist);
+		close(ufd);
+	}
 	return ret;
 }
+
 /* TODO: don't use a global variable for filter ... */
 static struct strfilter *available_func_filter;
 
@@ -2092,3 +2206,95 @@ int show_available_funcs(const char *target, struct strfilter *_filter)
 	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
 	return 0;
 }
+
+#define DEFAULT_FUNC_FILTER "!_*"
+
+/*
+ * uprobe_events only accepts address:
+ * Convert function and any offset to address
+ */
+static int convert_name_to_addr(struct perf_probe_event *pev, const char *exec)
+{
+	struct perf_probe_point *pp = &pev->point;
+	struct symbol *sym;
+	struct map *map = NULL;
+	char *function = NULL, *name = NULL;
+	int ret = -EINVAL;
+	unsigned long long vaddr = 0;
+
+	if (!pp->function)
+		goto out;
+
+	function = strdup(pp->function);
+	if (!function) {
+		pr_warning("Failed to allocate memory by strdup.\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	name = realpath(exec, NULL);
+	if (!name) {
+		pr_warning("Cannot find realpath for %s.\n", exec);
+		goto out;
+	}
+	map = dso__new_map(name);
+	if (!map) {
+		pr_warning("Cannot find appropriate DSO for %s.\n", name);
+		goto out;
+	}
+	available_func_filter = strfilter__new(DEFAULT_FUNC_FILTER, NULL);
+	if (map__load(map, filter_available_functions)) {
+		pr_err("Failed to load map.\n");
+		return -EINVAL;
+	}
+
+	sym = map__find_symbol_by_name(map, function, NULL);
+	if (!sym) {
+		pr_warning("Cannot find %s in DSO %s\n", function, name);
+		goto out;
+	}
+
+	if (map->start > sym->start)
+		vaddr = map->start;
+	vaddr += sym->start + pp->offset + map->pgoff;
+	pp->offset = 0;
+
+	if (!pev->event) {
+		pev->event = function;
+		function = NULL;
+	}
+	if (!pev->group) {
+		char *ptr1, *ptr2;
+
+		pev->group = zalloc(sizeof(char *) * 64);
+		ptr1 = strdup(basename(exec));
+		if (ptr1) {
+			ptr2 = strpbrk(ptr1, "-._");
+			if (ptr2)
+				*ptr2 = '\0';
+			e_snprintf(pev->group, 64, "%s_%s", PERFPROBE_GROUP,
+					ptr1);
+			free(ptr1);
+		}
+	}
+	free(pp->function);
+	pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+	if (!pp->function) {
+		ret = -ENOMEM;
+		pr_warning("Failed to allocate memory by zalloc.\n");
+		goto out;
+	}
+	e_snprintf(pp->function, MAX_PROBE_ARGS, "%s:0x%llx", name, vaddr);
+	ret = 0;
+
+out:
+	if (map) {
+		dso__delete(map->dso);
+		map__delete(map);
+	}
+	if (function)
+		free(function);
+	if (name)
+		free(name);
+	return ret;
+}
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index a7dee83..9e8c846 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -7,7 +7,7 @@
 
 extern bool probe_event_dry_run;
 
-/* kprobe-tracer tracing point */
+/* kprobe-tracer and uprobe-tracer tracing point */
 struct probe_trace_point {
 	char		*symbol;	/* Base symbol */
 	char		*module;	/* Module name */
@@ -21,7 +21,7 @@ struct probe_trace_arg_ref {
 	long				offset;	/* Offset value */
 };
 
-/* kprobe-tracer tracing argument */
+/* kprobe-tracer and uprobe-tracer tracing argument */
 struct probe_trace_arg {
 	char				*name;	/* Argument name */
 	char				*value;	/* Base value */
@@ -29,12 +29,13 @@ struct probe_trace_arg {
 	struct probe_trace_arg_ref	*ref;	/* Referencing offset */
 };
 
-/* kprobe-tracer tracing event (point + arg) */
+/* kprobe-tracer and uprobe-tracer tracing event (point + arg) */
 struct probe_trace_event {
 	char				*event;	/* Event name */
 	char				*group;	/* Group name */
 	struct probe_trace_point	point;	/* Trace point */
 	int				nargs;	/* Number of args */
+	bool				uprobes;	/* uprobes only */
 	struct probe_trace_arg		*args;	/* Arguments */
 };
 
@@ -70,6 +71,7 @@ struct perf_probe_event {
 	char			*group;	/* Group name */
 	struct perf_probe_point	point;	/* Probe point */
 	int			nargs;	/* Number of arguments */
+	bool			uprobes;
 	struct perf_probe_arg	*args;	/* Arguments */
 };
 
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 632b50c..e81c4fd 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -2767,3 +2767,11 @@ int machine__load_vmlinux_path(struct machine *machine, enum map_type type,
 
 	return ret;
 }
+
+struct map *dso__new_map(const char *name)
+{
+	struct dso *dso = dso__new(name);
+	struct map *map = map__new2(0, dso, MAP__FUNCTION);
+
+	return map;
+}
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 29f8d74..6d28bbd 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -217,6 +217,7 @@ void dso__set_long_name(struct dso *dso, char *name);
 void dso__set_build_id(struct dso *dso, void *build_id);
 void dso__read_running_kernel_build_id(struct dso *dso,
 				       struct machine *machine);
+struct map *dso__new_map(const char *name);
 struct symbol *dso__find_symbol(struct dso *dso, enum map_type type,
 				u64 addr);
 struct symbol *dso__find_symbol_by_name(struct dso *dso, enum map_type type,


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 24/30] perf: show possible probes in a given executable file or library.
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (22 preceding siblings ...)
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 23/30] perf: perf interface for uprobes Srikar Dronamraju
@ 2011-11-18 11:11 ` Srikar Dronamraju
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 25/30] uprobes: call post_xol() unconditionally Srikar Dronamraju
                   ` (7 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:11 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Enhances -F/--funcs option of "perf probe" to list possible probe points in
an executable file or library.

Show last 10 functions in /bin/zsh.

# perf probe -F -x /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

Show first 10 functions in /lib/libc.so.6

# perf probe -F -x /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

(Changelog (since v5)
- Removed the separate documentation change patch and added the
  documentation changes as part of this patch.

 tools/perf/Documentation/perf-probe.txt |    4 ++
 tools/perf/builtin-probe.c              |    4 +-
 tools/perf/util/probe-event.c           |   55 ++++++++++++++++++++++++-------
 tools/perf/util/probe-event.h           |    4 +-
 4 files changed, 49 insertions(+), 18 deletions(-)

diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
index 469ad6d..be88378 100644
--- a/tools/perf/Documentation/perf-probe.txt
+++ b/tools/perf/Documentation/perf-probe.txt
@@ -78,6 +78,8 @@ OPTIONS
 -F::
 --funcs::
 	Show available functions in given module or kernel.
+	With -x/--exec, can also list functions in a user space executable
+	/ shared library.
 
 --filter=FILTER::
 	(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
@@ -101,7 +103,7 @@ OPTIONS
 -x::
 --exec=PATH::
 	Specify path to the executable or shared library file for user
-	space tracing.
+	space tracing. Can also be used with --funcs option.
 
 PROBE SYNTAX
 ------------
diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 43e6321..5e7622c 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -365,8 +365,8 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target,
-					   params.filter);
+		ret = show_available_funcs(params.target, params.filter,
+					params.uprobes);
 		strfilter__delete(params.filter);
 		if (ret < 0)
 			pr_err("  Error: Failed to show functions."
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index d4f4c2b..2c4ec61 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -47,6 +47,7 @@
 #include "trace-event.h"	/* For __unused */
 #include "probe-event.h"
 #include "probe-finder.h"
+#include "session.h"
 
 #define MAX_CMDLEN 256
 #define MAX_PROBE_ARGS 128
@@ -2179,32 +2180,60 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *target, struct strfilter *_filter)
+static int __show_available_funcs(struct map *map)
+{
+	if (map__load(map, filter_available_functions)) {
+		pr_err("Failed to load map.\n");
+		return -EINVAL;
+	}
+	if (!dso__sorted_by_name(map->dso, map->type))
+		dso__sort_by_name(map->dso, map->type);
+
+	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
+	return 0;
+}
+
+static int available_kernel_funcs(const char *module)
 {
 	struct map *map;
 	int ret;
 
-	setup_pager();
-
 	ret = init_vmlinux();
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(target);
+	map = kernel_get_module_map(module);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
 		return -EINVAL;
 	}
+	return __show_available_funcs(map);
+}
+
+int show_available_funcs(const char *target, struct strfilter *_filter,
+					bool user)
+{
+	struct map *map;
+	int ret;
+
+	setup_pager();
 	available_func_filter = _filter;
-	if (map__load(map, filter_available_functions)) {
-		pr_err("Failed to load map.\n");
-		return -EINVAL;
-	}
-	if (!dso__sorted_by_name(map->dso, map->type))
-		dso__sort_by_name(map->dso, map->type);
 
-	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
-	return 0;
+	if (!user)
+		return available_kernel_funcs(target);
+
+	symbol_conf.try_vmlinux_path = false;
+	symbol_conf.sort_by_name = true;
+	ret = symbol__init();
+	if (ret < 0) {
+		pr_err("Failed to init symbol map.\n");
+		return ret;
+	}
+	map = dso__new_map(target);
+	ret = __show_available_funcs(map);
+	dso__delete(map->dso);
+	map__delete(map);
+	return ret;
 }
 
 #define DEFAULT_FUNC_FILTER "!_*"
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 9e8c846..f9f3de8 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -131,8 +131,8 @@ extern int show_line_range(struct line_range *lr, const char *module);
 extern int show_available_vars(struct perf_probe_event *pevs, int npevs,
 			       int max_probe_points, const char *module,
 			       struct strfilter *filter, bool externs);
-extern int show_available_funcs(const char *module, struct strfilter *filter);
-
+extern int show_available_funcs(const char *module, struct strfilter *filter,
+				bool user);
 
 /* Maximum index number of event-name postfix */
 #define MAX_EVENT_INDEX	1024


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 25/30] uprobes: call post_xol() unconditionally
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (23 preceding siblings ...)
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 24/30] perf: show possible probes in a given executable file or library Srikar Dronamraju
@ 2011-11-18 11:11 ` Srikar Dronamraju
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 26/30] uprobes: introduce uprobe_deny_signal() Srikar Dronamraju
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:11 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Kill sstep_complete(), change uprobe_notify_resume() to use
post_xol() unconditionally.

It is wrong to assume that regs->ip always changes after the step.
rep or jmp/call to self for example. We know that this task has
already done the step, we can rely on DIE_DEBUG notification.

Original-patch-from: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    3 ++-
 kernel/uprobes.c        |   38 ++++++++++++--------------------------
 2 files changed, 14 insertions(+), 27 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index add5222..70d639c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -74,7 +74,8 @@ struct uprobe {
 enum uprobe_task_state {
 	UTASK_RUNNING,
 	UTASK_BP_HIT,
-	UTASK_SSTEP
+	UTASK_SSTEP,
+	UTASK_SSTEP_ACK,
 };
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index b440acd..50cde86 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1313,24 +1313,6 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 }
 
 /*
- * Verify from Instruction Pointer if singlestep has indeed occurred.
- * If Singlestep has occurred, then do post singlestep fix-ups.
- */
-static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
-{
-	unsigned long vaddr = instruction_pointer(regs);
-
-	/*
-	 * If we have executed out of line, Instruction pointer
-	 * cannot be same as virtual address of XOL slot.
-	 */
-	if (vaddr == current->utask->xol_vaddr)
-		return false;
-	post_xol(uprobe, regs);
-	return true;
-}
-
-/*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
  *
@@ -1377,15 +1359,18 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		else
 			/* Cannot Singlestep; re-execute the instruction. */
 			goto cleanup_ret;
-	} else if (utask->state == UTASK_SSTEP) {
+	} else {
 		u = utask->active_uprobe;
-		if (sstep_complete(u, regs)) {
-			put_uprobe(u);
-			utask->active_uprobe = NULL;
-			utask->state = UTASK_RUNNING;
-			user_disable_single_step(current);
-			xol_free_insn_slot(current);
-		}
+		if (utask->state == UTASK_SSTEP_ACK)
+			post_xol(u, regs);
+		else
+			WARN_ON_ONCE(1);
+
+		put_uprobe(u);
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+		user_disable_single_step(current);
+		xol_free_insn_slot(current);
 	}
 	return;
 
@@ -1435,6 +1420,7 @@ int uprobe_post_notifier(struct pt_regs *regs)
 		/* task is currently not uprobed */
 		return 0;
 
+	utask->state = UTASK_SSTEP_ACK;
 	set_thread_flag(TIF_UPROBE);
 	return 1;
 }


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 26/30] uprobes: introduce uprobe_deny_signal()
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (24 preceding siblings ...)
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 25/30] uprobes: call post_xol() unconditionally Srikar Dronamraju
@ 2011-11-18 11:11 ` Srikar Dronamraju
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 27/30] uprobes: x86: introduce xol_was_trapped() Srikar Dronamraju
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:11 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


A not-UTASK_RUNNING task obviously can't handle the signals, neither
it should stop/freeze/etc. It must not even exit if it was
SIGKILL'ed

This patch adds the new hook, uprobe_deny_signal(), called by
get_signal_to_deliver(). It simply clears TIF_SIGPENDING to ensure
that this thread can do nothing connected to signals until it
becomes UTASK_RUNNING.

We also change post_xol() path to do recalc_sigpending() before
return to user-mode, this ensures the signal can't be lost.

Original-patch-from: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    5 +++++
 kernel/signal.c         |    3 +++
 kernel/uprobes.c        |   23 +++++++++++++++++++++++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 70d639c..8d12c06 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -129,6 +129,7 @@ extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
 extern int uprobe_post_notifier(struct pt_regs *regs);
 extern int uprobe_bkpt_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
+extern bool uprobe_deny_signal(void);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -149,6 +150,10 @@ static inline void munmap_uprobe(struct vm_area_struct *vma)
 static inline void uprobe_notify_resume(struct pt_regs *regs)
 {
 }
+static inline bool uprobe_deny_signal(void)
+{
+	return false;
+}
 static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
 {
 	return 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index b3f78d0..5d68510 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2149,6 +2149,9 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+	if (unlikely(uprobe_deny_signal()))
+		return 0;
+
 relock:
 	/*
 	 * We'll jump back here after any time we were stopped in TASK_STOPPED.
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 50cde86..3e7c4c5 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1312,6 +1312,25 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 	return -EFAULT;
 }
 
+bool uprobe_deny_signal(void)
+{
+	struct task_struct *tsk = current;
+	struct uprobe_task *utask = tsk->utask;
+
+	if (likely(!utask || !utask->active_uprobe))
+		return false;
+
+	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
+
+	if (signal_pending(tsk)) {
+		spin_lock_irq(&tsk->sighand->siglock);
+		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
+		spin_unlock_irq(&tsk->sighand->siglock);
+	}
+
+	return true;
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1371,6 +1390,10 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		utask->state = UTASK_RUNNING;
 		user_disable_single_step(current);
 		xol_free_insn_slot(current);
+
+		spin_lock_irq(&current->sighand->siglock);
+		recalc_sigpending(); /* see uprobe_deny_signal() */
+		spin_unlock_irq(&current->sighand->siglock);
 	}
 	return;
 


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 27/30] uprobes: x86: introduce xol_was_trapped()
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (25 preceding siblings ...)
  2011-11-18 11:11 ` [PATCH v7 3.2-rc2 26/30] uprobes: introduce uprobe_deny_signal() Srikar Dronamraju
@ 2011-11-18 11:12 ` Srikar Dronamraju
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 28/30] uprobes: introduce UTASK_SSTEP_TRAPPED logic Srikar Dronamraju
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Postpone the signals until we execute the probed insn. This is
simply wrong if xol insn traps and generates the signal itself. Say,
SIGILL/SIGSEGV/etc.

Adds xol_was_trapped() to detect this case. It assumes that anything
like do_page_fault/do_trap/etc sets thread.trap_no != -1.

We add uprobe_task_arch_info->saved_trap_no and change
pre_xol/post_xol to save/restore thread.trap_no, xol_was_trapped()
simply checks that ->trap_no is not equal to UPROBE_TRAP_NO == -1
set by pre_xol().

Original-patch-from: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---

Changelog (since v6)
- added x86 specific hook for aborting xol.

 arch/x86/include/asm/uprobes.h |    7 ++++++-
 arch/x86/kernel/uprobes.c      |   33 ++++++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 99d7d4b..6a47024 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -39,16 +39,21 @@ struct uprobe_arch_info {
 
 struct uprobe_task_arch_info {
 	unsigned long saved_scratch_register;
+	unsigned long saved_trap_no;
 };
 #else
 struct uprobe_arch_info {};
-struct uprobe_task_arch_info {};
+struct uprobe_task_arch_info {
+	unsigned long saved_trap_no;
+};
 #endif
 struct uprobe;
 extern int analyze_insn(struct mm_struct *mm, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern bool xol_was_trapped(struct task_struct *tsk);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
+extern void abort_xol(struct pt_regs *regs, struct uprobe *uprobe);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 0792fc8..3f0eb4e 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -409,6 +409,8 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 	regs->ip = vaddr;
 }
 
+#define	UPROBE_TRAP_NO	-1ul
+
 /*
  * pre_xol - prepare to execute out of line.
  * @uprobe: the probepoint information.
@@ -424,6 +426,9 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 {
 	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
 
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
 	regs->ip = current->utask->xol_vaddr;
 	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
 		tskinfo->saved_scratch_register = regs->ax;
@@ -439,6 +444,11 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 #else
 int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 {
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
 	regs->ip = current->utask->xol_vaddr;
 	return 0;
 }
@@ -494,7 +504,8 @@ static void handle_riprel_post_xol(struct uprobe *uprobe,
 		 * Fall through to handle stuff like "jmpq *...(%rip)" and
 		 * "callq *...(%rip)".
 		 */
-		*correction += 4;
+		if (correction)
+			*correction += 4;
 	}
 }
 #else
@@ -504,6 +515,14 @@ static void handle_riprel_post_xol(struct uprobe *uprobe,
 }
 #endif
 
+bool xol_was_trapped(struct task_struct *tsk)
+{
+	if (tsk->thread.trap_no != UPROBE_TRAP_NO)
+		return true;
+
+	return false;
+}
+
 /*
  * Called after single-stepping. To avoid the SMP problems that can
  * occur when we temporarily put back the original opcode to
@@ -534,6 +553,9 @@ int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	int result = 0;
 	long correction;
 
+	WARN_ON_ONCE(current->thread.trap_no != UPROBE_TRAP_NO);
+
+	current->thread.trap_no = utask->tskinfo.saved_trap_no;
 	correction = (long)(utask->vaddr - utask->xol_vaddr);
 	handle_riprel_post_xol(uprobe, regs, &correction);
 	if (uprobe->fixups & UPROBES_FIX_IP)
@@ -571,3 +593,12 @@ int uprobe_exception_notify(struct notifier_block *self,
 	}
 	return ret;
 }
+
+void abort_xol(struct pt_regs *regs, struct uprobe *uprobe)
+{
+	struct uprobe_task *utask = current->utask;
+
+	current->thread.trap_no = utask->tskinfo.saved_trap_no;
+	handle_riprel_post_xol(uprobe, regs, NULL);
+	set_instruction_pointer(regs, utask->vaddr);
+}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 28/30] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (26 preceding siblings ...)
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 27/30] uprobes: x86: introduce xol_was_trapped() Srikar Dronamraju
@ 2011-11-18 11:12 ` Srikar Dronamraju
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 29/30] uprobes: Introduce uprobe flags Srikar Dronamraju
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Add UTASK_SSTEP_TRAPPED state/code to handle the case when
xol insn itself triggers the signal.

In this case we should restart the original insn even if the task is
already SIGKILL'ed (say, the coredump should report the correct ip).
This is even more important if the task has a handler for SIGSEGV/etc,
The _same_ instruction should be repeated again after return from the
signal handler, and SSTEP can never finish in this case.

Change uprobe_deny_signal() to set UTASK_SSTEP_TRAPPED and TIF_UPROBE. It
also sets TIF_NOTIFY_RESUME.

When uprobe_notify_resume() sees UTASK_SSTEP_TRAPPED it does abort_xol()
instead of post_xol().

Original-patch-from: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog (since v6)
-	abort_xol moved to previous patch.

 include/linux/uprobes.h |    1 +
 kernel/uprobes.c        |   13 ++++++++++---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 8d12c06..6a84332 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,7 @@ enum uprobe_task_state {
 	UTASK_BP_HIT,
 	UTASK_SSTEP,
 	UTASK_SSTEP_ACK,
+	UTASK_SSTEP_TRAPPED,
 };
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 3e7c4c5..f8c0f7c 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1326,6 +1326,12 @@ bool uprobe_deny_signal(void)
 		spin_lock_irq(&tsk->sighand->siglock);
 		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 		spin_unlock_irq(&tsk->sighand->siglock);
+
+		if (__fatal_signal_pending(tsk) || xol_was_trapped(tsk)) {
+			utask->state = UTASK_SSTEP_TRAPPED;
+			set_tsk_thread_flag(tsk, TIF_UPROBE);
+			set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+		}
 	}
 
 	return true;
@@ -1382,6 +1388,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		u = utask->active_uprobe;
 		if (utask->state == UTASK_SSTEP_ACK)
 			post_xol(u, regs);
+		else if (utask->state == UTASK_SSTEP_TRAPPED)
+			abort_xol(regs, u);
 		else
 			WARN_ON_ONCE(1);
 
@@ -1405,9 +1413,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 	if (u) {
 		put_uprobe(u);
 		set_instruction_pointer(regs, probept);
-	} else {
-		/*TODO Return SIGTRAP signal */
-	}
+	} else
+		send_sig(SIGTRAP, current, 0);
 }
 
 /*


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 29/30] uprobes: Introduce uprobe flags
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (27 preceding siblings ...)
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 28/30] uprobes: introduce UTASK_SSTEP_TRAPPED logic Srikar Dronamraju
@ 2011-11-18 11:12 ` Srikar Dronamraju
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 30/30] x86: skip singlestep where possible Srikar Dronamraju
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


While registering a probe, there is a timelag between the time the register
request is given all probes are inserted in different processes. If the probe
register fails after inserting  a probe in couple of processes; the installed
probes are reverted. However the probes could have hit and triggered handler
before the probes are reverted.

Avoids running the handler until the register is complete or as soon as the
last unregister kicks in.

Also this patch
- enables skipping singlestep where possible.
- uses a flag to denote if a copy of instruction is made.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   11 ++++++++++-
 kernel/uprobes.c        |   32 ++++++++++++++++++++++++++------
 2 files changed, 36 insertions(+), 7 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6a84332..20bdd0a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -46,6 +46,14 @@ struct uprobe_task_arch_info {};	/* arch specific task info */
 /* Adjust the return address of a call insn */
 #define UPROBES_FIX_CALL	0x2
 
+/* flags that denote/change uprobes behaviour */
+/* Have a copy of original instruction */
+#define UPROBES_COPY_INSN	0x1
+/* Dont run handlers when first register/ last unregister in progress*/
+#define UPROBES_RUN_HANDLER	0x2
+/* Can skip singlestep */
+#define UPROBES_SKIP_SSTEP	0x4
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -66,7 +74,7 @@ struct uprobe {
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
-	int			copy;
+	int			flags;
 	u16			fixups;
 	u8			insn[MAX_UINSN_BYTES];
 };
@@ -131,6 +139,7 @@ extern int uprobe_post_notifier(struct pt_regs *regs);
 extern int uprobe_bkpt_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
 extern bool uprobe_deny_signal(void);
+extern bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index f8c0f7c..2493191 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -436,6 +436,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	spin_lock_irqsave(&uprobes_treelock, flags);
 	u = __insert_uprobe(uprobe);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
+
+	/* For now assume that the instruction need not be single-stepped */
+	uprobe->flags |= UPROBES_SKIP_SSTEP;
 	return u;
 }
 
@@ -475,6 +478,9 @@ static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
 {
 	struct uprobe_consumer *consumer;
 
+	if (!(uprobe->flags & UPROBES_RUN_HANDLER))
+		return;
+
 	down_read(&uprobe->consumer_rwsem);
 	consumer = uprobe->consumers;
 	for (consumer = uprobe->consumers; consumer;
@@ -594,7 +600,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 		return -EEXIST;
 
 	addr = (unsigned long)vaddr;
-	if (!uprobe->copy) {
+	if (!(uprobe->flags & UPROBES_COPY_INSN)) {
 		ret = copy_insn(uprobe, vma, addr);
 		if (ret)
 			return ret;
@@ -606,7 +612,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 		if (ret)
 			return ret;
 
-		uprobe->copy = 1;
+		uprobe->flags |= UPROBES_COPY_INSN;
 	}
 	ret = set_bkpt(mm, uprobe, addr);
 	if (!ret)
@@ -850,7 +856,8 @@ int register_uprobe(struct inode *inode, loff_t offset,
 		if (ret) {
 			uprobe->consumers = NULL;
 			__unregister_uprobe(inode, offset, uprobe);
-		}
+		} else
+			uprobe->flags |= UPROBES_RUN_HANDLER;
 	}
 
 	mutex_unlock(uprobes_hash(inode));
@@ -886,9 +893,10 @@ void unregister_uprobe(struct inode *inode, loff_t offset,
 		goto unreg_out;
 	}
 
-	if (!uprobe->consumers)
+	if (!uprobe->consumers) {
 		__unregister_uprobe(inode, offset, uprobe);
-
+		uprobe->flags &= ~UPROBES_RUN_HANDLER;
+	}
 	mutex_unlock(uprobes_hash(inode));
 
 unreg_out:
@@ -1337,6 +1345,12 @@ bool uprobe_deny_signal(void)
 	return true;
 }
 
+bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u)
+{
+	u->flags &= ~UPROBES_SKIP_SSTEP;
+	return false;
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1378,6 +1392,10 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		}
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
+
+		if (u->flags & UPROBES_SKIP_SSTEP && can_skip_xol(regs, u))
+			goto cleanup_ret;
+
 		utask->state = UTASK_SSTEP;
 		if (!pre_ssout(u, regs, probept))
 			user_enable_single_step(current);
@@ -1411,8 +1429,10 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		utask->state = UTASK_RUNNING;
 	}
 	if (u) {
+		if (!(u->flags & UPROBES_SKIP_SSTEP))
+			set_instruction_pointer(regs, probept);
+
 		put_uprobe(u);
-		set_instruction_pointer(regs, probept);
 	} else
 		send_sig(SIGTRAP, current, 0);
 }


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH v7 3.2-rc2 30/30] x86: skip singlestep where possible
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (28 preceding siblings ...)
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 29/30] uprobes: Introduce uprobe flags Srikar Dronamraju
@ 2011-11-18 11:12 ` Srikar Dronamraju
  2011-11-22  5:03 ` [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
  31 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-18 11:12 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds
  Cc: Oleg Nesterov, Andrew Morton, LKML, Linux-mm, Ingo Molnar,
	Andi Kleen, Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Thomas Gleixner, Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Anton Arapov, Ananth N Mavinakayanahalli, Jim Keniston,
	Stephen Wilson


Check and skip singlestepping underlying instructions where possible.

For now handles single byte as well as few multibyte nop instructions.
However can be extended to other instructions too.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/kernel/uprobes.c |   44 ++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 44 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 3f0eb4e..f59053f 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -602,3 +602,47 @@ void abort_xol(struct pt_regs *regs, struct uprobe *uprobe)
 	handle_riprel_post_xol(uprobe, regs, NULL);
 	set_instruction_pointer(regs, utask->vaddr);
 }
+
+/*
+ * Skip these instructions:
+ *
+ * 0f 19 90 90 90 90 90		nopl   -0x6f6f6f70(%rax)
+ * 0f 1f 00			nopl (%rax)
+ * 0f 1f 40 00			nopl 0x0(%rax)
+ * 0f 1f 44 00 00		nopl 0x0(%rax,%rax,1)
+ * 0f 1f 80 00 00 00 00		nopl 0x0(%rax)
+ * 0f 1f 84 00 00 00 00		nopl 0x0(%rax,%rax,1)
+ * 66 0f 1f 44 00 00 00		nopw 0x0(%rax,%rax,1)
+ * 66 0f 1f 84 00 00 00		nopw 0x0(%rax,%rax,1)
+ * 66 87 c0			xchg %eax,%eax
+ * 66 90			nop
+ * 87 c0			xchg %eax,%eax
+ * 90 				nop
+ */
+
+bool can_skip_xol(struct pt_regs *regs, struct uprobe *u)
+{
+	int i;
+
+	for (i = 0; i < MAX_UINSN_BYTES; i++) {
+		if ((u->insn[i] == 0x66))
+			continue;
+
+		if (u->insn[i] == 0x90)
+			return true;
+
+		if ((u->insn[i] == 0x0f) && (u->insn[i+1] == 0x1f))
+			return true;
+
+		if ((u->insn[i] == 0x0f) && (u->insn[i+1] == 0x19))
+			return true;
+
+		if ((u->insn[i] == 0x87) && (u->insn[i+1] == 0xc0))
+			return true;
+
+		break;
+	}
+
+	u->flags &= ~UPROBES_SKIP_SSTEP;
+	return false;
+}


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (29 preceding siblings ...)
  2011-11-18 11:12 ` [PATCH v7 3.2-rc2 30/30] x86: skip singlestep where possible Srikar Dronamraju
@ 2011-11-22  5:03 ` Srikar Dronamraju
  2011-11-22 14:49   ` Stephen Rothwell
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
  31 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-22  5:03 UTC (permalink / raw)
  To: Peter Zijlstra, Linus Torvalds, Thomas Gleixner, Oleg Nesterov,
	Ingo Molnar, Stephen Rothwell
  Cc: Andrew Morton, LKML, Linux-mm, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Ananth N Mavinakayanahalli,
	Jim Keniston, H. Peter Anvin

Hi Ingo, Thomas, Linus, Stephen, Peter,

> This patchset resolves most of the comments on the previous posting
> (https://lkml.org/lkml/2011/11/10/408) patchset applies on top of
> commit cfcfc9eca2b
> 
> This patchset depends on bulkref patch from Paul McKenney
> https://lkml.org/lkml/2011/11/2/365 and enable interrupts before
> calling do_notify_resume on i686 patch
> https://lkml.org/lkml/2011/10/25/265.
> 
> uprobes git is hosted at git://github.com/srikard/linux.git
> with branch inode_uprobes_v32rc2.
> (The previous patchset posted to lkml has been rebased to 3.2-rc2 is also
> available at branch inode_uprobes_v32rc2_prev. This is to help the
> reviewers of the previous patchsets to quickly identify the changes.)
> 
> Uprobes Patches
> This patchset implements inode based uprobes which are specified as
> <file>:<offset> where offset is the offset from start of the map.

Given that uprobes has been reviewed several times on LKML and all
comments till now have been addressed, can we push uprobes into either
-tip or -next. This will help people to test and give more feedback and
also provide a way for it to be pushed into 3.3. This also helps in
resolving and pushing fixes faster.

If you have concerns, can you please voice them?

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support
  2011-11-22  5:03 ` [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
@ 2011-11-22 14:49   ` Stephen Rothwell
  2011-11-23 13:20     ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Stephen Rothwell @ 2011-11-22 14:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Thomas Gleixner, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Jim Keniston, H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 2633 bytes --]

Hi Srikar,

On Tue, 22 Nov 2011 10:33:30 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>
> > uprobes git is hosted at git://github.com/srikard/linux.git
> > with branch inode_uprobes_v32rc2.
>
> Given that uprobes has been reviewed several times on LKML and all
> comments till now have been addressed, can we push uprobes into either
> -tip or -next. This will help people to test and give more feedback and
> also provide a way for it to be pushed into 3.3. This also helps in
> resolving and pushing fixes faster.

OK, I have added that to linux-next with you as the contact,

> If you have concerns, can you please voice them?

You should tidy up the commit messages (they almost all have really bad
short descriptions) and make sure that the authorship is correct in all
cases.

Also, I would prefer a less version specific branch name (like "for-next"
or something) that way you won't have to keep asking me to change it over
time.  If there is any way you can host this on kernel.org, that will
make the merging into Linus' tree a bit smoother.

Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgment of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
     * submitted under GPL v2 (or later) and include the Contributor's
	Signed-off-by,
     * posted to the relevant mailing list,
     * reviewed by you (or another maintainer of your subsystem tree),
     * successfully unit tested, and 
     * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
sfr@canb.auug.org.au

Legal Stuff:
By participating in linux-next, your subsystem tree contributions are
public and will be included in the linux-next trees.  You may be sent
e-mail messages indicating errors or other issues when the
patches/commits from your subsystem tree are merged and tested in
linux-next.  These messages may also be cross-posted to the linux-next
mailing list, the linux-kernel mailing list, etc.  The linux-next tree
project and IBM (my employer) make no warranties regarding the linux-next
project, the testing procedures, the results, the e-mails, etc.  If you
don't agree to these ground rules, let me know and I'll remove your tree
from participation in linux-next.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support
  2011-11-22 14:49   ` Stephen Rothwell
@ 2011-11-23 13:20     ` Srikar Dronamraju
  2011-11-23 13:38       ` Stephen Rothwell
  0 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-23 13:20 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Peter Zijlstra, Linus Torvalds, Thomas Gleixner, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Jim Keniston, mailsrikar,
	H. Peter Anvin

Hi Stephen, 

> > > uprobes git is hosted at git://github.com/srikard/linux.git
> > > with branch inode_uprobes_v32rc2.
> >
> > Given that uprobes has been reviewed several times on LKML and all
> > comments till now have been addressed, can we push uprobes into either
> > -tip or -next. This will help people to test and give more feedback and
> > also provide a way for it to be pushed into 3.3. This also helps in
> > resolving and pushing fixes faster.
> 
> OK, I have added that to linux-next with you as the contact,

Thanks a lot for adding uprobes to linux-next.

I am already getting feedback on things to improve.

> 
> > If you have concerns, can you please voice them?
> 
> You should tidy up the commit messages (they almost all have really bad
> short descriptions) and make sure that the authorship is correct in all
> cases.
> 

I have relooked at the commit messages. 
Have also resolve Dan Carpenter's comments on git log --oneline 
not showing properly.

> Also, I would prefer a less version specific branch name (like "for-next"
> or something) that way you won't have to keep asking me to change it over
> time.  If there is any way you can host this on kernel.org, that will
> make the merging into Linus' tree a bit smoother.


I have created a for-next branch at git://github.com/srikard/linux.git.
My kernel.org account isnt re-activated yet because I still need to
complete key-signing. I will try to get that done at the earliest.
Till then, I would have to host on github.



Please do let me know if there is anything that I have missed out.
-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support
  2011-11-23 13:20     ` Srikar Dronamraju
@ 2011-11-23 13:38       ` Stephen Rothwell
  0 siblings, 0 replies; 106+ messages in thread
From: Stephen Rothwell @ 2011-11-23 13:38 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Thomas Gleixner, Oleg Nesterov,
	Ingo Molnar, Andrew Morton, LKML, Linux-mm, Andi Kleen,
	Christoph Hellwig, Steven Rostedt, Roland McGrath,
	Masami Hiramatsu, Arnaldo Carvalho de Melo,
	Ananth N Mavinakayanahalli, Jim Keniston, mailsrikar,
	H. Peter Anvin

[-- Attachment #1: Type: text/plain, Size: 691 bytes --]

Hi Srikar,

On Wed, 23 Nov 2011 18:50:51 +0530 Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
>
> I have relooked at the commit messages. 
> Have also resolve Dan Carpenter's comments on git log --oneline 
> not showing properly.

Looks better, thanks.

> I have created a for-next branch at git://github.com/srikard/linux.git.

Thanks, I have switched to that.

> My kernel.org account isnt re-activated yet because I still need to
> complete key-signing. I will try to get that done at the earliest.
> Till then, I would have to host on github.

That's ok.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
@ 2011-11-23 16:09   ` Peter Zijlstra
  2011-11-23 16:11     ` Peter Zijlstra
  2011-11-24 14:39     ` Srikar Dronamraju
  2011-11-23 16:22   ` Peter Zijlstra
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 16:09 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +static int __register_uprobe(struct inode *inode, loff_t offset,
> +                               struct uprobe *uprobe)
> +{
> +       struct list_head try_list;
> +       struct vm_area_struct *vma;
> +       struct address_space *mapping;
> +       struct vma_info *vi, *tmpvi;
> +       struct mm_struct *mm;
> +       loff_t vaddr;
> +       int ret = 0;
> +
> +       mapping = inode->i_mapping;
> +       INIT_LIST_HEAD(&try_list);
> +       while ((vi = find_next_vma_info(&try_list, offset,
> +                                               mapping, true)) != NULL) {
> +               if (IS_ERR(vi)) {
> +                       ret = -ENOMEM;
> +                       break;
> +               }
> +               mm = vi->mm;
> +               down_read(&mm->mmap_sem);
> +               vma = find_vma(mm, (unsigned long)vi->vaddr);
> +               if (!vma || !valid_vma(vma, true)) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               vaddr = vma->vm_start + offset;
> +               vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +               if (vma->vm_file->f_mapping->host != inode ||
> +                                               vaddr != vi->vaddr) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               ret = install_breakpoint(mm);
> +               up_read(&mm->mmap_sem);
> +               mmput(mm);
> +               if (ret && ret == -EEXIST)
> +                       ret = 0;
> +               if (!ret)
> +                       break;

Shouldn't that read:
		if (ret)
			break;

So that we bail when there's a real error instead of no error?

> +       }
> +       list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> +               list_del(&vi->probe_list);
> +               kfree(vi);
> +       }
> +       return ret;
> +} 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-23 16:09   ` Peter Zijlstra
@ 2011-11-23 16:11     ` Peter Zijlstra
  2011-11-24 14:39     ` Srikar Dronamraju
  1 sibling, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 16:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Wed, 2011-11-23 at 17:09 +0100, Peter Zijlstra wrote:
> > +               if (IS_ERR(vi)) {
> > +                       ret = -ENOMEM;
> > +                       break;
> > +               } 

Also, might as well use:

			ret = PTR_ERR(vi);



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
  2011-11-23 16:09   ` Peter Zijlstra
@ 2011-11-23 16:22   ` Peter Zijlstra
  2011-11-23 16:27   ` Peter Zijlstra
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 16:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +int register_uprobe(struct inode *inode, loff_t offset,
> +                               struct uprobe_consumer *consumer)
> +{
> +       struct uprobe *uprobe;
> +       int ret = -EINVAL;
> +
> +       if (!consumer || consumer->next)
> +               return ret;
> +
> +       inode = igrab(inode);

So why are you dealing with !consumer but not with !inode? and why does
it make sense to allow !consumer at all?

> +       if (!inode)
> +               return ret;
> +
> +       if (offset > i_size_read(inode))
> +               goto reg_out;
> +
> +       ret = 0;
> +       mutex_lock(uprobes_hash(inode));
> +       uprobe = alloc_uprobe(inode, offset);
> +       if (uprobe && !add_consumer(uprobe, consumer)) {
> +               ret = __register_uprobe(inode, offset, uprobe);
> +               if (ret) {
> +                       uprobe->consumers = NULL;
> +                       __unregister_uprobe(inode, offset, uprobe);
> +               }
> +       }
> +
> +       mutex_unlock(uprobes_hash(inode));
> +       put_uprobe(uprobe);
> +
> +reg_out:
> +       iput(inode);
> +       return ret;
> +} 

So if this function returns an error the caller is responsible for
cleaning up consumer, otherwise we take responsibility.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
  2011-11-23 16:09   ` Peter Zijlstra
  2011-11-23 16:22   ` Peter Zijlstra
@ 2011-11-23 16:27   ` Peter Zijlstra
  2011-11-23 16:35   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 16:27 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +void unregister_uprobe(struct inode *inode, loff_t offset,
> +                               struct uprobe_consumer *consumer)
> +{
> +       struct uprobe *uprobe = NULL;
> +
> +       inode = igrab(inode);
> +       if (!inode || !consumer)
> +               goto unreg_out;

Why do you take a reference on the inode here? Surely inode is already
made stable by whoever calls us?

> +       uprobe = find_uprobe(inode, offset);
> +       if (!uprobe)
> +               goto unreg_out;
> +
> +       mutex_lock(uprobes_hash(inode));
> +       if (!del_consumer(uprobe, consumer)) {
> +               mutex_unlock(uprobes_hash(inode));
> +               goto unreg_out;
> +       }
> +
> +       if (!uprobe->consumers)
> +               __unregister_uprobe(inode, offset, uprobe);
> +
> +       mutex_unlock(uprobes_hash(inode));
> +
> +unreg_out:
> +       if (uprobe)
> +               put_uprobe(uprobe);
> +       if (inode)
> +               iput(inode);
> +} 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
                     ` (2 preceding siblings ...)
  2011-11-23 16:27   ` Peter Zijlstra
@ 2011-11-23 16:35   ` Peter Zijlstra
  2011-11-28 15:29   ` Peter Zijlstra
  2011-12-01 13:20   ` Peter Zijlstra
  5 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 16:35 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +#define UPROBES_HASH_SZ        13
> +/* serialize (un)register */
> +static struct mutex uprobes_mutex[UPROBES_HASH_SZ];
> +#define uprobes_hash(v)        (&uprobes_mutex[((unsigned long)(v)) %\
> +                                               UPROBES_HASH_SZ]) 

Was there any reason to for using this hasing scheme, say over hash.h?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
@ 2011-11-23 17:13   ` Peter Zijlstra
  2011-11-23 18:10   ` Peter Zijlstra
  2011-11-23 18:15   ` Peter Zijlstra
  2 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 17:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +int mmap_uprobe(struct vm_area_struct *vma)
> +{
> +       struct list_head tmp_list;
> +       struct uprobe *uprobe, *u;
> +       struct inode *inode;
> +       int ret = 0, count = 0;
> +
> +       if (!atomic_read(&uprobe_events) || !valid_vma(vma, true))
> +               return ret;     /* Bail-out */
> +
> +       inode = igrab(vma->vm_file->f_mapping->host);
> +       if (!inode)
> +               return ret; 

Since we hold mmap_sem, vma is pinned, which should pin the inode, why
take out another ref?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
  2011-11-23 17:13   ` Peter Zijlstra
@ 2011-11-23 18:10   ` Peter Zijlstra
  2011-11-24 13:47     ` Srikar Dronamraju
  2011-11-23 18:15   ` Peter Zijlstra
  2 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 18:10 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +                       ret = install_breakpoint(vma->vm_mm, uprobe);
> +                       if (ret == -EEXIST) {
> +                               atomic_inc(&vma->vm_mm->mm_uprobes_count);
> +                               ret = 0;
> +                       } 

Aren't you double counting that probe position here? The one that raced
you to inserting it will also have incremented that counter, no?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
  2011-11-23 17:13   ` Peter Zijlstra
  2011-11-23 18:10   ` Peter Zijlstra
@ 2011-11-23 18:15   ` Peter Zijlstra
  2011-11-23 19:50     ` Steven Rostedt
  2011-11-24 13:37     ` Srikar Dronamraju
  2 siblings, 2 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 18:15 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> @@ -545,8 +547,14 @@ again:                     remove_next = 1 + (end > next->vm_end);

I'm not sure if you use quilt or git to produce these patches but can
you either add:

QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"

to your .quiltrc, or:

[diff "default"]
                xfuncname = "^[[:alpha:]$_].*[^:]$"

to your .gitconfig ?



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes
  2011-11-18 11:06 ` [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes Srikar Dronamraju
@ 2011-11-23 18:23   ` Peter Zijlstra
  0 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 18:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:36 +0530, Srikar Dronamraju wrote:
> +static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
> +{
> +       struct uprobe *uprobe, *cur_uprobe;
> +
> +       uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
> +       if (!uprobe)
> +               return NULL;
> +
> +       uprobe->inode = igrab(inode);
> +       uprobe->offset = offset;
> +
> +       /* add to uprobes_tree, sorted on inode:offset */
> +       cur_uprobe = insert_uprobe(uprobe);
> +
> +       /* a uprobe exists for this inode:offset combination */
> +       if (cur_uprobe) {
> +               kfree(uprobe);
> +               uprobe = cur_uprobe;
> +               iput(inode);
> +       }
> +       return uprobe;
> +} 

A function called alloc that actually publishes the object is weird.
Usually those things are separated. Alloc does the memory allocation and
sometimes initialization like things, but it never publishes the thing.

This leads to slightly weird code later on. Its not wrong, just weird
and makes reading this stuff slightly more challenging than needed.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction Srikar Dronamraju
@ 2011-11-23 18:26   ` Peter Zijlstra
  2011-11-23 18:40   ` Peter Zijlstra
  2011-11-28 14:23   ` Peter Zijlstra
  2 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 18:26 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +       if (IS_ERR(page))
> +               return -ENOMEM; 

again, why not use PTR_ERR() if you're there already?

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction Srikar Dronamraju
  2011-11-23 18:26   ` Peter Zijlstra
@ 2011-11-23 18:40   ` Peter Zijlstra
  2011-11-23 19:49     ` Steven Rostedt
  2011-11-24 12:50     ` Srikar Dronamraju
  2011-11-28 14:23   ` Peter Zijlstra
  2 siblings, 2 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 18:40 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +               /* TODO : Analysis and verification of instruction */

As in refuse to set a breakpoint on an instruction we can't deal with?

Do we care? The worst case we'll crash the program, but if we're allowed
setting uprobes we already have enough privileges to do that anyway,
right?



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 19/30] tracing: modify is_delete, is_return from ints to bool.
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 19/30] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
@ 2011-11-23 19:24   ` Steven Rostedt
  0 siblings, 0 replies; 106+ messages in thread
From: Steven Rostedt @ 2011-11-23 19:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:40 +0530, Srikar Dronamraju wrote:
> is_delete and is_return can take atmost 2 values and
> are better of being a boolean than a int.

I'm fine with this.

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve

> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents.
  2011-11-18 11:10 ` [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
@ 2011-11-23 19:32   ` Steven Rostedt
  2011-11-24 13:12     ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Steven Rostedt @ 2011-11-23 19:32 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:40 +0530, Srikar Dronamraju wrote:
> --- /dev/null
> +++ b/kernel/trace/trace_probe.h
> @@ -0,0 +1,160 @@
> +/*
> + * Common header file for probe-based Dynamic events.
> + *
> + * This program is free software; you can redistribute it and/or
> modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
> 02111-1307  USA
> + *
> + * Copyright (C) IBM Corporation, 2010
> + * Author:     Srikar Dronamraju
> + *
> + * Derived from kernel/trace/trace_kprobe.c written by

Shouldn't the above be:

 include/linux/trace_kprobe.h ?

Although, I would think both of these files are a bit more that derived
from. I would have been a bit stronger on the wording and say: This code
was copied from trace_kprobe.[ch] written by Masami ...

Then say,

Updates to make this generic:

 * Copyright (C) IBM Corporation, 2010
 * Author:     Srikar Dronamraju

-- Steve

> + * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
> + */
> + 




^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-23 18:40   ` Peter Zijlstra
@ 2011-11-23 19:49     ` Steven Rostedt
  2011-11-23 20:52       ` Peter Zijlstra
  2011-11-24 12:50     ` Srikar Dronamraju
  1 sibling, 1 reply; 106+ messages in thread
From: Steven Rostedt @ 2011-11-23 19:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Wed, 2011-11-23 at 19:40 +0100, Peter Zijlstra wrote:
> On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > +               /* TODO : Analysis and verification of instruction */
> 
> As in refuse to set a breakpoint on an instruction we can't deal with?
> 
> Do we care? The worst case we'll crash the program, but if we're allowed
> setting uprobes we already have enough privileges to do that anyway,
> right?

Well, I wouldn't be happy if I was running a server, and needed to
analyze something it was doing, and because I screwed up the location of
my probe, I crash the server, made lots of people unhappy and lose my
job over it.

I think we do care, but it can be a TODO item.

-- Steve



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-23 18:15   ` Peter Zijlstra
@ 2011-11-23 19:50     ` Steven Rostedt
  2011-11-24 13:37     ` Srikar Dronamraju
  1 sibling, 0 replies; 106+ messages in thread
From: Steven Rostedt @ 2011-11-23 19:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Wed, 2011-11-23 at 19:15 +0100, Peter Zijlstra wrote:
> On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > @@ -545,8 +547,14 @@ again:                     remove_next = 1 + (end > next->vm_end);
> 
> I'm not sure if you use quilt or git to produce these patches but can
> you either add:
> 
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
> 
> to your .quiltrc, or:
> 
> [diff "default"]
>                 xfuncname = "^[[:alpha:]$_].*[^:]$"
> 
> to your .gitconfig ?
> 

or just place a space in front of "again:"

/me runs!

-- Steve



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-23 19:49     ` Steven Rostedt
@ 2011-11-23 20:52       ` Peter Zijlstra
  0 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-23 20:52 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Srikar Dronamraju, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Wed, 2011-11-23 at 14:49 -0500, Steven Rostedt wrote:
> On Wed, 2011-11-23 at 19:40 +0100, Peter Zijlstra wrote:
> > On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > > +               /* TODO : Analysis and verification of instruction */
> > 
> > As in refuse to set a breakpoint on an instruction we can't deal with?
> > 
> > Do we care? The worst case we'll crash the program, but if we're allowed
> > setting uprobes we already have enough privileges to do that anyway,
> > right?
> 
> Well, I wouldn't be happy if I was running a server, and needed to
> analyze something it was doing, and because I screwed up the location of
> my probe, I crash the server, made lots of people unhappy and lose my
> job over it.
> 
> I think we do care, but it can be a TODO item.

But but but, why not let userspace sort it? And if you're going to
provide the kernel with inode:offset data yourself, you're already well
aware of wtf you're doing.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-23 18:40   ` Peter Zijlstra
  2011-11-23 19:49     ` Steven Rostedt
@ 2011-11-24 12:50     ` Srikar Dronamraju
  1 sibling, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-24 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Wed, 23 Nov 2011 19:40:16 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > +               /* TODO : Analysis and verification of instruction
> > */
> 
> As in refuse to set a breakpoint on an instruction we can't deal with?
> 
> Do we care? The worst case we'll crash the program, but if we're
> allowed setting uprobes we already have enough privileges to do that
> anyway, right?
> 

I think we should and we do care. 
That's already implemented in the subsequent patches too.
For example: we don't a trace breakpoint instruction.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents.
  2011-11-23 19:32   ` Steven Rostedt
@ 2011-11-24 13:12     ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-24 13:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Peter Zijlstra, Linus Torvalds, Oleg Nesterov, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

> > + *
> > + * Copyright (C) IBM Corporation, 2010
> > + * Author:     Srikar Dronamraju
> > + *
> > + * Derived from kernel/trace/trace_kprobe.c written by
> 
> Shouldn't the above be:
> 
>  include/linux/trace_kprobe.h ?
> 
> Although, I would think both of these files are a bit more that derived
> from. I would have been a bit stronger on the wording and say: This code
> was copied from trace_kprobe.[ch] written by Masami ...
> 
> Then say,
> 
> Updates to make this generic:
> 
>  * Copyright (C) IBM Corporation, 2010
>  * Author:     Srikar Dronamraju
> 
> -- Steve
> 

Okay, Will do as suggested.
Thanks for reporting/suggesting.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-23 18:15   ` Peter Zijlstra
  2011-11-23 19:50     ` Steven Rostedt
@ 2011-11-24 13:37     ` Srikar Dronamraju
  2011-11-24 13:47       ` Peter Zijlstra
  1 sibling, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-24 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, tulasidhard

* Peter Zijlstra <peterz@infradead.org> [2011-11-23 19:15:49]:

> On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > @@ -545,8 +547,14 @@ again:                     remove_next = 1 + (end > next->vm_end);
> 
> I'm not sure if you use quilt or git to produce these patches but can
> you either add:
> 
> QUILT_DIFF_OPTS="-F ^[[:alpha:]\$_].*[^:]\$"
> 
> to your .quiltrc, or:
> 
> [diff "default"]
>                 xfuncname = "^[[:alpha:]$_].*[^:]$"

I use stgit 
You had suggested this to me earlier, and I have it my ~/.gitconfig


[diff "default"]
        xfuncname = "^[[:alpha:]$_].*[^:]$"


 stg version 
 Stacked GIT 0.15
 git version 1.7.1
 Python version 2.6.6 (r266:84292, Apr 11 2011, 15:50:32) 
 [GCC 4.4.4 20100726 (Red Hat 4.4.4-13)]

One thing that I might be doing differently is 

I do a "stg export" before using sendpatchset to mail the patches.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-24 13:37     ` Srikar Dronamraju
@ 2011-11-24 13:47       ` Peter Zijlstra
  0 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-24 13:47 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, tulasidhard

On Thu, 2011-11-24 at 19:07 +0530, Srikar Dronamraju wrote:
> I do a "stg export" before using sendpatchset to mail the patches.

hrm, weird stuff. I've no idea what stg does, but it looks like its not
working right. A well, something to figure out on a rainy afternoon or
so.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-23 18:10   ` Peter Zijlstra
@ 2011-11-24 13:47     ` Srikar Dronamraju
  2011-11-24 14:13       ` Peter Zijlstra
  2011-11-28 14:59       ` Peter Zijlstra
  0 siblings, 2 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-24 13:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

* Peter Zijlstra <peterz@infradead.org> [2011-11-23 19:10:12]:

> On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > +                       ret = install_breakpoint(vma->vm_mm, uprobe);
> > +                       if (ret == -EEXIST) {
> > +                               atomic_inc(&vma->vm_mm->mm_uprobes_count);
> > +                               ret = 0;
> > +                       } 
> 
> Aren't you double counting that probe position here? The one that raced
> you to inserting it will also have incremented that counter, no?
> 

No we arent.
Because register_uprobe can never race with mmap_uprobe and register
before mmap_uprobe registers .(Once we start mmap_region,
register_uprobe waits for the read_lock of mmap_sem.)

And we badly need this for mmap_uprobe case.  Because when we do mremap,
or vma_adjust(), we do a munmap_uprobe() followed by mmap_uprobe() which
would have decremented the count but not removed it. So when we do a
mmap_uprobe, we need to increment the count. 

-- 
Thanks and regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-24 13:47     ` Srikar Dronamraju
@ 2011-11-24 14:13       ` Peter Zijlstra
  2011-11-24 14:25         ` Srikar Dronamraju
  2011-11-28 14:59       ` Peter Zijlstra
  1 sibling, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-24 14:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

On Thu, 2011-11-24 at 19:17 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-11-23 19:10:12]:
> 
> > On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > > +                       ret = install_breakpoint(vma->vm_mm, uprobe);
> > > +                       if (ret == -EEXIST) {
> > > +                               atomic_inc(&vma->vm_mm->mm_uprobes_count);
> > > +                               ret = 0;
> > > +                       } 
> > 
> > Aren't you double counting that probe position here? The one that raced
> > you to inserting it will also have incremented that counter, no?
> > 
> 
> No we arent.
> Because register_uprobe can never race with mmap_uprobe and register
> before mmap_uprobe registers .(Once we start mmap_region,
> register_uprobe waits for the read_lock of mmap_sem.)

Still doesn't make any sense. Since you don't increment on success, one
has to assume install_breakpoint() will cause an increment. Therefore,
when we encounter -EEXIST we'll already have accounted for this
mm,inode,offset combination.

But I'll have another look at it, maybe I'm missing something
obvious :-)

> And we badly need this for mmap_uprobe case.  Because when we do mremap,
> or vma_adjust(), we do a munmap_uprobe() followed by mmap_uprobe() which
> would have decremented the count but not removed it. So when we do a
> mmap_uprobe, we need to increment the count. 

Well I see why the count needs to be correct, that's not the issue.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-24 14:13       ` Peter Zijlstra
@ 2011-11-24 14:25         ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-24 14:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

* Peter Zijlstra <peterz@infradead.org> [2011-11-24 15:13:37]:

> On Thu, 2011-11-24 at 19:17 +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2011-11-23 19:10:12]:
> > 
> > > On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > > > +                       ret = install_breakpoint(vma->vm_mm, uprobe);
> > > > +                       if (ret == -EEXIST) {
> > > > +                               atomic_inc(&vma->vm_mm->mm_uprobes_count);
> > > > +                               ret = 0;
> > > > +                       } 
> > > 
> > > Aren't you double counting that probe position here? The one that raced
> > > you to inserting it will also have incremented that counter, no?
> > > 
> > 
> > No we arent.
> > Because register_uprobe can never race with mmap_uprobe and register
> > before mmap_uprobe registers .(Once we start mmap_region,
> > register_uprobe waits for the read_lock of mmap_sem.)
> 
> Still doesn't make any sense. Since you don't increment on success, one
> has to assume install_breakpoint() will cause an increment. Therefore,
> when we encounter -EEXIST we'll already have accounted for this
> mm,inode,offset combination.
> 

In the success case, install_breakpoint itself does the increment.
We cant allow install_breakpoint to increment in EEXIST case always
because doing that in register_uprobe context would increment which is
wrong.

> But I'll have another look at it, maybe I'm missing something
> obvious :-)
> 
> > And we badly need this for mmap_uprobe case.  Because when we do mremap,
> > or vma_adjust(), we do a munmap_uprobe() followed by mmap_uprobe() which
> > would have decremented the count but not removed it. So when we do a
> > mmap_uprobe, we need to increment the count. 
> 
> Well I see why the count needs to be correct, that's not the issue.

Okay .. 

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-23 16:09   ` Peter Zijlstra
  2011-11-23 16:11     ` Peter Zijlstra
@ 2011-11-24 14:39     ` Srikar Dronamraju
  1 sibling, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-24 14:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

> > +                       ret = 0;
> > +               if (!ret)
> > +                       break;
> 
> Shouldn't that read:
> 		if (ret)
> 			break;
> 
> So that we bail when there's a real error instead of no error?
> 

Right, will fix this and also do the PTR_ERR changes that you suggested.

-- 
thanks and regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
@ 2011-11-25 14:29   ` Peter Zijlstra
  2011-11-25 14:54   ` Peter Zijlstra
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-25 14:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:38 +0530, Srikar Dronamraju wrote:
> +       /* poke the new insn in, ASSUMES we don't cross page boundary */
> +       vaddr &= ~PAGE_MASK;
> +       memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);

I still don't get why you don't simply write something like:

BUG_ON(vaddr + uprobe_opcode_size >= PAGE_SIZE);

That's as descriptive as the comment and actually does something if
someone got it wrong, instead of silently corrupting crap.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
  2011-11-25 14:29   ` Peter Zijlstra
@ 2011-11-25 14:54   ` Peter Zijlstra
  2011-11-26  2:25     ` Srikar Dronamraju
  2011-11-28 14:13   ` Peter Zijlstra
  2011-11-28 15:01   ` Peter Zijlstra
  3 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-25 14:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:38 +0530, Srikar Dronamraju wrote:
> +static int read_opcode(struct mm_struct *mm, unsigned long vaddr,
> +                                               uprobe_opcode_t *opcode)
> +{
> +       struct page *page;
> +       void *vaddr_new;
> +       int ret;
> +
> +       ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL);
> +       if (ret <= 0)
> +               return ret;
> +
> +       lock_page(page);
> +       vaddr_new = kmap_atomic(page);
> +       vaddr &= ~PAGE_MASK;

BUG_ON(vaddr + uprobe_opcode_sz >= PAGE_SIZE);

> +       memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
> +       kunmap_atomic(vaddr_new);
> +       unlock_page(page);
> +       put_page(page);         /* we did a get_user_pages in the beginning */
> +       return 0;
> +} 


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep
  2011-11-18 11:09 ` [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep Srikar Dronamraju
@ 2011-11-25 15:24   ` Peter Zijlstra
  2011-11-26  2:22     ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-25 15:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:39 +0530, Srikar Dronamraju wrote:

> +       consumer = uprobe->consumers;
> +       for (consumer = uprobe->consumers; consumer;
> +                                       consumer = consumer->next) { 

that first expression seems redundant..


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep
  2011-11-25 15:24   ` Peter Zijlstra
@ 2011-11-26  2:22     ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-26  2:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

* Peter Zijlstra <peterz@infradead.org> [2011-11-25 16:24:22]:

> On Fri, 2011-11-18 at 16:39 +0530, Srikar Dronamraju wrote:
> 
> > +       consumer = uprobe->consumers;
> > +       for (consumer = uprobe->consumers; consumer;
> > +                                       consumer = consumer->next) { 
> 
> that first expression seems redundant..

Yes,  will remove.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-25 14:54   ` Peter Zijlstra
@ 2011-11-26  2:25     ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-26  2:25 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

* Peter Zijlstra <peterz@infradead.org> [2011-11-25 15:54:46]:

> On Fri, 2011-11-18 at 16:38 +0530, Srikar Dronamraju wrote:
> > +static int read_opcode(struct mm_struct *mm, unsigned long vaddr,
> > +                                               uprobe_opcode_t *opcode)
> > +{
> > +       struct page *page;
> > +       void *vaddr_new;
> > +       int ret;
> > +
> > +       ret = get_user_pages(NULL, mm, vaddr, 1, 0, 0, &page, NULL);
> > +       if (ret <= 0)
> > +               return ret;
> > +
> > +       lock_page(page);
> > +       vaddr_new = kmap_atomic(page);
> > +       vaddr &= ~PAGE_MASK;
> 
> BUG_ON(vaddr + uprobe_opcode_sz >= PAGE_SIZE);
> 

Okay, will add BUG_ON.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
  2011-11-25 14:29   ` Peter Zijlstra
  2011-11-25 14:54   ` Peter Zijlstra
@ 2011-11-28 14:13   ` Peter Zijlstra
  2011-11-29  7:49     ` Srikar Dronamraju
  2011-11-28 15:01   ` Peter Zijlstra
  3 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 14:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:38 +0530, Srikar Dronamraju wrote:
> +/**
> + * is_bkpt_insn - check if instruction is breakpoint instruction.
> + * @insn: instruction to be checked.
> + * Default implementation of is_bkpt_insn
> + * Returns true if @insn is a breakpoint instruction.
> + */
> +bool __weak is_bkpt_insn(u8 *insn)
> +{
> +       return (insn[0] == UPROBES_BKPT_INSN);
>  } 

This seems wrong, UPROBES_BKPT_INSN basically defined to be of
uprobe_opcode_t type, not u8.

So:

bool __weak is_bkpt_insn(uprobe_opcode_t *insn)
{
	return *insn == UPROBE_BKPT_INSN;
}

seems like the right way to write this.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction Srikar Dronamraju
  2011-11-23 18:26   ` Peter Zijlstra
  2011-11-23 18:40   ` Peter Zijlstra
@ 2011-11-28 14:23   ` Peter Zijlstra
  2 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 14:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +static int __copy_insn(struct address_space *mapping,
> +                       struct vm_area_struct *vma, char *insn,
> +                       unsigned long nbytes, unsigned long offset)
> +{
> +       struct file *filp = vma->vm_file;
> +       struct page *page;
> +       void *vaddr;
> +       unsigned long off1;
> +       unsigned long idx;
> +
> +       if (!filp)
> +               return -EINVAL;
> +
> +       idx = (unsigned long)(offset >> PAGE_CACHE_SHIFT);
> +       off1 = offset &= ~PAGE_MASK;
> +
> +       /*
> +        * Ensure that the page that has the original instruction is
> +        * populated and in page-cache.
> +        */
> +       page = read_mapping_page(mapping, idx, filp);
> +       if (IS_ERR(page))
> +               return -ENOMEM;
> +
> +       vaddr = kmap_atomic(page);
> +       memcpy(insn, vaddr + off1, nbytes);
> +       kunmap_atomic(vaddr);
> +       page_cache_release(page);
> +       return 0;
> +}
> +
> +static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
> +                                       unsigned long addr)
> +{
> +       struct address_space *mapping;
> +       int bytes;
> +       unsigned long nbytes;
> +
> +       addr &= ~PAGE_MASK;
> +       nbytes = PAGE_SIZE - addr;
> +       mapping = uprobe->inode->i_mapping;
> +
> +       /* Instruction at end of binary; copy only available bytes */
> +       if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
> +               bytes = uprobe->inode->i_size - uprobe->offset;
> +       else
> +               bytes = MAX_UINSN_BYTES;
> +
> +       /* Instruction at the page-boundary; copy bytes in second page */
> +       if (nbytes < bytes) {
> +               if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
> +                               bytes - nbytes, uprobe->offset + nbytes))
> +                       return -ENOMEM;

You just lost your possible -EINVAL return value.

> +
> +               bytes = nbytes;
> +       }
> +       return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
> +} 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-24 13:47     ` Srikar Dronamraju
  2011-11-24 14:13       ` Peter Zijlstra
@ 2011-11-28 14:59       ` Peter Zijlstra
  2011-11-29  8:33         ` Srikar Dronamraju
  1 sibling, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 14:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

On Thu, 2011-11-24 at 19:17 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-11-23 19:10:12]:
> 
> > On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > > +                       ret = install_breakpoint(vma->vm_mm, uprobe);
> > > +                       if (ret == -EEXIST) {
> > > +                               atomic_inc(&vma->vm_mm->mm_uprobes_count);
> > > +                               ret = 0;
> > > +                       } 
> > 
> > Aren't you double counting that probe position here? The one that raced
> > you to inserting it will also have incremented that counter, no?
> > 
> 
> No we arent.
> Because register_uprobe can never race with mmap_uprobe and register
> before mmap_uprobe registers .(Once we start mmap_region,
> register_uprobe waits for the read_lock of mmap_sem.)
> 
> And we badly need this for mmap_uprobe case.  Because when we do mremap,
> or vma_adjust(), we do a munmap_uprobe() followed by mmap_uprobe() which
> would have decremented the count but not removed it. So when we do a
> mmap_uprobe, we need to increment the count. 

Ok, so I didn't parse that properly last time around.. but it still
doesn't make sense, why would munmap_uprobe() decrement the count but
not uninstall the probe?

install_breakpoint() returning -EEXIST on two different conditions
doesn't help either.

So what I think you're doing is that you're optimizing the unmap case
since the memory is going to be thrown out fixing up the instruction is
a waste of time, but this leads to the asymmetry observed above. But you
fail to mention this in both the changelog or a comment near that
-EEXIST branch in mmap_uprobe.

Worse, you don't explain how the other -EEXIST (!consumers) thing
interacts here, and I just gave up trying to figure that out since it
made my head hurt.

Also, your whole series of patches is still utter crap, the splitup
doesn't work at all, I need to constantly search back and forth between
patches in order to figure out wtf is happening, and your changelogs
only seem to add confusion if anything at all.

Also, you seem to have stuck a whole bunch of random patches at the end
that fix various things without folding them back in to make the series
saner/smaller.

I've now reverted to simply applying all patches and reading the end
result and using git-blame to figure out what patch something came
from :-(


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
                     ` (2 preceding siblings ...)
  2011-11-28 14:13   ` Peter Zijlstra
@ 2011-11-28 15:01   ` Peter Zijlstra
  3 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 15:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:38 +0530, Srikar Dronamraju wrote:
> 
> Provides Background page replacement by
>  - cow the page that needs replacement.
>  - modify a copy of the cowed page.
>  - replace the cow page with the modified page
>  - flush the page tables.
> 
> Also provides additional routines to read an opcode from a given virtual
> address and for verifying if a instruction is a breakpoint instruction.

You again/still lost the reason why we duplicate bits of mm/ksm.c here.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
                     ` (3 preceding siblings ...)
  2011-11-23 16:35   ` Peter Zijlstra
@ 2011-11-28 15:29   ` Peter Zijlstra
  2011-11-29  7:48     ` Srikar Dronamraju
  2011-12-01 13:20   ` Peter Zijlstra
  5 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 15:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> +                                               struct uprobe *uprobe)
> +{
> +       struct list_head try_list;
> +       struct address_space *mapping;
> +       struct vma_info *vi, *tmpvi;
> +       struct vm_area_struct *vma;
> +       struct mm_struct *mm;
> +       loff_t vaddr;
> +
> +       mapping = inode->i_mapping;
> +       INIT_LIST_HEAD(&try_list);
> +       while ((vi = find_next_vma_info(&try_list, offset,
> +                                               mapping, false)) != NULL) {
> +               if (IS_ERR(vi))
> +                       break;

So what kind of half-assed state are we left in if we try an unregister
under memory pressure and how do we deal with that?

> +               mm = vi->mm;
> +               down_read(&mm->mmap_sem);
> +               vma = find_vma(mm, (unsigned long)vi->vaddr);
> +               if (!vma || !valid_vma(vma, false)) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               vaddr = vma->vm_start + offset;
> +               vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +               if (vma->vm_file->f_mapping->host != inode ||
> +                                               vaddr != vi->vaddr) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               remove_breakpoint(mm);
> +               up_read(&mm->mmap_sem);
> +               mmput(mm);
> +       }
> +
> +       list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> +               list_del(&vi->probe_list);
> +               kfree(vi);
> +       }
> +       delete_uprobe(uprobe);
> +} 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
                   ` (30 preceding siblings ...)
  2011-11-22  5:03 ` [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
@ 2011-11-28 19:06 ` Oleg Nesterov
  2011-11-28 19:06   ` [PATCH 1/5] uprobes: kill pre_ssout(), introduce set_xol_ip() Oleg Nesterov
                     ` (7 more replies)
  31 siblings, 8 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-28 19:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

Hello.

On top of this series, not for inclusion yet, just to explain what
I mean. May be someone can test it ;)

This series kills xol_vma. Instead we use the per_cpu-like xol slots.

This is much more simple and efficient. And this of course solves
many problems we currently have with xol_vma.

For example, we simply can not trust it. We do not know what actually
we are going to execute in UTASK_SSTEP mode. An application can unmap
this area and then do mmap(PROT_EXEC|PROT_WRITE, MAP_FIXED) to fool
uprobes.

The only disadvantage is that this adds a bit more arch-dependant
code.

The main question, can this work? I know very little in this area.
And I am not sure if this can be ported to other architectures.

Please comment.

Oleg.

 arch/x86/include/asm/fixmap.h      |    9 +
 arch/x86/include/asm/thread_info.h |    4 
 arch/x86/kernel/process.c          |    6 
 arch/x86/kernel/uprobes.c          |   26 +++-
 include/linux/mm_types.h           |    1 
 include/linux/uprobes.h            |   27 ----
 kernel/fork.c                      |    2 
 kernel/uprobes.c                   |  239 +++----------------------------------
 8 files changed, 71 insertions(+), 243 deletions(-)


^ permalink raw reply	[flat|nested] 106+ messages in thread

* [PATCH 1/5] uprobes: kill pre_ssout(), introduce set_xol_ip()
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
@ 2011-11-28 19:06   ` Oleg Nesterov
  2011-11-28 19:06   ` [PATCH 2/5] uprobes: introduce uprobe_switch_to() Oleg Nesterov
                     ` (6 subsequent siblings)
  7 siblings, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-28 19:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

No functional changes, preparation.

- Do not change regs->ip in pre_xol().

- Kill pre_ssout(), move its code into the single caller.

- Add the new __weak helper, set_xol_ip(). Currently it simply does
  regs->ip = utask->xol_vaddr.

- Change uprobe_notify_resume() to do set_xol_ip() after pre_xol().

IOW, before this patch uprobe_notify_resume() does:

	utask->state = UTASK_SSTEP;
	pre_ssout:
		xol_get_insn_slot();
		pre_xol();		// <----- sets regs->ip
	user_enable_single_step(current);

after:

	xol_get_insn_slot();
	pre_xol();		// <------ doesn't change regs->ip
	user_enable_single_step(current);
	utask->state = UTASK_SSTEP;
	set_xol_ip();		// <----- sets regs->ip

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/kernel/uprobes.c |    2 --
 include/linux/uprobes.h   |    1 +
 kernel/uprobes.c          |   27 +++++++++++++--------------
 3 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 40f9f75..cd086be 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -429,7 +429,6 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	tskinfo->saved_trap_no = current->thread.trap_no;
 	current->thread.trap_no = UPROBE_TRAP_NO;
 
-	regs->ip = current->utask->xol_vaddr;
 	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
 		tskinfo->saved_scratch_register = regs->ax;
 		regs->ax = current->utask->vaddr;
@@ -449,7 +448,6 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	tskinfo->saved_trap_no = current->thread.trap_no;
 	current->thread.trap_no = UPROBE_TRAP_NO;
 
-	regs->ip = current->utask->xol_vaddr;
 	return 0;
 }
 #endif
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 20bdd0a..c9ff67a 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -140,6 +140,7 @@ extern int uprobe_bkpt_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
 extern bool uprobe_deny_signal(void);
 extern bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u);
+extern void __weak set_xol_ip(struct pt_regs *regs);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 2493191..b596432 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1311,15 +1311,6 @@ static struct uprobe_task *add_utask(void)
 	return utask;
 }
 
-/* Prepare to single-step probed instruction out of line. */
-static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
-				unsigned long vaddr)
-{
-	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs))
-		return 0;
-	return -EFAULT;
-}
-
 bool uprobe_deny_signal(void)
 {
 	struct task_struct *tsk = current;
@@ -1351,6 +1342,11 @@ bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u)
 	return false;
 }
 
+void __weak set_xol_ip(struct pt_regs *regs)
+{
+	set_instruction_pointer(regs, current->utask->xol_vaddr);
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1396,12 +1392,15 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		if (u->flags & UPROBES_SKIP_SSTEP && can_skip_xol(regs, u))
 			goto cleanup_ret;
 
-		utask->state = UTASK_SSTEP;
-		if (!pre_ssout(u, regs, probept))
-			user_enable_single_step(current);
-		else
-			/* Cannot Singlestep; re-execute the instruction. */
+		if (!xol_get_insn_slot(u, probept))
+			goto cleanup_ret;
+
+		if (pre_xol(u, regs))
 			goto cleanup_ret;
+
+		user_enable_single_step(current);
+		utask->state = UTASK_SSTEP;
+		set_xol_ip(regs);
 	} else {
 		u = utask->active_uprobe;
 		if (utask->state == UTASK_SSTEP_ACK)
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 2/5] uprobes: introduce uprobe_switch_to()
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
  2011-11-28 19:06   ` [PATCH 1/5] uprobes: kill pre_ssout(), introduce set_xol_ip() Oleg Nesterov
@ 2011-11-28 19:06   ` Oleg Nesterov
  2011-11-28 19:53     ` Peter Zijlstra
  2011-11-28 19:07   ` [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS] Oleg Nesterov
                     ` (5 subsequent siblings)
  7 siblings, 1 reply; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-28 19:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

Introduce uprobe_switch_to(), it is called by switch_to() paths if
the "next" task is going to execute the xol insn.

Currently we use TIF_SINGLESTEP (added to _TIF_WORK_CTXSW_NEXT) to
detect this case in __switch_to_xtra() and call uprobe_switch_to(),
may be we can add another flag.

uprobe_switch_to() verifies that this task is actually UTASK_SSTEP
and X86_EFLAGS_TF is set.

Finally uprobe_switch_to() does set_xol_ip(). Currently this is not
needed, but this means that set_xol_ip() is called every time the
UTASK_SSTEP task migrates to another CPU.

To ensure set_xol_ip() can't race with itself we add preempt_disable()
into another caller, uprobe_notify_resume().

Note! this patch assumes we can trust X86_EFLAGS_TF. I mean, afaiu
even if the single-stepping insn races with irq/exception, this flag
will be cleared if and only if this instruction was already executed.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/thread_info.h |    4 ++++
 arch/x86/kernel/process.c          |    6 ++++++
 arch/x86/kernel/uprobes.c          |   14 ++++++++++++++
 include/linux/uprobes.h            |    1 +
 kernel/uprobes.c                   |    2 ++
 5 files changed, 27 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index aeb3e04..af711a1 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -150,7 +150,11 @@ struct thread_info {
 	(_TIF_IO_BITMAP|_TIF_NOTSC|_TIF_BLOCKSTEP)
 
 #define _TIF_WORK_CTXSW_PREV (_TIF_WORK_CTXSW|_TIF_USER_RETURN_NOTIFY)
+#ifdef CONFIG_UPROBES
+#define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW|_TIF_DEBUG|_TIF_SINGLESTEP)
+#else
 #define _TIF_WORK_CTXSW_NEXT (_TIF_WORK_CTXSW|_TIF_DEBUG)
+#endif
 
 #define PREEMPT_ACTIVE		0x10000000
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index b9b3b1a..233bf20 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -229,6 +229,12 @@ void __switch_to_xtra(struct task_struct *prev_p, struct task_struct *next_p,
 		 */
 		memset(tss->io_bitmap, 0xff, prev->io_bitmap_max);
 	}
+
+#ifdef CONFIG_UPROBES
+	if (test_tsk_thread_flag(next_p, TIF_SINGLESTEP))
+		uprobe_switch_to(next_p);
+#endif
+
 	propagate_user_return_notify(prev_p, next_p);
 }
 
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index cd086be..4140137 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -452,6 +452,20 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 }
 #endif
 
+void uprobe_switch_to(struct task_struct *curr)
+{
+	struct uprobe_task *utask = curr->utask;
+	struct pt_regs *regs = task_pt_regs(curr);
+
+	if (!utask || utask->state != UTASK_SSTEP)
+		return;
+
+	if (!(regs->flags & X86_EFLAGS_TF))
+		return;
+
+	set_xol_ip(regs);
+}
+
 /*
  * Called by post_xol() to adjust the return address pushed by a call
  * instruction executed out of line.
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index c9ff67a..d590d66 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -141,6 +141,7 @@ extern void uprobe_notify_resume(struct pt_regs *regs);
 extern bool uprobe_deny_signal(void);
 extern bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u);
 extern void __weak set_xol_ip(struct pt_regs *regs);
+extern void uprobe_switch_to(struct task_struct *);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index b596432..9c509dc 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1399,8 +1399,10 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			goto cleanup_ret;
 
 		user_enable_single_step(current);
+		preempt_disable();
 		utask->state = UTASK_SSTEP;
 		set_xol_ip(regs);
+		preempt_enable();
 	} else {
 		u = utask->active_uprobe;
 		if (utask->state == UTASK_SSTEP_ACK)
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS]
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
  2011-11-28 19:06   ` [PATCH 1/5] uprobes: kill pre_ssout(), introduce set_xol_ip() Oleg Nesterov
  2011-11-28 19:06   ` [PATCH 2/5] uprobes: introduce uprobe_switch_to() Oleg Nesterov
@ 2011-11-28 19:07   ` Oleg Nesterov
  2011-11-28 19:48     ` Peter Zijlstra
  2011-11-29 18:24     ` Oleg Nesterov
  2011-11-28 19:07   ` [PATCH 4/5] uprobes: teach set_xol_ip() to use uprobe_xol_slots[] Oleg Nesterov
                     ` (4 subsequent siblings)
  7 siblings, 2 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-28 19:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

This patch adds uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] array.
Each CPU has its own slot for xol (used in the next patch).

We "export" this data to the user-space via set_fixmap(PAGE_KERNEL_VSYSCALL).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/fixmap.h |    9 +++++++++
 arch/x86/kernel/uprobes.c     |   10 ++++++++++
 include/linux/uprobes.h       |    1 +
 kernel/uprobes.c              |    4 ++++
 4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 460c74e..a902e19 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -81,6 +81,15 @@ enum fixed_addresses {
 	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
+
+#ifdef CONFIG_UPROBES
+	#define UPROBES_XOL_SLOT_BYTES  128
+
+	UPROBE_XOL_LAST_PAGE,
+	UPROBE_XOL_FIRST_PAGE = UPROBE_XOL_LAST_PAGE
+			      + NR_CPUS * UPROBES_XOL_SLOT_BYTES / PAGE_SIZE,
+#endif
+
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 4140137..ebb280c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -664,3 +664,13 @@ bool can_skip_xol(struct pt_regs *regs, struct uprobe *u)
 	u->flags &= ~UPROBES_SKIP_SSTEP;
 	return false;
 }
+
+void __init map_uprobe_xol_slots(void *pages)
+{
+	int idx = UPROBE_XOL_FIRST_PAGE;
+
+	do {
+		__set_fixmap(idx, __pa(pages), PAGE_KERNEL_VSYSCALL);
+		pages += PAGE_SIZE;
+	} while (idx-- != UPROBE_XOL_LAST_PAGE);
+}
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index d590d66..bb59a66 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -142,6 +142,7 @@ extern bool uprobe_deny_signal(void);
 extern bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u);
 extern void __weak set_xol_ip(struct pt_regs *regs);
 extern void uprobe_switch_to(struct task_struct *);
+extern void map_uprobe_xol_slots(void *);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9c509dc..20007da 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1342,6 +1342,9 @@ bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u)
 	return false;
 }
 
+static unsigned char
+uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] __page_aligned_bss;
+
 void __weak set_xol_ip(struct pt_regs *regs)
 {
 	set_instruction_pointer(regs, current->utask->xol_vaddr);
@@ -1490,6 +1493,7 @@ static int __init init_uprobes(void)
 		mutex_init(&uprobes_mmap_mutex[i]);
 	}
 	init_bulkref(&uprobes_srcu);
+	map_uprobe_xol_slots(uprobe_xol_slots);
 	return register_die_notifier(&uprobe_exception_nb);
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 4/5] uprobes: teach set_xol_ip() to use uprobe_xol_slots[]
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
                     ` (2 preceding siblings ...)
  2011-11-28 19:07   ` [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS] Oleg Nesterov
@ 2011-11-28 19:07   ` Oleg Nesterov
  2011-11-28 19:07   ` [PATCH 5/5] uprobes: remove the uprobes_xol_area code Oleg Nesterov
                     ` (3 subsequent siblings)
  7 siblings, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-28 19:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

Change set_xol_ip() to use uprobe_xol_slots[] per-cpu array to
"allocate" the insn slot. We do not care if the task migrates to
another CPU before executing xol insn, set_xol_ip() will be called
again by uprobe_switch_to(). Likewise, we do not care if the task
is simply preempted or sleeps.

IOW, uprobe_xol_slots[CPU] is "owned" by cpu_curr(CPU).

This makes xol_get_insn_slot/xol_free_insn_slot unnecessary, but
uprobe_notify_resume() should set utask->vaddr. The patch updates
the callers but doesn't remove this code to simplify the review.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 kernel/uprobes.c |   16 ++++++++++------
 1 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 20007da..c9e2f65 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1285,7 +1285,6 @@ void free_uprobe_utask(struct task_struct *tsk)
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
-	xol_free_insn_slot(tsk);
 	kfree(utask);
 	tsk->utask = NULL;
 }
@@ -1347,7 +1346,15 @@ uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] __page_aligned_bss;
 
 void __weak set_xol_ip(struct pt_regs *regs)
 {
-	set_instruction_pointer(regs, current->utask->xol_vaddr);
+	int cpu = smp_processor_id();
+	struct uprobe_task *utask = current->utask;
+	struct uprobe *uprobe = utask->active_uprobe;
+
+	memcpy(uprobe_xol_slots[cpu], uprobe->insn, MAX_UINSN_BYTES);
+
+	utask->xol_vaddr = fix_to_virt(UPROBE_XOL_FIRST_PAGE)
+				+ UPROBES_XOL_SLOT_BYTES * cpu;
+	set_instruction_pointer(regs, utask->xol_vaddr);
 }
 
 /*
@@ -1390,14 +1397,12 @@ void uprobe_notify_resume(struct pt_regs *regs)
 				goto cleanup_ret;
 		}
 		utask->active_uprobe = u;
+		utask->vaddr = probept;
 		handler_chain(u, regs);
 
 		if (u->flags & UPROBES_SKIP_SSTEP && can_skip_xol(regs, u))
 			goto cleanup_ret;
 
-		if (!xol_get_insn_slot(u, probept))
-			goto cleanup_ret;
-
 		if (pre_xol(u, regs))
 			goto cleanup_ret;
 
@@ -1419,7 +1424,6 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		utask->active_uprobe = NULL;
 		utask->state = UTASK_RUNNING;
 		user_disable_single_step(current);
-		xol_free_insn_slot(current);
 
 		spin_lock_irq(&current->sighand->siglock);
 		recalc_sigpending(); /* see uprobe_deny_signal() */
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 106+ messages in thread

* [PATCH 5/5] uprobes: remove the uprobes_xol_area code
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
                     ` (3 preceding siblings ...)
  2011-11-28 19:07   ` [PATCH 4/5] uprobes: teach set_xol_ip() to use uprobe_xol_slots[] Oleg Nesterov
@ 2011-11-28 19:07   ` Oleg Nesterov
  2011-11-28 19:57   ` [PATCH RFC 0/5] uprobes: kill xol vma Peter Zijlstra
                     ` (2 subsequent siblings)
  7 siblings, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-28 19:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

Remove the no longer needed uprobes_xol_area code.

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 include/linux/mm_types.h |    1 -
 include/linux/uprobes.h  |   24 ------
 kernel/fork.c            |    2 -
 kernel/uprobes.c         |  198 ----------------------------------------------
 4 files changed, 0 insertions(+), 225 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2595c9c..b3f1ece 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -392,7 +392,6 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_UPROBES
 	atomic_t mm_uprobes_count;
-	struct uprobes_xol_area *uprobes_xol_area;
 #endif
 };
 
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bb59a66..4f92272 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -100,26 +100,6 @@ struct uprobe_task {
 	struct uprobe *active_uprobe;
 };
 
-/*
- * On a breakpoint hit, thread contests for a slot.  It free the
- * slot after singlestep.  Only definite number of slots are
- * allocated.
- */
-
-struct uprobes_xol_area {
-	wait_queue_head_t wq;	/* if all slots are busy */
-	atomic_t slot_count;	/* currently in use slots */
-	unsigned long *bitmap;	/* 0 = free slot */
-	struct page *page;
-
-	/*
-	 * We keep the vma's vm_start rather than a pointer to the vma
-	 * itself.  The probed process or a naughty kernel module could make
-	 * the vma go away, and we must handle that reasonably gracefully.
-	 */
-	unsigned long vaddr;		/* Page(s) of instruction slots */
-};
-
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct mm_struct *mm, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -131,7 +111,6 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void free_uprobe_utask(struct task_struct *tsk);
-extern void free_uprobes_xol_area(struct mm_struct *mm);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
 extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
@@ -174,8 +153,5 @@ static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
 static inline void free_uprobe_utask(struct task_struct *tsk)
 {
 }
-static inline void free_uprobes_xol_area(struct mm_struct *mm)
-{
-}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 166ee1b..a6b1757 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -553,7 +553,6 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
-		free_uprobes_xol_area(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
@@ -742,7 +741,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #endif
 #ifdef CONFIG_UPROBES
 	atomic_set(&mm->mm_uprobes_count, 0);
-	mm->uprobes_xol_area = NULL;
 #endif
 
 	if (!mm_init(mm, tsk))
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index c9e2f65..aaab607 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -33,9 +33,6 @@
 #include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
-#define UINSNS_PER_PAGE	(PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
-#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
-
 static bulkref_t uprobes_srcu;
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize rbtree access */
@@ -1062,201 +1059,6 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	return;
 }
 
-/* Slot allocation for XOL */
-static int xol_add_vma(struct uprobes_xol_area *area)
-{
-	struct mm_struct *mm;
-	int ret;
-
-	area->page = alloc_page(GFP_HIGHUSER);
-	if (!area->page)
-		return -ENOMEM;
-
-	mm = current->mm;
-	down_write(&mm->mmap_sem);
-	ret = -EALREADY;
-	if (mm->uprobes_xol_area)
-		goto fail;
-
-	ret = -ENOMEM;
-
-	/* Try to map as high as possible, this is only a hint. */
-	area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
-							PAGE_SIZE, 0, 0);
-	if (area->vaddr & ~PAGE_MASK) {
-		ret = area->vaddr;
-		goto fail;
-	}
-
-	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
-				VM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO,
-				&area->page);
-	if (ret)
-		goto fail;
-
-	smp_wmb();	/* pairs with get_uprobes_xol_area() */
-	mm->uprobes_xol_area = area;
-	ret = 0;
-
-fail:
-	up_write(&mm->mmap_sem);
-	if (ret)
-		__free_page(area->page);
-
-	return ret;
-}
-
-static struct uprobes_xol_area *get_uprobes_xol_area(struct mm_struct *mm)
-{
-	struct uprobes_xol_area *area = mm->uprobes_xol_area;
-	smp_read_barrier_depends();/* pairs with wmb in xol_add_vma() */
-	return area;
-}
-
-/*
- * xol_alloc_area - Allocate process's uprobes_xol_area.
- * This area will be used for storing instructions for execution out of
- * line.
- *
- * Returns the allocated area or NULL.
- */
-static struct uprobes_xol_area *xol_alloc_area(void)
-{
-	struct uprobes_xol_area *area;
-
-	area = kzalloc(sizeof(*area), GFP_KERNEL);
-	if (unlikely(!area))
-		return NULL;
-
-	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
-								GFP_KERNEL);
-
-	if (!area->bitmap)
-		goto fail;
-
-	init_waitqueue_head(&area->wq);
-	if (!xol_add_vma(area))
-		return area;
-
-fail:
-	kfree(area->bitmap);
-	kfree(area);
-	return get_uprobes_xol_area(current->mm);
-}
-
-/*
- * free_uprobes_xol_area - Free the area allocated for slots.
- */
-void free_uprobes_xol_area(struct mm_struct *mm)
-{
-	struct uprobes_xol_area *area = mm->uprobes_xol_area;
-
-	if (!area)
-		return;
-
-	put_page(area->page);
-	kfree(area->bitmap);
-	kfree(area);
-}
-
-/*
- *  - search for a free slot.
- */
-static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
-{
-	unsigned long slot_addr;
-	int slot_nr;
-
-	do {
-		slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
-		if (slot_nr < UINSNS_PER_PAGE) {
-			if (!test_and_set_bit(slot_nr, area->bitmap))
-				break;
-
-			slot_nr = UINSNS_PER_PAGE;
-			continue;
-		}
-		wait_event(area->wq,
-			(atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
-	} while (slot_nr >= UINSNS_PER_PAGE);
-
-	slot_addr = area->vaddr + (slot_nr * UPROBES_XOL_SLOT_BYTES);
-	atomic_inc(&area->slot_count);
-	return slot_addr;
-}
-
-/*
- * xol_get_insn_slot - If was not allocated a slot, then
- * allocate a slot.
- * Returns the allocated slot address or 0.
- */
-static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
-					unsigned long slot_addr)
-{
-	struct uprobes_xol_area *area;
-	unsigned long offset;
-	void *vaddr;
-
-	area = get_uprobes_xol_area(current->mm);
-	if (!area) {
-		area = xol_alloc_area();
-		if (!area)
-			return 0;
-	}
-	current->utask->xol_vaddr = xol_take_insn_slot(area);
-
-	/*
-	 * Initialize the slot if xol_vaddr points to valid
-	 * instruction slot.
-	 */
-	if (unlikely(!current->utask->xol_vaddr))
-		return 0;
-
-	current->utask->vaddr = slot_addr;
-	offset = current->utask->xol_vaddr & ~PAGE_MASK;
-	vaddr = kmap_atomic(area->page);
-	memcpy(vaddr + offset, uprobe->insn, MAX_UINSN_BYTES);
-	kunmap_atomic(vaddr);
-	return current->utask->xol_vaddr;
-}
-
-/*
- * xol_free_insn_slot - If slot was earlier allocated by
- * @xol_get_insn_slot(), make the slot available for
- * subsequent requests.
- */
-static void xol_free_insn_slot(struct task_struct *tsk)
-{
-	struct uprobes_xol_area *area;
-	unsigned long vma_end;
-	unsigned long slot_addr;
-
-	if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
-		return;
-
-	slot_addr = tsk->utask->xol_vaddr;
-
-	if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
-		return;
-
-	area = tsk->mm->uprobes_xol_area;
-	vma_end = area->vaddr + PAGE_SIZE;
-	if (area->vaddr <= slot_addr && slot_addr < vma_end) {
-		int slot_nr;
-		unsigned long offset = slot_addr - area->vaddr;
-
-		slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
-		if (slot_nr >= UINSNS_PER_PAGE)
-			return;
-
-		clear_bit(slot_nr, area->bitmap);
-		atomic_dec(&area->slot_count);
-		if (waitqueue_active(&area->wq))
-			wake_up(&area->wq);
-		tsk->utask->xol_vaddr = 0;
-	}
-}
-
 /**
  * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
  * @regs: Reflects the saved state of the task after it has hit a breakpoint
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS]
  2011-11-28 19:07   ` [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS] Oleg Nesterov
@ 2011-11-28 19:48     ` Peter Zijlstra
  2011-11-28 19:52       ` Peter Zijlstra
  2011-11-29 18:24     ` Oleg Nesterov
  1 sibling, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 19:48 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Mon, 2011-11-28 at 20:07 +0100, Oleg Nesterov wrote:
> +       UPROBE_XOL_FIRST_PAGE = UPROBE_XOL_LAST_PAGE
> +                             + NR_CPUS * UPROBES_XOL_SLOT_BYTES / PAGE_SIZE, 

I think that wants to be: 
	+ DIV_ROUND_UP(NR_CPUS * UPROBES_XOL_SLOT_BYTES, PAGE_SIZE);

otherwise you'll end up with 0 pages for UP and the sort.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS]
  2011-11-28 19:48     ` Peter Zijlstra
@ 2011-11-28 19:52       ` Peter Zijlstra
  0 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 19:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Mon, 2011-11-28 at 20:48 +0100, Peter Zijlstra wrote:
> On Mon, 2011-11-28 at 20:07 +0100, Oleg Nesterov wrote:
> > +       UPROBE_XOL_FIRST_PAGE = UPROBE_XOL_LAST_PAGE
> > +                             + NR_CPUS * UPROBES_XOL_SLOT_BYTES / PAGE_SIZE, 
> 
> I think that wants to be: 
> 	+ DIV_ROUND_UP(NR_CPUS * UPROBES_XOL_SLOT_BYTES, PAGE_SIZE);
> 
> otherwise you'll end up with 0 pages for UP and the sort.

Ah, no I see, you'll already have the one LAST_PAGE thing.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/5] uprobes: introduce uprobe_switch_to()
  2011-11-28 19:06   ` [PATCH 2/5] uprobes: introduce uprobe_switch_to() Oleg Nesterov
@ 2011-11-28 19:53     ` Peter Zijlstra
  2011-11-29 17:18       ` Oleg Nesterov
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 19:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Mon, 2011-11-28 at 20:06 +0100, Oleg Nesterov wrote:
> +void uprobe_switch_to(struct task_struct *curr)
> +{
> +       struct uprobe_task *utask = curr->utask;
> +       struct pt_regs *regs = task_pt_regs(curr);
> +
> +       if (!utask || utask->state != UTASK_SSTEP)
> +               return;
> +
> +       if (!(regs->flags & X86_EFLAGS_TF))
> +               return;
> +
> +       set_xol_ip(regs);
> +} 

> void __weak set_xol_ip(struct pt_regs *regs)
>  {
> +       int cpu = smp_processor_id();
> +       struct uprobe_task *utask = current->utask;
> +       struct uprobe *uprobe = utask->active_uprobe;
> +
> +       memcpy(uprobe_xol_slots[cpu], uprobe->insn, MAX_UINSN_BYTES);
> +
> +       utask->xol_vaddr = fix_to_virt(UPROBE_XOL_FIRST_PAGE)
> +                               + UPROBES_XOL_SLOT_BYTES * cpu;
> +       set_instruction_pointer(regs, utask->xol_vaddr);
>  }

So uprobe_switch_to() will always reset the IP to the start of the slot?
That sounds wrong, things like the RIP relative stuff needs multiple
instructions.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
                     ` (4 preceding siblings ...)
  2011-11-28 19:07   ` [PATCH 5/5] uprobes: remove the uprobes_xol_area code Oleg Nesterov
@ 2011-11-28 19:57   ` Peter Zijlstra
  2011-11-29 10:30   ` Srikar Dronamraju
  2011-12-12 17:30   ` Oleg Nesterov
  7 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-28 19:57 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Mon, 2011-11-28 at 20:06 +0100, Oleg Nesterov wrote:
> 
> On top of this series, not for inclusion yet, just to explain what
> I mean. May be someone can test it ;)
> 
> This series kills xol_vma. Instead we use the per_cpu-like xol slots.
> 
> This is much more simple and efficient. And this of course solves
> many problems we currently have with xol_vma.
> 
> For example, we simply can not trust it. We do not know what actually
> we are going to execute in UTASK_SSTEP mode. An application can unmap
> this area and then do mmap(PROT_EXEC|PROT_WRITE, MAP_FIXED) to fool
> uprobes.
> 
> The only disadvantage is that this adds a bit more arch-dependant
> code.
> 
> The main question, can this work? I know very little in this area.
> And I am not sure if this can be ported to other architectures.

I very much like this approach! I think the provided implementation
might have some issues, but yeah, using fixmaps and a __switch_to_xtra
hook to provide per task slots seems very nice indeed!

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-28 15:29   ` Peter Zijlstra
@ 2011-11-29  7:48     ` Srikar Dronamraju
  2011-11-29 10:52       ` Peter Zijlstra
  0 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-29  7:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

* Peter Zijlstra <peterz@infradead.org> [2011-11-28 16:29:54]:

> On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> > +                                               struct uprobe *uprobe)
> > +{
> > +       struct list_head try_list;
> > +       struct address_space *mapping;
> > +       struct vma_info *vi, *tmpvi;
> > +       struct vm_area_struct *vma;
> > +       struct mm_struct *mm;
> > +       loff_t vaddr;
> > +
> > +       mapping = inode->i_mapping;
> > +       INIT_LIST_HEAD(&try_list);
> > +       while ((vi = find_next_vma_info(&try_list, offset,
> > +                                               mapping, false)) != NULL) {
> > +               if (IS_ERR(vi))
> > +                       break;
> 
> So what kind of half-assed state are we left in if we try an unregister
> under memory pressure and how do we deal with that?
> 

Agree, Even I had this concern and wanted to see if there are ways to
deal with this.

- One approach would be pass extra GFG flags while we do allocations
  atleast in the unregister_uprobe.

Drawback of this approach: if the system is already under memory
pressure we shouldnt exert more pressure by asking it to repeat.

- The other approach would be to cache these temporary objects while we
  insert probes. i.e keep these metadata around.

I am sure you wouldnt want to add additional metadata.

- Third approach would be to have a completion/worker routine kick in if
  unregister_uprobe fails due to memory allocations.

This looks better than the rest.

Do you have any other approaches that we could try?

-- 
thanks and regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement.
  2011-11-28 14:13   ` Peter Zijlstra
@ 2011-11-29  7:49     ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-29  7:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

* Peter Zijlstra <peterz@infradead.org> [2011-11-28 15:13:29]:

> On Fri, 2011-11-18 at 16:38 +0530, Srikar Dronamraju wrote:
> > +/**
> > + * is_bkpt_insn - check if instruction is breakpoint instruction.
> > + * @insn: instruction to be checked.
> > + * Default implementation of is_bkpt_insn
> > + * Returns true if @insn is a breakpoint instruction.
> > + */
> > +bool __weak is_bkpt_insn(u8 *insn)
> > +{
> > +       return (insn[0] == UPROBES_BKPT_INSN);
> >  } 
> 
> This seems wrong, UPROBES_BKPT_INSN basically defined to be of
> uprobe_opcode_t type, not u8.
> 
> So:
> 
> bool __weak is_bkpt_insn(uprobe_opcode_t *insn)
> {
> 	return *insn == UPROBE_BKPT_INSN;
> }
> 
> seems like the right way to write this.
> 

Agree, will fix this. 
Thanks for bringing this up.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-28 14:59       ` Peter Zijlstra
@ 2011-11-29  8:33         ` Srikar Dronamraju
  2011-11-29 11:48           ` Peter Zijlstra
  2011-11-30  5:30           ` Srikar Dronamraju
  0 siblings, 2 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-29  8:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

> > > > +                       ret = install_breakpoint(vma->vm_mm, uprobe);
> > > > +                       if (ret == -EEXIST) {
> > > > +                               atomic_inc(&vma->vm_mm->mm_uprobes_count);
> > > > +                               ret = 0;
> > > > +                       } 
> > > 
> > > Aren't you double counting that probe position here? The one that raced
> > > you to inserting it will also have incremented that counter, no?
> > > 
> > 
> > No we arent.
> > Because register_uprobe can never race with mmap_uprobe and register
> > before mmap_uprobe registers .(Once we start mmap_region,
> > register_uprobe waits for the read_lock of mmap_sem.)
> > 
> > And we badly need this for mmap_uprobe case.  Because when we do mremap,
> > or vma_adjust(), we do a munmap_uprobe() followed by mmap_uprobe() which
> > would have decremented the count but not removed it. So when we do a
> > mmap_uprobe, we need to increment the count. 
> 
> Ok, so I didn't parse that properly last time around.. but it still
> doesn't make sense, why would munmap_uprobe() decrement the count but
> not uninstall the probe?
> 
> install_breakpoint() returning -EEXIST on two different conditions
> doesn't help either.
> 
> So what I think you're doing is that you're optimizing the unmap case
> since the memory is going to be thrown out fixing up the instruction is
> a waste of time, but this leads to the asymmetry observed above. But you

Yes, we are optimizing the unmap case, because we expect the memory to
be thrown out.

> fail to mention this in both the changelog or a comment near that
> -EEXIST branch in mmap_uprobe.
> 
> Worse, you don't explain how the other -EEXIST (!consumers) thing
> interacts here, and I just gave up trying to figure that out since it
> made my head hurt.
> 

install_breakpoints cannot have !consumers to be true when called from
register_uprobe. (Since unregister_uprobe() which does the removal of
consumer cannot race with register_uprobe().)

Now lets consider mmap_uprobe() being called from vm_adjust(), the
preceding unmap_uprobe() has already decremented the count but left the
count intact.

if consumers is NULL, unregister_uprobes() has kicked already in, so
there is no point in inserting the probe, Hence we return EEXIST. The
following unregister_uprobe() (or the munmap_uprobe() which might race
before unregister_uprobe) is also going to decrement the count.  So we
have a case where the same breakpoint is accounted as removed twice. To
offset this, we pretend as if the breakpoint is around by incrementing
the count.

Would it help if I add an extra check in mmap_uprobe?

int mmap_uprobe(...) {
....
	       ret = install_breakpoint(vma->vm_mm, uprobe);
	       if (ret == -EEXIST) {
			if (!read_opcode(vma->vm_mm, vaddr, &opcode) &&
					(opcode == UPROBES_BKPT_INSN))
			       atomic_inc(&vma->vm_mm->mm_uprobes_count);
		       ret = 0;
	       } 
....
}


The extra read_opcode check will tell us if the breakpoint is still
around and then only increment the count. (As in it will distinguish if
the mmap_uprobe is from vm_adjust).

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
                     ` (5 preceding siblings ...)
  2011-11-28 19:57   ` [PATCH RFC 0/5] uprobes: kill xol vma Peter Zijlstra
@ 2011-11-29 10:30   ` Srikar Dronamraju
  2011-11-29 18:26     ` Oleg Nesterov
  2011-12-12 17:30   ` Oleg Nesterov
  7 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-29 10:30 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

> 
> On top of this series, not for inclusion yet, just to explain what
> I mean. May be someone can test it ;)
> 
> This series kills xol_vma. Instead we use the per_cpu-like xol slots.
> 
> This is much more simple and efficient. And this of course solves
> many problems we currently have with xol_vma.
> 
> For example, we simply can not trust it. We do not know what actually
> we are going to execute in UTASK_SSTEP mode. An application can unmap
> this area and then do mmap(PROT_EXEC|PROT_WRITE, MAP_FIXED) to fool
> uprobes.
> 
> The only disadvantage is that this adds a bit more arch-dependant
> code.
> 
> The main question, can this work? I know very little in this area.
> And I am not sure if this can be ported to other architectures.

Nice idea. I think this will help us in implementing boosted uprobes if
tweak a bit.  (i.e having a jump after the actual instruction that gets
us back to the actual instruction stream). The current method of using a
first cum-first-serve slot reservation doesnt work for booster because
we have had to clear the slot in the post processing. 

I will apply your patches and test and let you know how it goes. (in a day
or two).

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-29  7:48     ` Srikar Dronamraju
@ 2011-11-29 10:52       ` Peter Zijlstra
  2011-12-01 13:41         ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-29 10:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Tue, 2011-11-29 at 13:18 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-11-28 16:29:54]:
> 
> > On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> > > +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> > > +                                               struct uprobe *uprobe)
> > > +{
> > > +       struct list_head try_list;
> > > +       struct address_space *mapping;
> > > +       struct vma_info *vi, *tmpvi;
> > > +       struct vm_area_struct *vma;
> > > +       struct mm_struct *mm;
> > > +       loff_t vaddr;
> > > +
> > > +       mapping = inode->i_mapping;
> > > +       INIT_LIST_HEAD(&try_list);
> > > +       while ((vi = find_next_vma_info(&try_list, offset,
> > > +                                               mapping, false)) != NULL) {
> > > +               if (IS_ERR(vi))
> > > +                       break;
> > 
> > So what kind of half-assed state are we left in if we try an unregister
> > under memory pressure and how do we deal with that?
> > 
> 
> Agree, Even I had this concern and wanted to see if there are ways to
> deal with this.

If you do have this, please mention it in the Changelog and/or put /*
XXX */ in the code or so to point it out that there's a problem here.

> Do you have any other approaches that we could try?

You could use the stuff from patch 29 to effectively disable the uprobe
and return -ENOMEM to whoemever is unregistering. Basically failing the
unreg.

That way you can leave the uprobe in existance and half installed but
functionally fully disabled. Userspace (assuming we go back that far)
can then either re-try the removal later, or even reinstate it by doing
a register again or so.

Its still not pretty, but its better than pretending the unreg
completed.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-29  8:33         ` Srikar Dronamraju
@ 2011-11-29 11:48           ` Peter Zijlstra
  2011-11-29 15:05             ` Peter Zijlstra
  2011-11-29 16:22             ` Srikar Dronamraju
  2011-11-30  5:30           ` Srikar Dronamraju
  1 sibling, 2 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-29 11:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

On Tue, 2011-11-29 at 14:03 +0530, Srikar Dronamraju wrote:


> install_breakpoints cannot have !consumers to be true when called from
> register_uprobe. (Since unregister_uprobe() which does the removal of
> consumer cannot race with register_uprobe().)

Right, that's the easy case ;-)

> Now lets consider mmap_uprobe() being called from vm_adjust(), the
> preceding unmap_uprobe() has already decremented the count but left the
> count intact.
> 
> if consumers is NULL, unregister_uprobes() has kicked already in, so
> there is no point in inserting the probe, Hence we return EEXIST. The
> following unregister_uprobe() (or the munmap_uprobe() which might race
> before unregister_uprobe) is also going to decrement the count.  So we
> have a case where the same breakpoint is accounted as removed twice. To
> offset this, we pretend as if the breakpoint is around by incrementing
> the count.

There's 2 main cases, 
	A) vma_adjust() vs unregister_uprobe() and 
	B) mmap() vs unregister_uprobe().

The result of A should be -1 reference in total, since we're removing
the one probe. The result of B should be 0 since we're removing the
probe and we shouldn't be installing new ones.

A1)
	vma_adjust()
	  munmap_uprobe()
				unregister_uprobe()
	  mmap_uprobe()
				  delete_uprobe()


	munmap will to -1, mmap will do +1, __unregister_uprobe() which is
serialized against vma_adjust() will do -1 on either the old or new vma,
resulting in a grand total of: -1+1-1=-1, OK

A2) breakpoint is in old, not in new, again two cases:

A2a) __unregister_uprobe() sees old

	munmap -1, __unregister_uprobe -1, mmap 0: -2 FAIL

A2b) __unregister_uprobe() sees new

	munmap -1, __unregister_uprobe 0, mmap 0: -1 OK

A3) breakpoint is in new, not in old, again two cases:

A3a) __unregister_uprobe() sees old

	munmap 0, __unregister_uprobe 0, mmap: 1: 1 FAIL

A3b) __unregister_uprobe() seed new

	munmap 0, __unregister_uprobe -1, mmap: 1: 0 FAIL

B1)
				unregister_uprobe()
	mmap()
	  mmap_uprobe()
				  __unregister_uprobe()
				  delete_uprobe()

	mmap +1, __unregister_uprobe() -1: 0 OK

B2)
				unregister_uprobe()
	mmap()
				  __unregister_uprobe()
	  mmap_uprobe()
				  delete_uprobe()

	mmap +1, __unregister_uprobe() 0: +1 FAIL


> Would it help if I add an extra check in mmap_uprobe?
> 
> int mmap_uprobe(...) {
> ....
> 	       ret = install_breakpoint(vma->vm_mm, uprobe);
> 	       if (ret == -EEXIST) {
> 			if (!read_opcode(vma->vm_mm, vaddr, &opcode) &&
> 					(opcode == UPROBES_BKPT_INSN))
> 			       atomic_inc(&vma->vm_mm->mm_uprobes_count);
> 		       ret = 0;
> 	       } 
> ....
> }

> The extra read_opcode check will tell us if the breakpoint is still
> around and then only increment the count. (As in it will distinguish if
> the mmap_uprobe is from vm_adjust).

No, I don't see that fixing A2a for example.

Could be I confused myself above, but like said, this stuff hurt brain.

It might just be easiest not to optimize munmap and leave fancy stuff
for later.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-29 11:48           ` Peter Zijlstra
@ 2011-11-29 15:05             ` Peter Zijlstra
  2011-11-30  5:50               ` Srikar Dronamraju
  2011-11-29 16:22             ` Srikar Dronamraju
  1 sibling, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-29 15:05 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

On Tue, 2011-11-29 at 12:48 +0100, Peter Zijlstra wrote:
> There's 2 main cases, 
>         A) vma_adjust() vs unregister_uprobe() and 
>         B) mmap() vs unregister_uprobe().
> 
> The result of A should be -1 reference in total, since we're removing
> the one probe.

This might not be correct for A[23], please double check.

>  The result of B should be 0 since we're removing the
> probe and we shouldn't be installing new ones.
> 
> A1)
>         vma_adjust()
>           munmap_uprobe()
>                                 unregister_uprobe()
>           mmap_uprobe()
>                                   delete_uprobe()
> 
> 
>         munmap will to -1, mmap will do +1, __unregister_uprobe() which is
> serialized against vma_adjust() will do -1 on either the old or new vma,
> resulting in a grand total of: -1+1-1=-1, OK
> 
> A2) breakpoint is in old, not in new, again two cases:
> 
> A2a) __unregister_uprobe() sees old
> 
>         munmap -1, __unregister_uprobe -1, mmap 0: -2 FAIL
> 
> A2b) __unregister_uprobe() sees new
> 
>         munmap -1, __unregister_uprobe 0, mmap 0: -1 OK
> 
> A3) breakpoint is in new, not in old, again two cases:
> 
> A3a) __unregister_uprobe() sees old
> 
>         munmap 0, __unregister_uprobe 0, mmap: 1: 1 FAIL
> 
> A3b) __unregister_uprobe() seed new
> 
>         munmap 0, __unregister_uprobe -1, mmap: 1: 0 FAIL

There's more cases, I forgot the details of how the prio_tree stuff
works, so please consider if its possible to also have:

  __unregister_uprobe() will observe neither old nor new

This could happen if we first munmap, __unregister_uprobe() will iterate
past where mmap() will insert the new vma, mmap will insert the new vma,
and __unregister_uprobe() will now not observe it.

and

  __unregister_uprobe() will observe both old _and_ new

This latter could happen by favourably interleaving the prio_tree
iteration with the munmap and mmap operations, so that we first observe
the old vma, do the munmap, do the mmap, and then have the
find_next_vma_info() thing find the new vma.

> B1)
>                                 unregister_uprobe()
>         mmap()
>           mmap_uprobe()
>                                   __unregister_uprobe()
>                                   delete_uprobe()
> 
>         mmap +1, __unregister_uprobe() -1: 0 OK
> 
> B2)
>                                 unregister_uprobe()
>         mmap()
>                                   __unregister_uprobe()
>           mmap_uprobe()
>                                   delete_uprobe()
> 
>         mmap +1, __unregister_uprobe() 0: +1 FAIL 

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-29 11:48           ` Peter Zijlstra
  2011-11-29 15:05             ` Peter Zijlstra
@ 2011-11-29 16:22             ` Srikar Dronamraju
  2011-11-30 12:25               ` Peter Zijlstra
  1 sibling, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-29 16:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

The rules that I am using are: 

mmap_uprobe() increments the count if 
	- it successfully adds a breakpoint.
	- it not add a breakpoint, but sees that there is a underlying
	  breakpoint (via a read_opcode call).

munmap_uprobe() decrements the count if 
	- it sees a underlying breakpoint,  (via  a read_opcode call)
	- Subsequent unregister_uprobe wouldnt find the breakpoint
	  unless a mmap_uprobe kicks in, since the old vma would be
	  dropped just after munmap_uprobe.

register_uprobe increments the count if:
	- it successfully adds a breakpoint.

unregister_uprobe decrements the count if:
	- it sees a underlying breakpoint and removes successfully. 
			(via a read_opcode call)
	- Subsequent munmap_uprobe wouldnt find the breakpoint
	  since there is no underlying breakpoint after the
	  breakpoint removal.

> > 
> > if consumers is NULL, unregister_uprobes() has kicked already in, so
> > there is no point in inserting the probe, Hence we return EEXIST. The
> > following unregister_uprobe() (or the munmap_uprobe() which might race
> > before unregister_uprobe) is also going to decrement the count.  So we
> > have a case where the same breakpoint is accounted as removed twice. To
> > offset this, we pretend as if the breakpoint is around by incrementing
> > the count.
> 
> There's 2 main cases, 
> 	A) vma_adjust() vs unregister_uprobe() and 
> 	B) mmap() vs unregister_uprobe().
> 
> The result of A should be -1 reference in total, since we're removing
> the one probe. 

If the breakpoint was never there, then a value of 0 should also be
correct.  See case A3a and A3b.

> The result of B should be 0 since we're removing the
> probe and we shouldn't be installing new ones.
> 
> A1)
> 	vma_adjust()
> 	  munmap_uprobe()
> 				unregister_uprobe()
> 	  mmap_uprobe()
> 				  delete_uprobe()
> 
> 
> 	munmap will to -1, mmap will do +1, __unregister_uprobe() which is
> serialized against vma_adjust() will do -1 on either the old or new vma,
> resulting in a grand total of: -1+1-1=-1, OK

Right.

> 
> A2) breakpoint is in old, not in new, again two cases:
> 
> A2a) __unregister_uprobe() sees old

So  unregister_uprobe is called on the vma before vma_adjust.

> 
> 	munmap -1, __unregister_uprobe -1, mmap 0: -2 FAIL
> 

So munmap wouldnt decrement because, munmap_uprobe checks to see if the
breakpoint is still around before it increments.

unregister unlike munmap removes the breakpoint too.

> A2b) __unregister_uprobe() sees new
> 

So the order would be munmap(), mmap() and unregister_uprobe()

> 	munmap -1, __unregister_uprobe 0, mmap 0: -1 OK

Right, Since the old vma is gone, the new vma doesnt have the
breakpoint.

> 
> A3) breakpoint is in new, not in old, again two cases:
> 

> A3a) __unregister_uprobe() sees old
> 
So  unregister_uprobe is called on the vma before vma_adjust.

> 	munmap 0, __unregister_uprobe 0, mmap: 1: 1 FAIL


If mmap_uprobe() increments it would mean that breakpoint was already
there. (-EEXIST + read_opcode); since there was no breakpoint, it will
not increment..

0 is the correct value here, Not -1. because there was no probe inserted
or removed.

> 
> A3b) __unregister_uprobe() seed new
So the order would be munmap(), mmap() and unregister_uprobe()
> 
> 	munmap 0, __unregister_uprobe -1, mmap: 1: 0 FAIL
> 

If mmap_uprobe() increments it would mean that breakpoint was already
there.  __unregister_uprobe will decrement.  Since we added a new probe
and deleted it, the value 0 is correct here.

> B1)
> 				unregister_uprobe()
> 	mmap()
> 	  mmap_uprobe()
> 				  __unregister_uprobe()
> 				  delete_uprobe()
> 
> 	mmap +1, __unregister_uprobe() -1: 0 OK
> 
> B2)
> 				unregister_uprobe()
> 	mmap()
> 				  __unregister_uprobe()
> 	  mmap_uprobe()
> 				  delete_uprobe()
> 
> 	mmap +1, __unregister_uprobe() 0: +1 FAIL

I think you meant __unregister_uprobe happened before mmap_uprobe.

If mmap_uprobe() increments it would mean that breakpoint was already
there. (-EEXIST + read_opcode); since there was no breakpoint, it will
not increment..
> 
> 
> > Would it help if I add an extra check in mmap_uprobe?
> > 
> > int mmap_uprobe(...) {
> > ....
> > 	       ret = install_breakpoint(vma->vm_mm, uprobe);
> > 	       if (ret == -EEXIST) {
> > 			if (!read_opcode(vma->vm_mm, vaddr, &opcode) &&
> > 					(opcode == UPROBES_BKPT_INSN))
> > 			       atomic_inc(&vma->vm_mm->mm_uprobes_count);
> > 		       ret = 0;
> > 	       } 
> > ....
> > }
> 
> > The extra read_opcode check will tell us if the breakpoint is still
> > around and then only increment the count. (As in it will distinguish if
> > the mmap_uprobe is from vm_adjust).
> 
> No, I don't see that fixing A2a for example.

This check should help A3a and B2 cases.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/5] uprobes: introduce uprobe_switch_to()
  2011-11-28 19:53     ` Peter Zijlstra
@ 2011-11-29 17:18       ` Oleg Nesterov
  2011-11-30 12:11         ` Peter Zijlstra
  0 siblings, 1 reply; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-29 17:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On 11/28, Peter Zijlstra wrote:
>
> On Mon, 2011-11-28 at 20:06 +0100, Oleg Nesterov wrote:
> > +void uprobe_switch_to(struct task_struct *curr)
> > +{
> > +       struct uprobe_task *utask = curr->utask;
> > +       struct pt_regs *regs = task_pt_regs(curr);
> > +
> > +       if (!utask || utask->state != UTASK_SSTEP)
> > +               return;
> > +
> > +       if (!(regs->flags & X86_EFLAGS_TF))
> > +               return;
> > +
> > +       set_xol_ip(regs);
> > +}
>
> > void __weak set_xol_ip(struct pt_regs *regs)
> >  {
> > +       int cpu = smp_processor_id();
> > +       struct uprobe_task *utask = current->utask;
> > +       struct uprobe *uprobe = utask->active_uprobe;
> > +
> > +       memcpy(uprobe_xol_slots[cpu], uprobe->insn, MAX_UINSN_BYTES);
> > +
> > +       utask->xol_vaddr = fix_to_virt(UPROBE_XOL_FIRST_PAGE)
> > +                               + UPROBES_XOL_SLOT_BYTES * cpu;
> > +       set_instruction_pointer(regs, utask->xol_vaddr);
> >  }
>
> So uprobe_switch_to() will always reset the IP to the start of the slot?
> That sounds wrong, things like the RIP relative stuff needs multiple
> instructions.

Hmm. Could you explain? Especially the "multiple instructions" part.

In any case we should reset the IP to the start of the slot.

But yes, I'm afraid this is too simple. Before this patches pre_xol()
is called when we already know ->xol_vaddr. But afaics x86 doesn't use
this info (post_xol() does). So this looks equally correct or wrong.

But perhaps we need another arch-dependent hook which takes ->xol_vaddr
into account instead of simple memcpy(), to handle the RIP relative
case.

Or I misunderstood?


Peter, all, I apologize in advance, I can't be responsive today.

Oleg.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS]
  2011-11-28 19:07   ` [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS] Oleg Nesterov
  2011-11-28 19:48     ` Peter Zijlstra
@ 2011-11-29 18:24     ` Oleg Nesterov
  1 sibling, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-29 18:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On 11/28, Oleg Nesterov wrote:
>
> This patch adds uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] array.

Typo, it should be uprobe_xol_slots[NR_CPUS][UPROBES_XOL_SLOT_BYTES].


-------------------------------------------------------------------------
[PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS]

This patch adds uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] array.
Each CPU has its own slot for xol (used in the next patch).

We "export" this data to the user-space via set_fixmap(PAGE_KERNEL_VSYSCALL).

Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
 arch/x86/include/asm/fixmap.h |    9 +++++++++
 arch/x86/kernel/uprobes.c     |   10 ++++++++++
 include/linux/uprobes.h       |    1 +
 kernel/uprobes.c              |    4 ++++
 4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 460c74e..a902e19 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -81,6 +81,15 @@ enum fixed_addresses {
 	VVAR_PAGE,
 	VSYSCALL_HPET,
 #endif
+
+#ifdef CONFIG_UPROBES
+	#define UPROBES_XOL_SLOT_BYTES  128
+
+	UPROBE_XOL_LAST_PAGE,
+	UPROBE_XOL_FIRST_PAGE = UPROBE_XOL_LAST_PAGE
+			      + NR_CPUS * UPROBES_XOL_SLOT_BYTES / PAGE_SIZE,
+#endif
+
 	FIX_DBGP_BASE,
 	FIX_EARLYCON_MEM_BASE,
 #ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 4140137..ebb280c 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -664,3 +664,13 @@ bool can_skip_xol(struct pt_regs *regs, struct uprobe *u)
 	u->flags &= ~UPROBES_SKIP_SSTEP;
 	return false;
 }
+
+void __init map_uprobe_xol_slots(void *pages)
+{
+	int idx = UPROBE_XOL_FIRST_PAGE;
+
+	do {
+		__set_fixmap(idx, __pa(pages), PAGE_KERNEL_VSYSCALL);
+		pages += PAGE_SIZE;
+	} while (idx-- != UPROBE_XOL_LAST_PAGE);
+}
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index d590d66..bb59a66 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -142,6 +142,7 @@ extern bool uprobe_deny_signal(void);
 extern bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u);
 extern void __weak set_xol_ip(struct pt_regs *regs);
 extern void uprobe_switch_to(struct task_struct *);
+extern void map_uprobe_xol_slots(void *);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9c509dc..20007da 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1342,6 +1342,9 @@ bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u)
 	return false;
 }
 
+static unsigned char
+uprobe_xol_slots[NR_CPUS][UPROBES_XOL_SLOT_BYTES] __page_aligned_bss;
+
 void __weak set_xol_ip(struct pt_regs *regs)
 {
 	set_instruction_pointer(regs, current->utask->xol_vaddr);
@@ -1490,6 +1493,7 @@ static int __init init_uprobes(void)
 		mutex_init(&uprobes_mmap_mutex[i]);
 	}
 	init_bulkref(&uprobes_srcu);
+	map_uprobe_xol_slots(uprobe_xol_slots);
 	return register_die_notifier(&uprobe_exception_nb);
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-29 10:30   ` Srikar Dronamraju
@ 2011-11-29 18:26     ` Oleg Nesterov
  2011-11-30 16:15       ` Andi Kleen
  0 siblings, 1 reply; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-29 18:26 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On 11/29, Srikar Dronamraju wrote:
>
> I will apply your patches and test and let you know how it goes. (in a day
> or two).

Thanks! please note that 3/5 is wrong, I sent the updated version.
Or you can add the fix below.

Oleg.

--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1144,7 +1144,7 @@ bool __weak can_skip_xol(struct pt_regs *regs, struct uprobe *u)
 }
 
 static unsigned char
-uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] __page_aligned_bss;
+uprobe_xol_slots[NR_CPUS][UPROBES_XOL_SLOT_BYTES] __page_aligned_bss;
 
 void __weak set_xol_ip(struct pt_regs *regs)
 {


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-29  8:33         ` Srikar Dronamraju
  2011-11-29 11:48           ` Peter Zijlstra
@ 2011-11-30  5:30           ` Srikar Dronamraju
  1 sibling, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-30  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

> 
> int mmap_uprobe(...) {
> ....
> 	       ret = install_breakpoint(vma->vm_mm, uprobe);
> 	       if (ret == -EEXIST) {
> 			if (!read_opcode(vma->vm_mm, vaddr, &opcode) &&
> 					(opcode == UPROBES_BKPT_INSN))
> 			       atomic_inc(&vma->vm_mm->mm_uprobes_count);
> 		       ret = 0;
> 	       } 
> ....
> }
> 

Infact the check for EEXIST and read_opcode in mmap_uprobe() is needed
for another reason too.

Lets say while unregister_uprobe was around, a thread thats being
probed, just forked a child and the child called mmap_uprobe.

Now mmap_uprobe might find that the breakpoint is already inserted
since the pages are shared with the parent. But before
unregister_uprobe can come around and cleanup, the child can run and hit
the breakpoint. Since the breakpoint count is 0 for the child, we dont
expect the child to have hit a breakpoint placed by uprobes, and the
child gets a SIGTRAP.

With this check for read_opcode on EEXIST from install_breakpoint, we
will know that there is a valid breakpoint underneath and increment
the count. So on a breakpoint hit, the uprobes notifier does the right
thing.

If the unregister_uprobe() had already cleanup the breakpoint in the
parent, the child's copy would also be clean so read_opcode wont find
the breakpoint and hence we wont increment the breakpoint.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-29 15:05             ` Peter Zijlstra
@ 2011-11-30  5:50               ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-11-30  5:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

> 
> There's more cases, I forgot the details of how the prio_tree stuff
> works, so please consider if its possible to also have:
> 
>   __unregister_uprobe() will observe neither old nor new
> 
> This could happen if we first munmap, __unregister_uprobe() will iterate
> past where mmap() will insert the new vma, mmap will insert the new vma,
> and __unregister_uprobe() will now not observe it.
> 

- When we iterate thro __unregister_uprobe(), we always walk from the
  root of the prio tree and not depend on the last found node. So
  __unregister_uprobe able to iterate thro the rmap without finding the
  old or the new vma would mean that the exclusive mmap_sem was dropped
  for atleast a brief period and munmap/mmap are disjoint.

Here munmap_uprobe would have reduced the count followed by the pages
being cleared.
__unregister_uprobe maintains the status quo.
mmap_uprobe would load a new set of pages without any breakpoint, since
there are no consumers, and no underlying breakpoints, it also maintains
the status quo.

> and
> 
>   __unregister_uprobe() will observe both old _and_ new
> 
> This latter could happen by favourably interleaving the prio_tree
> iteration with the munmap and mmap operations, so that we first observe
> the old vma, do the munmap, do the mmap, and then have the
> find_next_vma_info() thing find the new vma.

If __unregister_uprobe() can observe both old _and_ new, then it means
mmap has occurred. So its correct that probes are removed from
the old and new. The munmap_uprobe of the old vma wouldnt see the
breakpoint (via read_opcode) so wont decrement the count. If the
munmap_uprobe had seen the breakpoint before unregister_uprobe, then
unregister_uprobe cant decrement the count.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/5] uprobes: introduce uprobe_switch_to()
  2011-11-29 17:18       ` Oleg Nesterov
@ 2011-11-30 12:11         ` Peter Zijlstra
  2011-11-30 17:10           ` Oleg Nesterov
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-30 12:11 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Tue, 2011-11-29 at 18:18 +0100, Oleg Nesterov wrote:
> On 11/28, Peter Zijlstra wrote:
> >
> > On Mon, 2011-11-28 at 20:06 +0100, Oleg Nesterov wrote:
> > > +void uprobe_switch_to(struct task_struct *curr)
> > > +{
> > > +       struct uprobe_task *utask = curr->utask;
> > > +       struct pt_regs *regs = task_pt_regs(curr);
> > > +
> > > +       if (!utask || utask->state != UTASK_SSTEP)
> > > +               return;
> > > +
> > > +       if (!(regs->flags & X86_EFLAGS_TF))
> > > +               return;
> > > +
> > > +       set_xol_ip(regs);
> > > +}
> >
> > > void __weak set_xol_ip(struct pt_regs *regs)
> > >  {
> > > +       int cpu = smp_processor_id();
> > > +       struct uprobe_task *utask = current->utask;
> > > +       struct uprobe *uprobe = utask->active_uprobe;
> > > +
> > > +       memcpy(uprobe_xol_slots[cpu], uprobe->insn, MAX_UINSN_BYTES);
> > > +
> > > +       utask->xol_vaddr = fix_to_virt(UPROBE_XOL_FIRST_PAGE)
> > > +                               + UPROBES_XOL_SLOT_BYTES * cpu;
> > > +       set_instruction_pointer(regs, utask->xol_vaddr);
> > >  }
> >
> > So uprobe_switch_to() will always reset the IP to the start of the slot?
> > That sounds wrong, things like the RIP relative stuff needs multiple
> > instructions.
> 
> Hmm. Could you explain? Especially the "multiple instructions" part.
> 
> In any case we should reset the IP to the start of the slot.
> 
> But yes, I'm afraid this is too simple. Before this patches pre_xol()
> is called when we already know ->xol_vaddr. But afaics x86 doesn't use
> this info (post_xol() does). So this looks equally correct or wrong.
> 
> But perhaps we need another arch-dependent hook which takes ->xol_vaddr
> into account instead of simple memcpy(), to handle the RIP relative
> case.
> 
> Or I misunderstood?

Suppose you need multiple instructions to replace the one you patched
out, for example because the instruction was RIP relative (the effect
relied on the IP the instruction is at, eg. short jumps instead of
absolute jumps).

One way to translate these instructions is something like

  push eax
  mov eax, $previous_ip
  $ins eax+offset
  pop eax

Also, the thing Srikar mentioned is boosted probes, in that case you
forgo the whole single step thing and rewrite the probe as:

  $ins
  jmp $next_insn

Now in the former case you still single step so the context switch hook
can function as proposed (triggered off of TIF_SINGLESTEP). However if
you get preempted after the mov you want to continue with the $ins, not
restart at push. So uprobe_switch_to() will have to preserve the
relative offset within the slot.

On the second example there's no singlestepping left, so we need to
create a new TIF flag, when you first set up the probe you toggle that
flag and on the first context switch where the IP is outside of the slot
you clear it. But still you need to maintain relative offset within the
slot when you move it around.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-29 16:22             ` Srikar Dronamraju
@ 2011-11-30 12:25               ` Peter Zijlstra
  2011-12-01  5:40                 ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-30 12:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

On Tue, 2011-11-29 at 21:52 +0530, Srikar Dronamraju wrote:
> The rules that I am using are: 
> 
> mmap_uprobe() increments the count if 
>         - it successfully adds a breakpoint.
>         - it not add a breakpoint, but sees that there is a underlying
>           breakpoint (via a read_opcode call).
> 
> munmap_uprobe() decrements the count if 
>         - it sees a underlying breakpoint,  (via  a read_opcode call)
>         - Subsequent unregister_uprobe wouldnt find the breakpoint
>           unless a mmap_uprobe kicks in, since the old vma would be
>           dropped just after munmap_uprobe.
> 
> register_uprobe increments the count if:
>         - it successfully adds a breakpoint.
> 
> unregister_uprobe decrements the count if:
>         - it sees a underlying breakpoint and removes successfully. 
>                         (via a read_opcode call)
>         - Subsequent munmap_uprobe wouldnt find the breakpoint
>           since there is no underlying breakpoint after the
>           breakpoint removal. 

The problem I'm having is that such stuff isn't included in the patch
set.

We've got both comments in the C language and Changelog in our patch
system, yet you consistently fail to use either to convey useful
information on non-trivial bits like this.

This leaves the reviewer wondering if you've actually considered stuff
properly, then me actually finding bugs in there does of course
undermine that even further.

What I really would like is for this patch set not to have such subtle
stuff at all, esp. at first. Once its in and its been used a bit we can
start optimizing and add subtle crap like this.



^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-29 18:26     ` Oleg Nesterov
@ 2011-11-30 16:15       ` Andi Kleen
  2011-11-30 16:20         ` Peter Zijlstra
  0 siblings, 1 reply; 106+ messages in thread
From: Andi Kleen @ 2011-11-30 16:15 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Linus Torvalds, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Andi Kleen, Christoph Hellwig,
	Steven Rostedt, Roland McGrath, Thomas Gleixner,
	Masami Hiramatsu, Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson


>  static unsigned char
> -uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] __page_aligned_bss;
> +uprobe_xol_slots[NR_CPUS][UPROBES_XOL_SLOT_BYTES] __page_aligned_bss;

NR_CPUS arrays are basically always wrong.

Use per cpu data.

-Andi


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-30 16:15       ` Andi Kleen
@ 2011-11-30 16:20         ` Peter Zijlstra
  2011-11-30 18:47           ` Oleg Nesterov
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-11-30 16:20 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Oleg Nesterov, Srikar Dronamraju, Linus Torvalds, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Wed, 2011-11-30 at 17:15 +0100, Andi Kleen wrote:
> >  static unsigned char
> > -uprobe_xol_slots[UPROBES_XOL_SLOT_BYTES][NR_CPUS] __page_aligned_bss;
> > +uprobe_xol_slots[NR_CPUS][UPROBES_XOL_SLOT_BYTES] __page_aligned_bss;
> 
> NR_CPUS arrays are basically always wrong.
> 
> Use per cpu data.

Doesn't really work here, you'd know if you'd read the patches. What we
could do though is do a UPROBES_XOL_SLOT_BYTES * nr_cpu_ids bootmem
allocation or so.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH 2/5] uprobes: introduce uprobe_switch_to()
  2011-11-30 12:11         ` Peter Zijlstra
@ 2011-11-30 17:10           ` Oleg Nesterov
  0 siblings, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-30 17:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On 11/30, Peter Zijlstra wrote:
>
> On Tue, 2011-11-29 at 18:18 +0100, Oleg Nesterov wrote:
> > On 11/28, Peter Zijlstra wrote:
> > >
> > > So uprobe_switch_to() will always reset the IP to the start of the slot?
> > > That sounds wrong, things like the RIP relative stuff needs multiple
> > > instructions.
> >
> > Hmm. Could you explain? Especially the "multiple instructions" part.
> >
> > In any case we should reset the IP to the start of the slot.
> >
> > But yes, I'm afraid this is too simple. Before this patches pre_xol()
> > is called when we already know ->xol_vaddr. But afaics x86 doesn't use
> > this info (post_xol() does). So this looks equally correct or wrong.
> >
> > But perhaps we need another arch-dependent hook which takes ->xol_vaddr
> > into account instead of simple memcpy(), to handle the RIP relative
> > case.
> >
> > Or I misunderstood?
>
> Suppose you need multiple instructions to replace the one you patched
> out,

Ah, I see, thanks...

Yes, in this case set_xol_ip() should add the offset,
regs->ip % UPROBES_XOL_SLOT_BYTES.

But the current code doesn't use multiple instructions and it relies
on the single-stepping, so I think currently this is correct.

> for example because the instruction was RIP relative (the effect
> relied on the IP the instruction is at, eg. short jumps instead of
> absolute jumps).
>
> One way to translate these instructions is something like
>
>   push eax
>   mov eax, $previous_ip
>   $ins eax+offset
>   pop eax

I can be easily wrong, but afaics this particular case is covered by
pre_xol/post_xol. But I guess this doesn't matter.

Yes, I thought about multiple insns in xol slot too.

> Also, the thing Srikar mentioned is boosted probes, in that case you
> forgo the whole single step thing and rewrite the probe as:
>
>   $ins
>   jmp $next_insn

Yes! it would be nice to avoid the stepping if possible. But so far
I am not sure how/when this can work...

> Now in the former case you still single step so the context switch hook
> can function as proposed (triggered off of TIF_SINGLESTEP). However if
> you get preempted after the mov you want to continue with the $ins, not
> restart at push.

This is not clear to me. Single step with multiple insns?

> So uprobe_switch_to() will have to preserve the
> relative offset within the slot.

Yes, agreed.

> On the second example there's no singlestepping left, so we need to
> create a new TIF flag, when you first set up the probe you toggle that
> flag and on the first context switch where the IP is outside of the slot
> you clear it. But still you need to maintain relative offset within the
> slot when you move it around.

Yes. Currently uprobe_switch_to() checks X86_EFLAGS_TF() to verify that
it is correct to change regs->ip. But if we know that, say, this insn
can't jump/call/rep we can simply check regs->ip. And in this case we
can avoid the stepping.

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-30 16:20         ` Peter Zijlstra
@ 2011-11-30 18:47           ` Oleg Nesterov
  0 siblings, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-30 18:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andi Kleen, Srikar Dronamraju, Linus Torvalds, Andrew Morton,
	LKML, Linux-mm, Ingo Molnar, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On 11/30, Peter Zijlstra wrote:
>
> What we
> could do though is do a UPROBES_XOL_SLOT_BYTES * nr_cpu_ids bootmem
> allocation or so.

Agreed, this looks much better.

I'd prefer to do this in a separate patch to keep this change as simple
as possible.

Oleg.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups.
  2011-11-18 11:08 ` [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups Srikar Dronamraju
@ 2011-11-30 18:57   ` Oleg Nesterov
  2011-12-01  5:52     ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Oleg Nesterov @ 2011-11-30 18:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On 11/18, Srikar Dronamraju wrote:
>
> +static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
> +							struct insn *insn)
> +{
> [...snip...]
> +	if (insn->immediate.nbytes) {
> +		cursor++;
> +		memmove(cursor, cursor + insn->displacement.nbytes,
> +						insn->immediate.nbytes);
> +	}
> +	return;
> +}

Of course I don not understand this code. But it seems that it can
rewrite uprobe->insn ?

If yes, don't we need to save the original insn for unregister_uprobe?

Oleg.


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-11-30 12:25               ` Peter Zijlstra
@ 2011-12-01  5:40                 ` Srikar Dronamraju
  2011-12-01 11:36                   ` Peter Zijlstra
  0 siblings, 1 reply; 106+ messages in thread
From: Srikar Dronamraju @ 2011-12-01  5:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

> > The rules that I am using are: 
> > 
> > mmap_uprobe() increments the count if 
> >         - it successfully adds a breakpoint.
> >         - it not add a breakpoint, but sees that there is a underlying
> >           breakpoint (via a read_opcode call).
> > 
> > munmap_uprobe() decrements the count if 
> >         - it sees a underlying breakpoint,  (via  a read_opcode call)
> >         - Subsequent unregister_uprobe wouldnt find the breakpoint
> >           unless a mmap_uprobe kicks in, since the old vma would be
> >           dropped just after munmap_uprobe.
> > 
> > register_uprobe increments the count if:
> >         - it successfully adds a breakpoint.
> > 
> > unregister_uprobe decrements the count if:
> >         - it sees a underlying breakpoint and removes successfully. 
> >                         (via a read_opcode call)
> >         - Subsequent munmap_uprobe wouldnt find the breakpoint
> >           since there is no underlying breakpoint after the
> >           breakpoint removal. 
> 
> The problem I'm having is that such stuff isn't included in the patch
> set.
> 
> We've got both comments in the C language and Changelog in our patch
> system, yet you consistently fail to use either to convey useful
> information on non-trivial bits like this.
> 

Agree, I will put this as part of comments.

> This leaves the reviewer wondering if you've actually considered stuff
> properly, then me actually finding bugs in there does of course
> undermine that even further.
> 
> What I really would like is for this patch set not to have such subtle
> stuff at all, esp. at first. Once its in and its been used a bit we can
> start optimizing and add subtle crap like this.

We actually started the discussion of why we increment the count in
mmap_uprobe() in EEXIST case (and read_opcode()). It exists for two
reasons.
	- To handle fork case (that I wrote in another mail).
	- To handle mremap.(the case where we are discussing now)

I would contend that removing the breakpoint in munmap doesnt amount to
optimization. Since the start of unmap(), there cannot be another
remove_breakpoint called for the vma,vaddr tuple, until the vma is
cleaned up, or the subsequent mmap() is done. So the case of accounting
for an already decremented count should never occur.

I was following the general convention being used within the kernel to not
bother about the area that we are going to unmap. For example: If a ptraced
area were to be unmapped or remapped, I dont see the breakpoint being
removed and added back. Also if a ptrace process is exitting, we dont go
about removing the installed breakpoints.

Also we would still need the check for EEXIST and read_opcode for handling
the fork() case. So even if we add extra line to remove the actual
breakpoint in munmap, It doesnt make the code any more simpler.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups.
  2011-11-30 18:57   ` Oleg Nesterov
@ 2011-12-01  5:52     ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-12-01  5:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

* Oleg Nesterov <oleg@redhat.com> [2011-11-30 19:57:51]:

> On 11/18, Srikar Dronamraju wrote:
> >
> > +static void handle_riprel_insn(struct mm_struct *mm, struct uprobe *uprobe,
> > +							struct insn *insn)
> > +{
> > [...snip...]
> > +	if (insn->immediate.nbytes) {
> > +		cursor++;
> > +		memmove(cursor, cursor + insn->displacement.nbytes,
> > +						insn->immediate.nbytes);
> > +	}
> > +	return;
> > +}
> 
> Of course I don not understand this code. But it seems that it can
> rewrite uprobe->insn ?
> 

Yes, we do rewrite the instruction for the RIP relative instructions. 
But the first byte is still intact.

> If yes, don't we need to save the original insn for unregister_uprobe?

When we unregister, we just put back the least opcode size which
happens to be the first byte for x86.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-12-01  5:40                 ` Srikar Dronamraju
@ 2011-12-01 11:36                   ` Peter Zijlstra
  2011-12-01 13:24                     ` Srikar Dronamraju
  0 siblings, 1 reply; 106+ messages in thread
From: Peter Zijlstra @ 2011-12-01 11:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

On Thu, 2011-12-01 at 11:10 +0530, Srikar Dronamraju wrote:

> > What I really would like is for this patch set not to have such subtle
> > stuff at all, esp. at first. Once its in and its been used a bit we can
> > start optimizing and add subtle crap like this.
> 
> We actually started the discussion of why we increment the count in
> mmap_uprobe() in EEXIST case (and read_opcode()). It exists for two
> reasons.
> 	- To handle fork case (that I wrote in another mail).
> 	- To handle mremap.(the case where we are discussing now)
> 
> I would contend that removing the breakpoint in munmap doesnt amount to
> optimization. Since the start of unmap(), there cannot be another
> remove_breakpoint called for the vma,vaddr tuple, until the vma is
> cleaned up, or the subsequent mmap() is done. So the case of accounting
> for an already decremented count should never occur.
> 
> I was following the general convention being used within the kernel to not
> bother about the area that we are going to unmap. For example: If a ptraced
> area were to be unmapped or remapped, I dont see the breakpoint being
> removed and added back. Also if a ptrace process is exitting, we dont go
> about removing the installed breakpoints.
> 
> Also we would still need the check for EEXIST and read_opcode for handling
> the fork() case. So even if we add extra line to remove the actual
> breakpoint in munmap, It doesnt make the code any more simpler.

Not adding the counter now does though. The whole mm->mm_uprobes_count
thing itself is basically an optimization.

Without it we'll get to uprobe_notify_resume() too often, but who cares.
And not having to worry about it removes a lot of this complexity.

Then in the patch where you introduce this optimization you can list all
the nitty gritty details of mremap/fork and counter balancing.

Another point, maybe add some comments on how the generic bits of
uprobe_notify_resume()/uprobe_bkpt_notifier()/uprobe_post_notifier() etc
hang together and what the arch stuff should do. 

Currently I have to flip back and forth between those to figure out what
happens.

Having that information also helps validate that x86 does indeed do what
is expected and helps other arch maintainers write their code without
having to grok wtf x86 does.

^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
                     ` (4 preceding siblings ...)
  2011-11-28 15:29   ` Peter Zijlstra
@ 2011-12-01 13:20   ` Peter Zijlstra
  5 siblings, 0 replies; 106+ messages in thread
From: Peter Zijlstra @ 2011-12-01 13:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

On Fri, 2011-11-18 at 16:37 +0530, Srikar Dronamraju wrote:
> +static int __register_uprobe(struct inode *inode, loff_t offset,
> +                               struct uprobe *uprobe)
> +{
> +       struct list_head try_list;
> +       struct vm_area_struct *vma;
> +       struct address_space *mapping;
> +       struct vma_info *vi, *tmpvi;
> +       struct mm_struct *mm;
> +       loff_t vaddr;
> +       int ret = 0;
> +
> +       mapping = inode->i_mapping;
> +       INIT_LIST_HEAD(&try_list);
> +       while ((vi = find_next_vma_info(&try_list, offset,
> +                                               mapping, true)) != NULL) {
> +               if (IS_ERR(vi)) {
> +                       ret = -ENOMEM;
> +                       break;
> +               }
> +               mm = vi->mm;
> +               down_read(&mm->mmap_sem);
> +               vma = find_vma(mm, (unsigned long)vi->vaddr);
> +               if (!vma || !valid_vma(vma, true)) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               vaddr = vma->vm_start + offset;
> +               vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +               if (vma->vm_file->f_mapping->host != inode ||
> +                                               vaddr != vi->vaddr) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               ret = install_breakpoint(mm);
> +               up_read(&mm->mmap_sem);
> +               mmput(mm);
> +               if (ret && ret == -EEXIST)
> +                       ret = 0;
> +               if (!ret)
> +                       break;
> +       }
> +       list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> +               list_del(&vi->probe_list);
> +               kfree(vi);
> +       }
> +       return ret;
> +}
> +
> +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> +                                               struct uprobe *uprobe)
> +{
> +       struct list_head try_list;
> +       struct address_space *mapping;
> +       struct vma_info *vi, *tmpvi;
> +       struct vm_area_struct *vma;
> +       struct mm_struct *mm;
> +       loff_t vaddr;
> +
> +       mapping = inode->i_mapping;
> +       INIT_LIST_HEAD(&try_list);
> +       while ((vi = find_next_vma_info(&try_list, offset,
> +                                               mapping, false)) != NULL) {
> +               if (IS_ERR(vi))
> +                       break;
> +               mm = vi->mm;
> +               down_read(&mm->mmap_sem);
> +               vma = find_vma(mm, (unsigned long)vi->vaddr);
> +               if (!vma || !valid_vma(vma, false)) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               vaddr = vma->vm_start + offset;
> +               vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +               if (vma->vm_file->f_mapping->host != inode ||
> +                                               vaddr != vi->vaddr) {
> +                       list_del(&vi->probe_list);
> +                       kfree(vi);
> +                       up_read(&mm->mmap_sem);
> +                       mmput(mm);
> +                       continue;
> +               }
> +               remove_breakpoint(mm);
> +               up_read(&mm->mmap_sem);
> +               mmput(mm);
> +       }
> +
> +       list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> +               list_del(&vi->probe_list);
> +               kfree(vi);
> +       }
> +       delete_uprobe(uprobe);
> +} 

I already mentioned on IRC that there's a lot of duplication here and
how to 'solve that'...

Something like the below, it lost the delete_uprobe() bit, and it adds a
few XXX marks where we have to deal with -ENOMEM. Also its not been near
a compiler.

---
 kernel/uprobes.c |   78 ++++++++++++++---------------------------------------
 1 files changed, 21 insertions(+), 57 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 2493191..c57284a 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -622,7 +622,7 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 }
 
 static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
-							loff_t vaddr)
+			      struct vm_area_struct *vma, loff_t vaddr)
 {
 	if (!set_orig_insn(mm, uprobe, (unsigned long)vaddr, true))
 		atomic_dec(&mm->mm_uprobes_count);
@@ -713,8 +713,10 @@ static struct vma_info *find_next_vma_info(struct list_head *head,
 	return retvi;
 }
 
-static int __register_uprobe(struct inode *inode, loff_t offset,
-				struct uprobe *uprobe)
+typedef int (*vma_func_t)(struct mm_struct *mm, struct uprobe *uprobe,
+			  struct vm_area_struct *vma, unsigned long addr);
+
+static int __for_each_vma(struct uprobe *uprobe, vma_func_t func)
 {
 	struct list_head try_list;
 	struct vm_area_struct *vma;
@@ -724,12 +726,12 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 	loff_t vaddr;
 	int ret = 0;
 
-	mapping = inode->i_mapping;
+	mapping = uprobe->inode->i_mapping;
 	INIT_LIST_HEAD(&try_list);
-	while ((vi = find_next_vma_info(&try_list, offset,
+	while ((vi = find_next_vma_info(&try_list, uprobe->offset,
 						mapping, true)) != NULL) {
 		if (IS_ERR(vi)) {
-			ret = -ENOMEM;
+			ret = PTR_ERR(vi);
 			break;
 		}
 		mm = vi->mm;
@@ -742,9 +744,9 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		vaddr = vma->vm_start + offset;
+		vaddr = vma->vm_start + uprobe->offset;
 		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
-		if (vma->vm_file->f_mapping->host != inode ||
+		if (vma->vm_file->f_mapping->host != uprobe->inode ||
 						vaddr != vi->vaddr) {
 			list_del(&vi->probe_list);
 			kfree(vi);
@@ -752,12 +754,12 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
+		ret = func(mm, uprobe, vma, vi->vaddr);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 		if (ret && ret == -EEXIST)
 			ret = 0;
-		if (!ret)
+		if (ret)
 			break;
 	}
 	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
@@ -767,52 +769,14 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 	return ret;
 }
 
-static void __unregister_uprobe(struct inode *inode, loff_t offset,
-						struct uprobe *uprobe)
+static int __register_uprobe(struct uprobe *uprobe)
 {
-	struct list_head try_list;
-	struct address_space *mapping;
-	struct vma_info *vi, *tmpvi;
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	loff_t vaddr;
-
-	mapping = inode->i_mapping;
-	INIT_LIST_HEAD(&try_list);
-	while ((vi = find_next_vma_info(&try_list, offset,
-						mapping, false)) != NULL) {
-		if (IS_ERR(vi))
-			break;
-		mm = vi->mm;
-		down_read(&mm->mmap_sem);
-		vma = find_vma(mm, (unsigned long)vi->vaddr);
-		if (!vma || !valid_vma(vma, false)) {
-			list_del(&vi->probe_list);
-			kfree(vi);
-			up_read(&mm->mmap_sem);
-			mmput(mm);
-			continue;
-		}
-		vaddr = vma->vm_start + offset;
-		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
-		if (vma->vm_file->f_mapping->host != inode ||
-						vaddr != vi->vaddr) {
-			list_del(&vi->probe_list);
-			kfree(vi);
-			up_read(&mm->mmap_sem);
-			mmput(mm);
-			continue;
-		}
-		remove_breakpoint(mm, uprobe, vi->vaddr);
-		up_read(&mm->mmap_sem);
-		mmput(mm);
-	}
+	return __for_each_vma(uprobe, install_breakpoint);
+}
 
-	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
-		list_del(&vi->probe_list);
-		kfree(vi);
-	}
-	delete_uprobe(uprobe);
+static int __unregister_uprobe(struct uprobe *uprobe)
+{
+	return __for_each_vma(uprobe, remove_breakpoint);
 }
 
 /*
@@ -852,10 +816,10 @@ int register_uprobe(struct inode *inode, loff_t offset,
 	mutex_lock(uprobes_hash(inode));
 	uprobe = alloc_uprobe(inode, offset);
 	if (uprobe && !add_consumer(uprobe, consumer)) {
-		ret = __register_uprobe(inode, offset, uprobe);
+		ret = __register_uprobe(uprobe);
 		if (ret) {
 			uprobe->consumers = NULL;
-			__unregister_uprobe(inode, offset, uprobe);
+			__unregister_uprobe(uprobe); // -ENOMEM
 		} else
 			uprobe->flags |= UPROBES_RUN_HANDLER;
 	}
@@ -894,7 +858,7 @@ void unregister_uprobe(struct inode *inode, loff_t offset,
 	}
 
 	if (!uprobe->consumers) {
-		__unregister_uprobe(inode, offset, uprobe);
+		__unregister_uprobe(uprobe); // XXX -ENOMEM
 		uprobe->flags &= ~UPROBES_RUN_HANDLER;
 	}
 	mutex_unlock(uprobes_hash(inode));


^ permalink raw reply related	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap.
  2011-12-01 11:36                   ` Peter Zijlstra
@ 2011-12-01 13:24                     ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-12-01 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	tulasidhard

> > I was following the general convention being used within the kernel to not
> > bother about the area that we are going to unmap. For example: If a ptraced
> > area were to be unmapped or remapped, I dont see the breakpoint being
> > removed and added back. Also if a ptrace process is exitting, we dont go
> > about removing the installed breakpoints.
> > 
> > Also we would still need the check for EEXIST and read_opcode for handling
> > the fork() case. So even if we add extra line to remove the actual
> > breakpoint in munmap, It doesnt make the code any more simpler.
> 
> Not adding the counter now does though. The whole mm->mm_uprobes_count
> thing itself is basically an optimization.
> 
> Without it we'll get to uprobe_notify_resume() too often, but who cares.
> And not having to worry about it removes a lot of this complexity.
> 
> Then in the patch where you introduce this optimization you can list all
> the nitty gritty details of mremap/fork and counter balancing.
> 

Okay, I will move the optimization parts into a separate patch and keep
it at the end of the patchset.

> Another point, maybe add some comments on how the generic bits of
> uprobe_notify_resume()/uprobe_bkpt_notifier()/uprobe_post_notifier() etc
> hang together and what the arch stuff should do. 
> 
> Currently I have to flip back and forth between those to figure out what
> happens.
> 
> Having that information also helps validate that x86 does indeed do what
> is expected and helps other arch maintainers write their code without
> having to grok wtf x86 does.
> 

Okay, will work towards this.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes.
  2011-11-29 10:52       ` Peter Zijlstra
@ 2011-12-01 13:41         ` Srikar Dronamraju
  0 siblings, 0 replies; 106+ messages in thread
From: Srikar Dronamraju @ 2011-12-01 13:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Oleg Nesterov, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson

> 
> You could use the stuff from patch 29 to effectively disable the uprobe
> and return -ENOMEM to whoemever is unregistering. Basically failing the
> unreg.
> 
> That way you can leave the uprobe in existance and half installed but
> functionally fully disabled. Userspace (assuming we go back that far)
> can then either re-try the removal later, or even reinstate it by doing
> a register again or so.
> 
> Its still not pretty, but its better than pretending the unreg
> completed.
> 

This approach has its own disadvantages. perf record which does the
unregister_uprobe() might be get stuck under low memory conditions while
it tries to complete unregistration. Also the user would be confused if
the tracer is still collecting information, once the unregister_uprobe
has returned an error.

So I would still think using a kworker thread to complete unregistration
on a low memory condition might be a better solution.

While I work on getting the kworker thread implementation ready, we
could use delay deleting the probe, set the not_run_handler flag and
also see if we can remove the breakpoint while the breakpoint is hit.

This way the only worse thing that can happen is the probed processes
still take a hit.

If the kworker thread were to face a low memory situation, then it will
try to schedule another kworker thread or itself again (at a later point
in time).  I still need to investigate some more on this. 

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 106+ messages in thread

* Re: [PATCH RFC 0/5] uprobes: kill xol vma
  2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
                     ` (6 preceding siblings ...)
  2011-11-29 10:30   ` Srikar Dronamraju
@ 2011-12-12 17:30   ` Oleg Nesterov
  7 siblings, 0 replies; 106+ messages in thread
From: Oleg Nesterov @ 2011-12-12 17:30 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Linus Torvalds, Andrew Morton, LKML, Linux-mm,
	Ingo Molnar, Andi Kleen, Christoph Hellwig, Steven Rostedt,
	Roland McGrath, Thomas Gleixner, Masami Hiramatsu,
	Arnaldo Carvalho de Melo, Anton Arapov,
	Ananth N Mavinakayanahalli, Jim Keniston, Stephen Wilson,
	Josh Stone

On 11/28, Oleg Nesterov wrote:
>
> On top of this series, not for inclusion yet, just to explain what
> I mean. May be someone can test it ;)
>
> This series kills xol_vma. Instead we use the per_cpu-like xol slots.
>
> This is much more simple and efficient. And this of course solves
> many problems we currently have with xol_vma.
>
> For example, we simply can not trust it. We do not know what actually
> we are going to execute in UTASK_SSTEP mode. An application can unmap
> this area and then do mmap(PROT_EXEC|PROT_WRITE, MAP_FIXED) to fool
> uprobes.
>
> The only disadvantage is that this adds a bit more arch-dependant
> code.
>
> The main question, can this work?

OK, it almost works.

But, this way we can't probe the compat tasks. A __USER32_CS task can't
access the fix_to_virt() area, so it can't use uprobe_xol_slots[].

Many thanks to Josh who noticed this.

I'll try to think more, but so far I do not see any simple solution.

Oleg.


^ permalink raw reply	[flat|nested] 106+ messages in thread

end of thread, other threads:[~2011-12-12 17:36 UTC | newest]

Thread overview: 106+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-18 11:06 [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
2011-11-18 11:06 ` [PATCH v7 3.2-rc2 1/30] uprobes: Auxillary routines to insert, find, delete uprobes Srikar Dronamraju
2011-11-23 18:23   ` Peter Zijlstra
2011-11-18 11:07 ` [PATCH v7 3.2-rc2 2/30] uprobes: Allow multiple consumers for an uprobe Srikar Dronamraju
2011-11-18 11:07 ` [PATCH v7 3.2-rc2 3/30] uprobes: register/unregister probes Srikar Dronamraju
2011-11-23 16:09   ` Peter Zijlstra
2011-11-23 16:11     ` Peter Zijlstra
2011-11-24 14:39     ` Srikar Dronamraju
2011-11-23 16:22   ` Peter Zijlstra
2011-11-23 16:27   ` Peter Zijlstra
2011-11-23 16:35   ` Peter Zijlstra
2011-11-28 15:29   ` Peter Zijlstra
2011-11-29  7:48     ` Srikar Dronamraju
2011-11-29 10:52       ` Peter Zijlstra
2011-12-01 13:41         ` Srikar Dronamraju
2011-12-01 13:20   ` Peter Zijlstra
2011-11-18 11:07 ` [PATCH v7 3.2-rc2 4/30] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
2011-11-23 17:13   ` Peter Zijlstra
2011-11-23 18:10   ` Peter Zijlstra
2011-11-24 13:47     ` Srikar Dronamraju
2011-11-24 14:13       ` Peter Zijlstra
2011-11-24 14:25         ` Srikar Dronamraju
2011-11-28 14:59       ` Peter Zijlstra
2011-11-29  8:33         ` Srikar Dronamraju
2011-11-29 11:48           ` Peter Zijlstra
2011-11-29 15:05             ` Peter Zijlstra
2011-11-30  5:50               ` Srikar Dronamraju
2011-11-29 16:22             ` Srikar Dronamraju
2011-11-30 12:25               ` Peter Zijlstra
2011-12-01  5:40                 ` Srikar Dronamraju
2011-12-01 11:36                   ` Peter Zijlstra
2011-12-01 13:24                     ` Srikar Dronamraju
2011-11-30  5:30           ` Srikar Dronamraju
2011-11-23 18:15   ` Peter Zijlstra
2011-11-23 19:50     ` Steven Rostedt
2011-11-24 13:37     ` Srikar Dronamraju
2011-11-24 13:47       ` Peter Zijlstra
2011-11-18 11:07 ` [PATCH v7 3.2-rc2 5/30] uprobes: copy of the original instruction Srikar Dronamraju
2011-11-23 18:26   ` Peter Zijlstra
2011-11-23 18:40   ` Peter Zijlstra
2011-11-23 19:49     ` Steven Rostedt
2011-11-23 20:52       ` Peter Zijlstra
2011-11-24 12:50     ` Srikar Dronamraju
2011-11-28 14:23   ` Peter Zijlstra
2011-11-18 11:07 ` [PATCH v7 3.2-rc2 6/30] uprobes: define fixups Srikar Dronamraju
2011-11-18 11:07 ` [PATCH v7 3.2-rc2 7/30] uprobes: uprobes arch info Srikar Dronamraju
2011-11-18 11:08 ` [PATCH v7 3.2-rc2 8/30] x86: analyze instruction and determine fixups Srikar Dronamraju
2011-11-30 18:57   ` Oleg Nesterov
2011-12-01  5:52     ` Srikar Dronamraju
2011-11-18 11:08 ` [PATCH v7 3.2-rc2 9/30] uprobes: Background page replacement Srikar Dronamraju
2011-11-25 14:29   ` Peter Zijlstra
2011-11-25 14:54   ` Peter Zijlstra
2011-11-26  2:25     ` Srikar Dronamraju
2011-11-28 14:13   ` Peter Zijlstra
2011-11-29  7:49     ` Srikar Dronamraju
2011-11-28 15:01   ` Peter Zijlstra
2011-11-18 11:08 ` [PATCH v7 3.2-rc2 10/30] x86: Set instruction pointer Srikar Dronamraju
2011-11-18 11:08 ` [PATCH v7 3.2-rc2 11/30] x86: Introduce TIF_UPROBE FLAG Srikar Dronamraju
2011-11-18 11:09 ` [PATCH v7 3.2-rc2 12/30] uprobes: Handle breakpoint and Singlestep Srikar Dronamraju
2011-11-25 15:24   ` Peter Zijlstra
2011-11-26  2:22     ` Srikar Dronamraju
2011-11-18 11:09 ` [PATCH v7 3.2-rc2 13/30] x86: define a x86 specific exception notifier Srikar Dronamraju
2011-11-18 11:09 ` [PATCH v7 3.2-rc2 14/30] uprobe: register " Srikar Dronamraju
2011-11-18 11:09 ` [PATCH v7 3.2-rc2 15/30] x86: Define x86_64 specific uprobe_task_arch_info structure Srikar Dronamraju
2011-11-18 11:09 ` [PATCH v7 3.2-rc2 16/30] uprobes: Introduce " Srikar Dronamraju
2011-11-18 11:09 ` [PATCH v7 3.2-rc2 17/30] x86: arch specific hooks for pre/post singlestep handling Srikar Dronamraju
2011-11-18 11:10 ` [PATCH v7 3.2-rc2 18/30] uprobes: slot allocation Srikar Dronamraju
2011-11-18 11:10 ` [PATCH v7 3.2-rc2 19/30] tracing: modify is_delete, is_return from ints to bool Srikar Dronamraju
2011-11-23 19:24   ` Steven Rostedt
2011-11-18 11:10 ` [PATCH v7 3.2-rc2 20/30] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
2011-11-23 19:32   ` Steven Rostedt
2011-11-24 13:12     ` Srikar Dronamraju
2011-11-18 11:10 ` [PATCH v7 3.2-rc2 21/30] tracing: uprobes trace_event interface Srikar Dronamraju
2011-11-18 11:10 ` [PATCH v7 3.2-rc2 22/30] perf: rename target_module to target Srikar Dronamraju
2011-11-18 11:11 ` [PATCH v7 3.2-rc2 23/30] perf: perf interface for uprobes Srikar Dronamraju
2011-11-18 11:11 ` [PATCH v7 3.2-rc2 24/30] perf: show possible probes in a given executable file or library Srikar Dronamraju
2011-11-18 11:11 ` [PATCH v7 3.2-rc2 25/30] uprobes: call post_xol() unconditionally Srikar Dronamraju
2011-11-18 11:11 ` [PATCH v7 3.2-rc2 26/30] uprobes: introduce uprobe_deny_signal() Srikar Dronamraju
2011-11-18 11:12 ` [PATCH v7 3.2-rc2 27/30] uprobes: x86: introduce xol_was_trapped() Srikar Dronamraju
2011-11-18 11:12 ` [PATCH v7 3.2-rc2 28/30] uprobes: introduce UTASK_SSTEP_TRAPPED logic Srikar Dronamraju
2011-11-18 11:12 ` [PATCH v7 3.2-rc2 29/30] uprobes: Introduce uprobe flags Srikar Dronamraju
2011-11-18 11:12 ` [PATCH v7 3.2-rc2 30/30] x86: skip singlestep where possible Srikar Dronamraju
2011-11-22  5:03 ` [PATCH v7 3.2-rc2 0/30] uprobes patchset with perf probe support Srikar Dronamraju
2011-11-22 14:49   ` Stephen Rothwell
2011-11-23 13:20     ` Srikar Dronamraju
2011-11-23 13:38       ` Stephen Rothwell
2011-11-28 19:06 ` [PATCH RFC 0/5] uprobes: kill xol vma Oleg Nesterov
2011-11-28 19:06   ` [PATCH 1/5] uprobes: kill pre_ssout(), introduce set_xol_ip() Oleg Nesterov
2011-11-28 19:06   ` [PATCH 2/5] uprobes: introduce uprobe_switch_to() Oleg Nesterov
2011-11-28 19:53     ` Peter Zijlstra
2011-11-29 17:18       ` Oleg Nesterov
2011-11-30 12:11         ` Peter Zijlstra
2011-11-30 17:10           ` Oleg Nesterov
2011-11-28 19:07   ` [PATCH 3/5] uprobes: introduce uprobe_xol_slots[NR_CPUS] Oleg Nesterov
2011-11-28 19:48     ` Peter Zijlstra
2011-11-28 19:52       ` Peter Zijlstra
2011-11-29 18:24     ` Oleg Nesterov
2011-11-28 19:07   ` [PATCH 4/5] uprobes: teach set_xol_ip() to use uprobe_xol_slots[] Oleg Nesterov
2011-11-28 19:07   ` [PATCH 5/5] uprobes: remove the uprobes_xol_area code Oleg Nesterov
2011-11-28 19:57   ` [PATCH RFC 0/5] uprobes: kill xol vma Peter Zijlstra
2011-11-29 10:30   ` Srikar Dronamraju
2011-11-29 18:26     ` Oleg Nesterov
2011-11-30 16:15       ` Andi Kleen
2011-11-30 16:20         ` Peter Zijlstra
2011-11-30 18:47           ` Oleg Nesterov
2011-12-12 17:30   ` Oleg Nesterov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).