All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
@ 2011-09-20 11:59 ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath, LKML


This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.

This patchset resolves most of the comments on the previous posting
(https://lkml.org/lkml/2011/6/7/232) patchset applies on top of tip
commit e467f18f945c83e66

Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

For previous postings: please refer: https://lkml.org/lkml/2011/4/1/176
http://lkml.org/lkml/2011/3/14/171/ http://lkml.org/lkml/2010/12/16/65
http://lkml.org/lkml/2010/8/25/165 http://lkml.org/lkml/2010/7/27/121
http://lkml.org/lkml/2010/7/12/67 http://lkml.org/lkml/2010/7/8/239
http://lkml.org/lkml/2010/6/29/299 http://lkml.org/lkml/2010/6/14/41
http://lkml.org/lkml/2010/3/20/107 and http://lkml.org/lkml/2010/5/18/307

This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint.  In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location.  On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.

4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace.  In future we could look at a single
backtrace showing application, library and kernel calls.

Changes from last patchset:
- mmap_uprobe doesnt drop mmap_sem anymore
- Introduced uprobes_mmap_mutex to serialize mmap_uprobes.
- Uses i_mutex instead of uprobes_mutex.
- Introduces munmap_uprobes
- Change in perf probe interface as recommended by Masami.
- Doesnt depend on get_user_pages to do the COW.
- Slot allocation changed from per-task to shared slot allocation mechanism.

Here is the list of TODO Items.

- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.

Please refer "[RFC] [PATCH 3.1-rc4-tip 21/26] tracing: tracing: Uprobe
tracer documentation" on how to use uprobe_tracer.

Please refer "[RFC] [PATCH 3.1-rc4-tip 25/26] perf: Documentation for perf
uprobes" on how to use uprobe_tracer.

Please do provide your valuable comments.

Thanks in advance.
Srikar
---
 0 files changed, 0 insertions(+), 0 deletions(-)


Srikar Dronamraju (26)
 0: Uprobes patchset with perf probe support
 1: uprobes: Auxillary routines to insert, find, delete uprobes
 2: Uprobes: Allow multiple consumers for an uprobe.
 3: Uprobes: register/unregister probes.
 4: uprobes: Define hooks for mmap/munmap.
 5: Uprobes: copy of the original instruction.
 6: Uprobes: define fixups.
 7: Uprobes: uprobes arch info
 8: x86: analyze instruction and determine fixups.
 9: Uprobes: Background page replacement.
10: x86: Set instruction pointer.
11: x86: Introduce TIF_UPROBE FLAG.
12: Uprobes: Handle breakpoint and Singlestep
13: x86: define a x86 specific exception notifier.
14: uprobe: register exception notifier
15: x86: Define x86_64 specific uprobe_task_arch_info structure
16: uprobes: Introduce uprobe_task_arch_info structure.
17: x86: arch specific hooks for pre/post singlestep handling.
18: uprobes: slot allocation.
19: tracing: Extract out common code for kprobes/uprobes traceevents.
20: tracing: uprobes trace_event interface
21: tracing: uprobes Documentation
22: perf: rename target_module to target
23: perf: perf interface for uprobes
24: perf: show possible probes in a given executable file or library.
25: perf: Documentation for perf uprobes
26: uprobes: queue signals while thread is singlestepping.


 Documentation/trace/uprobetracer.txt    |   94 ++
 arch/Kconfig                            |    4 +
 arch/x86/Kconfig                        |    3 +
 arch/x86/include/asm/thread_info.h      |    2 +
 arch/x86/include/asm/uprobes.h          |   54 ++
 arch/x86/kernel/Makefile                |    1 +
 arch/x86/kernel/signal.c                |   14 +
 arch/x86/kernel/uprobes.c               |  562 ++++++++++++
 include/linux/mm_types.h                |    5 +
 include/linux/sched.h                   |    3 +
 include/linux/uprobes.h                 |  165 ++++
 kernel/Makefile                         |    1 +
 kernel/fork.c                           |   11 +
 kernel/signal.c                         |   22 +-
 kernel/trace/Kconfig                    |   20 +
 kernel/trace/Makefile                   |    2 +
 kernel/trace/trace.h                    |    5 +
 kernel/trace/trace_kprobe.c             |  894 +------------------
 kernel/trace/trace_probe.c              |  784 ++++++++++++++++
 kernel/trace/trace_probe.h              |  162 ++++
 kernel/trace/trace_uprobe.c             |  770 ++++++++++++++++
 kernel/uprobes.c                        | 1475 +++++++++++++++++++++++++++++++
 mm/memory.c                             |    4 +
 mm/mmap.c                               |    6 +
 tools/perf/Documentation/perf-probe.txt |   14 +
 tools/perf/builtin-probe.c              |   49 +-
 tools/perf/util/probe-event.c           |  410 +++++++--
 tools/perf/util/probe-event.h           |   12 +-
 tools/perf/util/symbol.c                |   10 +-
 tools/perf/util/symbol.h                |    1 +
 30 files changed, 4581 insertions(+), 978 deletions(-)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
@ 2011-09-20 11:59 ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath, LKML


This patchset implements Uprobes which enables you to dynamically break
into any routine in a user space application and collect information
non-disruptively.

This patchset resolves most of the comments on the previous posting
(https://lkml.org/lkml/2011/6/7/232) patchset applies on top of tip
commit e467f18f945c83e66

Uprobes Patches
This patchset implements inode based uprobes which are specified as
<file>:<offset> where offset is the offset from start of the map.

When a uprobe is registered, Uprobes makes a copy of the probed
instruction, replaces the first byte(s) of the probed instruction with a
breakpoint instruction. (Uprobes uses background page replacement
mechanism and ensures that the breakpoint affects only that process.)

When a CPU hits the breakpoint instruction, Uprobes gets notified of
trap and finds the associated uprobe. It then executes the associated
handler. Uprobes single-steps its copy of the probed instruction and
resumes execution of the probed process at the instruction following the
probepoint. Instruction copies to be single-stepped are stored in a
per-mm "execution out of line (XOL) area". Currently XOL area is
allocated as one page vma.

For previous postings: please refer: https://lkml.org/lkml/2011/4/1/176
http://lkml.org/lkml/2011/3/14/171/ http://lkml.org/lkml/2010/12/16/65
http://lkml.org/lkml/2010/8/25/165 http://lkml.org/lkml/2010/7/27/121
http://lkml.org/lkml/2010/7/12/67 http://lkml.org/lkml/2010/7/8/239
http://lkml.org/lkml/2010/6/29/299 http://lkml.org/lkml/2010/6/14/41
http://lkml.org/lkml/2010/3/20/107 and http://lkml.org/lkml/2010/5/18/307

This patchset is a rework based on suggestions from discussions on lkml
in September, March and January 2010 (http://lkml.org/lkml/2010/1/11/92,
http://lkml.org/lkml/2010/1/27/19, http://lkml.org/lkml/2010/3/20/107
and http://lkml.org/lkml/2010/3/31/199 ). This implementation of uprobes
doesnt depend on utrace.

Advantages of uprobes over conventional debugging include:

1. Non-disruptive.
Unlike current ptrace based mechanisms, uprobes tracing wouldnt
involve signals, stopping threads and context switching between the
tracer and tracee.

2. Much better handling of multithreaded programs because of XOL.
Current ptrace based mechanisms use single stepping inline, i.e they
copy back the original instruction on hitting a breakpoint.  In such
mechanisms tracers have to stop all the threads on a breakpoint hit or
tracers will not be able to handle all hits to the location of
interest. Uprobes uses execution out of line, where the instruction to
be traced is analysed at the time of breakpoint insertion and a copy
of instruction is stored at a different location.  On breakpoint hit,
uprobes jumps to that copied location and singlesteps the same
instruction and does the necessary fixups post singlestepping.

3. Multiple tracers for an application.
Multiple uprobes based tracer could work in unison to trace an
application. There could one tracer that could be interested in
generic events for a particular set of process. While there could be
another tracer that is just interested in one specific event of a
particular process thats part of the previous set of process.

4. Corelating events from kernels and userspace.
Uprobes could be used with other tools like kprobes, tracepoints or as
part of higher level tools like perf to give a consolidated set of
events from kernel and userspace.  In future we could look at a single
backtrace showing application, library and kernel calls.

Changes from last patchset:
- mmap_uprobe doesnt drop mmap_sem anymore
- Introduced uprobes_mmap_mutex to serialize mmap_uprobes.
- Uses i_mutex instead of uprobes_mutex.
- Introduces munmap_uprobes
- Change in perf probe interface as recommended by Masami.
- Doesnt depend on get_user_pages to do the COW.
- Slot allocation changed from per-task to shared slot allocation mechanism.

Here is the list of TODO Items.

- Prefiltering (i.e filtering at the time of probe insertion)
- Return probes.
- Support for other architectures.
- Uprobes booster.
- replace macro W with bits in inat table.

Please refer "[RFC] [PATCH 3.1-rc4-tip 21/26] tracing: tracing: Uprobe
tracer documentation" on how to use uprobe_tracer.

Please refer "[RFC] [PATCH 3.1-rc4-tip 25/26] perf: Documentation for perf
uprobes" on how to use uprobe_tracer.

Please do provide your valuable comments.

Thanks in advance.
Srikar
---
 0 files changed, 0 insertions(+), 0 deletions(-)


Srikar Dronamraju (26)
 0: Uprobes patchset with perf probe support
 1: uprobes: Auxillary routines to insert, find, delete uprobes
 2: Uprobes: Allow multiple consumers for an uprobe.
 3: Uprobes: register/unregister probes.
 4: uprobes: Define hooks for mmap/munmap.
 5: Uprobes: copy of the original instruction.
 6: Uprobes: define fixups.
 7: Uprobes: uprobes arch info
 8: x86: analyze instruction and determine fixups.
 9: Uprobes: Background page replacement.
10: x86: Set instruction pointer.
11: x86: Introduce TIF_UPROBE FLAG.
12: Uprobes: Handle breakpoint and Singlestep
13: x86: define a x86 specific exception notifier.
14: uprobe: register exception notifier
15: x86: Define x86_64 specific uprobe_task_arch_info structure
16: uprobes: Introduce uprobe_task_arch_info structure.
17: x86: arch specific hooks for pre/post singlestep handling.
18: uprobes: slot allocation.
19: tracing: Extract out common code for kprobes/uprobes traceevents.
20: tracing: uprobes trace_event interface
21: tracing: uprobes Documentation
22: perf: rename target_module to target
23: perf: perf interface for uprobes
24: perf: show possible probes in a given executable file or library.
25: perf: Documentation for perf uprobes
26: uprobes: queue signals while thread is singlestepping.


 Documentation/trace/uprobetracer.txt    |   94 ++
 arch/Kconfig                            |    4 +
 arch/x86/Kconfig                        |    3 +
 arch/x86/include/asm/thread_info.h      |    2 +
 arch/x86/include/asm/uprobes.h          |   54 ++
 arch/x86/kernel/Makefile                |    1 +
 arch/x86/kernel/signal.c                |   14 +
 arch/x86/kernel/uprobes.c               |  562 ++++++++++++
 include/linux/mm_types.h                |    5 +
 include/linux/sched.h                   |    3 +
 include/linux/uprobes.h                 |  165 ++++
 kernel/Makefile                         |    1 +
 kernel/fork.c                           |   11 +
 kernel/signal.c                         |   22 +-
 kernel/trace/Kconfig                    |   20 +
 kernel/trace/Makefile                   |    2 +
 kernel/trace/trace.h                    |    5 +
 kernel/trace/trace_kprobe.c             |  894 +------------------
 kernel/trace/trace_probe.c              |  784 ++++++++++++++++
 kernel/trace/trace_probe.h              |  162 ++++
 kernel/trace/trace_uprobe.c             |  770 ++++++++++++++++
 kernel/uprobes.c                        | 1475 +++++++++++++++++++++++++++++++
 mm/memory.c                             |    4 +
 mm/mmap.c                               |    6 +
 tools/perf/Documentation/perf-probe.txt |   14 +
 tools/perf/builtin-probe.c              |   49 +-
 tools/perf/util/probe-event.c           |  410 +++++++--
 tools/perf/util/probe-event.h           |   12 +-
 tools/perf/util/symbol.c                |   10 +-
 tools/perf/util/symbol.h                |    1 +
 30 files changed, 4581 insertions(+), 978 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 11:59   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Uprobes are maintained in a rb-tree indexed by inode and offset (offset
from the start of the map). For a unique inode, offset combination,
there can be one unique uprobe in the rbtree. Provide routines that
insert a given uprobe, find a uprobe given a inode and offset.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   35 +++++++++
 kernel/uprobes.c        |  174 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/uprobes.h
 create mode 100644 kernel/uprobes.c

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..bfb85c4
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/rbtree.h>
+
+struct uprobe {
+	struct rb_node		rb_node;	/* node in the rb tree */
+	atomic_t		ref;
+	struct inode		*inode;		/* Also hold a ref to inode */
+	loff_t			offset;
+};
+
+#endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..e452147
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,174 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/uprobes.h>
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
+
+static int match_uprobe(struct uprobe *l, struct uprobe *r)
+{
+	if (l->inode < r->inode)
+		return -1;
+	if (l->inode > r->inode)
+		return 1;
+	else {
+		if (l->offset < r->offset)
+			return -1;
+
+		if (l->offset > r->offset)
+			return 1;
+	}
+
+	return 0;
+}
+
+static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
+{
+	struct uprobe u = { .inode = inode, .offset = offset };
+	struct rb_node *n = uprobes_tree.rb_node;
+	struct uprobe *uprobe;
+	int match;
+
+	while (n) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		match = match_uprobe(&u, uprobe);
+		if (!match) {
+			atomic_inc(&uprobe->ref);
+			return uprobe;
+		}
+		if (match < 0)
+			n = n->rb_left;
+		else
+			n = n->rb_right;
+
+	}
+	return NULL;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires uprobes_treelock
+ */
+static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
+{
+	struct uprobe *uprobe;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, offset);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return uprobe;
+}
+
+static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
+{
+	struct rb_node **p = &uprobes_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct uprobe *u;
+	int match;
+
+	while (*p) {
+		parent = *p;
+		u = rb_entry(parent, struct uprobe, rb_node);
+		match = match_uprobe(uprobe, u);
+		if (!match) {
+			atomic_inc(&u->ref);
+			return u;
+		}
+
+		if (match < 0)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+
+	}
+	u = NULL;
+	rb_link_node(&uprobe->rb_node, parent, p);
+	rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+	/* get access + drop ref */
+	atomic_set(&uprobe->ref, 2);
+	return u;
+}
+
+/*
+ * Acquires uprobes_treelock.
+ * Matching uprobe already exists in rbtree;
+ *	increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ *	get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+	struct uprobe *u;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	u = __insert_uprobe(uprobe);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+	if (atomic_dec_and_test(&uprobe->ref))
+		kfree(uprobe);
+}
+
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe *uprobe, *cur_uprobe;
+
+	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+	if (!uprobe)
+		return NULL;
+
+	uprobe->inode = igrab(inode);
+	uprobe->offset = offset;
+
+	/* add to uprobes_tree, sorted on inode:offset */
+	cur_uprobe = insert_uprobe(uprobe);
+
+	/* a uprobe exists for this inode:offset combination*/
+	if (cur_uprobe) {
+		kfree(uprobe);
+		uprobe = cur_uprobe;
+		iput(inode);
+	}
+	return uprobe;
+}
+
+static void delete_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	put_uprobe(uprobe);
+	iput(uprobe->inode);
+}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-20 11:59   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 11:59 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Uprobes are maintained in a rb-tree indexed by inode and offset (offset
from the start of the map). For a unique inode, offset combination,
there can be one unique uprobe in the rbtree. Provide routines that
insert a given uprobe, find a uprobe given a inode and offset.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   35 +++++++++
 kernel/uprobes.c        |  174 +++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 209 insertions(+), 0 deletions(-)
 create mode 100644 include/linux/uprobes.h
 create mode 100644 kernel/uprobes.c

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
new file mode 100644
index 0000000..bfb85c4
--- /dev/null
+++ b/include/linux/uprobes.h
@@ -0,0 +1,35 @@
+#ifndef _LINUX_UPROBES_H
+#define _LINUX_UPROBES_H
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/rbtree.h>
+
+struct uprobe {
+	struct rb_node		rb_node;	/* node in the rb tree */
+	atomic_t		ref;
+	struct inode		*inode;		/* Also hold a ref to inode */
+	loff_t			offset;
+};
+
+#endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
new file mode 100644
index 0000000..e452147
--- /dev/null
+++ b/kernel/uprobes.c
@@ -0,0 +1,174 @@
+/*
+ * Userspace Probes (UProbes)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/highmem.h>
+#include <linux/slab.h>
+#include <linux/uprobes.h>
+
+static struct rb_root uprobes_tree = RB_ROOT;
+static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
+
+static int match_uprobe(struct uprobe *l, struct uprobe *r)
+{
+	if (l->inode < r->inode)
+		return -1;
+	if (l->inode > r->inode)
+		return 1;
+	else {
+		if (l->offset < r->offset)
+			return -1;
+
+		if (l->offset > r->offset)
+			return 1;
+	}
+
+	return 0;
+}
+
+static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
+{
+	struct uprobe u = { .inode = inode, .offset = offset };
+	struct rb_node *n = uprobes_tree.rb_node;
+	struct uprobe *uprobe;
+	int match;
+
+	while (n) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		match = match_uprobe(&u, uprobe);
+		if (!match) {
+			atomic_inc(&uprobe->ref);
+			return uprobe;
+		}
+		if (match < 0)
+			n = n->rb_left;
+		else
+			n = n->rb_right;
+
+	}
+	return NULL;
+}
+
+/*
+ * Find a uprobe corresponding to a given inode:offset
+ * Acquires uprobes_treelock
+ */
+static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
+{
+	struct uprobe *uprobe;
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, offset);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return uprobe;
+}
+
+static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
+{
+	struct rb_node **p = &uprobes_tree.rb_node;
+	struct rb_node *parent = NULL;
+	struct uprobe *u;
+	int match;
+
+	while (*p) {
+		parent = *p;
+		u = rb_entry(parent, struct uprobe, rb_node);
+		match = match_uprobe(uprobe, u);
+		if (!match) {
+			atomic_inc(&u->ref);
+			return u;
+		}
+
+		if (match < 0)
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+
+	}
+	u = NULL;
+	rb_link_node(&uprobe->rb_node, parent, p);
+	rb_insert_color(&uprobe->rb_node, &uprobes_tree);
+	/* get access + drop ref */
+	atomic_set(&uprobe->ref, 2);
+	return u;
+}
+
+/*
+ * Acquires uprobes_treelock.
+ * Matching uprobe already exists in rbtree;
+ *	increment (access refcount) and return the matching uprobe.
+ *
+ * No matching uprobe; insert the uprobe in rb_tree;
+ *	get a double refcount (access + creation) and return NULL.
+ */
+static struct uprobe *insert_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+	struct uprobe *u;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	u = __insert_uprobe(uprobe);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	return u;
+}
+
+static void put_uprobe(struct uprobe *uprobe)
+{
+	if (atomic_dec_and_test(&uprobe->ref))
+		kfree(uprobe);
+}
+
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+{
+	struct uprobe *uprobe, *cur_uprobe;
+
+	uprobe = kzalloc(sizeof(struct uprobe), GFP_KERNEL);
+	if (!uprobe)
+		return NULL;
+
+	uprobe->inode = igrab(inode);
+	uprobe->offset = offset;
+
+	/* add to uprobes_tree, sorted on inode:offset */
+	cur_uprobe = insert_uprobe(uprobe);
+
+	/* a uprobe exists for this inode:offset combination*/
+	if (cur_uprobe) {
+		kfree(uprobe);
+		uprobe = cur_uprobe;
+		iput(inode);
+	}
+	return uprobe;
+}
+
+static void delete_uprobe(struct uprobe *uprobe)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	rb_erase(&uprobe->rb_node, &uprobes_tree);
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+	put_uprobe(uprobe);
+	iput(uprobe->inode);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 2/26]   Uprobes: Allow multiple consumers for an uprobe.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:00   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


Since there is a unique uprobe for a inode, offset combination, provide
an ability for users to have more than one consumer for a uprobe.

Each consumer will define a handler and a filter.  Handler specifies the
routine to run on hitting a probepoint.  Filter allows to selectively run
the handler on hitting the probepoint.  Handler/Filter will be relevant when
we handle probehit.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   13 +++++++++++++
 kernel/uprobes.c        |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bfb85c4..bf31f7c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -25,9 +25,22 @@
 
 #include <linux/rbtree.h>
 
+struct uprobe_consumer {
+	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+	/*
+	 * filter is optional; If a filter exists, handler is run
+	 * if and only if filter returns true.
+	 */
+	bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+	struct uprobe_consumer *next;
+};
+
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	atomic_t		ref;
+	struct rw_semaphore	consumer_rwsem;
+	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
 };
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index e452147..ba9fd55 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -149,6 +149,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 
 	uprobe->inode = igrab(inode);
 	uprobe->offset = offset;
+	init_rwsem(&uprobe->consumer_rwsem);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
@@ -162,6 +163,46 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/* Returns the previous consumer */
+static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	down_write(&uprobe->consumer_rwsem);
+	consumer->next = uprobe->consumers;
+	uprobe->consumers = consumer;
+	up_write(&uprobe->consumer_rwsem);
+	return consumer->next;
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe_consumer *con;
+	bool ret = false;
+
+	down_write(&uprobe->consumer_rwsem);
+	con = uprobe->consumers;
+	if (consumer == con) {
+		uprobe->consumers = con->next;
+		ret = true;
+	} else {
+		for (; con; con = con->next) {
+			if (con->next == consumer) {
+				con->next = consumer->next;
+				ret = true;
+				break;
+			}
+		}
+	}
+	up_write(&uprobe->consumer_rwsem);
+	return ret;
+}
+
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 2/26]   Uprobes: Allow multiple consumers for an uprobe.
@ 2011-09-20 12:00   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


Since there is a unique uprobe for a inode, offset combination, provide
an ability for users to have more than one consumer for a uprobe.

Each consumer will define a handler and a filter.  Handler specifies the
routine to run on hitting a probepoint.  Filter allows to selectively run
the handler on hitting the probepoint.  Handler/Filter will be relevant when
we handle probehit.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   13 +++++++++++++
 kernel/uprobes.c        |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bfb85c4..bf31f7c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -25,9 +25,22 @@
 
 #include <linux/rbtree.h>
 
+struct uprobe_consumer {
+	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
+	/*
+	 * filter is optional; If a filter exists, handler is run
+	 * if and only if filter returns true.
+	 */
+	bool (*filter)(struct uprobe_consumer *self, struct task_struct *task);
+
+	struct uprobe_consumer *next;
+};
+
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	atomic_t		ref;
+	struct rw_semaphore	consumer_rwsem;
+	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
 };
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index e452147..ba9fd55 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -149,6 +149,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 
 	uprobe->inode = igrab(inode);
 	uprobe->offset = offset;
+	init_rwsem(&uprobe->consumer_rwsem);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
@@ -162,6 +163,46 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+/* Returns the previous consumer */
+static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	down_write(&uprobe->consumer_rwsem);
+	consumer->next = uprobe->consumers;
+	uprobe->consumers = consumer;
+	up_write(&uprobe->consumer_rwsem);
+	return consumer->next;
+}
+
+/*
+ * For uprobe @uprobe, delete the consumer @consumer.
+ * Return true if the @consumer is deleted successfully
+ * or return false.
+ */
+static bool del_consumer(struct uprobe *uprobe,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe_consumer *con;
+	bool ret = false;
+
+	down_write(&uprobe->consumer_rwsem);
+	con = uprobe->consumers;
+	if (consumer == con) {
+		uprobe->consumers = con->next;
+		ret = true;
+	} else {
+		for (; con; con = con->next) {
+			if (con->next == consumer) {
+				con->next = consumer->next;
+				ret = true;
+				break;
+			}
+		}
+	}
+	up_write(&uprobe->consumer_rwsem);
+	return ret;
+}
+
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:00   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


A probe is specified by a file:offset.  While registering, a breakpoint
is inserted for the first consumer, On subsequent probes, the consumer
gets appended to the existing consumers. While unregistering a
breakpoint is removed if the consumer happens to be the last consumer.
All other unregisterations, the consumer is deleted from the list of
consumers.

Probe specifications are maintained in a rb tree. A probe specification
is converted into a uprobe before store in a rb tree.  A uprobe can be
shared by many consumers.

Given a inode, we get a list of mm's that have mapped the inode.
However we want to limit the probes to certain processes/threads.  The
filtering should be at thread level. To limit the probes to a certain
processes/threads, we would want to walk through the list of threads
whose mm member refer to a given mm.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/Kconfig            |    9 ++
 include/linux/uprobes.h |   16 +++
 kernel/Makefile         |    1 
 kernel/uprobes.c        |  263 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 289 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4b0669c..dedd489 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,15 @@ config OPTPROBES
 	depends on KPROBES && HAVE_OPTPROBES
 	depends on !PREEMPT
 
+config UPROBES
+	bool "User-space probes (EXPERIMENTAL)"
+	help
+	  Uprobes enables kernel subsystems to establish probepoints
+	  in user applications and execute handler functions when
+	  the probepoints are hit.
+
+	  If in doubt, say "N".
+
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
 	help
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bf31f7c..6d5a3fe 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -45,4 +45,20 @@ struct uprobe {
 	loff_t			offset;
 };
 
+#ifdef CONFIG_UPROBES
+extern int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+}
+#endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index eca595e..aa810c8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_UPROBES) += uprobes.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index ba9fd55..eeb6ed5 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -24,11 +24,40 @@
 #include <linux/kernel.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
 
+/*
+ * Maintain a temporary per vma info that can be used to search if a vma
+ * has already been handled. This structure is introduced since extending
+ * vm_area_struct wasnt recommended.
+ */
+struct vma_info {
+	struct list_head probe_list;
+	struct mm_struct *mm;
+	loff_t vaddr;
+};
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma
+ *	- Return 1 if the specified virtual address is in an
+ *	  executable vma.
+ */
+static bool valid_vma(struct vm_area_struct *vma)
+{
+	if (!vma->vm_file)
+		return false;
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+						(VM_READ|VM_EXEC))
+		return true;
+
+	return false;
+}
+
 static int match_uprobe(struct uprobe *l, struct uprobe *r)
 {
 	if (l->inode < r->inode)
@@ -203,6 +232,18 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
+static int install_breakpoint(struct mm_struct *mm)
+{
+	/* Placeholder: Yet to be implemented */
+	return 0;
+}
+
+static void remove_breakpoint(struct mm_struct *mm)
+{
+	/* Placeholder: Yet to be implemented */
+	return;
+}
+
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
@@ -213,3 +254,225 @@ static void delete_uprobe(struct uprobe *uprobe)
 	put_uprobe(uprobe);
 	iput(uprobe->inode);
 }
+
+static struct vma_info *__find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping,
+			struct vma_info *vi)
+{
+	struct prio_tree_iter iter;
+	struct vm_area_struct *vma;
+	struct vma_info *tmpvi;
+	loff_t vaddr;
+	unsigned long pgoff = offset >> PAGE_SHIFT;
+	int existing_vma;
+
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (!vma || !valid_vma(vma))
+			return NULL;
+
+		existing_vma = 0;
+		vaddr = vma->vm_start + offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		list_for_each_entry(tmpvi, head, probe_list) {
+			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
+				existing_vma = 1;
+				break;
+			}
+		}
+		if (!existing_vma &&
+				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
+			vi->mm = vma->vm_mm;
+			vi->vaddr = vaddr;
+			list_add(&vi->probe_list, head);
+			return vi;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * Iterate in the rmap prio tree  and find a vma where a probe has not
+ * yet been inserted.
+ */
+static struct vma_info *find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping)
+{
+	struct vma_info *vi, *retvi;
+	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
+	if (!vi)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&vi->probe_list);
+	mutex_lock(&mapping->i_mmap_mutex);
+	retvi = __find_next_vma_info(head, offset, mapping, vi);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (!retvi)
+		kfree(vi);
+	return retvi;
+}
+
+static int __register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe *uprobe)
+{
+	struct list_head try_list;
+	struct vm_area_struct *vma;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct mm_struct *mm;
+	int ret = 0;
+
+	mapping = inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, offset,
+							mapping)) != NULL) {
+		if (IS_ERR(vi)) {
+			ret = -ENOMEM;
+			break;
+		}
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long) vi->vaddr);
+		if (!vma || !valid_vma(vma)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		ret = install_breakpoint(mm);
+		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			break;
+		}
+		ret = 0;
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	return ret;
+}
+
+static void __unregister_uprobe(struct inode *inode, loff_t offset,
+						struct uprobe *uprobe)
+{
+	struct list_head try_list;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+
+	mapping = inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, offset,
+							mapping)) != NULL) {
+		if (IS_ERR(vi))
+			break;
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long) vi->vaddr);
+		if (!vma || !valid_vma(vma)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		remove_breakpoint(mm);
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	delete_uprobe(uprobe);
+}
+
+/*
+ * register_uprobe - register a probe
+ * @inode: the file in which the probe has to be placed.
+ * @offset: offset from the start of the file.
+ * @consumer: information on howto handle the probe..
+ *
+ * Apart from the access refcount, register_uprobe() takes a creation
+ * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
+ * inserted into the rbtree (i.e first consumer for a @inode:@offset
+ * tuple).  Creation refcount stops unregister_uprobe from freeing the
+ * @uprobe even before the register operation is complete. Creation
+ * refcount is released when the last @consumer for the @uprobe
+ * unregisters.
+ *
+ * Return errno if it cannot successully install probes
+ * else return 0 (success)
+ */
+int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe;
+	int ret = 0;
+
+	inode = igrab(inode);
+	if (!inode || !consumer || consumer->next)
+		return -EINVAL;
+
+	if (offset > inode->i_size)
+		return -EINVAL;
+
+	mutex_lock(&inode->i_mutex);
+	uprobe = alloc_uprobe(inode, offset);
+	if (!uprobe)
+		return -ENOMEM;
+
+	if (!add_consumer(uprobe, consumer)) {
+		ret = __register_uprobe(inode, offset, uprobe);
+		if (ret)
+			__unregister_uprobe(inode, offset, uprobe);
+	}
+
+	mutex_unlock(&inode->i_mutex);
+	put_uprobe(uprobe);
+	iput(inode);
+	return ret;
+}
+
+/*
+ * unregister_uprobe - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @consumer: identify which probe if multiple probes are colocated.
+ */
+void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe;
+
+	inode = igrab(inode);
+	if (!inode || !consumer)
+		return;
+
+	if (offset > inode->i_size)
+		return;
+
+	uprobe = find_uprobe(inode, offset);
+	if (!uprobe)
+		return;
+
+	if (!del_consumer(uprobe, consumer)) {
+		put_uprobe(uprobe);
+		return;
+	}
+
+	mutex_lock(&inode->i_mutex);
+	if (!uprobe->consumers)
+		__unregister_uprobe(inode, offset, uprobe);
+
+	mutex_unlock(&inode->i_mutex);
+	put_uprobe(uprobe);
+	iput(inode);
+}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-09-20 12:00   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


A probe is specified by a file:offset.  While registering, a breakpoint
is inserted for the first consumer, On subsequent probes, the consumer
gets appended to the existing consumers. While unregistering a
breakpoint is removed if the consumer happens to be the last consumer.
All other unregisterations, the consumer is deleted from the list of
consumers.

Probe specifications are maintained in a rb tree. A probe specification
is converted into a uprobe before store in a rb tree.  A uprobe can be
shared by many consumers.

Given a inode, we get a list of mm's that have mapped the inode.
However we want to limit the probes to certain processes/threads.  The
filtering should be at thread level. To limit the probes to a certain
processes/threads, we would want to walk through the list of threads
whose mm member refer to a given mm.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/Kconfig            |    9 ++
 include/linux/uprobes.h |   16 +++
 kernel/Makefile         |    1 
 kernel/uprobes.c        |  263 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 289 insertions(+), 0 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 4b0669c..dedd489 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -61,6 +61,15 @@ config OPTPROBES
 	depends on KPROBES && HAVE_OPTPROBES
 	depends on !PREEMPT
 
+config UPROBES
+	bool "User-space probes (EXPERIMENTAL)"
+	help
+	  Uprobes enables kernel subsystems to establish probepoints
+	  in user applications and execute handler functions when
+	  the probepoints are hit.
+
+	  If in doubt, say "N".
+
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
 	help
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bf31f7c..6d5a3fe 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -45,4 +45,20 @@ struct uprobe {
 	loff_t			offset;
 };
 
+#ifdef CONFIG_UPROBES
+extern int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+extern void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer);
+#else /* CONFIG_UPROBES is not defined */
+static inline int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	return -ENOSYS;
+}
+static inline void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+}
+#endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/Makefile b/kernel/Makefile
index eca595e..aa810c8 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -108,6 +108,7 @@ obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
 obj-$(CONFIG_CRASH_DUMP) += crash_dump.o
 obj-$(CONFIG_JUMP_LABEL) += jump_label.o
+obj-$(CONFIG_UPROBES) += uprobes.o
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index ba9fd55..eeb6ed5 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -24,11 +24,40 @@
 #include <linux/kernel.h>
 #include <linux/highmem.h>
 #include <linux/slab.h>
+#include <linux/sched.h>
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
 
+/*
+ * Maintain a temporary per vma info that can be used to search if a vma
+ * has already been handled. This structure is introduced since extending
+ * vm_area_struct wasnt recommended.
+ */
+struct vma_info {
+	struct list_head probe_list;
+	struct mm_struct *mm;
+	loff_t vaddr;
+};
+
+/*
+ * valid_vma: Verify if the specified vma is an executable vma
+ *	- Return 1 if the specified virtual address is in an
+ *	  executable vma.
+ */
+static bool valid_vma(struct vm_area_struct *vma)
+{
+	if (!vma->vm_file)
+		return false;
+
+	if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
+						(VM_READ|VM_EXEC))
+		return true;
+
+	return false;
+}
+
 static int match_uprobe(struct uprobe *l, struct uprobe *r)
 {
 	if (l->inode < r->inode)
@@ -203,6 +232,18 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
+static int install_breakpoint(struct mm_struct *mm)
+{
+	/* Placeholder: Yet to be implemented */
+	return 0;
+}
+
+static void remove_breakpoint(struct mm_struct *mm)
+{
+	/* Placeholder: Yet to be implemented */
+	return;
+}
+
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
@@ -213,3 +254,225 @@ static void delete_uprobe(struct uprobe *uprobe)
 	put_uprobe(uprobe);
 	iput(uprobe->inode);
 }
+
+static struct vma_info *__find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping,
+			struct vma_info *vi)
+{
+	struct prio_tree_iter iter;
+	struct vm_area_struct *vma;
+	struct vma_info *tmpvi;
+	loff_t vaddr;
+	unsigned long pgoff = offset >> PAGE_SHIFT;
+	int existing_vma;
+
+	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
+		if (!vma || !valid_vma(vma))
+			return NULL;
+
+		existing_vma = 0;
+		vaddr = vma->vm_start + offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		list_for_each_entry(tmpvi, head, probe_list) {
+			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
+				existing_vma = 1;
+				break;
+			}
+		}
+		if (!existing_vma &&
+				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
+			vi->mm = vma->vm_mm;
+			vi->vaddr = vaddr;
+			list_add(&vi->probe_list, head);
+			return vi;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * Iterate in the rmap prio tree  and find a vma where a probe has not
+ * yet been inserted.
+ */
+static struct vma_info *find_next_vma_info(struct list_head *head,
+			loff_t offset, struct address_space *mapping)
+{
+	struct vma_info *vi, *retvi;
+	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
+	if (!vi)
+		return ERR_PTR(-ENOMEM);
+
+	INIT_LIST_HEAD(&vi->probe_list);
+	mutex_lock(&mapping->i_mmap_mutex);
+	retvi = __find_next_vma_info(head, offset, mapping, vi);
+	mutex_unlock(&mapping->i_mmap_mutex);
+
+	if (!retvi)
+		kfree(vi);
+	return retvi;
+}
+
+static int __register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe *uprobe)
+{
+	struct list_head try_list;
+	struct vm_area_struct *vma;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct mm_struct *mm;
+	int ret = 0;
+
+	mapping = inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, offset,
+							mapping)) != NULL) {
+		if (IS_ERR(vi)) {
+			ret = -ENOMEM;
+			break;
+		}
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long) vi->vaddr);
+		if (!vma || !valid_vma(vma)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		ret = install_breakpoint(mm);
+		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			break;
+		}
+		ret = 0;
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	return ret;
+}
+
+static void __unregister_uprobe(struct inode *inode, loff_t offset,
+						struct uprobe *uprobe)
+{
+	struct list_head try_list;
+	struct address_space *mapping;
+	struct vma_info *vi, *tmpvi;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+
+	mapping = inode->i_mapping;
+	INIT_LIST_HEAD(&try_list);
+	while ((vi = find_next_vma_info(&try_list, offset,
+							mapping)) != NULL) {
+		if (IS_ERR(vi))
+			break;
+		mm = vi->mm;
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, (unsigned long) vi->vaddr);
+		if (!vma || !valid_vma(vma)) {
+			list_del(&vi->probe_list);
+			kfree(vi);
+			up_read(&mm->mmap_sem);
+			mmput(mm);
+			continue;
+		}
+		remove_breakpoint(mm);
+		up_read(&mm->mmap_sem);
+		mmput(mm);
+	}
+
+	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
+		list_del(&vi->probe_list);
+		kfree(vi);
+	}
+	delete_uprobe(uprobe);
+}
+
+/*
+ * register_uprobe - register a probe
+ * @inode: the file in which the probe has to be placed.
+ * @offset: offset from the start of the file.
+ * @consumer: information on howto handle the probe..
+ *
+ * Apart from the access refcount, register_uprobe() takes a creation
+ * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
+ * inserted into the rbtree (i.e first consumer for a @inode:@offset
+ * tuple).  Creation refcount stops unregister_uprobe from freeing the
+ * @uprobe even before the register operation is complete. Creation
+ * refcount is released when the last @consumer for the @uprobe
+ * unregisters.
+ *
+ * Return errno if it cannot successully install probes
+ * else return 0 (success)
+ */
+int register_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe;
+	int ret = 0;
+
+	inode = igrab(inode);
+	if (!inode || !consumer || consumer->next)
+		return -EINVAL;
+
+	if (offset > inode->i_size)
+		return -EINVAL;
+
+	mutex_lock(&inode->i_mutex);
+	uprobe = alloc_uprobe(inode, offset);
+	if (!uprobe)
+		return -ENOMEM;
+
+	if (!add_consumer(uprobe, consumer)) {
+		ret = __register_uprobe(inode, offset, uprobe);
+		if (ret)
+			__unregister_uprobe(inode, offset, uprobe);
+	}
+
+	mutex_unlock(&inode->i_mutex);
+	put_uprobe(uprobe);
+	iput(inode);
+	return ret;
+}
+
+/*
+ * unregister_uprobe - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @consumer: identify which probe if multiple probes are colocated.
+ */
+void unregister_uprobe(struct inode *inode, loff_t offset,
+				struct uprobe_consumer *consumer)
+{
+	struct uprobe *uprobe;
+
+	inode = igrab(inode);
+	if (!inode || !consumer)
+		return;
+
+	if (offset > inode->i_size)
+		return;
+
+	uprobe = find_uprobe(inode, offset);
+	if (!uprobe)
+		return;
+
+	if (!del_consumer(uprobe, consumer)) {
+		put_uprobe(uprobe);
+		return;
+	}
+
+	mutex_lock(&inode->i_mutex);
+	if (!uprobe->consumers)
+		__unregister_uprobe(inode, offset, uprobe);
+
+	mutex_unlock(&inode->i_mutex);
+	put_uprobe(uprobe);
+	iput(inode);
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:00   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


If an executable vma is getting mapped, search and insert corresponding
probes. On unmap, make sure the per-mm count is decremented by appropriate
amount.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |    3 +
 include/linux/uprobes.h  |   12 +++
 kernel/fork.c            |    5 +
 kernel/uprobes.c         |  174 +++++++++++++++++++++++++++++++++++++++++++---
 mm/memory.c              |    4 +
 mm/mmap.c                |    6 ++
 6 files changed, 194 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 774b895..9aeb64f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -349,6 +349,9 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_UPROBES
+	atomic_t mm_uprobes_count;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6d5a3fe..b4de058 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -25,6 +25,8 @@
 
 #include <linux/rbtree.h>
 
+struct vm_area_struct;
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -40,6 +42,7 @@ struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	atomic_t		ref;
 	struct rw_semaphore	consumer_rwsem;
+	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
@@ -50,6 +53,8 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern int mmap_uprobe(struct vm_area_struct *vma);
+extern void munmap_uprobe(struct vm_area_struct *vma);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -60,5 +65,12 @@ static inline void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
 {
 }
+static inline int mmap_uprobe(struct vm_area_struct *vma)
+{
+	return 0;
+}
+static inline void munmap_uprobe(struct vm_area_struct *vma)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..7cc0b51 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	atomic_set(&mm->mm_uprobes_count,
+			atomic_read(&oldmm->mm_uprobes_count));
+#endif
 
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index eeb6ed5..5bc3f90 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -29,6 +29,7 @@
 
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
+static DEFINE_MUTEX(uprobes_mmap_mutex);	/* uprobe->pending_list */
 
 /*
  * Maintain a temporary per vma info that can be used to search if a vma
@@ -58,13 +59,23 @@ static bool valid_vma(struct vm_area_struct *vma)
 	return false;
 }
 
-static int match_uprobe(struct uprobe *l, struct uprobe *r)
+static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
 {
+	/*
+	 * if match_inode is non NULL then indicate if the
+	 * inode atleast match.
+	 */
+	if (match_inode)
+		*match_inode = 0;
+
 	if (l->inode < r->inode)
 		return -1;
 	if (l->inode > r->inode)
 		return 1;
 	else {
+		if (match_inode)
+			*match_inode = 1;
+
 		if (l->offset < r->offset)
 			return -1;
 
@@ -75,16 +86,20 @@ static int match_uprobe(struct uprobe *l, struct uprobe *r)
 	return 0;
 }
 
-static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
+static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
+					struct rb_node **close_match)
 {
 	struct uprobe u = { .inode = inode, .offset = offset };
 	struct rb_node *n = uprobes_tree.rb_node;
 	struct uprobe *uprobe;
-	int match;
+	int match, match_inode;
 
 	while (n) {
 		uprobe = rb_entry(n, struct uprobe, rb_node);
-		match = match_uprobe(&u, uprobe);
+		match = match_uprobe(&u, uprobe, &match_inode);
+		if (close_match && match_inode)
+			*close_match = n;
+
 		if (!match) {
 			atomic_inc(&uprobe->ref);
 			return uprobe;
@@ -108,7 +123,7 @@ static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
 	unsigned long flags;
 
 	spin_lock_irqsave(&uprobes_treelock, flags);
-	uprobe = __find_uprobe(inode, offset);
+	uprobe = __find_uprobe(inode, offset, NULL);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
 	return uprobe;
 }
@@ -123,7 +138,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 	while (*p) {
 		parent = *p;
 		u = rb_entry(parent, struct uprobe, rb_node);
-		match = match_uprobe(uprobe, u);
+		match = match_uprobe(uprobe, u, NULL);
 		if (!match) {
 			atomic_inc(&u->ref);
 			return u;
@@ -179,6 +194,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	uprobe->inode = igrab(inode);
 	uprobe->offset = offset;
 	init_rwsem(&uprobe->consumer_rwsem);
+	INIT_LIST_HEAD(&uprobe->pending_list);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
@@ -232,15 +248,21 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
-static int install_breakpoint(struct mm_struct *mm)
+
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 {
 	/* Placeholder: Yet to be implemented */
+	if (!uprobe->consumers)
+		return 0;
+
+	atomic_inc(&mm->mm_uprobes_count);
 	return 0;
 }
 
-static void remove_breakpoint(struct mm_struct *mm)
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 {
 	/* Placeholder: Yet to be implemented */
+	atomic_dec(&mm->mm_uprobes_count);
 	return;
 }
 
@@ -340,7 +362,7 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm);
+		ret = install_breakpoint(mm, uprobe);
 		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
 			up_read(&mm->mmap_sem);
 			mmput(mm);
@@ -382,7 +404,7 @@ static void __unregister_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		remove_breakpoint(mm);
+		remove_breakpoint(mm, uprobe);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
@@ -476,3 +498,135 @@ void unregister_uprobe(struct inode *inode, loff_t offset,
 	put_uprobe(uprobe);
 	iput(inode);
 }
+
+/*
+ * For a given inode, build a list of probes that need to be inserted.
+ */
+static void build_probe_list(struct inode *inode, struct list_head *head)
+{
+	struct uprobe *uprobe;
+	struct rb_node *n;
+	unsigned long flags;
+
+	n = uprobes_tree.rb_node;
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, 0, &n);
+	/*
+	 * If indeed there is a probe for the inode and with offset zero,
+	 * then lets release its reference. (ref got thro __find_uprobe)
+	 */
+	if (uprobe)
+		put_uprobe(uprobe);
+	for (; n; n = rb_next(n)) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		if (uprobe->inode != inode)
+			break;
+		list_add(&uprobe->pending_list, head);
+		atomic_inc(&uprobe->ref);
+	}
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ *
+ * Return -ve no if we fail to insert probes and we cannot
+ * bail-out.
+ * Return 0 otherwise. i.e :
+ *	- successful insertion of probes
+ *	- (or) no possible probes to be inserted.
+ *	- (or) insertion of probes failed but we can bail-out.
+ */
+int mmap_uprobe(struct vm_area_struct *vma)
+{
+	struct list_head tmp_list;
+	struct uprobe *uprobe, *u;
+	struct inode *inode;
+	int ret = 0;
+
+	if (!valid_vma(vma))
+		return ret;	/* Bail-out */
+
+	inode = igrab(vma->vm_file->f_mapping->host);
+	if (!inode)
+		return ret;
+
+	INIT_LIST_HEAD(&tmp_list);
+	mutex_lock(&uprobes_mmap_mutex);
+	build_probe_list(inode, &tmp_list);
+	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+		loff_t vaddr;
+
+		list_del(&uprobe->pending_list);
+		if (!ret && uprobe->consumers) {
+			vaddr = vma->vm_start + uprobe->offset;
+			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
+				continue;
+			ret = install_breakpoint(vma->vm_mm, uprobe);
+
+			if (ret && (ret == -ESRCH || ret == -EEXIST))
+				ret = 0;
+		}
+		put_uprobe(uprobe);
+	}
+
+	mutex_unlock(&uprobes_mmap_mutex);
+	iput(inode);
+	return ret;
+}
+
+static void dec_mm_uprobes_count(struct vm_area_struct *vma,
+		struct inode *inode)
+{
+	struct uprobe *uprobe;
+	struct rb_node *n;
+	unsigned long flags;
+
+	n = uprobes_tree.rb_node;
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, 0, &n);
+
+	/*
+	 * If indeed there is a probe for the inode and with offset zero,
+	 * then lets release its reference. (ref got thro __find_uprobe)
+	 */
+	if (uprobe)
+		put_uprobe(uprobe);
+	for (; n; n = rb_next(n)) {
+		loff_t vaddr;
+
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		if (uprobe->inode != inode)
+			break;
+		vaddr = vma->vm_start + uprobe->offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
+			continue;
+		atomic_dec(&vma->vm_mm->mm_uprobes_count);
+	}
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called in context of a munmap of a vma.
+ */
+void munmap_uprobe(struct vm_area_struct *vma)
+{
+	struct inode *inode;
+
+	if (!valid_vma(vma))
+		return;		/* Bail-out */
+
+	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
+		return;
+
+	inode = igrab(vma->vm_file->f_mapping->host);
+	if (!inode)
+		return;
+
+	dec_mm_uprobes_count(vma, inode);
+	iput(inode);
+	return;
+}
diff --git a/mm/memory.c b/mm/memory.c
index a56e3ba..a65fd1f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/uprobes.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -1337,6 +1338,9 @@ unsigned long unmap_vmas(struct mmu_gather *tlb,
 		if (unlikely(is_pfn_mapping(vma)))
 			untrack_pfn_vma(vma, 0, 0);
 
+		if (vma->vm_file)
+			munmap_uprobe(vma);
+
 		while (start != end) {
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..f51d482 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
 #include <linux/perf_event.h>
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1329,6 +1330,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			mm->locked_vm += (len >> PAGE_SHIFT);
 	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
+
+	if (file && mmap_uprobe(vma))
+		/* matching probes but cannot insert */
+		goto unmap_and_free_vma;
+
 	return addr;
 
 unmap_and_free_vma:


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-20 12:00   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


If an executable vma is getting mapped, search and insert corresponding
probes. On unmap, make sure the per-mm count is decremented by appropriate
amount.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |    3 +
 include/linux/uprobes.h  |   12 +++
 kernel/fork.c            |    5 +
 kernel/uprobes.c         |  174 +++++++++++++++++++++++++++++++++++++++++++---
 mm/memory.c              |    4 +
 mm/mmap.c                |    6 ++
 6 files changed, 194 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 774b895..9aeb64f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -349,6 +349,9 @@ struct mm_struct {
 #ifdef CONFIG_CPUMASK_OFFSTACK
 	struct cpumask cpumask_allocation;
 #endif
+#ifdef CONFIG_UPROBES
+	atomic_t mm_uprobes_count;
+#endif
 };
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 6d5a3fe..b4de058 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -25,6 +25,8 @@
 
 #include <linux/rbtree.h>
 
+struct vm_area_struct;
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -40,6 +42,7 @@ struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
 	atomic_t		ref;
 	struct rw_semaphore	consumer_rwsem;
+	struct list_head	pending_list;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
@@ -50,6 +53,8 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern int mmap_uprobe(struct vm_area_struct *vma);
+extern void munmap_uprobe(struct vm_area_struct *vma);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -60,5 +65,12 @@ static inline void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
 {
 }
+static inline int mmap_uprobe(struct vm_area_struct *vma)
+{
+	return 0;
+}
+static inline void munmap_uprobe(struct vm_area_struct *vma)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 8e6b6f4..7cc0b51 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -66,6 +66,7 @@
 #include <linux/user-return-notifier.h>
 #include <linux/oom.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
@@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	mm->pmd_huge_pte = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	atomic_set(&mm->mm_uprobes_count,
+			atomic_read(&oldmm->mm_uprobes_count));
+#endif
 
 	if (!mm_init(mm, tsk))
 		goto fail_nomem;
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index eeb6ed5..5bc3f90 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -29,6 +29,7 @@
 
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
+static DEFINE_MUTEX(uprobes_mmap_mutex);	/* uprobe->pending_list */
 
 /*
  * Maintain a temporary per vma info that can be used to search if a vma
@@ -58,13 +59,23 @@ static bool valid_vma(struct vm_area_struct *vma)
 	return false;
 }
 
-static int match_uprobe(struct uprobe *l, struct uprobe *r)
+static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
 {
+	/*
+	 * if match_inode is non NULL then indicate if the
+	 * inode atleast match.
+	 */
+	if (match_inode)
+		*match_inode = 0;
+
 	if (l->inode < r->inode)
 		return -1;
 	if (l->inode > r->inode)
 		return 1;
 	else {
+		if (match_inode)
+			*match_inode = 1;
+
 		if (l->offset < r->offset)
 			return -1;
 
@@ -75,16 +86,20 @@ static int match_uprobe(struct uprobe *l, struct uprobe *r)
 	return 0;
 }
 
-static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
+static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
+					struct rb_node **close_match)
 {
 	struct uprobe u = { .inode = inode, .offset = offset };
 	struct rb_node *n = uprobes_tree.rb_node;
 	struct uprobe *uprobe;
-	int match;
+	int match, match_inode;
 
 	while (n) {
 		uprobe = rb_entry(n, struct uprobe, rb_node);
-		match = match_uprobe(&u, uprobe);
+		match = match_uprobe(&u, uprobe, &match_inode);
+		if (close_match && match_inode)
+			*close_match = n;
+
 		if (!match) {
 			atomic_inc(&uprobe->ref);
 			return uprobe;
@@ -108,7 +123,7 @@ static struct uprobe *find_uprobe(struct inode * inode, loff_t offset)
 	unsigned long flags;
 
 	spin_lock_irqsave(&uprobes_treelock, flags);
-	uprobe = __find_uprobe(inode, offset);
+	uprobe = __find_uprobe(inode, offset, NULL);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
 	return uprobe;
 }
@@ -123,7 +138,7 @@ static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
 	while (*p) {
 		parent = *p;
 		u = rb_entry(parent, struct uprobe, rb_node);
-		match = match_uprobe(uprobe, u);
+		match = match_uprobe(uprobe, u, NULL);
 		if (!match) {
 			atomic_inc(&u->ref);
 			return u;
@@ -179,6 +194,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	uprobe->inode = igrab(inode);
 	uprobe->offset = offset;
 	init_rwsem(&uprobe->consumer_rwsem);
+	INIT_LIST_HEAD(&uprobe->pending_list);
 
 	/* add to uprobes_tree, sorted on inode:offset */
 	cur_uprobe = insert_uprobe(uprobe);
@@ -232,15 +248,21 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
-static int install_breakpoint(struct mm_struct *mm)
+
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 {
 	/* Placeholder: Yet to be implemented */
+	if (!uprobe->consumers)
+		return 0;
+
+	atomic_inc(&mm->mm_uprobes_count);
 	return 0;
 }
 
-static void remove_breakpoint(struct mm_struct *mm)
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
 {
 	/* Placeholder: Yet to be implemented */
+	atomic_dec(&mm->mm_uprobes_count);
 	return;
 }
 
@@ -340,7 +362,7 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm);
+		ret = install_breakpoint(mm, uprobe);
 		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
 			up_read(&mm->mmap_sem);
 			mmput(mm);
@@ -382,7 +404,7 @@ static void __unregister_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		remove_breakpoint(mm);
+		remove_breakpoint(mm, uprobe);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
@@ -476,3 +498,135 @@ void unregister_uprobe(struct inode *inode, loff_t offset,
 	put_uprobe(uprobe);
 	iput(inode);
 }
+
+/*
+ * For a given inode, build a list of probes that need to be inserted.
+ */
+static void build_probe_list(struct inode *inode, struct list_head *head)
+{
+	struct uprobe *uprobe;
+	struct rb_node *n;
+	unsigned long flags;
+
+	n = uprobes_tree.rb_node;
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, 0, &n);
+	/*
+	 * If indeed there is a probe for the inode and with offset zero,
+	 * then lets release its reference. (ref got thro __find_uprobe)
+	 */
+	if (uprobe)
+		put_uprobe(uprobe);
+	for (; n; n = rb_next(n)) {
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		if (uprobe->inode != inode)
+			break;
+		list_add(&uprobe->pending_list, head);
+		atomic_inc(&uprobe->ref);
+	}
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called from mmap_region.
+ * called with mm->mmap_sem acquired.
+ *
+ * Return -ve no if we fail to insert probes and we cannot
+ * bail-out.
+ * Return 0 otherwise. i.e :
+ *	- successful insertion of probes
+ *	- (or) no possible probes to be inserted.
+ *	- (or) insertion of probes failed but we can bail-out.
+ */
+int mmap_uprobe(struct vm_area_struct *vma)
+{
+	struct list_head tmp_list;
+	struct uprobe *uprobe, *u;
+	struct inode *inode;
+	int ret = 0;
+
+	if (!valid_vma(vma))
+		return ret;	/* Bail-out */
+
+	inode = igrab(vma->vm_file->f_mapping->host);
+	if (!inode)
+		return ret;
+
+	INIT_LIST_HEAD(&tmp_list);
+	mutex_lock(&uprobes_mmap_mutex);
+	build_probe_list(inode, &tmp_list);
+	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
+		loff_t vaddr;
+
+		list_del(&uprobe->pending_list);
+		if (!ret && uprobe->consumers) {
+			vaddr = vma->vm_start + uprobe->offset;
+			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
+				continue;
+			ret = install_breakpoint(vma->vm_mm, uprobe);
+
+			if (ret && (ret == -ESRCH || ret == -EEXIST))
+				ret = 0;
+		}
+		put_uprobe(uprobe);
+	}
+
+	mutex_unlock(&uprobes_mmap_mutex);
+	iput(inode);
+	return ret;
+}
+
+static void dec_mm_uprobes_count(struct vm_area_struct *vma,
+		struct inode *inode)
+{
+	struct uprobe *uprobe;
+	struct rb_node *n;
+	unsigned long flags;
+
+	n = uprobes_tree.rb_node;
+	spin_lock_irqsave(&uprobes_treelock, flags);
+	uprobe = __find_uprobe(inode, 0, &n);
+
+	/*
+	 * If indeed there is a probe for the inode and with offset zero,
+	 * then lets release its reference. (ref got thro __find_uprobe)
+	 */
+	if (uprobe)
+		put_uprobe(uprobe);
+	for (; n; n = rb_next(n)) {
+		loff_t vaddr;
+
+		uprobe = rb_entry(n, struct uprobe, rb_node);
+		if (uprobe->inode != inode)
+			break;
+		vaddr = vma->vm_start + uprobe->offset;
+		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
+		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
+			continue;
+		atomic_dec(&vma->vm_mm->mm_uprobes_count);
+	}
+	spin_unlock_irqrestore(&uprobes_treelock, flags);
+}
+
+/*
+ * Called in context of a munmap of a vma.
+ */
+void munmap_uprobe(struct vm_area_struct *vma)
+{
+	struct inode *inode;
+
+	if (!valid_vma(vma))
+		return;		/* Bail-out */
+
+	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
+		return;
+
+	inode = igrab(vma->vm_file->f_mapping->host);
+	if (!inode)
+		return;
+
+	dec_mm_uprobes_count(vma, inode);
+	iput(inode);
+	return;
+}
diff --git a/mm/memory.c b/mm/memory.c
index a56e3ba..a65fd1f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
 #include <linux/swapops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
+#include <linux/uprobes.h>
 
 #include <asm/io.h>
 #include <asm/pgalloc.h>
@@ -1337,6 +1338,9 @@ unsigned long unmap_vmas(struct mmu_gather *tlb,
 		if (unlikely(is_pfn_mapping(vma)))
 			untrack_pfn_vma(vma, 0, 0);
 
+		if (vma->vm_file)
+			munmap_uprobe(vma);
+
 		while (start != end) {
 			if (unlikely(is_vm_hugetlb_page(vma))) {
 				/*
diff --git a/mm/mmap.c b/mm/mmap.c
index a65efd4..f51d482 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -30,6 +30,7 @@
 #include <linux/perf_event.h>
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
+#include <linux/uprobes.h>
 
 #include <asm/uaccess.h>
 #include <asm/cacheflush.h>
@@ -1329,6 +1330,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			mm->locked_vm += (len >> PAGE_SHIFT);
 	} else if ((flags & MAP_POPULATE) && !(flags & MAP_NONBLOCK))
 		make_pages_present(addr, addr + len);
+
+	if (file && mmap_uprobe(vma))
+		/* matching probes but cannot insert */
+		goto unmap_and_free_vma;
+
 	return addr;
 
 unmap_and_free_vma:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:00   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


When inserting the first probepoint, save a copy of the original
instruction.  This copy is later used for fixup analysis, copied to the slot
on probe-hit and for restoring the original instruction.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/Kconfig            |    1 
 include/linux/uprobes.h |   12 ++++
 kernel/uprobes.c        |  142 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 143 insertions(+), 12 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index dedd489..d6a4e1d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -63,6 +63,7 @@ config OPTPROBES
 
 config UPROBES
 	bool "User-space probes (EXPERIMENTAL)"
+	select MM_OWNER
 	help
 	  Uprobes enables kernel subsystems to establish probepoints
 	  in user applications and execute handler functions when
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b4de058..50a8c67 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -26,6 +26,12 @@
 #include <linux/rbtree.h>
 
 struct vm_area_struct;
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+
+#define MAX_UINSN_BYTES 4
+#endif
 
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
@@ -46,9 +52,15 @@ struct uprobe {
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
+	int			copy;
+	u8			insn[MAX_UINSN_BYTES];
 };
 
 #ifdef CONFIG_UPROBES
+extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+							unsigned long vaddr);
+extern int __weak set_orig_insn(struct task_struct *tsk,
+		struct uprobe *uprobe, unsigned long vaddr, bool verify);
 extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5bc3f90..e0e10dd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -23,6 +23,7 @@
 
 #include <linux/kernel.h>
 #include <linux/highmem.h>
+#include <linux/pagemap.h>	/* grab_cache_page */
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/uprobes.h>
@@ -59,6 +60,20 @@ static bool valid_vma(struct vm_area_struct *vma)
 	return false;
 }
 
+int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+						unsigned long vaddr)
+{
+	/* placeholder: yet to be implemented */
+	return 0;
+}
+
+int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
+					unsigned long vaddr, bool verify)
+{
+	/* placeholder: yet to be implemented */
+	return 0;
+}
+
 static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
 {
 	/*
@@ -248,22 +263,125 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
+static int __copy_insn(struct address_space *mapping,
+			struct vm_area_struct *vma, char *insn,
+			unsigned long nbytes, unsigned long offset)
+{
+	struct file *filp = vma->vm_file;
+	struct page *page;
+	void *vaddr;
+	unsigned long off1;
+	unsigned long idx;
+
+	if (!filp)
+		return -EINVAL;
+
+	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
+	off1 = offset &= ~PAGE_MASK;
+
+	/*
+	 * Ensure that the page that has the original instruction is
+	 * populated and in page-cache.
+	 */
+	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
+	page = grab_cache_page(mapping, idx);
+	if (!page)
+		return -ENOMEM;
+
+	vaddr = kmap_atomic(page);
+	memcpy(insn, vaddr + off1, nbytes);
+	kunmap_atomic(vaddr);
+	unlock_page(page);
+	page_cache_release(page);
+	return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
+					unsigned long addr)
+{
+	struct address_space *mapping;
+	int bytes;
+	unsigned long nbytes;
+
+	addr &= ~PAGE_MASK;
+	nbytes = PAGE_SIZE - addr;
+	mapping = uprobe->inode->i_mapping;
+
+	/* Instruction at end of binary; copy only available bytes */
+	if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+		bytes = uprobe->inode->i_size - uprobe->offset;
+	else
+		bytes = MAX_UINSN_BYTES;
+
+	/* Instruction at the page-boundary; copy bytes in second page */
+	if (nbytes < bytes) {
+		if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
+				bytes - nbytes, uprobe->offset + nbytes))
+			return -ENOMEM;
+		bytes = nbytes;
+	}
+	return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
+}
+
+static struct task_struct *get_mm_owner(struct mm_struct *mm)
+{
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = rcu_dereference(mm->owner);
+	if (tsk)
+		get_task_struct(tsk);
+	rcu_read_unlock();
+	return tsk;
+}
 
-static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+				struct vm_area_struct *vma, loff_t vaddr)
 {
-	/* Placeholder: Yet to be implemented */
+	struct task_struct *tsk;
+	unsigned long addr;
+	int ret = -EINVAL;
+
 	if (!uprobe->consumers)
 		return 0;
 
-	atomic_inc(&mm->mm_uprobes_count);
-	return 0;
+	tsk = get_mm_owner(mm);
+	if (!tsk)	/* task is probably exiting; bail-out */
+		return -ESRCH;
+
+	if (vaddr > TASK_SIZE_OF(tsk))
+		goto put_return;
+
+	addr = (unsigned long) vaddr;
+	if (!uprobe->copy) {
+		ret = copy_insn(uprobe, vma, addr);
+		if (ret)
+			goto put_return;
+		/* TODO : Analysis and verification of instruction */
+		uprobe->copy = 1;
+	}
+
+	ret = set_bkpt(tsk, uprobe, addr);
+	if (!ret)
+		atomic_inc(&mm->mm_uprobes_count);
+
+put_return:
+	put_task_struct(tsk);
+	return ret;
 }
 
-static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+							loff_t vaddr)
 {
-	/* Placeholder: Yet to be implemented */
-	atomic_dec(&mm->mm_uprobes_count);
-	return;
+	struct task_struct *tsk = get_mm_owner(mm);
+
+	if (!tsk)	/* task is probably exiting; bail-out */
+		return;
+
+	if (!set_orig_insn(tsk, uprobe, (unsigned long) vaddr, true))
+		atomic_dec(&mm->mm_uprobes_count);
+
+	put_task_struct(tsk);
 }
 
 static void delete_uprobe(struct uprobe *uprobe)
@@ -362,7 +480,7 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm, uprobe);
+		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
 		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
 			up_read(&mm->mmap_sem);
 			mmput(mm);
@@ -404,7 +522,7 @@ static void __unregister_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		remove_breakpoint(mm, uprobe);
+		remove_breakpoint(mm, uprobe, vi->vaddr);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
@@ -564,8 +682,8 @@ int mmap_uprobe(struct vm_area_struct *vma)
 			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
 			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
 				continue;
-			ret = install_breakpoint(vma->vm_mm, uprobe);
-
+			ret = install_breakpoint(vma->vm_mm, uprobe, vma,
+								vaddr);
 			if (ret && (ret == -ESRCH || ret == -EEXIST))
 				ret = 0;
 		}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
@ 2011-09-20 12:00   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:00 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


When inserting the first probepoint, save a copy of the original
instruction.  This copy is later used for fixup analysis, copied to the slot
on probe-hit and for restoring the original instruction.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/Kconfig            |    1 
 include/linux/uprobes.h |   12 ++++
 kernel/uprobes.c        |  142 +++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 143 insertions(+), 12 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index dedd489..d6a4e1d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -63,6 +63,7 @@ config OPTPROBES
 
 config UPROBES
 	bool "User-space probes (EXPERIMENTAL)"
+	select MM_OWNER
 	help
 	  Uprobes enables kernel subsystems to establish probepoints
 	  in user applications and execute handler functions when
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index b4de058..50a8c67 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -26,6 +26,12 @@
 #include <linux/rbtree.h>
 
 struct vm_area_struct;
+#ifdef CONFIG_ARCH_SUPPORTS_UPROBES
+#include <asm/uprobes.h>
+#else
+
+#define MAX_UINSN_BYTES 4
+#endif
 
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
@@ -46,9 +52,15 @@ struct uprobe {
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
+	int			copy;
+	u8			insn[MAX_UINSN_BYTES];
 };
 
 #ifdef CONFIG_UPROBES
+extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+							unsigned long vaddr);
+extern int __weak set_orig_insn(struct task_struct *tsk,
+		struct uprobe *uprobe, unsigned long vaddr, bool verify);
 extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5bc3f90..e0e10dd 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -23,6 +23,7 @@
 
 #include <linux/kernel.h>
 #include <linux/highmem.h>
+#include <linux/pagemap.h>	/* grab_cache_page */
 #include <linux/slab.h>
 #include <linux/sched.h>
 #include <linux/uprobes.h>
@@ -59,6 +60,20 @@ static bool valid_vma(struct vm_area_struct *vma)
 	return false;
 }
 
+int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
+						unsigned long vaddr)
+{
+	/* placeholder: yet to be implemented */
+	return 0;
+}
+
+int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
+					unsigned long vaddr, bool verify)
+{
+	/* placeholder: yet to be implemented */
+	return 0;
+}
+
 static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
 {
 	/*
@@ -248,22 +263,125 @@ static bool del_consumer(struct uprobe *uprobe,
 	return ret;
 }
 
+static int __copy_insn(struct address_space *mapping,
+			struct vm_area_struct *vma, char *insn,
+			unsigned long nbytes, unsigned long offset)
+{
+	struct file *filp = vma->vm_file;
+	struct page *page;
+	void *vaddr;
+	unsigned long off1;
+	unsigned long idx;
+
+	if (!filp)
+		return -EINVAL;
+
+	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
+	off1 = offset &= ~PAGE_MASK;
+
+	/*
+	 * Ensure that the page that has the original instruction is
+	 * populated and in page-cache.
+	 */
+	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
+	page = grab_cache_page(mapping, idx);
+	if (!page)
+		return -ENOMEM;
+
+	vaddr = kmap_atomic(page);
+	memcpy(insn, vaddr + off1, nbytes);
+	kunmap_atomic(vaddr);
+	unlock_page(page);
+	page_cache_release(page);
+	return 0;
+}
+
+static int copy_insn(struct uprobe *uprobe, struct vm_area_struct *vma,
+					unsigned long addr)
+{
+	struct address_space *mapping;
+	int bytes;
+	unsigned long nbytes;
+
+	addr &= ~PAGE_MASK;
+	nbytes = PAGE_SIZE - addr;
+	mapping = uprobe->inode->i_mapping;
+
+	/* Instruction at end of binary; copy only available bytes */
+	if (uprobe->offset + MAX_UINSN_BYTES > uprobe->inode->i_size)
+		bytes = uprobe->inode->i_size - uprobe->offset;
+	else
+		bytes = MAX_UINSN_BYTES;
+
+	/* Instruction at the page-boundary; copy bytes in second page */
+	if (nbytes < bytes) {
+		if (__copy_insn(mapping, vma, uprobe->insn + nbytes,
+				bytes - nbytes, uprobe->offset + nbytes))
+			return -ENOMEM;
+		bytes = nbytes;
+	}
+	return __copy_insn(mapping, vma, uprobe->insn, bytes, uprobe->offset);
+}
+
+static struct task_struct *get_mm_owner(struct mm_struct *mm)
+{
+	struct task_struct *tsk;
+
+	rcu_read_lock();
+	tsk = rcu_dereference(mm->owner);
+	if (tsk)
+		get_task_struct(tsk);
+	rcu_read_unlock();
+	return tsk;
+}
 
-static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+				struct vm_area_struct *vma, loff_t vaddr)
 {
-	/* Placeholder: Yet to be implemented */
+	struct task_struct *tsk;
+	unsigned long addr;
+	int ret = -EINVAL;
+
 	if (!uprobe->consumers)
 		return 0;
 
-	atomic_inc(&mm->mm_uprobes_count);
-	return 0;
+	tsk = get_mm_owner(mm);
+	if (!tsk)	/* task is probably exiting; bail-out */
+		return -ESRCH;
+
+	if (vaddr > TASK_SIZE_OF(tsk))
+		goto put_return;
+
+	addr = (unsigned long) vaddr;
+	if (!uprobe->copy) {
+		ret = copy_insn(uprobe, vma, addr);
+		if (ret)
+			goto put_return;
+		/* TODO : Analysis and verification of instruction */
+		uprobe->copy = 1;
+	}
+
+	ret = set_bkpt(tsk, uprobe, addr);
+	if (!ret)
+		atomic_inc(&mm->mm_uprobes_count);
+
+put_return:
+	put_task_struct(tsk);
+	return ret;
 }
 
-static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
+static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
+							loff_t vaddr)
 {
-	/* Placeholder: Yet to be implemented */
-	atomic_dec(&mm->mm_uprobes_count);
-	return;
+	struct task_struct *tsk = get_mm_owner(mm);
+
+	if (!tsk)	/* task is probably exiting; bail-out */
+		return;
+
+	if (!set_orig_insn(tsk, uprobe, (unsigned long) vaddr, true))
+		atomic_dec(&mm->mm_uprobes_count);
+
+	put_task_struct(tsk);
 }
 
 static void delete_uprobe(struct uprobe *uprobe)
@@ -362,7 +480,7 @@ static int __register_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		ret = install_breakpoint(mm, uprobe);
+		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
 		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
 			up_read(&mm->mmap_sem);
 			mmput(mm);
@@ -404,7 +522,7 @@ static void __unregister_uprobe(struct inode *inode, loff_t offset,
 			mmput(mm);
 			continue;
 		}
-		remove_breakpoint(mm, uprobe);
+		remove_breakpoint(mm, uprobe, vi->vaddr);
 		up_read(&mm->mmap_sem);
 		mmput(mm);
 	}
@@ -564,8 +682,8 @@ int mmap_uprobe(struct vm_area_struct *vma)
 			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
 			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
 				continue;
-			ret = install_breakpoint(vma->vm_mm, uprobe);
-
+			ret = install_breakpoint(vma->vm_mm, uprobe, vma,
+								vaddr);
 			if (ret && (ret == -ESRCH || ret == -EEXIST))
 				ret = 0;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 6/26]   Uprobes: define fixups.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:01   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


During the first insertion of a probepoint, instruction is analyzed for
fixups and cached in the per-uprobe struct. On a probehit, the cached
fixup is used. Fixup analysis and caching is done in arch-specific
code.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 50a8c67..074c4e9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,17 @@ struct vm_area_struct;
 #define MAX_UINSN_BYTES 4
 #endif
 
+#define uprobe_opcode_sz sizeof(uprobe_opcode_t)
+
+/* Post-execution fixups.  Some architectures may define others. */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE	0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP	0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL	0x2
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -53,6 +64,7 @@ struct uprobe {
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
 	int			copy;
+	u16			fixups;
 	u8			insn[MAX_UINSN_BYTES];
 };
 

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 6/26]   Uprobes: define fixups.
@ 2011-09-20 12:01   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


During the first insertion of a probepoint, instruction is analyzed for
fixups and cached in the per-uprobe struct. On a probehit, the cached
fixup is used. Fixup analysis and caching is done in arch-specific
code.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 50a8c67..074c4e9 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -33,6 +33,17 @@ struct vm_area_struct;
 #define MAX_UINSN_BYTES 4
 #endif
 
+#define uprobe_opcode_sz sizeof(uprobe_opcode_t)
+
+/* Post-execution fixups.  Some architectures may define others. */
+
+/* No fixup needed */
+#define UPROBES_FIX_NONE	0x0
+/* Adjust IP back to vicinity of actual insn */
+#define UPROBES_FIX_IP	0x1
+/* Adjust the return address of a call insn */
+#define UPROBES_FIX_CALL	0x2
+
 struct uprobe_consumer {
 	int (*handler)(struct uprobe_consumer *self, struct pt_regs *regs);
 	/*
@@ -53,6 +64,7 @@ struct uprobe {
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
 	int			copy;
+	u16			fixups;
 	u8			insn[MAX_UINSN_BYTES];
 };
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 7/26]   Uprobes: uprobes arch info
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:01   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Introduce per uprobe arch info structure.
Used to store arch specific details. For example: details to handle
Rip relative instructions in X86_64.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 074c4e9..2548b94 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,7 +29,7 @@ struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
-
+struct uprobe_arch_info {};
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -60,6 +60,7 @@ struct uprobe {
 	atomic_t		ref;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
+	struct uprobe_arch_info arch_info;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 7/26]   Uprobes: uprobes arch info
@ 2011-09-20 12:01   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Introduce per uprobe arch info structure.
Used to store arch specific details. For example: details to handle
Rip relative instructions in X86_64.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 074c4e9..2548b94 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,7 +29,7 @@ struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
-
+struct uprobe_arch_info {};
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -60,6 +60,7 @@ struct uprobe {
 	atomic_t		ref;
 	struct rw_semaphore	consumer_rwsem;
 	struct list_head	pending_list;
+	struct uprobe_arch_info arch_info;
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:01   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


The instruction analysis is based on x86 instruction decoder and
determines if an instruction can be probed and determines the necessary
fixups after singlestep.  Instruction analysis is done at probe
insertion time so that we avoid having to repeat the same analysis every
time a probe is hit.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/Kconfig               |    3 
 arch/x86/include/asm/uprobes.h |   42 ++++
 arch/x86/kernel/Makefile       |    1 
 arch/x86/kernel/uprobes.c      |  385 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 431 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/uprobes.h
 create mode 100644 arch/x86/kernel/uprobes.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f1833e3..39fd62b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -250,6 +250,9 @@ config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU
 
+config ARCH_SUPPORTS_UPROBES
+	def_bool y
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..4295ce0
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+#ifdef CONFIG_X86_64
+struct uprobe_arch_info {
+	unsigned long rip_rela_target_address;
+};
+#else
+struct uprobe_arch_info {};
+#endif
+struct uprobe;
+extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+#endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 82f2912..e95cc9d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 obj-$(CONFIG_OF)			+= devicetree.o
+obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..e4fd077
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,385 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UPROBES_FIX_RIP_AX	0x8000
+#define UPROBES_FIX_RIP_CX	0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
+	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
+	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
+	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
+	 << (row % 32))
+
+
+static const u32 good_insns_64[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+	W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+/* Good-instruction tables for 32-bit apps */
+
+static const u32 good_insns_32[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static const u32 good_2byte_insns[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+	W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+	W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+	W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field.  On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes.  These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ *	but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+	int i;
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		switch (insn->prefixes.bytes[i]) {
+		case 0x26:	 /*INAT_PFX_ES   */
+		case 0x2E:	 /*INAT_PFX_CS   */
+		case 0x36:	 /*INAT_PFX_DS   */
+		case 0x3E:	 /*INAT_PFX_SS   */
+		case 0xF0:	 /*INAT_PFX_LOCK */
+			return true;
+		}
+	}
+	return false;
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, false);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn),
+					(unsigned long *) good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, true);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_64))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn),
+					(unsigned long *) good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->fixups accordingly.  To start with, uprobe->fixups is
+ * either zero or it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+	bool fix_ip = true, fix_call = false;	/* defaults */
+	insn_get_opcode(insn);	/* should be a nop */
+
+	switch (OPCODE1(insn)) {
+	case 0xc3:		/* ret/lret */
+	case 0xcb:
+	case 0xc2:
+	case 0xca:
+		/* ip is correct */
+		fix_ip = false;
+		break;
+	case 0xe8:		/* call relative - Fix return addr */
+		fix_call = true;
+		break;
+	case 0x9a:		/* call absolute - Fix return addr, not ip */
+		fix_call = true;
+		fix_ip = false;
+		break;
+	case 0xff:
+	    {
+		int reg;
+		insn_get_modrm(insn);
+		reg = MODRM_REG(insn);
+		if (reg == 2 || reg == 3) {
+			/* call or lcall, indirect */
+			/* Fix return addr; ip is correct. */
+			fix_call = true;
+			fix_ip = false;
+		} else if (reg == 4 || reg == 5) {
+			/* jmp or ljmp, indirect */
+			/* ip is correct. */
+			fix_ip = false;
+		}
+		break;
+	    }
+	case 0xea:		/* jmp absolute -- ip is correct */
+		fix_ip = false;
+		break;
+	default:
+		break;
+	}
+	if (fix_ip)
+		uprobe->fixups |= UPROBES_FIX_IP;
+	if (fix_call)
+		uprobe->fixups |= UPROBES_FIX_CALL;
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return
+ * immediately.  Otherwise, rewrite the instruction so that it accesses
+ * its memory operand indirectly through a scratch register.  Set
+ * uprobe->fixups and uprobe->arch_info.rip_rela_target_address
+ * accordingly.  (The contents of the scratch register will be saved
+ * before we single-step the modified instruction, and restored
+ * afterward.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area.  At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+	u8 *cursor;
+	u8 reg;
+
+	uprobe->arch_info.rip_rela_target_address = 0x0;
+	if (!insn_rip_relative(insn))
+		return;
+
+	/*
+	 * Point cursor at the modrm byte.  The next 4 bytes are the
+	 * displacement.  Beyond the displacement, for some instructions,
+	 * is the immediate operand.
+	 */
+	cursor = uprobe->insn + insn->prefixes.nbytes
+			+ insn->rex_prefix.nbytes + insn->opcode.nbytes;
+	insn_get_length(insn);
+
+	/*
+	 * Convert from rip-relative addressing to indirect addressing
+	 * via a scratch register.  Change the r/m field from 0x5 (%rip)
+	 * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+	 */
+	reg = MODRM_REG(insn);
+	if (reg == 0) {
+		/*
+		 * The register operand (if any) is either the A register
+		 * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+		 * REX prefix) %r8.  In any case, we know the C register
+		 * is NOT the register operand, so we use %rcx (register
+		 * #1) for the scratch register.
+		 */
+		uprobe->fixups = UPROBES_FIX_RIP_CX;
+		/* Change modrm from 00 000 101 to 00 000 001. */
+		*cursor = 0x1;
+	} else {
+		/* Use %rax (register #0) for the scratch register. */
+		uprobe->fixups = UPROBES_FIX_RIP_AX;
+		/* Change modrm from 00 xxx 101 to 00 xxx 000 */
+		*cursor = (reg << 3);
+	}
+
+	/* Target address = address of next instruction + (signed) offset */
+	uprobe->arch_info.rip_rela_target_address = (long) insn->length
+					+ insn->displacement.value;
+	/* Displacement field is gone; slide immediate field (if any) over. */
+	if (insn->immediate.nbytes) {
+		cursor++;
+		memmove(cursor, cursor + insn->displacement.nbytes,
+						insn->immediate.nbytes);
+	}
+	return;
+}
+#else
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+	return;
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
+{
+	int ret;
+	struct insn insn;
+
+	uprobe->fixups = 0;
+	if (is_32bit_app(tsk))
+		ret = validate_insn_32bits(uprobe, &insn);
+	else
+		ret = validate_insn_64bits(uprobe, &insn);
+	if (ret != 0)
+		return ret;
+	if (!is_32bit_app(tsk))
+		handle_riprel_insn(uprobe, &insn);
+	prepare_fixups(uprobe, &insn);
+	return 0;
+}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-20 12:01   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


The instruction analysis is based on x86 instruction decoder and
determines if an instruction can be probed and determines the necessary
fixups after singlestep.  Instruction analysis is done at probe
insertion time so that we avoid having to repeat the same analysis every
time a probe is hit.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/Kconfig               |    3 
 arch/x86/include/asm/uprobes.h |   42 ++++
 arch/x86/kernel/Makefile       |    1 
 arch/x86/kernel/uprobes.c      |  385 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 431 insertions(+), 0 deletions(-)
 create mode 100644 arch/x86/include/asm/uprobes.h
 create mode 100644 arch/x86/kernel/uprobes.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f1833e3..39fd62b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -250,6 +250,9 @@ config ARCH_CPU_PROBE_RELEASE
 	def_bool y
 	depends on HOTPLUG_CPU
 
+config ARCH_SUPPORTS_UPROBES
+	def_bool y
+
 source "init/Kconfig"
 source "kernel/Kconfig.freezer"
 
diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
new file mode 100644
index 0000000..4295ce0
--- /dev/null
+++ b/arch/x86/include/asm/uprobes.h
@@ -0,0 +1,42 @@
+#ifndef _ASM_UPROBES_H
+#define _ASM_UPROBES_H
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+typedef u8 uprobe_opcode_t;
+#define MAX_UINSN_BYTES 16
+#define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
+
+#define UPROBES_BKPT_INSN 0xcc
+#define UPROBES_BKPT_INSN_SIZE 1
+
+#ifdef CONFIG_X86_64
+struct uprobe_arch_info {
+	unsigned long rip_rela_target_address;
+};
+#else
+struct uprobe_arch_info {};
+#endif
+struct uprobe;
+extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+#endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 82f2912..e95cc9d 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -98,6 +98,7 @@ obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
 
 obj-$(CONFIG_SWIOTLB)			+= pci-swiotlb.o
 obj-$(CONFIG_OF)			+= devicetree.o
+obj-$(CONFIG_UPROBES)			+= uprobes.o
 
 ###
 # 64 bit specific files
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
new file mode 100644
index 0000000..e4fd077
--- /dev/null
+++ b/arch/x86/kernel/uprobes.c
@@ -0,0 +1,385 @@
+/*
+ * Userspace Probes (UProbes) for x86
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2008-2011
+ * Authors:
+ *	Srikar Dronamraju
+ *	Jim Keniston
+ */
+
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/ptrace.h>
+#include <linux/uprobes.h>
+
+#include <linux/kdebug.h>
+#include <asm/insn.h>
+
+#ifdef CONFIG_X86_32
+#define is_32bit_app(tsk) 1
+#else
+#define is_32bit_app(tsk) (test_tsk_thread_flag(tsk, TIF_IA32))
+#endif
+
+#define UPROBES_FIX_RIP_AX	0x8000
+#define UPROBES_FIX_RIP_CX	0x4000
+
+/* Adaptations for mhiramat x86 decoder v14. */
+#define OPCODE1(insn) ((insn)->opcode.bytes[0])
+#define OPCODE2(insn) ((insn)->opcode.bytes[1])
+#define OPCODE3(insn) ((insn)->opcode.bytes[2])
+#define MODRM_REG(insn) X86_MODRM_REG(insn->modrm.value)
+
+#define W(row, b0, b1, b2, b3, b4, b5, b6, b7, b8, b9, ba, bb, bc, bd, be, bf)\
+	(((b0##UL << 0x0)|(b1##UL << 0x1)|(b2##UL << 0x2)|(b3##UL << 0x3) |   \
+	  (b4##UL << 0x4)|(b5##UL << 0x5)|(b6##UL << 0x6)|(b7##UL << 0x7) |   \
+	  (b8##UL << 0x8)|(b9##UL << 0x9)|(ba##UL << 0xa)|(bb##UL << 0xb) |   \
+	  (bc##UL << 0xc)|(bd##UL << 0xd)|(be##UL << 0xe)|(bf##UL << 0xf))    \
+	 << (row % 32))
+
+
+static const u32 good_insns_64[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0) , /* 30 */
+	W(0x40, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+/* Good-instruction tables for 32-bit apps */
+
+static const u32 good_insns_32[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) | /* 20 */
+	W(0x30, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0) | /* c0 */
+	W(0xd0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0) | /* e0 */
+	W(0xf0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+
+/* Using this for both 64-bit and 32-bit apps */
+static const u32 good_2byte_insns[256 / 32] = {
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+	/*      ----------------------------------------------         */
+	W(0x00, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1) | /* 00 */
+	W(0x10, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* 10 */
+	W(0x20, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1) | /* 20 */
+	W(0x30, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) , /* 30 */
+	W(0x40, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 40 */
+	W(0x50, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 50 */
+	W(0x60, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 60 */
+	W(0x70, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1) , /* 70 */
+	W(0x80, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* 80 */
+	W(0x90, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* 90 */
+	W(0xa0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1) | /* a0 */
+	W(0xb0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1) , /* b0 */
+	W(0xc0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* c0 */
+	W(0xd0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) , /* d0 */
+	W(0xe0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1) | /* e0 */
+	W(0xf0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0)   /* f0 */
+	/*      ----------------------------------------------         */
+	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f         */
+};
+#undef W
+
+/*
+ * opcodes we'll probably never support:
+ * 6c-6d, e4-e5, ec-ed - in
+ * 6e-6f, e6-e7, ee-ef - out
+ * cc, cd - int3, int
+ * cf - iret
+ * d6 - illegal instruction
+ * f1 - int1/icebp
+ * f4 - hlt
+ * fa, fb - cli, sti
+ * 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2
+ *
+ * invalid opcodes in 64-bit mode:
+ * 06, 0e, 16, 1e, 27, 2f, 37, 3f, 60-62, 82, c4-c5, d4-d5
+ *
+ * 63 - we support this opcode in x86_64 but not in i386.
+ *
+ * opcodes we may need to refine support for:
+ * 0f - 2-byte instructions: For many of these instructions, the validity
+ * depends on the prefix and/or the reg field.  On such instructions, we
+ * just consider the opcode combination valid if it corresponds to any
+ * valid instruction.
+ * 8f - Group 1 - only reg = 0 is OK
+ * c6-c7 - Group 11 - only reg = 0 is OK
+ * d9-df - fpu insns with some illegal encodings
+ * f2, f3 - repnz, repz prefixes.  These are also the first byte for
+ * certain floating-point instructions, such as addsd.
+ * fe - Group 4 - only reg = 0 or 1 is OK
+ * ff - Group 5 - only reg = 0-6 is OK
+ *
+ * others -- Do we need to support these?
+ * 0f - (floating-point?) prefetch instructions
+ * 07, 17, 1f - pop es, pop ss, pop ds
+ * 26, 2e, 36, 3e - es:, cs:, ss:, ds: segment prefixes --
+ *	but 64 and 65 (fs: and gs:) seem to be used, so we support them
+ * 67 - addr16 prefix
+ * ce - into
+ * f0 - lock prefix
+ */
+
+/*
+ * TODO:
+ * - Where necessary, examine the modrm byte and allow only valid instructions
+ * in the different Groups and fpu instructions.
+ */
+
+static bool is_prefix_bad(struct insn *insn)
+{
+	int i;
+
+	for (i = 0; i < insn->prefixes.nbytes; i++) {
+		switch (insn->prefixes.bytes[i]) {
+		case 0x26:	 /*INAT_PFX_ES   */
+		case 0x2E:	 /*INAT_PFX_CS   */
+		case 0x36:	 /*INAT_PFX_DS   */
+		case 0x3E:	 /*INAT_PFX_SS   */
+		case 0xF0:	 /*INAT_PFX_LOCK */
+			return true;
+		}
+	}
+	return false;
+}
+
+static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, false);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn),
+					(unsigned long *) good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+static int validate_insn_64bits(struct uprobe *uprobe, struct insn *insn)
+{
+	insn_init(insn, uprobe->insn, true);
+
+	/* Skip good instruction prefixes; reject "bad" ones. */
+	insn_get_opcode(insn);
+	if (is_prefix_bad(insn))
+		return -ENOTSUPP;
+	if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_64))
+		return 0;
+	if (insn->opcode.nbytes == 2) {
+		if (test_bit(OPCODE2(insn),
+					(unsigned long *) good_2byte_insns))
+			return 0;
+	}
+	return -ENOTSUPP;
+}
+
+/*
+ * Figure out which fixups post_xol() will need to perform, and annotate
+ * uprobe->fixups accordingly.  To start with, uprobe->fixups is
+ * either zero or it reflects rip-related fixups.
+ */
+static void prepare_fixups(struct uprobe *uprobe, struct insn *insn)
+{
+	bool fix_ip = true, fix_call = false;	/* defaults */
+	insn_get_opcode(insn);	/* should be a nop */
+
+	switch (OPCODE1(insn)) {
+	case 0xc3:		/* ret/lret */
+	case 0xcb:
+	case 0xc2:
+	case 0xca:
+		/* ip is correct */
+		fix_ip = false;
+		break;
+	case 0xe8:		/* call relative - Fix return addr */
+		fix_call = true;
+		break;
+	case 0x9a:		/* call absolute - Fix return addr, not ip */
+		fix_call = true;
+		fix_ip = false;
+		break;
+	case 0xff:
+	    {
+		int reg;
+		insn_get_modrm(insn);
+		reg = MODRM_REG(insn);
+		if (reg == 2 || reg == 3) {
+			/* call or lcall, indirect */
+			/* Fix return addr; ip is correct. */
+			fix_call = true;
+			fix_ip = false;
+		} else if (reg == 4 || reg == 5) {
+			/* jmp or ljmp, indirect */
+			/* ip is correct. */
+			fix_ip = false;
+		}
+		break;
+	    }
+	case 0xea:		/* jmp absolute -- ip is correct */
+		fix_ip = false;
+		break;
+	default:
+		break;
+	}
+	if (fix_ip)
+		uprobe->fixups |= UPROBES_FIX_IP;
+	if (fix_call)
+		uprobe->fixups |= UPROBES_FIX_CALL;
+}
+
+#ifdef CONFIG_X86_64
+/*
+ * If uprobe->insn doesn't use rip-relative addressing, return
+ * immediately.  Otherwise, rewrite the instruction so that it accesses
+ * its memory operand indirectly through a scratch register.  Set
+ * uprobe->fixups and uprobe->arch_info.rip_rela_target_address
+ * accordingly.  (The contents of the scratch register will be saved
+ * before we single-step the modified instruction, and restored
+ * afterward.)
+ *
+ * We do this because a rip-relative instruction can access only a
+ * relatively small area (+/- 2 GB from the instruction), and the XOL
+ * area typically lies beyond that area.  At least for instructions
+ * that store to memory, we can't execute the original instruction
+ * and "fix things up" later, because the misdirected store could be
+ * disastrous.
+ *
+ * Some useful facts about rip-relative instructions:
+ * - There's always a modrm byte.
+ * - There's never a SIB byte.
+ * - The displacement is always 4 bytes.
+ */
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+	u8 *cursor;
+	u8 reg;
+
+	uprobe->arch_info.rip_rela_target_address = 0x0;
+	if (!insn_rip_relative(insn))
+		return;
+
+	/*
+	 * Point cursor at the modrm byte.  The next 4 bytes are the
+	 * displacement.  Beyond the displacement, for some instructions,
+	 * is the immediate operand.
+	 */
+	cursor = uprobe->insn + insn->prefixes.nbytes
+			+ insn->rex_prefix.nbytes + insn->opcode.nbytes;
+	insn_get_length(insn);
+
+	/*
+	 * Convert from rip-relative addressing to indirect addressing
+	 * via a scratch register.  Change the r/m field from 0x5 (%rip)
+	 * to 0x0 (%rax) or 0x1 (%rcx), and squeeze out the offset field.
+	 */
+	reg = MODRM_REG(insn);
+	if (reg == 0) {
+		/*
+		 * The register operand (if any) is either the A register
+		 * (%rax, %eax, etc.) or (if the 0x4 bit is set in the
+		 * REX prefix) %r8.  In any case, we know the C register
+		 * is NOT the register operand, so we use %rcx (register
+		 * #1) for the scratch register.
+		 */
+		uprobe->fixups = UPROBES_FIX_RIP_CX;
+		/* Change modrm from 00 000 101 to 00 000 001. */
+		*cursor = 0x1;
+	} else {
+		/* Use %rax (register #0) for the scratch register. */
+		uprobe->fixups = UPROBES_FIX_RIP_AX;
+		/* Change modrm from 00 xxx 101 to 00 xxx 000 */
+		*cursor = (reg << 3);
+	}
+
+	/* Target address = address of next instruction + (signed) offset */
+	uprobe->arch_info.rip_rela_target_address = (long) insn->length
+					+ insn->displacement.value;
+	/* Displacement field is gone; slide immediate field (if any) over. */
+	if (insn->immediate.nbytes) {
+		cursor++;
+		memmove(cursor, cursor + insn->displacement.nbytes,
+						insn->immediate.nbytes);
+	}
+	return;
+}
+#else
+static void handle_riprel_insn(struct uprobe *uprobe, struct insn *insn)
+{
+	return;
+}
+#endif /* CONFIG_X86_64 */
+
+/**
+ * analyze_insn - instruction analysis including validity and fixups.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * Return 0 on success or a -ve number on error.
+ */
+int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
+{
+	int ret;
+	struct insn insn;
+
+	uprobe->fixups = 0;
+	if (is_32bit_app(tsk))
+		ret = validate_insn_32bits(uprobe, &insn);
+	else
+		ret = validate_insn_64bits(uprobe, &insn);
+	if (ret != 0)
+		return ret;
+	if (!is_32bit_app(tsk))
+		handle_riprel_insn(uprobe, &insn);
+	prepare_fixups(uprobe, &insn);
+	return 0;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 9/26]   Uprobes: Background page replacement.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:01   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


Provides Background page replacement by
 - cow the page that needs replacement.
 - modify a copy of the cowed page.
 - replace the cow page with the modified page
 - flush the page tables.

Also provides additional routines to read an opcode from a given virtual
address and for verifying if a instruction is a breakpoint instruction.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    4 +
 kernel/uprobes.c        |  268 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 266 insertions(+), 6 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2548b94..2c139f3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,6 +29,7 @@ struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
+typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
 #define MAX_UINSN_BYTES 4
 #endif
@@ -74,6 +75,9 @@ extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 							unsigned long vaddr);
 extern int __weak set_orig_insn(struct task_struct *tsk,
 		struct uprobe *uprobe, unsigned long vaddr, bool verify);
+extern bool __weak is_bkpt_insn(u8 *insn);
+extern int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+						uprobe_opcode_t *opcode);
 extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index e0e10dd..9adc3aa 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -26,6 +26,9 @@
 #include <linux/pagemap.h>	/* grab_cache_page */
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/rmap.h>		/* anon_vma_prepare */
+#include <linux/mmu_notifier.h>	/* set_pte_at_notify */
+#include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -60,18 +63,265 @@ static bool valid_vma(struct vm_area_struct *vma)
 	return false;
 }
 
+/**
+ * __replace_page - replace page in vma by new page.
+ * based on replace_page in mm/ksm.c
+ *
+ * @vma:      vma that holds the pte pointing to page
+ * @page:     the cowed page we are replacing by kpage
+ * @kpage:    the modified page we replace page by
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int __replace_page(struct vm_area_struct *vma, struct page *page,
+					struct page *kpage)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int err = -EFAULT;
+
+	addr = page_address_in_vma(page, vma);
+	if (addr == -EFAULT)
+		goto out;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out;
+
+	get_page(kpage);
+	page_add_new_anon_rmap(kpage, vma, addr);
+
+	flush_cache_page(vma, addr, pte_pfn(*ptep));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+	page_remove_rmap(page);
+	if (!page_mapped(page))
+		try_to_free_swap(page);
+	put_page(page);
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+
+out:
+	return err;
+}
+
+/*
+ * NOTE:
+ * Expect the breakpoint instruction to be the smallest size instruction for
+ * the architecture. If an arch has variable length instruction and the
+ * breakpoint instruction is not of the smallest length instruction
+ * supported by that architecture then we need to modify read_opcode /
+ * write_opcode accordingly. This would never be a problem for archs that
+ * have fixed length instructions.
+ */
+
+/*
+ * write_opcode - write the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @uprobe: the breakpointing information.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @vaddr.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm).
+ *
+ * For task @tsk, write the opcode at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
+			unsigned long vaddr, uprobe_opcode_t opcode)
+{
+	struct page *old_page, *new_page;
+	struct address_space *mapping;
+	void *vaddr_old, *vaddr_new;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+	int ret;
+
+	/* Read the page with vaddr into memory */
+	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &old_page, &vma);
+	if (ret <= 0)
+		return ret;
+	ret = -EINVAL;
+
+	/*
+	 * We are interested in text pages only. Our pages of interest
+	 * should be mapped for read and execute only. We desist from
+	 * adding probes in write mapped pages since the breakpoints
+	 * might end up in the file copy.
+	 */
+	if (!valid_vma(vma))
+		goto put_out;
+
+	mapping = uprobe->inode->i_mapping;
+	if (mapping != vma->vm_file->f_mapping)
+		goto put_out;
+
+	addr = vma->vm_start + uprobe->offset;
+	addr -= vma->vm_pgoff << PAGE_SHIFT;
+	if (vaddr != (unsigned long) addr)
+		goto put_out;
+
+	/* Allocate a page */
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+	if (!new_page) {
+		ret = -ENOMEM;
+		goto put_out;
+	}
+
+	/*
+	 * lock page will serialize against do_wp_page()'s
+	 * PageAnon() handling
+	 */
+	lock_page(old_page);
+	/* copy the page now that we've got it stable */
+	vaddr_old = kmap_atomic(old_page);
+	vaddr_new = kmap_atomic(new_page);
+
+	memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+	/* poke the new insn in, ASSUMES we don't cross page boundary */
+	vaddr &= ~PAGE_MASK;
+	memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+	kunmap_atomic(vaddr_new);
+	kunmap_atomic(vaddr_old);
+
+	ret = anon_vma_prepare(vma);
+	if (ret) {
+		page_cache_release(new_page);
+		goto unlock_out;
+	}
+
+	lock_page(new_page);
+	ret = __replace_page(vma, old_page, new_page);
+	unlock_page(new_page);
+	if (ret != 0)
+		page_cache_release(new_page);
+unlock_out:
+	unlock_page(old_page);
+
+put_out:
+	put_page(old_page); /* we did a get_page in the beginning */
+	return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @vaddr: the virtual address to read the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm.
+ *
+ * For task @tsk, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+						uprobe_opcode_t *opcode)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	void *vaddr_new;
+	int ret;
+
+	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
+	if (ret <= 0)
+		return ret;
+	ret = -EINVAL;
+
+	/*
+	 * We are interested in text pages only. Our pages of interest
+	 * should be mapped for read and execute only. We desist from
+	 * adding probes in write mapped pages since the breakpoints
+	 * might end up in the file copy.
+	 */
+	if (!valid_vma(vma))
+		goto put_out;
+
+	lock_page(page);
+	vaddr_new = kmap_atomic(page);
+	vaddr &= ~PAGE_MASK;
+	memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+	kunmap_atomic(vaddr_new);
+	unlock_page(page);
+	ret =  0;
+
+put_out:
+	put_page(page); /* we did a get_user_pages in the beginning */
+	return ret;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For task @tsk, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
 int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 						unsigned long vaddr)
 {
-	/* placeholder: yet to be implemented */
-	return 0;
+	return write_opcode(tsk, uprobe, vaddr, UPROBES_BKPT_INSN);
 }
 
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For task @tsk, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
 int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
-					unsigned long vaddr, bool verify)
+				unsigned long vaddr, bool verify)
 {
-	/* placeholder: yet to be implemented */
-	return 0;
+	if (verify) {
+		uprobe_opcode_t opcode;
+		int result = read_opcode(tsk, vaddr, &opcode);
+		if (result)
+			return result;
+		if (opcode != UPROBES_BKPT_INSN)
+			return -EINVAL;
+	}
+	return write_opcode(tsk, uprobe, vaddr,
+			*(uprobe_opcode_t *) uprobe->insn);
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(u8 *insn)
+{
+	uprobe_opcode_t opcode;
+
+	memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
+	return (opcode == UPROBES_BKPT_INSN);
 }
 
 static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
@@ -357,7 +607,13 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 		ret = copy_insn(uprobe, vma, addr);
 		if (ret)
 			goto put_return;
-		/* TODO : Analysis and verification of instruction */
+		if (is_bkpt_insn(uprobe->insn)) {
+			ret = -EEXIST;
+			goto put_return;
+		}
+		ret = analyze_insn(tsk, uprobe);
+		if (ret)
+			goto put_return;
 		uprobe->copy = 1;
 	}
 

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 9/26]   Uprobes: Background page replacement.
@ 2011-09-20 12:01   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


Provides Background page replacement by
 - cow the page that needs replacement.
 - modify a copy of the cowed page.
 - replace the cow page with the modified page
 - flush the page tables.

Also provides additional routines to read an opcode from a given virtual
address and for verifying if a instruction is a breakpoint instruction.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    4 +
 kernel/uprobes.c        |  268 ++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 266 insertions(+), 6 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2548b94..2c139f3 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -29,6 +29,7 @@ struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
 #else
+typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
 #define MAX_UINSN_BYTES 4
 #endif
@@ -74,6 +75,9 @@ extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 							unsigned long vaddr);
 extern int __weak set_orig_insn(struct task_struct *tsk,
 		struct uprobe *uprobe, unsigned long vaddr, bool verify);
+extern bool __weak is_bkpt_insn(u8 *insn);
+extern int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+						uprobe_opcode_t *opcode);
 extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index e0e10dd..9adc3aa 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -26,6 +26,9 @@
 #include <linux/pagemap.h>	/* grab_cache_page */
 #include <linux/slab.h>
 #include <linux/sched.h>
+#include <linux/rmap.h>		/* anon_vma_prepare */
+#include <linux/mmu_notifier.h>	/* set_pte_at_notify */
+#include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -60,18 +63,265 @@ static bool valid_vma(struct vm_area_struct *vma)
 	return false;
 }
 
+/**
+ * __replace_page - replace page in vma by new page.
+ * based on replace_page in mm/ksm.c
+ *
+ * @vma:      vma that holds the pte pointing to page
+ * @page:     the cowed page we are replacing by kpage
+ * @kpage:    the modified page we replace page by
+ *
+ * Returns 0 on success, -EFAULT on failure.
+ */
+static int __replace_page(struct vm_area_struct *vma, struct page *page,
+					struct page *kpage)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *ptep;
+	spinlock_t *ptl;
+	unsigned long addr;
+	int err = -EFAULT;
+
+	addr = page_address_in_vma(page, vma);
+	if (addr == -EFAULT)
+		goto out;
+
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		goto out;
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd))
+		goto out;
+
+	ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+	if (!ptep)
+		goto out;
+
+	get_page(kpage);
+	page_add_new_anon_rmap(kpage, vma, addr);
+
+	flush_cache_page(vma, addr, pte_pfn(*ptep));
+	ptep_clear_flush(vma, addr, ptep);
+	set_pte_at_notify(mm, addr, ptep, mk_pte(kpage, vma->vm_page_prot));
+
+	page_remove_rmap(page);
+	if (!page_mapped(page))
+		try_to_free_swap(page);
+	put_page(page);
+	pte_unmap_unlock(ptep, ptl);
+	err = 0;
+
+out:
+	return err;
+}
+
+/*
+ * NOTE:
+ * Expect the breakpoint instruction to be the smallest size instruction for
+ * the architecture. If an arch has variable length instruction and the
+ * breakpoint instruction is not of the smallest length instruction
+ * supported by that architecture then we need to modify read_opcode /
+ * write_opcode accordingly. This would never be a problem for archs that
+ * have fixed length instructions.
+ */
+
+/*
+ * write_opcode - write the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @uprobe: the breakpointing information.
+ * @vaddr: the virtual address to store the opcode.
+ * @opcode: opcode to be written at @vaddr.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm).
+ *
+ * For task @tsk, write the opcode at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
+static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
+			unsigned long vaddr, uprobe_opcode_t opcode)
+{
+	struct page *old_page, *new_page;
+	struct address_space *mapping;
+	void *vaddr_old, *vaddr_new;
+	struct vm_area_struct *vma;
+	unsigned long addr;
+	int ret;
+
+	/* Read the page with vaddr into memory */
+	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &old_page, &vma);
+	if (ret <= 0)
+		return ret;
+	ret = -EINVAL;
+
+	/*
+	 * We are interested in text pages only. Our pages of interest
+	 * should be mapped for read and execute only. We desist from
+	 * adding probes in write mapped pages since the breakpoints
+	 * might end up in the file copy.
+	 */
+	if (!valid_vma(vma))
+		goto put_out;
+
+	mapping = uprobe->inode->i_mapping;
+	if (mapping != vma->vm_file->f_mapping)
+		goto put_out;
+
+	addr = vma->vm_start + uprobe->offset;
+	addr -= vma->vm_pgoff << PAGE_SHIFT;
+	if (vaddr != (unsigned long) addr)
+		goto put_out;
+
+	/* Allocate a page */
+	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
+	if (!new_page) {
+		ret = -ENOMEM;
+		goto put_out;
+	}
+
+	/*
+	 * lock page will serialize against do_wp_page()'s
+	 * PageAnon() handling
+	 */
+	lock_page(old_page);
+	/* copy the page now that we've got it stable */
+	vaddr_old = kmap_atomic(old_page);
+	vaddr_new = kmap_atomic(new_page);
+
+	memcpy(vaddr_new, vaddr_old, PAGE_SIZE);
+	/* poke the new insn in, ASSUMES we don't cross page boundary */
+	vaddr &= ~PAGE_MASK;
+	memcpy(vaddr_new + vaddr, &opcode, uprobe_opcode_sz);
+
+	kunmap_atomic(vaddr_new);
+	kunmap_atomic(vaddr_old);
+
+	ret = anon_vma_prepare(vma);
+	if (ret) {
+		page_cache_release(new_page);
+		goto unlock_out;
+	}
+
+	lock_page(new_page);
+	ret = __replace_page(vma, old_page, new_page);
+	unlock_page(new_page);
+	if (ret != 0)
+		page_cache_release(new_page);
+unlock_out:
+	unlock_page(old_page);
+
+put_out:
+	put_page(old_page); /* we did a get_page in the beginning */
+	return ret;
+}
+
+/**
+ * read_opcode - read the opcode at a given virtual address.
+ * @tsk: the probed task.
+ * @vaddr: the virtual address to read the opcode.
+ * @opcode: location to store the read opcode.
+ *
+ * Called with tsk->mm->mmap_sem held (for read and with a reference to
+ * tsk->mm.
+ *
+ * For task @tsk, read the opcode at @vaddr and store it in @opcode.
+ * Return 0 (success) or a negative errno.
+ */
+int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
+						uprobe_opcode_t *opcode)
+{
+	struct vm_area_struct *vma;
+	struct page *page;
+	void *vaddr_new;
+	int ret;
+
+	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
+	if (ret <= 0)
+		return ret;
+	ret = -EINVAL;
+
+	/*
+	 * We are interested in text pages only. Our pages of interest
+	 * should be mapped for read and execute only. We desist from
+	 * adding probes in write mapped pages since the breakpoints
+	 * might end up in the file copy.
+	 */
+	if (!valid_vma(vma))
+		goto put_out;
+
+	lock_page(page);
+	vaddr_new = kmap_atomic(page);
+	vaddr &= ~PAGE_MASK;
+	memcpy(opcode, vaddr_new + vaddr, uprobe_opcode_sz);
+	kunmap_atomic(vaddr_new);
+	unlock_page(page);
+	ret =  0;
+
+put_out:
+	put_page(page); /* we did a get_user_pages in the beginning */
+	return ret;
+}
+
+/**
+ * set_bkpt - store breakpoint at a given address.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ *
+ * For task @tsk, store the breakpoint instruction at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
 int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 						unsigned long vaddr)
 {
-	/* placeholder: yet to be implemented */
-	return 0;
+	return write_opcode(tsk, uprobe, vaddr, UPROBES_BKPT_INSN);
 }
 
+/**
+ * set_orig_insn - Restore the original instruction.
+ * @tsk: the probed task.
+ * @uprobe: the probepoint information.
+ * @vaddr: the virtual address to insert the opcode.
+ * @verify: if true, verify existance of breakpoint instruction.
+ *
+ * For task @tsk, restore the original opcode (opcode) at @vaddr.
+ * Return 0 (success) or a negative errno.
+ */
 int __weak set_orig_insn(struct task_struct *tsk, struct uprobe *uprobe,
-					unsigned long vaddr, bool verify)
+				unsigned long vaddr, bool verify)
 {
-	/* placeholder: yet to be implemented */
-	return 0;
+	if (verify) {
+		uprobe_opcode_t opcode;
+		int result = read_opcode(tsk, vaddr, &opcode);
+		if (result)
+			return result;
+		if (opcode != UPROBES_BKPT_INSN)
+			return -EINVAL;
+	}
+	return write_opcode(tsk, uprobe, vaddr,
+			*(uprobe_opcode_t *) uprobe->insn);
+}
+
+/**
+ * is_bkpt_insn - check if instruction is breakpoint instruction.
+ * @insn: instruction to be checked.
+ * Default implementation of is_bkpt_insn
+ * Returns true if @insn is a breakpoint instruction.
+ */
+bool __weak is_bkpt_insn(u8 *insn)
+{
+	uprobe_opcode_t opcode;
+
+	memcpy(&opcode, insn, UPROBES_BKPT_INSN_SIZE);
+	return (opcode == UPROBES_BKPT_INSN);
 }
 
 static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
@@ -357,7 +607,13 @@ static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 		ret = copy_insn(uprobe, vma, addr);
 		if (ret)
 			goto put_return;
-		/* TODO : Analysis and verification of instruction */
+		if (is_bkpt_insn(uprobe->insn)) {
+			ret = -EEXIST;
+			goto put_return;
+		}
+		ret = analyze_insn(tsk, uprobe);
+		if (ret)
+			goto put_return;
 		uprobe->copy = 1;
 	}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 10/26]   x86: Set instruction pointer.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:01   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


Provides x86 specific routine to set the instruction pointer to the
given address.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    1 +
 arch/x86/kernel/uprobes.c      |   10 ++++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 4295ce0..35ac9d7 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -39,4 +39,5 @@ struct uprobe_arch_info {};
 #endif
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e4fd077..e1c1dfe 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -383,3 +383,13 @@ int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
 	prepare_fixups(uprobe, &insn);
 	return 0;
 }
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
+{
+	regs->ip = vaddr;
+}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 10/26]   x86: Set instruction pointer.
@ 2011-09-20 12:01   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:01 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


Provides x86 specific routine to set the instruction pointer to the
given address.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    1 +
 arch/x86/kernel/uprobes.c      |   10 ++++++++++
 2 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 4295ce0..35ac9d7 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -39,4 +39,5 @@ struct uprobe_arch_info {};
 #endif
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
+extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e4fd077..e1c1dfe 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -383,3 +383,13 @@ int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe)
 	prepare_fixups(uprobe, &insn);
 	return 0;
 }
+
+/*
+ * @reg: reflects the saved state of the task
+ * @vaddr: the virtual address to jump to.
+ * Return 0 on success or a -ve number on error.
+ */
+void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
+{
+	regs->ip = vaddr;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 11/26]   x86: Introduce TIF_UPROBE FLAG.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:02   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


On a breakpoint or singlestep, the exception notifier will just
set this thread_info FLAG so that do_notify_resume can be made aware
that a breakpoint/singlestep has occurred.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/thread_info.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..aeb3e04 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
+#define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 11/26]   x86: Introduce TIF_UPROBE FLAG.
@ 2011-09-20 12:02   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


On a breakpoint or singlestep, the exception notifier will just
set this thread_info FLAG so that do_notify_resume can be made aware
that a breakpoint/singlestep has occurred.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/thread_info.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index a1fe5c1..aeb3e04 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -84,6 +84,7 @@ struct thread_info {
 #define TIF_SECCOMP		8	/* secure computing */
 #define TIF_MCE_NOTIFY		10	/* notify userspace of an MCE */
 #define TIF_USER_RETURN_NOTIFY	11	/* notify kernel of userspace return */
+#define TIF_UPROBE		12	/* breakpointed or singlestepping */
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* 32bit process */
 #define TIF_FORK		18	/* ret_from_fork */
@@ -107,6 +108,7 @@ struct thread_info {
 #define _TIF_SECCOMP		(1 << TIF_SECCOMP)
 #define _TIF_MCE_NOTIFY		(1 << TIF_MCE_NOTIFY)
 #define _TIF_USER_RETURN_NOTIFY	(1 << TIF_USER_RETURN_NOTIFY)
+#define _TIF_UPROBE		(1 << TIF_UPROBE)
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_FORK		(1 << TIF_FORK)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:02   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


Provides routines to create/manage and free the task specific
information.

Adds a hook in uprobe_notify_resume to handle breakpoint and singlestep
exception.

Uprobes needs to maintain some task specific information including if a
task has hit a probepoint, uprobe corresponding to the probehit,
the slot where the original instruction is copied to before
single-stepping.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h   |    3 +
 include/linux/uprobes.h |   35 ++++++++
 kernel/fork.c           |    4 +
 kernel/uprobes.c        |  205 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 247 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc6f5f2..4f84980 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1569,6 +1569,9 @@ struct task_struct {
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
 #endif
+#ifdef CONFIG_UPROBES
+	struct uprobe_task *utask;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2c139f3..fa7eaba 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -70,6 +70,26 @@ struct uprobe {
 	u8			insn[MAX_UINSN_BYTES];
 };
 
+enum uprobe_task_state {
+	UTASK_RUNNING,
+	UTASK_BP_HIT,
+	UTASK_SSTEP
+};
+
+/*
+ * uprobe_utask -- not a user-visible struct.
+ * Corresponds to a thread in a probed process.
+ * Guarded by uproc->mutex.
+ */
+struct uprobe_task {
+	unsigned long xol_vaddr;
+	unsigned long vaddr;
+
+	enum uprobe_task_state state;
+
+	struct uprobe *active_uprobe;
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -82,8 +102,13 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern void free_uprobe_utask(struct task_struct *tsk);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
+extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -101,5 +126,15 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
 static inline void munmap_uprobe(struct vm_area_struct *vma)
 {
 }
+static inline void uprobe_notify_resume(struct pt_regs *regs)
+{
+}
+static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return 0;
+}
+static inline void free_uprobe_utask(struct task_struct *tsk)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7cc0b51..5914bc1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -195,6 +195,7 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+	free_uprobe_utask(tsk);
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
@@ -1285,6 +1286,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	INIT_LIST_HEAD(&p->pi_state_list);
 	p->pi_state_cache = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	p->utask = NULL;
+#endif
 	/*
 	 * sigaltstack should be cleared when sharing the same VM
 	 */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9adc3aa..8b6654e 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -29,6 +29,7 @@
 #include <linux/rmap.h>		/* anon_vma_prepare */
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
+#include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -473,6 +474,21 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_consumer *consumer;
+
+	down_read(&uprobe->consumer_rwsem);
+	consumer = uprobe->consumers;
+	for (consumer = uprobe->consumers; consumer;
+			consumer = consumer->next) {
+		if (!consumer->filter ||
+				consumer->filter(consumer, current))
+			consumer->handler(consumer, regs);
+	}
+	up_read(&uprobe->consumer_rwsem);
+}
+
 /* Returns the previous consumer */
 static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
 				struct uprobe_consumer *consumer)
@@ -640,10 +656,22 @@ static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 	put_task_struct(tsk);
 }
 
+/*
+ * There could be threads that have hit the breakpoint and are entering the
+ * notifier code and trying to acquire the uprobes_treelock. The thread
+ * calling delete_uprobe() that is removing the uprobe from the rb_tree can
+ * race with these threads and might acquire the uprobes_treelock compared
+ * to some of the breakpoint hit threads. In such a case, the breakpoint hit
+ * threads will not find the uprobe. Finding if a "trap" instruction was
+ * present at the interrupting address is racy. Hence provide some extra
+ * time (by way of synchronize_sched() for breakpoint hit threads to acquire
+ * the uprobes_treelock before the uprobe is removed from the rbtree.
+ */
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
 
+	synchronize_sched();
 	spin_lock_irqsave(&uprobes_treelock, flags);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
@@ -1004,3 +1032,180 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	iput(inode);
 	return;
 }
+
+/**
+ * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void free_uprobe_utask(struct task_struct *tsk)
+{
+	struct uprobe_task *utask = tsk->utask;
+
+	if (!utask)
+		return;
+
+	if (utask->active_uprobe)
+		put_uprobe(utask->active_uprobe);
+
+	kfree(utask);
+	tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - negative errno otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+	struct uprobe_task *utask;
+
+	utask = kzalloc(sizeof *utask, GFP_KERNEL);
+	if (unlikely(utask == NULL))
+		return ERR_PTR(-ENOMEM);
+
+	utask->active_uprobe = NULL;
+	current->utask = utask;
+	return utask;
+}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+				unsigned long vaddr)
+{
+	/* TODO: Yet to be implemented */
+	return -EFAULT;
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	/* TODO: Yet to be implemented */
+	return false;
+}
+
+/*
+ * uprobe_notify_resume gets called in task context just before returning
+ * to userspace.
+ *
+ *  If its the first time the probepoint is hit, slot gets allocated here.
+ *  If its the first time the thread hit a breakpoint, utask gets
+ *  allocated here.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+	struct vm_area_struct *vma;
+	struct uprobe_task *utask;
+	struct mm_struct *mm;
+	struct uprobe *u = NULL;
+	unsigned long probept;
+
+	utask = current->utask;
+	mm = current->mm;
+	if (!utask || utask->state == UTASK_BP_HIT) {
+		probept = get_uprobe_bkpt_addr(regs);
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, probept);
+		if (vma && valid_vma(vma))
+			u = find_uprobe(vma->vm_file->f_mapping->host,
+					probept - vma->vm_start +
+					(vma->vm_pgoff << PAGE_SHIFT));
+		up_read(&mm->mmap_sem);
+		if (!u)
+			/* No matching uprobe; signal SIGTRAP. */
+			goto cleanup_ret;
+		if (!utask) {
+			utask = add_utask();
+			/* Cannot Allocate; re-execute the instruction. */
+			if (!utask)
+				goto cleanup_ret;
+		}
+		/* TODO Start queueing signals. */
+		utask->active_uprobe = u;
+		handler_chain(u, regs);
+		utask->state = UTASK_SSTEP;
+		if (!pre_ssout(u, regs, probept))
+			user_enable_single_step(current);
+		else
+			/* Cannot Singlestep; re-execute the instruction. */
+			goto cleanup_ret;
+	} else if (utask->state == UTASK_SSTEP) {
+		u = utask->active_uprobe;
+		if (sstep_complete(u, regs)) {
+			put_uprobe(u);
+			utask->active_uprobe = NULL;
+			utask->state = UTASK_RUNNING;
+			user_disable_single_step(current);
+
+			/* TODO Stop queueing signals. */
+		}
+	}
+	return;
+
+cleanup_ret:
+	if (utask) {
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+	}
+	if (u) {
+		put_uprobe(u);
+		set_instruction_pointer(regs, probept);
+	} else {
+		/*TODO Return SIGTRAP signal */
+	}
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context
+ * it gets a reference to the ppt and sets TIF_UPROBE flag,
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+	struct uprobe_task *utask;
+
+	if (!current->mm || !atomic_read(&current->mm->mm_uprobes_count))
+		/* task is currently not uprobed */
+		return 0;
+
+	utask = current->utask;
+	if (utask)
+		utask->state = UTASK_BP_HIT;
+	set_thread_flag(TIF_UPROBE);
+	return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context.
+ * It completes the single step operation.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+	struct uprobe *uprobe;
+	struct uprobe_task *utask;
+
+	if (!current->mm || !current->utask || !current->utask->active_uprobe)
+		/* task is currently not uprobed */
+		return 0;
+
+	utask = current->utask;
+	uprobe = utask->active_uprobe;
+	set_thread_flag(TIF_UPROBE);
+	return 1;
+}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-09-20 12:02   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


Provides routines to create/manage and free the task specific
information.

Adds a hook in uprobe_notify_resume to handle breakpoint and singlestep
exception.

Uprobes needs to maintain some task specific information including if a
task has hit a probepoint, uprobe corresponding to the probehit,
the slot where the original instruction is copied to before
single-stepping.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h   |    3 +
 include/linux/uprobes.h |   35 ++++++++
 kernel/fork.c           |    4 +
 kernel/uprobes.c        |  205 +++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 247 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc6f5f2..4f84980 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1569,6 +1569,9 @@ struct task_struct {
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
 	atomic_t ptrace_bp_refcnt;
 #endif
+#ifdef CONFIG_UPROBES
+	struct uprobe_task *utask;
+#endif
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 2c139f3..fa7eaba 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -70,6 +70,26 @@ struct uprobe {
 	u8			insn[MAX_UINSN_BYTES];
 };
 
+enum uprobe_task_state {
+	UTASK_RUNNING,
+	UTASK_BP_HIT,
+	UTASK_SSTEP
+};
+
+/*
+ * uprobe_utask -- not a user-visible struct.
+ * Corresponds to a thread in a probed process.
+ * Guarded by uproc->mutex.
+ */
+struct uprobe_task {
+	unsigned long xol_vaddr;
+	unsigned long vaddr;
+
+	enum uprobe_task_state state;
+
+	struct uprobe *active_uprobe;
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -82,8 +102,13 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
+extern void free_uprobe_utask(struct task_struct *tsk);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
+extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
+extern int uprobe_post_notifier(struct pt_regs *regs);
+extern int uprobe_bkpt_notifier(struct pt_regs *regs);
+extern void uprobe_notify_resume(struct pt_regs *regs);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -101,5 +126,15 @@ static inline int mmap_uprobe(struct vm_area_struct *vma)
 static inline void munmap_uprobe(struct vm_area_struct *vma)
 {
 }
+static inline void uprobe_notify_resume(struct pt_regs *regs)
+{
+}
+static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return 0;
+}
+static inline void free_uprobe_utask(struct task_struct *tsk)
+{
+}
 #endif /* CONFIG_UPROBES */
 #endif	/* _LINUX_UPROBES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 7cc0b51..5914bc1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -195,6 +195,7 @@ void __put_task_struct(struct task_struct *tsk)
 	delayacct_tsk_free(tsk);
 	put_signal_struct(tsk->signal);
 
+	free_uprobe_utask(tsk);
 	if (!profile_handoff_task(tsk))
 		free_task(tsk);
 }
@@ -1285,6 +1286,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	INIT_LIST_HEAD(&p->pi_state_list);
 	p->pi_state_cache = NULL;
 #endif
+#ifdef CONFIG_UPROBES
+	p->utask = NULL;
+#endif
 	/*
 	 * sigaltstack should be cleared when sharing the same VM
 	 */
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 9adc3aa..8b6654e 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -29,6 +29,7 @@
 #include <linux/rmap.h>		/* anon_vma_prepare */
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
+#include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -473,6 +474,21 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 	return uprobe;
 }
 
+static void handler_chain(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_consumer *consumer;
+
+	down_read(&uprobe->consumer_rwsem);
+	consumer = uprobe->consumers;
+	for (consumer = uprobe->consumers; consumer;
+			consumer = consumer->next) {
+		if (!consumer->filter ||
+				consumer->filter(consumer, current))
+			consumer->handler(consumer, regs);
+	}
+	up_read(&uprobe->consumer_rwsem);
+}
+
 /* Returns the previous consumer */
 static struct uprobe_consumer *add_consumer(struct uprobe *uprobe,
 				struct uprobe_consumer *consumer)
@@ -640,10 +656,22 @@ static void remove_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
 	put_task_struct(tsk);
 }
 
+/*
+ * There could be threads that have hit the breakpoint and are entering the
+ * notifier code and trying to acquire the uprobes_treelock. The thread
+ * calling delete_uprobe() that is removing the uprobe from the rb_tree can
+ * race with these threads and might acquire the uprobes_treelock compared
+ * to some of the breakpoint hit threads. In such a case, the breakpoint hit
+ * threads will not find the uprobe. Finding if a "trap" instruction was
+ * present at the interrupting address is racy. Hence provide some extra
+ * time (by way of synchronize_sched() for breakpoint hit threads to acquire
+ * the uprobes_treelock before the uprobe is removed from the rbtree.
+ */
 static void delete_uprobe(struct uprobe *uprobe)
 {
 	unsigned long flags;
 
+	synchronize_sched();
 	spin_lock_irqsave(&uprobes_treelock, flags);
 	rb_erase(&uprobe->rb_node, &uprobes_tree);
 	spin_unlock_irqrestore(&uprobes_treelock, flags);
@@ -1004,3 +1032,180 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	iput(inode);
 	return;
 }
+
+/**
+ * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
+ * @regs: Reflects the saved state of the task after it has hit a breakpoint
+ * instruction.
+ * Return the address of the breakpoint instruction.
+ */
+unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
+{
+	return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
+}
+
+/*
+ * Called with no locks held.
+ * Called in context of a exiting or a exec-ing thread.
+ */
+void free_uprobe_utask(struct task_struct *tsk)
+{
+	struct uprobe_task *utask = tsk->utask;
+
+	if (!utask)
+		return;
+
+	if (utask->active_uprobe)
+		put_uprobe(utask->active_uprobe);
+
+	kfree(utask);
+	tsk->utask = NULL;
+}
+
+/*
+ * Allocate a uprobe_task object for the task.
+ * Called when the thread hits a breakpoint for the first time.
+ *
+ * Returns:
+ * - pointer to new uprobe_task on success
+ * - negative errno otherwise
+ */
+static struct uprobe_task *add_utask(void)
+{
+	struct uprobe_task *utask;
+
+	utask = kzalloc(sizeof *utask, GFP_KERNEL);
+	if (unlikely(utask == NULL))
+		return ERR_PTR(-ENOMEM);
+
+	utask->active_uprobe = NULL;
+	current->utask = utask;
+	return utask;
+}
+
+/* Prepare to single-step probed instruction out of line. */
+static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
+				unsigned long vaddr)
+{
+	/* TODO: Yet to be implemented */
+	return -EFAULT;
+}
+
+/*
+ * Verify from Instruction Pointer if singlestep has indeed occurred.
+ * If Singlestep has occurred, then do post singlestep fix-ups.
+ */
+static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	/* TODO: Yet to be implemented */
+	return false;
+}
+
+/*
+ * uprobe_notify_resume gets called in task context just before returning
+ * to userspace.
+ *
+ *  If its the first time the probepoint is hit, slot gets allocated here.
+ *  If its the first time the thread hit a breakpoint, utask gets
+ *  allocated here.
+ */
+void uprobe_notify_resume(struct pt_regs *regs)
+{
+	struct vm_area_struct *vma;
+	struct uprobe_task *utask;
+	struct mm_struct *mm;
+	struct uprobe *u = NULL;
+	unsigned long probept;
+
+	utask = current->utask;
+	mm = current->mm;
+	if (!utask || utask->state == UTASK_BP_HIT) {
+		probept = get_uprobe_bkpt_addr(regs);
+		down_read(&mm->mmap_sem);
+		vma = find_vma(mm, probept);
+		if (vma && valid_vma(vma))
+			u = find_uprobe(vma->vm_file->f_mapping->host,
+					probept - vma->vm_start +
+					(vma->vm_pgoff << PAGE_SHIFT));
+		up_read(&mm->mmap_sem);
+		if (!u)
+			/* No matching uprobe; signal SIGTRAP. */
+			goto cleanup_ret;
+		if (!utask) {
+			utask = add_utask();
+			/* Cannot Allocate; re-execute the instruction. */
+			if (!utask)
+				goto cleanup_ret;
+		}
+		/* TODO Start queueing signals. */
+		utask->active_uprobe = u;
+		handler_chain(u, regs);
+		utask->state = UTASK_SSTEP;
+		if (!pre_ssout(u, regs, probept))
+			user_enable_single_step(current);
+		else
+			/* Cannot Singlestep; re-execute the instruction. */
+			goto cleanup_ret;
+	} else if (utask->state == UTASK_SSTEP) {
+		u = utask->active_uprobe;
+		if (sstep_complete(u, regs)) {
+			put_uprobe(u);
+			utask->active_uprobe = NULL;
+			utask->state = UTASK_RUNNING;
+			user_disable_single_step(current);
+
+			/* TODO Stop queueing signals. */
+		}
+	}
+	return;
+
+cleanup_ret:
+	if (utask) {
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+	}
+	if (u) {
+		put_uprobe(u);
+		set_instruction_pointer(regs, probept);
+	} else {
+		/*TODO Return SIGTRAP signal */
+	}
+}
+
+/*
+ * uprobe_bkpt_notifier gets called from interrupt context
+ * it gets a reference to the ppt and sets TIF_UPROBE flag,
+ */
+int uprobe_bkpt_notifier(struct pt_regs *regs)
+{
+	struct uprobe_task *utask;
+
+	if (!current->mm || !atomic_read(&current->mm->mm_uprobes_count))
+		/* task is currently not uprobed */
+		return 0;
+
+	utask = current->utask;
+	if (utask)
+		utask->state = UTASK_BP_HIT;
+	set_thread_flag(TIF_UPROBE);
+	return 1;
+}
+
+/*
+ * uprobe_post_notifier gets called in interrupt context.
+ * It completes the single step operation.
+ */
+int uprobe_post_notifier(struct pt_regs *regs)
+{
+	struct uprobe *uprobe;
+	struct uprobe_task *utask;
+
+	if (!current->mm || !current->utask || !current->utask->active_uprobe)
+		/* task is currently not uprobed */
+		return 0;
+
+	utask = current->utask;
+	uprobe = utask->active_uprobe;
+	set_thread_flag(TIF_UPROBE);
+	return 1;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 13/26]   x86: define a x86 specific exception notifier.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:02   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Uprobes uses notifier mechanism to get in control when an application
encounters a breakpoint or a singlestep exception.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    4 ++++
 arch/x86/kernel/signal.c       |   14 ++++++++++++++
 arch/x86/kernel/uprobes.c      |   29 +++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 35ac9d7..7d9d0a5 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -23,6 +23,8 @@
  *	Jim Keniston
  */
 
+#include <linux/notifier.h>
+
 typedef u8 uprobe_opcode_t;
 #define MAX_UINSN_BYTES 16
 #define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
@@ -40,4 +42,6 @@ struct uprobe_arch_info {};
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb..97fac33 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
+	if (thread_info_flags & _TIF_UPROBE) {
+		clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+		/*
+		 * On x86_32, do_notify_resume() gets called with
+		 * interrupts disabled. Hence enable interrupts if they
+		 * are still disabled.
+		 */
+		local_irq_enable();
+#endif
+		uprobe_notify_resume(regs);
+	}
+
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e1c1dfe..8ec759a 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -393,3 +393,32 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 {
 	regs->ip = vaddr;
 }
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+	int ret = NOTIFY_DONE;
+
+	/* We are only interested in userspace traps */
+	if (regs && !user_mode_vm(regs))
+		return NOTIFY_DONE;
+
+	switch (val) {
+	case DIE_INT3:
+		/* Run your handler here */
+		if (uprobe_bkpt_notifier(regs))
+			ret = NOTIFY_STOP;
+		break;
+	case DIE_DEBUG:
+		if (uprobe_post_notifier(regs))
+			ret = NOTIFY_STOP;
+	default:
+		break;
+	}
+	return ret;
+}


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 13/26]   x86: define a x86 specific exception notifier.
@ 2011-09-20 12:02   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Uprobes uses notifier mechanism to get in control when an application
encounters a breakpoint or a singlestep exception.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    4 ++++
 arch/x86/kernel/signal.c       |   14 ++++++++++++++
 arch/x86/kernel/uprobes.c      |   29 +++++++++++++++++++++++++++++
 3 files changed, 47 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 35ac9d7..7d9d0a5 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -23,6 +23,8 @@
  *	Jim Keniston
  */
 
+#include <linux/notifier.h>
+
 typedef u8 uprobe_opcode_t;
 #define MAX_UINSN_BYTES 16
 #define UPROBES_XOL_SLOT_BYTES	128	/* to keep it cache aligned */
@@ -40,4 +42,6 @@ struct uprobe_arch_info {};
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c
index 54ddaeb..97fac33 100644
--- a/arch/x86/kernel/signal.c
+++ b/arch/x86/kernel/signal.c
@@ -20,6 +20,7 @@
 #include <linux/personality.h>
 #include <linux/uaccess.h>
 #include <linux/user-return-notifier.h>
+#include <linux/uprobes.h>
 
 #include <asm/processor.h>
 #include <asm/ucontext.h>
@@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
 		mce_notify_process();
 #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
 
+	if (thread_info_flags & _TIF_UPROBE) {
+		clear_thread_flag(TIF_UPROBE);
+#ifdef CONFIG_X86_32
+		/*
+		 * On x86_32, do_notify_resume() gets called with
+		 * interrupts disabled. Hence enable interrupts if they
+		 * are still disabled.
+		 */
+		local_irq_enable();
+#endif
+		uprobe_notify_resume(regs);
+	}
+
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs);
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e1c1dfe..8ec759a 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -393,3 +393,32 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 {
 	regs->ip = vaddr;
 }
+
+/*
+ * Wrapper routine for handling exceptions.
+ */
+int uprobe_exception_notify(struct notifier_block *self,
+				       unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+	int ret = NOTIFY_DONE;
+
+	/* We are only interested in userspace traps */
+	if (regs && !user_mode_vm(regs))
+		return NOTIFY_DONE;
+
+	switch (val) {
+	case DIE_INT3:
+		/* Run your handler here */
+		if (uprobe_bkpt_notifier(regs))
+			ret = NOTIFY_STOP;
+		break;
+	case DIE_DEBUG:
+		if (uprobe_post_notifier(regs))
+			ret = NOTIFY_STOP;
+	default:
+		break;
+	}
+	return ret;
+}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 14/26]   uprobe: register exception notifier
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:02   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


Use the notifier mechanism to register uprobes exception notifier.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/uprobes.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 8b6654e..083c577 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -30,6 +30,7 @@
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/ptrace.h>	/* user_enable_single_step */
+#include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -1209,3 +1210,20 @@ int uprobe_post_notifier(struct pt_regs *regs)
 	set_thread_flag(TIF_UPROBE);
 	return 1;
 }
+
+struct notifier_block uprobe_exception_nb = {
+	.notifier_call = uprobe_exception_notify,
+	.priority = INT_MAX - 1,	/* notified after kprobes, kgdb */
+};
+
+static int __init init_uprobes(void)
+{
+	return register_die_notifier(&uprobe_exception_nb);
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 14/26]   uprobe: register exception notifier
@ 2011-09-20 12:02   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:02 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


Use the notifier mechanism to register uprobes exception notifier.

Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/uprobes.c |   18 ++++++++++++++++++
 1 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 8b6654e..083c577 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -30,6 +30,7 @@
 #include <linux/mmu_notifier.h>	/* set_pte_at_notify */
 #include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/ptrace.h>	/* user_enable_single_step */
+#include <linux/kdebug.h>	/* notifier mechanism */
 #include <linux/uprobes.h>
 
 static struct rb_root uprobes_tree = RB_ROOT;
@@ -1209,3 +1210,20 @@ int uprobe_post_notifier(struct pt_regs *regs)
 	set_thread_flag(TIF_UPROBE);
 	return 1;
 }
+
+struct notifier_block uprobe_exception_nb = {
+	.notifier_call = uprobe_exception_notify,
+	.priority = INT_MAX - 1,	/* notified after kprobes, kgdb */
+};
+
+static int __init init_uprobes(void)
+{
+	return register_die_notifier(&uprobe_exception_nb);
+}
+
+static void __exit exit_uprobes(void)
+{
+}
+
+module_init(init_uprobes);
+module_exit(exit_uprobes);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 15/26]   x86: Define x86_64 specific uprobe_task_arch_info structure
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:03   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


On x86_64, need to handle RIP relative instructions, which requires us to
save and restore a register.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 7d9d0a5..2ad2c71 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -36,8 +36,13 @@ typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {
 	unsigned long rip_rela_target_address;
 };
+
+struct uprobe_task_arch_info {
+	unsigned long saved_scratch_register;
+};
 #else
 struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};
 #endif
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 15/26]   x86: Define x86_64 specific uprobe_task_arch_info structure
@ 2011-09-20 12:03   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


On x86_64, need to handle RIP relative instructions, which requires us to
save and restore a register.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 7d9d0a5..2ad2c71 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -36,8 +36,13 @@ typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {
 	unsigned long rip_rela_target_address;
 };
+
+struct uprobe_task_arch_info {
+	unsigned long saved_scratch_register;
+};
 #else
 struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};
 #endif
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 16/26]   uprobes: Introduce uprobe_task_arch_info structure.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:03   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


uprobe_task_arch_info structure helps save and restore architecture
specific artifacts at the probehit/singlestep/original instruction
restore time.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index fa7eaba..30576fa 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
 #else
 typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};	/* arch specific task info */
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -86,6 +87,7 @@ struct uprobe_task {
 	unsigned long vaddr;
 
 	enum uprobe_task_state state;
+	struct uprobe_task_arch_info tskinfo;
 
 	struct uprobe *active_uprobe;
 };

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 16/26]   uprobes: Introduce uprobe_task_arch_info structure.
@ 2011-09-20 12:03   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


uprobe_task_arch_info structure helps save and restore architecture
specific artifacts at the probehit/singlestep/original instruction
restore time.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index fa7eaba..30576fa 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
 #else
 typedef u8 uprobe_opcode_t;
 struct uprobe_arch_info {};
+struct uprobe_task_arch_info {};	/* arch specific task info */
 #define MAX_UINSN_BYTES 4
 #endif
 
@@ -86,6 +87,7 @@ struct uprobe_task {
 	unsigned long vaddr;
 
 	enum uprobe_task_state state;
+	struct uprobe_task_arch_info tskinfo;
 
 	struct uprobe *active_uprobe;
 };

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:03   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


Hooks for handling pre singlestepping and post singlestepping.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    2 +
 arch/x86/kernel/uprobes.c      |  138 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 2ad2c71..1c30cfd 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -47,6 +47,8 @@ struct uprobe_task_arch_info {};
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 8ec759a..da1bc12 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
+#include <linux/uaccess.h>
 
 #include <linux/kdebug.h>
 #include <asm/insn.h>
@@ -395,6 +396,143 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 }
 
 /*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+#ifdef CONFIG_X86_64
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	regs->ip = current->utask->xol_vaddr;
+	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
+		tskinfo->saved_scratch_register = regs->ax;
+		regs->ax = current->utask->vaddr;
+		regs->ax += uprobe->arch_info.rip_rela_target_address;
+	} else if (uprobe->fixups & UPROBES_FIX_RIP_CX) {
+		tskinfo->saved_scratch_register = regs->cx;
+		regs->cx = current->utask->vaddr;
+		regs->cx += uprobe->arch_info.rip_rela_target_address;
+	}
+	return 0;
+}
+#else
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	regs->ip = current->utask->xol_vaddr;
+	return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+	int rasize, ncopied;
+	long ra = 0;
+
+	if (is_32bit_app(current))
+		rasize = 4;
+	else
+		rasize = 8;
+	ncopied = copy_from_user(&ra, (void __user *) sp, rasize);
+	if (unlikely(ncopied))
+		goto fail;
+	ra += correction;
+	ncopied = copy_to_user((void __user *) sp, &ra, rasize);
+	if (unlikely(ncopied))
+		goto fail;
+	return 0;
+
+fail:
+	pr_warn_once("uprobes: Failed to adjust return address after"
+		" single-stepping call instruction;"
+		" pid=%d, sp=%#lx\n", current->pid, sp);
+	return -EFAULT;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+	return ((uprobe->fixups &
+			(UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+	if (is_riprel_insn(uprobe)) {
+		struct uprobe_task_arch_info *tskinfo;
+		tskinfo = &current->utask->tskinfo;
+
+		if (uprobe->fixups & UPROBES_FIX_RIP_AX)
+			regs->ax = tskinfo->saved_scratch_register;
+		else
+			regs->cx = tskinfo->saved_scratch_register;
+		/*
+		 * The original instruction includes a displacement, and so
+		 * is 4 bytes longer than what we've just single-stepped.
+		 * Fall through to handle stuff like "jmpq *...(%rip)" and
+		 * "callq *...(%rip)".
+		 */
+		*correction += 4;
+	}
+}
+#else
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+}
+#endif
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction.  We need
+ * to make it relative to the original instruction (FIX_IP).  Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction.  We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field).  (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task *utask = current->utask;
+	int result = 0;
+	long correction;
+
+	correction = (long)(utask->vaddr - utask->xol_vaddr);
+	handle_riprel_post_xol(uprobe, regs, &correction);
+	if (uprobe->fixups & UPROBES_FIX_IP)
+		regs->ip += correction;
+	if (uprobe->fixups & UPROBES_FIX_CALL)
+		result = adjust_ret_addr(regs->sp, correction);
+	return result;
+}
+
+/*
  * Wrapper routine for handling exceptions.
  */
 int uprobe_exception_notify(struct notifier_block *self,

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
@ 2011-09-20 12:03   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


Hooks for handling pre singlestepping and post singlestepping.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/x86/include/asm/uprobes.h |    2 +
 arch/x86/kernel/uprobes.c      |  138 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 140 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 2ad2c71..1c30cfd 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -47,6 +47,8 @@ struct uprobe_task_arch_info {};
 struct uprobe;
 extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
+extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index 8ec759a..da1bc12 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -25,6 +25,7 @@
 #include <linux/sched.h>
 #include <linux/ptrace.h>
 #include <linux/uprobes.h>
+#include <linux/uaccess.h>
 
 #include <linux/kdebug.h>
 #include <asm/insn.h>
@@ -395,6 +396,143 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 }
 
 /*
+ * pre_xol - prepare to execute out of line.
+ * @uprobe: the probepoint information.
+ * @regs: reflects the saved user state of @tsk.
+ *
+ * If we're emulating a rip-relative instruction, save the contents
+ * of the scratch register and store the target address in that register.
+ *
+ * Returns true if @uprobe->opcode is @bkpt_insn.
+ */
+#ifdef CONFIG_X86_64
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	regs->ip = current->utask->xol_vaddr;
+	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
+		tskinfo->saved_scratch_register = regs->ax;
+		regs->ax = current->utask->vaddr;
+		regs->ax += uprobe->arch_info.rip_rela_target_address;
+	} else if (uprobe->fixups & UPROBES_FIX_RIP_CX) {
+		tskinfo->saved_scratch_register = regs->cx;
+		regs->cx = current->utask->vaddr;
+		regs->cx += uprobe->arch_info.rip_rela_target_address;
+	}
+	return 0;
+}
+#else
+int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	regs->ip = current->utask->xol_vaddr;
+	return 0;
+}
+#endif
+
+/*
+ * Called by post_xol() to adjust the return address pushed by a call
+ * instruction executed out of line.
+ */
+static int adjust_ret_addr(unsigned long sp, long correction)
+{
+	int rasize, ncopied;
+	long ra = 0;
+
+	if (is_32bit_app(current))
+		rasize = 4;
+	else
+		rasize = 8;
+	ncopied = copy_from_user(&ra, (void __user *) sp, rasize);
+	if (unlikely(ncopied))
+		goto fail;
+	ra += correction;
+	ncopied = copy_to_user((void __user *) sp, &ra, rasize);
+	if (unlikely(ncopied))
+		goto fail;
+	return 0;
+
+fail:
+	pr_warn_once("uprobes: Failed to adjust return address after"
+		" single-stepping call instruction;"
+		" pid=%d, sp=%#lx\n", current->pid, sp);
+	return -EFAULT;
+}
+
+#ifdef CONFIG_X86_64
+static bool is_riprel_insn(struct uprobe *uprobe)
+{
+	return ((uprobe->fixups &
+			(UPROBES_FIX_RIP_AX | UPROBES_FIX_RIP_CX)) != 0);
+}
+
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+	if (is_riprel_insn(uprobe)) {
+		struct uprobe_task_arch_info *tskinfo;
+		tskinfo = &current->utask->tskinfo;
+
+		if (uprobe->fixups & UPROBES_FIX_RIP_AX)
+			regs->ax = tskinfo->saved_scratch_register;
+		else
+			regs->cx = tskinfo->saved_scratch_register;
+		/*
+		 * The original instruction includes a displacement, and so
+		 * is 4 bytes longer than what we've just single-stepped.
+		 * Fall through to handle stuff like "jmpq *...(%rip)" and
+		 * "callq *...(%rip)".
+		 */
+		*correction += 4;
+	}
+}
+#else
+static void handle_riprel_post_xol(struct uprobe *uprobe,
+			struct pt_regs *regs, long *correction)
+{
+}
+#endif
+
+/*
+ * Called after single-stepping. To avoid the SMP problems that can
+ * occur when we temporarily put back the original opcode to
+ * single-step, we single-stepped a copy of the instruction.
+ *
+ * This function prepares to resume execution after the single-step.
+ * We have to fix things up as follows:
+ *
+ * Typically, the new ip is relative to the copied instruction.  We need
+ * to make it relative to the original instruction (FIX_IP).  Exceptions
+ * are return instructions and absolute or indirect jump or call instructions.
+ *
+ * If the single-stepped instruction was a call, the return address that
+ * is atop the stack is the address following the copied instruction.  We
+ * need to make it the address following the original instruction (FIX_CALL).
+ *
+ * If the original instruction was a rip-relative instruction such as
+ * "movl %edx,0xnnnn(%rip)", we have instead executed an equivalent
+ * instruction using a scratch register -- e.g., "movl %edx,(%rax)".
+ * We need to restore the contents of the scratch register and adjust
+ * the ip, keeping in mind that the instruction we executed is 4 bytes
+ * shorter than the original instruction (since we squeezed out the offset
+ * field).  (FIX_RIP_AX or FIX_RIP_CX)
+ */
+int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
+{
+	struct uprobe_task *utask = current->utask;
+	int result = 0;
+	long correction;
+
+	correction = (long)(utask->vaddr - utask->xol_vaddr);
+	handle_riprel_post_xol(uprobe, regs, &correction);
+	if (uprobe->fixups & UPROBES_FIX_IP)
+		regs->ip += correction;
+	if (uprobe->fixups & UPROBES_FIX_CALL)
+		result = adjust_ret_addr(regs->sp, correction);
+	return result;
+}
+
+/*
  * Wrapper routine for handling exceptions.
  */
 int uprobe_exception_notify(struct notifier_block *self,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:03   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


One page of slots are allocated per mm.
On a probehit one free slot is acquired and released after
singlestep operation completes.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |    2 
 include/linux/uprobes.h  |   22 ++++
 kernel/fork.c            |    2 
 kernel/uprobes.c         |  246 +++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 267 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9aeb64f..aa2e427 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/uprobes.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -351,6 +352,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_UPROBES
 	atomic_t mm_uprobes_count;
+	struct uprobes_xol_area *uprobes_xol_area;
 #endif
 };
 
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 30576fa..a407d17 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -92,6 +92,27 @@ struct uprobe_task {
 	struct uprobe *active_uprobe;
 };
 
+/*
+ * On a breakpoint hit, thread contests for a slot.  It free the
+ * slot after singlestep.  Only definite number of slots are
+ * allocated.
+ */
+
+struct uprobes_xol_area {
+	spinlock_t slot_lock;	/* protects bitmap and slot (de)allocation*/
+	wait_queue_head_t wq;	/* if all slots are busy */
+	atomic_t slot_count;	/* currently in use slots */
+	unsigned long *bitmap;	/* 0 = free slot */
+	struct page *page;
+
+	/*
+	 * We keep the vma's vm_start rather than a pointer to the vma
+	 * itself.  The probed process or a naughty kernel module could make
+	 * the vma go away, and we must handle that reasonably gracefully.
+	 */
+	unsigned long vaddr;		/* Page(s) of instruction slots */
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -105,6 +126,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void free_uprobe_utask(struct task_struct *tsk);
+extern void free_uprobes_xol_area(struct mm_struct *mm);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
 extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
diff --git a/kernel/fork.c b/kernel/fork.c
index 5914bc1..088a27c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -557,6 +557,7 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
+		free_uprobes_xol_area(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
@@ -744,6 +745,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_UPROBES
 	atomic_set(&mm->mm_uprobes_count,
 			atomic_read(&oldmm->mm_uprobes_count));
+	mm->uprobes_xol_area = NULL;
 #endif
 
 	if (!mm_init(mm, tsk))
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 083c577..ca1f622 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -31,8 +31,13 @@
 #include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/kdebug.h>	/* notifier mechanism */
+#include <linux/mman.h>		/* PROT_EXEC, MAP_PRIVATE */
+#include <linux/init_task.h>	/* init_cred */
 #include <linux/uprobes.h>
 
+#define UINSNS_PER_PAGE	(PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
 static DEFINE_MUTEX(uprobes_mmap_mutex);	/* uprobe->pending_list */
@@ -49,15 +54,21 @@ struct vma_info {
 };
 
 /*
- * valid_vma: Verify if the specified vma is an executable vma
+ * valid_vma: Verify if the specified vma is an executable vma,
+ * but not an XOL vma.
  *	- Return 1 if the specified virtual address is in an
- *	  executable vma.
+ *	  executable vma, but not in an XOL vma.
  */
 static bool valid_vma(struct vm_area_struct *vma)
 {
+	struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
+
 	if (!vma->vm_file)
 		return false;
 
+	if (area && (area->vaddr == vma->vm_start))
+			return false;
+
 	if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
 						(VM_READ|VM_EXEC))
 		return true;
@@ -1034,6 +1045,218 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	return;
 }
 
+/* Slot allocation for XOL */
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+	const struct cred *curr_cred;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	unsigned long addr;
+	int ret = -ENOMEM;
+
+	mm = get_task_mm(current);
+	if (!mm)
+		return -ESRCH;
+
+	down_write(&mm->mmap_sem);
+	if (mm->uprobes_xol_area) {
+		ret = -EALREADY;
+		goto fail;
+	}
+
+	/*
+	 * Find the end of the top mapping and skip a page.
+	 * If there is no space for PAGE_SIZE above
+	 * that, mmap will ignore our address hint.
+	 *
+	 * override credentials otherwise anonymous memory might
+	 * not be granted execute permission when the selinux
+	 * security hooks have their way.
+	 */
+	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
+	addr = vma->vm_end + PAGE_SIZE;
+	curr_cred = override_creds(&init_cred);
+	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
+	revert_creds(curr_cred);
+
+	if (addr & ~PAGE_MASK)
+		goto fail;
+	vma = find_vma(mm, addr);
+
+	/* Don't expand vma on mremap(). */
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+	area->vaddr = vma->vm_start;
+	if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
+				&vma) > 0)
+		ret = 0;
+
+fail:
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+	struct uprobes_xol_area *area = NULL;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+								GFP_KERNEL);
+
+	if (!area->bitmap)
+		goto fail;
+
+	init_waitqueue_head(&area->wq);
+	spin_lock_init(&area->slot_lock);
+	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
+		task_lock(current);
+		if (!current->mm->uprobes_xol_area) {
+			current->mm->uprobes_xol_area = area;
+			task_unlock(current);
+			return area;
+		}
+		task_unlock(current);
+	}
+
+fail:
+	kfree(area->bitmap);
+	kfree(area);
+	return current->mm->uprobes_xol_area;
+}
+
+/*
+ * free_uprobes_xol_area - Free the area allocated for slots.
+ */
+void free_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+	if (!area)
+		return;
+
+	put_page(area->page);
+	kfree(area->bitmap);
+	kfree(area);
+}
+
+static void xol_wait_event(struct uprobes_xol_area *area)
+{
+	if (atomic_read(&area->slot_count) >= UINSNS_PER_PAGE)
+		wait_event(area->wq,
+			(atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+}
+
+/*
+ *  - search for a free slot.
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+	unsigned long slot_addr, flags;
+	int slot_nr;
+
+	do {
+		spin_lock_irqsave(&area->slot_lock, flags);
+		slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+		if (slot_nr < UINSNS_PER_PAGE) {
+			__set_bit(slot_nr, area->bitmap);
+			slot_addr = area->vaddr +
+					(slot_nr * UPROBES_XOL_SLOT_BYTES);
+			atomic_inc(&area->slot_count);
+		}
+		spin_unlock_irqrestore(&area->slot_lock, flags);
+		if (slot_nr >= UINSNS_PER_PAGE)
+			xol_wait_event(area);
+
+	} while (slot_nr >= UINSNS_PER_PAGE);
+
+	return slot_addr;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+					unsigned long slot_addr)
+{
+	struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+	unsigned long offset;
+	void *vaddr;
+
+	if (!area) {
+		area = xol_alloc_area();
+		if (!area)
+			return 0;
+	}
+	current->utask->xol_vaddr = xol_take_insn_slot(area);
+
+	/*
+	 * Initialize the slot if xol_vaddr points to valid
+	 * instruction slot.
+	 */
+	if (unlikely(!current->utask->xol_vaddr))
+		return 0;
+
+	current->utask->vaddr = slot_addr;
+	offset = current->utask->xol_vaddr & ~PAGE_MASK;
+	vaddr = kmap_atomic(area->page);
+	memcpy(vaddr + offset, uprobe->insn, MAX_UINSN_BYTES);
+	kunmap_atomic(vaddr);
+	return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk)
+{
+	struct uprobes_xol_area *area;
+	unsigned long vma_end;
+	unsigned long slot_addr;
+
+	if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
+		return;
+
+	slot_addr = tsk->utask->xol_vaddr;
+
+	if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+		return;
+
+	area = tsk->mm->uprobes_xol_area;
+	vma_end = area->vaddr + PAGE_SIZE;
+	if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+		int slot_nr;
+		unsigned long offset = slot_addr - area->vaddr;
+		unsigned long flags;
+
+		slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+		if (slot_nr >= UINSNS_PER_PAGE)
+			return;
+
+		spin_lock_irqsave(&area->slot_lock, flags);
+		__clear_bit(slot_nr, area->bitmap);
+		spin_unlock_irqrestore(&area->slot_lock, flags);
+		atomic_dec(&area->slot_count);
+		if (waitqueue_active(&area->wq))
+			wake_up(&area->wq);
+		tsk->utask->xol_vaddr = 0;
+	}
+}
+
 /**
  * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
  * @regs: Reflects the saved state of the task after it has hit a breakpoint
@@ -1059,6 +1282,7 @@ void free_uprobe_utask(struct task_struct *tsk)
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
+	xol_free_insn_slot(tsk);
 	kfree(utask);
 	tsk->utask = NULL;
 }
@@ -1088,7 +1312,10 @@ static struct uprobe_task *add_utask(void)
 static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 				unsigned long vaddr)
 {
-	/* TODO: Yet to be implemented */
+	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
+		set_instruction_pointer(regs, current->utask->xol_vaddr);
+		return 0;
+	}
 	return -EFAULT;
 }
 
@@ -1098,8 +1325,16 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
  */
 static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 {
-	/* TODO: Yet to be implemented */
-	return false;
+	unsigned long vaddr = instruction_pointer(regs);
+
+	/*
+	 * If we have executed out of line, Instruction pointer
+	 * cannot be same as virtual address of XOL slot.
+	 */
+	if (vaddr == current->utask->xol_vaddr)
+		return false;
+	post_xol(uprobe, regs);
+	return true;
 }
 
 /*
@@ -1154,6 +1389,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->active_uprobe = NULL;
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
+			xol_free_insn_slot(current);
 
 			/* TODO Stop queueing signals. */
 		}

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-20 12:03   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


One page of slots are allocated per mm.
On a probehit one free slot is acquired and released after
singlestep operation completes.

Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mm_types.h |    2 
 include/linux/uprobes.h  |   22 ++++
 kernel/fork.c            |    2 
 kernel/uprobes.c         |  246 +++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 267 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9aeb64f..aa2e427 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/uprobes.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -351,6 +352,7 @@ struct mm_struct {
 #endif
 #ifdef CONFIG_UPROBES
 	atomic_t mm_uprobes_count;
+	struct uprobes_xol_area *uprobes_xol_area;
 #endif
 };
 
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 30576fa..a407d17 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -92,6 +92,27 @@ struct uprobe_task {
 	struct uprobe *active_uprobe;
 };
 
+/*
+ * On a breakpoint hit, thread contests for a slot.  It free the
+ * slot after singlestep.  Only definite number of slots are
+ * allocated.
+ */
+
+struct uprobes_xol_area {
+	spinlock_t slot_lock;	/* protects bitmap and slot (de)allocation*/
+	wait_queue_head_t wq;	/* if all slots are busy */
+	atomic_t slot_count;	/* currently in use slots */
+	unsigned long *bitmap;	/* 0 = free slot */
+	struct page *page;
+
+	/*
+	 * We keep the vma's vm_start rather than a pointer to the vma
+	 * itself.  The probed process or a naughty kernel module could make
+	 * the vma go away, and we must handle that reasonably gracefully.
+	 */
+	unsigned long vaddr;		/* Page(s) of instruction slots */
+};
+
 #ifdef CONFIG_UPROBES
 extern int __weak set_bkpt(struct task_struct *tsk, struct uprobe *uprobe,
 							unsigned long vaddr);
@@ -105,6 +126,7 @@ extern int register_uprobe(struct inode *inode, loff_t offset,
 extern void unregister_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer);
 extern void free_uprobe_utask(struct task_struct *tsk);
+extern void free_uprobes_xol_area(struct mm_struct *mm);
 extern int mmap_uprobe(struct vm_area_struct *vma);
 extern void munmap_uprobe(struct vm_area_struct *vma);
 extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
diff --git a/kernel/fork.c b/kernel/fork.c
index 5914bc1..088a27c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -557,6 +557,7 @@ void mmput(struct mm_struct *mm)
 	might_sleep();
 
 	if (atomic_dec_and_test(&mm->mm_users)) {
+		free_uprobes_xol_area(mm);
 		exit_aio(mm);
 		ksm_exit(mm);
 		khugepaged_exit(mm); /* must run before exit_mmap */
@@ -744,6 +745,7 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
 #ifdef CONFIG_UPROBES
 	atomic_set(&mm->mm_uprobes_count,
 			atomic_read(&oldmm->mm_uprobes_count));
+	mm->uprobes_xol_area = NULL;
 #endif
 
 	if (!mm_init(mm, tsk))
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 083c577..ca1f622 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -31,8 +31,13 @@
 #include <linux/swap.h>		/* try_to_free_swap */
 #include <linux/ptrace.h>	/* user_enable_single_step */
 #include <linux/kdebug.h>	/* notifier mechanism */
+#include <linux/mman.h>		/* PROT_EXEC, MAP_PRIVATE */
+#include <linux/init_task.h>	/* init_cred */
 #include <linux/uprobes.h>
 
+#define UINSNS_PER_PAGE	(PAGE_SIZE/UPROBES_XOL_SLOT_BYTES)
+#define MAX_UPROBES_XOL_SLOTS UINSNS_PER_PAGE
+
 static struct rb_root uprobes_tree = RB_ROOT;
 static DEFINE_SPINLOCK(uprobes_treelock);	/* serialize (un)register */
 static DEFINE_MUTEX(uprobes_mmap_mutex);	/* uprobe->pending_list */
@@ -49,15 +54,21 @@ struct vma_info {
 };
 
 /*
- * valid_vma: Verify if the specified vma is an executable vma
+ * valid_vma: Verify if the specified vma is an executable vma,
+ * but not an XOL vma.
  *	- Return 1 if the specified virtual address is in an
- *	  executable vma.
+ *	  executable vma, but not in an XOL vma.
  */
 static bool valid_vma(struct vm_area_struct *vma)
 {
+	struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
+
 	if (!vma->vm_file)
 		return false;
 
+	if (area && (area->vaddr == vma->vm_start))
+			return false;
+
 	if ((vma->vm_flags & (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)) ==
 						(VM_READ|VM_EXEC))
 		return true;
@@ -1034,6 +1045,218 @@ void munmap_uprobe(struct vm_area_struct *vma)
 	return;
 }
 
+/* Slot allocation for XOL */
+static int xol_add_vma(struct uprobes_xol_area *area)
+{
+	const struct cred *curr_cred;
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	unsigned long addr;
+	int ret = -ENOMEM;
+
+	mm = get_task_mm(current);
+	if (!mm)
+		return -ESRCH;
+
+	down_write(&mm->mmap_sem);
+	if (mm->uprobes_xol_area) {
+		ret = -EALREADY;
+		goto fail;
+	}
+
+	/*
+	 * Find the end of the top mapping and skip a page.
+	 * If there is no space for PAGE_SIZE above
+	 * that, mmap will ignore our address hint.
+	 *
+	 * override credentials otherwise anonymous memory might
+	 * not be granted execute permission when the selinux
+	 * security hooks have their way.
+	 */
+	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
+	addr = vma->vm_end + PAGE_SIZE;
+	curr_cred = override_creds(&init_cred);
+	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
+	revert_creds(curr_cred);
+
+	if (addr & ~PAGE_MASK)
+		goto fail;
+	vma = find_vma(mm, addr);
+
+	/* Don't expand vma on mremap(). */
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
+	area->vaddr = vma->vm_start;
+	if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
+				&vma) > 0)
+		ret = 0;
+
+fail:
+	up_write(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
+
+/*
+ * xol_alloc_area - Allocate process's uprobes_xol_area.
+ * This area will be used for storing instructions for execution out of
+ * line.
+ *
+ * Returns the allocated area or NULL.
+ */
+static struct uprobes_xol_area *xol_alloc_area(void)
+{
+	struct uprobes_xol_area *area = NULL;
+
+	area = kzalloc(sizeof(*area), GFP_KERNEL);
+	if (unlikely(!area))
+		return NULL;
+
+	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
+								GFP_KERNEL);
+
+	if (!area->bitmap)
+		goto fail;
+
+	init_waitqueue_head(&area->wq);
+	spin_lock_init(&area->slot_lock);
+	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
+		task_lock(current);
+		if (!current->mm->uprobes_xol_area) {
+			current->mm->uprobes_xol_area = area;
+			task_unlock(current);
+			return area;
+		}
+		task_unlock(current);
+	}
+
+fail:
+	kfree(area->bitmap);
+	kfree(area);
+	return current->mm->uprobes_xol_area;
+}
+
+/*
+ * free_uprobes_xol_area - Free the area allocated for slots.
+ */
+void free_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+
+	if (!area)
+		return;
+
+	put_page(area->page);
+	kfree(area->bitmap);
+	kfree(area);
+}
+
+static void xol_wait_event(struct uprobes_xol_area *area)
+{
+	if (atomic_read(&area->slot_count) >= UINSNS_PER_PAGE)
+		wait_event(area->wq,
+			(atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
+}
+
+/*
+ *  - search for a free slot.
+ */
+static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
+{
+	unsigned long slot_addr, flags;
+	int slot_nr;
+
+	do {
+		spin_lock_irqsave(&area->slot_lock, flags);
+		slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
+		if (slot_nr < UINSNS_PER_PAGE) {
+			__set_bit(slot_nr, area->bitmap);
+			slot_addr = area->vaddr +
+					(slot_nr * UPROBES_XOL_SLOT_BYTES);
+			atomic_inc(&area->slot_count);
+		}
+		spin_unlock_irqrestore(&area->slot_lock, flags);
+		if (slot_nr >= UINSNS_PER_PAGE)
+			xol_wait_event(area);
+
+	} while (slot_nr >= UINSNS_PER_PAGE);
+
+	return slot_addr;
+}
+
+/*
+ * xol_get_insn_slot - If was not allocated a slot, then
+ * allocate a slot.
+ * Returns the allocated slot address or 0.
+ */
+static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
+					unsigned long slot_addr)
+{
+	struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+	unsigned long offset;
+	void *vaddr;
+
+	if (!area) {
+		area = xol_alloc_area();
+		if (!area)
+			return 0;
+	}
+	current->utask->xol_vaddr = xol_take_insn_slot(area);
+
+	/*
+	 * Initialize the slot if xol_vaddr points to valid
+	 * instruction slot.
+	 */
+	if (unlikely(!current->utask->xol_vaddr))
+		return 0;
+
+	current->utask->vaddr = slot_addr;
+	offset = current->utask->xol_vaddr & ~PAGE_MASK;
+	vaddr = kmap_atomic(area->page);
+	memcpy(vaddr + offset, uprobe->insn, MAX_UINSN_BYTES);
+	kunmap_atomic(vaddr);
+	return current->utask->xol_vaddr;
+}
+
+/*
+ * xol_free_insn_slot - If slot was earlier allocated by
+ * @xol_get_insn_slot(), make the slot available for
+ * subsequent requests.
+ */
+static void xol_free_insn_slot(struct task_struct *tsk)
+{
+	struct uprobes_xol_area *area;
+	unsigned long vma_end;
+	unsigned long slot_addr;
+
+	if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
+		return;
+
+	slot_addr = tsk->utask->xol_vaddr;
+
+	if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
+		return;
+
+	area = tsk->mm->uprobes_xol_area;
+	vma_end = area->vaddr + PAGE_SIZE;
+	if (area->vaddr <= slot_addr && slot_addr < vma_end) {
+		int slot_nr;
+		unsigned long offset = slot_addr - area->vaddr;
+		unsigned long flags;
+
+		slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
+		if (slot_nr >= UINSNS_PER_PAGE)
+			return;
+
+		spin_lock_irqsave(&area->slot_lock, flags);
+		__clear_bit(slot_nr, area->bitmap);
+		spin_unlock_irqrestore(&area->slot_lock, flags);
+		atomic_dec(&area->slot_count);
+		if (waitqueue_active(&area->wq))
+			wake_up(&area->wq);
+		tsk->utask->xol_vaddr = 0;
+	}
+}
+
 /**
  * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
  * @regs: Reflects the saved state of the task after it has hit a breakpoint
@@ -1059,6 +1282,7 @@ void free_uprobe_utask(struct task_struct *tsk)
 	if (utask->active_uprobe)
 		put_uprobe(utask->active_uprobe);
 
+	xol_free_insn_slot(tsk);
 	kfree(utask);
 	tsk->utask = NULL;
 }
@@ -1088,7 +1312,10 @@ static struct uprobe_task *add_utask(void)
 static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 				unsigned long vaddr)
 {
-	/* TODO: Yet to be implemented */
+	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
+		set_instruction_pointer(regs, current->utask->xol_vaddr);
+		return 0;
+	}
 	return -EFAULT;
 }
 
@@ -1098,8 +1325,16 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
  */
 static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 {
-	/* TODO: Yet to be implemented */
-	return false;
+	unsigned long vaddr = instruction_pointer(regs);
+
+	/*
+	 * If we have executed out of line, Instruction pointer
+	 * cannot be same as virtual address of XOL slot.
+	 */
+	if (vaddr == current->utask->xol_vaddr)
+		return false;
+	post_xol(uprobe, regs);
+	return true;
 }
 
 /*
@@ -1154,6 +1389,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->active_uprobe = NULL;
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
+			xol_free_insn_slot(current);
 
 			/* TODO Stop queueing signals. */
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 19/26]   tracing: Extract out common code for kprobes/uprobes traceevents.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:03   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/trace_kprobe.c |  894 +------------------------------------------
 kernel/trace/trace_probe.c  |  778 +++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |  160 ++++++++
 5 files changed, 963 insertions(+), 874 deletions(-)
 create mode 100644 kernel/trace/trace_probe.c
 create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..520106a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
 	bool "Enable kprobes-based dynamic events"
 	select TRACING
+	select PROBE_EVENTS
 	default y
 	help
 	  This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config PROBE_EVENTS
+	def_bool n
+
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace tracepoints dynamically"
 	depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..692223a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
+obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5fb3697..d5f4e51 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,547 +19,15 @@
 
 #include <linux/module.h>
 #include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
-	"common_type",
-	"common_flags",
-	"common_preempt_count",
-	"common_pid",
-	"common_tgid",
-	FIELD_STRING_IP,
-	FIELD_STRING_RETIP,
-	FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
-				 void *);
-#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
-#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
-
-/* Printing  in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
-						const char *name,	\
-						void *data, void *ent)\
-{									\
-	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-}									\
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs)	\
-	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl)	((u32)(dl) >> 16)
-#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
-	return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
-	return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- *  data_rloc stores the offset from data_rloc itself, but data_loc
- *  stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
-						  const char *name,
-						  void *data, void *ent)
-{
-	int len = *(u32 *)data >> 16;
-
-	if (!len)
-		return trace_seq_printf(s, " %s=(fault)", name);
-	else
-		return trace_seq_printf(s, " %s=\"%s\"", name,
-					(const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
-	fetch_func_t	fn;
-	void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
-				 struct pt_regs *regs, void *dest)
-{
-	return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8)		\
-DEFINE_FETCH_##method(u16)		\
-DEFINE_FETCH_##method(u32)		\
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn)			\
-	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
-	 && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type)						\
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
-					void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_register(regs,			\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type)					\
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
-					  void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type)					\
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
-					  void *dummy, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_return_value(regs);			\
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type)					\
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
-					  void *addr, void *dest)	\
-{									\
-	type retval;							\
-	if (probe_kernel_address(addr, retval))				\
-		*(type *)dest = 0;					\
-	else								\
-		*(type *)dest = retval;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
-						      void *addr, void *dest)
-{
-	long ret;
-	int maxlen = get_rloc_len(*(u32 *)dest);
-	u8 *dst = get_rloc_data(dest);
-	u8 *src = addr;
-	mm_segment_t old_fs = get_fs();
-	if (!maxlen)
-		return;
-	/*
-	 * Try to get string again, since the string can be changed while
-	 * probing.
-	 */
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do
-		ret = __copy_from_user_inatomic(dst++, src++, 1);
-	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
-	dst[-1] = '\0';
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0) {	/* Failed to fetch string */
-		((u8 *)get_rloc_data(dest))[0] = '\0';
-		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
-	} else
-		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
-					      get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
-							void *addr, void *dest)
-{
-	int ret, len = 0;
-	u8 c;
-	mm_segment_t old_fs = get_fs();
-
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do {
-		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
-		len++;
-	} while (c && ret == 0 && len < MAX_STRING_SIZE);
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0)	/* Failed to check the length */
-		*(u32 *)dest = 0;
-	else
-		*(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
-	char *symbol;
-	long offset;
-	unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
-	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
-	if (sc->addr)
-		sc->addr += sc->offset;
-	return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
-	kfree(sc->symbol);
-	kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
-	struct symbol_cache *sc;
-
-	if (!sym || strlen(sym) == 0)
-		return NULL;
-	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
-	if (!sc)
-		return NULL;
-
-	sc->symbol = kstrdup(sym, GFP_KERNEL);
-	if (!sc->symbol) {
-		kfree(sc);
-		return NULL;
-	}
-	sc->offset = offset;
 
-	update_symbol_cache(sc);
-	return sc;
-}
-
-#define DEFINE_FETCH_symbol(type)					\
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
-					  void *data, void *dest)	\
-{									\
-	struct symbol_cache *sc = data;					\
-	if (sc->addr)							\
-		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
-	else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
-	struct fetch_param orig;
-	long offset;
-};
-
-#define DEFINE_FETCH_deref(type)					\
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct deref_fetch_param *dprm = data;				\
-	unsigned long addr;						\
-	call_fetch(&dprm->orig, regs, &addr);				\
-	if (addr) {							\
-		addr += dprm->offset;					\
-		fetch_memory_##type(regs, (void *)addr, dest);		\
-	} else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
-	struct fetch_param orig;
-	unsigned char hi_shift;
-	unsigned char low_shift;
-};
+#include "trace_probe.h"
 
-#define DEFINE_FETCH_bitfield(type)					\
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct bitfield_fetch_param *bprm = data;			\
-	type buf = 0;							\
-	call_fetch(&bprm->orig, regs, &buf);				\
-	if (buf) {							\
-		buf <<= bprm->hi_shift;					\
-		buf >>= bprm->low_shift;				\
-	}								\
-	*(type *)dest = buf;						\
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-update_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
-	FETCH_MTD_reg = 0,
-	FETCH_MTD_stack,
-	FETCH_MTD_retval,
-	FETCH_MTD_memory,
-	FETCH_MTD_symbol,
-	FETCH_MTD_deref,
-	FETCH_MTD_bitfield,
-	FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type)	\
-	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
-	{.name = _name,				\
-	 .size = _size,					\
-	 .is_signed = sign,				\
-	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
-	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
-	 .fmttype = _fmttype,				\
-	 .fetch = {					\
-ASSIGN_FETCH_FUNC(reg, ftype),				\
-ASSIGN_FETCH_FUNC(stack, ftype),			\
-ASSIGN_FETCH_FUNC(retval, ftype),			\
-ASSIGN_FETCH_FUNC(memory, ftype),			\
-ASSIGN_FETCH_FUNC(symbol, ftype),			\
-ASSIGN_FETCH_FUNC(deref, ftype),			\
-ASSIGN_FETCH_FUNC(bitfield, ftype),			\
-	  }						\
-	}
-
-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
-	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
-	const char	*name;		/* Name of type */
-	size_t		size;		/* Byte size of type */
-	int		is_signed;	/* Signed flag */
-	print_type_func_t	print;	/* Print functions */
-	const char	*fmt;		/* Fromat string */
-	const char	*fmttype;	/* Name in format file */
-	/* Fetch functions */
-	fetch_func_t	fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
-	/* Special types */
-	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
-					sizeof(u32), 1, "__data_loc char[]"),
-	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
-					string_size, sizeof(u32), 0, "u32"),
-	/* Basic types */
-	ASSIGN_FETCH_TYPE(u8,  u8,  0),
-	ASSIGN_FETCH_TYPE(u16, u16, 0),
-	ASSIGN_FETCH_TYPE(u32, u32, 0),
-	ASSIGN_FETCH_TYPE(u64, u64, 0),
-	ASSIGN_FETCH_TYPE(s8,  u8,  1),
-	ASSIGN_FETCH_TYPE(s16, u16, 1),
-	ASSIGN_FETCH_TYPE(s32, u32, 1),
-	ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
-	int i;
-
-	if (!type)
-		type = DEFAULT_FETCH_TYPE_STR;
-
-	/* Special case: bitfield */
-	if (*type == 'b') {
-		unsigned long bs;
-		type = strchr(type, '/');
-		if (!type)
-			goto fail;
-		type++;
-		if (strict_strtoul(type, 0, &bs))
-			goto fail;
-		switch (bs) {
-		case 8:
-			return find_fetch_type("u8");
-		case 16:
-			return find_fetch_type("u16");
-		case 32:
-			return find_fetch_type("u32");
-		case 64:
-			return find_fetch_type("u64");
-		default:
-			goto fail;
-		}
-	}
-
-	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
-		if (strcmp(type, fetch_type_table[i].name) == 0)
-			return &fetch_type_table[i];
-fail:
-	return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
-					  void *dummy, void *dest)
-{
-	*(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
-					    fetch_func_t orig_fn)
-{
-	int i;
-
-	if (type != &fetch_type_table[FETCH_TYPE_STRING])
-		return NULL;	/* Only string type needs size function */
-	for (i = 0; i < FETCH_MTD_END; i++)
-		if (type->fetch[i] == orig_fn)
-			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
-	WARN_ON(1);	/* This should not happen */
-	return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"
 
 /**
  * Kprobe event core functions
  */
 
-struct probe_arg {
-	struct fetch_param	fetch;
-	struct fetch_param	fetch_size;
-	unsigned int		offset;	/* Offset from argument entry */
-	const char		*name;	/* Name of this argument */
-	const char		*comm;	/* Command of this argument */
-	const struct fetch_type	*type;	/* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE	1
-#define TP_FLAG_PROFILE	2
-#define TP_FLAG_REGISTERED 4
-
 struct trace_probe {
 	struct list_head	list;
 	struct kretprobe	rp;	/* Use rp.kp for kprobe use */
@@ -631,18 +99,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
 static int kretprobe_dispatcher(struct kretprobe_instance *ri,
 				struct pt_regs *regs);
 
-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
-	if (!isalpha(*name) && *name != '_')
-		return 0;
-	while (*++name != '\0') {
-		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
-			return 0;
-	}
-	return 1;
-}
-
 /*
  * Allocate new trace_probe and initialize it (including kprobes).
  */
@@ -651,7 +107,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 					     void *addr,
 					     const char *symbol,
 					     unsigned long offs,
-					     int nargs, int is_return)
+					     int nargs, bool is_return)
 {
 	struct trace_probe *tp;
 	int ret = -ENOMEM;
@@ -702,34 +158,12 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 	return ERR_PTR(ret);
 }
 
-static void update_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		update_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		update_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		update_symbol_cache(arg->fetch.data);
-}
-
-static void free_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		free_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		free_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		free_symbol_cache(arg->fetch.data);
-	kfree(arg->name);
-	kfree(arg->comm);
-}
-
 static void free_trace_probe(struct trace_probe *tp)
 {
 	int i;
 
 	for (i = 0; i < tp->nr_args; i++)
-		free_probe_arg(&tp->args[i]);
+		traceprobe_free_probe_arg(&tp->args[i]);
 
 	kfree(tp->call.class->system);
 	kfree(tp->call.name);
@@ -787,7 +221,7 @@ static int __register_trace_probe(struct trace_probe *tp)
 		return -EINVAL;
 
 	for (i = 0; i < tp->nr_args; i++)
-		update_probe_arg(&tp->args[i]);
+		traceprobe_update_arg(&tp->args[i]);
 
 	/* Set/clear disabled flag according to tp->flag */
 	if (trace_probe_is_enabled(tp))
@@ -910,227 +344,6 @@ static struct notifier_block trace_probe_module_nb = {
 	.priority = 1	/* Invoked after kprobe module callback */
 };
 
-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
-	char *tmp;
-	int ret;
-
-	if (!offset)
-		return -EINVAL;
-
-	tmp = strchr(symbol, '+');
-	if (tmp) {
-		/* skip sign because strict_strtol doesn't accept '+' */
-		ret = strict_strtoul(tmp + 1, 0, offset);
-		if (ret)
-			return ret;
-		*tmp = '\0';
-	} else
-		*offset = 0;
-	return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
-			    struct fetch_param *f, int is_return)
-{
-	int ret = 0;
-	unsigned long param;
-
-	if (strcmp(arg, "retval") == 0) {
-		if (is_return)
-			f->fn = t->fetch[FETCH_MTD_retval];
-		else
-			ret = -EINVAL;
-	} else if (strncmp(arg, "stack", 5) == 0) {
-		if (arg[5] == '\0') {
-			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
-				f->fn = fetch_stack_address;
-			else
-				ret = -EINVAL;
-		} else if (isdigit(arg[5])) {
-			ret = strict_strtoul(arg + 5, 10, &param);
-			if (ret || param > PARAM_MAX_STACK)
-				ret = -EINVAL;
-			else {
-				f->fn = t->fetch[FETCH_MTD_stack];
-				f->data = (void *)param;
-			}
-		} else
-			ret = -EINVAL;
-	} else
-		ret = -EINVAL;
-	return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
-			     struct fetch_param *f, int is_return)
-{
-	int ret = 0;
-	unsigned long param;
-	long offset;
-	char *tmp;
-
-	switch (arg[0]) {
-	case '$':
-		ret = parse_probe_vars(arg + 1, t, f, is_return);
-		break;
-	case '%':	/* named register */
-		ret = regs_query_register_offset(arg + 1);
-		if (ret >= 0) {
-			f->fn = t->fetch[FETCH_MTD_reg];
-			f->data = (void *)(unsigned long)ret;
-			ret = 0;
-		}
-		break;
-	case '@':	/* memory or symbol */
-		if (isdigit(arg[1])) {
-			ret = strict_strtoul(arg + 1, 0, &param);
-			if (ret)
-				break;
-			f->fn = t->fetch[FETCH_MTD_memory];
-			f->data = (void *)param;
-		} else {
-			ret = split_symbol_offset(arg + 1, &offset);
-			if (ret)
-				break;
-			f->data = alloc_symbol_cache(arg + 1, offset);
-			if (f->data)
-				f->fn = t->fetch[FETCH_MTD_symbol];
-		}
-		break;
-	case '+':	/* deref memory */
-		arg++;	/* Skip '+', because strict_strtol() rejects it. */
-	case '-':
-		tmp = strchr(arg, '(');
-		if (!tmp)
-			break;
-		*tmp = '\0';
-		ret = strict_strtol(arg, 0, &offset);
-		if (ret)
-			break;
-		arg = tmp + 1;
-		tmp = strrchr(arg, ')');
-		if (tmp) {
-			struct deref_fetch_param *dprm;
-			const struct fetch_type *t2 = find_fetch_type(NULL);
-			*tmp = '\0';
-			dprm = kzalloc(sizeof(struct deref_fetch_param),
-				       GFP_KERNEL);
-			if (!dprm)
-				return -ENOMEM;
-			dprm->offset = offset;
-			ret = __parse_probe_arg(arg, t2, &dprm->orig,
-						is_return);
-			if (ret)
-				kfree(dprm);
-			else {
-				f->fn = t->fetch[FETCH_MTD_deref];
-				f->data = (void *)dprm;
-			}
-		}
-		break;
-	}
-	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
-		pr_info("%s type has no corresponding fetch method.\n",
-			t->name);
-		ret = -EINVAL;
-	}
-	return ret;
-}
-
-#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
-				      const struct fetch_type *t,
-				      struct fetch_param *f)
-{
-	struct bitfield_fetch_param *bprm;
-	unsigned long bw, bo;
-	char *tail;
-
-	if (*bf != 'b')
-		return 0;
-
-	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
-	if (!bprm)
-		return -ENOMEM;
-	bprm->orig = *f;
-	f->fn = t->fetch[FETCH_MTD_bitfield];
-	f->data = (void *)bprm;
-
-	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
-	if (bw == 0 || *tail != '@')
-		return -EINVAL;
-
-	bf = tail + 1;
-	bo = simple_strtoul(bf, &tail, 0);
-	if (tail == bf || *tail != '/')
-		return -EINVAL;
-
-	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
-	bprm->low_shift = bprm->hi_shift + bo;
-	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
-			   struct probe_arg *parg, int is_return)
-{
-	const char *t;
-	int ret;
-
-	if (strlen(arg) > MAX_ARGSTR_LEN) {
-		pr_info("Argument is too long.: %s\n",  arg);
-		return -ENOSPC;
-	}
-	parg->comm = kstrdup(arg, GFP_KERNEL);
-	if (!parg->comm) {
-		pr_info("Failed to allocate memory for command '%s'.\n", arg);
-		return -ENOMEM;
-	}
-	t = strchr(parg->comm, ':');
-	if (t) {
-		arg[t - parg->comm] = '\0';
-		t++;
-	}
-	parg->type = find_fetch_type(t);
-	if (!parg->type) {
-		pr_info("Unsupported type: %s\n", t);
-		return -EINVAL;
-	}
-	parg->offset = tp->size;
-	tp->size += parg->type->size;
-	ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
-	if (ret >= 0 && t != NULL)
-		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
-	if (ret >= 0) {
-		parg->fetch_size.fn = get_fetch_size_function(parg->type,
-							      parg->fetch.fn);
-		parg->fetch_size.data = parg->fetch.data;
-	}
-	return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
-			       struct probe_arg *args, int narg)
-{
-	int i;
-	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
-		if (strcmp(reserved_field_names[i], name) == 0)
-			return 1;
-	for (i = 0; i < narg; i++)
-		if (strcmp(args[i].name, name) == 0)
-			return 1;
-	return 0;
-}
-
 static int create_trace_probe(int argc, char **argv)
 {
 	/*
@@ -1153,7 +366,7 @@ static int create_trace_probe(int argc, char **argv)
 	 */
 	struct trace_probe *tp;
 	int i, ret = 0;
-	int is_return = 0, is_delete = 0;
+	bool is_return = false, is_delete = false;
 	char *symbol = NULL, *event = NULL, *group = NULL;
 	char *arg;
 	unsigned long offset = 0;
@@ -1162,11 +375,11 @@ static int create_trace_probe(int argc, char **argv)
 
 	/* argc must be >= 1 */
 	if (argv[0][0] == 'p')
-		is_return = 0;
+		is_return = false;
 	else if (argv[0][0] == 'r')
-		is_return = 1;
+		is_return = true;
 	else if (argv[0][0] == '-')
-		is_delete = 1;
+		is_delete = true;
 	else {
 		pr_info("Probe definition must be started with 'p', 'r' or"
 			" '-'.\n");
@@ -1230,7 +443,7 @@ static int create_trace_probe(int argc, char **argv)
 		/* a symbol specified */
 		symbol = argv[1];
 		/* TODO: support .init module functions */
-		ret = split_symbol_offset(symbol, &offset);
+		ret = traceprobe_split_symbol_offset(symbol, &offset);
 		if (ret) {
 			pr_info("Failed to parse symbol.\n");
 			return ret;
@@ -1292,7 +505,8 @@ static int create_trace_probe(int argc, char **argv)
 			goto error;
 		}
 
-		if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
 			pr_info("Argument[%d] name '%s' conflicts with "
 				"another field.\n", i, argv[i]);
 			ret = -EINVAL;
@@ -1300,7 +514,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								is_return);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
@@ -1387,70 +602,11 @@ static int probes_open(struct inode *inode, struct file *file)
 	return seq_open(file, &probes_seq_op);
 }
 
-static int command_trace_probe(const char *buf)
-{
-	char **argv;
-	int argc = 0, ret = 0;
-
-	argv = argv_split(GFP_KERNEL, buf, &argc);
-	if (!argv)
-		return -ENOMEM;
-
-	if (argc)
-		ret = create_trace_probe(argc, argv);
-
-	argv_free(argv);
-	return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
 static ssize_t probes_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
-	char *kbuf, *tmp;
-	int ret;
-	size_t done;
-	size_t size;
-
-	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
-	if (!kbuf)
-		return -ENOMEM;
-
-	ret = done = 0;
-	while (done < count) {
-		size = count - done;
-		if (size >= WRITE_BUFSIZE)
-			size = WRITE_BUFSIZE - 1;
-		if (copy_from_user(kbuf, buffer + done, size)) {
-			ret = -EFAULT;
-			goto out;
-		}
-		kbuf[size] = '\0';
-		tmp = strchr(kbuf, '\n');
-		if (tmp) {
-			*tmp = '\0';
-			size = tmp - kbuf + 1;
-		} else if (done + size < count) {
-			pr_warning("Line length is too long: "
-				   "Should be less than %d.", WRITE_BUFSIZE);
-			ret = -EINVAL;
-			goto out;
-		}
-		done += size;
-		/* Remove comments */
-		tmp = strchr(kbuf, '#');
-		if (tmp)
-			*tmp = '\0';
-
-		ret = command_trace_probe(kbuf);
-		if (ret)
-			goto out;
-	}
-	ret = done;
-out:
-	kfree(kbuf);
-	return ret;
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_probe);
 }
 
 static const struct file_operations kprobe_events_ops = {
@@ -1686,16 +842,6 @@ print_kretprobe_event(struct trace_iterator *iter, int flags,
 	return TRACE_TYPE_PARTIAL_LINE;
 }
 
-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed)			\
-	do {								\
-		ret = trace_define_field(event_call, #type, name,	\
-					 offsetof(typeof(field), item),	\
-					 sizeof(field.item), is_signed, \
-					 FILTER_OTHER);			\
-		if (ret)						\
-			return ret;					\
-	} while (0)
 
 static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
 {
@@ -2020,7 +1166,7 @@ static __init int kprobe_trace_self_tests_init(void)
 
 	pr_info("Testing kprobe tracing: ");
 
-	ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
+	ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
 				  "$stack $stack0 +0($stack)");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function entry.\n");
@@ -2035,7 +1181,7 @@ static __init int kprobe_trace_self_tests_init(void)
 			enable_trace_probe(tp, TP_FLAG_TRACE);
 	}
 
-	ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
+	ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
 				  "$retval");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function return.\n");
@@ -2055,13 +1201,13 @@ static __init int kprobe_trace_self_tests_init(void)
 
 	ret = target(1, 2, 3, 4, 5, 6);
 
-	ret = command_trace_probe("-:testprobe");
+	ret = traceprobe_command_trace_probe("-:testprobe");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
 	}
 
-	ret = command_trace_probe("-:testprobe2");
+	ret = traceprobe_command_trace_probe("-:testprobe2");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..52580b5
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,778 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:     Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+	"common_type",
+	"common_flags",
+	"common_preempt_count",
+	"common_pid",
+	"common_tgid",
+	FIELD_STRING_IP,
+	FIELD_STRING_RETIP,
+	FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
+#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
+
+/* Printing  in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
+						const char *name,	\
+						void *data, void *ent)\
+{									\
+	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+}									\
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+	return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+	return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+						  const char *name,
+						  void *data, void *ent)
+{
+	int len = *(u32 *)data >> 16;
+
+	if (!len)
+		return trace_seq_printf(s, " %s=(fault)", name);
+	else
+		return trace_seq_printf(s, " %s=\"%s\"", name,
+					(const char *)get_loc_data(data, ent));
+}
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8)		\
+DEFINE_FETCH_##method(u16)		\
+DEFINE_FETCH_##method(u32)		\
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn)			\
+	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
+	 && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type)						\
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
+					void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_register(regs,			\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type)					\
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+					  void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type)					\
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+					  void *dummy, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_return_value(regs);			\
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type)					\
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+					  void *addr, void *dest)	\
+{									\
+	type retval;							\
+	if (probe_kernel_address(addr, retval))				\
+		*(type *)dest = 0;					\
+	else								\
+		*(type *)dest = retval;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+						      void *addr, void *dest)
+{
+	long ret;
+	int maxlen = get_rloc_len(*(u32 *)dest);
+	u8 *dst = get_rloc_data(dest);
+	u8 *src = addr;
+	mm_segment_t old_fs = get_fs();
+	if (!maxlen)
+		return;
+	/*
+	 * Try to get string again, since the string can be changed while
+	 * probing.
+	 */
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do
+		ret = __copy_from_user_inatomic(dst++, src++, 1);
+	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+	dst[-1] = '\0';
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0) {	/* Failed to fetch string */
+		((u8 *)get_rloc_data(dest))[0] = '\0';
+		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+	} else
+		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+					      get_rloc_offs(*(u32 *)dest));
+}
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+							void *addr, void *dest)
+{
+	int ret, len = 0;
+	u8 c;
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do {
+		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+		len++;
+	} while (c && ret == 0 && len < MAX_STRING_SIZE);
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0)	/* Failed to check the length */
+		*(u32 *)dest = 0;
+	else
+		*(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+	char *symbol;
+	long offset;
+	unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+	if (sc->addr)
+		sc->addr += sc->offset;
+	return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+	kfree(sc->symbol);
+	kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+	struct symbol_cache *sc;
+
+	if (!sym || strlen(sym) == 0)
+		return NULL;
+	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+	if (!sc)
+		return NULL;
+
+	sc->symbol = kstrdup(sym, GFP_KERNEL);
+	if (!sc->symbol) {
+		kfree(sc);
+		return NULL;
+	}
+	sc->offset = offset;
+
+	update_symbol_cache(sc);
+	return sc;
+}
+
+#define DEFINE_FETCH_symbol(type)					\
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+					  void *data, void *dest)	\
+{									\
+	struct symbol_cache *sc = data;					\
+	if (sc->addr)							\
+		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
+	else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+	struct fetch_param orig;
+	long offset;
+};
+
+#define DEFINE_FETCH_deref(type)					\
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct deref_fetch_param *dprm = data;				\
+	unsigned long addr;						\
+	call_fetch(&dprm->orig, regs, &addr);				\
+	if (addr) {							\
+		addr += dprm->offset;					\
+		fetch_memory_##type(regs, (void *)addr, dest);		\
+	} else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+	struct fetch_param orig;
+	unsigned char hi_shift;
+	unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type)					\
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct bitfield_fetch_param *bprm = data;			\
+	type buf = 0;							\
+	call_fetch(&bprm->orig, regs, &buf);				\
+	if (buf) {							\
+		buf <<= bprm->hi_shift;					\
+		buf >>= bprm->low_shift;				\
+	}								\
+	*(type *)dest = buf;						\
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+update_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type)	\
+	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
+	{.name = _name,				\
+	 .size = _size,					\
+	 .is_signed = sign,				\
+	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
+	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
+	 .fmttype = _fmttype,				\
+	 .fetch = {					\
+ASSIGN_FETCH_FUNC(reg, ftype),				\
+ASSIGN_FETCH_FUNC(stack, ftype),			\
+ASSIGN_FETCH_FUNC(retval, ftype),			\
+ASSIGN_FETCH_FUNC(memory, ftype),			\
+ASSIGN_FETCH_FUNC(symbol, ftype),			\
+ASSIGN_FETCH_FUNC(deref, ftype),			\
+ASSIGN_FETCH_FUNC(bitfield, ftype),			\
+	  }						\
+	}
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
+	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+	/* Special types */
+	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+					sizeof(u32), 1, "__data_loc char[]"),
+	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+					string_size, sizeof(u32), 0, "u32"),
+	/* Basic types */
+	ASSIGN_FETCH_TYPE(u8,  u8,  0),
+	ASSIGN_FETCH_TYPE(u16, u16, 0),
+	ASSIGN_FETCH_TYPE(u32, u32, 0),
+	ASSIGN_FETCH_TYPE(u64, u64, 0),
+	ASSIGN_FETCH_TYPE(s8,  u8,  1),
+	ASSIGN_FETCH_TYPE(s16, u16, 1),
+	ASSIGN_FETCH_TYPE(s32, u32, 1),
+	ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+	int i;
+
+	if (!type)
+		type = DEFAULT_FETCH_TYPE_STR;
+
+	/* Special case: bitfield */
+	if (*type == 'b') {
+		unsigned long bs;
+		type = strchr(type, '/');
+		if (!type)
+			goto fail;
+		type++;
+		if (strict_strtoul(type, 0, &bs))
+			goto fail;
+		switch (bs) {
+		case 8:
+			return find_fetch_type("u8");
+		case 16:
+			return find_fetch_type("u16");
+		case 32:
+			return find_fetch_type("u32");
+		case 64:
+			return find_fetch_type("u64");
+		default:
+			goto fail;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+		if (strcmp(type, fetch_type_table[i].name) == 0)
+			return &fetch_type_table[i];
+fail:
+	return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+					void *dummy, void *dest)
+{
+	*(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+					fetch_func_t orig_fn)
+{
+	int i;
+
+	if (type != &fetch_type_table[FETCH_TYPE_STRING])
+		return NULL;	/* Only string type needs size function */
+	for (i = 0; i < FETCH_MTD_END; i++)
+		if (type->fetch[i] == orig_fn)
+			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+	WARN_ON(1);	/* This should not happen */
+	return NULL;
+}
+
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+	char *tmp;
+	int ret;
+
+	if (!offset)
+		return -EINVAL;
+
+	tmp = strchr(symbol, '+');
+	if (tmp) {
+		/* skip sign because strict_strtol doesn't accept '+' */
+		ret = strict_strtoul(tmp + 1, 0, offset);
+		if (ret)
+			return ret;
+		*tmp = '\0';
+	} else
+		*offset = 0;
+	return 0;
+}
+
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+			    struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+
+	if (strcmp(arg, "retval") == 0) {
+		if (is_return)
+			f->fn = t->fetch[FETCH_MTD_retval];
+		else
+			ret = -EINVAL;
+	} else if (strncmp(arg, "stack", 5) == 0) {
+		if (arg[5] == '\0') {
+			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+				f->fn = fetch_stack_address;
+			else
+				ret = -EINVAL;
+		} else if (isdigit(arg[5])) {
+			ret = strict_strtoul(arg + 5, 10, &param);
+			if (ret || param > PARAM_MAX_STACK)
+				ret = -EINVAL;
+			else {
+				f->fn = t->fetch[FETCH_MTD_stack];
+				f->data = (void *)param;
+			}
+		} else
+			ret = -EINVAL;
+	} else
+		ret = -EINVAL;
+	return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+		     struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+	long offset;
+	char *tmp;
+
+	switch (arg[0]) {
+	case '$':
+		ret = parse_probe_vars(arg + 1, t, f, is_return);
+		break;
+	case '%':	/* named register */
+		ret = regs_query_register_offset(arg + 1);
+		if (ret >= 0) {
+			f->fn = t->fetch[FETCH_MTD_reg];
+			f->data = (void *)(unsigned long)ret;
+			ret = 0;
+		}
+		break;
+	case '@':	/* memory or symbol */
+		if (isdigit(arg[1])) {
+			ret = strict_strtoul(arg + 1, 0, &param);
+			if (ret)
+				break;
+			f->fn = t->fetch[FETCH_MTD_memory];
+			f->data = (void *)param;
+		} else {
+			ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+			if (ret)
+				break;
+			f->data = alloc_symbol_cache(arg + 1, offset);
+			if (f->data)
+				f->fn = t->fetch[FETCH_MTD_symbol];
+		}
+		break;
+	case '+':	/* deref memory */
+		arg++;	/* Skip '+', because strict_strtol() rejects it. */
+	case '-':
+		tmp = strchr(arg, '(');
+		if (!tmp)
+			break;
+		*tmp = '\0';
+		ret = strict_strtol(arg, 0, &offset);
+		if (ret)
+			break;
+		arg = tmp + 1;
+		tmp = strrchr(arg, ')');
+		if (tmp) {
+			struct deref_fetch_param *dprm;
+			const struct fetch_type *t2 = find_fetch_type(NULL);
+			*tmp = '\0';
+			dprm = kzalloc(sizeof(struct deref_fetch_param),
+				       GFP_KERNEL);
+			if (!dprm)
+				return -ENOMEM;
+			dprm->offset = offset;
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			if (ret)
+				kfree(dprm);
+			else {
+				f->fn = t->fetch[FETCH_MTD_deref];
+				f->data = (void *)dprm;
+			}
+		}
+		break;
+	}
+	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
+		pr_info("%s type has no corresponding fetch method.\n",
+			t->name);
+		ret = -EINVAL;
+	}
+	return ret;
+}
+#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+				      const struct fetch_type *t,
+				      struct fetch_param *f)
+{
+	struct bitfield_fetch_param *bprm;
+	unsigned long bw, bo;
+	char *tail;
+
+	if (*bf != 'b')
+		return 0;
+
+	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	if (!bprm)
+		return -ENOMEM;
+	bprm->orig = *f;
+	f->fn = t->fetch[FETCH_MTD_bitfield];
+	f->data = (void *)bprm;
+
+	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
+	if (bw == 0 || *tail != '@')
+		return -EINVAL;
+
+	bf = tail + 1;
+	bo = simple_strtoul(bf, &tail, 0);
+	if (tail == bf || *tail != '/')
+		return -EINVAL;
+
+	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+	bprm->low_shift = bprm->hi_shift + bo;
+	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		struct probe_arg *parg, bool is_return)
+{
+	const char *t;
+	int ret;
+
+	if (strlen(arg) > MAX_ARGSTR_LEN) {
+		pr_info("Argument is too long.: %s\n",  arg);
+		return -ENOSPC;
+	}
+	parg->comm = kstrdup(arg, GFP_KERNEL);
+	if (!parg->comm) {
+		pr_info("Failed to allocate memory for command '%s'.\n", arg);
+		return -ENOMEM;
+	}
+	t = strchr(parg->comm, ':');
+	if (t) {
+		arg[t - parg->comm] = '\0';
+		t++;
+	}
+	parg->type = find_fetch_type(t);
+	if (!parg->type) {
+		pr_info("Unsupported type: %s\n", t);
+		return -EINVAL;
+	}
+	parg->offset = *size;
+	*size += parg->type->size;
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	if (ret >= 0 && t != NULL)
+		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+	if (ret >= 0) {
+		parg->fetch_size.fn = get_fetch_size_function(parg->type,
+							      parg->fetch.fn);
+		parg->fetch_size.data = parg->fetch.data;
+	}
+	return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+		if (strcmp(reserved_field_names[i], name) == 0)
+			return 1;
+	for (i = 0; i < narg; i++)
+		if (strcmp(args[i].name, name) == 0)
+			return 1;
+	return 0;
+}
+
+void traceprobe_update_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		update_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		update_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		update_symbol_cache(arg->fetch.data);
+}
+
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		free_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		free_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		free_symbol_cache(arg->fetch.data);
+	kfree(arg->name);
+	kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char**))
+{
+	char **argv;
+	int argc = 0, ret = 0;
+
+	argv = argv_split(GFP_KERNEL, buf, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	if (argc)
+		ret = createfn(argc, argv);
+
+	argv_free(argv);
+	return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+	    size_t count, loff_t *ppos, int (*createfn)(int, char**))
+{
+	char *kbuf, *tmp;
+	int ret = 0;
+	size_t done = 0;
+	size_t size;
+
+	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	while (done < count) {
+		size = count - done;
+		if (size >= WRITE_BUFSIZE)
+			size = WRITE_BUFSIZE - 1;
+		if (copy_from_user(kbuf, buffer + done, size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		kbuf[size] = '\0';
+		tmp = strchr(kbuf, '\n');
+		if (tmp) {
+			*tmp = '\0';
+			size = tmp - kbuf + 1;
+		} else if (done + size < count) {
+			pr_warning("Line length is too long: "
+				   "Should be less than %d.", WRITE_BUFSIZE);
+			ret = -EINVAL;
+			goto out;
+		}
+		done += size;
+		/* Remove comments */
+		tmp = strchr(kbuf, '#');
+		if (tmp)
+			*tmp = '\0';
+
+		ret = traceprobe_command(kbuf, createfn);
+		if (ret)
+			goto out;
+	}
+	ret = done;
+out:
+	kfree(kbuf);
+	return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..500a08f
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,160 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:     Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed)			\
+	do {								\
+		ret = trace_define_field(event_call, #type, name,	\
+					 offsetof(typeof(field), item),	\
+					 sizeof(field.item), is_signed, \
+					 FILTER_OTHER);			\
+		if (ret)						\
+			return ret;					\
+	} while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE	1
+#define TP_FLAG_PROFILE	2
+#define TP_FLAG_REGISTERED 4
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs)	\
+	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl)	((u32)(dl) >> 16)
+#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ *  data_rloc stores the offset from data_rloc itself, but data_loc
+ *  stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+				 void *);
+
+/* Fetch types */
+enum {
+	FETCH_MTD_reg = 0,
+	FETCH_MTD_stack,
+	FETCH_MTD_retval,
+	FETCH_MTD_memory,
+	FETCH_MTD_symbol,
+	FETCH_MTD_deref,
+	FETCH_MTD_bitfield,
+	FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+	const char	*name;		/* Name of type */
+	size_t		size;		/* Byte size of type */
+	int		is_signed;	/* Signed flag */
+	print_type_func_t	print;	/* Print functions */
+	const char	*fmt;		/* Fromat string */
+	const char	*fmttype;	/* Name in format file */
+	/* Fetch functions */
+	fetch_func_t	fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+	fetch_func_t	fn;
+	void *data;
+};
+
+struct probe_arg {
+	struct fetch_param	fetch;
+	struct fetch_param	fetch_size;
+	unsigned int		offset;	/* Offset from argument entry */
+	const char		*name;	/* Name of this argument */
+	const char		*comm;	/* Command of this argument */
+	const struct fetch_type	*type;	/* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+				 struct pt_regs *regs, void *dest)
+{
+	return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static int is_good_name(const char *name)
+{
+	if (!isalpha(*name) && *name != '_')
+		return 0;
+	while (*++name != '\0') {
+		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+			return 0;
+	}
+	return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		   struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg);
+
+extern void traceprobe_update_arg(struct probe_arg *arg);
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *ppos,
+		int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 19/26]   tracing: Extract out common code for kprobes/uprobes traceevents.
@ 2011-09-20 12:03   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:03 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Move parts of trace_kprobe.c that can be shared with upcoming
trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
kernel/trace/trace_probe.c.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/trace/Kconfig        |    4 
 kernel/trace/Makefile       |    1 
 kernel/trace/trace_kprobe.c |  894 +------------------------------------------
 kernel/trace/trace_probe.c  |  778 +++++++++++++++++++++++++++++++++++++
 kernel/trace/trace_probe.h  |  160 ++++++++
 5 files changed, 963 insertions(+), 874 deletions(-)
 create mode 100644 kernel/trace/trace_probe.c
 create mode 100644 kernel/trace/trace_probe.h

diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index cd31345..520106a 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -373,6 +373,7 @@ config KPROBE_EVENT
 	depends on HAVE_REGS_AND_STACK_ACCESS_API
 	bool "Enable kprobes-based dynamic events"
 	select TRACING
+	select PROBE_EVENTS
 	default y
 	help
 	  This allows the user to add tracing events (similar to tracepoints)
@@ -385,6 +386,9 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config PROBE_EVENTS
+	def_bool n
+
 config DYNAMIC_FTRACE
 	bool "enable/disable ftrace tracepoints dynamically"
 	depends on FUNCTION_TRACER
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 761c510..692223a 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -56,5 +56,6 @@ obj-$(CONFIG_TRACEPOINTS) += power-traces.o
 ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
+obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5fb3697..d5f4e51 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -19,547 +19,15 @@
 
 #include <linux/module.h>
 #include <linux/uaccess.h>
-#include <linux/kprobes.h>
-#include <linux/seq_file.h>
-#include <linux/slab.h>
-#include <linux/smp.h>
-#include <linux/debugfs.h>
-#include <linux/types.h>
-#include <linux/string.h>
-#include <linux/ctype.h>
-#include <linux/ptrace.h>
-#include <linux/perf_event.h>
-#include <linux/stringify.h>
-#include <linux/limits.h>
-#include <asm/bitsperlong.h>
-
-#include "trace.h"
-#include "trace_output.h"
-
-#define MAX_TRACE_ARGS 128
-#define MAX_ARGSTR_LEN 63
-#define MAX_EVENT_NAME_LEN 64
-#define MAX_STRING_SIZE PATH_MAX
-#define KPROBE_EVENT_SYSTEM "kprobes"
-
-/* Reserved field names */
-#define FIELD_STRING_IP "__probe_ip"
-#define FIELD_STRING_RETIP "__probe_ret_ip"
-#define FIELD_STRING_FUNC "__probe_func"
-
-const char *reserved_field_names[] = {
-	"common_type",
-	"common_flags",
-	"common_preempt_count",
-	"common_pid",
-	"common_tgid",
-	FIELD_STRING_IP,
-	FIELD_STRING_RETIP,
-	FIELD_STRING_FUNC,
-};
-
-/* Printing function type */
-typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
-				 void *);
-#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
-#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
-
-/* Printing  in basic type function template */
-#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
-static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
-						const char *name,	\
-						void *data, void *ent)\
-{									\
-	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
-}									\
-static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
-
-DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
-DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
-
-/* data_rloc: data relative location, compatible with u32 */
-#define make_data_rloc(len, roffs)	\
-	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
-#define get_rloc_len(dl)	((u32)(dl) >> 16)
-#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
-
-static inline void *get_rloc_data(u32 *dl)
-{
-	return (u8 *)dl + get_rloc_offs(*dl);
-}
-
-/* For data_loc conversion */
-static inline void *get_loc_data(u32 *dl, void *ent)
-{
-	return (u8 *)ent + get_rloc_offs(*dl);
-}
-
-/*
- * Convert data_rloc to data_loc:
- *  data_rloc stores the offset from data_rloc itself, but data_loc
- *  stores the offset from event entry.
- */
-#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
-
-/* For defining macros, define string/string_size types */
-typedef u32 string;
-typedef u32 string_size;
-
-/* Print type function for string type */
-static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
-						  const char *name,
-						  void *data, void *ent)
-{
-	int len = *(u32 *)data >> 16;
-
-	if (!len)
-		return trace_seq_printf(s, " %s=(fault)", name);
-	else
-		return trace_seq_printf(s, " %s=\"%s\"", name,
-					(const char *)get_loc_data(data, ent));
-}
-static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
-
-/* Data fetch function type */
-typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
-
-struct fetch_param {
-	fetch_func_t	fn;
-	void *data;
-};
-
-static __kprobes void call_fetch(struct fetch_param *fprm,
-				 struct pt_regs *regs, void *dest)
-{
-	return fprm->fn(regs, fprm->data, dest);
-}
-
-#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
-/*
- * Define macro for basic types - we don't need to define s* types, because
- * we have to care only about bitwidth at recording time.
- */
-#define DEFINE_BASIC_FETCH_FUNCS(method) \
-DEFINE_FETCH_##method(u8)		\
-DEFINE_FETCH_##method(u16)		\
-DEFINE_FETCH_##method(u32)		\
-DEFINE_FETCH_##method(u64)
-
-#define CHECK_FETCH_FUNCS(method, fn)			\
-	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
-	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
-	 && (fn != NULL))
-
-/* Data fetch function templates */
-#define DEFINE_FETCH_reg(type)						\
-static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
-					void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_register(regs,			\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(reg)
-/* No string on the register */
-#define fetch_reg_string NULL
-#define fetch_reg_string_size NULL
-
-#define DEFINE_FETCH_stack(type)					\
-static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
-					  void *offset, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
-				(unsigned int)((unsigned long)offset));	\
-}
-DEFINE_BASIC_FETCH_FUNCS(stack)
-/* No string on the stack entry */
-#define fetch_stack_string NULL
-#define fetch_stack_string_size NULL
-
-#define DEFINE_FETCH_retval(type)					\
-static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
-					  void *dummy, void *dest)	\
-{									\
-	*(type *)dest = (type)regs_return_value(regs);			\
-}
-DEFINE_BASIC_FETCH_FUNCS(retval)
-/* No string on the retval */
-#define fetch_retval_string NULL
-#define fetch_retval_string_size NULL
-
-#define DEFINE_FETCH_memory(type)					\
-static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
-					  void *addr, void *dest)	\
-{									\
-	type retval;							\
-	if (probe_kernel_address(addr, retval))				\
-		*(type *)dest = 0;					\
-	else								\
-		*(type *)dest = retval;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(memory)
-/*
- * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
- * length and relative data location.
- */
-static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
-						      void *addr, void *dest)
-{
-	long ret;
-	int maxlen = get_rloc_len(*(u32 *)dest);
-	u8 *dst = get_rloc_data(dest);
-	u8 *src = addr;
-	mm_segment_t old_fs = get_fs();
-	if (!maxlen)
-		return;
-	/*
-	 * Try to get string again, since the string can be changed while
-	 * probing.
-	 */
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do
-		ret = __copy_from_user_inatomic(dst++, src++, 1);
-	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
-	dst[-1] = '\0';
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0) {	/* Failed to fetch string */
-		((u8 *)get_rloc_data(dest))[0] = '\0';
-		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
-	} else
-		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
-					      get_rloc_offs(*(u32 *)dest));
-}
-/* Return the length of string -- including null terminal byte */
-static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
-							void *addr, void *dest)
-{
-	int ret, len = 0;
-	u8 c;
-	mm_segment_t old_fs = get_fs();
-
-	set_fs(KERNEL_DS);
-	pagefault_disable();
-	do {
-		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
-		len++;
-	} while (c && ret == 0 && len < MAX_STRING_SIZE);
-	pagefault_enable();
-	set_fs(old_fs);
-
-	if (ret < 0)	/* Failed to check the length */
-		*(u32 *)dest = 0;
-	else
-		*(u32 *)dest = len;
-}
-
-/* Memory fetching by symbol */
-struct symbol_cache {
-	char *symbol;
-	long offset;
-	unsigned long addr;
-};
-
-static unsigned long update_symbol_cache(struct symbol_cache *sc)
-{
-	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
-	if (sc->addr)
-		sc->addr += sc->offset;
-	return sc->addr;
-}
-
-static void free_symbol_cache(struct symbol_cache *sc)
-{
-	kfree(sc->symbol);
-	kfree(sc);
-}
-
-static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
-{
-	struct symbol_cache *sc;
-
-	if (!sym || strlen(sym) == 0)
-		return NULL;
-	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
-	if (!sc)
-		return NULL;
-
-	sc->symbol = kstrdup(sym, GFP_KERNEL);
-	if (!sc->symbol) {
-		kfree(sc);
-		return NULL;
-	}
-	sc->offset = offset;
 
-	update_symbol_cache(sc);
-	return sc;
-}
-
-#define DEFINE_FETCH_symbol(type)					\
-static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
-					  void *data, void *dest)	\
-{									\
-	struct symbol_cache *sc = data;					\
-	if (sc->addr)							\
-		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
-	else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(symbol)
-DEFINE_FETCH_symbol(string)
-DEFINE_FETCH_symbol(string_size)
-
-/* Dereference memory access function */
-struct deref_fetch_param {
-	struct fetch_param orig;
-	long offset;
-};
-
-#define DEFINE_FETCH_deref(type)					\
-static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct deref_fetch_param *dprm = data;				\
-	unsigned long addr;						\
-	call_fetch(&dprm->orig, regs, &addr);				\
-	if (addr) {							\
-		addr += dprm->offset;					\
-		fetch_memory_##type(regs, (void *)addr, dest);		\
-	} else								\
-		*(type *)dest = 0;					\
-}
-DEFINE_BASIC_FETCH_FUNCS(deref)
-DEFINE_FETCH_deref(string)
-DEFINE_FETCH_deref(string_size)
-
-static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
-{
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Bitfield fetch function */
-struct bitfield_fetch_param {
-	struct fetch_param orig;
-	unsigned char hi_shift;
-	unsigned char low_shift;
-};
+#include "trace_probe.h"
 
-#define DEFINE_FETCH_bitfield(type)					\
-static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
-					    void *data, void *dest)	\
-{									\
-	struct bitfield_fetch_param *bprm = data;			\
-	type buf = 0;							\
-	call_fetch(&bprm->orig, regs, &buf);				\
-	if (buf) {							\
-		buf <<= bprm->hi_shift;					\
-		buf >>= bprm->low_shift;				\
-	}								\
-	*(type *)dest = buf;						\
-}
-DEFINE_BASIC_FETCH_FUNCS(bitfield)
-#define fetch_bitfield_string NULL
-#define fetch_bitfield_string_size NULL
-
-static __kprobes void
-update_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		update_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		update_symbol_cache(data->orig.data);
-}
-
-static __kprobes void
-free_bitfield_fetch_param(struct bitfield_fetch_param *data)
-{
-	/*
-	 * Don't check the bitfield itself, because this must be the
-	 * last fetch function.
-	 */
-	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
-		free_deref_fetch_param(data->orig.data);
-	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
-		free_symbol_cache(data->orig.data);
-	kfree(data);
-}
-
-/* Default (unsigned long) fetch type */
-#define __DEFAULT_FETCH_TYPE(t) u##t
-#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
-#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
-#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
-
-/* Fetch types */
-enum {
-	FETCH_MTD_reg = 0,
-	FETCH_MTD_stack,
-	FETCH_MTD_retval,
-	FETCH_MTD_memory,
-	FETCH_MTD_symbol,
-	FETCH_MTD_deref,
-	FETCH_MTD_bitfield,
-	FETCH_MTD_END,
-};
-
-#define ASSIGN_FETCH_FUNC(method, type)	\
-	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
-
-#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
-	{.name = _name,				\
-	 .size = _size,					\
-	 .is_signed = sign,				\
-	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
-	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
-	 .fmttype = _fmttype,				\
-	 .fetch = {					\
-ASSIGN_FETCH_FUNC(reg, ftype),				\
-ASSIGN_FETCH_FUNC(stack, ftype),			\
-ASSIGN_FETCH_FUNC(retval, ftype),			\
-ASSIGN_FETCH_FUNC(memory, ftype),			\
-ASSIGN_FETCH_FUNC(symbol, ftype),			\
-ASSIGN_FETCH_FUNC(deref, ftype),			\
-ASSIGN_FETCH_FUNC(bitfield, ftype),			\
-	  }						\
-	}
-
-#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
-	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
-
-#define FETCH_TYPE_STRING 0
-#define FETCH_TYPE_STRSIZE 1
-
-/* Fetch type information table */
-static const struct fetch_type {
-	const char	*name;		/* Name of type */
-	size_t		size;		/* Byte size of type */
-	int		is_signed;	/* Signed flag */
-	print_type_func_t	print;	/* Print functions */
-	const char	*fmt;		/* Fromat string */
-	const char	*fmttype;	/* Name in format file */
-	/* Fetch functions */
-	fetch_func_t	fetch[FETCH_MTD_END];
-} fetch_type_table[] = {
-	/* Special types */
-	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
-					sizeof(u32), 1, "__data_loc char[]"),
-	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
-					string_size, sizeof(u32), 0, "u32"),
-	/* Basic types */
-	ASSIGN_FETCH_TYPE(u8,  u8,  0),
-	ASSIGN_FETCH_TYPE(u16, u16, 0),
-	ASSIGN_FETCH_TYPE(u32, u32, 0),
-	ASSIGN_FETCH_TYPE(u64, u64, 0),
-	ASSIGN_FETCH_TYPE(s8,  u8,  1),
-	ASSIGN_FETCH_TYPE(s16, u16, 1),
-	ASSIGN_FETCH_TYPE(s32, u32, 1),
-	ASSIGN_FETCH_TYPE(s64, u64, 1),
-};
-
-static const struct fetch_type *find_fetch_type(const char *type)
-{
-	int i;
-
-	if (!type)
-		type = DEFAULT_FETCH_TYPE_STR;
-
-	/* Special case: bitfield */
-	if (*type == 'b') {
-		unsigned long bs;
-		type = strchr(type, '/');
-		if (!type)
-			goto fail;
-		type++;
-		if (strict_strtoul(type, 0, &bs))
-			goto fail;
-		switch (bs) {
-		case 8:
-			return find_fetch_type("u8");
-		case 16:
-			return find_fetch_type("u16");
-		case 32:
-			return find_fetch_type("u32");
-		case 64:
-			return find_fetch_type("u64");
-		default:
-			goto fail;
-		}
-	}
-
-	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
-		if (strcmp(type, fetch_type_table[i].name) == 0)
-			return &fetch_type_table[i];
-fail:
-	return NULL;
-}
-
-/* Special function : only accept unsigned long */
-static __kprobes void fetch_stack_address(struct pt_regs *regs,
-					  void *dummy, void *dest)
-{
-	*(unsigned long *)dest = kernel_stack_pointer(regs);
-}
-
-static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
-					    fetch_func_t orig_fn)
-{
-	int i;
-
-	if (type != &fetch_type_table[FETCH_TYPE_STRING])
-		return NULL;	/* Only string type needs size function */
-	for (i = 0; i < FETCH_MTD_END; i++)
-		if (type->fetch[i] == orig_fn)
-			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
-
-	WARN_ON(1);	/* This should not happen */
-	return NULL;
-}
+#define KPROBE_EVENT_SYSTEM "kprobes"
 
 /**
  * Kprobe event core functions
  */
 
-struct probe_arg {
-	struct fetch_param	fetch;
-	struct fetch_param	fetch_size;
-	unsigned int		offset;	/* Offset from argument entry */
-	const char		*name;	/* Name of this argument */
-	const char		*comm;	/* Command of this argument */
-	const struct fetch_type	*type;	/* Type of this argument */
-};
-
-/* Flags for trace_probe */
-#define TP_FLAG_TRACE	1
-#define TP_FLAG_PROFILE	2
-#define TP_FLAG_REGISTERED 4
-
 struct trace_probe {
 	struct list_head	list;
 	struct kretprobe	rp;	/* Use rp.kp for kprobe use */
@@ -631,18 +99,6 @@ static int kprobe_dispatcher(struct kprobe *kp, struct pt_regs *regs);
 static int kretprobe_dispatcher(struct kretprobe_instance *ri,
 				struct pt_regs *regs);
 
-/* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
-{
-	if (!isalpha(*name) && *name != '_')
-		return 0;
-	while (*++name != '\0') {
-		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
-			return 0;
-	}
-	return 1;
-}
-
 /*
  * Allocate new trace_probe and initialize it (including kprobes).
  */
@@ -651,7 +107,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 					     void *addr,
 					     const char *symbol,
 					     unsigned long offs,
-					     int nargs, int is_return)
+					     int nargs, bool is_return)
 {
 	struct trace_probe *tp;
 	int ret = -ENOMEM;
@@ -702,34 +158,12 @@ static struct trace_probe *alloc_trace_probe(const char *group,
 	return ERR_PTR(ret);
 }
 
-static void update_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		update_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		update_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		update_symbol_cache(arg->fetch.data);
-}
-
-static void free_probe_arg(struct probe_arg *arg)
-{
-	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
-		free_bitfield_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
-		free_deref_fetch_param(arg->fetch.data);
-	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
-		free_symbol_cache(arg->fetch.data);
-	kfree(arg->name);
-	kfree(arg->comm);
-}
-
 static void free_trace_probe(struct trace_probe *tp)
 {
 	int i;
 
 	for (i = 0; i < tp->nr_args; i++)
-		free_probe_arg(&tp->args[i]);
+		traceprobe_free_probe_arg(&tp->args[i]);
 
 	kfree(tp->call.class->system);
 	kfree(tp->call.name);
@@ -787,7 +221,7 @@ static int __register_trace_probe(struct trace_probe *tp)
 		return -EINVAL;
 
 	for (i = 0; i < tp->nr_args; i++)
-		update_probe_arg(&tp->args[i]);
+		traceprobe_update_arg(&tp->args[i]);
 
 	/* Set/clear disabled flag according to tp->flag */
 	if (trace_probe_is_enabled(tp))
@@ -910,227 +344,6 @@ static struct notifier_block trace_probe_module_nb = {
 	.priority = 1	/* Invoked after kprobe module callback */
 };
 
-/* Split symbol and offset. */
-static int split_symbol_offset(char *symbol, unsigned long *offset)
-{
-	char *tmp;
-	int ret;
-
-	if (!offset)
-		return -EINVAL;
-
-	tmp = strchr(symbol, '+');
-	if (tmp) {
-		/* skip sign because strict_strtol doesn't accept '+' */
-		ret = strict_strtoul(tmp + 1, 0, offset);
-		if (ret)
-			return ret;
-		*tmp = '\0';
-	} else
-		*offset = 0;
-	return 0;
-}
-
-#define PARAM_MAX_ARGS 16
-#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
-
-static int parse_probe_vars(char *arg, const struct fetch_type *t,
-			    struct fetch_param *f, int is_return)
-{
-	int ret = 0;
-	unsigned long param;
-
-	if (strcmp(arg, "retval") == 0) {
-		if (is_return)
-			f->fn = t->fetch[FETCH_MTD_retval];
-		else
-			ret = -EINVAL;
-	} else if (strncmp(arg, "stack", 5) == 0) {
-		if (arg[5] == '\0') {
-			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
-				f->fn = fetch_stack_address;
-			else
-				ret = -EINVAL;
-		} else if (isdigit(arg[5])) {
-			ret = strict_strtoul(arg + 5, 10, &param);
-			if (ret || param > PARAM_MAX_STACK)
-				ret = -EINVAL;
-			else {
-				f->fn = t->fetch[FETCH_MTD_stack];
-				f->data = (void *)param;
-			}
-		} else
-			ret = -EINVAL;
-	} else
-		ret = -EINVAL;
-	return ret;
-}
-
-/* Recursive argument parser */
-static int __parse_probe_arg(char *arg, const struct fetch_type *t,
-			     struct fetch_param *f, int is_return)
-{
-	int ret = 0;
-	unsigned long param;
-	long offset;
-	char *tmp;
-
-	switch (arg[0]) {
-	case '$':
-		ret = parse_probe_vars(arg + 1, t, f, is_return);
-		break;
-	case '%':	/* named register */
-		ret = regs_query_register_offset(arg + 1);
-		if (ret >= 0) {
-			f->fn = t->fetch[FETCH_MTD_reg];
-			f->data = (void *)(unsigned long)ret;
-			ret = 0;
-		}
-		break;
-	case '@':	/* memory or symbol */
-		if (isdigit(arg[1])) {
-			ret = strict_strtoul(arg + 1, 0, &param);
-			if (ret)
-				break;
-			f->fn = t->fetch[FETCH_MTD_memory];
-			f->data = (void *)param;
-		} else {
-			ret = split_symbol_offset(arg + 1, &offset);
-			if (ret)
-				break;
-			f->data = alloc_symbol_cache(arg + 1, offset);
-			if (f->data)
-				f->fn = t->fetch[FETCH_MTD_symbol];
-		}
-		break;
-	case '+':	/* deref memory */
-		arg++;	/* Skip '+', because strict_strtol() rejects it. */
-	case '-':
-		tmp = strchr(arg, '(');
-		if (!tmp)
-			break;
-		*tmp = '\0';
-		ret = strict_strtol(arg, 0, &offset);
-		if (ret)
-			break;
-		arg = tmp + 1;
-		tmp = strrchr(arg, ')');
-		if (tmp) {
-			struct deref_fetch_param *dprm;
-			const struct fetch_type *t2 = find_fetch_type(NULL);
-			*tmp = '\0';
-			dprm = kzalloc(sizeof(struct deref_fetch_param),
-				       GFP_KERNEL);
-			if (!dprm)
-				return -ENOMEM;
-			dprm->offset = offset;
-			ret = __parse_probe_arg(arg, t2, &dprm->orig,
-						is_return);
-			if (ret)
-				kfree(dprm);
-			else {
-				f->fn = t->fetch[FETCH_MTD_deref];
-				f->data = (void *)dprm;
-			}
-		}
-		break;
-	}
-	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
-		pr_info("%s type has no corresponding fetch method.\n",
-			t->name);
-		ret = -EINVAL;
-	}
-	return ret;
-}
-
-#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
-
-/* Bitfield type needs to be parsed into a fetch function */
-static int __parse_bitfield_probe_arg(const char *bf,
-				      const struct fetch_type *t,
-				      struct fetch_param *f)
-{
-	struct bitfield_fetch_param *bprm;
-	unsigned long bw, bo;
-	char *tail;
-
-	if (*bf != 'b')
-		return 0;
-
-	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
-	if (!bprm)
-		return -ENOMEM;
-	bprm->orig = *f;
-	f->fn = t->fetch[FETCH_MTD_bitfield];
-	f->data = (void *)bprm;
-
-	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
-	if (bw == 0 || *tail != '@')
-		return -EINVAL;
-
-	bf = tail + 1;
-	bo = simple_strtoul(bf, &tail, 0);
-	if (tail == bf || *tail != '/')
-		return -EINVAL;
-
-	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
-	bprm->low_shift = bprm->hi_shift + bo;
-	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
-}
-
-/* String length checking wrapper */
-static int parse_probe_arg(char *arg, struct trace_probe *tp,
-			   struct probe_arg *parg, int is_return)
-{
-	const char *t;
-	int ret;
-
-	if (strlen(arg) > MAX_ARGSTR_LEN) {
-		pr_info("Argument is too long.: %s\n",  arg);
-		return -ENOSPC;
-	}
-	parg->comm = kstrdup(arg, GFP_KERNEL);
-	if (!parg->comm) {
-		pr_info("Failed to allocate memory for command '%s'.\n", arg);
-		return -ENOMEM;
-	}
-	t = strchr(parg->comm, ':');
-	if (t) {
-		arg[t - parg->comm] = '\0';
-		t++;
-	}
-	parg->type = find_fetch_type(t);
-	if (!parg->type) {
-		pr_info("Unsupported type: %s\n", t);
-		return -EINVAL;
-	}
-	parg->offset = tp->size;
-	tp->size += parg->type->size;
-	ret = __parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
-	if (ret >= 0 && t != NULL)
-		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
-	if (ret >= 0) {
-		parg->fetch_size.fn = get_fetch_size_function(parg->type,
-							      parg->fetch.fn);
-		parg->fetch_size.data = parg->fetch.data;
-	}
-	return ret;
-}
-
-/* Return 1 if name is reserved or already used by another argument */
-static int conflict_field_name(const char *name,
-			       struct probe_arg *args, int narg)
-{
-	int i;
-	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
-		if (strcmp(reserved_field_names[i], name) == 0)
-			return 1;
-	for (i = 0; i < narg; i++)
-		if (strcmp(args[i].name, name) == 0)
-			return 1;
-	return 0;
-}
-
 static int create_trace_probe(int argc, char **argv)
 {
 	/*
@@ -1153,7 +366,7 @@ static int create_trace_probe(int argc, char **argv)
 	 */
 	struct trace_probe *tp;
 	int i, ret = 0;
-	int is_return = 0, is_delete = 0;
+	bool is_return = false, is_delete = false;
 	char *symbol = NULL, *event = NULL, *group = NULL;
 	char *arg;
 	unsigned long offset = 0;
@@ -1162,11 +375,11 @@ static int create_trace_probe(int argc, char **argv)
 
 	/* argc must be >= 1 */
 	if (argv[0][0] == 'p')
-		is_return = 0;
+		is_return = false;
 	else if (argv[0][0] == 'r')
-		is_return = 1;
+		is_return = true;
 	else if (argv[0][0] == '-')
-		is_delete = 1;
+		is_delete = true;
 	else {
 		pr_info("Probe definition must be started with 'p', 'r' or"
 			" '-'.\n");
@@ -1230,7 +443,7 @@ static int create_trace_probe(int argc, char **argv)
 		/* a symbol specified */
 		symbol = argv[1];
 		/* TODO: support .init module functions */
-		ret = split_symbol_offset(symbol, &offset);
+		ret = traceprobe_split_symbol_offset(symbol, &offset);
 		if (ret) {
 			pr_info("Failed to parse symbol.\n");
 			return ret;
@@ -1292,7 +505,8 @@ static int create_trace_probe(int argc, char **argv)
 			goto error;
 		}
 
-		if (conflict_field_name(tp->args[i].name, tp->args, i)) {
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
 			pr_info("Argument[%d] name '%s' conflicts with "
 				"another field.\n", i, argv[i]);
 			ret = -EINVAL;
@@ -1300,7 +514,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = parse_probe_arg(arg, tp, &tp->args[i], is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								is_return);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
@@ -1387,70 +602,11 @@ static int probes_open(struct inode *inode, struct file *file)
 	return seq_open(file, &probes_seq_op);
 }
 
-static int command_trace_probe(const char *buf)
-{
-	char **argv;
-	int argc = 0, ret = 0;
-
-	argv = argv_split(GFP_KERNEL, buf, &argc);
-	if (!argv)
-		return -ENOMEM;
-
-	if (argc)
-		ret = create_trace_probe(argc, argv);
-
-	argv_free(argv);
-	return ret;
-}
-
-#define WRITE_BUFSIZE 4096
-
 static ssize_t probes_write(struct file *file, const char __user *buffer,
 			    size_t count, loff_t *ppos)
 {
-	char *kbuf, *tmp;
-	int ret;
-	size_t done;
-	size_t size;
-
-	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
-	if (!kbuf)
-		return -ENOMEM;
-
-	ret = done = 0;
-	while (done < count) {
-		size = count - done;
-		if (size >= WRITE_BUFSIZE)
-			size = WRITE_BUFSIZE - 1;
-		if (copy_from_user(kbuf, buffer + done, size)) {
-			ret = -EFAULT;
-			goto out;
-		}
-		kbuf[size] = '\0';
-		tmp = strchr(kbuf, '\n');
-		if (tmp) {
-			*tmp = '\0';
-			size = tmp - kbuf + 1;
-		} else if (done + size < count) {
-			pr_warning("Line length is too long: "
-				   "Should be less than %d.", WRITE_BUFSIZE);
-			ret = -EINVAL;
-			goto out;
-		}
-		done += size;
-		/* Remove comments */
-		tmp = strchr(kbuf, '#');
-		if (tmp)
-			*tmp = '\0';
-
-		ret = command_trace_probe(kbuf);
-		if (ret)
-			goto out;
-	}
-	ret = done;
-out:
-	kfree(kbuf);
-	return ret;
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_probe);
 }
 
 static const struct file_operations kprobe_events_ops = {
@@ -1686,16 +842,6 @@ print_kretprobe_event(struct trace_iterator *iter, int flags,
 	return TRACE_TYPE_PARTIAL_LINE;
 }
 
-#undef DEFINE_FIELD
-#define DEFINE_FIELD(type, item, name, is_signed)			\
-	do {								\
-		ret = trace_define_field(event_call, #type, name,	\
-					 offsetof(typeof(field), item),	\
-					 sizeof(field.item), is_signed, \
-					 FILTER_OTHER);			\
-		if (ret)						\
-			return ret;					\
-	} while (0)
 
 static int kprobe_event_define_fields(struct ftrace_event_call *event_call)
 {
@@ -2020,7 +1166,7 @@ static __init int kprobe_trace_self_tests_init(void)
 
 	pr_info("Testing kprobe tracing: ");
 
-	ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
+	ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
 				  "$stack $stack0 +0($stack)");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function entry.\n");
@@ -2035,7 +1181,7 @@ static __init int kprobe_trace_self_tests_init(void)
 			enable_trace_probe(tp, TP_FLAG_TRACE);
 	}
 
-	ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
+	ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
 				  "$retval");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on probing function return.\n");
@@ -2055,13 +1201,13 @@ static __init int kprobe_trace_self_tests_init(void)
 
 	ret = target(1, 2, 3, 4, 5, 6);
 
-	ret = command_trace_probe("-:testprobe");
+	ret = traceprobe_command_trace_probe("-:testprobe");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
 	}
 
-	ret = command_trace_probe("-:testprobe2");
+	ret = traceprobe_command_trace_probe("-:testprobe2");
 	if (WARN_ON_ONCE(ret)) {
 		pr_warning("error on deleting a probe.\n");
 		warn++;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
new file mode 100644
index 0000000..52580b5
--- /dev/null
+++ b/kernel/trace/trace_probe.c
@@ -0,0 +1,778 @@
+/*
+ * Common code for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:     Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ */
+
+#include "trace_probe.h"
+
+const char *reserved_field_names[] = {
+	"common_type",
+	"common_flags",
+	"common_preempt_count",
+	"common_pid",
+	"common_tgid",
+	FIELD_STRING_IP,
+	FIELD_STRING_RETIP,
+	FIELD_STRING_FUNC,
+};
+
+/* Printing function type */
+#define PRINT_TYPE_FUNC_NAME(type)	print_type_##type
+#define PRINT_TYPE_FMT_NAME(type)	print_type_format_##type
+
+/* Printing  in basic type function template */
+#define DEFINE_BASIC_PRINT_TYPE_FUNC(type, fmt, cast)			\
+static __kprobes int PRINT_TYPE_FUNC_NAME(type)(struct trace_seq *s,	\
+						const char *name,	\
+						void *data, void *ent)\
+{									\
+	return trace_seq_printf(s, " %s=" fmt, name, (cast)*(type *)data);\
+}									\
+static const char PRINT_TYPE_FMT_NAME(type)[] = fmt;
+
+DEFINE_BASIC_PRINT_TYPE_FUNC(u8, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u16, "%x", unsigned int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u32, "%lx", unsigned long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(u64, "%llx", unsigned long long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s8, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s16, "%d", int)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s32, "%ld", long)
+DEFINE_BASIC_PRINT_TYPE_FUNC(s64, "%lld", long long)
+
+static inline void *get_rloc_data(u32 *dl)
+{
+	return (u8 *)dl + get_rloc_offs(*dl);
+}
+
+/* For data_loc conversion */
+static inline void *get_loc_data(u32 *dl, void *ent)
+{
+	return (u8 *)ent + get_rloc_offs(*dl);
+}
+
+/* For defining macros, define string/string_size types */
+typedef u32 string;
+typedef u32 string_size;
+
+/* Print type function for string type */
+static __kprobes int PRINT_TYPE_FUNC_NAME(string)(struct trace_seq *s,
+						  const char *name,
+						  void *data, void *ent)
+{
+	int len = *(u32 *)data >> 16;
+
+	if (!len)
+		return trace_seq_printf(s, " %s=(fault)", name);
+	else
+		return trace_seq_printf(s, " %s=\"%s\"", name,
+					(const char *)get_loc_data(data, ent));
+}
+static const char PRINT_TYPE_FMT_NAME(string)[] = "\\\"%s\\\"";
+
+#define FETCH_FUNC_NAME(method, type)	fetch_##method##_##type
+/*
+ * Define macro for basic types - we don't need to define s* types, because
+ * we have to care only about bitwidth at recording time.
+ */
+#define DEFINE_BASIC_FETCH_FUNCS(method) \
+DEFINE_FETCH_##method(u8)		\
+DEFINE_FETCH_##method(u16)		\
+DEFINE_FETCH_##method(u32)		\
+DEFINE_FETCH_##method(u64)
+
+#define CHECK_FETCH_FUNCS(method, fn)			\
+	(((FETCH_FUNC_NAME(method, u8) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u16) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u32) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, u64) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string) == fn) ||	\
+	  (FETCH_FUNC_NAME(method, string_size) == fn)) \
+	 && (fn != NULL))
+
+/* Data fetch function templates */
+#define DEFINE_FETCH_reg(type)						\
+static __kprobes void FETCH_FUNC_NAME(reg, type)(struct pt_regs *regs,	\
+					void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_register(regs,			\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(reg)
+/* No string on the register */
+#define fetch_reg_string NULL
+#define fetch_reg_string_size NULL
+
+#define DEFINE_FETCH_stack(type)					\
+static __kprobes void FETCH_FUNC_NAME(stack, type)(struct pt_regs *regs,\
+					  void *offset, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_get_kernel_stack_nth(regs,		\
+				(unsigned int)((unsigned long)offset));	\
+}
+DEFINE_BASIC_FETCH_FUNCS(stack)
+/* No string on the stack entry */
+#define fetch_stack_string NULL
+#define fetch_stack_string_size NULL
+
+#define DEFINE_FETCH_retval(type)					\
+static __kprobes void FETCH_FUNC_NAME(retval, type)(struct pt_regs *regs,\
+					  void *dummy, void *dest)	\
+{									\
+	*(type *)dest = (type)regs_return_value(regs);			\
+}
+DEFINE_BASIC_FETCH_FUNCS(retval)
+/* No string on the retval */
+#define fetch_retval_string NULL
+#define fetch_retval_string_size NULL
+
+#define DEFINE_FETCH_memory(type)					\
+static __kprobes void FETCH_FUNC_NAME(memory, type)(struct pt_regs *regs,\
+					  void *addr, void *dest)	\
+{									\
+	type retval;							\
+	if (probe_kernel_address(addr, retval))				\
+		*(type *)dest = 0;					\
+	else								\
+		*(type *)dest = retval;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(memory)
+/*
+ * Fetch a null-terminated string. Caller MUST set *(u32 *)dest with max
+ * length and relative data location.
+ */
+static __kprobes void FETCH_FUNC_NAME(memory, string)(struct pt_regs *regs,
+						      void *addr, void *dest)
+{
+	long ret;
+	int maxlen = get_rloc_len(*(u32 *)dest);
+	u8 *dst = get_rloc_data(dest);
+	u8 *src = addr;
+	mm_segment_t old_fs = get_fs();
+	if (!maxlen)
+		return;
+	/*
+	 * Try to get string again, since the string can be changed while
+	 * probing.
+	 */
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do
+		ret = __copy_from_user_inatomic(dst++, src++, 1);
+	while (dst[-1] && ret == 0 && src - (u8 *)addr < maxlen);
+	dst[-1] = '\0';
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0) {	/* Failed to fetch string */
+		((u8 *)get_rloc_data(dest))[0] = '\0';
+		*(u32 *)dest = make_data_rloc(0, get_rloc_offs(*(u32 *)dest));
+	} else
+		*(u32 *)dest = make_data_rloc(src - (u8 *)addr,
+					      get_rloc_offs(*(u32 *)dest));
+}
+/* Return the length of string -- including null terminal byte */
+static __kprobes void FETCH_FUNC_NAME(memory, string_size)(struct pt_regs *regs,
+							void *addr, void *dest)
+{
+	int ret, len = 0;
+	u8 c;
+	mm_segment_t old_fs = get_fs();
+
+	set_fs(KERNEL_DS);
+	pagefault_disable();
+	do {
+		ret = __copy_from_user_inatomic(&c, (u8 *)addr + len, 1);
+		len++;
+	} while (c && ret == 0 && len < MAX_STRING_SIZE);
+	pagefault_enable();
+	set_fs(old_fs);
+
+	if (ret < 0)	/* Failed to check the length */
+		*(u32 *)dest = 0;
+	else
+		*(u32 *)dest = len;
+}
+
+/* Memory fetching by symbol */
+struct symbol_cache {
+	char *symbol;
+	long offset;
+	unsigned long addr;
+};
+
+static unsigned long update_symbol_cache(struct symbol_cache *sc)
+{
+	sc->addr = (unsigned long)kallsyms_lookup_name(sc->symbol);
+	if (sc->addr)
+		sc->addr += sc->offset;
+	return sc->addr;
+}
+
+static void free_symbol_cache(struct symbol_cache *sc)
+{
+	kfree(sc->symbol);
+	kfree(sc);
+}
+
+static struct symbol_cache *alloc_symbol_cache(const char *sym, long offset)
+{
+	struct symbol_cache *sc;
+
+	if (!sym || strlen(sym) == 0)
+		return NULL;
+	sc = kzalloc(sizeof(struct symbol_cache), GFP_KERNEL);
+	if (!sc)
+		return NULL;
+
+	sc->symbol = kstrdup(sym, GFP_KERNEL);
+	if (!sc->symbol) {
+		kfree(sc);
+		return NULL;
+	}
+	sc->offset = offset;
+
+	update_symbol_cache(sc);
+	return sc;
+}
+
+#define DEFINE_FETCH_symbol(type)					\
+static __kprobes void FETCH_FUNC_NAME(symbol, type)(struct pt_regs *regs,\
+					  void *data, void *dest)	\
+{									\
+	struct symbol_cache *sc = data;					\
+	if (sc->addr)							\
+		fetch_memory_##type(regs, (void *)sc->addr, dest);	\
+	else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(symbol)
+DEFINE_FETCH_symbol(string)
+DEFINE_FETCH_symbol(string_size)
+
+/* Dereference memory access function */
+struct deref_fetch_param {
+	struct fetch_param orig;
+	long offset;
+};
+
+#define DEFINE_FETCH_deref(type)					\
+static __kprobes void FETCH_FUNC_NAME(deref, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct deref_fetch_param *dprm = data;				\
+	unsigned long addr;						\
+	call_fetch(&dprm->orig, regs, &addr);				\
+	if (addr) {							\
+		addr += dprm->offset;					\
+		fetch_memory_##type(regs, (void *)addr, dest);		\
+	} else								\
+		*(type *)dest = 0;					\
+}
+DEFINE_BASIC_FETCH_FUNCS(deref)
+DEFINE_FETCH_deref(string)
+DEFINE_FETCH_deref(string_size)
+
+static __kprobes void update_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void free_deref_fetch_param(struct deref_fetch_param *data)
+{
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Bitfield fetch function */
+struct bitfield_fetch_param {
+	struct fetch_param orig;
+	unsigned char hi_shift;
+	unsigned char low_shift;
+};
+
+#define DEFINE_FETCH_bitfield(type)					\
+static __kprobes void FETCH_FUNC_NAME(bitfield, type)(struct pt_regs *regs,\
+					    void *data, void *dest)	\
+{									\
+	struct bitfield_fetch_param *bprm = data;			\
+	type buf = 0;							\
+	call_fetch(&bprm->orig, regs, &buf);				\
+	if (buf) {							\
+		buf <<= bprm->hi_shift;					\
+		buf >>= bprm->low_shift;				\
+	}								\
+	*(type *)dest = buf;						\
+}
+
+DEFINE_BASIC_FETCH_FUNCS(bitfield)
+#define fetch_bitfield_string NULL
+#define fetch_bitfield_string_size NULL
+
+static __kprobes void
+update_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		update_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		update_symbol_cache(data->orig.data);
+}
+
+static __kprobes void
+free_bitfield_fetch_param(struct bitfield_fetch_param *data)
+{
+	/*
+	 * Don't check the bitfield itself, because this must be the
+	 * last fetch function.
+	 */
+	if (CHECK_FETCH_FUNCS(deref, data->orig.fn))
+		free_deref_fetch_param(data->orig.data);
+	else if (CHECK_FETCH_FUNCS(symbol, data->orig.fn))
+		free_symbol_cache(data->orig.data);
+	kfree(data);
+}
+
+/* Default (unsigned long) fetch type */
+#define __DEFAULT_FETCH_TYPE(t) u##t
+#define _DEFAULT_FETCH_TYPE(t) __DEFAULT_FETCH_TYPE(t)
+#define DEFAULT_FETCH_TYPE _DEFAULT_FETCH_TYPE(BITS_PER_LONG)
+#define DEFAULT_FETCH_TYPE_STR __stringify(DEFAULT_FETCH_TYPE)
+
+#define ASSIGN_FETCH_FUNC(method, type)	\
+	[FETCH_MTD_##method] = FETCH_FUNC_NAME(method, type)
+
+#define __ASSIGN_FETCH_TYPE(_name, ptype, ftype, _size, sign, _fmttype)	\
+	{.name = _name,				\
+	 .size = _size,					\
+	 .is_signed = sign,				\
+	 .print = PRINT_TYPE_FUNC_NAME(ptype),		\
+	 .fmt = PRINT_TYPE_FMT_NAME(ptype),		\
+	 .fmttype = _fmttype,				\
+	 .fetch = {					\
+ASSIGN_FETCH_FUNC(reg, ftype),				\
+ASSIGN_FETCH_FUNC(stack, ftype),			\
+ASSIGN_FETCH_FUNC(retval, ftype),			\
+ASSIGN_FETCH_FUNC(memory, ftype),			\
+ASSIGN_FETCH_FUNC(symbol, ftype),			\
+ASSIGN_FETCH_FUNC(deref, ftype),			\
+ASSIGN_FETCH_FUNC(bitfield, ftype),			\
+	  }						\
+	}
+
+#define ASSIGN_FETCH_TYPE(ptype, ftype, sign)			\
+	__ASSIGN_FETCH_TYPE(#ptype, ptype, ftype, sizeof(ftype), sign, #ptype)
+
+#define FETCH_TYPE_STRING 0
+#define FETCH_TYPE_STRSIZE 1
+
+/* Fetch type information table */
+static const struct fetch_type fetch_type_table[] = {
+	/* Special types */
+	[FETCH_TYPE_STRING] = __ASSIGN_FETCH_TYPE("string", string, string,
+					sizeof(u32), 1, "__data_loc char[]"),
+	[FETCH_TYPE_STRSIZE] = __ASSIGN_FETCH_TYPE("string_size", u32,
+					string_size, sizeof(u32), 0, "u32"),
+	/* Basic types */
+	ASSIGN_FETCH_TYPE(u8,  u8,  0),
+	ASSIGN_FETCH_TYPE(u16, u16, 0),
+	ASSIGN_FETCH_TYPE(u32, u32, 0),
+	ASSIGN_FETCH_TYPE(u64, u64, 0),
+	ASSIGN_FETCH_TYPE(s8,  u8,  1),
+	ASSIGN_FETCH_TYPE(s16, u16, 1),
+	ASSIGN_FETCH_TYPE(s32, u32, 1),
+	ASSIGN_FETCH_TYPE(s64, u64, 1),
+};
+
+static const struct fetch_type *find_fetch_type(const char *type)
+{
+	int i;
+
+	if (!type)
+		type = DEFAULT_FETCH_TYPE_STR;
+
+	/* Special case: bitfield */
+	if (*type == 'b') {
+		unsigned long bs;
+		type = strchr(type, '/');
+		if (!type)
+			goto fail;
+		type++;
+		if (strict_strtoul(type, 0, &bs))
+			goto fail;
+		switch (bs) {
+		case 8:
+			return find_fetch_type("u8");
+		case 16:
+			return find_fetch_type("u16");
+		case 32:
+			return find_fetch_type("u32");
+		case 64:
+			return find_fetch_type("u64");
+		default:
+			goto fail;
+		}
+	}
+
+	for (i = 0; i < ARRAY_SIZE(fetch_type_table); i++)
+		if (strcmp(type, fetch_type_table[i].name) == 0)
+			return &fetch_type_table[i];
+fail:
+	return NULL;
+}
+
+/* Special function : only accept unsigned long */
+static __kprobes void fetch_stack_address(struct pt_regs *regs,
+					void *dummy, void *dest)
+{
+	*(unsigned long *)dest = kernel_stack_pointer(regs);
+}
+
+static fetch_func_t get_fetch_size_function(const struct fetch_type *type,
+					fetch_func_t orig_fn)
+{
+	int i;
+
+	if (type != &fetch_type_table[FETCH_TYPE_STRING])
+		return NULL;	/* Only string type needs size function */
+	for (i = 0; i < FETCH_MTD_END; i++)
+		if (type->fetch[i] == orig_fn)
+			return fetch_type_table[FETCH_TYPE_STRSIZE].fetch[i];
+
+	WARN_ON(1);	/* This should not happen */
+	return NULL;
+}
+
+
+/* Split symbol and offset. */
+int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset)
+{
+	char *tmp;
+	int ret;
+
+	if (!offset)
+		return -EINVAL;
+
+	tmp = strchr(symbol, '+');
+	if (tmp) {
+		/* skip sign because strict_strtol doesn't accept '+' */
+		ret = strict_strtoul(tmp + 1, 0, offset);
+		if (ret)
+			return ret;
+		*tmp = '\0';
+	} else
+		*offset = 0;
+	return 0;
+}
+
+
+#define PARAM_MAX_STACK (THREAD_SIZE / sizeof(unsigned long))
+
+static int parse_probe_vars(char *arg, const struct fetch_type *t,
+			    struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+
+	if (strcmp(arg, "retval") == 0) {
+		if (is_return)
+			f->fn = t->fetch[FETCH_MTD_retval];
+		else
+			ret = -EINVAL;
+	} else if (strncmp(arg, "stack", 5) == 0) {
+		if (arg[5] == '\0') {
+			if (strcmp(t->name, DEFAULT_FETCH_TYPE_STR) == 0)
+				f->fn = fetch_stack_address;
+			else
+				ret = -EINVAL;
+		} else if (isdigit(arg[5])) {
+			ret = strict_strtoul(arg + 5, 10, &param);
+			if (ret || param > PARAM_MAX_STACK)
+				ret = -EINVAL;
+			else {
+				f->fn = t->fetch[FETCH_MTD_stack];
+				f->data = (void *)param;
+			}
+		} else
+			ret = -EINVAL;
+	} else
+		ret = -EINVAL;
+	return ret;
+}
+
+/* Recursive argument parser */
+static int parse_probe_arg(char *arg, const struct fetch_type *t,
+		     struct fetch_param *f, bool is_return)
+{
+	int ret = 0;
+	unsigned long param;
+	long offset;
+	char *tmp;
+
+	switch (arg[0]) {
+	case '$':
+		ret = parse_probe_vars(arg + 1, t, f, is_return);
+		break;
+	case '%':	/* named register */
+		ret = regs_query_register_offset(arg + 1);
+		if (ret >= 0) {
+			f->fn = t->fetch[FETCH_MTD_reg];
+			f->data = (void *)(unsigned long)ret;
+			ret = 0;
+		}
+		break;
+	case '@':	/* memory or symbol */
+		if (isdigit(arg[1])) {
+			ret = strict_strtoul(arg + 1, 0, &param);
+			if (ret)
+				break;
+			f->fn = t->fetch[FETCH_MTD_memory];
+			f->data = (void *)param;
+		} else {
+			ret = traceprobe_split_symbol_offset(arg + 1, &offset);
+			if (ret)
+				break;
+			f->data = alloc_symbol_cache(arg + 1, offset);
+			if (f->data)
+				f->fn = t->fetch[FETCH_MTD_symbol];
+		}
+		break;
+	case '+':	/* deref memory */
+		arg++;	/* Skip '+', because strict_strtol() rejects it. */
+	case '-':
+		tmp = strchr(arg, '(');
+		if (!tmp)
+			break;
+		*tmp = '\0';
+		ret = strict_strtol(arg, 0, &offset);
+		if (ret)
+			break;
+		arg = tmp + 1;
+		tmp = strrchr(arg, ')');
+		if (tmp) {
+			struct deref_fetch_param *dprm;
+			const struct fetch_type *t2 = find_fetch_type(NULL);
+			*tmp = '\0';
+			dprm = kzalloc(sizeof(struct deref_fetch_param),
+				       GFP_KERNEL);
+			if (!dprm)
+				return -ENOMEM;
+			dprm->offset = offset;
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			if (ret)
+				kfree(dprm);
+			else {
+				f->fn = t->fetch[FETCH_MTD_deref];
+				f->data = (void *)dprm;
+			}
+		}
+		break;
+	}
+	if (!ret && !f->fn) {	/* Parsed, but do not find fetch method */
+		pr_info("%s type has no corresponding fetch method.\n",
+			t->name);
+		ret = -EINVAL;
+	}
+	return ret;
+}
+#define BYTES_TO_BITS(nb)	((BITS_PER_LONG * (nb)) / sizeof(long))
+
+/* Bitfield type needs to be parsed into a fetch function */
+static int __parse_bitfield_probe_arg(const char *bf,
+				      const struct fetch_type *t,
+				      struct fetch_param *f)
+{
+	struct bitfield_fetch_param *bprm;
+	unsigned long bw, bo;
+	char *tail;
+
+	if (*bf != 'b')
+		return 0;
+
+	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
+	if (!bprm)
+		return -ENOMEM;
+	bprm->orig = *f;
+	f->fn = t->fetch[FETCH_MTD_bitfield];
+	f->data = (void *)bprm;
+
+	bw = simple_strtoul(bf + 1, &tail, 0);	/* Use simple one */
+	if (bw == 0 || *tail != '@')
+		return -EINVAL;
+
+	bf = tail + 1;
+	bo = simple_strtoul(bf, &tail, 0);
+	if (tail == bf || *tail != '/')
+		return -EINVAL;
+
+	bprm->hi_shift = BYTES_TO_BITS(t->size) - (bw + bo);
+	bprm->low_shift = bprm->hi_shift + bo;
+	return (BYTES_TO_BITS(t->size) < (bw + bo)) ? -EINVAL : 0;
+}
+
+/* String length checking wrapper */
+int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		struct probe_arg *parg, bool is_return)
+{
+	const char *t;
+	int ret;
+
+	if (strlen(arg) > MAX_ARGSTR_LEN) {
+		pr_info("Argument is too long.: %s\n",  arg);
+		return -ENOSPC;
+	}
+	parg->comm = kstrdup(arg, GFP_KERNEL);
+	if (!parg->comm) {
+		pr_info("Failed to allocate memory for command '%s'.\n", arg);
+		return -ENOMEM;
+	}
+	t = strchr(parg->comm, ':');
+	if (t) {
+		arg[t - parg->comm] = '\0';
+		t++;
+	}
+	parg->type = find_fetch_type(t);
+	if (!parg->type) {
+		pr_info("Unsupported type: %s\n", t);
+		return -EINVAL;
+	}
+	parg->offset = *size;
+	*size += parg->type->size;
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	if (ret >= 0 && t != NULL)
+		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
+	if (ret >= 0) {
+		parg->fetch_size.fn = get_fetch_size_function(parg->type,
+							      parg->fetch.fn);
+		parg->fetch_size.data = parg->fetch.data;
+	}
+	return ret;
+}
+
+/* Return 1 if name is reserved or already used by another argument */
+int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg)
+{
+	int i;
+	for (i = 0; i < ARRAY_SIZE(reserved_field_names); i++)
+		if (strcmp(reserved_field_names[i], name) == 0)
+			return 1;
+	for (i = 0; i < narg; i++)
+		if (strcmp(args[i].name, name) == 0)
+			return 1;
+	return 0;
+}
+
+void traceprobe_update_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		update_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		update_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		update_symbol_cache(arg->fetch.data);
+}
+
+
+void traceprobe_free_probe_arg(struct probe_arg *arg)
+{
+	if (CHECK_FETCH_FUNCS(bitfield, arg->fetch.fn))
+		free_bitfield_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(deref, arg->fetch.fn))
+		free_deref_fetch_param(arg->fetch.data);
+	else if (CHECK_FETCH_FUNCS(symbol, arg->fetch.fn))
+		free_symbol_cache(arg->fetch.data);
+	kfree(arg->name);
+	kfree(arg->comm);
+}
+
+int traceprobe_command(const char *buf, int (*createfn)(int, char**))
+{
+	char **argv;
+	int argc = 0, ret = 0;
+
+	argv = argv_split(GFP_KERNEL, buf, &argc);
+	if (!argv)
+		return -ENOMEM;
+
+	if (argc)
+		ret = createfn(argc, argv);
+
+	argv_free(argv);
+	return ret;
+}
+
+#define WRITE_BUFSIZE 128
+
+ssize_t traceprobe_probes_write(struct file *file, const char __user *buffer,
+	    size_t count, loff_t *ppos, int (*createfn)(int, char**))
+{
+	char *kbuf, *tmp;
+	int ret = 0;
+	size_t done = 0;
+	size_t size;
+
+	kbuf = kmalloc(WRITE_BUFSIZE, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	while (done < count) {
+		size = count - done;
+		if (size >= WRITE_BUFSIZE)
+			size = WRITE_BUFSIZE - 1;
+		if (copy_from_user(kbuf, buffer + done, size)) {
+			ret = -EFAULT;
+			goto out;
+		}
+		kbuf[size] = '\0';
+		tmp = strchr(kbuf, '\n');
+		if (tmp) {
+			*tmp = '\0';
+			size = tmp - kbuf + 1;
+		} else if (done + size < count) {
+			pr_warning("Line length is too long: "
+				   "Should be less than %d.", WRITE_BUFSIZE);
+			ret = -EINVAL;
+			goto out;
+		}
+		done += size;
+		/* Remove comments */
+		tmp = strchr(kbuf, '#');
+		if (tmp)
+			*tmp = '\0';
+
+		ret = traceprobe_command(kbuf, createfn);
+		if (ret)
+			goto out;
+	}
+	ret = done;
+out:
+	kfree(kbuf);
+	return ret;
+}
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
new file mode 100644
index 0000000..500a08f
--- /dev/null
+++ b/kernel/trace/trace_probe.h
@@ -0,0 +1,160 @@
+/*
+ * Common header file for probe-based Dynamic events.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:     Srikar Dronamraju
+ *
+ * Derived from kernel/trace/trace_kprobe.c written by
+ * Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
+ */
+
+#include <linux/seq_file.h>
+#include <linux/slab.h>
+#include <linux/smp.h>
+#include <linux/debugfs.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/ctype.h>
+#include <linux/ptrace.h>
+#include <linux/perf_event.h>
+#include <linux/kprobes.h>
+#include <linux/stringify.h>
+#include <linux/limits.h>
+#include <linux/uaccess.h>
+#include <asm/bitsperlong.h>
+
+#include "trace.h"
+#include "trace_output.h"
+
+#define MAX_TRACE_ARGS 128
+#define MAX_ARGSTR_LEN 63
+#define MAX_EVENT_NAME_LEN 64
+#define MAX_STRING_SIZE PATH_MAX
+
+/* Reserved field names */
+#define FIELD_STRING_IP "__probe_ip"
+#define FIELD_STRING_RETIP "__probe_ret_ip"
+#define FIELD_STRING_FUNC "__probe_func"
+
+#undef DEFINE_FIELD
+#define DEFINE_FIELD(type, item, name, is_signed)			\
+	do {								\
+		ret = trace_define_field(event_call, #type, name,	\
+					 offsetof(typeof(field), item),	\
+					 sizeof(field.item), is_signed, \
+					 FILTER_OTHER);			\
+		if (ret)						\
+			return ret;					\
+	} while (0)
+
+
+/* Flags for trace_probe */
+#define TP_FLAG_TRACE	1
+#define TP_FLAG_PROFILE	2
+#define TP_FLAG_REGISTERED 4
+
+
+/* data_rloc: data relative location, compatible with u32 */
+#define make_data_rloc(len, roffs)	\
+	(((u32)(len) << 16) | ((u32)(roffs) & 0xffff))
+#define get_rloc_len(dl)	((u32)(dl) >> 16)
+#define get_rloc_offs(dl)	((u32)(dl) & 0xffff)
+
+/*
+ * Convert data_rloc to data_loc:
+ *  data_rloc stores the offset from data_rloc itself, but data_loc
+ *  stores the offset from event entry.
+ */
+#define convert_rloc_to_loc(dl, offs)	((u32)(dl) + (offs))
+
+/* Data fetch function type */
+typedef	void (*fetch_func_t)(struct pt_regs *, void *, void *);
+/* Printing function type */
+typedef int (*print_type_func_t)(struct trace_seq *, const char *, void *,
+				 void *);
+
+/* Fetch types */
+enum {
+	FETCH_MTD_reg = 0,
+	FETCH_MTD_stack,
+	FETCH_MTD_retval,
+	FETCH_MTD_memory,
+	FETCH_MTD_symbol,
+	FETCH_MTD_deref,
+	FETCH_MTD_bitfield,
+	FETCH_MTD_END,
+};
+
+/* Fetch type information table */
+struct fetch_type {
+	const char	*name;		/* Name of type */
+	size_t		size;		/* Byte size of type */
+	int		is_signed;	/* Signed flag */
+	print_type_func_t	print;	/* Print functions */
+	const char	*fmt;		/* Fromat string */
+	const char	*fmttype;	/* Name in format file */
+	/* Fetch functions */
+	fetch_func_t	fetch[FETCH_MTD_END];
+};
+
+struct fetch_param {
+	fetch_func_t	fn;
+	void *data;
+};
+
+struct probe_arg {
+	struct fetch_param	fetch;
+	struct fetch_param	fetch_size;
+	unsigned int		offset;	/* Offset from argument entry */
+	const char		*name;	/* Name of this argument */
+	const char		*comm;	/* Command of this argument */
+	const struct fetch_type	*type;	/* Type of this argument */
+};
+
+static inline __kprobes void call_fetch(struct fetch_param *fprm,
+				 struct pt_regs *regs, void *dest)
+{
+	return fprm->fn(regs, fprm->data, dest);
+}
+
+/* Check the name is good for event/group/fields */
+static int is_good_name(const char *name)
+{
+	if (!isalpha(*name) && *name != '_')
+		return 0;
+	while (*++name != '\0') {
+		if (!isalpha(*name) && !isdigit(*name) && *name != '_')
+			return 0;
+	}
+	return 1;
+}
+
+extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
+		   struct probe_arg *parg, bool is_return);
+
+extern int traceprobe_conflict_field_name(const char *name,
+			       struct probe_arg *args, int narg);
+
+extern void traceprobe_update_arg(struct probe_arg *arg);
+extern void traceprobe_free_probe_arg(struct probe_arg *arg);
+
+extern int traceprobe_split_symbol_offset(char *symbol, unsigned long *offset);
+
+extern ssize_t traceprobe_probes_write(struct file *file,
+		const char __user *buffer, size_t count, loff_t *ppos,
+		int (*createfn)(int, char**));
+
+extern int traceprobe_command(const char *buf, int (*createfn)(int, char**));

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 20/26]   tracing: uprobes trace_event interface
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:04   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.

The following example shows how to dump the instruction pointer and %ax
a register at the probed text address.  Here we are trying to probe
zfree in /bin/zsh

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g    DF .text  0000000000000012  Base        zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

TODO: Connect a filter to a consumer.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/Kconfig                |    8 
 kernel/trace/Kconfig        |   16 +
 kernel/trace/Makefile       |    1 
 kernel/trace/trace.h        |    5 
 kernel/trace/trace_kprobe.c |    4 
 kernel/trace/trace_probe.c  |   14 +
 kernel/trace/trace_probe.h  |    6 
 kernel/trace/trace_uprobe.c |  770 +++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 809 insertions(+), 15 deletions(-)
 create mode 100644 kernel/trace/trace_uprobe.c

diff --git a/arch/Kconfig b/arch/Kconfig
index d6a4e1d..53ce702 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,14 +62,8 @@ config OPTPROBES
 	depends on !PREEMPT
 
 config UPROBES
-	bool "User-space probes (EXPERIMENTAL)"
 	select MM_OWNER
-	help
-	  Uprobes enables kernel subsystems to establish probepoints
-	  in user applications and execute handler functions when
-	  the probepoints are hit.
-
-	  If in doubt, say "N".
+	def_bool n
 
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 520106a..b001fb1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config UPROBE_EVENT
+	bool "Enable uprobes-based dynamic events"
+	depends on ARCH_SUPPORTS_UPROBES
+	depends on MMU
+	select UPROBES
+	select PROBE_EVENTS
+	select TRACING
+	default n
+	help
+	  This allows the user to add tracing events on top of userspace dynamic
+	  events (similar to tracepoints) on the fly via the traceevents interface.
+	  Those events can be inserted wherever uprobes can probe, and record
+	  various registers.
+	  This option is required if you plan to use perf-probe subcommand of perf
+	  tools on user space applications.
+
 config PROBE_EVENTS
 	def_bool n
 
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 692223a..bb3d3ff 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -57,5 +57,6 @@ ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
 obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 616846b..c9b737c 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
 	unsigned long		ret_ip;
 };
 
+struct uprobe_trace_entry_head {
+	struct trace_entry	ent;
+	unsigned long		ip;
+};
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d5f4e51..b156d8f 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -514,8 +514,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
-								is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size,
+					&tp->args[i], is_return, true);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 52580b5..d8f71ef 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -528,13 +528,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
 
 /* Recursive argument parser */
 static int parse_probe_arg(char *arg, const struct fetch_type *t,
-		     struct fetch_param *f, bool is_return)
+		     struct fetch_param *f, bool is_return, bool is_kprobe)
 {
 	int ret = 0;
 	unsigned long param;
 	long offset;
 	char *tmp;
 
+	/* Until uprobe_events supports only reg arguments */
+	if (!is_kprobe && arg[0] != '%')
+		return -EINVAL;
+
 	switch (arg[0]) {
 	case '$':
 		ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -584,7 +588,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
 			if (!dprm)
 				return -ENOMEM;
 			dprm->offset = offset;
-			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+							is_kprobe);
 			if (ret)
 				kfree(dprm);
 			else {
@@ -638,7 +643,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
 
 /* String length checking wrapper */
 int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		struct probe_arg *parg, bool is_return)
+		struct probe_arg *parg, bool is_return, bool is_kprobe)
 {
 	const char *t;
 	int ret;
@@ -664,7 +669,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
 	}
 	parg->offset = *size;
 	*size += parg->type->size;
-	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+							is_kprobe);
 	if (ret >= 0 && t != NULL)
 		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
 	if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 500a08f..0cab89a 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -48,6 +48,7 @@
 #define FIELD_STRING_IP "__probe_ip"
 #define FIELD_STRING_RETIP "__probe_ret_ip"
 #define FIELD_STRING_FUNC "__probe_func"
+#define FIELD_STRING_PID "__probe_pid"
 
 #undef DEFINE_FIELD
 #define DEFINE_FIELD(type, item, name, is_signed)			\
@@ -65,6 +66,7 @@
 #define TP_FLAG_TRACE	1
 #define TP_FLAG_PROFILE	2
 #define TP_FLAG_REGISTERED 4
+#define TP_FLAG_UPROBE	8
 
 
 /* data_rloc: data relative location, compatible with u32 */
@@ -131,7 +133,7 @@ static inline __kprobes void call_fetch(struct fetch_param *fprm,
 }
 
 /* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
+static inline int is_good_name(const char *name)
 {
 	if (!isalpha(*name) && *name != '_')
 		return 0;
@@ -143,7 +145,7 @@ static int is_good_name(const char *name)
 }
 
 extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		   struct probe_arg *parg, bool is_return);
+		   struct probe_arg *parg, bool is_return, bool is_kprobe);
 
 extern int traceprobe_conflict_field_name(const char *name,
 			       struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..d274207
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,770 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:	Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+	struct uprobe_consumer cons;
+	struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+	struct list_head	list;
+	struct ftrace_event_class	class;
+	struct ftrace_event_call	call;
+	struct uprobe_trace_consumer	*consumer;
+	struct inode		*inode;
+	char			*filename;
+	unsigned long		offset;
+	unsigned long		nhit;
+	unsigned int		flags;	/* For TP_FLAG_* */
+	ssize_t			size;		/* trace entry size */
+	unsigned int		nr_args;
+	struct probe_arg	args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n)			\
+	(offsetof(struct trace_uprobe, args) +	\
+	(sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+				const char *event, int nargs)
+{
+	struct trace_uprobe *tp;
+
+	if (!event || !is_good_name(event))
+		return ERR_PTR(-EINVAL);
+
+	if (!group || !is_good_name(group))
+		return ERR_PTR(-EINVAL);
+
+	tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+	if (!tp)
+		return ERR_PTR(-ENOMEM);
+
+	tp->call.class = &tp->class;
+	tp->call.name = kstrdup(event, GFP_KERNEL);
+	if (!tp->call.name)
+		goto error;
+
+	tp->class.system = kstrdup(group, GFP_KERNEL);
+	if (!tp->class.system)
+		goto error;
+
+	INIT_LIST_HEAD(&tp->list);
+	return tp;
+error:
+	kfree(tp->call.name);
+	kfree(tp);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+	int i;
+
+	for (i = 0; i < tp->nr_args; i++)
+		traceprobe_free_probe_arg(&tp->args[i]);
+
+	iput(tp->inode);
+	kfree(tp->call.class->system);
+	kfree(tp->call.name);
+	kfree(tp->filename);
+	kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+					const char *group)
+{
+	struct trace_uprobe *tp;
+
+	list_for_each_entry(tp, &uprobe_list, list)
+		if (strcmp(tp->call.name, event) == 0 &&
+		    strcmp(tp->call.class->system, group) == 0)
+			return tp;
+	return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+	list_del(&tp->list);
+	unregister_uprobe_event(tp);
+	free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+	struct trace_uprobe *old_tp;
+	int ret;
+
+	mutex_lock(&uprobe_lock);
+
+	/* register as an event */
+	old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+	if (old_tp)
+		/* delete old event */
+		unregister_trace_uprobe(old_tp);
+
+	ret = register_uprobe_event(tp);
+	if (ret) {
+		pr_warning("Failed to register probe event(%d)\n", ret);
+		goto end;
+	}
+
+	list_add_tail(&tp->list, &uprobe_list);
+end:
+	mutex_unlock(&uprobe_lock);
+	return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+	/*
+	 * Argument syntax:
+	 *  - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+	 *
+	 *  - Remove uprobe: -:[GRP/]EVENT
+	 */
+	struct path path;
+	struct inode *inode = NULL;
+	struct trace_uprobe *tp;
+	int i, ret = 0;
+	int is_delete = 0;
+	char *arg = NULL, *event = NULL, *group = NULL;
+	unsigned long offset;
+	char buf[MAX_EVENT_NAME_LEN];
+	char *filename;
+
+	/* argc must be >= 1 */
+	if (argv[0][0] == '-')
+		is_delete = 1;
+	else if (argv[0][0] != 'p') {
+		pr_info("Probe definition must be started with 'p', 'r' or"
+			" '-'.\n");
+		return -EINVAL;
+	}
+
+	if (argv[0][1] == ':') {
+		event = &argv[0][2];
+		if (strchr(event, '/')) {
+			group = event;
+			event = strchr(group, '/') + 1;
+			event[-1] = '\0';
+			if (strlen(group) == 0) {
+				pr_info("Group name is not specified\n");
+				return -EINVAL;
+			}
+		}
+		if (strlen(event) == 0) {
+			pr_info("Event name is not specified\n");
+			return -EINVAL;
+		}
+	}
+	if (!group)
+		group = UPROBE_EVENT_SYSTEM;
+
+	if (is_delete) {
+		if (!event) {
+			pr_info("Delete command needs an event name.\n");
+			return -EINVAL;
+		}
+		mutex_lock(&uprobe_lock);
+		tp = find_probe_event(event, group);
+		if (!tp) {
+			mutex_unlock(&uprobe_lock);
+			pr_info("Event %s/%s doesn't exist.\n", group, event);
+			return -ENOENT;
+		}
+		/* delete an event */
+		unregister_trace_uprobe(tp);
+		mutex_unlock(&uprobe_lock);
+		return 0;
+	}
+
+	if (argc < 2) {
+		pr_info("Probe point is not specified.\n");
+		return -EINVAL;
+	}
+	if (isdigit(argv[1][0])) {
+		pr_info("probe point must be have a filename.\n");
+		return -EINVAL;
+	}
+	arg = strchr(argv[1], ':');
+	if (!arg)
+		goto fail_address_parse;
+
+	*arg++ = '\0';
+	filename = argv[1];
+	ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto fail_address_parse;
+	inode = igrab(path.dentry->d_inode);
+
+	ret = strict_strtoul(arg, 0, &offset);
+		if (ret)
+			goto fail_address_parse;
+	argc -= 2; argv += 2;
+
+	/* setup a probe */
+	if (!event) {
+		char *tail = strrchr(filename, '/');
+		char *ptr;
+
+		ptr = kstrdup((tail ? tail + 1 : filename), GFP_KERNEL);
+		if (!ptr) {
+			ret = -ENOMEM;
+			goto fail_address_parse;
+		}
+
+		tail = ptr;
+		ptr = strpbrk(tail, ".-_");
+		if (ptr)
+			*ptr = '\0';
+
+		snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p', tail,
+				offset);
+		event = buf;
+		kfree(tail);
+	}
+	tp = alloc_trace_uprobe(group, event, argc);
+	if (IS_ERR(tp)) {
+		pr_info("Failed to allocate trace_uprobe.(%d)\n",
+			(int)PTR_ERR(tp));
+		iput(inode);
+		return PTR_ERR(tp);
+	}
+	tp->offset = offset;
+	tp->inode = inode;
+	tp->filename = kstrdup(filename, GFP_KERNEL);
+	if (!tp->filename) {
+			pr_info("Failed to allocate filename.\n");
+			ret = -ENOMEM;
+			goto error;
+	}
+
+	/* parse arguments */
+	ret = 0;
+	for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+		/* Increment count for freeing args in error case */
+		tp->nr_args++;
+
+		/* Parse argument name */
+		arg = strchr(argv[i], '=');
+		if (arg) {
+			*arg++ = '\0';
+			tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+		} else {
+			arg = argv[i];
+			/* If argument name is omitted, set "argN" */
+			snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+			tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+		}
+
+		if (!tp->args[i].name) {
+			pr_info("Failed to allocate argument[%d] name.\n", i);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		if (!is_good_name(tp->args[i].name)) {
+			pr_info("Invalid argument[%d] name: %s\n",
+				i, tp->args[i].name);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
+			pr_info("Argument[%d] name '%s' conflicts with "
+				"another field.\n", i, argv[i]);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		/* Parse fetch argument */
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								false, false);
+		if (ret) {
+			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+			goto error;
+		}
+	}
+
+	ret = register_trace_uprobe(tp);
+	if (ret)
+		goto error;
+	return 0;
+
+error:
+	free_trace_uprobe(tp);
+	return ret;
+
+fail_address_parse:
+	if (inode)
+		iput(inode);
+	pr_info("Failed to parse address.\n");
+	return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+	struct trace_uprobe *tp;
+
+	mutex_lock(&uprobe_lock);
+	while (!list_empty(&uprobe_list)) {
+		tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+		unregister_trace_uprobe(tp);
+	}
+	mutex_unlock(&uprobe_lock);
+}
+
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+	mutex_lock(&uprobe_lock);
+	return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+	mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+	int i;
+
+	seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+	seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+	for (i = 0; i < tp->nr_args; i++)
+		seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) &&
+	    (file->f_flags & O_TRUNC))
+		cleanup_all_probes();
+
+	return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+			    size_t count, loff_t *ppos)
+{
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+	.owner          = THIS_MODULE,
+	.open           = probes_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+	.write		= probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+
+	seq_printf(m, "  %s %-44s %15lu\n", tp->filename, tp->call.name,
+								tp->nhit);
+	return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+	.owner          = THIS_MODULE,
+	.open           = profile_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+	struct uprobe_trace_entry_head *entry;
+	struct ring_buffer_event *event;
+	struct ring_buffer *buffer;
+	u8 *data;
+	int size, i, pc;
+	unsigned long irq_flags;
+	struct ftrace_event_call *call = &tp->call;
+
+	tp->nhit++;
+
+	local_save_flags(irq_flags);
+	pc = preempt_count();
+
+	size = sizeof(*entry) + tp->size;
+
+	event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+						  size, irq_flags, pc);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	if (!filter_current_check_discard(buffer, call, entry, event))
+		trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+		   struct trace_event *event)
+{
+	struct uprobe_trace_entry_head *field;
+	struct trace_seq *s = &iter->seq;
+	struct trace_uprobe *tp;
+	u8 *data;
+	int i;
+
+	field = (struct uprobe_trace_entry_head *)iter->ent;
+	tp = container_of(event, struct trace_uprobe, call.event);
+
+	if (!trace_seq_printf(s, "%s: (", tp->call.name))
+		goto partial;
+
+	if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+		goto partial;
+
+	if (!trace_seq_puts(s, ")"))
+		goto partial;
+
+	data = (u8 *)&field[1];
+	for (i = 0; i < tp->nr_args; i++)
+		if (!tp->args[i].type->print(s, tp->args[i].name,
+					     data + tp->args[i].offset, field))
+			goto partial;
+
+	if (!trace_seq_puts(s, "\n"))
+		goto partial;
+
+	return TRACE_TYPE_HANDLED;
+partial:
+	return TRACE_TYPE_PARTIAL_LINE;
+}
+
+
+static int probe_event_enable(struct trace_uprobe *tp, int flag)
+{
+	struct uprobe_trace_consumer *utc;
+	int ret = 0;
+
+	if (!tp->inode || tp->consumer)
+		return -EINTR;
+
+	utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+	if (!utc)
+		return -EINTR;
+
+	utc->cons.handler = uprobe_dispatcher;
+	utc->cons.filter = NULL;
+	ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+	if (ret) {
+		kfree(utc);
+		return ret;
+	}
+
+	tp->flags |= flag;
+	utc->tp = tp;
+	tp->consumer = utc;
+	return 0;
+}
+
+static void probe_event_disable(struct trace_uprobe *tp, int flag)
+{
+	if (!tp->inode || !tp->consumer)
+		return;
+
+	unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+	tp->flags &= ~flag;
+	kfree(tp->consumer);
+	tp->consumer = NULL;
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+	int ret, i;
+	struct uprobe_trace_entry_head field;
+	struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+	DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+	/* Set argument names as fields */
+	for (i = 0; i < tp->nr_args; i++) {
+		ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+					 tp->args[i].name,
+					 sizeof(field) + tp->args[i].offset,
+					 tp->args[i].type->size,
+					 tp->args[i].type->is_signed,
+					 FILTER_OTHER);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+	int i;
+	int pos = 0;
+
+	const char *fmt, *arg;
+
+	fmt = "(%lx)";
+	arg = "REC->" FIELD_STRING_IP;
+
+	/* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+				tp->args[i].name, tp->args[i].type->fmt);
+	}
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+				tp->args[i].name);
+	}
+
+#undef LEN_OR_ZERO
+
+	/* return the length of print_fmt */
+	return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+	int len;
+	char *print_fmt;
+
+	/* First: called with 0 length to calculate the needed length */
+	len = __set_print_fmt(tp, NULL, 0);
+	print_fmt = kmalloc(len + 1, GFP_KERNEL);
+	if (!print_fmt)
+		return -ENOMEM;
+
+	/* Second: actually write the @print_fmt */
+	__set_print_fmt(tp, print_fmt, len + 1);
+	tp->call.print_fmt = print_fmt;
+
+	return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp,
+					 struct pt_regs *regs)
+{
+	struct ftrace_event_call *call = &tp->call;
+	struct uprobe_trace_entry_head *entry;
+	struct hlist_head *head;
+	u8 *data;
+	int size, __size, i;
+	int rctx;
+
+	__size = sizeof(*entry) + tp->size;
+	size = ALIGN(__size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+		     "profile buffer not large enough"))
+		return;
+
+	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+	if (!entry)
+		return;
+
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	head = this_cpu_ptr(call->perf_events);
+	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+#endif	/* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+	switch (type) {
+	case TRACE_REG_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_TRACE);
+	case TRACE_REG_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_TRACE);
+		return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+	case TRACE_REG_PERF_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_PROFILE);
+	case TRACE_REG_PERF_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_PROFILE);
+		return 0;
+#endif
+	}
+	return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+	struct uprobe_trace_consumer *utc;
+	struct trace_uprobe *tp;
+
+	utc = container_of(con, struct uprobe_trace_consumer, cons);
+	tp = utc->tp;
+	if (!tp || tp->consumer != utc)
+		return 0;
+
+	if (tp->flags & TP_FLAG_TRACE)
+		uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+	if (tp->flags & TP_FLAG_PROFILE)
+		uprobe_perf_func(tp, regs);
+#endif
+	return 0;
+}
+
+
+static struct trace_event_functions uprobe_funcs = {
+	.trace		= print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+	struct ftrace_event_call *call = &tp->call;
+	int ret;
+
+	/* Initialize ftrace_event_call */
+	INIT_LIST_HEAD(&call->class->fields);
+	call->event.funcs = &uprobe_funcs;
+	call->class->define_fields = uprobe_event_define_fields;
+	if (set_print_fmt(tp) < 0)
+		return -ENOMEM;
+	ret = register_ftrace_event(&call->event);
+	if (!ret) {
+		kfree(call->print_fmt);
+		return -ENODEV;
+	}
+	call->flags = 0;
+	call->class->reg = uprobe_register;
+	call->data = tp;
+	ret = trace_add_event_call(call);
+	if (ret) {
+		pr_info("Failed to register uprobe event: %s\n", call->name);
+		kfree(call->print_fmt);
+		unregister_ftrace_event(&call->event);
+	}
+	return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+	/* tp->event is unregistered in trace_remove_event_call() */
+	trace_remove_event_call(&tp->call);
+	kfree(tp->call.print_fmt);
+	tp->call.print_fmt = NULL;
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+	struct dentry *d_tracer;
+	struct dentry *entry;
+
+	d_tracer = tracing_init_dentry();
+	if (!d_tracer)
+		return 0;
+
+	entry = trace_create_file("uprobe_events", 0644, d_tracer,
+				    NULL, &uprobe_events_ops);
+	/* Profile interface */
+	entry = trace_create_file("uprobe_profile", 0444, d_tracer,
+				    NULL, &uprobe_profile_ops);
+	return 0;
+}
+fs_initcall(init_uprobe_trace);


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 20/26]   tracing: uprobes trace_event interface
@ 2011-09-20 12:04   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


Implements trace_event support for uprobes. In its current form it can
be used to put probes at a specified offset in a file and dump the
required registers when the code flow reaches the probed address.

The following example shows how to dump the instruction pointer and %ax
a register at the probed text address.  Here we are trying to probe
zfree in /bin/zsh

# cd /sys/kernel/debug/tracing/
# cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
# objdump -T /bin/zsh | grep -w zfree
0000000000446420 g    DF .text  0000000000000012  Base        zfree
# echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
# cat uprobe_events
p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
# echo 1 > events/uprobes/enable
# sleep 20
# echo 0 > events/uprobes/enable
# cat trace
# tracer: nop
#
#           TASK-PID    CPU#    TIMESTAMP  FUNCTION
#              | |       |          |         |
             zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
             zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79

TODO: Connect a filter to a consumer.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/Kconfig                |    8 
 kernel/trace/Kconfig        |   16 +
 kernel/trace/Makefile       |    1 
 kernel/trace/trace.h        |    5 
 kernel/trace/trace_kprobe.c |    4 
 kernel/trace/trace_probe.c  |   14 +
 kernel/trace/trace_probe.h  |    6 
 kernel/trace/trace_uprobe.c |  770 +++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 809 insertions(+), 15 deletions(-)
 create mode 100644 kernel/trace/trace_uprobe.c

diff --git a/arch/Kconfig b/arch/Kconfig
index d6a4e1d..53ce702 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -62,14 +62,8 @@ config OPTPROBES
 	depends on !PREEMPT
 
 config UPROBES
-	bool "User-space probes (EXPERIMENTAL)"
 	select MM_OWNER
-	help
-	  Uprobes enables kernel subsystems to establish probepoints
-	  in user applications and execute handler functions when
-	  the probepoints are hit.
-
-	  If in doubt, say "N".
+	def_bool n
 
 config HAVE_EFFICIENT_UNALIGNED_ACCESS
 	bool
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 520106a..b001fb1 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -386,6 +386,22 @@ config KPROBE_EVENT
 	  This option is also required by perf-probe subcommand of perf tools.
 	  If you want to use perf tools, this option is strongly recommended.
 
+config UPROBE_EVENT
+	bool "Enable uprobes-based dynamic events"
+	depends on ARCH_SUPPORTS_UPROBES
+	depends on MMU
+	select UPROBES
+	select PROBE_EVENTS
+	select TRACING
+	default n
+	help
+	  This allows the user to add tracing events on top of userspace dynamic
+	  events (similar to tracepoints) on the fly via the traceevents interface.
+	  Those events can be inserted wherever uprobes can probe, and record
+	  various registers.
+	  This option is required if you plan to use perf-probe subcommand of perf
+	  tools on user space applications.
+
 config PROBE_EVENTS
 	def_bool n
 
diff --git a/kernel/trace/Makefile b/kernel/trace/Makefile
index 692223a..bb3d3ff 100644
--- a/kernel/trace/Makefile
+++ b/kernel/trace/Makefile
@@ -57,5 +57,6 @@ ifeq ($(CONFIG_TRACING),y)
 obj-$(CONFIG_KGDB_KDB) += trace_kdb.o
 endif
 obj-$(CONFIG_PROBE_EVENTS) +=trace_probe.o
+obj-$(CONFIG_UPROBE_EVENT) += trace_uprobe.o
 
 libftrace-y := ftrace.o
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index 616846b..c9b737c 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -97,6 +97,11 @@ struct kretprobe_trace_entry_head {
 	unsigned long		ret_ip;
 };
 
+struct uprobe_trace_entry_head {
+	struct trace_entry	ent;
+	unsigned long		ip;
+};
+
 /*
  * trace_flag_type is an enumeration that holds different
  * states when a trace occurs. These are:
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d5f4e51..b156d8f 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -514,8 +514,8 @@ static int create_trace_probe(int argc, char **argv)
 		}
 
 		/* Parse fetch argument */
-		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
-								is_return);
+		ret = traceprobe_parse_probe_arg(arg, &tp->size,
+					&tp->args[i], is_return, true);
 		if (ret) {
 			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
 			goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 52580b5..d8f71ef 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -528,13 +528,17 @@ static int parse_probe_vars(char *arg, const struct fetch_type *t,
 
 /* Recursive argument parser */
 static int parse_probe_arg(char *arg, const struct fetch_type *t,
-		     struct fetch_param *f, bool is_return)
+		     struct fetch_param *f, bool is_return, bool is_kprobe)
 {
 	int ret = 0;
 	unsigned long param;
 	long offset;
 	char *tmp;
 
+	/* Until uprobe_events supports only reg arguments */
+	if (!is_kprobe && arg[0] != '%')
+		return -EINVAL;
+
 	switch (arg[0]) {
 	case '$':
 		ret = parse_probe_vars(arg + 1, t, f, is_return);
@@ -584,7 +588,8 @@ static int parse_probe_arg(char *arg, const struct fetch_type *t,
 			if (!dprm)
 				return -ENOMEM;
 			dprm->offset = offset;
-			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return);
+			ret = parse_probe_arg(arg, t2, &dprm->orig, is_return,
+							is_kprobe);
 			if (ret)
 				kfree(dprm);
 			else {
@@ -638,7 +643,7 @@ static int __parse_bitfield_probe_arg(const char *bf,
 
 /* String length checking wrapper */
 int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		struct probe_arg *parg, bool is_return)
+		struct probe_arg *parg, bool is_return, bool is_kprobe)
 {
 	const char *t;
 	int ret;
@@ -664,7 +669,8 @@ int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
 	}
 	parg->offset = *size;
 	*size += parg->type->size;
-	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return);
+	ret = parse_probe_arg(arg, parg->type, &parg->fetch, is_return,
+							is_kprobe);
 	if (ret >= 0 && t != NULL)
 		ret = __parse_bitfield_probe_arg(t, parg->type, &parg->fetch);
 	if (ret >= 0) {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 500a08f..0cab89a 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -48,6 +48,7 @@
 #define FIELD_STRING_IP "__probe_ip"
 #define FIELD_STRING_RETIP "__probe_ret_ip"
 #define FIELD_STRING_FUNC "__probe_func"
+#define FIELD_STRING_PID "__probe_pid"
 
 #undef DEFINE_FIELD
 #define DEFINE_FIELD(type, item, name, is_signed)			\
@@ -65,6 +66,7 @@
 #define TP_FLAG_TRACE	1
 #define TP_FLAG_PROFILE	2
 #define TP_FLAG_REGISTERED 4
+#define TP_FLAG_UPROBE	8
 
 
 /* data_rloc: data relative location, compatible with u32 */
@@ -131,7 +133,7 @@ static inline __kprobes void call_fetch(struct fetch_param *fprm,
 }
 
 /* Check the name is good for event/group/fields */
-static int is_good_name(const char *name)
+static inline int is_good_name(const char *name)
 {
 	if (!isalpha(*name) && *name != '_')
 		return 0;
@@ -143,7 +145,7 @@ static int is_good_name(const char *name)
 }
 
 extern int traceprobe_parse_probe_arg(char *arg, ssize_t *size,
-		   struct probe_arg *parg, bool is_return);
+		   struct probe_arg *parg, bool is_return, bool is_kprobe);
 
 extern int traceprobe_conflict_field_name(const char *name,
 			       struct probe_arg *args, int narg);
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
new file mode 100644
index 0000000..d274207
--- /dev/null
+++ b/kernel/trace/trace_uprobe.c
@@ -0,0 +1,770 @@
+/*
+ * uprobes-based tracing events
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ *
+ * Copyright (C) IBM Corporation, 2010
+ * Author:	Srikar Dronamraju
+ */
+
+#include <linux/module.h>
+#include <linux/uaccess.h>
+#include <linux/uprobes.h>
+#include <linux/namei.h>
+
+#include "trace_probe.h"
+
+#define UPROBE_EVENT_SYSTEM "uprobes"
+
+/**
+ * uprobe event core functions
+ */
+struct trace_uprobe;
+struct uprobe_trace_consumer {
+	struct uprobe_consumer cons;
+	struct trace_uprobe *tp;
+};
+
+struct trace_uprobe {
+	struct list_head	list;
+	struct ftrace_event_class	class;
+	struct ftrace_event_call	call;
+	struct uprobe_trace_consumer	*consumer;
+	struct inode		*inode;
+	char			*filename;
+	unsigned long		offset;
+	unsigned long		nhit;
+	unsigned int		flags;	/* For TP_FLAG_* */
+	ssize_t			size;		/* trace entry size */
+	unsigned int		nr_args;
+	struct probe_arg	args[];
+};
+
+#define SIZEOF_TRACE_UPROBE(n)			\
+	(offsetof(struct trace_uprobe, args) +	\
+	(sizeof(struct probe_arg) * (n)))
+
+static int register_uprobe_event(struct trace_uprobe *tp);
+static void unregister_uprobe_event(struct trace_uprobe *tp);
+
+static DEFINE_MUTEX(uprobe_lock);
+static LIST_HEAD(uprobe_list);
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs);
+
+/*
+ * Allocate new trace_uprobe and initialize it (including uprobes).
+ */
+static struct trace_uprobe *alloc_trace_uprobe(const char *group,
+				const char *event, int nargs)
+{
+	struct trace_uprobe *tp;
+
+	if (!event || !is_good_name(event))
+		return ERR_PTR(-EINVAL);
+
+	if (!group || !is_good_name(group))
+		return ERR_PTR(-EINVAL);
+
+	tp = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
+	if (!tp)
+		return ERR_PTR(-ENOMEM);
+
+	tp->call.class = &tp->class;
+	tp->call.name = kstrdup(event, GFP_KERNEL);
+	if (!tp->call.name)
+		goto error;
+
+	tp->class.system = kstrdup(group, GFP_KERNEL);
+	if (!tp->class.system)
+		goto error;
+
+	INIT_LIST_HEAD(&tp->list);
+	return tp;
+error:
+	kfree(tp->call.name);
+	kfree(tp);
+	return ERR_PTR(-ENOMEM);
+}
+
+static void free_trace_uprobe(struct trace_uprobe *tp)
+{
+	int i;
+
+	for (i = 0; i < tp->nr_args; i++)
+		traceprobe_free_probe_arg(&tp->args[i]);
+
+	iput(tp->inode);
+	kfree(tp->call.class->system);
+	kfree(tp->call.name);
+	kfree(tp->filename);
+	kfree(tp);
+}
+
+static struct trace_uprobe *find_probe_event(const char *event,
+					const char *group)
+{
+	struct trace_uprobe *tp;
+
+	list_for_each_entry(tp, &uprobe_list, list)
+		if (strcmp(tp->call.name, event) == 0 &&
+		    strcmp(tp->call.class->system, group) == 0)
+			return tp;
+	return NULL;
+}
+
+/* Unregister a trace_uprobe and probe_event: call with locking uprobe_lock */
+static void unregister_trace_uprobe(struct trace_uprobe *tp)
+{
+	list_del(&tp->list);
+	unregister_uprobe_event(tp);
+	free_trace_uprobe(tp);
+}
+
+/* Register a trace_uprobe and probe_event */
+static int register_trace_uprobe(struct trace_uprobe *tp)
+{
+	struct trace_uprobe *old_tp;
+	int ret;
+
+	mutex_lock(&uprobe_lock);
+
+	/* register as an event */
+	old_tp = find_probe_event(tp->call.name, tp->call.class->system);
+	if (old_tp)
+		/* delete old event */
+		unregister_trace_uprobe(old_tp);
+
+	ret = register_uprobe_event(tp);
+	if (ret) {
+		pr_warning("Failed to register probe event(%d)\n", ret);
+		goto end;
+	}
+
+	list_add_tail(&tp->list, &uprobe_list);
+end:
+	mutex_unlock(&uprobe_lock);
+	return ret;
+}
+
+static int create_trace_uprobe(int argc, char **argv)
+{
+	/*
+	 * Argument syntax:
+	 *  - Add uprobe: p[:[GRP/]EVENT] VADDR@PID [%REG]
+	 *
+	 *  - Remove uprobe: -:[GRP/]EVENT
+	 */
+	struct path path;
+	struct inode *inode = NULL;
+	struct trace_uprobe *tp;
+	int i, ret = 0;
+	int is_delete = 0;
+	char *arg = NULL, *event = NULL, *group = NULL;
+	unsigned long offset;
+	char buf[MAX_EVENT_NAME_LEN];
+	char *filename;
+
+	/* argc must be >= 1 */
+	if (argv[0][0] == '-')
+		is_delete = 1;
+	else if (argv[0][0] != 'p') {
+		pr_info("Probe definition must be started with 'p', 'r' or"
+			" '-'.\n");
+		return -EINVAL;
+	}
+
+	if (argv[0][1] == ':') {
+		event = &argv[0][2];
+		if (strchr(event, '/')) {
+			group = event;
+			event = strchr(group, '/') + 1;
+			event[-1] = '\0';
+			if (strlen(group) == 0) {
+				pr_info("Group name is not specified\n");
+				return -EINVAL;
+			}
+		}
+		if (strlen(event) == 0) {
+			pr_info("Event name is not specified\n");
+			return -EINVAL;
+		}
+	}
+	if (!group)
+		group = UPROBE_EVENT_SYSTEM;
+
+	if (is_delete) {
+		if (!event) {
+			pr_info("Delete command needs an event name.\n");
+			return -EINVAL;
+		}
+		mutex_lock(&uprobe_lock);
+		tp = find_probe_event(event, group);
+		if (!tp) {
+			mutex_unlock(&uprobe_lock);
+			pr_info("Event %s/%s doesn't exist.\n", group, event);
+			return -ENOENT;
+		}
+		/* delete an event */
+		unregister_trace_uprobe(tp);
+		mutex_unlock(&uprobe_lock);
+		return 0;
+	}
+
+	if (argc < 2) {
+		pr_info("Probe point is not specified.\n");
+		return -EINVAL;
+	}
+	if (isdigit(argv[1][0])) {
+		pr_info("probe point must be have a filename.\n");
+		return -EINVAL;
+	}
+	arg = strchr(argv[1], ':');
+	if (!arg)
+		goto fail_address_parse;
+
+	*arg++ = '\0';
+	filename = argv[1];
+	ret = kern_path(filename, LOOKUP_FOLLOW, &path);
+	if (ret)
+		goto fail_address_parse;
+	inode = igrab(path.dentry->d_inode);
+
+	ret = strict_strtoul(arg, 0, &offset);
+		if (ret)
+			goto fail_address_parse;
+	argc -= 2; argv += 2;
+
+	/* setup a probe */
+	if (!event) {
+		char *tail = strrchr(filename, '/');
+		char *ptr;
+
+		ptr = kstrdup((tail ? tail + 1 : filename), GFP_KERNEL);
+		if (!ptr) {
+			ret = -ENOMEM;
+			goto fail_address_parse;
+		}
+
+		tail = ptr;
+		ptr = strpbrk(tail, ".-_");
+		if (ptr)
+			*ptr = '\0';
+
+		snprintf(buf, MAX_EVENT_NAME_LEN, "%c_%s_0x%lx", 'p', tail,
+				offset);
+		event = buf;
+		kfree(tail);
+	}
+	tp = alloc_trace_uprobe(group, event, argc);
+	if (IS_ERR(tp)) {
+		pr_info("Failed to allocate trace_uprobe.(%d)\n",
+			(int)PTR_ERR(tp));
+		iput(inode);
+		return PTR_ERR(tp);
+	}
+	tp->offset = offset;
+	tp->inode = inode;
+	tp->filename = kstrdup(filename, GFP_KERNEL);
+	if (!tp->filename) {
+			pr_info("Failed to allocate filename.\n");
+			ret = -ENOMEM;
+			goto error;
+	}
+
+	/* parse arguments */
+	ret = 0;
+	for (i = 0; i < argc && i < MAX_TRACE_ARGS; i++) {
+		/* Increment count for freeing args in error case */
+		tp->nr_args++;
+
+		/* Parse argument name */
+		arg = strchr(argv[i], '=');
+		if (arg) {
+			*arg++ = '\0';
+			tp->args[i].name = kstrdup(argv[i], GFP_KERNEL);
+		} else {
+			arg = argv[i];
+			/* If argument name is omitted, set "argN" */
+			snprintf(buf, MAX_EVENT_NAME_LEN, "arg%d", i + 1);
+			tp->args[i].name = kstrdup(buf, GFP_KERNEL);
+		}
+
+		if (!tp->args[i].name) {
+			pr_info("Failed to allocate argument[%d] name.\n", i);
+			ret = -ENOMEM;
+			goto error;
+		}
+
+		if (!is_good_name(tp->args[i].name)) {
+			pr_info("Invalid argument[%d] name: %s\n",
+				i, tp->args[i].name);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		if (traceprobe_conflict_field_name(tp->args[i].name,
+							tp->args, i)) {
+			pr_info("Argument[%d] name '%s' conflicts with "
+				"another field.\n", i, argv[i]);
+			ret = -EINVAL;
+			goto error;
+		}
+
+		/* Parse fetch argument */
+		ret = traceprobe_parse_probe_arg(arg, &tp->size, &tp->args[i],
+								false, false);
+		if (ret) {
+			pr_info("Parse error at argument[%d]. (%d)\n", i, ret);
+			goto error;
+		}
+	}
+
+	ret = register_trace_uprobe(tp);
+	if (ret)
+		goto error;
+	return 0;
+
+error:
+	free_trace_uprobe(tp);
+	return ret;
+
+fail_address_parse:
+	if (inode)
+		iput(inode);
+	pr_info("Failed to parse address.\n");
+	return ret;
+}
+
+static void cleanup_all_probes(void)
+{
+	struct trace_uprobe *tp;
+
+	mutex_lock(&uprobe_lock);
+	while (!list_empty(&uprobe_list)) {
+		tp = list_entry(uprobe_list.next, struct trace_uprobe, list);
+		unregister_trace_uprobe(tp);
+	}
+	mutex_unlock(&uprobe_lock);
+}
+
+
+/* Probes listing interfaces */
+static void *probes_seq_start(struct seq_file *m, loff_t *pos)
+{
+	mutex_lock(&uprobe_lock);
+	return seq_list_start(&uprobe_list, *pos);
+}
+
+static void *probes_seq_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	return seq_list_next(v, &uprobe_list, pos);
+}
+
+static void probes_seq_stop(struct seq_file *m, void *v)
+{
+	mutex_unlock(&uprobe_lock);
+}
+
+static int probes_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+	int i;
+
+	seq_printf(m, "p:%s/%s", tp->call.class->system, tp->call.name);
+	seq_printf(m, " %s:0x%p", tp->filename, (void *)tp->offset);
+
+	for (i = 0; i < tp->nr_args; i++)
+		seq_printf(m, " %s=%s", tp->args[i].name, tp->args[i].comm);
+	seq_printf(m, "\n");
+	return 0;
+}
+
+static const struct seq_operations probes_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_seq_show
+};
+
+static int probes_open(struct inode *inode, struct file *file)
+{
+	if ((file->f_mode & FMODE_WRITE) &&
+	    (file->f_flags & O_TRUNC))
+		cleanup_all_probes();
+
+	return seq_open(file, &probes_seq_op);
+}
+
+static ssize_t probes_write(struct file *file, const char __user *buffer,
+			    size_t count, loff_t *ppos)
+{
+	return traceprobe_probes_write(file, buffer, count, ppos,
+			create_trace_uprobe);
+}
+
+static const struct file_operations uprobe_events_ops = {
+	.owner          = THIS_MODULE,
+	.open           = probes_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+	.write		= probes_write,
+};
+
+/* Probes profiling interfaces */
+static int probes_profile_seq_show(struct seq_file *m, void *v)
+{
+	struct trace_uprobe *tp = v;
+
+	seq_printf(m, "  %s %-44s %15lu\n", tp->filename, tp->call.name,
+								tp->nhit);
+	return 0;
+}
+
+static const struct seq_operations profile_seq_op = {
+	.start  = probes_seq_start,
+	.next   = probes_seq_next,
+	.stop   = probes_seq_stop,
+	.show   = probes_profile_seq_show
+};
+
+static int profile_open(struct inode *inode, struct file *file)
+{
+	return seq_open(file, &profile_seq_op);
+}
+
+static const struct file_operations uprobe_profile_ops = {
+	.owner          = THIS_MODULE,
+	.open           = profile_open,
+	.read           = seq_read,
+	.llseek         = seq_lseek,
+	.release        = seq_release,
+};
+
+/* uprobe handler */
+static void uprobe_trace_func(struct trace_uprobe *tp, struct pt_regs *regs)
+{
+	struct uprobe_trace_entry_head *entry;
+	struct ring_buffer_event *event;
+	struct ring_buffer *buffer;
+	u8 *data;
+	int size, i, pc;
+	unsigned long irq_flags;
+	struct ftrace_event_call *call = &tp->call;
+
+	tp->nhit++;
+
+	local_save_flags(irq_flags);
+	pc = preempt_count();
+
+	size = sizeof(*entry) + tp->size;
+
+	event = trace_current_buffer_lock_reserve(&buffer, call->event.type,
+						  size, irq_flags, pc);
+	if (!event)
+		return;
+
+	entry = ring_buffer_event_data(event);
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	if (!filter_current_check_discard(buffer, call, entry, event))
+		trace_buffer_unlock_commit(buffer, event, irq_flags, pc);
+}
+
+/* Event entry printers */
+enum print_line_t
+print_uprobe_event(struct trace_iterator *iter, int flags,
+		   struct trace_event *event)
+{
+	struct uprobe_trace_entry_head *field;
+	struct trace_seq *s = &iter->seq;
+	struct trace_uprobe *tp;
+	u8 *data;
+	int i;
+
+	field = (struct uprobe_trace_entry_head *)iter->ent;
+	tp = container_of(event, struct trace_uprobe, call.event);
+
+	if (!trace_seq_printf(s, "%s: (", tp->call.name))
+		goto partial;
+
+	if (!seq_print_ip_sym(s, field->ip, flags | TRACE_ITER_SYM_OFFSET))
+		goto partial;
+
+	if (!trace_seq_puts(s, ")"))
+		goto partial;
+
+	data = (u8 *)&field[1];
+	for (i = 0; i < tp->nr_args; i++)
+		if (!tp->args[i].type->print(s, tp->args[i].name,
+					     data + tp->args[i].offset, field))
+			goto partial;
+
+	if (!trace_seq_puts(s, "\n"))
+		goto partial;
+
+	return TRACE_TYPE_HANDLED;
+partial:
+	return TRACE_TYPE_PARTIAL_LINE;
+}
+
+
+static int probe_event_enable(struct trace_uprobe *tp, int flag)
+{
+	struct uprobe_trace_consumer *utc;
+	int ret = 0;
+
+	if (!tp->inode || tp->consumer)
+		return -EINTR;
+
+	utc = kzalloc(sizeof(struct uprobe_trace_consumer), GFP_KERNEL);
+	if (!utc)
+		return -EINTR;
+
+	utc->cons.handler = uprobe_dispatcher;
+	utc->cons.filter = NULL;
+	ret = register_uprobe(tp->inode, tp->offset, &utc->cons);
+	if (ret) {
+		kfree(utc);
+		return ret;
+	}
+
+	tp->flags |= flag;
+	utc->tp = tp;
+	tp->consumer = utc;
+	return 0;
+}
+
+static void probe_event_disable(struct trace_uprobe *tp, int flag)
+{
+	if (!tp->inode || !tp->consumer)
+		return;
+
+	unregister_uprobe(tp->inode, tp->offset, &tp->consumer->cons);
+	tp->flags &= ~flag;
+	kfree(tp->consumer);
+	tp->consumer = NULL;
+}
+
+static int uprobe_event_define_fields(struct ftrace_event_call *event_call)
+{
+	int ret, i;
+	struct uprobe_trace_entry_head field;
+	struct trace_uprobe *tp = (struct trace_uprobe *)event_call->data;
+
+	DEFINE_FIELD(unsigned long, ip, FIELD_STRING_IP, 0);
+	/* Set argument names as fields */
+	for (i = 0; i < tp->nr_args; i++) {
+		ret = trace_define_field(event_call, tp->args[i].type->fmttype,
+					 tp->args[i].name,
+					 sizeof(field) + tp->args[i].offset,
+					 tp->args[i].type->size,
+					 tp->args[i].type->is_signed,
+					 FILTER_OTHER);
+		if (ret)
+			return ret;
+	}
+	return 0;
+}
+
+static int __set_print_fmt(struct trace_uprobe *tp, char *buf, int len)
+{
+	int i;
+	int pos = 0;
+
+	const char *fmt, *arg;
+
+	fmt = "(%lx)";
+	arg = "REC->" FIELD_STRING_IP;
+
+	/* When len=0, we just calculate the needed length */
+#define LEN_OR_ZERO (len ? len - pos : 0)
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\"%s", fmt);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, " %s=%s",
+				tp->args[i].name, tp->args[i].type->fmt);
+	}
+
+	pos += snprintf(buf + pos, LEN_OR_ZERO, "\", %s", arg);
+
+	for (i = 0; i < tp->nr_args; i++) {
+		pos += snprintf(buf + pos, LEN_OR_ZERO, ", REC->%s",
+				tp->args[i].name);
+	}
+
+#undef LEN_OR_ZERO
+
+	/* return the length of print_fmt */
+	return pos;
+}
+
+static int set_print_fmt(struct trace_uprobe *tp)
+{
+	int len;
+	char *print_fmt;
+
+	/* First: called with 0 length to calculate the needed length */
+	len = __set_print_fmt(tp, NULL, 0);
+	print_fmt = kmalloc(len + 1, GFP_KERNEL);
+	if (!print_fmt)
+		return -ENOMEM;
+
+	/* Second: actually write the @print_fmt */
+	__set_print_fmt(tp, print_fmt, len + 1);
+	tp->call.print_fmt = print_fmt;
+
+	return 0;
+}
+
+#ifdef CONFIG_PERF_EVENTS
+
+/* uprobe profile handler */
+static void uprobe_perf_func(struct trace_uprobe *tp,
+					 struct pt_regs *regs)
+{
+	struct ftrace_event_call *call = &tp->call;
+	struct uprobe_trace_entry_head *entry;
+	struct hlist_head *head;
+	u8 *data;
+	int size, __size, i;
+	int rctx;
+
+	__size = sizeof(*entry) + tp->size;
+	size = ALIGN(__size + sizeof(u32), sizeof(u64));
+	size -= sizeof(u32);
+	if (WARN_ONCE(size > PERF_MAX_TRACE_SIZE,
+		     "profile buffer not large enough"))
+		return;
+
+	entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx);
+	if (!entry)
+		return;
+
+	entry->ip = get_uprobe_bkpt_addr(task_pt_regs(current));
+	data = (u8 *)&entry[1];
+	for (i = 0; i < tp->nr_args; i++)
+		call_fetch(&tp->args[i].fetch, regs,
+						data + tp->args[i].offset);
+
+	head = this_cpu_ptr(call->perf_events);
+	perf_trace_buf_submit(entry, size, rctx, entry->ip, 1, regs, head);
+}
+#endif	/* CONFIG_PERF_EVENTS */
+
+static
+int uprobe_register(struct ftrace_event_call *event, enum trace_reg type)
+{
+	switch (type) {
+	case TRACE_REG_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_TRACE);
+	case TRACE_REG_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_TRACE);
+		return 0;
+
+#ifdef CONFIG_PERF_EVENTS
+	case TRACE_REG_PERF_REGISTER:
+		return probe_event_enable(event->data, TP_FLAG_PROFILE);
+	case TRACE_REG_PERF_UNREGISTER:
+		probe_event_disable(event->data, TP_FLAG_PROFILE);
+		return 0;
+#endif
+	}
+	return 0;
+}
+
+static int uprobe_dispatcher(struct uprobe_consumer *con, struct pt_regs *regs)
+{
+	struct uprobe_trace_consumer *utc;
+	struct trace_uprobe *tp;
+
+	utc = container_of(con, struct uprobe_trace_consumer, cons);
+	tp = utc->tp;
+	if (!tp || tp->consumer != utc)
+		return 0;
+
+	if (tp->flags & TP_FLAG_TRACE)
+		uprobe_trace_func(tp, regs);
+#ifdef CONFIG_PERF_EVENTS
+	if (tp->flags & TP_FLAG_PROFILE)
+		uprobe_perf_func(tp, regs);
+#endif
+	return 0;
+}
+
+
+static struct trace_event_functions uprobe_funcs = {
+	.trace		= print_uprobe_event
+};
+
+static int register_uprobe_event(struct trace_uprobe *tp)
+{
+	struct ftrace_event_call *call = &tp->call;
+	int ret;
+
+	/* Initialize ftrace_event_call */
+	INIT_LIST_HEAD(&call->class->fields);
+	call->event.funcs = &uprobe_funcs;
+	call->class->define_fields = uprobe_event_define_fields;
+	if (set_print_fmt(tp) < 0)
+		return -ENOMEM;
+	ret = register_ftrace_event(&call->event);
+	if (!ret) {
+		kfree(call->print_fmt);
+		return -ENODEV;
+	}
+	call->flags = 0;
+	call->class->reg = uprobe_register;
+	call->data = tp;
+	ret = trace_add_event_call(call);
+	if (ret) {
+		pr_info("Failed to register uprobe event: %s\n", call->name);
+		kfree(call->print_fmt);
+		unregister_ftrace_event(&call->event);
+	}
+	return ret;
+}
+
+static void unregister_uprobe_event(struct trace_uprobe *tp)
+{
+	/* tp->event is unregistered in trace_remove_event_call() */
+	trace_remove_event_call(&tp->call);
+	kfree(tp->call.print_fmt);
+	tp->call.print_fmt = NULL;
+}
+
+/* Make a trace interface for controling probe points */
+static __init int init_uprobe_trace(void)
+{
+	struct dentry *d_tracer;
+	struct dentry *entry;
+
+	d_tracer = tracing_init_dentry();
+	if (!d_tracer)
+		return 0;
+
+	entry = trace_create_file("uprobe_events", 0644, d_tracer,
+				    NULL, &uprobe_events_ops);
+	/* Profile interface */
+	entry = trace_create_file("uprobe_profile", 0444, d_tracer,
+				    NULL, &uprobe_profile_ops);
+	return 0;
+}
+fs_initcall(init_uprobe_trace);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 21/26]   tracing: uprobes Documentation
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:04   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


Documents how to trace user space applications using uprobe tracer.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 Documentation/trace/uprobetracer.txt |   94 ++++++++++++++++++++++++++++++++++
 1 files changed, 94 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/uprobetracer.txt

diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
new file mode 100644
index 0000000..6c18ffe
--- /dev/null
+++ b/Documentation/trace/uprobetracer.txt
@@ -0,0 +1,94 @@
+		Uprobe-tracer: Uprobe-based Event Tracing
+		=========================================
+                 Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+  p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS]	: Set a probe
+
+ GRP		: Group name. If omitted, use "uprobes" for it.
+ EVENT		: Event name. If omitted, the event name is generated
+		  based on SYMBOL+offs.
+ PATH		: path to an executable or a library.
+ SYMBOL[+offs]	: Symbol+offset where the probe is inserted.
+
+ FETCHARGS	: Arguments. Each probe can have up to 128 args.
+  %REG		: Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+  echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+  echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address.  Here we are trying to probe
+function zfree in /bin/zsh
+
+    # cd /sys/kernel/debug/tracing/
+    # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
+    00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+    # objdump -T /bin/zsh | grep -w zfree
+    0000000000446420 g    DF .text  0000000000000012  Base        zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+    # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+    # cat uprobe_events
+    p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+    # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+    # sleep 20
+    # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+    # cat trace
+    # tracer: nop
+    #
+    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
+    #              | |       |          |         |
+                 zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
+

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 21/26]   tracing: uprobes Documentation
@ 2011-09-20 12:04   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton


Documents how to trace user space applications using uprobe tracer.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 Documentation/trace/uprobetracer.txt |   94 ++++++++++++++++++++++++++++++++++
 1 files changed, 94 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/trace/uprobetracer.txt

diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt
new file mode 100644
index 0000000..6c18ffe
--- /dev/null
+++ b/Documentation/trace/uprobetracer.txt
@@ -0,0 +1,94 @@
+		Uprobe-tracer: Uprobe-based Event Tracing
+		=========================================
+                 Documentation is written by Srikar Dronamraju
+
+Overview
+--------
+These events are similar to kprobe based events.
+To enable this feature, build your kernel with CONFIG_UPROBE_EVENTS=y.
+
+Similar to the kprobe-event tracer, this doesn't need to be activated via
+current_tracer. Instead of that, add probe points via
+/sys/kernel/debug/tracing/uprobe_events, and enable it via
+/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled.
+
+
+Synopsis of uprobe_tracer
+-------------------------
+  p[:[GRP/]EVENT] PATH:SYMBOL[+offs] [FETCHARGS]	: Set a probe
+
+ GRP		: Group name. If omitted, use "uprobes" for it.
+ EVENT		: Event name. If omitted, the event name is generated
+		  based on SYMBOL+offs.
+ PATH		: path to an executable or a library.
+ SYMBOL[+offs]	: Symbol+offset where the probe is inserted.
+
+ FETCHARGS	: Arguments. Each probe can have up to 128 args.
+  %REG		: Fetch register REG
+
+Event Profiling
+---------------
+ You can check the total number of probe hits and probe miss-hits via
+/sys/kernel/debug/tracing/uprobe_profile.
+ The first column is event name, the second is the number of probe hits,
+the third is the number of probe miss-hits.
+
+Usage examples
+--------------
+To add a probe as a new event, write a new definition to uprobe_events
+as below.
+
+  echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events
+
+ This sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash
+
+
+  echo > /sys/kernel/debug/tracing/uprobe_events
+
+ This clears all probe points.
+
+The following example shows how to dump the instruction pointer and %ax
+a register at the probed text address.  Here we are trying to probe
+function zfree in /bin/zsh
+
+    # cd /sys/kernel/debug/tracing/
+    # cat /proc/`pgrep  zsh`/maps | grep /bin/zsh | grep r-xp
+    00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh
+    # objdump -T /bin/zsh | grep -w zfree
+    0000000000446420 g    DF .text  0000000000000012  Base        zfree
+
+0x46420 is the offset of zfree in object /bin/zsh that is loaded at
+0x00400000. Hence the command to probe would be :
+
+    # echo 'p /bin/zsh:0x46420 %ip %ax' > uprobe_events
+
+We can see the events that are registered by looking at the uprobe_events
+file.
+
+    # cat uprobe_events
+    p:uprobes/p_zsh_0x46420 /bin/zsh:0x0000000000046420
+
+Right after definition, each event is disabled by default. For tracing these
+events, you need to enable it by:
+
+    # echo 1 > events/uprobes/enable
+
+Lets disable the event after sleeping for some time.
+    # sleep 20
+    # echo 0 > events/uprobes/enable
+
+And you can see the traced information via /sys/kernel/debug/tracing/trace.
+
+    # cat trace
+    # tracer: nop
+    #
+    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
+    #              | |       |          |         |
+                 zsh-24842 [006] 258544.995456: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [007] 258545.000270: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [002] 258545.043929: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+                 zsh-24842 [004] 258547.046129: p_zsh_0x46420: (0x446420) arg1=446421 arg2=79
+
+Each line shows us probes were triggered for a pid 24842 with ip being
+0x446421 and contents of ax register being 79.
+

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 22/26]   perf: rename target_module to target
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:04   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


This is a precursor patch that modifies names that refer to kernel/module
to also refer to user space names.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |   12 ++++++------
 tools/perf/util/probe-event.c |   26 +++++++++++++-------------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 710ae3d..93d5171 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -61,7 +61,7 @@ static struct {
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
 	struct line_range line_range;
-	const char *target_module;
+	const char *target;
 	int max_probe_points;
 	struct strfilter *filter;
 } params;
@@ -249,7 +249,7 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target_module,
+	OPT_STRING('m', "module", &params.target,
 		   "modname|path",
 		   "target module name (for online) or path (for offline)"),
 #endif
@@ -336,7 +336,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target_module,
+		ret = show_available_funcs(params.target,
 					   params.filter);
 		strfilter__delete(params.filter);
 		if (ret < 0)
@@ -357,7 +357,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			usage_with_options(probe_usage, options);
 		}
 
-		ret = show_line_range(&params.line_range, params.target_module);
+		ret = show_line_range(&params.line_range, params.target);
 		if (ret < 0)
 			pr_err("  Error: Failed to show lines. (%d)\n", ret);
 		return ret;
@@ -374,7 +374,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 
 		ret = show_available_vars(params.events, params.nevents,
 					  params.max_probe_points,
-					  params.target_module,
+					  params.target,
 					  params.filter,
 					  params.show_ext_vars);
 		strfilter__delete(params.filter);
@@ -396,7 +396,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	if (params.nevents) {
 		ret = add_perf_probe_events(params.events, params.nevents,
 					    params.max_probe_points,
-					    params.target_module,
+					    params.target,
 					    params.force_add);
 		if (ret < 0) {
 			pr_err("  Error: Failed to add events. (%d)\n", ret);
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 1c7bfa5..3ee7c39 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -275,10 +275,10 @@ static int add_module_to_probe_trace_events(struct probe_trace_event *tevs,
 /* Try to find perf_probe_event with debuginfo */
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	bool need_dwarf = perf_probe_event_need_dwarf(pev);
-	struct debuginfo *dinfo = open_debuginfo(module);
+	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
 	if (!dinfo) {
@@ -297,9 +297,9 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 
 	if (ntevs > 0) {	/* Succeeded to find trace events */
 		pr_debug("find %d probe_trace_events.\n", ntevs);
-		if (module)
+		if (target)
 			ret = add_module_to_probe_trace_events(*tevs, ntevs,
-							       module);
+							       target);
 		return ret < 0 ? ret : ntevs;
 	}
 
@@ -1798,14 +1798,14 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 
 static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	struct symbol *sym;
 	int ret = 0, i;
 	struct probe_trace_event *tev;
 
 	/* Convert perf_probe_event with debuginfo */
-	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, module);
+	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, target);
 	if (ret != 0)
 		return ret;	/* Found in debuginfo or got an error */
 
@@ -1821,8 +1821,8 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		goto error;
 	}
 
-	if (module) {
-		tev->point.module = strdup(module);
+	if (target) {
+		tev->point.module = strdup(target);
 		if (tev->point.module == NULL) {
 			ret = -ENOMEM;
 			goto error;
@@ -1886,7 +1886,7 @@ struct __event_package {
 };
 
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
-			  int max_tevs, const char *module, bool force_add)
+			  int max_tevs, const char *target, bool force_add)
 {
 	int i, j, ret;
 	struct __event_package *pkgs;
@@ -1909,7 +1909,7 @@ int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 		ret  = convert_to_probe_trace_events(pkgs[i].pev,
 						     &pkgs[i].tevs,
 						     max_tevs,
-						     module);
+						     target);
 		if (ret < 0)
 			goto end;
 		pkgs[i].ntevs = ret;
@@ -2063,7 +2063,7 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *module, struct strfilter *_filter)
+int show_available_funcs(const char *target, struct strfilter *_filter)
 {
 	struct map *map;
 	int ret;
@@ -2074,9 +2074,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(module);
+	map = kernel_get_module_map(target);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
 		return -EINVAL;
 	}
 	available_func_filter = _filter;

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 22/26]   perf: rename target_module to target
@ 2011-09-20 12:04   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML


This is a precursor patch that modifies names that refer to kernel/module
to also refer to user space names.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |   12 ++++++------
 tools/perf/util/probe-event.c |   26 +++++++++++++-------------
 2 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 710ae3d..93d5171 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -61,7 +61,7 @@ static struct {
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
 	struct line_range line_range;
-	const char *target_module;
+	const char *target;
 	int max_probe_points;
 	struct strfilter *filter;
 } params;
@@ -249,7 +249,7 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target_module,
+	OPT_STRING('m', "module", &params.target,
 		   "modname|path",
 		   "target module name (for online) or path (for offline)"),
 #endif
@@ -336,7 +336,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target_module,
+		ret = show_available_funcs(params.target,
 					   params.filter);
 		strfilter__delete(params.filter);
 		if (ret < 0)
@@ -357,7 +357,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			usage_with_options(probe_usage, options);
 		}
 
-		ret = show_line_range(&params.line_range, params.target_module);
+		ret = show_line_range(&params.line_range, params.target);
 		if (ret < 0)
 			pr_err("  Error: Failed to show lines. (%d)\n", ret);
 		return ret;
@@ -374,7 +374,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 
 		ret = show_available_vars(params.events, params.nevents,
 					  params.max_probe_points,
-					  params.target_module,
+					  params.target,
 					  params.filter,
 					  params.show_ext_vars);
 		strfilter__delete(params.filter);
@@ -396,7 +396,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	if (params.nevents) {
 		ret = add_perf_probe_events(params.events, params.nevents,
 					    params.max_probe_points,
-					    params.target_module,
+					    params.target,
 					    params.force_add);
 		if (ret < 0) {
 			pr_err("  Error: Failed to add events. (%d)\n", ret);
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 1c7bfa5..3ee7c39 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -275,10 +275,10 @@ static int add_module_to_probe_trace_events(struct probe_trace_event *tevs,
 /* Try to find perf_probe_event with debuginfo */
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	bool need_dwarf = perf_probe_event_need_dwarf(pev);
-	struct debuginfo *dinfo = open_debuginfo(module);
+	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
 	if (!dinfo) {
@@ -297,9 +297,9 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 
 	if (ntevs > 0) {	/* Succeeded to find trace events */
 		pr_debug("find %d probe_trace_events.\n", ntevs);
-		if (module)
+		if (target)
 			ret = add_module_to_probe_trace_events(*tevs, ntevs,
-							       module);
+							       target);
 		return ret < 0 ? ret : ntevs;
 	}
 
@@ -1798,14 +1798,14 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 
 static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 					  struct probe_trace_event **tevs,
-					  int max_tevs, const char *module)
+					  int max_tevs, const char *target)
 {
 	struct symbol *sym;
 	int ret = 0, i;
 	struct probe_trace_event *tev;
 
 	/* Convert perf_probe_event with debuginfo */
-	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, module);
+	ret = try_to_find_probe_trace_events(pev, tevs, max_tevs, target);
 	if (ret != 0)
 		return ret;	/* Found in debuginfo or got an error */
 
@@ -1821,8 +1821,8 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		goto error;
 	}
 
-	if (module) {
-		tev->point.module = strdup(module);
+	if (target) {
+		tev->point.module = strdup(target);
 		if (tev->point.module == NULL) {
 			ret = -ENOMEM;
 			goto error;
@@ -1886,7 +1886,7 @@ struct __event_package {
 };
 
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
-			  int max_tevs, const char *module, bool force_add)
+			  int max_tevs, const char *target, bool force_add)
 {
 	int i, j, ret;
 	struct __event_package *pkgs;
@@ -1909,7 +1909,7 @@ int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 		ret  = convert_to_probe_trace_events(pkgs[i].pev,
 						     &pkgs[i].tevs,
 						     max_tevs,
-						     module);
+						     target);
 		if (ret < 0)
 			goto end;
 		pkgs[i].ntevs = ret;
@@ -2063,7 +2063,7 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *module, struct strfilter *_filter)
+int show_available_funcs(const char *target, struct strfilter *_filter)
 {
 	struct map *map;
 	int ret;
@@ -2074,9 +2074,9 @@ int show_available_funcs(const char *module, struct strfilter *_filter)
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(module);
+	map = kernel_get_module_map(target);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
 		return -EINVAL;
 	}
 	available_func_filter = _filter;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 23/26]   perf: perf interface for uprobes
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:04   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


Enhances perf probe to user space executables and libraries.
Provides very basic support for uprobes.

[ Probing a function in the executable using function name  ]
-------------------------------------------------------------
[root@localhost ~]# perf probe -x /bin/zsh zfree
Add new event:
  probe_zsh:zfree      (on /bin/zsh:0x45400)

You can now use it on all perf tools, such as:

	perf record -e probe_zsh:zfree -aR sleep 1

[root@localhost ~]# perf record -e probe_zsh:zfree -aR sleep 15
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.314 MB perf.data (~13715 samples) ]
[root@localhost ~]# perf report --stdio
# Events: 3K probe_zsh:zfree
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ......
#
   100.00%              zsh  zsh            [.] zfree


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[root@localhost ~]

[ Probing a library function using function name ]
--------------------------------------------------
[root@localhost]#
[root@localhost]# perf probe -x /lib64/libc.so.6 malloc
Add new event:
  probe_libc:malloc    (on /lib64/libc-2.5.so:0x74dc0)

You can now use it on all perf tools, such as:

	perf record -e probe_libc:malloc -aR sleep 1

[root@localhost]#
[root@localhost]# perf probe --list
  probe_libc:malloc    (on /lib64/libc-2.5.so:0x0000000000074dc0)

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |   37 ++++
 tools/perf/util/probe-event.c |  346 +++++++++++++++++++++++++++++++++--------
 tools/perf/util/probe-event.h |    8 +
 tools/perf/util/symbol.c      |   10 +
 tools/perf/util/symbol.h      |    1 
 5 files changed, 324 insertions(+), 78 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 93d5171..43e6321 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -57,6 +57,7 @@ static struct {
 	bool show_ext_vars;
 	bool show_funcs;
 	bool mod_events;
+	bool uprobes;
 	int nevents;
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
@@ -78,6 +79,7 @@ static int parse_probe_event(const char *str)
 		return -1;
 	}
 
+	pev->uprobes = params.uprobes;
 	/* Parse a perf-probe command into event */
 	ret = parse_perf_probe_command(str, pev);
 	pr_debug("%d arguments\n", pev->nargs);
@@ -128,6 +130,27 @@ static int opt_del_probe_event(const struct option *opt __used,
 	return 0;
 }
 
+static int opt_set_target(const struct option *opt, const char *str,
+			int unset __used)
+{
+	int ret = -ENOENT;
+
+	if  (str && !params.target) {
+		if (!strcmp(opt->long_name, "exec"))
+			params.uprobes = true;
+#ifdef DWARF_SUPPORT
+		else if (!strcmp(opt->long_name, "module"))
+			params.uprobes = false;
+#endif
+		else
+			return ret;
+
+		params.target = str;
+		ret = 0;
+	}
+	return ret;
+}
+
 #ifdef DWARF_SUPPORT
 static int opt_show_lines(const struct option *opt __used,
 			  const char *str, int unset __used)
@@ -249,9 +272,9 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target,
-		   "modname|path",
-		   "target module name (for online) or path (for offline)"),
+	OPT_CALLBACK('m', "module", NULL, "modname|path",
+		"target module name (for online) or path (for offline)",
+		opt_set_target),
 #endif
 	OPT__DRY_RUN(&probe_event_dry_run),
 	OPT_INTEGER('\0', "max-probes", &params.max_probe_points,
@@ -263,6 +286,8 @@ static const struct option options[] = {
 		     "\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
 		     "\t\t\t \"" DEFAULT_FUNC_FILTER "\" for --funcs)",
 		     opt_set_filter),
+	OPT_CALLBACK('x', "exec", NULL, "executable|path",
+			"target executable name or path", opt_set_target),
 	OPT_END()
 };
 
@@ -313,6 +338,10 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			pr_err("  Error: Don't use --list with --funcs.\n");
 			usage_with_options(probe_usage, options);
 		}
+		if (params.uprobes) {
+			pr_warning("  Error: Don't use --list with --exec.\n");
+			usage_with_options(probe_usage, options);
+		}
 		ret = show_perf_probe_events();
 		if (ret < 0)
 			pr_err("  Error: Failed to show event list. (%d)\n",
@@ -346,7 +375,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	}
 
 #ifdef DWARF_SUPPORT
-	if (params.show_lines) {
+	if (params.show_lines && !params.uprobes) {
 		if (params.mod_events) {
 			pr_err("  Error: Don't use --line with"
 			       " --add/--del.\n");
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 3ee7c39..a501486 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -73,6 +73,8 @@ static int e_snprintf(char *str, size_t size, const char *format, ...)
 }
 
 static char *synthesize_perf_probe_point(struct perf_probe_point *pp);
+static int convert_name_to_addr(struct perf_probe_event *pev,
+				const char *exec);
 static struct machine machine;
 
 /* Initialize symbol maps and path of vmlinux/modules */
@@ -173,6 +175,31 @@ const char *kernel_get_module_path(const char *module)
 	return (dso) ? dso->long_name : NULL;
 }
 
+static int init_perf_uprobes(void)
+{
+	int ret = 0;
+
+	symbol_conf.try_vmlinux_path = false;
+	symbol_conf.sort_by_name = true;
+	ret = symbol__init();
+	if (ret < 0)
+		pr_debug("Failed to init symbol map.\n");
+
+	return ret;
+}
+
+static int convert_to_perf_probe_point(struct probe_trace_point *tp,
+					struct perf_probe_point *pp)
+{
+	pp->function = strdup(tp->symbol);
+	if (pp->function == NULL)
+		return -ENOMEM;
+	pp->offset = tp->offset;
+	pp->retprobe = tp->retprobe;
+
+	return 0;
+}
+
 #ifdef DWARF_SUPPORT
 /* Open new debuginfo of given module */
 static struct debuginfo *open_debuginfo(const char *module)
@@ -281,6 +308,15 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
+	if (pev->uprobes) {
+		if (need_dwarf) {
+			pr_warning("Debuginfo-analysis is not yet supported"
+					" with -x/--exec option.\n");
+			return -ENOSYS;
+		}
+		return convert_name_to_addr(pev, target);
+	}
+
 	if (!dinfo) {
 		if (need_dwarf) {
 			pr_warning("Failed to open debuginfo file.\n");
@@ -606,23 +642,20 @@ static int kprobe_convert_to_perf_probe(struct probe_trace_point *tp,
 		pr_err("Failed to find symbol %s in kernel.\n", tp->symbol);
 		return -ENOENT;
 	}
-	pp->function = strdup(tp->symbol);
-	if (pp->function == NULL)
-		return -ENOMEM;
-	pp->offset = tp->offset;
-	pp->retprobe = tp->retprobe;
-
-	return 0;
+	return convert_to_perf_probe_point(tp, pp);
 }
 
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 				struct probe_trace_event **tevs __unused,
-				int max_tevs __unused, const char *mod __unused)
+				int max_tevs __unused, const char *target)
 {
 	if (perf_probe_event_need_dwarf(pev)) {
 		pr_warning("Debuginfo-analysis is not supported.\n");
 		return -ENOSYS;
 	}
+	if (pev->uprobes)
+		return convert_name_to_addr(pev, target);
+
 	return 0;
 }
 
@@ -887,6 +920,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
 		return -EINVAL;
 	}
 
+	if (pev->uprobes && !pp->function) {
+		semantic_error("No function specified for uprobes");
+		return -EINVAL;
+	}
+
 	if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
 		semantic_error("Offset/Line/Lazy pattern can't be used with "
 			       "return probe.\n");
@@ -896,6 +934,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
 	pr_debug("symbol:%s file:%s line:%d offset:%lu return:%d lazy:%s\n",
 		 pp->function, pp->file, pp->line, pp->offset, pp->retprobe,
 		 pp->lazy_line);
+
+	if (pev->uprobes && perf_probe_event_need_dwarf(pev)) {
+		semantic_error("no dwarf based probes for uprobes.");
+		return -EINVAL;
+	}
 	return 0;
 }
 
@@ -1047,7 +1090,8 @@ bool perf_probe_event_need_dwarf(struct perf_probe_event *pev)
 {
 	int i;
 
-	if (pev->point.file || pev->point.line || pev->point.lazy_line)
+	if ((pev->point.file && !pev->uprobes) || pev->point.line ||
+					pev->point.lazy_line)
 		return true;
 
 	for (i = 0; i < pev->nargs; i++)
@@ -1344,11 +1388,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 	if (buf == NULL)
 		return NULL;
 
-	len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s%s%s+%lu",
-			 tp->retprobe ? 'r' : 'p',
-			 tev->group, tev->event,
-			 tp->module ?: "", tp->module ? ":" : "",
-			 tp->symbol, tp->offset);
+	if (tev->uprobes)
+		len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+				 tp->retprobe ? 'r' : 'p',
+				 tev->group, tev->event, tp->symbol);
+	else
+		len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s%s%s+%lu",
+				 tp->retprobe ? 'r' : 'p',
+				 tev->group, tev->event,
+				 tp->module ?: "", tp->module ? ":" : "",
+				 tp->symbol, tp->offset);
+
 	if (len <= 0)
 		goto error;
 
@@ -1367,7 +1417,7 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 }
 
 static int convert_to_perf_probe_event(struct probe_trace_event *tev,
-				       struct perf_probe_event *pev)
+			       struct perf_probe_event *pev, bool is_kprobe)
 {
 	char buf[64] = "";
 	int i, ret;
@@ -1379,7 +1429,11 @@ static int convert_to_perf_probe_event(struct probe_trace_event *tev,
 		return -ENOMEM;
 
 	/* Convert trace_point to probe_point */
-	ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+	if (is_kprobe)
+		ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+	else
+		ret = convert_to_perf_probe_point(&tev->point, &pev->point);
+
 	if (ret < 0)
 		return ret;
 
@@ -1475,7 +1529,7 @@ static void clear_probe_trace_event(struct probe_trace_event *tev)
 	memset(tev, 0, sizeof(*tev));
 }
 
-static int open_kprobe_events(bool readwrite)
+static int open_probe_events(bool readwrite, bool is_kprobe)
 {
 	char buf[PATH_MAX];
 	const char *__debugfs;
@@ -1486,8 +1540,13 @@ static int open_kprobe_events(bool readwrite)
 		pr_warning("Debugfs is not mounted.\n");
 		return -ENOENT;
 	}
+	if (is_kprobe)
+		ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events",
+							__debugfs);
+	else
+		ret = e_snprintf(buf, PATH_MAX, "%stracing/uprobe_events",
+							__debugfs);
 
-	ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events", __debugfs);
 	if (ret >= 0) {
 		pr_debug("Opening %s write=%d\n", buf, readwrite);
 		if (readwrite && !probe_event_dry_run)
@@ -1498,16 +1557,29 @@ static int open_kprobe_events(bool readwrite)
 
 	if (ret < 0) {
 		if (errno == ENOENT)
-			pr_warning("kprobe_events file does not exist - please"
-				 " rebuild kernel with CONFIG_KPROBE_EVENT.\n");
+			pr_warning("%s file does not exist - please"
+				" rebuild kernel with CONFIG_%s_EVENT.\n",
+				is_kprobe ? "kprobe_events" : "uprobe_events",
+				is_kprobe ? "KPROBE" : "UPROBE");
 		else
-			pr_warning("Failed to open kprobe_events file: %s\n",
-				   strerror(errno));
+			pr_warning("Failed to open %s file: %s\n",
+				is_kprobe ? "kprobe_events" : "uprobe_events",
+				strerror(errno));
 	}
 	return ret;
 }
 
-/* Get raw string list of current kprobe_events */
+static int open_kprobe_events(bool readwrite)
+{
+	return open_probe_events(readwrite, 1);
+}
+
+static int open_uprobe_events(bool readwrite)
+{
+	return open_probe_events(readwrite, 0);
+}
+
+/* Get raw string list of current kprobe_events  or uprobe_events */
 static struct strlist *get_probe_trace_command_rawlist(int fd)
 {
 	int ret, idx;
@@ -1572,36 +1644,26 @@ static int show_perf_probe_event(struct perf_probe_event *pev)
 	return ret;
 }
 
-/* List up current perf-probe events */
-int show_perf_probe_events(void)
+static int __show_perf_probe_events(int fd, bool is_kprobe)
 {
-	int fd, ret;
+	int ret = 0;
 	struct probe_trace_event tev;
 	struct perf_probe_event pev;
 	struct strlist *rawlist;
 	struct str_node *ent;
 
-	setup_pager();
-	ret = init_vmlinux();
-	if (ret < 0)
-		return ret;
-
 	memset(&tev, 0, sizeof(tev));
 	memset(&pev, 0, sizeof(pev));
 
-	fd = open_kprobe_events(false);
-	if (fd < 0)
-		return fd;
-
 	rawlist = get_probe_trace_command_rawlist(fd);
-	close(fd);
 	if (!rawlist)
 		return -ENOENT;
 
 	strlist__for_each(ent, rawlist) {
 		ret = parse_probe_trace_command(ent->s, &tev);
 		if (ret >= 0) {
-			ret = convert_to_perf_probe_event(&tev, &pev);
+			ret = convert_to_perf_probe_event(&tev, &pev,
+								is_kprobe);
 			if (ret >= 0)
 				ret = show_perf_probe_event(&pev);
 		}
@@ -1611,6 +1673,31 @@ int show_perf_probe_events(void)
 			break;
 	}
 	strlist__delete(rawlist);
+	return ret;
+}
+
+/* List up current perf-probe events */
+int show_perf_probe_events(void)
+{
+	int fd, ret;
+
+	setup_pager();
+	fd = open_kprobe_events(false);
+	if (fd < 0)
+		return fd;
+
+	ret = init_vmlinux();
+	if (ret < 0)
+		return ret;
+
+	ret = __show_perf_probe_events(fd, true);
+	close(fd);
+
+	fd = open_uprobe_events(false);
+	if (fd >= 0) {
+		ret = __show_perf_probe_events(fd, false);
+		close(fd);
+	}
 
 	return ret;
 }
@@ -1720,7 +1807,10 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 	const char *event, *group;
 	struct strlist *namelist;
 
-	fd = open_kprobe_events(true);
+	if (pev->uprobes)
+		fd = open_uprobe_events(true);
+	else
+		fd = open_kprobe_events(true);
 	if (fd < 0)
 		return fd;
 	/* Get current event names */
@@ -1832,6 +1922,7 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 	tev->point.offset = pev->point.offset;
 	tev->point.retprobe = pev->point.retprobe;
 	tev->nargs = pev->nargs;
+	tev->uprobes = pev->uprobes;
 	if (tev->nargs) {
 		tev->args = zalloc(sizeof(struct probe_trace_arg)
 				   * tev->nargs);
@@ -1862,6 +1953,9 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		}
 	}
 
+	if (pev->uprobes)
+		return 1;
+
 	/* Currently just checking function name from symbol map */
 	sym = __find_kernel_function_by_name(tev->point.symbol, NULL);
 	if (!sym) {
@@ -1888,15 +1982,19 @@ struct __event_package {
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 			  int max_tevs, const char *target, bool force_add)
 {
-	int i, j, ret;
+	int i, j, ret = 0;
 	struct __event_package *pkgs;
 
 	pkgs = zalloc(sizeof(struct __event_package) * npevs);
 	if (pkgs == NULL)
 		return -ENOMEM;
 
-	/* Init vmlinux path */
-	ret = init_vmlinux();
+	if (!pevs->uprobes)
+		/* Init vmlinux path */
+		ret = init_vmlinux();
+	else
+		ret = init_perf_uprobes();
+
 	if (ret < 0) {
 		free(pkgs);
 		return ret;
@@ -1966,23 +2064,15 @@ static int __del_trace_probe_event(int fd, struct str_node *ent)
 	return ret;
 }
 
-static int del_trace_probe_event(int fd, const char *group,
-				  const char *event, struct strlist *namelist)
+static int del_trace_probe_event(int fd, const char *buf,
+						  struct strlist *namelist)
 {
-	char buf[128];
 	struct str_node *ent, *n;
-	int found = 0, ret = 0;
-
-	ret = e_snprintf(buf, 128, "%s:%s", group, event);
-	if (ret < 0) {
-		pr_err("Failed to copy event.\n");
-		return ret;
-	}
+	int ret = -1;
 
 	if (strpbrk(buf, "*?")) { /* Glob-exp */
 		strlist__for_each_safe(ent, n, namelist)
 			if (strglobmatch(ent->s, buf)) {
-				found++;
 				ret = __del_trace_probe_event(fd, ent);
 				if (ret < 0)
 					break;
@@ -1991,40 +2081,42 @@ static int del_trace_probe_event(int fd, const char *group,
 	} else {
 		ent = strlist__find(namelist, buf);
 		if (ent) {
-			found++;
 			ret = __del_trace_probe_event(fd, ent);
 			if (ret >= 0)
 				strlist__remove(namelist, ent);
 		}
 	}
-	if (found == 0 && ret >= 0)
-		pr_info("Info: Event \"%s\" does not exist.\n", buf);
-
 	return ret;
 }
 
 int del_perf_probe_events(struct strlist *dellist)
 {
-	int fd, ret = 0;
+	int ret = -1, ufd = -1, kfd = -1;
+	char buf[128];
 	const char *group, *event;
 	char *p, *str;
 	struct str_node *ent;
-	struct strlist *namelist;
+	struct strlist *namelist = NULL, *unamelist = NULL;
 
-	fd = open_kprobe_events(true);
-	if (fd < 0)
-		return fd;
 
 	/* Get current event names */
-	namelist = get_probe_trace_event_names(fd, true);
-	if (namelist == NULL)
-		return -EINVAL;
+	kfd = open_kprobe_events(true);
+	if (kfd < 0)
+		return kfd;
+	namelist = get_probe_trace_event_names(kfd, true);
+
+	ufd = open_uprobe_events(true);
+	if (ufd >= 0)
+		unamelist = get_probe_trace_event_names(ufd, true);
+
+	if (namelist == NULL && unamelist == NULL)
+		goto error;
 
 	strlist__for_each(ent, dellist) {
 		str = strdup(ent->s);
 		if (str == NULL) {
 			ret = -ENOMEM;
-			break;
+			goto error;
 		}
 		pr_debug("Parsing: %s\n", str);
 		p = strchr(str, ':');
@@ -2036,15 +2128,37 @@ int del_perf_probe_events(struct strlist *dellist)
 			group = "*";
 			event = str;
 		}
+
+		ret = e_snprintf(buf, 128, "%s:%s", group, event);
+		if (ret < 0) {
+			pr_err("Failed to copy event.");
+			free(str);
+			goto error;
+		}
+
 		pr_debug("Group: %s, Event: %s\n", group, event);
-		ret = del_trace_probe_event(fd, group, event, namelist);
+		if (namelist)
+			ret = del_trace_probe_event(kfd, buf, namelist);
+		if (unamelist && ret != 0)
+			ret = del_trace_probe_event(ufd, buf, unamelist);
+
 		free(str);
-		if (ret < 0)
-			break;
+		if (ret != 0)
+			pr_info("Info: Event \"%s\" does not exist.\n", buf);
 	}
-	strlist__delete(namelist);
-	close(fd);
 
+error:
+	if (kfd >= 0) {
+		if (namelist)
+			strlist__delete(namelist);
+		close(kfd);
+	}
+
+	if (ufd >= 0) {
+		if (unamelist)
+			strlist__delete(unamelist);
+		close(ufd);
+	}
 	return ret;
 }
 /* TODO: don't use a global variable for filter ... */
@@ -2090,3 +2204,95 @@ int show_available_funcs(const char *target, struct strfilter *_filter)
 	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
 	return 0;
 }
+
+#define DEFAULT_FUNC_FILTER "!_*"
+
+/*
+ * uprobe_events only accepts address:
+ * Convert function and any offset to address
+ */
+static int convert_name_to_addr(struct perf_probe_event *pev, const char *exec)
+{
+	struct perf_probe_point *pp = &pev->point;
+	struct symbol *sym;
+	struct map *map = NULL;
+	char *function = NULL, *name = NULL;
+	int ret = -EINVAL;
+	unsigned long long vaddr = 0;
+
+	if (!pp->function)
+		goto out;
+
+	function = strdup(pp->function);
+	if (!function) {
+		pr_warning("Failed to allocate memory by strdup.\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	name = realpath(exec, NULL);
+	if (!name) {
+		pr_warning("Cannot find realpath for %s.\n", exec);
+		goto out;
+	}
+	map = dso__new_map(name);
+	if (!map) {
+		pr_warning("Cannot find appropriate DSO for %s.\n", name);
+		goto out;
+	}
+	available_func_filter = strfilter__new(DEFAULT_FUNC_FILTER, NULL);
+	if (map__load(map, filter_available_functions)) {
+		pr_err("Failed to load map.\n");
+		return -EINVAL;
+	}
+
+	sym = map__find_symbol_by_name(map, function, NULL);
+	if (!sym) {
+		pr_warning("Cannot find %s in DSO %s\n", function, name);
+		goto out;
+	}
+
+	if (map->start > sym->start)
+		vaddr = map->start;
+	vaddr += sym->start + pp->offset + map->pgoff;
+	pp->offset = 0;
+
+	if (!pev->event) {
+		pev->event = function;
+		function = NULL;
+	}
+	if (!pev->group) {
+		char *ptr1, *ptr2;
+
+		pev->group = zalloc(sizeof(char *) * 64);
+		ptr1 = strdup(basename(exec));
+		if (ptr1) {
+			ptr2 = strpbrk(ptr1, "-._");
+			if (ptr2)
+				*ptr2 = '\0';
+			e_snprintf(pev->group, 64, "%s_%s", PERFPROBE_GROUP,
+					ptr1);
+			free(ptr1);
+		}
+	}
+	free(pp->function);
+	pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+	if (!pp->function) {
+		ret = -ENOMEM;
+		pr_warning("Failed to allocate memory by zalloc.\n");
+		goto out;
+	}
+	e_snprintf(pp->function, MAX_PROBE_ARGS, "%s:0x%llx", name, vaddr);
+	ret = 0;
+
+out:
+	if (map) {
+		dso__delete(map->dso);
+		map__delete(map);
+	}
+	if (function)
+		free(function);
+	if (name)
+		free(name);
+	return ret;
+}
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index a7dee83..9e8c846 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -7,7 +7,7 @@
 
 extern bool probe_event_dry_run;
 
-/* kprobe-tracer tracing point */
+/* kprobe-tracer and uprobe-tracer tracing point */
 struct probe_trace_point {
 	char		*symbol;	/* Base symbol */
 	char		*module;	/* Module name */
@@ -21,7 +21,7 @@ struct probe_trace_arg_ref {
 	long				offset;	/* Offset value */
 };
 
-/* kprobe-tracer tracing argument */
+/* kprobe-tracer and uprobe-tracer tracing argument */
 struct probe_trace_arg {
 	char				*name;	/* Argument name */
 	char				*value;	/* Base value */
@@ -29,12 +29,13 @@ struct probe_trace_arg {
 	struct probe_trace_arg_ref	*ref;	/* Referencing offset */
 };
 
-/* kprobe-tracer tracing event (point + arg) */
+/* kprobe-tracer and uprobe-tracer tracing event (point + arg) */
 struct probe_trace_event {
 	char				*event;	/* Event name */
 	char				*group;	/* Group name */
 	struct probe_trace_point	point;	/* Trace point */
 	int				nargs;	/* Number of args */
+	bool				uprobes;	/* uprobes only */
 	struct probe_trace_arg		*args;	/* Arguments */
 };
 
@@ -70,6 +71,7 @@ struct perf_probe_event {
 	char			*group;	/* Group name */
 	struct perf_probe_point	point;	/* Probe point */
 	int			nargs;	/* Number of arguments */
+	bool			uprobes;
 	struct perf_probe_arg	*args;	/* Arguments */
 };
 
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 245e60d..7bc468f 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -567,7 +567,7 @@ static int dso__split_kallsyms(struct dso *dso, struct map *map,
 	struct machine *machine = kmaps->machine;
 	struct map *curr_map = map;
 	struct symbol *pos;
-	int count = 0, moved = 0;	
+	int count = 0, moved = 0;
 	struct rb_root *root = &dso->symbols[map->type];
 	struct rb_node *next = rb_first(root);
 	int kernel_range = 0;
@@ -2685,3 +2685,11 @@ int machine__load_vmlinux_path(struct machine *machine, enum map_type type,
 
 	return ret;
 }
+
+struct map *dso__new_map(const char *name)
+{
+	struct dso *dso = dso__new(name);
+	struct map *map = map__new2(0, dso, MAP__FUNCTION);
+
+	return map;
+}
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 7733f0b..cf567d2 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -216,6 +216,7 @@ void dso__set_long_name(struct dso *dso, char *name);
 void dso__set_build_id(struct dso *dso, void *build_id);
 void dso__read_running_kernel_build_id(struct dso *dso,
 				       struct machine *machine);
+struct map *dso__new_map(const char *name);
 struct symbol *dso__find_symbol(struct dso *dso, enum map_type type,
 				u64 addr);
 struct symbol *dso__find_symbol_by_name(struct dso *dso, enum map_type type,

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 23/26]   perf: perf interface for uprobes
@ 2011-09-20 12:04   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Andi Kleen, Andrew Morton


Enhances perf probe to user space executables and libraries.
Provides very basic support for uprobes.

[ Probing a function in the executable using function name  ]
-------------------------------------------------------------
[root@localhost ~]# perf probe -x /bin/zsh zfree
Add new event:
  probe_zsh:zfree      (on /bin/zsh:0x45400)

You can now use it on all perf tools, such as:

	perf record -e probe_zsh:zfree -aR sleep 1

[root@localhost ~]# perf record -e probe_zsh:zfree -aR sleep 15
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.314 MB perf.data (~13715 samples) ]
[root@localhost ~]# perf report --stdio
# Events: 3K probe_zsh:zfree
#
# Overhead  Command  Shared Object  Symbol
# ........  .......  .............  ......
#
   100.00%              zsh  zsh            [.] zfree


#
# (For a higher level overview, try: perf report --sort comm,dso)
#
[root@localhost ~]

[ Probing a library function using function name ]
--------------------------------------------------
[root@localhost]#
[root@localhost]# perf probe -x /lib64/libc.so.6 malloc
Add new event:
  probe_libc:malloc    (on /lib64/libc-2.5.so:0x74dc0)

You can now use it on all perf tools, such as:

	perf record -e probe_libc:malloc -aR sleep 1

[root@localhost]#
[root@localhost]# perf probe --list
  probe_libc:malloc    (on /lib64/libc-2.5.so:0x0000000000074dc0)

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |   37 ++++
 tools/perf/util/probe-event.c |  346 +++++++++++++++++++++++++++++++++--------
 tools/perf/util/probe-event.h |    8 +
 tools/perf/util/symbol.c      |   10 +
 tools/perf/util/symbol.h      |    1 
 5 files changed, 324 insertions(+), 78 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 93d5171..43e6321 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -57,6 +57,7 @@ static struct {
 	bool show_ext_vars;
 	bool show_funcs;
 	bool mod_events;
+	bool uprobes;
 	int nevents;
 	struct perf_probe_event events[MAX_PROBES];
 	struct strlist *dellist;
@@ -78,6 +79,7 @@ static int parse_probe_event(const char *str)
 		return -1;
 	}
 
+	pev->uprobes = params.uprobes;
 	/* Parse a perf-probe command into event */
 	ret = parse_perf_probe_command(str, pev);
 	pr_debug("%d arguments\n", pev->nargs);
@@ -128,6 +130,27 @@ static int opt_del_probe_event(const struct option *opt __used,
 	return 0;
 }
 
+static int opt_set_target(const struct option *opt, const char *str,
+			int unset __used)
+{
+	int ret = -ENOENT;
+
+	if  (str && !params.target) {
+		if (!strcmp(opt->long_name, "exec"))
+			params.uprobes = true;
+#ifdef DWARF_SUPPORT
+		else if (!strcmp(opt->long_name, "module"))
+			params.uprobes = false;
+#endif
+		else
+			return ret;
+
+		params.target = str;
+		ret = 0;
+	}
+	return ret;
+}
+
 #ifdef DWARF_SUPPORT
 static int opt_show_lines(const struct option *opt __used,
 			  const char *str, int unset __used)
@@ -249,9 +272,9 @@ static const struct option options[] = {
 		   "file", "vmlinux pathname"),
 	OPT_STRING('s', "source", &symbol_conf.source_prefix,
 		   "directory", "path to kernel source"),
-	OPT_STRING('m', "module", &params.target,
-		   "modname|path",
-		   "target module name (for online) or path (for offline)"),
+	OPT_CALLBACK('m', "module", NULL, "modname|path",
+		"target module name (for online) or path (for offline)",
+		opt_set_target),
 #endif
 	OPT__DRY_RUN(&probe_event_dry_run),
 	OPT_INTEGER('\0', "max-probes", &params.max_probe_points,
@@ -263,6 +286,8 @@ static const struct option options[] = {
 		     "\t\t\t(default: \"" DEFAULT_VAR_FILTER "\" for --vars,\n"
 		     "\t\t\t \"" DEFAULT_FUNC_FILTER "\" for --funcs)",
 		     opt_set_filter),
+	OPT_CALLBACK('x', "exec", NULL, "executable|path",
+			"target executable name or path", opt_set_target),
 	OPT_END()
 };
 
@@ -313,6 +338,10 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 			pr_err("  Error: Don't use --list with --funcs.\n");
 			usage_with_options(probe_usage, options);
 		}
+		if (params.uprobes) {
+			pr_warning("  Error: Don't use --list with --exec.\n");
+			usage_with_options(probe_usage, options);
+		}
 		ret = show_perf_probe_events();
 		if (ret < 0)
 			pr_err("  Error: Failed to show event list. (%d)\n",
@@ -346,7 +375,7 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 	}
 
 #ifdef DWARF_SUPPORT
-	if (params.show_lines) {
+	if (params.show_lines && !params.uprobes) {
 		if (params.mod_events) {
 			pr_err("  Error: Don't use --line with"
 			       " --add/--del.\n");
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index 3ee7c39..a501486 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -73,6 +73,8 @@ static int e_snprintf(char *str, size_t size, const char *format, ...)
 }
 
 static char *synthesize_perf_probe_point(struct perf_probe_point *pp);
+static int convert_name_to_addr(struct perf_probe_event *pev,
+				const char *exec);
 static struct machine machine;
 
 /* Initialize symbol maps and path of vmlinux/modules */
@@ -173,6 +175,31 @@ const char *kernel_get_module_path(const char *module)
 	return (dso) ? dso->long_name : NULL;
 }
 
+static int init_perf_uprobes(void)
+{
+	int ret = 0;
+
+	symbol_conf.try_vmlinux_path = false;
+	symbol_conf.sort_by_name = true;
+	ret = symbol__init();
+	if (ret < 0)
+		pr_debug("Failed to init symbol map.\n");
+
+	return ret;
+}
+
+static int convert_to_perf_probe_point(struct probe_trace_point *tp,
+					struct perf_probe_point *pp)
+{
+	pp->function = strdup(tp->symbol);
+	if (pp->function == NULL)
+		return -ENOMEM;
+	pp->offset = tp->offset;
+	pp->retprobe = tp->retprobe;
+
+	return 0;
+}
+
 #ifdef DWARF_SUPPORT
 /* Open new debuginfo of given module */
 static struct debuginfo *open_debuginfo(const char *module)
@@ -281,6 +308,15 @@ static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 	struct debuginfo *dinfo = open_debuginfo(target);
 	int ntevs, ret = 0;
 
+	if (pev->uprobes) {
+		if (need_dwarf) {
+			pr_warning("Debuginfo-analysis is not yet supported"
+					" with -x/--exec option.\n");
+			return -ENOSYS;
+		}
+		return convert_name_to_addr(pev, target);
+	}
+
 	if (!dinfo) {
 		if (need_dwarf) {
 			pr_warning("Failed to open debuginfo file.\n");
@@ -606,23 +642,20 @@ static int kprobe_convert_to_perf_probe(struct probe_trace_point *tp,
 		pr_err("Failed to find symbol %s in kernel.\n", tp->symbol);
 		return -ENOENT;
 	}
-	pp->function = strdup(tp->symbol);
-	if (pp->function == NULL)
-		return -ENOMEM;
-	pp->offset = tp->offset;
-	pp->retprobe = tp->retprobe;
-
-	return 0;
+	return convert_to_perf_probe_point(tp, pp);
 }
 
 static int try_to_find_probe_trace_events(struct perf_probe_event *pev,
 				struct probe_trace_event **tevs __unused,
-				int max_tevs __unused, const char *mod __unused)
+				int max_tevs __unused, const char *target)
 {
 	if (perf_probe_event_need_dwarf(pev)) {
 		pr_warning("Debuginfo-analysis is not supported.\n");
 		return -ENOSYS;
 	}
+	if (pev->uprobes)
+		return convert_name_to_addr(pev, target);
+
 	return 0;
 }
 
@@ -887,6 +920,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
 		return -EINVAL;
 	}
 
+	if (pev->uprobes && !pp->function) {
+		semantic_error("No function specified for uprobes");
+		return -EINVAL;
+	}
+
 	if ((pp->offset || pp->line || pp->lazy_line) && pp->retprobe) {
 		semantic_error("Offset/Line/Lazy pattern can't be used with "
 			       "return probe.\n");
@@ -896,6 +934,11 @@ static int parse_perf_probe_point(char *arg, struct perf_probe_event *pev)
 	pr_debug("symbol:%s file:%s line:%d offset:%lu return:%d lazy:%s\n",
 		 pp->function, pp->file, pp->line, pp->offset, pp->retprobe,
 		 pp->lazy_line);
+
+	if (pev->uprobes && perf_probe_event_need_dwarf(pev)) {
+		semantic_error("no dwarf based probes for uprobes.");
+		return -EINVAL;
+	}
 	return 0;
 }
 
@@ -1047,7 +1090,8 @@ bool perf_probe_event_need_dwarf(struct perf_probe_event *pev)
 {
 	int i;
 
-	if (pev->point.file || pev->point.line || pev->point.lazy_line)
+	if ((pev->point.file && !pev->uprobes) || pev->point.line ||
+					pev->point.lazy_line)
 		return true;
 
 	for (i = 0; i < pev->nargs; i++)
@@ -1344,11 +1388,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 	if (buf == NULL)
 		return NULL;
 
-	len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s%s%s+%lu",
-			 tp->retprobe ? 'r' : 'p',
-			 tev->group, tev->event,
-			 tp->module ?: "", tp->module ? ":" : "",
-			 tp->symbol, tp->offset);
+	if (tev->uprobes)
+		len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s",
+				 tp->retprobe ? 'r' : 'p',
+				 tev->group, tev->event, tp->symbol);
+	else
+		len = e_snprintf(buf, MAX_CMDLEN, "%c:%s/%s %s%s%s+%lu",
+				 tp->retprobe ? 'r' : 'p',
+				 tev->group, tev->event,
+				 tp->module ?: "", tp->module ? ":" : "",
+				 tp->symbol, tp->offset);
+
 	if (len <= 0)
 		goto error;
 
@@ -1367,7 +1417,7 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 }
 
 static int convert_to_perf_probe_event(struct probe_trace_event *tev,
-				       struct perf_probe_event *pev)
+			       struct perf_probe_event *pev, bool is_kprobe)
 {
 	char buf[64] = "";
 	int i, ret;
@@ -1379,7 +1429,11 @@ static int convert_to_perf_probe_event(struct probe_trace_event *tev,
 		return -ENOMEM;
 
 	/* Convert trace_point to probe_point */
-	ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+	if (is_kprobe)
+		ret = kprobe_convert_to_perf_probe(&tev->point, &pev->point);
+	else
+		ret = convert_to_perf_probe_point(&tev->point, &pev->point);
+
 	if (ret < 0)
 		return ret;
 
@@ -1475,7 +1529,7 @@ static void clear_probe_trace_event(struct probe_trace_event *tev)
 	memset(tev, 0, sizeof(*tev));
 }
 
-static int open_kprobe_events(bool readwrite)
+static int open_probe_events(bool readwrite, bool is_kprobe)
 {
 	char buf[PATH_MAX];
 	const char *__debugfs;
@@ -1486,8 +1540,13 @@ static int open_kprobe_events(bool readwrite)
 		pr_warning("Debugfs is not mounted.\n");
 		return -ENOENT;
 	}
+	if (is_kprobe)
+		ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events",
+							__debugfs);
+	else
+		ret = e_snprintf(buf, PATH_MAX, "%stracing/uprobe_events",
+							__debugfs);
 
-	ret = e_snprintf(buf, PATH_MAX, "%stracing/kprobe_events", __debugfs);
 	if (ret >= 0) {
 		pr_debug("Opening %s write=%d\n", buf, readwrite);
 		if (readwrite && !probe_event_dry_run)
@@ -1498,16 +1557,29 @@ static int open_kprobe_events(bool readwrite)
 
 	if (ret < 0) {
 		if (errno == ENOENT)
-			pr_warning("kprobe_events file does not exist - please"
-				 " rebuild kernel with CONFIG_KPROBE_EVENT.\n");
+			pr_warning("%s file does not exist - please"
+				" rebuild kernel with CONFIG_%s_EVENT.\n",
+				is_kprobe ? "kprobe_events" : "uprobe_events",
+				is_kprobe ? "KPROBE" : "UPROBE");
 		else
-			pr_warning("Failed to open kprobe_events file: %s\n",
-				   strerror(errno));
+			pr_warning("Failed to open %s file: %s\n",
+				is_kprobe ? "kprobe_events" : "uprobe_events",
+				strerror(errno));
 	}
 	return ret;
 }
 
-/* Get raw string list of current kprobe_events */
+static int open_kprobe_events(bool readwrite)
+{
+	return open_probe_events(readwrite, 1);
+}
+
+static int open_uprobe_events(bool readwrite)
+{
+	return open_probe_events(readwrite, 0);
+}
+
+/* Get raw string list of current kprobe_events  or uprobe_events */
 static struct strlist *get_probe_trace_command_rawlist(int fd)
 {
 	int ret, idx;
@@ -1572,36 +1644,26 @@ static int show_perf_probe_event(struct perf_probe_event *pev)
 	return ret;
 }
 
-/* List up current perf-probe events */
-int show_perf_probe_events(void)
+static int __show_perf_probe_events(int fd, bool is_kprobe)
 {
-	int fd, ret;
+	int ret = 0;
 	struct probe_trace_event tev;
 	struct perf_probe_event pev;
 	struct strlist *rawlist;
 	struct str_node *ent;
 
-	setup_pager();
-	ret = init_vmlinux();
-	if (ret < 0)
-		return ret;
-
 	memset(&tev, 0, sizeof(tev));
 	memset(&pev, 0, sizeof(pev));
 
-	fd = open_kprobe_events(false);
-	if (fd < 0)
-		return fd;
-
 	rawlist = get_probe_trace_command_rawlist(fd);
-	close(fd);
 	if (!rawlist)
 		return -ENOENT;
 
 	strlist__for_each(ent, rawlist) {
 		ret = parse_probe_trace_command(ent->s, &tev);
 		if (ret >= 0) {
-			ret = convert_to_perf_probe_event(&tev, &pev);
+			ret = convert_to_perf_probe_event(&tev, &pev,
+								is_kprobe);
 			if (ret >= 0)
 				ret = show_perf_probe_event(&pev);
 		}
@@ -1611,6 +1673,31 @@ int show_perf_probe_events(void)
 			break;
 	}
 	strlist__delete(rawlist);
+	return ret;
+}
+
+/* List up current perf-probe events */
+int show_perf_probe_events(void)
+{
+	int fd, ret;
+
+	setup_pager();
+	fd = open_kprobe_events(false);
+	if (fd < 0)
+		return fd;
+
+	ret = init_vmlinux();
+	if (ret < 0)
+		return ret;
+
+	ret = __show_perf_probe_events(fd, true);
+	close(fd);
+
+	fd = open_uprobe_events(false);
+	if (fd >= 0) {
+		ret = __show_perf_probe_events(fd, false);
+		close(fd);
+	}
 
 	return ret;
 }
@@ -1720,7 +1807,10 @@ static int __add_probe_trace_events(struct perf_probe_event *pev,
 	const char *event, *group;
 	struct strlist *namelist;
 
-	fd = open_kprobe_events(true);
+	if (pev->uprobes)
+		fd = open_uprobe_events(true);
+	else
+		fd = open_kprobe_events(true);
 	if (fd < 0)
 		return fd;
 	/* Get current event names */
@@ -1832,6 +1922,7 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 	tev->point.offset = pev->point.offset;
 	tev->point.retprobe = pev->point.retprobe;
 	tev->nargs = pev->nargs;
+	tev->uprobes = pev->uprobes;
 	if (tev->nargs) {
 		tev->args = zalloc(sizeof(struct probe_trace_arg)
 				   * tev->nargs);
@@ -1862,6 +1953,9 @@ static int convert_to_probe_trace_events(struct perf_probe_event *pev,
 		}
 	}
 
+	if (pev->uprobes)
+		return 1;
+
 	/* Currently just checking function name from symbol map */
 	sym = __find_kernel_function_by_name(tev->point.symbol, NULL);
 	if (!sym) {
@@ -1888,15 +1982,19 @@ struct __event_package {
 int add_perf_probe_events(struct perf_probe_event *pevs, int npevs,
 			  int max_tevs, const char *target, bool force_add)
 {
-	int i, j, ret;
+	int i, j, ret = 0;
 	struct __event_package *pkgs;
 
 	pkgs = zalloc(sizeof(struct __event_package) * npevs);
 	if (pkgs == NULL)
 		return -ENOMEM;
 
-	/* Init vmlinux path */
-	ret = init_vmlinux();
+	if (!pevs->uprobes)
+		/* Init vmlinux path */
+		ret = init_vmlinux();
+	else
+		ret = init_perf_uprobes();
+
 	if (ret < 0) {
 		free(pkgs);
 		return ret;
@@ -1966,23 +2064,15 @@ static int __del_trace_probe_event(int fd, struct str_node *ent)
 	return ret;
 }
 
-static int del_trace_probe_event(int fd, const char *group,
-				  const char *event, struct strlist *namelist)
+static int del_trace_probe_event(int fd, const char *buf,
+						  struct strlist *namelist)
 {
-	char buf[128];
 	struct str_node *ent, *n;
-	int found = 0, ret = 0;
-
-	ret = e_snprintf(buf, 128, "%s:%s", group, event);
-	if (ret < 0) {
-		pr_err("Failed to copy event.\n");
-		return ret;
-	}
+	int ret = -1;
 
 	if (strpbrk(buf, "*?")) { /* Glob-exp */
 		strlist__for_each_safe(ent, n, namelist)
 			if (strglobmatch(ent->s, buf)) {
-				found++;
 				ret = __del_trace_probe_event(fd, ent);
 				if (ret < 0)
 					break;
@@ -1991,40 +2081,42 @@ static int del_trace_probe_event(int fd, const char *group,
 	} else {
 		ent = strlist__find(namelist, buf);
 		if (ent) {
-			found++;
 			ret = __del_trace_probe_event(fd, ent);
 			if (ret >= 0)
 				strlist__remove(namelist, ent);
 		}
 	}
-	if (found == 0 && ret >= 0)
-		pr_info("Info: Event \"%s\" does not exist.\n", buf);
-
 	return ret;
 }
 
 int del_perf_probe_events(struct strlist *dellist)
 {
-	int fd, ret = 0;
+	int ret = -1, ufd = -1, kfd = -1;
+	char buf[128];
 	const char *group, *event;
 	char *p, *str;
 	struct str_node *ent;
-	struct strlist *namelist;
+	struct strlist *namelist = NULL, *unamelist = NULL;
 
-	fd = open_kprobe_events(true);
-	if (fd < 0)
-		return fd;
 
 	/* Get current event names */
-	namelist = get_probe_trace_event_names(fd, true);
-	if (namelist == NULL)
-		return -EINVAL;
+	kfd = open_kprobe_events(true);
+	if (kfd < 0)
+		return kfd;
+	namelist = get_probe_trace_event_names(kfd, true);
+
+	ufd = open_uprobe_events(true);
+	if (ufd >= 0)
+		unamelist = get_probe_trace_event_names(ufd, true);
+
+	if (namelist == NULL && unamelist == NULL)
+		goto error;
 
 	strlist__for_each(ent, dellist) {
 		str = strdup(ent->s);
 		if (str == NULL) {
 			ret = -ENOMEM;
-			break;
+			goto error;
 		}
 		pr_debug("Parsing: %s\n", str);
 		p = strchr(str, ':');
@@ -2036,15 +2128,37 @@ int del_perf_probe_events(struct strlist *dellist)
 			group = "*";
 			event = str;
 		}
+
+		ret = e_snprintf(buf, 128, "%s:%s", group, event);
+		if (ret < 0) {
+			pr_err("Failed to copy event.");
+			free(str);
+			goto error;
+		}
+
 		pr_debug("Group: %s, Event: %s\n", group, event);
-		ret = del_trace_probe_event(fd, group, event, namelist);
+		if (namelist)
+			ret = del_trace_probe_event(kfd, buf, namelist);
+		if (unamelist && ret != 0)
+			ret = del_trace_probe_event(ufd, buf, unamelist);
+
 		free(str);
-		if (ret < 0)
-			break;
+		if (ret != 0)
+			pr_info("Info: Event \"%s\" does not exist.\n", buf);
 	}
-	strlist__delete(namelist);
-	close(fd);
 
+error:
+	if (kfd >= 0) {
+		if (namelist)
+			strlist__delete(namelist);
+		close(kfd);
+	}
+
+	if (ufd >= 0) {
+		if (unamelist)
+			strlist__delete(unamelist);
+		close(ufd);
+	}
 	return ret;
 }
 /* TODO: don't use a global variable for filter ... */
@@ -2090,3 +2204,95 @@ int show_available_funcs(const char *target, struct strfilter *_filter)
 	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
 	return 0;
 }
+
+#define DEFAULT_FUNC_FILTER "!_*"
+
+/*
+ * uprobe_events only accepts address:
+ * Convert function and any offset to address
+ */
+static int convert_name_to_addr(struct perf_probe_event *pev, const char *exec)
+{
+	struct perf_probe_point *pp = &pev->point;
+	struct symbol *sym;
+	struct map *map = NULL;
+	char *function = NULL, *name = NULL;
+	int ret = -EINVAL;
+	unsigned long long vaddr = 0;
+
+	if (!pp->function)
+		goto out;
+
+	function = strdup(pp->function);
+	if (!function) {
+		pr_warning("Failed to allocate memory by strdup.\n");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	name = realpath(exec, NULL);
+	if (!name) {
+		pr_warning("Cannot find realpath for %s.\n", exec);
+		goto out;
+	}
+	map = dso__new_map(name);
+	if (!map) {
+		pr_warning("Cannot find appropriate DSO for %s.\n", name);
+		goto out;
+	}
+	available_func_filter = strfilter__new(DEFAULT_FUNC_FILTER, NULL);
+	if (map__load(map, filter_available_functions)) {
+		pr_err("Failed to load map.\n");
+		return -EINVAL;
+	}
+
+	sym = map__find_symbol_by_name(map, function, NULL);
+	if (!sym) {
+		pr_warning("Cannot find %s in DSO %s\n", function, name);
+		goto out;
+	}
+
+	if (map->start > sym->start)
+		vaddr = map->start;
+	vaddr += sym->start + pp->offset + map->pgoff;
+	pp->offset = 0;
+
+	if (!pev->event) {
+		pev->event = function;
+		function = NULL;
+	}
+	if (!pev->group) {
+		char *ptr1, *ptr2;
+
+		pev->group = zalloc(sizeof(char *) * 64);
+		ptr1 = strdup(basename(exec));
+		if (ptr1) {
+			ptr2 = strpbrk(ptr1, "-._");
+			if (ptr2)
+				*ptr2 = '\0';
+			e_snprintf(pev->group, 64, "%s_%s", PERFPROBE_GROUP,
+					ptr1);
+			free(ptr1);
+		}
+	}
+	free(pp->function);
+	pp->function = zalloc(sizeof(char *) * MAX_PROBE_ARGS);
+	if (!pp->function) {
+		ret = -ENOMEM;
+		pr_warning("Failed to allocate memory by zalloc.\n");
+		goto out;
+	}
+	e_snprintf(pp->function, MAX_PROBE_ARGS, "%s:0x%llx", name, vaddr);
+	ret = 0;
+
+out:
+	if (map) {
+		dso__delete(map->dso);
+		map__delete(map);
+	}
+	if (function)
+		free(function);
+	if (name)
+		free(name);
+	return ret;
+}
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index a7dee83..9e8c846 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -7,7 +7,7 @@
 
 extern bool probe_event_dry_run;
 
-/* kprobe-tracer tracing point */
+/* kprobe-tracer and uprobe-tracer tracing point */
 struct probe_trace_point {
 	char		*symbol;	/* Base symbol */
 	char		*module;	/* Module name */
@@ -21,7 +21,7 @@ struct probe_trace_arg_ref {
 	long				offset;	/* Offset value */
 };
 
-/* kprobe-tracer tracing argument */
+/* kprobe-tracer and uprobe-tracer tracing argument */
 struct probe_trace_arg {
 	char				*name;	/* Argument name */
 	char				*value;	/* Base value */
@@ -29,12 +29,13 @@ struct probe_trace_arg {
 	struct probe_trace_arg_ref	*ref;	/* Referencing offset */
 };
 
-/* kprobe-tracer tracing event (point + arg) */
+/* kprobe-tracer and uprobe-tracer tracing event (point + arg) */
 struct probe_trace_event {
 	char				*event;	/* Event name */
 	char				*group;	/* Group name */
 	struct probe_trace_point	point;	/* Trace point */
 	int				nargs;	/* Number of args */
+	bool				uprobes;	/* uprobes only */
 	struct probe_trace_arg		*args;	/* Arguments */
 };
 
@@ -70,6 +71,7 @@ struct perf_probe_event {
 	char			*group;	/* Group name */
 	struct perf_probe_point	point;	/* Probe point */
 	int			nargs;	/* Number of arguments */
+	bool			uprobes;
 	struct perf_probe_arg	*args;	/* Arguments */
 };
 
diff --git a/tools/perf/util/symbol.c b/tools/perf/util/symbol.c
index 245e60d..7bc468f 100644
--- a/tools/perf/util/symbol.c
+++ b/tools/perf/util/symbol.c
@@ -567,7 +567,7 @@ static int dso__split_kallsyms(struct dso *dso, struct map *map,
 	struct machine *machine = kmaps->machine;
 	struct map *curr_map = map;
 	struct symbol *pos;
-	int count = 0, moved = 0;	
+	int count = 0, moved = 0;
 	struct rb_root *root = &dso->symbols[map->type];
 	struct rb_node *next = rb_first(root);
 	int kernel_range = 0;
@@ -2685,3 +2685,11 @@ int machine__load_vmlinux_path(struct machine *machine, enum map_type type,
 
 	return ret;
 }
+
+struct map *dso__new_map(const char *name)
+{
+	struct dso *dso = dso__new(name);
+	struct map *map = map__new2(0, dso, MAP__FUNCTION);
+
+	return map;
+}
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index 7733f0b..cf567d2 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -216,6 +216,7 @@ void dso__set_long_name(struct dso *dso, char *name);
 void dso__set_build_id(struct dso *dso, void *build_id);
 void dso__read_running_kernel_build_id(struct dso *dso,
 				       struct machine *machine);
+struct map *dso__new_map(const char *name);
 struct symbol *dso__find_symbol(struct dso *dso, enum map_type type,
 				u64 addr);
 struct symbol *dso__find_symbol_by_name(struct dso *dso, enum map_type type,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 24/26]   perf: show possible probes in a given executable file or library.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:04   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


Enhances -F/--funcs option of "perf probe" to list possible probe points in
an executable file or library.

Show last 10 functions in /bin/zsh.

# perf probe -F -x /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

Show first 10 functions in /lib/libc.so.6

# perf probe -F -x /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |    4 +--
 tools/perf/util/probe-event.c |   56 +++++++++++++++++++++++++++++++----------
 tools/perf/util/probe-event.h |    4 +--
 3 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 43e6321..5e7622c 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -365,8 +365,8 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target,
-					   params.filter);
+		ret = show_available_funcs(params.target, params.filter,
+					params.uprobes);
 		strfilter__delete(params.filter);
 		if (ret < 0)
 			pr_err("  Error: Failed to show functions."
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index a501486..659fecb 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -47,6 +47,7 @@
 #include "trace-event.h"	/* For __unused */
 #include "probe-event.h"
 #include "probe-finder.h"
+#include "session.h"
 
 #define MAX_CMDLEN 256
 #define MAX_PROBE_ARGS 128
@@ -2161,6 +2162,7 @@ int del_perf_probe_events(struct strlist *dellist)
 	}
 	return ret;
 }
+
 /* TODO: don't use a global variable for filter ... */
 static struct strfilter *available_func_filter;
 
@@ -2177,32 +2179,60 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *target, struct strfilter *_filter)
+static int __show_available_funcs(struct map *map)
+{
+	if (map__load(map, filter_available_functions)) {
+		pr_err("Failed to load map.\n");
+		return -EINVAL;
+	}
+	if (!dso__sorted_by_name(map->dso, map->type))
+		dso__sort_by_name(map->dso, map->type);
+
+	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
+	return 0;
+}
+
+static int available_kernel_funcs(const char *module)
 {
 	struct map *map;
 	int ret;
 
-	setup_pager();
-
 	ret = init_vmlinux();
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(target);
+	map = kernel_get_module_map(module);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
 		return -EINVAL;
 	}
+	return __show_available_funcs(map);
+}
+
+int show_available_funcs(const char *target, struct strfilter *_filter,
+					bool user)
+{
+	struct map *map;
+	int ret;
+
+	setup_pager();
 	available_func_filter = _filter;
-	if (map__load(map, filter_available_functions)) {
-		pr_err("Failed to load map.\n");
-		return -EINVAL;
-	}
-	if (!dso__sorted_by_name(map->dso, map->type))
-		dso__sort_by_name(map->dso, map->type);
 
-	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
-	return 0;
+	if (!user)
+		return available_kernel_funcs(target);
+
+	symbol_conf.try_vmlinux_path = false;
+	symbol_conf.sort_by_name = true;
+	ret = symbol__init();
+	if (ret < 0) {
+		pr_err("Failed to init symbol map.\n");
+		return ret;
+	}
+	map = dso__new_map(target);
+	ret = __show_available_funcs(map);
+	dso__delete(map->dso);
+	map__delete(map);
+	return ret;
 }
 
 #define DEFAULT_FUNC_FILTER "!_*"
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 9e8c846..f9f3de8 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -131,8 +131,8 @@ extern int show_line_range(struct line_range *lr, const char *module);
 extern int show_available_vars(struct perf_probe_event *pevs, int npevs,
 			       int max_probe_points, const char *module,
 			       struct strfilter *filter, bool externs);
-extern int show_available_funcs(const char *module, struct strfilter *filter);
-
+extern int show_available_funcs(const char *module, struct strfilter *filter,
+				bool user);
 
 /* Maximum index number of event-name postfix */
 #define MAX_EVENT_INDEX	1024

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 24/26]   perf: show possible probes in a given executable file or library.
@ 2011-09-20 12:04   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:04 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Oleg Nesterov,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML


Enhances -F/--funcs option of "perf probe" to list possible probe points in
an executable file or library.

Show last 10 functions in /bin/zsh.

# perf probe -F -x /bin/zsh | tail
zstrtol
ztrcmp
ztrdup
ztrduppfx
ztrftime
ztrlen
ztrncpy
ztrsub
zwarn
zwarnnam

Show first 10 functions in /lib/libc.so.6

# perf probe -F -x /lib/libc.so.6 | head
_IO_adjust_column
_IO_adjust_wcolumn
_IO_default_doallocate
_IO_default_finish
_IO_default_pbackfail
_IO_default_uflow
_IO_default_xsgetn
_IO_default_xsputn
_IO_do_write@@GLIBC_2.2.5
_IO_doallocbuf

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/builtin-probe.c    |    4 +--
 tools/perf/util/probe-event.c |   56 +++++++++++++++++++++++++++++++----------
 tools/perf/util/probe-event.h |    4 +--
 3 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/tools/perf/builtin-probe.c b/tools/perf/builtin-probe.c
index 43e6321..5e7622c 100644
--- a/tools/perf/builtin-probe.c
+++ b/tools/perf/builtin-probe.c
@@ -365,8 +365,8 @@ int cmd_probe(int argc, const char **argv, const char *prefix __used)
 		if (!params.filter)
 			params.filter = strfilter__new(DEFAULT_FUNC_FILTER,
 						       NULL);
-		ret = show_available_funcs(params.target,
-					   params.filter);
+		ret = show_available_funcs(params.target, params.filter,
+					params.uprobes);
 		strfilter__delete(params.filter);
 		if (ret < 0)
 			pr_err("  Error: Failed to show functions."
diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index a501486..659fecb 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -47,6 +47,7 @@
 #include "trace-event.h"	/* For __unused */
 #include "probe-event.h"
 #include "probe-finder.h"
+#include "session.h"
 
 #define MAX_CMDLEN 256
 #define MAX_PROBE_ARGS 128
@@ -2161,6 +2162,7 @@ int del_perf_probe_events(struct strlist *dellist)
 	}
 	return ret;
 }
+
 /* TODO: don't use a global variable for filter ... */
 static struct strfilter *available_func_filter;
 
@@ -2177,32 +2179,60 @@ static int filter_available_functions(struct map *map __unused,
 	return 1;
 }
 
-int show_available_funcs(const char *target, struct strfilter *_filter)
+static int __show_available_funcs(struct map *map)
+{
+	if (map__load(map, filter_available_functions)) {
+		pr_err("Failed to load map.\n");
+		return -EINVAL;
+	}
+	if (!dso__sorted_by_name(map->dso, map->type))
+		dso__sort_by_name(map->dso, map->type);
+
+	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
+	return 0;
+}
+
+static int available_kernel_funcs(const char *module)
 {
 	struct map *map;
 	int ret;
 
-	setup_pager();
-
 	ret = init_vmlinux();
 	if (ret < 0)
 		return ret;
 
-	map = kernel_get_module_map(target);
+	map = kernel_get_module_map(module);
 	if (!map) {
-		pr_err("Failed to find %s map.\n", (target) ? : "kernel");
+		pr_err("Failed to find %s map.\n", (module) ? : "kernel");
 		return -EINVAL;
 	}
+	return __show_available_funcs(map);
+}
+
+int show_available_funcs(const char *target, struct strfilter *_filter,
+					bool user)
+{
+	struct map *map;
+	int ret;
+
+	setup_pager();
 	available_func_filter = _filter;
-	if (map__load(map, filter_available_functions)) {
-		pr_err("Failed to load map.\n");
-		return -EINVAL;
-	}
-	if (!dso__sorted_by_name(map->dso, map->type))
-		dso__sort_by_name(map->dso, map->type);
 
-	dso__fprintf_symbols_by_name(map->dso, map->type, stdout);
-	return 0;
+	if (!user)
+		return available_kernel_funcs(target);
+
+	symbol_conf.try_vmlinux_path = false;
+	symbol_conf.sort_by_name = true;
+	ret = symbol__init();
+	if (ret < 0) {
+		pr_err("Failed to init symbol map.\n");
+		return ret;
+	}
+	map = dso__new_map(target);
+	ret = __show_available_funcs(map);
+	dso__delete(map->dso);
+	map__delete(map);
+	return ret;
 }
 
 #define DEFAULT_FUNC_FILTER "!_*"
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 9e8c846..f9f3de8 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -131,8 +131,8 @@ extern int show_line_range(struct line_range *lr, const char *module);
 extern int show_available_vars(struct perf_probe_event *pevs, int npevs,
 			       int max_probe_points, const char *module,
 			       struct strfilter *filter, bool externs);
-extern int show_available_funcs(const char *module, struct strfilter *filter);
-
+extern int show_available_funcs(const char *module, struct strfilter *filter,
+				bool user);
 
 /* Maximum index number of event-name postfix */
 #define MAX_EVENT_INDEX	1024

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 25/26]   perf: Documentation for perf uprobes
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:05   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Modify perf-probe.txt to include uprobe documentation

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/Documentation/perf-probe.txt |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
index 800775e..3c98a54 100644
--- a/tools/perf/Documentation/perf-probe.txt
+++ b/tools/perf/Documentation/perf-probe.txt
@@ -78,6 +78,8 @@ OPTIONS
 -F::
 --funcs::
 	Show available functions in given module or kernel.
+	With -x/--exec, can also list functions in a user space executable
+	/ shared library.
 
 --filter=FILTER::
 	(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
@@ -98,6 +100,11 @@ OPTIONS
 --max-probes::
 	Set the maximum number of probe points for an event. Default is 128.
 
+-x::
+--exec=PATH::
+	Specify path to the executable or shared library file for user
+	space tracing. Can also be used with --funcs option.
+
 PROBE SYNTAX
 ------------
 Probe points are defined by following syntax.
@@ -182,6 +189,13 @@ Delete all probes on schedule().
 
  ./perf probe --del='schedule*'
 
+Add probes at zfree() function on /bin/zsh
+
+ ./perf probe -x /bin/zsh zfree
+
+Add probes at malloc() function on libc
+
+ ./perf probe -x /lib/libc.so.6 malloc
 
 SEE ALSO
 --------

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 25/26]   perf: Documentation for perf uprobes
@ 2011-09-20 12:05   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton


Modify perf-probe.txt to include uprobe documentation

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/Documentation/perf-probe.txt |   14 ++++++++++++++
 1 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
index 800775e..3c98a54 100644
--- a/tools/perf/Documentation/perf-probe.txt
+++ b/tools/perf/Documentation/perf-probe.txt
@@ -78,6 +78,8 @@ OPTIONS
 -F::
 --funcs::
 	Show available functions in given module or kernel.
+	With -x/--exec, can also list functions in a user space executable
+	/ shared library.
 
 --filter=FILTER::
 	(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
@@ -98,6 +100,11 @@ OPTIONS
 --max-probes::
 	Set the maximum number of probe points for an event. Default is 128.
 
+-x::
+--exec=PATH::
+	Specify path to the executable or shared library file for user
+	space tracing. Can also be used with --funcs option.
+
 PROBE SYNTAX
 ------------
 Probe points are defined by following syntax.
@@ -182,6 +189,13 @@ Delete all probes on schedule().
 
  ./perf probe --del='schedule*'
 
+Add probes at zfree() function on /bin/zsh
+
+ ./perf probe -x /bin/zsh zfree
+
+Add probes at malloc() function on libc
+
+ ./perf probe -x /lib/libc.so.6 malloc
 
 SEE ALSO
 --------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 12:05   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


- Queue signals delivered from the time we singlestep till
  completion of postprocessing. The queueing is done on a
  per-task basis.
- After singlestep completion, dequeue the signals.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    3 ++-
 kernel/signal.c         |   22 +++++++++++++++++++++-
 kernel/uprobes.c        |   22 ++++++++++++++++------
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a407d17..189cdce 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -24,7 +24,7 @@
  */
 
 #include <linux/rbtree.h>
-
+#include <linux/signal.h>	/* sigpending */
 struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
@@ -90,6 +90,7 @@ struct uprobe_task {
 	struct uprobe_task_arch_info tskinfo;
 
 	struct uprobe *active_uprobe;
+	struct sigpending delayed;
 };
 
 /*
diff --git a/kernel/signal.c b/kernel/signal.c
index 291c970..48b8c7c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1034,6 +1034,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 		return 0;
 
 	pending = group ? &t->signal->shared_pending : &t->pending;
+#ifdef CONFIG_UPROBES
+	if (!group && t->utask && t->utask->active_uprobe)
+		pending = &t->utask->delayed;
+#endif
+
 	/*
 	 * Short-circuit ignored signals and support queuing
 	 * exactly one non-rt signal, so that we can get more
@@ -1106,6 +1111,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 		}
 	}
 
+#ifdef CONFIG_UPROBES
+	if (!group && t->utask && t->utask->active_uprobe)
+		return 0;
+#endif
+
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
@@ -1569,6 +1579,13 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
 	}
 	q->info.si_overrun = 0;
 
+#ifdef CONFIG_UPROBES
+	if (!group && t->utask && t->utask->active_uprobe) {
+		pending = &t->utask->delayed;
+		list_add_tail(&q->list, &pending->list);
+		goto out;
+	}
+#endif
 	signalfd_notify(t, sig);
 	pending = group ? &t->signal->shared_pending : &t->pending;
 	list_add_tail(&q->list, &pending->list);
@@ -2199,7 +2216,10 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 			spin_unlock_irq(&sighand->siglock);
 			goto relock;
 		}
-
+#ifdef CONFIG_UPROBES
+		if (current->utask && current->utask->active_uprobe)
+			break;
+#endif
 		signr = dequeue_signal(current, &current->blocked, info);
 
 		if (!signr)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index ca1f622..d065fa7 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1298,11 +1298,14 @@ void free_uprobe_utask(struct task_struct *tsk)
 static struct uprobe_task *add_utask(void)
 {
 	struct uprobe_task *utask;
+	struct sigpending *delayed;
 
 	utask = kzalloc(sizeof *utask, GFP_KERNEL);
 	if (unlikely(utask == NULL))
 		return ERR_PTR(-ENOMEM);
 
+	delayed = &utask->delayed;
+	INIT_LIST_HEAD(&delayed->list);
 	utask->active_uprobe = NULL;
 	current->utask = utask;
 	return utask;
@@ -1337,6 +1340,16 @@ static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 	return true;
 }
 
+static void pushback_signals(struct sigpending *pending)
+{
+	struct sigqueue *q, *tmpq;
+
+	list_for_each_entry_safe(q, tmpq, &pending->list, list) {
+		list_del(&q->list);
+		send_sigqueue(q, current, 0);
+	}
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1373,7 +1386,6 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			if (!utask)
 				goto cleanup_ret;
 		}
-		/* TODO Start queueing signals. */
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
 		utask->state = UTASK_SSTEP;
@@ -1390,8 +1402,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
 			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
+			pushback_signals(&current->utask->delayed);
 		}
 	}
 	return;
@@ -1404,9 +1415,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 	if (u) {
 		put_uprobe(u);
 		set_instruction_pointer(regs, probept);
-	} else {
-		/*TODO Return SIGTRAP signal */
-	}
+	} else
+		send_sig(SIGTRAP, current, 0);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-09-20 12:05   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 12:05 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Steven Rostedt, Srikar Dronamraju, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML


- Queue signals delivered from the time we singlestep till
  completion of postprocessing. The queueing is done on a
  per-task basis.
- After singlestep completion, dequeue the signals.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/uprobes.h |    3 ++-
 kernel/signal.c         |   22 +++++++++++++++++++++-
 kernel/uprobes.c        |   22 ++++++++++++++++------
 3 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a407d17..189cdce 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -24,7 +24,7 @@
  */
 
 #include <linux/rbtree.h>
-
+#include <linux/signal.h>	/* sigpending */
 struct vm_area_struct;
 #ifdef CONFIG_ARCH_SUPPORTS_UPROBES
 #include <asm/uprobes.h>
@@ -90,6 +90,7 @@ struct uprobe_task {
 	struct uprobe_task_arch_info tskinfo;
 
 	struct uprobe *active_uprobe;
+	struct sigpending delayed;
 };
 
 /*
diff --git a/kernel/signal.c b/kernel/signal.c
index 291c970..48b8c7c 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1034,6 +1034,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 		return 0;
 
 	pending = group ? &t->signal->shared_pending : &t->pending;
+#ifdef CONFIG_UPROBES
+	if (!group && t->utask && t->utask->active_uprobe)
+		pending = &t->utask->delayed;
+#endif
+
 	/*
 	 * Short-circuit ignored signals and support queuing
 	 * exactly one non-rt signal, so that we can get more
@@ -1106,6 +1111,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
 		}
 	}
 
+#ifdef CONFIG_UPROBES
+	if (!group && t->utask && t->utask->active_uprobe)
+		return 0;
+#endif
+
 out_set:
 	signalfd_notify(t, sig);
 	sigaddset(&pending->signal, sig);
@@ -1569,6 +1579,13 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
 	}
 	q->info.si_overrun = 0;
 
+#ifdef CONFIG_UPROBES
+	if (!group && t->utask && t->utask->active_uprobe) {
+		pending = &t->utask->delayed;
+		list_add_tail(&q->list, &pending->list);
+		goto out;
+	}
+#endif
 	signalfd_notify(t, sig);
 	pending = group ? &t->signal->shared_pending : &t->pending;
 	list_add_tail(&q->list, &pending->list);
@@ -2199,7 +2216,10 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 			spin_unlock_irq(&sighand->siglock);
 			goto relock;
 		}
-
+#ifdef CONFIG_UPROBES
+		if (current->utask && current->utask->active_uprobe)
+			break;
+#endif
 		signr = dequeue_signal(current, &current->blocked, info);
 
 		if (!signr)
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index ca1f622..d065fa7 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1298,11 +1298,14 @@ void free_uprobe_utask(struct task_struct *tsk)
 static struct uprobe_task *add_utask(void)
 {
 	struct uprobe_task *utask;
+	struct sigpending *delayed;
 
 	utask = kzalloc(sizeof *utask, GFP_KERNEL);
 	if (unlikely(utask == NULL))
 		return ERR_PTR(-ENOMEM);
 
+	delayed = &utask->delayed;
+	INIT_LIST_HEAD(&delayed->list);
 	utask->active_uprobe = NULL;
 	current->utask = utask;
 	return utask;
@@ -1337,6 +1340,16 @@ static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 	return true;
 }
 
+static void pushback_signals(struct sigpending *pending)
+{
+	struct sigqueue *q, *tmpq;
+
+	list_for_each_entry_safe(q, tmpq, &pending->list, list) {
+		list_del(&q->list);
+		send_sigqueue(q, current, 0);
+	}
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1373,7 +1386,6 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			if (!utask)
 				goto cleanup_ret;
 		}
-		/* TODO Start queueing signals. */
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
 		utask->state = UTASK_SSTEP;
@@ -1390,8 +1402,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
 			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
+			pushback_signals(&current->utask->delayed);
 		}
 	}
 	return;
@@ -1404,9 +1415,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 	if (u) {
 		put_uprobe(u);
 		set_instruction_pointer(regs, probept);
-	} else {
-		/*TODO Return SIGTRAP signal */
-	}
+	} else
+		send_sig(SIGTRAP, current, 0);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-09-20 13:34   ` Christoph Hellwig
  -1 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2011-09-20 13:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath, LKML

On Tue, Sep 20, 2011 at 05:29:38PM +0530, Srikar Dronamraju wrote:
> - Uses i_mutex instead of uprobes_mutex.

What for exactly?  I'm pretty strict against introducing even more
uses for i_mutex, it's already way to overloaded with different
meanings.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
@ 2011-09-20 13:34   ` Christoph Hellwig
  0 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2011-09-20 13:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath, LKML

On Tue, Sep 20, 2011 at 05:29:38PM +0530, Srikar Dronamraju wrote:
> - Uses i_mutex instead of uprobes_mutex.

What for exactly?  I'm pretty strict against introducing even more
uses for i_mutex, it's already way to overloaded with different
meanings.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
  2011-09-20 13:34   ` Christoph Hellwig
@ 2011-09-20 14:12     ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

* Christoph Hellwig <hch@infradead.org> [2011-09-20 09:34:01]:

> On Tue, Sep 20, 2011 at 05:29:38PM +0530, Srikar Dronamraju wrote:
> > - Uses i_mutex instead of uprobes_mutex.
> 
> What for exactly?  I'm pretty strict against introducing even more
> uses for i_mutex, it's already way to overloaded with different
> meanings.
> 


There could be multiple simultaneous requests for adding/removing a
probe for the same location i.e same inode + same offset. These requests
will have to be serialized.

To serialize this we had used uprobes specific mutex (uprobes_mutex) in
the last patchset.  However using uprobes_mutex will mean we will be
serializing requests for unrelated files. I.e if we get a request to
probe libpthread while we are inserting/deleting a probe on libc, 
then we used to make the libpthread request wait unnecessarily.
This also means that I dont need to introduce yet another lock.

After using i_mutex, these two requests can run in parallel.

I had proposed this while answering one of the comments in the last
patchset. Since I didnt hear any complaints, I went ahead and
implemented this.

I could use any other inode/file/mapping based sleepable lock that is of
higher order than mmap_sem. Can you please let me know if we have
alternatives.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
@ 2011-09-20 14:12     ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

* Christoph Hellwig <hch@infradead.org> [2011-09-20 09:34:01]:

> On Tue, Sep 20, 2011 at 05:29:38PM +0530, Srikar Dronamraju wrote:
> > - Uses i_mutex instead of uprobes_mutex.
> 
> What for exactly?  I'm pretty strict against introducing even more
> uses for i_mutex, it's already way to overloaded with different
> meanings.
> 


There could be multiple simultaneous requests for adding/removing a
probe for the same location i.e same inode + same offset. These requests
will have to be serialized.

To serialize this we had used uprobes specific mutex (uprobes_mutex) in
the last patchset.  However using uprobes_mutex will mean we will be
serializing requests for unrelated files. I.e if we get a request to
probe libpthread while we are inserting/deleting a probe on libc, 
then we used to make the libpthread request wait unnecessarily.
This also means that I dont need to introduce yet another lock.

After using i_mutex, these two requests can run in parallel.

I had proposed this while answering one of the comments in the last
patchset. Since I didnt hear any complaints, I went ahead and
implemented this.

I could use any other inode/file/mapping based sleepable lock that is of
higher order than mmap_sem. Can you please let me know if we have
alternatives.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
  2011-09-20 14:12     ` Srikar Dronamraju
@ 2011-09-20 14:28       ` Christoph Hellwig
  -1 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2011-09-20 14:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Christoph Hellwig, Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Tue, Sep 20, 2011 at 07:42:04PM +0530, Srikar Dronamraju wrote:
> I could use any other inode/file/mapping based sleepable lock that is of
> higher order than mmap_sem. Can you please let me know if we have
> alternatives.

Please do not overload unrelated locks for this, but add a specific one.

There's two options:

 (a) add it to the inode (conditionally)
 (b) use global, hashed locks

I think (b) is good enough as adding/removing probes isn't exactly the
most critical fast path.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
@ 2011-09-20 14:28       ` Christoph Hellwig
  0 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2011-09-20 14:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Christoph Hellwig, Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Tue, Sep 20, 2011 at 07:42:04PM +0530, Srikar Dronamraju wrote:
> I could use any other inode/file/mapping based sleepable lock that is of
> higher order than mmap_sem. Can you please let me know if we have
> alternatives.

Please do not overload unrelated locks for this, but add a specific one.

There's two options:

 (a) add it to the inode (conditionally)
 (b) use global, hashed locks

I think (b) is good enough as adding/removing probes isn't exactly the
most critical fast path.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
  2011-09-20 14:28       ` Christoph Hellwig
@ 2011-09-20 15:19         ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 15:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

* Christoph Hellwig <hch@infradead.org> [2011-09-20 10:28:43]:

> On Tue, Sep 20, 2011 at 07:42:04PM +0530, Srikar Dronamraju wrote:
> > I could use any other inode/file/mapping based sleepable lock that is of
> > higher order than mmap_sem. Can you please let me know if we have
> > alternatives.
> 
> Please do not overload unrelated locks for this, but add a specific one.
> 
> There's two options:
> 
>  (a) add it to the inode (conditionally)
>  (b) use global, hashed locks
> 
> I think (b) is good enough as adding/removing probes isn't exactly the
> most critical fast path.
> 

Agree, I will replace the i_mutex with a uprobes specific hash locks.
I will make this change as part of next patchset.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 0/26]   Uprobes patchset with perf probe support
@ 2011-09-20 15:19         ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-20 15:19 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

* Christoph Hellwig <hch@infradead.org> [2011-09-20 10:28:43]:

> On Tue, Sep 20, 2011 at 07:42:04PM +0530, Srikar Dronamraju wrote:
> > I could use any other inode/file/mapping based sleepable lock that is of
> > higher order than mmap_sem. Can you please let me know if we have
> > alternatives.
> 
> Please do not overload unrelated locks for this, but add a specific one.
> 
> There's two options:
> 
>  (a) add it to the inode (conditionally)
>  (b) use global, hashed locks
> 
> I think (b) is good enough as adding/removing probes isn't exactly the
> most critical fast path.
> 

Agree, I will replace the i_mutex with a uprobes specific hash locks.
I will make this change as part of next patchset.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-20 11:59   ` Srikar Dronamraju
@ 2011-09-20 15:42     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 15:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

On Tue, Sep 20, 2011 at 05:29:49PM +0530, Srikar Dronamraju wrote:
> +static void delete_uprobe(struct uprobe *uprobe)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	rb_erase(&uprobe->rb_node, &uprobes_tree);
> +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> +	put_uprobe(uprobe);
> +	iput(uprobe->inode);

Use-after-free when put_uprobe() kfrees() the uprobe?

Stefan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-20 15:42     ` Stefan Hajnoczi
  0 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 15:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

On Tue, Sep 20, 2011 at 05:29:49PM +0530, Srikar Dronamraju wrote:
> +static void delete_uprobe(struct uprobe *uprobe)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	rb_erase(&uprobe->rb_node, &uprobes_tree);
> +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> +	put_uprobe(uprobe);
> +	iput(uprobe->inode);

Use-after-free when put_uprobe() kfrees() the uprobe?

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-09-20 16:50     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 16:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

On Tue, Sep 20, 2011 at 05:30:22PM +0530, Srikar Dronamraju wrote:
> +int register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +	int ret = 0;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer || consumer->next)
> +		return -EINVAL;
> +
> +	if (offset > inode->i_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&inode->i_mutex);
> +	uprobe = alloc_uprobe(inode, offset);
> +	if (!uprobe)
> +		return -ENOMEM;

The error returns above don't iput(inode).  And inode->i_mutex stays
locked on this return.

> +void unregister_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer)
> +		return;
> +
> +	if (offset > inode->i_size)
> +		return;
> +
> +	uprobe = find_uprobe(inode, offset);
> +	if (!uprobe)
> +		return;
> +
> +	if (!del_consumer(uprobe, consumer)) {
> +		put_uprobe(uprobe);
> +		return;
> +	}

More returns that do not iput(inode).

Stefan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-09-20 16:50     ` Stefan Hajnoczi
  0 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 16:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

On Tue, Sep 20, 2011 at 05:30:22PM +0530, Srikar Dronamraju wrote:
> +int register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +	int ret = 0;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer || consumer->next)
> +		return -EINVAL;
> +
> +	if (offset > inode->i_size)
> +		return -EINVAL;
> +
> +	mutex_lock(&inode->i_mutex);
> +	uprobe = alloc_uprobe(inode, offset);
> +	if (!uprobe)
> +		return -ENOMEM;

The error returns above don't iput(inode).  And inode->i_mutex stays
locked on this return.

> +void unregister_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer)
> +		return;
> +
> +	if (offset > inode->i_size)
> +		return;
> +
> +	uprobe = find_uprobe(inode, offset);
> +	if (!uprobe)
> +		return;
> +
> +	if (!del_consumer(uprobe, consumer)) {
> +		put_uprobe(uprobe);
> +		return;
> +	}

More returns that do not iput(inode).

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-09-20 17:03     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 17:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, Sep 20, 2011 at 05:30:40PM +0530, Srikar Dronamraju wrote:
> +static void build_probe_list(struct inode *inode, struct list_head *head)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);

Not sure whether grabbing root.rb_node outside the spinlock is safe?  If
the tree is rotated on another CPU you could catch and out-of-date node?

> +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> +		struct inode *inode)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);

Same here.

Stefan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-20 17:03     ` Stefan Hajnoczi
  0 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 17:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, Sep 20, 2011 at 05:30:40PM +0530, Srikar Dronamraju wrote:
> +static void build_probe_list(struct inode *inode, struct list_head *head)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);

Not sure whether grabbing root.rb_node outside the spinlock is safe?  If
the tree is rotated on another CPU you could catch and out-of-date node?

> +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> +		struct inode *inode)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);

Same here.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 12:01   ` Srikar Dronamraju
@ 2011-09-20 17:13     ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 17:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Tue, Sep 20, 2011 at 05:31:27PM +0530, Srikar Dronamraju wrote:
> 
> The instruction analysis is based on x86 instruction decoder and
> determines if an instruction can be probed and determines the necessary
> fixups after singlestep.  Instruction analysis is done at probe
> insertion time so that we avoid having to repeat the same analysis every
> time a probe is hit.
> 
> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/x86/Kconfig               |    3 
>  arch/x86/include/asm/uprobes.h |   42 ++++
>  arch/x86/kernel/Makefile       |    1 
>  arch/x86/kernel/uprobes.c      |  385 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 431 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/include/asm/uprobes.h
>  create mode 100644 arch/x86/kernel/uprobes.c

You've probably thought of this but it would be nice to skip XOL for
nops.  This would be a common case with static probes (e.g. sdt.h) where
the probe template includes a nop where we can easily plant int $0x3.

Perhaps a check can be added to the analysis so that after calling the
filter/handler we can immediately continue the process instead of
executing the (useless) nop out-of-line.

Stefan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-20 17:13     ` Stefan Hajnoczi
  0 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 17:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Tue, Sep 20, 2011 at 05:31:27PM +0530, Srikar Dronamraju wrote:
> 
> The instruction analysis is based on x86 instruction decoder and
> determines if an instruction can be probed and determines the necessary
> fixups after singlestep.  Instruction analysis is done at probe
> insertion time so that we avoid having to repeat the same analysis every
> time a probe is hit.
> 
> Signed-off-by: Jim Keniston <jkenisto@us.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/x86/Kconfig               |    3 
>  arch/x86/include/asm/uprobes.h |   42 ++++
>  arch/x86/kernel/Makefile       |    1 
>  arch/x86/kernel/uprobes.c      |  385 ++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 431 insertions(+), 0 deletions(-)
>  create mode 100644 arch/x86/include/asm/uprobes.h
>  create mode 100644 arch/x86/kernel/uprobes.c

You've probably thought of this but it would be nice to skip XOL for
nops.  This would be a common case with static probes (e.g. sdt.h) where
the probe template includes a nop where we can easily plant int $0x3.

Perhaps a check can be added to the analysis so that after calling the
filter/handler we can immediately continue the process instead of
executing the (useless) nop out-of-line.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 17:13     ` Stefan Hajnoczi
@ 2011-09-20 18:12       ` Christoph Hellwig
  -1 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2011-09-20 18:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig, Andi Kleen,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
> You've probably thought of this but it would be nice to skip XOL for
> nops.  This would be a common case with static probes (e.g. sdt.h) where
> the probe template includes a nop where we can easily plant int $0x3.

Do we now have sdt.h support for uprobes?  That's one of the killer
features that always seemed to get postponed.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-20 18:12       ` Christoph Hellwig
  0 siblings, 0 replies; 330+ messages in thread
From: Christoph Hellwig @ 2011-09-20 18:12 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig, Andi Kleen,
	Thomas Gleixner, Jonathan Corbet, Oleg Nesterov, Andrew Morton,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
> You've probably thought of this but it would be nice to skip XOL for
> nops.  This would be a common case with static probes (e.g. sdt.h) where
> the probe template includes a nop where we can easily plant int $0x3.

Do we now have sdt.h support for uprobes?  That's one of the killer
features that always seemed to get postponed.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 18:12       ` Christoph Hellwig
@ 2011-09-20 20:53         ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 20:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Masami Hiramatsu, Hugh Dickins, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
> > You've probably thought of this but it would be nice to skip XOL for
> > nops.  This would be a common case with static probes (e.g. sdt.h) where
> > the probe template includes a nop where we can easily plant int $0x3.
> 
> Do we now have sdt.h support for uprobes?  That's one of the killer
> features that always seemed to get postponed.

Not yet but it's a question of doing roughly what SystemTap does to
parse the appropriate ELF sections and then putting those probes into
uprobes.

Masami looked at this and found that SystemTap sdt.h currently requires
an extra userspace memory store in order to activate probes.  Each probe
has a "semaphore" 16-bit counter which applications may test before
hitting the probe itself.  This is used to avoid overhead in
applications that do expensive argument processing (e.g. creating
strings) for probes.

But this should be solvable so it would be possible to use perf-probe(1)
on a std.h-enabled binary.  Some distros already ship such binaries!

Stefan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-20 20:53         ` Stefan Hajnoczi
  0 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-20 20:53 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Masami Hiramatsu, Hugh Dickins, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
> On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
> > You've probably thought of this but it would be nice to skip XOL for
> > nops.  This would be a common case with static probes (e.g. sdt.h) where
> > the probe template includes a nop where we can easily plant int $0x3.
> 
> Do we now have sdt.h support for uprobes?  That's one of the killer
> features that always seemed to get postponed.

Not yet but it's a question of doing roughly what SystemTap does to
parse the appropriate ELF sections and then putting those probes into
uprobes.

Masami looked at this and found that SystemTap sdt.h currently requires
an extra userspace memory store in order to activate probes.  Each probe
has a "semaphore" 16-bit counter which applications may test before
hitting the probe itself.  This is used to avoid overhead in
applications that do expensive argument processing (e.g. creating
strings) for probes.

But this should be solvable so it would be possible to use perf-probe(1)
on a std.h-enabled binary.  Some distros already ship such binaries!

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-20 17:03     ` Stefan Hajnoczi
@ 2011-09-21  4:03       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-21  4:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> [2011-09-20 18:03:10]:

> On Tue, Sep 20, 2011 at 05:30:40PM +0530, Srikar Dronamraju wrote:
> > +static void build_probe_list(struct inode *inode, struct list_head *head)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> 
> Not sure whether grabbing root.rb_node outside the spinlock is safe?  If
> the tree is rotated on another CPU you could catch and out-of-date node?


Agree that its better to access the node in the spinlock.
Shall correct this.
 
> > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > +		struct inode *inode)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> 
> Same here.

Okay.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-21  4:03       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-21  4:03 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> [2011-09-20 18:03:10]:

> On Tue, Sep 20, 2011 at 05:30:40PM +0530, Srikar Dronamraju wrote:
> > +static void build_probe_list(struct inode *inode, struct list_head *head)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> 
> Not sure whether grabbing root.rb_node outside the spinlock is safe?  If
> the tree is rotated on another CPU you could catch and out-of-date node?


Agree that its better to access the node in the spinlock.
Shall correct this.
 
> > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > +		struct inode *inode)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> 
> Same here.

Okay.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-09-20 16:50     ` Stefan Hajnoczi
@ 2011-09-21  4:07       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-21  4:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

* Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> [2011-09-20 17:50:19]:

> On Tue, Sep 20, 2011 at 05:30:22PM +0530, Srikar Dronamraju wrote:
> > +int register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +	int ret = 0;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer || consumer->next)
> > +		return -EINVAL;
> > +
> > +	if (offset > inode->i_size)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&inode->i_mutex);
> > +	uprobe = alloc_uprobe(inode, offset);
> > +	if (!uprobe)
> > +		return -ENOMEM;
> 
> The error returns above don't iput(inode).  And inode->i_mutex stays
> locked on this return.

Yes will fix this .. by clubbing the !uprobe with the next condition.
Thanks for pointing this.

> 
> > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer)
> > +		return;
> > +
> > +	if (offset > inode->i_size)
> > +		return;
> > +
> > +	uprobe = find_uprobe(inode, offset);
> > +	if (!uprobe)
> > +		return;
> > +
> > +	if (!del_consumer(uprobe, consumer)) {
> > +		put_uprobe(uprobe);
> > +		return;
> > +	}
> 
> More returns that do not iput(inode).

Yes. will fix these too.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-09-21  4:07       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-21  4:07 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Peter Zijlstra, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

* Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> [2011-09-20 17:50:19]:

> On Tue, Sep 20, 2011 at 05:30:22PM +0530, Srikar Dronamraju wrote:
> > +int register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +	int ret = 0;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer || consumer->next)
> > +		return -EINVAL;
> > +
> > +	if (offset > inode->i_size)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&inode->i_mutex);
> > +	uprobe = alloc_uprobe(inode, offset);
> > +	if (!uprobe)
> > +		return -ENOMEM;
> 
> The error returns above don't iput(inode).  And inode->i_mutex stays
> locked on this return.

Yes will fix this .. by clubbing the !uprobe with the next condition.
Thanks for pointing this.

> 
> > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer)
> > +		return;
> > +
> > +	if (offset > inode->i_size)
> > +		return;
> > +
> > +	uprobe = find_uprobe(inode, offset);
> > +	if (!uprobe)
> > +		return;
> > +
> > +	if (!del_consumer(uprobe, consumer)) {
> > +		put_uprobe(uprobe);
> > +		return;
> > +	}
> 
> More returns that do not iput(inode).

Yes. will fix these too.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 12:01   ` Srikar Dronamraju
@ 2011-09-22  1:05     ` Josh Stone
  -1 siblings, 0 replies; 330+ messages in thread
From: Josh Stone @ 2011-09-22  1:05 UTC (permalink / raw)
  To: Srikar Dronamraju, Masami Hiramatsu
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

Hi Srikar,

I noticed that this produces a compiler warning on i686 from test_bit ->
variable_test_bit, that I think must be addressed.  It is similar to
something I fixed earlier in SystemTap's uprobes, but at that time I
didn't fully analyze the issue, so I'll attempt it now.

Masami, please note that what I describe below also happens on kprobes'
bitvector twobyte_is_boostable.


On 09/20/2011 05:01 AM, Srikar Dronamraju wrote:
[...]
> +static const u32 good_insns_64[256 / 32] = {
[...]
> +static const u32 good_insns_32[256 / 32] = {
[...]
> +static const u32 good_2byte_insns[256 / 32] = {
[...]
> +static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
> +{
> +	insn_init(insn, uprobe->insn, false);
> +
> +	/* Skip good instruction prefixes; reject "bad" ones. */
> +	insn_get_opcode(insn);
> +	if (is_prefix_bad(insn))
> +		return -ENOTSUPP;
> +	if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
> +		return 0;
> +	if (insn->opcode.nbytes == 2) {
> +		if (test_bit(OPCODE2(insn),
> +					(unsigned long *) good_2byte_insns))
> +			return 0;
> +	}
> +	return -ENOTSUPP;
> +}

gcc version 4.6.0 20110603 (Red Hat 4.6.0-10) (GCC) says:
>   CC      arch/x86/kernel/uprobes.o
> In file included from include/linux/bitops.h:22:0,
>                  from include/linux/kernel.h:17,
>                  from arch/x86/kernel/uprobes.c:24:
> /home/jistone/linux-2.6/arch/x86/include/asm/bitops.h: In function ‘analyze_insn’:
> /home/jistone/linux-2.6/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]
> /home/jistone/linux-2.6/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]

That's from variable_test_bit, whose second argument is volatile const
unsigned long *, then referenced with asm "m" (*(unsigned long *)addr).

The fix I used in SystemTap's case was to make the bitvectors volatile
as well.  But now I want to better know *why* this fix works.  There's
no real difference in code-gen:

> @@ -91,8 +91,8 @@ Disassembly of section .text:
>    63:	90                   	nop
>    64:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
>    68:	0f b6 45 bc          	movzbl -0x44(%ebp),%eax
> -  6c:	0f a3 05 00 00 00 00 	bt     %eax,0x0
> -			6f: R_386_32	.rodata.cst4
> +  6c:	0f a3 05 20 00 00 00 	bt     %eax,0x20
> +			6f: R_386_32	.data
>    73:	19 c0                	sbb    %eax,%eax
>    75:	85 c0                	test   %eax,%eax
>    77:	75 1c                	jne    95 <analyze_insn+0x95>
> @@ -100,8 +100,8 @@ Disassembly of section .text:
>    7d:	b8 f4 fd ff ff       	mov    $0xfffffdf4,%eax
>    82:	75 d9                	jne    5d <analyze_insn+0x5d>
>    84:	0f b6 55 bd          	movzbl -0x43(%ebp),%edx
> -  88:	0f a3 15 04 00 00 00 	bt     %edx,0x4
> -			8b: R_386_32	.rodata.cst4
> +  88:	0f a3 15 00 00 00 00 	bt     %edx,0x0
> +			8b: R_386_32	.data
>    8f:	19 d2                	sbb    %edx,%edx
>    91:	85 d2                	test   %edx,%edx
>    93:	74 c8                	je     5d <analyze_insn+0x5d>

The volatile makes the bitvectors move from .rodata to .data, not all
that surprising, I guess.  Then I figured out that the former case only
has the first word of good_insns_32 and good_2byte_insns, where the
latter volatile case keeps the whole thing.

as-is:
> Contents of section .rodata.cst4:
>  0000 7f7f7f7f 00c0fffe                    ........

with volatile:
> Contents of section .data:
>  0000 00c0fffe 0fff0e00 ffffffff ffffffc0  ................
>  0010 ffffffff 3fbffffe fffffeff fffffe7f  ....?...........
>  0020 7f7f7f7f bfbfbfbf ffffffff 370fffff  ............7...
>  0030 ffffffff ffffffff ff0fbfff 0f0fecf3  ................
>  0040 3f3f3f3f 3f3f3f3f 0000ffff 380fffff  ????????....8...
>  0050 fbffffff ffffffff cf0f8fff 0f0fecf3  ................

So declaring the bitvectors volatile makes gcc to keep them around in
full.  Otherwise it apparently looks to gcc like only the first word is
used by variable_test_bit's asm statement.

On x86_64, the warning doesn't appear, and even in .rodata the
bitvectors appear in full.  I'm guessing that the pointer aliasing from
u32* to a 64-bit unsigned long* makes gcc forgo the data elision.

I wonder if variable_test_bit() could/should force aliasing to fix this
for all callers.  But for now, marking volatile does the trick.

Thanks,
Josh

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-22  1:05     ` Josh Stone
  0 siblings, 0 replies; 330+ messages in thread
From: Josh Stone @ 2011-09-22  1:05 UTC (permalink / raw)
  To: Srikar Dronamraju, Masami Hiramatsu
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

Hi Srikar,

I noticed that this produces a compiler warning on i686 from test_bit ->
variable_test_bit, that I think must be addressed.  It is similar to
something I fixed earlier in SystemTap's uprobes, but at that time I
didn't fully analyze the issue, so I'll attempt it now.

Masami, please note that what I describe below also happens on kprobes'
bitvector twobyte_is_boostable.


On 09/20/2011 05:01 AM, Srikar Dronamraju wrote:
[...]
> +static const u32 good_insns_64[256 / 32] = {
[...]
> +static const u32 good_insns_32[256 / 32] = {
[...]
> +static const u32 good_2byte_insns[256 / 32] = {
[...]
> +static int validate_insn_32bits(struct uprobe *uprobe, struct insn *insn)
> +{
> +	insn_init(insn, uprobe->insn, false);
> +
> +	/* Skip good instruction prefixes; reject "bad" ones. */
> +	insn_get_opcode(insn);
> +	if (is_prefix_bad(insn))
> +		return -ENOTSUPP;
> +	if (test_bit(OPCODE1(insn), (unsigned long *) good_insns_32))
> +		return 0;
> +	if (insn->opcode.nbytes == 2) {
> +		if (test_bit(OPCODE2(insn),
> +					(unsigned long *) good_2byte_insns))
> +			return 0;
> +	}
> +	return -ENOTSUPP;
> +}

gcc version 4.6.0 20110603 (Red Hat 4.6.0-10) (GCC) says:
>   CC      arch/x86/kernel/uprobes.o
> In file included from include/linux/bitops.h:22:0,
>                  from include/linux/kernel.h:17,
>                  from arch/x86/kernel/uprobes.c:24:
> /home/jistone/linux-2.6/arch/x86/include/asm/bitops.h: In function ?analyze_insn?:
> /home/jistone/linux-2.6/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]
> /home/jistone/linux-2.6/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]

That's from variable_test_bit, whose second argument is volatile const
unsigned long *, then referenced with asm "m" (*(unsigned long *)addr).

The fix I used in SystemTap's case was to make the bitvectors volatile
as well.  But now I want to better know *why* this fix works.  There's
no real difference in code-gen:

> @@ -91,8 +91,8 @@ Disassembly of section .text:
>    63:	90                   	nop
>    64:	8d 74 26 00          	lea    0x0(%esi,%eiz,1),%esi
>    68:	0f b6 45 bc          	movzbl -0x44(%ebp),%eax
> -  6c:	0f a3 05 00 00 00 00 	bt     %eax,0x0
> -			6f: R_386_32	.rodata.cst4
> +  6c:	0f a3 05 20 00 00 00 	bt     %eax,0x20
> +			6f: R_386_32	.data
>    73:	19 c0                	sbb    %eax,%eax
>    75:	85 c0                	test   %eax,%eax
>    77:	75 1c                	jne    95 <analyze_insn+0x95>
> @@ -100,8 +100,8 @@ Disassembly of section .text:
>    7d:	b8 f4 fd ff ff       	mov    $0xfffffdf4,%eax
>    82:	75 d9                	jne    5d <analyze_insn+0x5d>
>    84:	0f b6 55 bd          	movzbl -0x43(%ebp),%edx
> -  88:	0f a3 15 04 00 00 00 	bt     %edx,0x4
> -			8b: R_386_32	.rodata.cst4
> +  88:	0f a3 15 00 00 00 00 	bt     %edx,0x0
> +			8b: R_386_32	.data
>    8f:	19 d2                	sbb    %edx,%edx
>    91:	85 d2                	test   %edx,%edx
>    93:	74 c8                	je     5d <analyze_insn+0x5d>

The volatile makes the bitvectors move from .rodata to .data, not all
that surprising, I guess.  Then I figured out that the former case only
has the first word of good_insns_32 and good_2byte_insns, where the
latter volatile case keeps the whole thing.

as-is:
> Contents of section .rodata.cst4:
>  0000 7f7f7f7f 00c0fffe                    ........

with volatile:
> Contents of section .data:
>  0000 00c0fffe 0fff0e00 ffffffff ffffffc0  ................
>  0010 ffffffff 3fbffffe fffffeff fffffe7f  ....?...........
>  0020 7f7f7f7f bfbfbfbf ffffffff 370fffff  ............7...
>  0030 ffffffff ffffffff ff0fbfff 0f0fecf3  ................
>  0040 3f3f3f3f 3f3f3f3f 0000ffff 380fffff  ????????....8...
>  0050 fbffffff ffffffff cf0f8fff 0f0fecf3  ................

So declaring the bitvectors volatile makes gcc to keep them around in
full.  Otherwise it apparently looks to gcc like only the first word is
used by variable_test_bit's asm statement.

On x86_64, the warning doesn't appear, and even in .rodata the
bitvectors appear in full.  I'm guessing that the pointer aliasing from
u32* to a 64-bit unsigned long* makes gcc forgo the data elision.

I wonder if variable_test_bit() could/should force aliasing to fix this
for all callers.  But for now, marking volatile does the trick.

Thanks,
Josh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 20:53         ` Stefan Hajnoczi
@ 2011-09-23 11:53           ` Masami Hiramatsu
  -1 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-23 11:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Christoph Hellwig, Srikar Dronamraju, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

(2011/09/21 5:53), Stefan Hajnoczi wrote:
> On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
>> On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
>>> You've probably thought of this but it would be nice to skip XOL for
>>> nops.  This would be a common case with static probes (e.g. sdt.h) where
>>> the probe template includes a nop where we can easily plant int $0x3.
>>
>> Do we now have sdt.h support for uprobes?  That's one of the killer
>> features that always seemed to get postponed.
> 
> Not yet but it's a question of doing roughly what SystemTap does to
> parse the appropriate ELF sections and then putting those probes into
> uprobes.
> 
> Masami looked at this and found that SystemTap sdt.h currently requires
> an extra userspace memory store in order to activate probes.  Each probe
> has a "semaphore" 16-bit counter which applications may test before
> hitting the probe itself.  This is used to avoid overhead in
> applications that do expensive argument processing (e.g. creating
> strings) for probes.

Indeed, originally, those semaphores designed for such use cases.
However, some applications *always* use it (e.g. qemu-kvm).

> 
> But this should be solvable so it would be possible to use perf-probe(1)
> on a std.h-enabled binary.  Some distros already ship such binaries!

I'm not sure that we should stick on the current implementation
of the sdt.h. I think we'd better modify the sdt.h to replace
such semaphores with checking whether the tracepoint is changed from nop.

Or, we can introduce an add-hoc ptrace code to perftools for modifying
those semaphores. However, this means that user always has to use
perf to trace applications, and it's hard to trace multiple applications
at a time (can we attach all of them?)...

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-23 11:53           ` Masami Hiramatsu
  0 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-23 11:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Christoph Hellwig, Srikar Dronamraju, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

(2011/09/21 5:53), Stefan Hajnoczi wrote:
> On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
>> On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
>>> You've probably thought of this but it would be nice to skip XOL for
>>> nops.  This would be a common case with static probes (e.g. sdt.h) where
>>> the probe template includes a nop where we can easily plant int $0x3.
>>
>> Do we now have sdt.h support for uprobes?  That's one of the killer
>> features that always seemed to get postponed.
> 
> Not yet but it's a question of doing roughly what SystemTap does to
> parse the appropriate ELF sections and then putting those probes into
> uprobes.
> 
> Masami looked at this and found that SystemTap sdt.h currently requires
> an extra userspace memory store in order to activate probes.  Each probe
> has a "semaphore" 16-bit counter which applications may test before
> hitting the probe itself.  This is used to avoid overhead in
> applications that do expensive argument processing (e.g. creating
> strings) for probes.

Indeed, originally, those semaphores designed for such use cases.
However, some applications *always* use it (e.g. qemu-kvm).

> 
> But this should be solvable so it would be possible to use perf-probe(1)
> on a std.h-enabled binary.  Some distros already ship such binaries!

I'm not sure that we should stick on the current implementation
of the sdt.h. I think we'd better modify the sdt.h to replace
such semaphores with checking whether the tracepoint is changed from nop.

Or, we can introduce an add-hoc ptrace code to perftools for modifying
those semaphores. However, this means that user always has to use
perf to trace applications, and it's hard to trace multiple applications
at a time (can we attach all of them?)...

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-23 11:53           ` Masami Hiramatsu
@ 2011-09-23 16:51             ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-23 16:51 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Christoph Hellwig, Srikar Dronamraju, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Fri, Sep 23, 2011 at 08:53:55PM +0900, Masami Hiramatsu wrote:
> (2011/09/21 5:53), Stefan Hajnoczi wrote:
> > On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
> >> On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
> > But this should be solvable so it would be possible to use perf-probe(1)
> > on a std.h-enabled binary.  Some distros already ship such binaries!
> 
> I'm not sure that we should stick on the current implementation
> of the sdt.h. I think we'd better modify the sdt.h to replace
> such semaphores with checking whether the tracepoint is changed from nop.

I like this option.  The only implication is that all userspace tracing
needs to go through uprobes if we want to support multiple consumers
tracing the same address.

> Or, we can introduce an add-hoc ptrace code to perftools for modifying
> those semaphores. However, this means that user always has to use
> perf to trace applications, and it's hard to trace multiple applications
> at a time (can we attach all of them?)...

I don't think perf needs to stay attached to the processes.  It just
needs to increment the semaphores on startup and decrement them on
shutdown.

Are you going to attempt either of these implementations?

Stefan

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-23 16:51             ` Stefan Hajnoczi
  0 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-23 16:51 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Christoph Hellwig, Srikar Dronamraju, Peter Zijlstra,
	Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, LKML

On Fri, Sep 23, 2011 at 08:53:55PM +0900, Masami Hiramatsu wrote:
> (2011/09/21 5:53), Stefan Hajnoczi wrote:
> > On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
> >> On Tue, Sep 20, 2011 at 06:13:10PM +0100, Stefan Hajnoczi wrote:
> > But this should be solvable so it would be possible to use perf-probe(1)
> > on a std.h-enabled binary.  Some distros already ship such binaries!
> 
> I'm not sure that we should stick on the current implementation
> of the sdt.h. I think we'd better modify the sdt.h to replace
> such semaphores with checking whether the tracepoint is changed from nop.

I like this option.  The only implication is that all userspace tracing
needs to go through uprobes if we want to support multiple consumers
tracing the same address.

> Or, we can introduce an add-hoc ptrace code to perftools for modifying
> those semaphores. However, this means that user always has to use
> perf to trace applications, and it's hard to trace multiple applications
> at a time (can we attach all of them?)...

I don't think perf needs to stay attached to the processes.  It just
needs to increment the semaphores on startup and decrement them on
shutdown.

Are you going to attempt either of these implementations?

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-20 15:42     ` Stefan Hajnoczi
@ 2011-09-26 11:18       ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 11:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Srikar Dronamraju, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

On Tue, 2011-09-20 at 16:42 +0100, Stefan Hajnoczi wrote:
> On Tue, Sep 20, 2011 at 05:29:49PM +0530, Srikar Dronamraju wrote:
> > +static void delete_uprobe(struct uprobe *uprobe)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	rb_erase(&uprobe->rb_node, &uprobes_tree);
> > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > +	put_uprobe(uprobe);
> > +	iput(uprobe->inode);
> 
> Use-after-free when put_uprobe() kfrees() the uprobe?

I suspect the caller still has one, and this was the reference for being
part of the tree. But yes, that could do with a comment.

The comment near atomic_set() in __insert_uprobe() isn't too clear
either. /* get access + drop ref */, would naively seem +1 -1 = 0,
instead of +1 +1 = 2.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-26 11:18       ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 11:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Srikar Dronamraju, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

On Tue, 2011-09-20 at 16:42 +0100, Stefan Hajnoczi wrote:
> On Tue, Sep 20, 2011 at 05:29:49PM +0530, Srikar Dronamraju wrote:
> > +static void delete_uprobe(struct uprobe *uprobe)
> > +{
> > +	unsigned long flags;
> > +
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	rb_erase(&uprobe->rb_node, &uprobes_tree);
> > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > +	put_uprobe(uprobe);
> > +	iput(uprobe->inode);
> 
> Use-after-free when put_uprobe() kfrees() the uprobe?

I suspect the caller still has one, and this was the reference for being
part of the tree. But yes, that could do with a comment.

The comment near atomic_set() in __insert_uprobe() isn't too clear
either. /* get access + drop ref */, would naively seem +1 -1 = 0,
instead of +1 +1 = 2.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-20 11:59   ` Srikar Dronamraju
@ 2011-09-26 11:18     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 11:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> +static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> +{
> +       struct rb_node **p = &uprobes_tree.rb_node;
> +       struct rb_node *parent = NULL;
> +       struct uprobe *u;
> +       int match;
> +
> +       while (*p) {
> +               parent = *p;
> +               u = rb_entry(parent, struct uprobe, rb_node);
> +               match = match_uprobe(uprobe, u);
> +               if (!match) {
> +                       atomic_inc(&u->ref);
> +                       return u;
> +               }
> +
> +               if (match < 0)
> +                       p = &parent->rb_left;
> +               else
> +                       p = &parent->rb_right;
> +
> +       }
> +       u = NULL;
> +       rb_link_node(&uprobe->rb_node, parent, p);
> +       rb_insert_color(&uprobe->rb_node, &uprobes_tree);
> +       /* get access + drop ref */
> +       atomic_set(&uprobe->ref, 2);
> +       return u;
> +} 

If you ever want to make a 'lockless' lookup work you need to set the
refcount of the new object before its fully visible, instead of after.

Now much of a problem now since its fully serialized by that
uprobes_treelock thing.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-26 11:18     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 11:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> +static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> +{
> +       struct rb_node **p = &uprobes_tree.rb_node;
> +       struct rb_node *parent = NULL;
> +       struct uprobe *u;
> +       int match;
> +
> +       while (*p) {
> +               parent = *p;
> +               u = rb_entry(parent, struct uprobe, rb_node);
> +               match = match_uprobe(uprobe, u);
> +               if (!match) {
> +                       atomic_inc(&u->ref);
> +                       return u;
> +               }
> +
> +               if (match < 0)
> +                       p = &parent->rb_left;
> +               else
> +                       p = &parent->rb_right;
> +
> +       }
> +       u = NULL;
> +       rb_link_node(&uprobe->rb_node, parent, p);
> +       rb_insert_color(&uprobe->rb_node, &uprobes_tree);
> +       /* get access + drop ref */
> +       atomic_set(&uprobe->ref, 2);
> +       return u;
> +} 

If you ever want to make a 'lockless' lookup work you need to set the
refcount of the new object before its fully visible, instead of after.

Now much of a problem now since its fully serialized by that
uprobes_treelock thing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-26 11:18       ` Peter Zijlstra
@ 2011-09-26 11:59         ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefan Hajnoczi, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 13:18:38]:

> On Tue, 2011-09-20 at 16:42 +0100, Stefan Hajnoczi wrote:
> > On Tue, Sep 20, 2011 at 05:29:49PM +0530, Srikar Dronamraju wrote:
> > > +static void delete_uprobe(struct uprobe *uprobe)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > > +	rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > > +	put_uprobe(uprobe);
> > > +	iput(uprobe->inode);
> > 
> > Use-after-free when put_uprobe() kfrees() the uprobe?
> 
> I suspect the caller still has one, and this was the reference for being
> part of the tree. But yes, that could do with a comment.
> 

Yes, the caller has a reference, However I went ahead and changed the
order of the last two statements.

> The comment near atomic_set() in __insert_uprobe() isn't too clear
> either. /* get access + drop ref */, would naively seem +1 -1 = 0,
> instead of +1 +1 = 2.
> 

Okay, Have modified the comment to /* get access + creation ref */

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-26 11:59         ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 11:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Stefan Hajnoczi, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, Oleg Nesterov, LKML,
	Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 13:18:38]:

> On Tue, 2011-09-20 at 16:42 +0100, Stefan Hajnoczi wrote:
> > On Tue, Sep 20, 2011 at 05:29:49PM +0530, Srikar Dronamraju wrote:
> > > +static void delete_uprobe(struct uprobe *uprobe)
> > > +{
> > > +	unsigned long flags;
> > > +
> > > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > > +	rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > > +	put_uprobe(uprobe);
> > > +	iput(uprobe->inode);
> > 
> > Use-after-free when put_uprobe() kfrees() the uprobe?
> 
> I suspect the caller still has one, and this was the reference for being
> part of the tree. But yes, that could do with a comment.
> 

Yes, the caller has a reference, However I went ahead and changed the
order of the last two statements.

> The comment near atomic_set() in __insert_uprobe() isn't too clear
> either. /* get access + drop ref */, would naively seem +1 -1 = 0,
> instead of +1 +1 = 2.
> 

Okay, Have modified the comment to /* get access + creation ref */

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-26 11:18     ` Peter Zijlstra
@ 2011-09-26 12:02       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 13:18:40]:

> On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> > +static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> > +{
> > +       struct rb_node **p = &uprobes_tree.rb_node;
> > +       struct rb_node *parent = NULL;
> > +       struct uprobe *u;
> > +       int match;
> > +
> > +       while (*p) {
> > +               parent = *p;
> > +               u = rb_entry(parent, struct uprobe, rb_node);
> > +               match = match_uprobe(uprobe, u);
> > +               if (!match) {
> > +                       atomic_inc(&u->ref);
> > +                       return u;
> > +               }
> > +
> > +               if (match < 0)
> > +                       p = &parent->rb_left;
> > +               else
> > +                       p = &parent->rb_right;
> > +
> > +       }
> > +       u = NULL;
> > +       rb_link_node(&uprobe->rb_node, parent, p);
> > +       rb_insert_color(&uprobe->rb_node, &uprobes_tree);
> > +       /* get access + drop ref */
> > +       atomic_set(&uprobe->ref, 2);
> > +       return u;
> > +} 
> 
> If you ever want to make a 'lockless' lookup work you need to set the
> refcount of the new object before its fully visible, instead of after.
> 

Agree, 

> Now much of a problem now since its fully serialized by that
> uprobes_treelock thing.
> 

Will stick with this for now; If and when we do a lockless lookup we
could fix this.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-26 12:02       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 12:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 13:18:40]:

> On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> > +static struct uprobe *__insert_uprobe(struct uprobe *uprobe)
> > +{
> > +       struct rb_node **p = &uprobes_tree.rb_node;
> > +       struct rb_node *parent = NULL;
> > +       struct uprobe *u;
> > +       int match;
> > +
> > +       while (*p) {
> > +               parent = *p;
> > +               u = rb_entry(parent, struct uprobe, rb_node);
> > +               match = match_uprobe(uprobe, u);
> > +               if (!match) {
> > +                       atomic_inc(&u->ref);
> > +                       return u;
> > +               }
> > +
> > +               if (match < 0)
> > +                       p = &parent->rb_left;
> > +               else
> > +                       p = &parent->rb_right;
> > +
> > +       }
> > +       u = NULL;
> > +       rb_link_node(&uprobe->rb_node, parent, p);
> > +       rb_insert_color(&uprobe->rb_node, &uprobes_tree);
> > +       /* get access + drop ref */
> > +       atomic_set(&uprobe->ref, 2);
> > +       return u;
> > +} 
> 
> If you ever want to make a 'lockless' lookup work you need to set the
> refcount of the new object before its fully visible, instead of after.
> 

Agree, 

> Now much of a problem now since its fully serialized by that
> uprobes_treelock thing.
> 

Will stick with this for now; If and when we do a lockless lookup we
could fix this.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 2/26]   Uprobes: Allow multiple consumers for an uprobe.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-09-26 12:29     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 12:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On Tue, 2011-09-20 at 17:30 +0530, Srikar Dronamraju wrote:
> +       con = uprobe->consumers;
> +       if (consumer == con) {
> +               uprobe->consumers = con->next;
> +               ret = true;
> +       } else {
> +               for (; con; con = con->next) {
> +                       if (con->next == consumer) {
> +                               con->next = consumer->next;
> +                               ret = true;
> +                               break;
> +                       }
> +               }
> +       } 

	struct uprobe_consumer **next = &uprobe->consumers;

	for (; *next; next = &(*next)->next) {
		if (*next == consumer) {
			*next = (*next)->next;
			ret = true;
			break;
		}
	}

Wouldn't something like that work?

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 2/26]   Uprobes: Allow multiple consumers for an uprobe.
@ 2011-09-26 12:29     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 12:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On Tue, 2011-09-20 at 17:30 +0530, Srikar Dronamraju wrote:
> +       con = uprobe->consumers;
> +       if (consumer == con) {
> +               uprobe->consumers = con->next;
> +               ret = true;
> +       } else {
> +               for (; con; con = con->next) {
> +                       if (con->next == consumer) {
> +                               con->next = consumer->next;
> +                               ret = true;
> +                               break;
> +                       }
> +               }
> +       } 

	struct uprobe_consumer **next = &uprobe->consumers;

	for (; *next; next = &(*next)->next) {
		if (*next == consumer) {
			*next = (*next)->next;
			ret = true;
			break;
		}
	}

Wouldn't something like that work?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-09-26 13:15     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:15 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Andi Kleen, Oleg Nesterov,
	LKML, Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

On Tue, 2011-09-20 at 17:30 +0530, Srikar Dronamraju wrote:

> +static struct vma_info *__find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping,
> +			struct vma_info *vi)
> +{
> +	struct prio_tree_iter iter;
> +	struct vm_area_struct *vma;
> +	struct vma_info *tmpvi;
> +	loff_t vaddr;
> +	unsigned long pgoff = offset >> PAGE_SHIFT;
> +	int existing_vma;
> +
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		if (!vma || !valid_vma(vma))
> +			return NULL;
> +
> +		existing_vma = 0;
> +		vaddr = vma->vm_start + offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		list_for_each_entry(tmpvi, head, probe_list) {
> +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> +				existing_vma = 1;
> +				break;
> +			}
> +		}
> +		if (!existing_vma &&
> +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
> +			vi->mm = vma->vm_mm;
> +			vi->vaddr = vaddr;
> +			list_add(&vi->probe_list, head);
> +			return vi;

The the sole purpose of actually having that list is the above linear
was to test if we've already had this one?

Does that really matter? After all, if the probe is already installed
installing it again will return with -EEXIST, which should be easy
enough to deal with.

> +		}
> +	}
> +	return NULL;
> +}
> +
> +/*
> + * Iterate in the rmap prio tree  and find a vma where a probe has not
> + * yet been inserted.
> + */
> +static struct vma_info *find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping)
> +{
> +	struct vma_info *vi, *retvi;
> +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> +	if (!vi)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&vi->probe_list);

weird place for the INIT_LIST_HEAD, I would have expected it near where
the rest of vi is initialized, although it looks to be superfluous
anyway, since list_add() can handle an uninitialized entry.


> +	mutex_lock(&mapping->i_mmap_mutex);
> +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> +	mutex_unlock(&mapping->i_mmap_mutex);
> +
> +	if (!retvi)
> +		kfree(vi);
> +	return retvi;
> +}
> +
> +static int __register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe *uprobe)
> +{
> +	struct list_head try_list;
> +	struct vm_area_struct *vma;
> +	struct address_space *mapping;
> +	struct vma_info *vi, *tmpvi;
> +	struct mm_struct *mm;
> +	int ret = 0;
> +
> +	mapping = inode->i_mapping;
> +	INIT_LIST_HEAD(&try_list);
> +	while ((vi = find_next_vma_info(&try_list, offset,
> +							mapping)) != NULL) {
> +		if (IS_ERR(vi)) {
> +			ret = -ENOMEM;
> +			break;
> +		}

Here we hold neither i_mmap_mutex nor mmap_sem, so everything can change
under our feet. See below..

> +		mm = vi->mm;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> +		if (!vma || !valid_vma(vma)) {

No validation if its indeed the same vma you found earlier? At the very
least we should validate the vma returned from find_vma() is indeed a
mapping of the inode we're after and that the offset is still to be
found at vaddr.

> +			list_del(&vi->probe_list);
> +			kfree(vi);
> +			up_read(&mm->mmap_sem);
> +			mmput(mm);
> +			continue;
> +		}
> +		ret = install_breakpoint(mm);
> +		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> +			up_read(&mm->mmap_sem);
> +			mmput(mm);
> +			break;
> +		}

Right, so you already deal with -EEXIST, so why do we need that list at
all then?

Aah, its to make fwd progress, without it we would keep retrying the
same vma over and over,.. hmm?

> +		ret = 0;
> +		up_read(&mm->mmap_sem);
> +		mmput(mm);
> +	}
> +	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> +		list_del(&vi->probe_list);
> +		kfree(vi);
> +	}
> +	return ret;
> +}


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-09-26 13:15     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:15 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Andi Kleen, Oleg Nesterov,
	LKML, Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

On Tue, 2011-09-20 at 17:30 +0530, Srikar Dronamraju wrote:

> +static struct vma_info *__find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping,
> +			struct vma_info *vi)
> +{
> +	struct prio_tree_iter iter;
> +	struct vm_area_struct *vma;
> +	struct vma_info *tmpvi;
> +	loff_t vaddr;
> +	unsigned long pgoff = offset >> PAGE_SHIFT;
> +	int existing_vma;
> +
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		if (!vma || !valid_vma(vma))
> +			return NULL;
> +
> +		existing_vma = 0;
> +		vaddr = vma->vm_start + offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		list_for_each_entry(tmpvi, head, probe_list) {
> +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> +				existing_vma = 1;
> +				break;
> +			}
> +		}
> +		if (!existing_vma &&
> +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
> +			vi->mm = vma->vm_mm;
> +			vi->vaddr = vaddr;
> +			list_add(&vi->probe_list, head);
> +			return vi;

The the sole purpose of actually having that list is the above linear
was to test if we've already had this one?

Does that really matter? After all, if the probe is already installed
installing it again will return with -EEXIST, which should be easy
enough to deal with.

> +		}
> +	}
> +	return NULL;
> +}
> +
> +/*
> + * Iterate in the rmap prio tree  and find a vma where a probe has not
> + * yet been inserted.
> + */
> +static struct vma_info *find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping)
> +{
> +	struct vma_info *vi, *retvi;
> +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> +	if (!vi)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&vi->probe_list);

weird place for the INIT_LIST_HEAD, I would have expected it near where
the rest of vi is initialized, although it looks to be superfluous
anyway, since list_add() can handle an uninitialized entry.


> +	mutex_lock(&mapping->i_mmap_mutex);
> +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> +	mutex_unlock(&mapping->i_mmap_mutex);
> +
> +	if (!retvi)
> +		kfree(vi);
> +	return retvi;
> +}
> +
> +static int __register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe *uprobe)
> +{
> +	struct list_head try_list;
> +	struct vm_area_struct *vma;
> +	struct address_space *mapping;
> +	struct vma_info *vi, *tmpvi;
> +	struct mm_struct *mm;
> +	int ret = 0;
> +
> +	mapping = inode->i_mapping;
> +	INIT_LIST_HEAD(&try_list);
> +	while ((vi = find_next_vma_info(&try_list, offset,
> +							mapping)) != NULL) {
> +		if (IS_ERR(vi)) {
> +			ret = -ENOMEM;
> +			break;
> +		}

Here we hold neither i_mmap_mutex nor mmap_sem, so everything can change
under our feet. See below..

> +		mm = vi->mm;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> +		if (!vma || !valid_vma(vma)) {

No validation if its indeed the same vma you found earlier? At the very
least we should validate the vma returned from find_vma() is indeed a
mapping of the inode we're after and that the offset is still to be
found at vaddr.

> +			list_del(&vi->probe_list);
> +			kfree(vi);
> +			up_read(&mm->mmap_sem);
> +			mmput(mm);
> +			continue;
> +		}
> +		ret = install_breakpoint(mm);
> +		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> +			up_read(&mm->mmap_sem);
> +			mmput(mm);
> +			break;
> +		}

Right, so you already deal with -EEXIST, so why do we need that list at
all then?

Aah, its to make fwd progress, without it we would keep retrying the
same vma over and over,.. hmm?

> +		ret = 0;
> +		up_read(&mm->mmap_sem);
> +		mmput(mm);
> +	}
> +	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> +		list_del(&vi->probe_list);
> +		kfree(vi);
> +	}
> +	return ret;
> +}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-09-26 13:15     ` Peter Zijlstra
@ 2011-09-26 13:23       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Andi Kleen, Oleg Nesterov,
	LKML, Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:15:00]:

> On Tue, 2011-09-20 at 17:30 +0530, Srikar Dronamraju wrote:
> 
> > +static struct vma_info *__find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping,
> > +			struct vma_info *vi)
> > +{
> > +	struct prio_tree_iter iter;
> > +	struct vm_area_struct *vma;
> > +	struct vma_info *tmpvi;
> > +	loff_t vaddr;
> > +	unsigned long pgoff = offset >> PAGE_SHIFT;
> > +	int existing_vma;
> > +
> > +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > +		if (!vma || !valid_vma(vma))
> > +			return NULL;
> > +
> > +		existing_vma = 0;
> > +		vaddr = vma->vm_start + offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		list_for_each_entry(tmpvi, head, probe_list) {
> > +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> > +				existing_vma = 1;
> > +				break;
> > +			}
> > +		}
> > +		if (!existing_vma &&
> > +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
> > +			vi->mm = vma->vm_mm;
> > +			vi->vaddr = vaddr;
> > +			list_add(&vi->probe_list, head);
> > +			return vi;
> 
> The the sole purpose of actually having that list is the above linear
> was to test if we've already had this one?
> 
> Does that really matter? After all, if the probe is already installed
> installing it again will return with -EEXIST, which should be easy
> enough to deal with.
> 

No, There is a possibility of going in a forever loop.
Since the the priotree can change when we drop the mapping->mutex, we
dont pass the hint to vma_prio_tree_foreach.
So we might keep getting the same vma again and again.

> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/*
> > + * Iterate in the rmap prio tree  and find a vma where a probe has not
> > + * yet been inserted.
> > + */
> > +static struct vma_info *find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping)
> > +{
> > +	struct vma_info *vi, *retvi;
> > +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> > +	if (!vi)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	INIT_LIST_HEAD(&vi->probe_list);
> 
> weird place for the INIT_LIST_HEAD, I would have expected it near where
> the rest of vi is initialized, although it looks to be superfluous
> anyway, since list_add() can handle an uninitialized entry.
> 
> 
> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> > +	mutex_unlock(&mapping->i_mmap_mutex);
> > +
> > +	if (!retvi)
> > +		kfree(vi);
> > +	return retvi;
> > +}
> > +
> > +static int __register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe *uprobe)
> > +{
> > +	struct list_head try_list;
> > +	struct vm_area_struct *vma;
> > +	struct address_space *mapping;
> > +	struct vma_info *vi, *tmpvi;
> > +	struct mm_struct *mm;
> > +	int ret = 0;
> > +
> > +	mapping = inode->i_mapping;
> > +	INIT_LIST_HEAD(&try_list);
> > +	while ((vi = find_next_vma_info(&try_list, offset,
> > +							mapping)) != NULL) {
> > +		if (IS_ERR(vi)) {
> > +			ret = -ENOMEM;
> > +			break;
> > +		}
> 
> Here we hold neither i_mmap_mutex nor mmap_sem, so everything can change
> under our feet. See below..
> 
> > +		mm = vi->mm;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> > +		if (!vma || !valid_vma(vma)) {
> 
> No validation if its indeed the same vma you found earlier? At the very
> least we should validate the vma returned from find_vma() is indeed a
> mapping of the inode we're after and that the offset is still to be
> found at vaddr.
> 

Yes, this can be done.

> > +			list_del(&vi->probe_list);
> > +			kfree(vi);
> > +			up_read(&mm->mmap_sem);
> > +			mmput(mm);
> > +			continue;
> > +		}
> > +		ret = install_breakpoint(mm);
> > +		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> > +			up_read(&mm->mmap_sem);
> > +			mmput(mm);
> > +			break;
> > +		}
> 
> Right, so you already deal with -EEXIST, so why do we need that list at
> all then?
> 
> Aah, its to make fwd progress, without it we would keep retrying the
> same vma over and over,.. hmm?
> 

Yes.

> > +		ret = 0;
> > +		up_read(&mm->mmap_sem);
> > +		mmput(mm);
> > +	}
> > +	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> > +		list_del(&vi->probe_list);
> > +		kfree(vi);
> > +	}
> > +	return ret;
> > +}
> 

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-09-26 13:23       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 13:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Andi Kleen, Oleg Nesterov,
	LKML, Jim Keniston, Roland McGrath, Ananth N Mavinakayanahalli,
	Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:15:00]:

> On Tue, 2011-09-20 at 17:30 +0530, Srikar Dronamraju wrote:
> 
> > +static struct vma_info *__find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping,
> > +			struct vma_info *vi)
> > +{
> > +	struct prio_tree_iter iter;
> > +	struct vm_area_struct *vma;
> > +	struct vma_info *tmpvi;
> > +	loff_t vaddr;
> > +	unsigned long pgoff = offset >> PAGE_SHIFT;
> > +	int existing_vma;
> > +
> > +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > +		if (!vma || !valid_vma(vma))
> > +			return NULL;
> > +
> > +		existing_vma = 0;
> > +		vaddr = vma->vm_start + offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		list_for_each_entry(tmpvi, head, probe_list) {
> > +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> > +				existing_vma = 1;
> > +				break;
> > +			}
> > +		}
> > +		if (!existing_vma &&
> > +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
> > +			vi->mm = vma->vm_mm;
> > +			vi->vaddr = vaddr;
> > +			list_add(&vi->probe_list, head);
> > +			return vi;
> 
> The the sole purpose of actually having that list is the above linear
> was to test if we've already had this one?
> 
> Does that really matter? After all, if the probe is already installed
> installing it again will return with -EEXIST, which should be easy
> enough to deal with.
> 

No, There is a possibility of going in a forever loop.
Since the the priotree can change when we drop the mapping->mutex, we
dont pass the hint to vma_prio_tree_foreach.
So we might keep getting the same vma again and again.

> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/*
> > + * Iterate in the rmap prio tree  and find a vma where a probe has not
> > + * yet been inserted.
> > + */
> > +static struct vma_info *find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping)
> > +{
> > +	struct vma_info *vi, *retvi;
> > +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> > +	if (!vi)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	INIT_LIST_HEAD(&vi->probe_list);
> 
> weird place for the INIT_LIST_HEAD, I would have expected it near where
> the rest of vi is initialized, although it looks to be superfluous
> anyway, since list_add() can handle an uninitialized entry.
> 
> 
> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> > +	mutex_unlock(&mapping->i_mmap_mutex);
> > +
> > +	if (!retvi)
> > +		kfree(vi);
> > +	return retvi;
> > +}
> > +
> > +static int __register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe *uprobe)
> > +{
> > +	struct list_head try_list;
> > +	struct vm_area_struct *vma;
> > +	struct address_space *mapping;
> > +	struct vma_info *vi, *tmpvi;
> > +	struct mm_struct *mm;
> > +	int ret = 0;
> > +
> > +	mapping = inode->i_mapping;
> > +	INIT_LIST_HEAD(&try_list);
> > +	while ((vi = find_next_vma_info(&try_list, offset,
> > +							mapping)) != NULL) {
> > +		if (IS_ERR(vi)) {
> > +			ret = -ENOMEM;
> > +			break;
> > +		}
> 
> Here we hold neither i_mmap_mutex nor mmap_sem, so everything can change
> under our feet. See below..
> 
> > +		mm = vi->mm;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> > +		if (!vma || !valid_vma(vma)) {
> 
> No validation if its indeed the same vma you found earlier? At the very
> least we should validate the vma returned from find_vma() is indeed a
> mapping of the inode we're after and that the offset is still to be
> found at vaddr.
> 

Yes, this can be done.

> > +			list_del(&vi->probe_list);
> > +			kfree(vi);
> > +			up_read(&mm->mmap_sem);
> > +			mmput(mm);
> > +			continue;
> > +		}
> > +		ret = install_breakpoint(mm);
> > +		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> > +			up_read(&mm->mmap_sem);
> > +			mmput(mm);
> > +			break;
> > +		}
> 
> Right, so you already deal with -EEXIST, so why do we need that list at
> all then?
> 
> Aah, its to make fwd progress, without it we would keep retrying the
> same vma over and over,.. hmm?
> 

Yes.

> > +		ret = 0;
> > +		up_read(&mm->mmap_sem);
> > +		mmput(mm);
> > +	}
> > +	list_for_each_entry_safe(vi, tmpvi, &try_list, probe_list) {
> > +		list_del(&vi->probe_list);
> > +		kfree(vi);
> > +	}
> > +	return ret;
> > +}
> 

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-20 11:59   ` Srikar Dronamraju
@ 2011-09-26 13:35     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:35 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)

Here and elsewhere, your whitespace is off, it should read:

	struct inode *inode

I think checkpatch will inform you of this, but I didn't check.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-26 13:35     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:35 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)

Here and elsewhere, your whitespace is off, it should read:

	struct inode *inode

I think checkpatch will inform you of this, but I didn't check.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-09-26 13:53     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML


> -static int match_uprobe(struct uprobe *l, struct uprobe *r)
> +static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
>  {
> +	/*
> +	 * if match_inode is non NULL then indicate if the
> +	 * inode atleast match.
> +	 */
> +	if (match_inode)
> +		*match_inode = 0;
> +
>  	if (l->inode < r->inode)
>  		return -1;
>  	if (l->inode > r->inode)
>  		return 1;
>  	else {
> +		if (match_inode)
> +			*match_inode = 1;
> +
>  		if (l->offset < r->offset)
>  			return -1;
>  
> @@ -75,16 +86,20 @@ static int match_uprobe(struct uprobe *l, struct uprobe *r)
>  	return 0;
>  }
>  
> -static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
> +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> +					struct rb_node **close_match)
>  {
>  	struct uprobe u = { .inode = inode, .offset = offset };
>  	struct rb_node *n = uprobes_tree.rb_node;
>  	struct uprobe *uprobe;
> -	int match;
> +	int match, match_inode;
>  
>  	while (n) {
>  		uprobe = rb_entry(n, struct uprobe, rb_node);
> -		match = match_uprobe(&u, uprobe);
> +		match = match_uprobe(&u, uprobe, &match_inode);
> +		if (close_match && match_inode)
> +			*close_match = n;

Because:

		if (close_match && uprobe->inode == inode)

Isn't good enough? Also, returning an rb_node just seems iffy.. 

>  		if (!match) {
>  			atomic_inc(&uprobe->ref);
>  			return uprobe;


Why not something like:


+static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
					bool inode_only)
+{
        struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
+       struct rb_node *n = uprobes_tree.rb_node;
+       struct uprobe *uprobe;
	struct uprobe *ret = NULL;
+       int match;
+
+       while (n) {
+               uprobe = rb_entry(n, struct uprobe, rb_node);
+               match = match_uprobe(&u, uprobe);
+               if (!match) {
			if (!inode_only)
	                       atomic_inc(&uprobe->ref);
+                       return uprobe;
+               }
		if (inode_only && uprobe->inode == inode)
			ret = uprobe;
+               if (match < 0)
+                       n = n->rb_left;
+               else
+                       n = n->rb_right;
+
+       }
        return ret;
+}


> +/*
> + * For a given inode, build a list of probes that need to be inserted.
> + */
> +static void build_probe_list(struct inode *inode, struct list_head *head)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	uprobe = __find_uprobe(inode, 0, &n);


> +	/*
> +	 * If indeed there is a probe for the inode and with offset zero,
> +	 * then lets release its reference. (ref got thro __find_uprobe)
> +	 */
> +	if (uprobe)
> +		put_uprobe(uprobe);

The above would make this ^ unneeded.

	n = &uprobe->rb_node;

> +	for (; n; n = rb_next(n)) {
> +		uprobe = rb_entry(n, struct uprobe, rb_node);
> +		if (uprobe->inode != inode)
> +			break;
> +		list_add(&uprobe->pending_list, head);
> +		atomic_inc(&uprobe->ref);
> +	}
> +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> +}

If this ever gets to be a latency issue (linear lookup under spinlock)
you can use a double lock (mutex+spinlock) and require that modification
acquires both but lookups can get away with either.

That way you can do the linear search using a mutex instead of the
spinlock.

> +
> +/*
> + * Called from mmap_region.
> + * called with mm->mmap_sem acquired.
> + *
> + * Return -ve no if we fail to insert probes and we cannot
> + * bail-out.
> + * Return 0 otherwise. i.e :
> + *	- successful insertion of probes
> + *	- (or) no possible probes to be inserted.
> + *	- (or) insertion of probes failed but we can bail-out.
> + */
> +int mmap_uprobe(struct vm_area_struct *vma)
> +{
> +	struct list_head tmp_list;
> +	struct uprobe *uprobe, *u;
> +	struct inode *inode;
> +	int ret = 0;
> +
> +	if (!valid_vma(vma))
> +		return ret;	/* Bail-out */
> +
> +	inode = igrab(vma->vm_file->f_mapping->host);
> +	if (!inode)
> +		return ret;
> +
> +	INIT_LIST_HEAD(&tmp_list);
> +	mutex_lock(&uprobes_mmap_mutex);
> +	build_probe_list(inode, &tmp_list);
> +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> +		loff_t vaddr;
> +
> +		list_del(&uprobe->pending_list);
> +		if (!ret && uprobe->consumers) {
> +			vaddr = vma->vm_start + uprobe->offset;
> +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +				continue;
> +			ret = install_breakpoint(vma->vm_mm, uprobe);
> +
> +			if (ret && (ret == -ESRCH || ret == -EEXIST))
> +				ret = 0;
> +		}
> +		put_uprobe(uprobe);
> +	}
> +
> +	mutex_unlock(&uprobes_mmap_mutex);
> +	iput(inode);
> +	return ret;
> +}
> +
> +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> +		struct inode *inode)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	uprobe = __find_uprobe(inode, 0, &n);
> +
> +	/*
> +	 * If indeed there is a probe for the inode and with offset zero,
> +	 * then lets release its reference. (ref got thro __find_uprobe)
> +	 */
> +	if (uprobe)
> +		put_uprobe(uprobe);
> +	for (; n; n = rb_next(n)) {
> +		loff_t vaddr;
> +
> +		uprobe = rb_entry(n, struct uprobe, rb_node);
> +		if (uprobe->inode != inode)
> +			break;
> +		vaddr = vma->vm_start + uprobe->offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +			continue;
> +		atomic_dec(&vma->vm_mm->mm_uprobes_count);
> +	}
> +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> +}
> +
> +/*
> + * Called in context of a munmap of a vma.
> + */
> +void munmap_uprobe(struct vm_area_struct *vma)
> +{
> +	struct inode *inode;
> +
> +	if (!valid_vma(vma))
> +		return;		/* Bail-out */
> +
> +	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> +		return;
> +
> +	inode = igrab(vma->vm_file->f_mapping->host);
> +	if (!inode)
> +		return;
> +
> +	dec_mm_uprobes_count(vma, inode);
> +	iput(inode);
> +	return;
> +}

One has to wonder why mmap_uprobe() can be one function but
munmap_uprobe() cannot.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-26 13:53     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML


> -static int match_uprobe(struct uprobe *l, struct uprobe *r)
> +static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
>  {
> +	/*
> +	 * if match_inode is non NULL then indicate if the
> +	 * inode atleast match.
> +	 */
> +	if (match_inode)
> +		*match_inode = 0;
> +
>  	if (l->inode < r->inode)
>  		return -1;
>  	if (l->inode > r->inode)
>  		return 1;
>  	else {
> +		if (match_inode)
> +			*match_inode = 1;
> +
>  		if (l->offset < r->offset)
>  			return -1;
>  
> @@ -75,16 +86,20 @@ static int match_uprobe(struct uprobe *l, struct uprobe *r)
>  	return 0;
>  }
>  
> -static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
> +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> +					struct rb_node **close_match)
>  {
>  	struct uprobe u = { .inode = inode, .offset = offset };
>  	struct rb_node *n = uprobes_tree.rb_node;
>  	struct uprobe *uprobe;
> -	int match;
> +	int match, match_inode;
>  
>  	while (n) {
>  		uprobe = rb_entry(n, struct uprobe, rb_node);
> -		match = match_uprobe(&u, uprobe);
> +		match = match_uprobe(&u, uprobe, &match_inode);
> +		if (close_match && match_inode)
> +			*close_match = n;

Because:

		if (close_match && uprobe->inode == inode)

Isn't good enough? Also, returning an rb_node just seems iffy.. 

>  		if (!match) {
>  			atomic_inc(&uprobe->ref);
>  			return uprobe;


Why not something like:


+static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
					bool inode_only)
+{
        struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
+       struct rb_node *n = uprobes_tree.rb_node;
+       struct uprobe *uprobe;
	struct uprobe *ret = NULL;
+       int match;
+
+       while (n) {
+               uprobe = rb_entry(n, struct uprobe, rb_node);
+               match = match_uprobe(&u, uprobe);
+               if (!match) {
			if (!inode_only)
	                       atomic_inc(&uprobe->ref);
+                       return uprobe;
+               }
		if (inode_only && uprobe->inode == inode)
			ret = uprobe;
+               if (match < 0)
+                       n = n->rb_left;
+               else
+                       n = n->rb_right;
+
+       }
        return ret;
+}


> +/*
> + * For a given inode, build a list of probes that need to be inserted.
> + */
> +static void build_probe_list(struct inode *inode, struct list_head *head)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	uprobe = __find_uprobe(inode, 0, &n);


> +	/*
> +	 * If indeed there is a probe for the inode and with offset zero,
> +	 * then lets release its reference. (ref got thro __find_uprobe)
> +	 */
> +	if (uprobe)
> +		put_uprobe(uprobe);

The above would make this ^ unneeded.

	n = &uprobe->rb_node;

> +	for (; n; n = rb_next(n)) {
> +		uprobe = rb_entry(n, struct uprobe, rb_node);
> +		if (uprobe->inode != inode)
> +			break;
> +		list_add(&uprobe->pending_list, head);
> +		atomic_inc(&uprobe->ref);
> +	}
> +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> +}

If this ever gets to be a latency issue (linear lookup under spinlock)
you can use a double lock (mutex+spinlock) and require that modification
acquires both but lookups can get away with either.

That way you can do the linear search using a mutex instead of the
spinlock.

> +
> +/*
> + * Called from mmap_region.
> + * called with mm->mmap_sem acquired.
> + *
> + * Return -ve no if we fail to insert probes and we cannot
> + * bail-out.
> + * Return 0 otherwise. i.e :
> + *	- successful insertion of probes
> + *	- (or) no possible probes to be inserted.
> + *	- (or) insertion of probes failed but we can bail-out.
> + */
> +int mmap_uprobe(struct vm_area_struct *vma)
> +{
> +	struct list_head tmp_list;
> +	struct uprobe *uprobe, *u;
> +	struct inode *inode;
> +	int ret = 0;
> +
> +	if (!valid_vma(vma))
> +		return ret;	/* Bail-out */
> +
> +	inode = igrab(vma->vm_file->f_mapping->host);
> +	if (!inode)
> +		return ret;
> +
> +	INIT_LIST_HEAD(&tmp_list);
> +	mutex_lock(&uprobes_mmap_mutex);
> +	build_probe_list(inode, &tmp_list);
> +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> +		loff_t vaddr;
> +
> +		list_del(&uprobe->pending_list);
> +		if (!ret && uprobe->consumers) {
> +			vaddr = vma->vm_start + uprobe->offset;
> +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +				continue;
> +			ret = install_breakpoint(vma->vm_mm, uprobe);
> +
> +			if (ret && (ret == -ESRCH || ret == -EEXIST))
> +				ret = 0;
> +		}
> +		put_uprobe(uprobe);
> +	}
> +
> +	mutex_unlock(&uprobes_mmap_mutex);
> +	iput(inode);
> +	return ret;
> +}
> +
> +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> +		struct inode *inode)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	uprobe = __find_uprobe(inode, 0, &n);
> +
> +	/*
> +	 * If indeed there is a probe for the inode and with offset zero,
> +	 * then lets release its reference. (ref got thro __find_uprobe)
> +	 */
> +	if (uprobe)
> +		put_uprobe(uprobe);
> +	for (; n; n = rb_next(n)) {
> +		loff_t vaddr;
> +
> +		uprobe = rb_entry(n, struct uprobe, rb_node);
> +		if (uprobe->inode != inode)
> +			break;
> +		vaddr = vma->vm_start + uprobe->offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +			continue;
> +		atomic_dec(&vma->vm_mm->mm_uprobes_count);
> +	}
> +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> +}
> +
> +/*
> + * Called in context of a munmap of a vma.
> + */
> +void munmap_uprobe(struct vm_area_struct *vma)
> +{
> +	struct inode *inode;
> +
> +	if (!valid_vma(vma))
> +		return;		/* Bail-out */
> +
> +	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> +		return;
> +
> +	inode = igrab(vma->vm_file->f_mapping->host);
> +	if (!inode)
> +		return;
> +
> +	dec_mm_uprobes_count(vma, inode);
> +	iput(inode);
> +	return;
> +}

One has to wonder why mmap_uprobe() can be one function but
munmap_uprobe() cannot.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-20 12:02   ` Srikar Dronamraju
@ 2011-09-26 13:59     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> 						Hence provide some extra
> + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> + * the uprobes_treelock before the uprobe is removed from the rbtree. 

'Some extra time' doesn't make me all warm an fuzzy inside, but instead
screams we fudge around a race condition.

ISTR raising this before ;-)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-09-26 13:59     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 13:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> 						Hence provide some extra
> + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> + * the uprobes_treelock before the uprobe is removed from the rbtree. 

'Some extra time' doesn't make me all warm an fuzzy inside, but instead
screams we fudge around a race condition.

ISTR raising this before ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-20 12:02   ` Srikar Dronamraju
@ 2011-09-26 14:02     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 14:02 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> + * Guarded by uproc->mutex.

That seems to be the only reference to this thing here... what's a uproc
and where's this mutex?

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-09-26 14:02     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 14:02 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> + * Guarded by uproc->mutex.

That seems to be the only reference to this thing here... what's a uproc
and where's this mutex?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26]   x86: define a x86 specific exception notifier.
  2011-09-20 12:02   ` Srikar Dronamraju
@ 2011-09-26 14:19     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 14:19 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> @@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
>                 mce_notify_process();
>  #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
>  
> +       if (thread_info_flags & _TIF_UPROBE) {
> +               clear_thread_flag(TIF_UPROBE);
> +#ifdef CONFIG_X86_32
> +               /*
> +                * On x86_32, do_notify_resume() gets called with
> +                * interrupts disabled. Hence enable interrupts if they
> +                * are still disabled.
> +                */
> +               local_irq_enable();
> +#endif
> +               uprobe_notify_resume(regs);
> +       }
> +
>         /* deal with pending signal delivery */
>         if (thread_info_flags & _TIF_SIGPENDING)
>                 do_signal(regs); 

It would be good to remove this difference between i386 and x86_64.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26]   x86: define a x86 specific exception notifier.
@ 2011-09-26 14:19     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 14:19 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> @@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
>                 mce_notify_process();
>  #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
>  
> +       if (thread_info_flags & _TIF_UPROBE) {
> +               clear_thread_flag(TIF_UPROBE);
> +#ifdef CONFIG_X86_32
> +               /*
> +                * On x86_32, do_notify_resume() gets called with
> +                * interrupts disabled. Hence enable interrupts if they
> +                * are still disabled.
> +                */
> +               local_irq_enable();
> +#endif
> +               uprobe_notify_resume(regs);
> +       }
> +
>         /* deal with pending signal delivery */
>         if (thread_info_flags & _TIF_SIGPENDING)
>                 do_signal(regs); 

It would be good to remove this difference between i386 and x86_64.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-26 14:23     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 14:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Ananth N Mavinakayanahalli, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +fail:
> +       pr_warn_once("uprobes: Failed to adjust return address after"
> +               " single-stepping call instruction;"
> +               " pid=%d, sp=%#lx\n", current->pid, sp);
> +       return -EFAULT; 

So how can that happen? Single-Step while someone unmapped the stack?

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
@ 2011-09-26 14:23     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 14:23 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Ananth N Mavinakayanahalli, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +fail:
> +       pr_warn_once("uprobes: Failed to adjust return address after"
> +               " single-stepping call instruction;"
> +               " pid=%d, sp=%#lx\n", current->pid, sp);
> +       return -EFAULT; 

So how can that happen? Single-Step while someone unmapped the stack?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-26 13:53     ` Peter Zijlstra
@ 2011-09-26 15:44       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

> >  
> > -static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
> > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> > +					struct rb_node **close_match)
> >  {
> >  	struct uprobe u = { .inode = inode, .offset = offset };
> >  	struct rb_node *n = uprobes_tree.rb_node;
> >  	struct uprobe *uprobe;
> > -	int match;
> > +	int match, match_inode;
> >  
> >  	while (n) {
> >  		uprobe = rb_entry(n, struct uprobe, rb_node);
> > -		match = match_uprobe(&u, uprobe);
> > +		match = match_uprobe(&u, uprobe, &match_inode);
> > +		if (close_match && match_inode)
> > +			*close_match = n;
> 
> Because:
> 
> 		if (close_match && uprobe->inode == inode)
> 
> Isn't good enough? Also, returning an rb_node just seems iffy.. 

yup this can be done. can you please elaborate on why passing back an
rb_node is an issue?

> 
> >  		if (!match) {
> >  			atomic_inc(&uprobe->ref);
> >  			return uprobe;
> 
> 
> Why not something like:
> 
> 
> +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> 					bool inode_only)
> +{
>         struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
> +       struct rb_node *n = uprobes_tree.rb_node;
> +       struct uprobe *uprobe;
> 	struct uprobe *ret = NULL;
> +       int match;
> +
> +       while (n) {
> +               uprobe = rb_entry(n, struct uprobe, rb_node);
> +               match = match_uprobe(&u, uprobe);
> +               if (!match) {
> 			if (!inode_only)
> 	                       atomic_inc(&uprobe->ref);
> +                       return uprobe;
> +               }
> 		if (inode_only && uprobe->inode == inode)
> 			ret = uprobe;
> +               if (match < 0)
> +                       n = n->rb_left;
> +               else
> +                       n = n->rb_right;
> +
> +       }
>         return ret;
> +}
> 

I am not comfortable with this change.
find_uprobe() was suppose to return back a uprobe if and only if
the inode and offset match, However with your approach, we end up
returning a uprobe that isnt matching and one that isnt refcounted.
Moreover if even if we have a matching uprobe, we end up sending a
unrefcounted uprobe back.

> 
> > +/*
> > + * For a given inode, build a list of probes that need to be inserted.
> > + */
> > +static void build_probe_list(struct inode *inode, struct list_head *head)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	uprobe = __find_uprobe(inode, 0, &n);
> 
> 
> > +	/*
> > +	 * If indeed there is a probe for the inode and with offset zero,
> > +	 * then lets release its reference. (ref got thro __find_uprobe)
> > +	 */
> > +	if (uprobe)
> > +		put_uprobe(uprobe);
> 
> The above would make this ^ unneeded.
> 
> 	n = &uprobe->rb_node;
> 
> > +	for (; n; n = rb_next(n)) {
> > +		uprobe = rb_entry(n, struct uprobe, rb_node);
> > +		if (uprobe->inode != inode)
> > +			break;
> > +		list_add(&uprobe->pending_list, head);
> > +		atomic_inc(&uprobe->ref);
> > +	}
> > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > +}
> 
> If this ever gets to be a latency issue (linear lookup under spinlock)
> you can use a double lock (mutex+spinlock) and require that modification
> acquires both but lookups can get away with either.
> 
> That way you can do the linear search using a mutex instead of the
> spinlock.
> 

Okay, 

> > +
> > +/*
> > + * Called from mmap_region.
> > + * called with mm->mmap_sem acquired.
> > + *
> > + * Return -ve no if we fail to insert probes and we cannot
> > + * bail-out.
> > + * Return 0 otherwise. i.e :
> > + *	- successful insertion of probes
> > + *	- (or) no possible probes to be inserted.
> > + *	- (or) insertion of probes failed but we can bail-out.
> > + */
> > +int mmap_uprobe(struct vm_area_struct *vma)
> > +{
> > +	struct list_head tmp_list;
> > +	struct uprobe *uprobe, *u;
> > +	struct inode *inode;
> > +	int ret = 0;
> > +
> > +	if (!valid_vma(vma))
> > +		return ret;	/* Bail-out */
> > +
> > +	inode = igrab(vma->vm_file->f_mapping->host);
> > +	if (!inode)
> > +		return ret;
> > +
> > +	INIT_LIST_HEAD(&tmp_list);
> > +	mutex_lock(&uprobes_mmap_mutex);
> > +	build_probe_list(inode, &tmp_list);
> > +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > +		loff_t vaddr;
> > +
> > +		list_del(&uprobe->pending_list);
> > +		if (!ret && uprobe->consumers) {
> > +			vaddr = vma->vm_start + uprobe->offset;
> > +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +				continue;
> > +			ret = install_breakpoint(vma->vm_mm, uprobe);
> > +
> > +			if (ret && (ret == -ESRCH || ret == -EEXIST))
> > +				ret = 0;
> > +		}
> > +		put_uprobe(uprobe);
> > +	}
> > +
> > +	mutex_unlock(&uprobes_mmap_mutex);
> > +	iput(inode);
> > +	return ret;
> > +}
> > +
> > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > +		struct inode *inode)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	uprobe = __find_uprobe(inode, 0, &n);
> > +
> > +	/*
> > +	 * If indeed there is a probe for the inode and with offset zero,
> > +	 * then lets release its reference. (ref got thro __find_uprobe)
> > +	 */
> > +	if (uprobe)
> > +		put_uprobe(uprobe);
> > +	for (; n; n = rb_next(n)) {
> > +		loff_t vaddr;
> > +
> > +		uprobe = rb_entry(n, struct uprobe, rb_node);
> > +		if (uprobe->inode != inode)
> > +			break;
> > +		vaddr = vma->vm_start + uprobe->offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +			continue;
> > +		atomic_dec(&vma->vm_mm->mm_uprobes_count);
> > +	}
> > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > +}
> > +
> > +/*
> > + * Called in context of a munmap of a vma.
> > + */
> > +void munmap_uprobe(struct vm_area_struct *vma)
> > +{
> > +	struct inode *inode;
> > +
> > +	if (!valid_vma(vma))
> > +		return;		/* Bail-out */
> > +
> > +	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> > +		return;
> > +
> > +	inode = igrab(vma->vm_file->f_mapping->host);
> > +	if (!inode)
> > +		return;
> > +
> > +	dec_mm_uprobes_count(vma, inode);
> > +	iput(inode);
> > +	return;
> > +}
> 
> One has to wonder why mmap_uprobe() can be one function but
> munmap_uprobe() cannot.
> 

I didnt understand this comment, Can you please elaborate?
mmap_uprobe uses build_probe_list and munmap_uprobe uses
dec_mm_uprobes_count.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-26 15:44       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 15:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

> >  
> > -static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
> > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> > +					struct rb_node **close_match)
> >  {
> >  	struct uprobe u = { .inode = inode, .offset = offset };
> >  	struct rb_node *n = uprobes_tree.rb_node;
> >  	struct uprobe *uprobe;
> > -	int match;
> > +	int match, match_inode;
> >  
> >  	while (n) {
> >  		uprobe = rb_entry(n, struct uprobe, rb_node);
> > -		match = match_uprobe(&u, uprobe);
> > +		match = match_uprobe(&u, uprobe, &match_inode);
> > +		if (close_match && match_inode)
> > +			*close_match = n;
> 
> Because:
> 
> 		if (close_match && uprobe->inode == inode)
> 
> Isn't good enough? Also, returning an rb_node just seems iffy.. 

yup this can be done. can you please elaborate on why passing back an
rb_node is an issue?

> 
> >  		if (!match) {
> >  			atomic_inc(&uprobe->ref);
> >  			return uprobe;
> 
> 
> Why not something like:
> 
> 
> +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> 					bool inode_only)
> +{
>         struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
> +       struct rb_node *n = uprobes_tree.rb_node;
> +       struct uprobe *uprobe;
> 	struct uprobe *ret = NULL;
> +       int match;
> +
> +       while (n) {
> +               uprobe = rb_entry(n, struct uprobe, rb_node);
> +               match = match_uprobe(&u, uprobe);
> +               if (!match) {
> 			if (!inode_only)
> 	                       atomic_inc(&uprobe->ref);
> +                       return uprobe;
> +               }
> 		if (inode_only && uprobe->inode == inode)
> 			ret = uprobe;
> +               if (match < 0)
> +                       n = n->rb_left;
> +               else
> +                       n = n->rb_right;
> +
> +       }
>         return ret;
> +}
> 

I am not comfortable with this change.
find_uprobe() was suppose to return back a uprobe if and only if
the inode and offset match, However with your approach, we end up
returning a uprobe that isnt matching and one that isnt refcounted.
Moreover if even if we have a matching uprobe, we end up sending a
unrefcounted uprobe back.

> 
> > +/*
> > + * For a given inode, build a list of probes that need to be inserted.
> > + */
> > +static void build_probe_list(struct inode *inode, struct list_head *head)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	uprobe = __find_uprobe(inode, 0, &n);
> 
> 
> > +	/*
> > +	 * If indeed there is a probe for the inode and with offset zero,
> > +	 * then lets release its reference. (ref got thro __find_uprobe)
> > +	 */
> > +	if (uprobe)
> > +		put_uprobe(uprobe);
> 
> The above would make this ^ unneeded.
> 
> 	n = &uprobe->rb_node;
> 
> > +	for (; n; n = rb_next(n)) {
> > +		uprobe = rb_entry(n, struct uprobe, rb_node);
> > +		if (uprobe->inode != inode)
> > +			break;
> > +		list_add(&uprobe->pending_list, head);
> > +		atomic_inc(&uprobe->ref);
> > +	}
> > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > +}
> 
> If this ever gets to be a latency issue (linear lookup under spinlock)
> you can use a double lock (mutex+spinlock) and require that modification
> acquires both but lookups can get away with either.
> 
> That way you can do the linear search using a mutex instead of the
> spinlock.
> 

Okay, 

> > +
> > +/*
> > + * Called from mmap_region.
> > + * called with mm->mmap_sem acquired.
> > + *
> > + * Return -ve no if we fail to insert probes and we cannot
> > + * bail-out.
> > + * Return 0 otherwise. i.e :
> > + *	- successful insertion of probes
> > + *	- (or) no possible probes to be inserted.
> > + *	- (or) insertion of probes failed but we can bail-out.
> > + */
> > +int mmap_uprobe(struct vm_area_struct *vma)
> > +{
> > +	struct list_head tmp_list;
> > +	struct uprobe *uprobe, *u;
> > +	struct inode *inode;
> > +	int ret = 0;
> > +
> > +	if (!valid_vma(vma))
> > +		return ret;	/* Bail-out */
> > +
> > +	inode = igrab(vma->vm_file->f_mapping->host);
> > +	if (!inode)
> > +		return ret;
> > +
> > +	INIT_LIST_HEAD(&tmp_list);
> > +	mutex_lock(&uprobes_mmap_mutex);
> > +	build_probe_list(inode, &tmp_list);
> > +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > +		loff_t vaddr;
> > +
> > +		list_del(&uprobe->pending_list);
> > +		if (!ret && uprobe->consumers) {
> > +			vaddr = vma->vm_start + uprobe->offset;
> > +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +				continue;
> > +			ret = install_breakpoint(vma->vm_mm, uprobe);
> > +
> > +			if (ret && (ret == -ESRCH || ret == -EEXIST))
> > +				ret = 0;
> > +		}
> > +		put_uprobe(uprobe);
> > +	}
> > +
> > +	mutex_unlock(&uprobes_mmap_mutex);
> > +	iput(inode);
> > +	return ret;
> > +}
> > +
> > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > +		struct inode *inode)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	uprobe = __find_uprobe(inode, 0, &n);
> > +
> > +	/*
> > +	 * If indeed there is a probe for the inode and with offset zero,
> > +	 * then lets release its reference. (ref got thro __find_uprobe)
> > +	 */
> > +	if (uprobe)
> > +		put_uprobe(uprobe);
> > +	for (; n; n = rb_next(n)) {
> > +		loff_t vaddr;
> > +
> > +		uprobe = rb_entry(n, struct uprobe, rb_node);
> > +		if (uprobe->inode != inode)
> > +			break;
> > +		vaddr = vma->vm_start + uprobe->offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +			continue;
> > +		atomic_dec(&vma->vm_mm->mm_uprobes_count);
> > +	}
> > +	spin_unlock_irqrestore(&uprobes_treelock, flags);
> > +}
> > +
> > +/*
> > + * Called in context of a munmap of a vma.
> > + */
> > +void munmap_uprobe(struct vm_area_struct *vma)
> > +{
> > +	struct inode *inode;
> > +
> > +	if (!valid_vma(vma))
> > +		return;		/* Bail-out */
> > +
> > +	if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> > +		return;
> > +
> > +	inode = igrab(vma->vm_file->f_mapping->host);
> > +	if (!inode)
> > +		return;
> > +
> > +	dec_mm_uprobes_count(vma, inode);
> > +	iput(inode);
> > +	return;
> > +}
> 
> One has to wonder why mmap_uprobe() can be one function but
> munmap_uprobe() cannot.
> 

I didnt understand this comment, Can you please elaborate?
mmap_uprobe uses build_probe_list and munmap_uprobe uses
dec_mm_uprobes_count.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26] x86: define a x86 specific exception notifier.
  2011-09-26 14:19     ` Peter Zijlstra
@ 2011-09-26 15:52       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:19:51]:

> On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > @@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> >                 mce_notify_process();
> >  #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
> >  
> > +       if (thread_info_flags & _TIF_UPROBE) {
> > +               clear_thread_flag(TIF_UPROBE);
> > +#ifdef CONFIG_X86_32
> > +               /*
> > +                * On x86_32, do_notify_resume() gets called with
> > +                * interrupts disabled. Hence enable interrupts if they
> > +                * are still disabled.
> > +                */
> > +               local_irq_enable();
> > +#endif
> > +               uprobe_notify_resume(regs);
> > +       }
> > +
> >         /* deal with pending signal delivery */
> >         if (thread_info_flags & _TIF_SIGPENDING)
> >                 do_signal(regs); 
> 
> It would be good to remove this difference between i386 and x86_64.


I think, we have already discussed this. I tried getting to know why we
have this difference in behaviour. However I havent been able to find
the answer.

If you can get somebody to answer this, I would be happy to modify as
required.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26] x86: define a x86 specific exception notifier.
@ 2011-09-26 15:52       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 15:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:19:51]:

> On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > @@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> >                 mce_notify_process();
> >  #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
> >  
> > +       if (thread_info_flags & _TIF_UPROBE) {
> > +               clear_thread_flag(TIF_UPROBE);
> > +#ifdef CONFIG_X86_32
> > +               /*
> > +                * On x86_32, do_notify_resume() gets called with
> > +                * interrupts disabled. Hence enable interrupts if they
> > +                * are still disabled.
> > +                */
> > +               local_irq_enable();
> > +#endif
> > +               uprobe_notify_resume(regs);
> > +       }
> > +
> >         /* deal with pending signal delivery */
> >         if (thread_info_flags & _TIF_SIGPENDING)
> >                 do_signal(regs); 
> 
> It would be good to remove this difference between i386 and x86_64.


I think, we have already discussed this. I tried getting to know why we
have this difference in behaviour. However I havent been able to find
the answer.

If you can get somebody to answer this, I would be happy to modify as
required.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-26 13:59     ` Peter Zijlstra
@ 2011-09-26 16:01       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:59:13]:

> On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > 						Hence provide some extra
> > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > + * the uprobes_treelock before the uprobe is removed from the rbtree. 
> 
> 'Some extra time' doesn't make me all warm an fuzzy inside, but instead
> screams we fudge around a race condition.

The extra time provided is sufficient to avoid the race. So will modify
it to mean "sufficient" instead of "some".   

Would that suffice?

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-09-26 16:01       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 16:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:59:13]:

> On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > 						Hence provide some extra
> > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > + * the uprobes_treelock before the uprobe is removed from the rbtree. 
> 
> 'Some extra time' doesn't make me all warm an fuzzy inside, but instead
> screams we fudge around a race condition.

The extra time provided is sufficient to avoid the race. So will modify
it to mean "sufficient" instead of "some".   

Would that suffice?

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
  2011-09-26 13:35     ` Peter Zijlstra
@ 2011-09-26 16:19       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:35:15]:

> On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
> 
> Here and elsewhere, your whitespace is off, it should read:
> 
> 	struct inode *inode
> 
> I think checkpatch will inform you of this, but I didn't check.
> 

I have run checkpatch.pl --strict on all the patches and it doesnt
report them.

However I do see these whitespace in three places definitions for
write_opcode, __find_uprobe, and find_uprobe.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 1/26]   uprobes: Auxillary routines to insert, find, delete uprobes
@ 2011-09-26 16:19       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 16:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:35:15]:

> On Tue, 2011-09-20 at 17:29 +0530, Srikar Dronamraju wrote:
> > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset)
> 
> Here and elsewhere, your whitespace is off, it should read:
> 
> 	struct inode *inode
> 
> I think checkpatch will inform you of this, but I didn't check.
> 

I have run checkpatch.pl --strict on all the patches and it doesnt
report them.

However I do see these whitespace in three places definitions for
write_opcode, __find_uprobe, and find_uprobe.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-26 16:01       ` Srikar Dronamraju
@ 2011-09-26 16:25         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 16:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Mon, 2011-09-26 at 21:31 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:59:13]:
> 
> > On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > > 						Hence provide some extra
> > > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > > + * the uprobes_treelock before the uprobe is removed from the rbtree. 
> > 
> > 'Some extra time' doesn't make me all warm an fuzzy inside, but instead
> > screams we fudge around a race condition.
> 
> The extra time provided is sufficient to avoid the race. So will modify
> it to mean "sufficient" instead of "some".   
> 
> Would that suffice?

Much better, for extra point, explain why its sufficient as well ;-)

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-09-26 16:25         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-26 16:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Mon, 2011-09-26 at 21:31 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:59:13]:
> 
> > On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > > 						Hence provide some extra
> > > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > > + * the uprobes_treelock before the uprobe is removed from the rbtree. 
> > 
> > 'Some extra time' doesn't make me all warm an fuzzy inside, but instead
> > screams we fudge around a race condition.
> 
> The extra time provided is sufficient to avoid the race. So will modify
> it to mean "sufficient" instead of "some".   
> 
> Would that suffice?

Much better, for extra point, explain why its sufficient as well ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
  2011-09-26 14:23     ` Peter Zijlstra
@ 2011-09-26 16:34       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Ananth N Mavinakayanahalli, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:23:53]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +fail:
> > +       pr_warn_once("uprobes: Failed to adjust return address after"
> > +               " single-stepping call instruction;"
> > +               " pid=%d, sp=%#lx\n", current->pid, sp);
> > +       return -EFAULT; 
> 
> So how can that happen? Single-Step while someone unmapped the stack?

We do a copy_to_user, copy_from_user just above this, Now if either of
them fail, we have no choice but to Bail out. What caused this EFault
may not be under Uprobes's Control.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
@ 2011-09-26 16:34       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-26 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Ananth N Mavinakayanahalli, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

* Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:23:53]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +fail:
> > +       pr_warn_once("uprobes: Failed to adjust return address after"
> > +               " single-stepping call instruction;"
> > +               " pid=%d, sp=%#lx\n", current->pid, sp);
> > +       return -EFAULT; 
> 
> So how can that happen? Single-Step while someone unmapped the stack?

We do a copy_to_user, copy_from_user just above this, Now if either of
them fail, we have no choice but to Bail out. What caused this EFault
may not be under Uprobes's Control.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 20:53         ` Stefan Hajnoczi
  (?)
  (?)
@ 2011-09-26 18:30         ` Mark Wielaard
  -1 siblings, 0 replies; 330+ messages in thread
From: Mark Wielaard @ 2011-09-26 18:30 UTC (permalink / raw)
  To: linux-kernel

Stefan Hajnoczi <stefanha <at> linux.vnet.ibm.com> writes:
> On Tue, Sep 20, 2011 at 02:12:25PM -0400, Christoph Hellwig wrote:
> > Do we now have sdt.h support for uprobes?  That's one of the killer
> > features that always seemed to get postponed.
> 
> Not yet but it's a question of doing roughly what SystemTap does to
> parse the appropriate ELF sections and then putting those probes into
> uprobes.

GDB now also implements this:
http://sourceware.org/ml/gdb-patches/2011-04/msg00036.html

Roland also posted a simple C based parser when he first suggested
the new format:
http://www.sourceware.org/ml/systemtap/2010-q3/msg00145.html

There is a complete description of the elf note section, argument format
and semaphore handling at:
http://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation

Cheers,

Mark


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-23 16:51             ` Stefan Hajnoczi
@ 2011-09-26 19:59               ` Josh Stone
  -1 siblings, 0 replies; 330+ messages in thread
From: Josh Stone @ 2011-09-26 19:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Masami Hiramatsu, Christoph Hellwig, Srikar Dronamraju,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm, SystemTap

On 09/23/2011 04:53 AM, Masami Hiramatsu wrote:
>> Masami looked at this and found that SystemTap sdt.h currently requires
>> an extra userspace memory store in order to activate probes.  Each probe
>> has a "semaphore" 16-bit counter which applications may test before
>> hitting the probe itself.  This is used to avoid overhead in
>> applications that do expensive argument processing (e.g. creating
>> strings) for probes.
> Indeed, originally, those semaphores designed for such use cases.
> However, some applications *always* use it (e.g. qemu-kvm).

I found that qemu-kvm generates its tracepoints like this:

  static inline void trace_$name($args) {
      if (QEMU_${nameupper}_ENABLED()) {
          QEMU_${nameupper}($argnames);
      }
  }

In that case, the $args are always computed to call the inline, so
you'll basically just get a memory read, jump, NOP.  There's no benefit
from checking ENABLED() here, and removing it would leave only the NOP.
 Even if you invent an improved mechanism for ENABLED(), that doesn't
change the fact that it's doing useless work here.

So in this case, it may be better to patch qemu, assuming my statements
hold for DTrace's implementation on other platforms too.  The ENABLED()
guard still does have other genuine uses though, as with the string
preparation in Python's probes.


On 09/23/2011 09:51 AM, Stefan Hajnoczi wrote:
>> I'm not sure that we should stick on the current implementation
>> of the sdt.h. I think we'd better modify the sdt.h to replace
>> such semaphores with checking whether the tracepoint is changed from nop.
> 
> I like this option.  The only implication is that all userspace tracing
> needs to go through uprobes if we want to support multiple consumers
> tracing the same address.

This limitation is practically true already, since sharing consumers
have to negotiate the breakpoint anyway.

If we can find a better way to handle semaphores, we at systemtap will
welcome sdt.h improvements.  On the face of it, checking one's own NOP
for modification sounds pretty elegant, but I'm not convinced that it's
possible in practice.

For one, it requires arch specific knowledge in sdt.h of what the NOP or
breakpoint looks like, whereas sdt.h currently only knows whether to use
NOP or NOP 0, without knowledge of how that's encoded.  And this gets
trickier with archs like IA64 where you're part of a bundle.  So this
much is hard, but not impossible.

Another issue is that there's not an easy compile-time correlation
between semaphore checks and probe locations, nor is it necessarily a
1:1 mapping.  The FOO_ENABLED() and PROBE_FOO() code blocks are
distinct, and the compiler can do many tricks with them, loop unrolling,
function specialization, etc.  And if we start placing constraints to
prevent this, then I think we'll be impacting code-gen of the
application more than we'd like.

So, I invite sdt.h prototypes of the nop check, but I'm skeptical...

>> Or, we can introduce an add-hoc ptrace code to perftools for modifying
>> those semaphores. However, this means that user always has to use
>> perf to trace applications, and it's hard to trace multiple applications
>> at a time (can we attach all of them?)...
> 
> I don't think perf needs to stay attached to the processes.  It just
> needs to increment the semaphores on startup and decrement them on
> shutdown.

You're still relying on getting in there twice, but ptrace could be busy
either or both times.  Plus you could inadvertently block any of the
other legitimate ptrace apps, especially if doing systemwide probing.

FWIW, the counter-semaphore is really only useful for the case where the
breakpoint is placed centrally (by uprobes), but the semaphore is
managed by each separate consumer.  In that case each consumer can
inc/dec their presence.  But if uprobes were to manage this itself, it
basically becomes a simple flag.  So, it would do the trick to have
uprobes take an extra inode offset as a flag to write in a 1, which is
admittedly a bit gross, but IMO the most workable.

Josh


PS - context for the CCed systemtap list:
https://lkml.org/lkml/2011/9/23/93

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-26 19:59               ` Josh Stone
  0 siblings, 0 replies; 330+ messages in thread
From: Josh Stone @ 2011-09-26 19:59 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Masami Hiramatsu, Christoph Hellwig, Srikar Dronamraju,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm, SystemTap

On 09/23/2011 04:53 AM, Masami Hiramatsu wrote:
>> Masami looked at this and found that SystemTap sdt.h currently requires
>> an extra userspace memory store in order to activate probes.  Each probe
>> has a "semaphore" 16-bit counter which applications may test before
>> hitting the probe itself.  This is used to avoid overhead in
>> applications that do expensive argument processing (e.g. creating
>> strings) for probes.
> Indeed, originally, those semaphores designed for such use cases.
> However, some applications *always* use it (e.g. qemu-kvm).

I found that qemu-kvm generates its tracepoints like this:

  static inline void trace_$name($args) {
      if (QEMU_${nameupper}_ENABLED()) {
          QEMU_${nameupper}($argnames);
      }
  }

In that case, the $args are always computed to call the inline, so
you'll basically just get a memory read, jump, NOP.  There's no benefit
from checking ENABLED() here, and removing it would leave only the NOP.
 Even if you invent an improved mechanism for ENABLED(), that doesn't
change the fact that it's doing useless work here.

So in this case, it may be better to patch qemu, assuming my statements
hold for DTrace's implementation on other platforms too.  The ENABLED()
guard still does have other genuine uses though, as with the string
preparation in Python's probes.


On 09/23/2011 09:51 AM, Stefan Hajnoczi wrote:
>> I'm not sure that we should stick on the current implementation
>> of the sdt.h. I think we'd better modify the sdt.h to replace
>> such semaphores with checking whether the tracepoint is changed from nop.
> 
> I like this option.  The only implication is that all userspace tracing
> needs to go through uprobes if we want to support multiple consumers
> tracing the same address.

This limitation is practically true already, since sharing consumers
have to negotiate the breakpoint anyway.

If we can find a better way to handle semaphores, we at systemtap will
welcome sdt.h improvements.  On the face of it, checking one's own NOP
for modification sounds pretty elegant, but I'm not convinced that it's
possible in practice.

For one, it requires arch specific knowledge in sdt.h of what the NOP or
breakpoint looks like, whereas sdt.h currently only knows whether to use
NOP or NOP 0, without knowledge of how that's encoded.  And this gets
trickier with archs like IA64 where you're part of a bundle.  So this
much is hard, but not impossible.

Another issue is that there's not an easy compile-time correlation
between semaphore checks and probe locations, nor is it necessarily a
1:1 mapping.  The FOO_ENABLED() and PROBE_FOO() code blocks are
distinct, and the compiler can do many tricks with them, loop unrolling,
function specialization, etc.  And if we start placing constraints to
prevent this, then I think we'll be impacting code-gen of the
application more than we'd like.

So, I invite sdt.h prototypes of the nop check, but I'm skeptical...

>> Or, we can introduce an add-hoc ptrace code to perftools for modifying
>> those semaphores. However, this means that user always has to use
>> perf to trace applications, and it's hard to trace multiple applications
>> at a time (can we attach all of them?)...
> 
> I don't think perf needs to stay attached to the processes.  It just
> needs to increment the semaphores on startup and decrement them on
> shutdown.

You're still relying on getting in there twice, but ptrace could be busy
either or both times.  Plus you could inadvertently block any of the
other legitimate ptrace apps, especially if doing systemwide probing.

FWIW, the counter-semaphore is really only useful for the case where the
breakpoint is placed centrally (by uprobes), but the semaphore is
managed by each separate consumer.  In that case each consumer can
inc/dec their presence.  But if uprobes were to manage this itself, it
basically becomes a simple flag.  So, it would do the trick to have
uprobes take an extra inode offset as a flag to write in a 1, which is
admittedly a bit gross, but IMO the most workable.

Josh


PS - context for the CCed systemtap list:
https://lkml.org/lkml/2011/9/23/93

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-26 19:59               ` Josh Stone
  (?)
@ 2011-09-27  1:32               ` Masami Hiramatsu
  2011-09-27  2:59                   ` Josh Stone
  -1 siblings, 1 reply; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-27  1:32 UTC (permalink / raw)
  To: Josh Stone
  Cc: Stefan Hajnoczi, Christoph Hellwig, Srikar Dronamraju,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm, SystemTap

(2011/09/27 4:59), Josh Stone wrote:
> On 09/23/2011 04:53 AM, Masami Hiramatsu wrote:
>>> Masami looked at this and found that SystemTap sdt.h currently requires
>>> an extra userspace memory store in order to activate probes.  Each probe
>>> has a "semaphore" 16-bit counter which applications may test before
>>> hitting the probe itself.  This is used to avoid overhead in
>>> applications that do expensive argument processing (e.g. creating
>>> strings) for probes.
>> Indeed, originally, those semaphores designed for such use cases.
>> However, some applications *always* use it (e.g. qemu-kvm).
>
> I found that qemu-kvm generates its tracepoints like this:
>
>   static inline void trace_$name($args) {
>       if (QEMU_${nameupper}_ENABLED()) {
>           QEMU_${nameupper}($argnames);
>       }
>   }

Right, that's what I've said.

> In that case, the $args are always computed to call the inline, so
> you'll basically just get a memory read, jump, NOP.  There's no benefit
> from checking ENABLED() here, and removing it would leave only the NOP.
>  Even if you invent an improved mechanism for ENABLED(), that doesn't
> change the fact that it's doing useless work here.

Yeah, this use is totally meaningless...

> So in this case, it may be better to patch qemu, assuming my statements
> hold for DTrace's implementation on other platforms too.  The ENABLED()
> guard still does have other genuine uses though, as with the string
> preparation in Python's probes.

I agree with that qemu needs to be fixed. However, this is just for
the qemu case. Not the best solution.

> On 09/23/2011 09:51 AM, Stefan Hajnoczi wrote:
>>> I'm not sure that we should stick on the current implementation
>>> of the sdt.h. I think we'd better modify the sdt.h to replace
>>> such semaphores with checking whether the tracepoint is changed from nop.
>>
>> I like this option.  The only implication is that all userspace tracing
>> needs to go through uprobes if we want to support multiple consumers
>> tracing the same address.
>
> This limitation is practically true already, since sharing consumers
> have to negotiate the breakpoint anyway.
>
> If we can find a better way to handle semaphores, we at systemtap will
> welcome sdt.h improvements.  On the face of it, checking one's own NOP
> for modification sounds pretty elegant, but I'm not convinced that it's
> possible in practice.
>
> For one, it requires arch specific knowledge in sdt.h of what the NOP or
> breakpoint looks like, whereas sdt.h currently only knows whether to use
> NOP or NOP 0, without knowledge of how that's encoded.  And this gets
> trickier with archs like IA64 where you're part of a bundle.  So this
> much is hard, but not impossible.

Even though, we can start with x86, which is currently one and only one
platform supporting uprobes :)
Maybe we can prepare asm/sdt.h for describing arch-dep code.

> Another issue is that there's not an easy compile-time correlation
> between semaphore checks and probe locations, nor is it necessarily a
> 1:1 mapping.  The FOO_ENABLED() and PROBE_FOO() code blocks are
> distinct, and the compiler can do many tricks with them, loop unrolling,
> function specialization, etc.  And if we start placing constraints to
> prevent this, then I think we'll be impacting code-gen of the
> application more than we'd like.

Perhaps, we can use the constructor attribute for that purpose.

__attribute__((constructor)) FOO_init() {
	/* Search FOO tracepoint address from tracepoint table(like extable) */
	FOO_sem = __find_first_trace_point("FOO");
}

This sets the address of first tracepoint of FOO to FOO_sem. :)

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-27  1:32               ` Masami Hiramatsu
@ 2011-09-27  2:59                   ` Josh Stone
  0 siblings, 0 replies; 330+ messages in thread
From: Josh Stone @ 2011-09-27  2:59 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Stefan Hajnoczi, Christoph Hellwig, Srikar Dronamraju,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm, SystemTap,
	LKML

On 09/26/2011 06:32 PM, Masami Hiramatsu wrote:
>> On 09/23/2011 09:51 AM, Stefan Hajnoczi wrote:
>>>> I'm not sure that we should stick on the current implementation
>>>> of the sdt.h. I think we'd better modify the sdt.h to replace
>>>> such semaphores with checking whether the tracepoint is changed from nop.
>>>
>>> I like this option.  The only implication is that all userspace tracing
>>> needs to go through uprobes if we want to support multiple consumers
>>> tracing the same address.
>>
>> This limitation is practically true already, since sharing consumers
>> have to negotiate the breakpoint anyway.
>>
>> If we can find a better way to handle semaphores, we at systemtap will
>> welcome sdt.h improvements.  On the face of it, checking one's own NOP
>> for modification sounds pretty elegant, but I'm not convinced that it's
>> possible in practice.
>>
>> For one, it requires arch specific knowledge in sdt.h of what the NOP or
>> breakpoint looks like, whereas sdt.h currently only knows whether to use
>> NOP or NOP 0, without knowledge of how that's encoded.  And this gets
>> trickier with archs like IA64 where you're part of a bundle.  So this
>> much is hard, but not impossible.
> 
> Even though, we can start with x86, which is currently one and only one
> platform supporting uprobes :)

This inode-based uprobes only supports x86 so far.  The utrace-based
uprobes also supports s390 and ppc, which you can bet Srikar will be
tasked with here before long...

But uprobes is not the only sdt.h consumer anyway.  GDB also supports
sdt.h in its targets, covering x86, ppc, s390, and arm so far IIRC.

> Maybe we can prepare asm/sdt.h for describing arch-dep code.

In sys/sdt.h itself, these would probably be short enough snippets that
we can just add each arch's check directly.  I'm just averse to adding
arch-dependent stuff unless we have to, but maybe we need it.

>> Another issue is that there's not an easy compile-time correlation
>> between semaphore checks and probe locations, nor is it necessarily a
>> 1:1 mapping.  The FOO_ENABLED() and PROBE_FOO() code blocks are
>> distinct, and the compiler can do many tricks with them, loop unrolling,
>> function specialization, etc.  And if we start placing constraints to
>> prevent this, then I think we'll be impacting code-gen of the
>> application more than we'd like.
> 
> Perhaps, we can use the constructor attribute for that purpose.
> 
> __attribute__((constructor)) FOO_init() {
> 	/* Search FOO tracepoint address from tracepoint table(like extable) */
> 	FOO_sem = __find_first_trace_point("FOO");
> }
> 
> This sets the address of first tracepoint of FOO to FOO_sem. :)

This works as long as all FOO instances are enabled altogether, and as
long as FOO events are not in constructors themselves.  Those are minor
and probably reasonable limitations, just need to be acknowledged.  And
I'm not sure how complicated __find_first_trace_point will be.

This adds a bit of start-up overhead too, but the libraries that
strongly care about this (like glibc and libgcc) are not the ones who
are using ENABLED() checks.

So OK, can you flesh out or prototype what __find_first_trace_point and
its tracepoint table should look like?  If you can demonstrate these,
then I'd be willing to try modifying bin/dtrace and sys/sdt.h.

The good news is that if we figure this out, then the kernel can be
completely SDT-agnostic -- it's just like any other address to probe.  I
hope we can also make it transparent to existing sdt.h consumers, so it
just looks like there's no semaphore-counter to manipulate.


Josh

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-09-27  2:59                   ` Josh Stone
  0 siblings, 0 replies; 330+ messages in thread
From: Josh Stone @ 2011-09-27  2:59 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Stefan Hajnoczi, Christoph Hellwig, Srikar Dronamraju,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm, SystemTap,
	LKML

On 09/26/2011 06:32 PM, Masami Hiramatsu wrote:
>> On 09/23/2011 09:51 AM, Stefan Hajnoczi wrote:
>>>> I'm not sure that we should stick on the current implementation
>>>> of the sdt.h. I think we'd better modify the sdt.h to replace
>>>> such semaphores with checking whether the tracepoint is changed from nop.
>>>
>>> I like this option.  The only implication is that all userspace tracing
>>> needs to go through uprobes if we want to support multiple consumers
>>> tracing the same address.
>>
>> This limitation is practically true already, since sharing consumers
>> have to negotiate the breakpoint anyway.
>>
>> If we can find a better way to handle semaphores, we at systemtap will
>> welcome sdt.h improvements.  On the face of it, checking one's own NOP
>> for modification sounds pretty elegant, but I'm not convinced that it's
>> possible in practice.
>>
>> For one, it requires arch specific knowledge in sdt.h of what the NOP or
>> breakpoint looks like, whereas sdt.h currently only knows whether to use
>> NOP or NOP 0, without knowledge of how that's encoded.  And this gets
>> trickier with archs like IA64 where you're part of a bundle.  So this
>> much is hard, but not impossible.
> 
> Even though, we can start with x86, which is currently one and only one
> platform supporting uprobes :)

This inode-based uprobes only supports x86 so far.  The utrace-based
uprobes also supports s390 and ppc, which you can bet Srikar will be
tasked with here before long...

But uprobes is not the only sdt.h consumer anyway.  GDB also supports
sdt.h in its targets, covering x86, ppc, s390, and arm so far IIRC.

> Maybe we can prepare asm/sdt.h for describing arch-dep code.

In sys/sdt.h itself, these would probably be short enough snippets that
we can just add each arch's check directly.  I'm just averse to adding
arch-dependent stuff unless we have to, but maybe we need it.

>> Another issue is that there's not an easy compile-time correlation
>> between semaphore checks and probe locations, nor is it necessarily a
>> 1:1 mapping.  The FOO_ENABLED() and PROBE_FOO() code blocks are
>> distinct, and the compiler can do many tricks with them, loop unrolling,
>> function specialization, etc.  And if we start placing constraints to
>> prevent this, then I think we'll be impacting code-gen of the
>> application more than we'd like.
> 
> Perhaps, we can use the constructor attribute for that purpose.
> 
> __attribute__((constructor)) FOO_init() {
> 	/* Search FOO tracepoint address from tracepoint table(like extable) */
> 	FOO_sem = __find_first_trace_point("FOO");
> }
> 
> This sets the address of first tracepoint of FOO to FOO_sem. :)

This works as long as all FOO instances are enabled altogether, and as
long as FOO events are not in constructors themselves.  Those are minor
and probably reasonable limitations, just need to be acknowledged.  And
I'm not sure how complicated __find_first_trace_point will be.

This adds a bit of start-up overhead too, but the libraries that
strongly care about this (like glibc and libgcc) are not the ones who
are using ENABLED() checks.

So OK, can you flesh out or prototype what __find_first_trace_point and
its tracepoint table should look like?  If you can demonstrate these,
then I'd be willing to try modifying bin/dtrace and sys/sdt.h.

The good news is that if we figure this out, then the kernel can be
completely SDT-agnostic -- it's just like any other address to probe.  I
hope we can also make it transparent to existing sdt.h consumers, so it
just looks like there's no semaphore-counter to manipulate.


Josh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-26 19:59               ` Josh Stone
  (?)
  (?)
@ 2011-09-27  7:08               ` Stefan Hajnoczi
  -1 siblings, 0 replies; 330+ messages in thread
From: Stefan Hajnoczi @ 2011-09-27  7:08 UTC (permalink / raw)
  To: Josh Stone
  Cc: Masami Hiramatsu, Christoph Hellwig, Srikar Dronamraju,
	Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm, SystemTap

On Mon, Sep 26, 2011 at 12:59:46PM -0700, Josh Stone wrote:
> On 09/23/2011 04:53 AM, Masami Hiramatsu wrote:
> >> Masami looked at this and found that SystemTap sdt.h currently requires
> >> an extra userspace memory store in order to activate probes.  Each probe
> >> has a "semaphore" 16-bit counter which applications may test before
> >> hitting the probe itself.  This is used to avoid overhead in
> >> applications that do expensive argument processing (e.g. creating
> >> strings) for probes.
> > Indeed, originally, those semaphores designed for such use cases.
> > However, some applications *always* use it (e.g. qemu-kvm).
> 
> I found that qemu-kvm generates its tracepoints like this:
> 
>   static inline void trace_$name($args) {
>       if (QEMU_${nameupper}_ENABLED()) {
>           QEMU_${nameupper}($argnames);
>       }
>   }
> 
> In that case, the $args are always computed to call the inline, so
> you'll basically just get a memory read, jump, NOP.  There's no benefit
> from checking ENABLED() here, and removing it would leave only the NOP.
>  Even if you invent an improved mechanism for ENABLED(), that doesn't
> change the fact that it's doing useless work here.
> 
> So in this case, it may be better to patch qemu, assuming my statements
> hold for DTrace's implementation on other platforms too.  The ENABLED()
> guard still does have other genuine uses though, as with the string
> preparation in Python's probes.

I will get qemu fixed.

Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-26 15:44       ` Srikar Dronamraju
@ 2011-09-27 11:37         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> 
> > > +
> > > +/*
> > > + * Called from mmap_region.
> > > + * called with mm->mmap_sem acquired.
> > > + *
> > > + * Return -ve no if we fail to insert probes and we cannot
> > > + * bail-out.
> > > + * Return 0 otherwise. i.e :
> > > + * - successful insertion of probes
> > > + * - (or) no possible probes to be inserted.
> > > + * - (or) insertion of probes failed but we can bail-out.
> > > + */
> > > +int mmap_uprobe(struct vm_area_struct *vma)
> > > +{
> > > +   struct list_head tmp_list;
> > > +   struct uprobe *uprobe, *u;
> > > +   struct inode *inode;
> > > +   int ret = 0;
> > > +
> > > +   if (!valid_vma(vma))
> > > +           return ret;     /* Bail-out */
> > > +
> > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > +   if (!inode)
> > > +           return ret;
> > > +
> > > +   INIT_LIST_HEAD(&tmp_list);
> > > +   mutex_lock(&uprobes_mmap_mutex);
> > > +   build_probe_list(inode, &tmp_list);
> > > +   list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > +           loff_t vaddr;
> > > +
> > > +           list_del(&uprobe->pending_list);
> > > +           if (!ret && uprobe->consumers) {
> > > +                   vaddr = vma->vm_start + uprobe->offset;
> > > +                   vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > +                   if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > +                           continue;
> > > +                   ret = install_breakpoint(vma->vm_mm, uprobe);
> > > +
> > > +                   if (ret && (ret == -ESRCH || ret == -EEXIST))
> > > +                           ret = 0;
> > > +           }
> > > +           put_uprobe(uprobe);
> > > +   }
> > > +
> > > +   mutex_unlock(&uprobes_mmap_mutex);
> > > +   iput(inode);
> > > +   return ret;
> > > +}
> > > +
> > > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > > +           struct inode *inode)
> > > +{
> > > +   struct uprobe *uprobe;
> > > +   struct rb_node *n;
> > > +   unsigned long flags;
> > > +
> > > +   n = uprobes_tree.rb_node;
> > > +   spin_lock_irqsave(&uprobes_treelock, flags);
> > > +   uprobe = __find_uprobe(inode, 0, &n);
> > > +
> > > +   /*
> > > +    * If indeed there is a probe for the inode and with offset zero,
> > > +    * then lets release its reference. (ref got thro __find_uprobe)
> > > +    */
> > > +   if (uprobe)
> > > +           put_uprobe(uprobe);
> > > +   for (; n; n = rb_next(n)) {
> > > +           loff_t vaddr;
> > > +
> > > +           uprobe = rb_entry(n, struct uprobe, rb_node);
> > > +           if (uprobe->inode != inode)
> > > +                   break;
> > > +           vaddr = vma->vm_start + uprobe->offset;
> > > +           vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > +           if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > +                   continue;
> > > +           atomic_dec(&vma->vm_mm->mm_uprobes_count);
> > > +   }
> > > +   spin_unlock_irqrestore(&uprobes_treelock, flags);
> > > +}
> > > +
> > > +/*
> > > + * Called in context of a munmap of a vma.
> > > + */
> > > +void munmap_uprobe(struct vm_area_struct *vma)
> > > +{
> > > +   struct inode *inode;
> > > +
> > > +   if (!valid_vma(vma))
> > > +           return;         /* Bail-out */
> > > +
> > > +   if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> > > +           return;
> > > +
> > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > +   if (!inode)
> > > +           return;
> > > +
> > > +   dec_mm_uprobes_count(vma, inode);
> > > +   iput(inode);
> > > +   return;
> > > +}
> > 
> > One has to wonder why mmap_uprobe() can be one function but
> > munmap_uprobe() cannot.
> > 
> 
> I didnt understand this comment, Can you please elaborate?
> mmap_uprobe uses build_probe_list and munmap_uprobe uses
> dec_mm_uprobes_count. 

Ah, I missed build_probe_list(), but I didn't see a reason for the
existence of dec_mm_uprobe_count(), the name doesn't make sense and the
content is 'small' enough to just put in munmap_uprobe.

To me it looks similar to the list iteration you have in mmap_uprobe(),
you didn't split that out into another function either.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-27 11:37         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> 
> > > +
> > > +/*
> > > + * Called from mmap_region.
> > > + * called with mm->mmap_sem acquired.
> > > + *
> > > + * Return -ve no if we fail to insert probes and we cannot
> > > + * bail-out.
> > > + * Return 0 otherwise. i.e :
> > > + * - successful insertion of probes
> > > + * - (or) no possible probes to be inserted.
> > > + * - (or) insertion of probes failed but we can bail-out.
> > > + */
> > > +int mmap_uprobe(struct vm_area_struct *vma)
> > > +{
> > > +   struct list_head tmp_list;
> > > +   struct uprobe *uprobe, *u;
> > > +   struct inode *inode;
> > > +   int ret = 0;
> > > +
> > > +   if (!valid_vma(vma))
> > > +           return ret;     /* Bail-out */
> > > +
> > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > +   if (!inode)
> > > +           return ret;
> > > +
> > > +   INIT_LIST_HEAD(&tmp_list);
> > > +   mutex_lock(&uprobes_mmap_mutex);
> > > +   build_probe_list(inode, &tmp_list);
> > > +   list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > +           loff_t vaddr;
> > > +
> > > +           list_del(&uprobe->pending_list);
> > > +           if (!ret && uprobe->consumers) {
> > > +                   vaddr = vma->vm_start + uprobe->offset;
> > > +                   vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > +                   if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > +                           continue;
> > > +                   ret = install_breakpoint(vma->vm_mm, uprobe);
> > > +
> > > +                   if (ret && (ret == -ESRCH || ret == -EEXIST))
> > > +                           ret = 0;
> > > +           }
> > > +           put_uprobe(uprobe);
> > > +   }
> > > +
> > > +   mutex_unlock(&uprobes_mmap_mutex);
> > > +   iput(inode);
> > > +   return ret;
> > > +}
> > > +
> > > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > > +           struct inode *inode)
> > > +{
> > > +   struct uprobe *uprobe;
> > > +   struct rb_node *n;
> > > +   unsigned long flags;
> > > +
> > > +   n = uprobes_tree.rb_node;
> > > +   spin_lock_irqsave(&uprobes_treelock, flags);
> > > +   uprobe = __find_uprobe(inode, 0, &n);
> > > +
> > > +   /*
> > > +    * If indeed there is a probe for the inode and with offset zero,
> > > +    * then lets release its reference. (ref got thro __find_uprobe)
> > > +    */
> > > +   if (uprobe)
> > > +           put_uprobe(uprobe);
> > > +   for (; n; n = rb_next(n)) {
> > > +           loff_t vaddr;
> > > +
> > > +           uprobe = rb_entry(n, struct uprobe, rb_node);
> > > +           if (uprobe->inode != inode)
> > > +                   break;
> > > +           vaddr = vma->vm_start + uprobe->offset;
> > > +           vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > +           if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > +                   continue;
> > > +           atomic_dec(&vma->vm_mm->mm_uprobes_count);
> > > +   }
> > > +   spin_unlock_irqrestore(&uprobes_treelock, flags);
> > > +}
> > > +
> > > +/*
> > > + * Called in context of a munmap of a vma.
> > > + */
> > > +void munmap_uprobe(struct vm_area_struct *vma)
> > > +{
> > > +   struct inode *inode;
> > > +
> > > +   if (!valid_vma(vma))
> > > +           return;         /* Bail-out */
> > > +
> > > +   if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> > > +           return;
> > > +
> > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > +   if (!inode)
> > > +           return;
> > > +
> > > +   dec_mm_uprobes_count(vma, inode);
> > > +   iput(inode);
> > > +   return;
> > > +}
> > 
> > One has to wonder why mmap_uprobe() can be one function but
> > munmap_uprobe() cannot.
> > 
> 
> I didnt understand this comment, Can you please elaborate?
> mmap_uprobe uses build_probe_list and munmap_uprobe uses
> dec_mm_uprobes_count. 

Ah, I missed build_probe_list(), but I didn't see a reason for the
existence of dec_mm_uprobe_count(), the name doesn't make sense and the
content is 'small' enough to just put in munmap_uprobe.

To me it looks similar to the list iteration you have in mmap_uprobe(),
you didn't split that out into another function either.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-26 15:44       ` Srikar Dronamraju
@ 2011-09-27 11:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > Why not something like:
> > 
> > 
> > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> >                                       bool inode_only)
> > +{
> >         struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
> > +       struct rb_node *n = uprobes_tree.rb_node;
> > +       struct uprobe *uprobe;
> >       struct uprobe *ret = NULL;
> > +       int match;
> > +
> > +       while (n) {
> > +               uprobe = rb_entry(n, struct uprobe, rb_node);
> > +               match = match_uprobe(&u, uprobe);
> > +               if (!match) {
> >                       if (!inode_only)
> >                              atomic_inc(&uprobe->ref);
> > +                       return uprobe;
> > +               }
> >               if (inode_only && uprobe->inode == inode)
> >                       ret = uprobe;
> > +               if (match < 0)
> > +                       n = n->rb_left;
> > +               else
> > +                       n = n->rb_right;
> > +
> > +       }
> >         return ret;
> > +}
> > 
> 
> I am not comfortable with this change.
> find_uprobe() was suppose to return back a uprobe if and only if
> the inode and offset match,

And it will, because find_uprobe() will never expose that third
argument.

>  However with your approach, we end up
> returning a uprobe that isnt matching and one that isnt refcounted.
> Moreover if even if we have a matching uprobe, we end up sending a
> unrefcounted uprobe back. 

Because the matching isn't the important part, you want to return the
leftmost node matching the specified inode. Also, in that case you
explicitly don't want the ref, since the first thing you do on the
call-site is drop the ref if there was a match. You don't care about
inode:0 in particular, you want a place to start iterating all of
inode:*.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-27 11:41         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > Why not something like:
> > 
> > 
> > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> >                                       bool inode_only)
> > +{
> >         struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
> > +       struct rb_node *n = uprobes_tree.rb_node;
> > +       struct uprobe *uprobe;
> >       struct uprobe *ret = NULL;
> > +       int match;
> > +
> > +       while (n) {
> > +               uprobe = rb_entry(n, struct uprobe, rb_node);
> > +               match = match_uprobe(&u, uprobe);
> > +               if (!match) {
> >                       if (!inode_only)
> >                              atomic_inc(&uprobe->ref);
> > +                       return uprobe;
> > +               }
> >               if (inode_only && uprobe->inode == inode)
> >                       ret = uprobe;
> > +               if (match < 0)
> > +                       n = n->rb_left;
> > +               else
> > +                       n = n->rb_right;
> > +
> > +       }
> >         return ret;
> > +}
> > 
> 
> I am not comfortable with this change.
> find_uprobe() was suppose to return back a uprobe if and only if
> the inode and offset match,

And it will, because find_uprobe() will never expose that third
argument.

>  However with your approach, we end up
> returning a uprobe that isnt matching and one that isnt refcounted.
> Moreover if even if we have a matching uprobe, we end up sending a
> unrefcounted uprobe back. 

Because the matching isn't the important part, you want to return the
leftmost node matching the specified inode. Also, in that case you
explicitly don't want the ref, since the first thing you do on the
call-site is drop the ref if there was a match. You don't care about
inode:0 in particular, you want a place to start iterating all of
inode:*.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-26 15:44       ` Srikar Dronamraju
@ 2011-09-27 11:42         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > Isn't good enough? Also, returning an rb_node just seems iffy.. 
> 
> yup this can be done. can you please elaborate on why passing back an
> rb_node is an issue? 

Just seems ugly to me, why return a pointer inside the object the
function name deals with.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-27 11:42         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:42 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > Isn't good enough? Also, returning an rb_node just seems iffy.. 
> 
> yup this can be done. can you please elaborate on why passing back an
> rb_node is an issue? 

Just seems ugly to me, why return a pointer inside the object the
function name deals with.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
  2011-09-26 16:34       ` Srikar Dronamraju
@ 2011-09-27 11:44         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:44 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Ananth N Mavinakayanahalli, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On Mon, 2011-09-26 at 22:04 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:23:53]:
> 
> > On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > > +fail:
> > > +       pr_warn_once("uprobes: Failed to adjust return address after"
> > > +               " single-stepping call instruction;"
> > > +               " pid=%d, sp=%#lx\n", current->pid, sp);
> > > +       return -EFAULT; 
> > 
> > So how can that happen? Single-Step while someone unmapped the stack?
> 
> We do a copy_to_user, copy_from_user just above this,

I saw that,

>  Now if either of
> them fail, we have no choice but to Bail out.

Agreed,

>  What caused this EFault may not be under Uprobes's Control.

I never said it was.. All I asked is what (outside of uprobe) was done
to cause this, and why is this particular error important enough to
warrant a warn.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 17/26]   x86: arch specific hooks for pre/post singlestep handling.
@ 2011-09-27 11:44         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:44 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Ananth N Mavinakayanahalli, Thomas Gleixner,
	Jonathan Corbet, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On Mon, 2011-09-26 at 22:04 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:23:53]:
> 
> > On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > > +fail:
> > > +       pr_warn_once("uprobes: Failed to adjust return address after"
> > > +               " single-stepping call instruction;"
> > > +               " pid=%d, sp=%#lx\n", current->pid, sp);
> > > +       return -EFAULT; 
> > 
> > So how can that happen? Single-Step while someone unmapped the stack?
> 
> We do a copy_to_user, copy_from_user just above this,

I saw that,

>  Now if either of
> them fail, we have no choice but to Bail out.

Agreed,

>  What caused this EFault may not be under Uprobes's Control.

I never said it was.. All I asked is what (outside of uprobe) was done
to cause this, and why is this particular error important enough to
warrant a warn.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26] x86: define a x86 specific exception notifier.
  2011-09-26 15:52       ` Srikar Dronamraju
@ 2011-09-27 11:46         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Mon, 2011-09-26 at 21:22 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:19:51]:
> 
> > On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > > @@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> > >                 mce_notify_process();
> > >  #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
> > >  
> > > +       if (thread_info_flags & _TIF_UPROBE) {
> > > +               clear_thread_flag(TIF_UPROBE);
> > > +#ifdef CONFIG_X86_32
> > > +               /*
> > > +                * On x86_32, do_notify_resume() gets called with
> > > +                * interrupts disabled. Hence enable interrupts if they
> > > +                * are still disabled.
> > > +                */
> > > +               local_irq_enable();
> > > +#endif
> > > +               uprobe_notify_resume(regs);
> > > +       }
> > > +
> > >         /* deal with pending signal delivery */
> > >         if (thread_info_flags & _TIF_SIGPENDING)
> > >                 do_signal(regs); 
> > 
> > It would be good to remove this difference between i386 and x86_64.
> 
> 
> I think, we have already discussed this. I tried getting to know why we
> have this difference in behaviour. However I havent been able to find
> the answer.
> 
> If you can get somebody to answer this, I would be happy to modify as
> required.

The Changelog failed to mention this. Afaict there really is no reason
other than that touching entry_32.S is a pain.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26] x86: define a x86 specific exception notifier.
@ 2011-09-27 11:46         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Andi Kleen, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Oleg Nesterov, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On Mon, 2011-09-26 at 21:22 +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2011-09-26 16:19:51]:
> 
> > On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > > @@ -820,6 +821,19 @@ do_notify_resume(struct pt_regs *regs, void *unused, __u32 thread_info_flags)
> > >                 mce_notify_process();
> > >  #endif /* CONFIG_X86_64 && CONFIG_X86_MCE */
> > >  
> > > +       if (thread_info_flags & _TIF_UPROBE) {
> > > +               clear_thread_flag(TIF_UPROBE);
> > > +#ifdef CONFIG_X86_32
> > > +               /*
> > > +                * On x86_32, do_notify_resume() gets called with
> > > +                * interrupts disabled. Hence enable interrupts if they
> > > +                * are still disabled.
> > > +                */
> > > +               local_irq_enable();
> > > +#endif
> > > +               uprobe_notify_resume(regs);
> > > +       }
> > > +
> > >         /* deal with pending signal delivery */
> > >         if (thread_info_flags & _TIF_SIGPENDING)
> > >                 do_signal(regs); 
> > 
> > It would be good to remove this difference between i386 and x86_64.
> 
> 
> I think, we have already discussed this. I tried getting to know why we
> have this difference in behaviour. However I havent been able to find
> the answer.
> 
> If you can get somebody to answer this, I would be happy to modify as
> required.

The Changelog failed to mention this. Afaict there really is no reason
other than that touching entry_32.S is a pain.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-27 11:49     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML, Eric Paris

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static int xol_add_vma(struct uprobes_xol_area *area)
> +{
> +       const struct cred *curr_cred;
> +       struct vm_area_struct *vma;
> +       struct mm_struct *mm;
> +       unsigned long addr;
> +       int ret = -ENOMEM;
> +
> +       mm = get_task_mm(current);
> +       if (!mm)
> +               return -ESRCH;
> +
> +       down_write(&mm->mmap_sem);
> +       if (mm->uprobes_xol_area) {
> +               ret = -EALREADY;
> +               goto fail;
> +       }
> +
> +       /*
> +        * Find the end of the top mapping and skip a page.
> +        * If there is no space for PAGE_SIZE above
> +        * that, mmap will ignore our address hint.
> +        *
> +        * override credentials otherwise anonymous memory might
> +        * not be granted execute permission when the selinux
> +        * security hooks have their way.
> +        */
> +       vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> +       addr = vma->vm_end + PAGE_SIZE;
> +       curr_cred = override_creds(&init_cred);
> +       addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> +       revert_creds(curr_cred);
> +
> +       if (addr & ~PAGE_MASK)
> +               goto fail;
> +       vma = find_vma(mm, addr);
> +
> +       /* Don't expand vma on mremap(). */
> +       vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> +       area->vaddr = vma->vm_start;
> +       if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
> +                               &vma) > 0)
> +               ret = 0;
> +
> +fail:
> +       up_write(&mm->mmap_sem);
> +       mmput(mm);
> +       return ret;
> +} 

So is that the right way? I looked back to the previous discussion with
Eric and couldn't really make up my mind either way. The changelog is
entirely without detail and Eric isn't CC'ed.

What's the point of having these discussions if all traces of them
disappear on the next posting?

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 11:49     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 11:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML, Eric Paris

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static int xol_add_vma(struct uprobes_xol_area *area)
> +{
> +       const struct cred *curr_cred;
> +       struct vm_area_struct *vma;
> +       struct mm_struct *mm;
> +       unsigned long addr;
> +       int ret = -ENOMEM;
> +
> +       mm = get_task_mm(current);
> +       if (!mm)
> +               return -ESRCH;
> +
> +       down_write(&mm->mmap_sem);
> +       if (mm->uprobes_xol_area) {
> +               ret = -EALREADY;
> +               goto fail;
> +       }
> +
> +       /*
> +        * Find the end of the top mapping and skip a page.
> +        * If there is no space for PAGE_SIZE above
> +        * that, mmap will ignore our address hint.
> +        *
> +        * override credentials otherwise anonymous memory might
> +        * not be granted execute permission when the selinux
> +        * security hooks have their way.
> +        */
> +       vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> +       addr = vma->vm_end + PAGE_SIZE;
> +       curr_cred = override_creds(&init_cred);
> +       addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> +       revert_creds(curr_cred);
> +
> +       if (addr & ~PAGE_MASK)
> +               goto fail;
> +       vma = find_vma(mm, addr);
> +
> +       /* Don't expand vma on mremap(). */
> +       vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> +       area->vaddr = vma->vm_start;
> +       if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
> +                               &vma) > 0)
> +               ret = 0;
> +
> +fail:
> +       up_write(&mm->mmap_sem);
> +       mmput(mm);
> +       return ret;
> +} 

So is that the right way? I looked back to the previous discussion with
Eric and couldn't really make up my mind either way. The changelog is
entirely without detail and Eric isn't CC'ed.

What's the point of having these discussions if all traces of them
disappear on the next posting?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-27 12:18     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static struct uprobes_xol_area *xol_alloc_area(void)
> +{
> +       struct uprobes_xol_area *area = NULL;
> +
> +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> +       if (unlikely(!area))
> +               return NULL;
> +
> +       area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> +                                                               GFP_KERNEL);
> +
> +       if (!area->bitmap)
> +               goto fail;
> +
> +       init_waitqueue_head(&area->wq);
> +       spin_lock_init(&area->slot_lock);
> +       if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {

So what happens if xol_add_vma() succeeds, but we find
->uprobes_xol_area set?

> +               task_lock(current);
> +               if (!current->mm->uprobes_xol_area) {

Having to re-test it under this lock seems to suggest it could.

> +                       current->mm->uprobes_xol_area = area;
> +                       task_unlock(current);
> +                       return area;

This function would be so much easier to read if the success case (this
here I presume) would not be nested 2 deep.

> +               }
> +               task_unlock(current);
> +       }

at which point you could end up with two extra vmas? Because there's no
freeing of the result of xol_add_vma().

> +fail:
> +       kfree(area->bitmap);
> +       kfree(area);
> +       return current->mm->uprobes_xol_area;
> +} 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:18     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:18 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static struct uprobes_xol_area *xol_alloc_area(void)
> +{
> +       struct uprobes_xol_area *area = NULL;
> +
> +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> +       if (unlikely(!area))
> +               return NULL;
> +
> +       area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> +                                                               GFP_KERNEL);
> +
> +       if (!area->bitmap)
> +               goto fail;
> +
> +       init_waitqueue_head(&area->wq);
> +       spin_lock_init(&area->slot_lock);
> +       if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {

So what happens if xol_add_vma() succeeds, but we find
->uprobes_xol_area set?

> +               task_lock(current);
> +               if (!current->mm->uprobes_xol_area) {

Having to re-test it under this lock seems to suggest it could.

> +                       current->mm->uprobes_xol_area = area;
> +                       task_unlock(current);
> +                       return area;

This function would be so much easier to read if the success case (this
here I presume) would not be nested 2 deep.

> +               }
> +               task_unlock(current);
> +       }

at which point you could end up with two extra vmas? Because there's no
freeing of the result of xol_add_vma().

> +fail:
> +       kfree(area->bitmap);
> +       kfree(area);
> +       return current->mm->uprobes_xol_area;
> +} 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-27 11:49     ` Peter Zijlstra
@ 2011-09-27 12:32       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Stephen Smalley, LKML, Eric Paris

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 13:49:37]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +static int xol_add_vma(struct uprobes_xol_area *area)
> > +{
> > +       const struct cred *curr_cred;
> > +       struct vm_area_struct *vma;
> > +       struct mm_struct *mm;
> > +       unsigned long addr;
> > +       int ret = -ENOMEM;
> > +
> > +       mm = get_task_mm(current);
> > +       if (!mm)
> > +               return -ESRCH;
> > +
> > +       down_write(&mm->mmap_sem);
> > +       if (mm->uprobes_xol_area) {
> > +               ret = -EALREADY;
> > +               goto fail;
> > +       }
> > +
> > +       /*
> > +        * Find the end of the top mapping and skip a page.
> > +        * If there is no space for PAGE_SIZE above
> > +        * that, mmap will ignore our address hint.
> > +        *
> > +        * override credentials otherwise anonymous memory might
> > +        * not be granted execute permission when the selinux
> > +        * security hooks have their way.
> > +        */
> > +       vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> > +       addr = vma->vm_end + PAGE_SIZE;
> > +       curr_cred = override_creds(&init_cred);
> > +       addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> > +       revert_creds(curr_cred);
> > +
> > +       if (addr & ~PAGE_MASK)
> > +               goto fail;
> > +       vma = find_vma(mm, addr);
> > +
> > +       /* Don't expand vma on mremap(). */
> > +       vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> > +       area->vaddr = vma->vm_start;
> > +       if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
> > +                               &vma) > 0)
> > +               ret = 0;
> > +
> > +fail:
> > +       up_write(&mm->mmap_sem);
> > +       mmput(mm);
> > +       return ret;
> > +} 
> 
> So is that the right way? I looked back to the previous discussion with
> Eric and couldn't really make up my mind either way. The changelog is
> entirely without detail and Eric isn't CC'ed.

This is based on what Stephen Smalley suggested on the same thread
https://lkml.org/lkml/2011/4/20/224

I used to keep the changelog after the marker after Christoph Hellwig
had suggested that https://lkml.org/lkml/2010/7/20/5
However "stg export" removes lines after the --- marker.

I agree that I should have copied Eric and Stephen atleast on this
patch. However if the number of to/cc are greater than 20, the LKML
archive cool ignore the mail.

I know that these arent problems faced by others and open to suggestions
on how they have overcome the same.

> 
> What's the point of having these discussions if all traces of them
> disappear on the next posting?

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:32       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Stephen Smalley, LKML, Eric Paris

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 13:49:37]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +static int xol_add_vma(struct uprobes_xol_area *area)
> > +{
> > +       const struct cred *curr_cred;
> > +       struct vm_area_struct *vma;
> > +       struct mm_struct *mm;
> > +       unsigned long addr;
> > +       int ret = -ENOMEM;
> > +
> > +       mm = get_task_mm(current);
> > +       if (!mm)
> > +               return -ESRCH;
> > +
> > +       down_write(&mm->mmap_sem);
> > +       if (mm->uprobes_xol_area) {
> > +               ret = -EALREADY;
> > +               goto fail;
> > +       }
> > +
> > +       /*
> > +        * Find the end of the top mapping and skip a page.
> > +        * If there is no space for PAGE_SIZE above
> > +        * that, mmap will ignore our address hint.
> > +        *
> > +        * override credentials otherwise anonymous memory might
> > +        * not be granted execute permission when the selinux
> > +        * security hooks have their way.
> > +        */
> > +       vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> > +       addr = vma->vm_end + PAGE_SIZE;
> > +       curr_cred = override_creds(&init_cred);
> > +       addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> > +       revert_creds(curr_cred);
> > +
> > +       if (addr & ~PAGE_MASK)
> > +               goto fail;
> > +       vma = find_vma(mm, addr);
> > +
> > +       /* Don't expand vma on mremap(). */
> > +       vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> > +       area->vaddr = vma->vm_start;
> > +       if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
> > +                               &vma) > 0)
> > +               ret = 0;
> > +
> > +fail:
> > +       up_write(&mm->mmap_sem);
> > +       mmput(mm);
> > +       return ret;
> > +} 
> 
> So is that the right way? I looked back to the previous discussion with
> Eric and couldn't really make up my mind either way. The changelog is
> entirely without detail and Eric isn't CC'ed.

This is based on what Stephen Smalley suggested on the same thread
https://lkml.org/lkml/2011/4/20/224

I used to keep the changelog after the marker after Christoph Hellwig
had suggested that https://lkml.org/lkml/2010/7/20/5
However "stg export" removes lines after the --- marker.

I agree that I should have copied Eric and Stephen atleast on this
patch. However if the number of to/cc are greater than 20, the LKML
archive cool ignore the mail.

I know that these arent problems faced by others and open to suggestions
on how they have overcome the same.

> 
> What's the point of having these discussions if all traces of them
> disappear on the next posting?

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-27 12:36     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static void xol_wait_event(struct uprobes_xol_area *area)
> +{
> +       if (atomic_read(&area->slot_count) >= UINSNS_PER_PAGE)
> +               wait_event(area->wq,
> +                       (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
> +} 

That's mighty redundant, look up wait_event() and try again.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:36     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static void xol_wait_event(struct uprobes_xol_area *area)
> +{
> +       if (atomic_read(&area->slot_count) >= UINSNS_PER_PAGE)
> +               wait_event(area->wq,
> +                       (atomic_read(&area->slot_count) < UINSNS_PER_PAGE));
> +} 

That's mighty redundant, look up wait_event() and try again.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-27 12:37     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
> +{
> +       unsigned long slot_addr, flags;
> +       int slot_nr;
> +
> +       do {
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
> +               if (slot_nr < UINSNS_PER_PAGE) {
> +                       __set_bit(slot_nr, area->bitmap);
> +                       slot_addr = area->vaddr +
> +                                       (slot_nr * UPROBES_XOL_SLOT_BYTES);
> +                       atomic_inc(&area->slot_count);
> +               }
> +               spin_unlock_irqrestore(&area->slot_lock, flags);
> +               if (slot_nr >= UINSNS_PER_PAGE)
> +                       xol_wait_event(area);
> +
> +       } while (slot_nr >= UINSNS_PER_PAGE);
> +
> +       return slot_addr;
> +} 

Why isn't a find_first_bit() + set_and_test_bit() not sufficient? That
is, what do you need that lock for?

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:37     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
> +{
> +       unsigned long slot_addr, flags;
> +       int slot_nr;
> +
> +       do {
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
> +               if (slot_nr < UINSNS_PER_PAGE) {
> +                       __set_bit(slot_nr, area->bitmap);
> +                       slot_addr = area->vaddr +
> +                                       (slot_nr * UPROBES_XOL_SLOT_BYTES);
> +                       atomic_inc(&area->slot_count);
> +               }
> +               spin_unlock_irqrestore(&area->slot_lock, flags);
> +               if (slot_nr >= UINSNS_PER_PAGE)
> +                       xol_wait_event(area);
> +
> +       } while (slot_nr >= UINSNS_PER_PAGE);
> +
> +       return slot_addr;
> +} 

Why isn't a find_first_bit() + set_and_test_bit() not sufficient? That
is, what do you need that lock for?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-27 12:18     ` Peter Zijlstra
@ 2011-09-27 12:45       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 14:18:52]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +static struct uprobes_xol_area *xol_alloc_area(void)
> > +{
> > +       struct uprobes_xol_area *area = NULL;
> > +
> > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +       if (unlikely(!area))
> > +               return NULL;
> > +
> > +       area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> > +                                                               GFP_KERNEL);
> > +
> > +       if (!area->bitmap)
> > +               goto fail;
> > +
> > +       init_waitqueue_head(&area->wq);
> > +       spin_lock_init(&area->slot_lock);
> > +       if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> 
> So what happens if xol_add_vma() succeeds, but we find
> ->uprobes_xol_area set?
> 
> > +               task_lock(current);
> > +               if (!current->mm->uprobes_xol_area) {
> 
> Having to re-test it under this lock seems to suggest it could.
> 
> > +                       current->mm->uprobes_xol_area = area;
> > +                       task_unlock(current);
> > +                       return area;
> 
> This function would be so much easier to read if the success case (this
> here I presume) would not be nested 2 deep.
> 
> > +               }
> > +               task_unlock(current);
> > +       }
> 
> at which point you could end up with two extra vmas? Because there's no
> freeing of the result of xol_add_vma().
> 

Agree, we need to unmap the vma in that case.

> > +fail:
> > +       kfree(area->bitmap);
> > +       kfree(area);
> > +       return current->mm->uprobes_xol_area;
> > +} 

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:45       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 14:18:52]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +static struct uprobes_xol_area *xol_alloc_area(void)
> > +{
> > +       struct uprobes_xol_area *area = NULL;
> > +
> > +       area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +       if (unlikely(!area))
> > +               return NULL;
> > +
> > +       area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> > +                                                               GFP_KERNEL);
> > +
> > +       if (!area->bitmap)
> > +               goto fail;
> > +
> > +       init_waitqueue_head(&area->wq);
> > +       spin_lock_init(&area->slot_lock);
> > +       if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> 
> So what happens if xol_add_vma() succeeds, but we find
> ->uprobes_xol_area set?
> 
> > +               task_lock(current);
> > +               if (!current->mm->uprobes_xol_area) {
> 
> Having to re-test it under this lock seems to suggest it could.
> 
> > +                       current->mm->uprobes_xol_area = area;
> > +                       task_unlock(current);
> > +                       return area;
> 
> This function would be so much easier to read if the success case (this
> here I presume) would not be nested 2 deep.
> 
> > +               }
> > +               task_unlock(current);
> > +       }
> 
> at which point you could end up with two extra vmas? Because there's no
> freeing of the result of xol_add_vma().
> 

Agree, we need to unmap the vma in that case.

> > +fail:
> > +       kfree(area->bitmap);
> > +       kfree(area);
> > +       return current->mm->uprobes_xol_area;
> > +} 

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-27 12:37     ` Peter Zijlstra
@ 2011-09-27 12:50       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 14:37:59]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
> > +{
> > +       unsigned long slot_addr, flags;
> > +       int slot_nr;
> > +
> > +       do {
> > +               spin_lock_irqsave(&area->slot_lock, flags);
> > +               slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
> > +               if (slot_nr < UINSNS_PER_PAGE) {
> > +                       __set_bit(slot_nr, area->bitmap);
> > +                       slot_addr = area->vaddr +
> > +                                       (slot_nr * UPROBES_XOL_SLOT_BYTES);
> > +                       atomic_inc(&area->slot_count);
> > +               }
> > +               spin_unlock_irqrestore(&area->slot_lock, flags);
> > +               if (slot_nr >= UINSNS_PER_PAGE)
> > +                       xol_wait_event(area);
> > +
> > +       } while (slot_nr >= UINSNS_PER_PAGE);
> > +
> > +       return slot_addr;
> > +} 
> 
> Why isn't a find_first_bit() + set_and_test_bit() not sufficient? That
> is, what do you need that lock for?

yes, we could do without the lock to.
Will do this in the next patchset.

-- 
Thanks and Regards 
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:50       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 14:37:59]:

> On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> > +static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
> > +{
> > +       unsigned long slot_addr, flags;
> > +       int slot_nr;
> > +
> > +       do {
> > +               spin_lock_irqsave(&area->slot_lock, flags);
> > +               slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
> > +               if (slot_nr < UINSNS_PER_PAGE) {
> > +                       __set_bit(slot_nr, area->bitmap);
> > +                       slot_addr = area->vaddr +
> > +                                       (slot_nr * UPROBES_XOL_SLOT_BYTES);
> > +                       atomic_inc(&area->slot_count);
> > +               }
> > +               spin_unlock_irqrestore(&area->slot_lock, flags);
> > +               if (slot_nr >= UINSNS_PER_PAGE)
> > +                       xol_wait_event(area);
> > +
> > +       } while (slot_nr >= UINSNS_PER_PAGE);
> > +
> > +       return slot_addr;
> > +} 
> 
> Why isn't a find_first_bit() + set_and_test_bit() not sufficient? That
> is, what do you need that lock for?

yes, we could do without the lock to.
Will do this in the next patchset.

-- 
Thanks and Regards 
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-27 12:50     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               __clear_bit(slot_nr, area->bitmap);
> +               spin_unlock_irqrestore(&area->slot_lock, flags); 

that so wants to be clear_bit()..

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:50     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               __clear_bit(slot_nr, area->bitmap);
> +               spin_unlock_irqrestore(&area->slot_lock, flags); 

that so wants to be clear_bit()..

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-27 12:55     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:55 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
> +{
> +       unsigned long slot_addr, flags;
> +       int slot_nr;
> +
> +       do {
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
> +               if (slot_nr < UINSNS_PER_PAGE) {
> +                       __set_bit(slot_nr, area->bitmap);
> +                       slot_addr = area->vaddr +
> +                                       (slot_nr * UPROBES_XOL_SLOT_BYTES);
> +                       atomic_inc(&area->slot_count);
> +               }
> +               spin_unlock_irqrestore(&area->slot_lock, flags);
> +               if (slot_nr >= UINSNS_PER_PAGE)
> +                       xol_wait_event(area);
> +
> +       } while (slot_nr >= UINSNS_PER_PAGE);
> +
> +       return slot_addr;
> +}

> +static void xol_free_insn_slot(struct task_struct *tsk)
> +{
> +       struct uprobes_xol_area *area;
> +       unsigned long vma_end;
> +       unsigned long slot_addr;
> +
> +       if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
> +               return;
> +
> +       slot_addr = tsk->utask->xol_vaddr;
> +
> +       if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
> +               return;
> +
> +       area = tsk->mm->uprobes_xol_area;
> +       vma_end = area->vaddr + PAGE_SIZE;
> +       if (area->vaddr <= slot_addr && slot_addr < vma_end) {
> +               int slot_nr;
> +               unsigned long offset = slot_addr - area->vaddr;
> +               unsigned long flags;
> +
> +               slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
> +               if (slot_nr >= UINSNS_PER_PAGE)
> +                       return;
> +
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               __clear_bit(slot_nr, area->bitmap);
> +               spin_unlock_irqrestore(&area->slot_lock, flags);
> +               atomic_dec(&area->slot_count);
> +               if (waitqueue_active(&area->wq))
> +                       wake_up(&area->wq);
> +               tsk->utask->xol_vaddr = 0;
> +       }
> +} 

So if you want to keep that slot_lock, you might as well make
->slot_count a normal integer.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:55     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:55 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Andi Kleen, LKML

On Tue, 2011-09-20 at 17:33 +0530, Srikar Dronamraju wrote:
> +static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
> +{
> +       unsigned long slot_addr, flags;
> +       int slot_nr;
> +
> +       do {
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               slot_nr = find_first_zero_bit(area->bitmap, UINSNS_PER_PAGE);
> +               if (slot_nr < UINSNS_PER_PAGE) {
> +                       __set_bit(slot_nr, area->bitmap);
> +                       slot_addr = area->vaddr +
> +                                       (slot_nr * UPROBES_XOL_SLOT_BYTES);
> +                       atomic_inc(&area->slot_count);
> +               }
> +               spin_unlock_irqrestore(&area->slot_lock, flags);
> +               if (slot_nr >= UINSNS_PER_PAGE)
> +                       xol_wait_event(area);
> +
> +       } while (slot_nr >= UINSNS_PER_PAGE);
> +
> +       return slot_addr;
> +}

> +static void xol_free_insn_slot(struct task_struct *tsk)
> +{
> +       struct uprobes_xol_area *area;
> +       unsigned long vma_end;
> +       unsigned long slot_addr;
> +
> +       if (!tsk->mm || !tsk->mm->uprobes_xol_area || !tsk->utask)
> +               return;
> +
> +       slot_addr = tsk->utask->xol_vaddr;
> +
> +       if (unlikely(!slot_addr || IS_ERR_VALUE(slot_addr)))
> +               return;
> +
> +       area = tsk->mm->uprobes_xol_area;
> +       vma_end = area->vaddr + PAGE_SIZE;
> +       if (area->vaddr <= slot_addr && slot_addr < vma_end) {
> +               int slot_nr;
> +               unsigned long offset = slot_addr - area->vaddr;
> +               unsigned long flags;
> +
> +               slot_nr = offset / UPROBES_XOL_SLOT_BYTES;
> +               if (slot_nr >= UINSNS_PER_PAGE)
> +                       return;
> +
> +               spin_lock_irqsave(&area->slot_lock, flags);
> +               __clear_bit(slot_nr, area->bitmap);
> +               spin_unlock_irqrestore(&area->slot_lock, flags);
> +               atomic_dec(&area->slot_count);
> +               if (waitqueue_active(&area->wq))
> +                       wake_up(&area->wq);
> +               tsk->utask->xol_vaddr = 0;
> +       }
> +} 

So if you want to keep that slot_lock, you might as well make
->slot_count a normal integer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-27 11:41         ` Peter Zijlstra
@ 2011-09-27 12:59           ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 13:41:21]:

> On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > > Why not something like:
> > > 
> > > 
> > > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> > >                                       bool inode_only)
> > > +{
> > >         struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
> > > +       struct rb_node *n = uprobes_tree.rb_node;
> > > +       struct uprobe *uprobe;
> > >       struct uprobe *ret = NULL;
> > > +       int match;
> > > +
> > > +       while (n) {
> > > +               uprobe = rb_entry(n, struct uprobe, rb_node);
> > > +               match = match_uprobe(&u, uprobe);
> > > +               if (!match) {
> > >                       if (!inode_only)
> > >                              atomic_inc(&uprobe->ref);
> > > +                       return uprobe;
> > > +               }
> > >               if (inode_only && uprobe->inode == inode)
> > >                       ret = uprobe;
> > > +               if (match < 0)
> > > +                       n = n->rb_left;
> > > +               else
> > > +                       n = n->rb_right;
> > > +
> > > +       }
> > >         return ret;
> > > +}
> > > 
> > 
> > I am not comfortable with this change.
> > find_uprobe() was suppose to return back a uprobe if and only if
> > the inode and offset match,
> 
> And it will, because find_uprobe() will never expose that third
> argument.
> 
> >  However with your approach, we end up
> > returning a uprobe that isnt matching and one that isnt refcounted.
> > Moreover if even if we have a matching uprobe, we end up sending a
> > unrefcounted uprobe back. 
> 
> Because the matching isn't the important part, you want to return the
> leftmost node matching the specified inode. Also, in that case you
> explicitly don't want the ref, since the first thing you do on the
> call-site is drop the ref if there was a match. You don't care about
> inode:0 in particular, you want a place to start iterating all of
> inode:*.
> 

The case of we taking a ref and dropping it would arise if and only if
there is a matching uprobe i.e inode: and 0 offset. I dont think that
would be the common case.

If you arent comfortable passing the rb_node as the third argument, then
we could pass the reference to uprobe itself. But that would mean we do
a redundant dereference everytime.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-27 12:59           ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 13:41:21]:

> On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > > Why not something like:
> > > 
> > > 
> > > +static struct uprobe *__find_uprobe(struct inode * inode, loff_t offset,
> > >                                       bool inode_only)
> > > +{
> > >         struct uprobe u = { .inode = inode, .offset = inode_only ? 0 : offset };
> > > +       struct rb_node *n = uprobes_tree.rb_node;
> > > +       struct uprobe *uprobe;
> > >       struct uprobe *ret = NULL;
> > > +       int match;
> > > +
> > > +       while (n) {
> > > +               uprobe = rb_entry(n, struct uprobe, rb_node);
> > > +               match = match_uprobe(&u, uprobe);
> > > +               if (!match) {
> > >                       if (!inode_only)
> > >                              atomic_inc(&uprobe->ref);
> > > +                       return uprobe;
> > > +               }
> > >               if (inode_only && uprobe->inode == inode)
> > >                       ret = uprobe;
> > > +               if (match < 0)
> > > +                       n = n->rb_left;
> > > +               else
> > > +                       n = n->rb_right;
> > > +
> > > +       }
> > >         return ret;
> > > +}
> > > 
> > 
> > I am not comfortable with this change.
> > find_uprobe() was suppose to return back a uprobe if and only if
> > the inode and offset match,
> 
> And it will, because find_uprobe() will never expose that third
> argument.
> 
> >  However with your approach, we end up
> > returning a uprobe that isnt matching and one that isnt refcounted.
> > Moreover if even if we have a matching uprobe, we end up sending a
> > unrefcounted uprobe back. 
> 
> Because the matching isn't the important part, you want to return the
> leftmost node matching the specified inode. Also, in that case you
> explicitly don't want the ref, since the first thing you do on the
> call-site is drop the ref if there was a match. You don't care about
> inode:0 in particular, you want a place to start iterating all of
> inode:*.
> 

The case of we taking a ref and dropping it would arise if and only if
there is a matching uprobe i.e inode: and 0 offset. I dont think that
would be the common case.

If you arent comfortable passing the rb_node as the third argument, then
we could pass the reference to uprobe itself. But that would mean we do
a redundant dereference everytime.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-27 12:32       ` Srikar Dronamraju
@ 2011-09-27 12:59         ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Stephen Smalley, LKML, Eric Paris

On Tue, 2011-09-27 at 18:02 +0530, Srikar Dronamraju wrote:
> I used to keep the changelog after the marker after Christoph Hellwig
> had suggested that https://lkml.org/lkml/2010/7/20/5
> However "stg export" removes lines after the --- marker. 

That's no excuse for writing shitty changelogs. Version logs contain the
incremental changes in each version, but the changelog should be a full
and proper description of the patch, irrespective of how many iterations
and changes it has undergone.



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-09-27 12:59         ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 12:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Hugh Dickins, Christoph Hellwig,
	Masami Hiramatsu, Thomas Gleixner, Ananth N Mavinakayanahalli,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Stephen Smalley, LKML, Eric Paris

On Tue, 2011-09-27 at 18:02 +0530, Srikar Dronamraju wrote:
> I used to keep the changelog after the marker after Christoph Hellwig
> had suggested that https://lkml.org/lkml/2010/7/20/5
> However "stg export" removes lines after the --- marker. 

That's no excuse for writing shitty changelogs. Version logs contain the
incremental changes in each version, but the changelog should be a full
and proper description of the patch, irrespective of how many iterations
and changes it has undergone.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-09-20 12:05   ` Srikar Dronamraju
@ 2011-09-27 13:03     ` Peter Zijlstra
  -1 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 13:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On Tue, 2011-09-20 at 17:35 +0530, Srikar Dronamraju wrote:
> +#ifdef CONFIG_UPROBES
> +       if (!group && t->utask && t->utask->active_uprobe)
> +               pending = &t->utask->delayed;
> +#endif
> +
>         /*
>          * Short-circuit ignored signals and support queuing
>          * exactly one non-rt signal, so that we can get more
> @@ -1106,6 +1111,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>                 }
>         }
>  
> +#ifdef CONFIG_UPROBES
> +       if (!group && t->utask && t->utask->active_uprobe)
> +               return 0;
> +#endif
> +
>  out_set:
>         signalfd_notify(t, sig);
>         sigaddset(&pending->signal, sig);
> @@ -1569,6 +1579,13 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
>         }
>         q->info.si_overrun = 0;
>  
> +#ifdef CONFIG_UPROBES
> +       if (!group && t->utask && t->utask->active_uprobe) {
> +               pending = &t->utask->delayed;
> +               list_add_tail(&q->list, &pending->list);
> +               goto out;
> +       }
> +#endif
>         signalfd_notify(t, sig);
>         pending = group ? &t->signal->shared_pending : &t->pending;
>         list_add_tail(&q->list, &pending->list);
> @@ -2199,7 +2216,10 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
>                         spin_unlock_irq(&sighand->siglock);
>                         goto relock;
>                 }
> -
> +#ifdef CONFIG_UPROBES
> +               if (current->utask && current->utask->active_uprobe)
> +                       break;
> +#endif 

That's just crying for something like:

#ifdef CONFIG_UPROBES
static inline bool uprobe_delay_signal(struct task_struct *p)
{
	return p->utask && p->utask->active_uprobe;
}
#else
static inline bool uprobe_delay_signal(struct task_struct *p)
{
	return false;
}
#endif

That'll instantly kill the #ifdeffery as well as describe wtf you're
actually doing.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-09-27 13:03     ` Peter Zijlstra
  0 siblings, 0 replies; 330+ messages in thread
From: Peter Zijlstra @ 2011-09-27 13:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On Tue, 2011-09-20 at 17:35 +0530, Srikar Dronamraju wrote:
> +#ifdef CONFIG_UPROBES
> +       if (!group && t->utask && t->utask->active_uprobe)
> +               pending = &t->utask->delayed;
> +#endif
> +
>         /*
>          * Short-circuit ignored signals and support queuing
>          * exactly one non-rt signal, so that we can get more
> @@ -1106,6 +1111,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
>                 }
>         }
>  
> +#ifdef CONFIG_UPROBES
> +       if (!group && t->utask && t->utask->active_uprobe)
> +               return 0;
> +#endif
> +
>  out_set:
>         signalfd_notify(t, sig);
>         sigaddset(&pending->signal, sig);
> @@ -1569,6 +1579,13 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
>         }
>         q->info.si_overrun = 0;
>  
> +#ifdef CONFIG_UPROBES
> +       if (!group && t->utask && t->utask->active_uprobe) {
> +               pending = &t->utask->delayed;
> +               list_add_tail(&q->list, &pending->list);
> +               goto out;
> +       }
> +#endif
>         signalfd_notify(t, sig);
>         pending = group ? &t->signal->shared_pending : &t->pending;
>         list_add_tail(&q->list, &pending->list);
> @@ -2199,7 +2216,10 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
>                         spin_unlock_irq(&sighand->siglock);
>                         goto relock;
>                 }
> -
> +#ifdef CONFIG_UPROBES
> +               if (current->utask && current->utask->active_uprobe)
> +                       break;
> +#endif 

That's just crying for something like:

#ifdef CONFIG_UPROBES
static inline bool uprobe_delay_signal(struct task_struct *p)
{
	return p->utask && p->utask->active_uprobe;
}
#else
static inline bool uprobe_delay_signal(struct task_struct *p)
{
	return false;
}
#endif

That'll instantly kill the #ifdeffery as well as describe wtf you're
actually doing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-27 11:37         ` Peter Zijlstra
@ 2011-09-27 13:08           ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 13:37:15]:

> On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > 
> > > > +
> > > > +/*
> > > > + * Called from mmap_region.
> > > > + * called with mm->mmap_sem acquired.
> > > > + *
> > > > + * Return -ve no if we fail to insert probes and we cannot
> > > > + * bail-out.
> > > > + * Return 0 otherwise. i.e :
> > > > + * - successful insertion of probes
> > > > + * - (or) no possible probes to be inserted.
> > > > + * - (or) insertion of probes failed but we can bail-out.
> > > > + */
> > > > +int mmap_uprobe(struct vm_area_struct *vma)
> > > > +{
> > > > +   struct list_head tmp_list;
> > > > +   struct uprobe *uprobe, *u;
> > > > +   struct inode *inode;
> > > > +   int ret = 0;
> > > > +
> > > > +   if (!valid_vma(vma))
> > > > +           return ret;     /* Bail-out */
> > > > +
> > > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > > +   if (!inode)
> > > > +           return ret;
> > > > +
> > > > +   INIT_LIST_HEAD(&tmp_list);
> > > > +   mutex_lock(&uprobes_mmap_mutex);
> > > > +   build_probe_list(inode, &tmp_list);
> > > > +   list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > > +           loff_t vaddr;
> > > > +
> > > > +           list_del(&uprobe->pending_list);
> > > > +           if (!ret && uprobe->consumers) {
> > > > +                   vaddr = vma->vm_start + uprobe->offset;
> > > > +                   vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > > +                   if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > > +                           continue;
> > > > +                   ret = install_breakpoint(vma->vm_mm, uprobe);
> > > > +
> > > > +                   if (ret && (ret == -ESRCH || ret == -EEXIST))
> > > > +                           ret = 0;
> > > > +           }
> > > > +           put_uprobe(uprobe);
> > > > +   }
> > > > +
> > > > +   mutex_unlock(&uprobes_mmap_mutex);
> > > > +   iput(inode);
> > > > +   return ret;
> > > > +}
> > > > +
> > > > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > > > +           struct inode *inode)
> > > > +{
> > > > +   struct uprobe *uprobe;
> > > > +   struct rb_node *n;
> > > > +   unsigned long flags;
> > > > +
> > > > +   n = uprobes_tree.rb_node;
> > > > +   spin_lock_irqsave(&uprobes_treelock, flags);
> > > > +   uprobe = __find_uprobe(inode, 0, &n);
> > > > +
> > > > +   /*
> > > > +    * If indeed there is a probe for the inode and with offset zero,
> > > > +    * then lets release its reference. (ref got thro __find_uprobe)
> > > > +    */
> > > > +   if (uprobe)
> > > > +           put_uprobe(uprobe);
> > > > +   for (; n; n = rb_next(n)) {
> > > > +           loff_t vaddr;
> > > > +
> > > > +           uprobe = rb_entry(n, struct uprobe, rb_node);
> > > > +           if (uprobe->inode != inode)
> > > > +                   break;
> > > > +           vaddr = vma->vm_start + uprobe->offset;
> > > > +           vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > > +           if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > > +                   continue;
> > > > +           atomic_dec(&vma->vm_mm->mm_uprobes_count);
> > > > +   }
> > > > +   spin_unlock_irqrestore(&uprobes_treelock, flags);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Called in context of a munmap of a vma.
> > > > + */
> > > > +void munmap_uprobe(struct vm_area_struct *vma)
> > > > +{
> > > > +   struct inode *inode;
> > > > +
> > > > +   if (!valid_vma(vma))
> > > > +           return;         /* Bail-out */
> > > > +
> > > > +   if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> > > > +           return;
> > > > +
> > > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > > +   if (!inode)
> > > > +           return;
> > > > +
> > > > +   dec_mm_uprobes_count(vma, inode);
> > > > +   iput(inode);
> > > > +   return;
> > > > +}
> > > 
> > > One has to wonder why mmap_uprobe() can be one function but
> > > munmap_uprobe() cannot.
> > > 
> > 
> > I didnt understand this comment, Can you please elaborate?
> > mmap_uprobe uses build_probe_list and munmap_uprobe uses
> > dec_mm_uprobes_count. 
> 
> Ah, I missed build_probe_list(), but I didn't see a reason for the
> existence of dec_mm_uprobe_count(), the name doesn't make sense and the
> content is 'small' enough to just put in munmap_uprobe.
> 
> To me it looks similar to the list iteration you have in mmap_uprobe(),
> you didn't split that out into another function either.

Hmm, For me whats done in dec_mm_uprobe_count() is similar to whats
done in build_probe_list(), just that build_probe_list does a atomic_inc
+ list_add while dec_mm_uprobe_count() does a atomic_dec.

When I kept dec_mm_uprobe_count() inside munmap_uprobe(), I found most
of the code nested too deep. Hence carved it out as a separate function.

I open to suggestions for dec_mm_uprobe_count()

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-09-27 13:08           ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 13:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Ananth N Mavinakayanahalli, Hugh Dickins,
	Christoph Hellwig, Jonathan Corbet, Thomas Gleixner,
	Masami Hiramatsu, Oleg Nesterov, Andrew Morton, Jim Keniston,
	Roland McGrath, Andi Kleen, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 13:37:15]:

> On Mon, 2011-09-26 at 21:14 +0530, Srikar Dronamraju wrote:
> > 
> > > > +
> > > > +/*
> > > > + * Called from mmap_region.
> > > > + * called with mm->mmap_sem acquired.
> > > > + *
> > > > + * Return -ve no if we fail to insert probes and we cannot
> > > > + * bail-out.
> > > > + * Return 0 otherwise. i.e :
> > > > + * - successful insertion of probes
> > > > + * - (or) no possible probes to be inserted.
> > > > + * - (or) insertion of probes failed but we can bail-out.
> > > > + */
> > > > +int mmap_uprobe(struct vm_area_struct *vma)
> > > > +{
> > > > +   struct list_head tmp_list;
> > > > +   struct uprobe *uprobe, *u;
> > > > +   struct inode *inode;
> > > > +   int ret = 0;
> > > > +
> > > > +   if (!valid_vma(vma))
> > > > +           return ret;     /* Bail-out */
> > > > +
> > > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > > +   if (!inode)
> > > > +           return ret;
> > > > +
> > > > +   INIT_LIST_HEAD(&tmp_list);
> > > > +   mutex_lock(&uprobes_mmap_mutex);
> > > > +   build_probe_list(inode, &tmp_list);
> > > > +   list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > > +           loff_t vaddr;
> > > > +
> > > > +           list_del(&uprobe->pending_list);
> > > > +           if (!ret && uprobe->consumers) {
> > > > +                   vaddr = vma->vm_start + uprobe->offset;
> > > > +                   vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > > +                   if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > > +                           continue;
> > > > +                   ret = install_breakpoint(vma->vm_mm, uprobe);
> > > > +
> > > > +                   if (ret && (ret == -ESRCH || ret == -EEXIST))
> > > > +                           ret = 0;
> > > > +           }
> > > > +           put_uprobe(uprobe);
> > > > +   }
> > > > +
> > > > +   mutex_unlock(&uprobes_mmap_mutex);
> > > > +   iput(inode);
> > > > +   return ret;
> > > > +}
> > > > +
> > > > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > > > +           struct inode *inode)
> > > > +{
> > > > +   struct uprobe *uprobe;
> > > > +   struct rb_node *n;
> > > > +   unsigned long flags;
> > > > +
> > > > +   n = uprobes_tree.rb_node;
> > > > +   spin_lock_irqsave(&uprobes_treelock, flags);
> > > > +   uprobe = __find_uprobe(inode, 0, &n);
> > > > +
> > > > +   /*
> > > > +    * If indeed there is a probe for the inode and with offset zero,
> > > > +    * then lets release its reference. (ref got thro __find_uprobe)
> > > > +    */
> > > > +   if (uprobe)
> > > > +           put_uprobe(uprobe);
> > > > +   for (; n; n = rb_next(n)) {
> > > > +           loff_t vaddr;
> > > > +
> > > > +           uprobe = rb_entry(n, struct uprobe, rb_node);
> > > > +           if (uprobe->inode != inode)
> > > > +                   break;
> > > > +           vaddr = vma->vm_start + uprobe->offset;
> > > > +           vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > > +           if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > > +                   continue;
> > > > +           atomic_dec(&vma->vm_mm->mm_uprobes_count);
> > > > +   }
> > > > +   spin_unlock_irqrestore(&uprobes_treelock, flags);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Called in context of a munmap of a vma.
> > > > + */
> > > > +void munmap_uprobe(struct vm_area_struct *vma)
> > > > +{
> > > > +   struct inode *inode;
> > > > +
> > > > +   if (!valid_vma(vma))
> > > > +           return;         /* Bail-out */
> > > > +
> > > > +   if (!atomic_read(&vma->vm_mm->mm_uprobes_count))
> > > > +           return;
> > > > +
> > > > +   inode = igrab(vma->vm_file->f_mapping->host);
> > > > +   if (!inode)
> > > > +           return;
> > > > +
> > > > +   dec_mm_uprobes_count(vma, inode);
> > > > +   iput(inode);
> > > > +   return;
> > > > +}
> > > 
> > > One has to wonder why mmap_uprobe() can be one function but
> > > munmap_uprobe() cannot.
> > > 
> > 
> > I didnt understand this comment, Can you please elaborate?
> > mmap_uprobe uses build_probe_list and munmap_uprobe uses
> > dec_mm_uprobes_count. 
> 
> Ah, I missed build_probe_list(), but I didn't see a reason for the
> existence of dec_mm_uprobe_count(), the name doesn't make sense and the
> content is 'small' enough to just put in munmap_uprobe.
> 
> To me it looks similar to the list iteration you have in mmap_uprobe(),
> you didn't split that out into another function either.

Hmm, For me whats done in dec_mm_uprobe_count() is similar to whats
done in build_probe_list(), just that build_probe_list does a atomic_inc
+ list_add while dec_mm_uprobe_count() does a atomic_dec.

When I kept dec_mm_uprobe_count() inside munmap_uprobe(), I found most
of the code nested too deep. Hence carved it out as a separate function.

I open to suggestions for dec_mm_uprobe_count()

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-09-27 13:03     ` Peter Zijlstra
@ 2011-09-27 13:12       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 15:03:46]:

> On Tue, 2011-09-20 at 17:35 +0530, Srikar Dronamraju wrote:
> > +#ifdef CONFIG_UPROBES
> > +       if (!group && t->utask && t->utask->active_uprobe)
> > +               pending = &t->utask->delayed;
> > +#endif
> > +
> >         /*
> >          * Short-circuit ignored signals and support queuing
> >          * exactly one non-rt signal, so that we can get more
> > @@ -1106,6 +1111,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
> >                 }
> >         }
> >  
> > +#ifdef CONFIG_UPROBES
> > +       if (!group && t->utask && t->utask->active_uprobe)
> > +               return 0;
> > +#endif
> > +
> >  out_set:
> >         signalfd_notify(t, sig);
> >         sigaddset(&pending->signal, sig);
> > @@ -1569,6 +1579,13 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
> >         }
> >         q->info.si_overrun = 0;
> >  
> > +#ifdef CONFIG_UPROBES
> > +       if (!group && t->utask && t->utask->active_uprobe) {
> > +               pending = &t->utask->delayed;
> > +               list_add_tail(&q->list, &pending->list);
> > +               goto out;
> > +       }
> > +#endif
> >         signalfd_notify(t, sig);
> >         pending = group ? &t->signal->shared_pending : &t->pending;
> >         list_add_tail(&q->list, &pending->list);
> > @@ -2199,7 +2216,10 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
> >                         spin_unlock_irq(&sighand->siglock);
> >                         goto relock;
> >                 }
> > -
> > +#ifdef CONFIG_UPROBES
> > +               if (current->utask && current->utask->active_uprobe)
> > +                       break;
> > +#endif 
> 
> That's just crying for something like:
> 
> #ifdef CONFIG_UPROBES
> static inline bool uprobe_delay_signal(struct task_struct *p)
> {
> 	return p->utask && p->utask->active_uprobe;
> }
> #else
> static inline bool uprobe_delay_signal(struct task_struct *p)
> {
> 	return false;
> }
> #endif
> 
> That'll instantly kill the #ifdeffery as well as describe wtf you're
> actually doing.


Okay, 

I did a rethink and implemented this patch a little differently using
block_all_signals, unblock_all_signals. This wouldnt need the 
#ifdeffery + no changes in kernel/signal.c

Will post the same in the next patchset.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-09-27 13:12       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-09-27 13:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Oleg Nesterov, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Peter Zijlstra <peterz@infradead.org> [2011-09-27 15:03:46]:

> On Tue, 2011-09-20 at 17:35 +0530, Srikar Dronamraju wrote:
> > +#ifdef CONFIG_UPROBES
> > +       if (!group && t->utask && t->utask->active_uprobe)
> > +               pending = &t->utask->delayed;
> > +#endif
> > +
> >         /*
> >          * Short-circuit ignored signals and support queuing
> >          * exactly one non-rt signal, so that we can get more
> > @@ -1106,6 +1111,11 @@ static int __send_signal(int sig, struct siginfo *info, struct task_struct *t,
> >                 }
> >         }
> >  
> > +#ifdef CONFIG_UPROBES
> > +       if (!group && t->utask && t->utask->active_uprobe)
> > +               return 0;
> > +#endif
> > +
> >  out_set:
> >         signalfd_notify(t, sig);
> >         sigaddset(&pending->signal, sig);
> > @@ -1569,6 +1579,13 @@ int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group)
> >         }
> >         q->info.si_overrun = 0;
> >  
> > +#ifdef CONFIG_UPROBES
> > +       if (!group && t->utask && t->utask->active_uprobe) {
> > +               pending = &t->utask->delayed;
> > +               list_add_tail(&q->list, &pending->list);
> > +               goto out;
> > +       }
> > +#endif
> >         signalfd_notify(t, sig);
> >         pending = group ? &t->signal->shared_pending : &t->pending;
> >         list_add_tail(&q->list, &pending->list);
> > @@ -2199,7 +2216,10 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
> >                         spin_unlock_irq(&sighand->siglock);
> >                         goto relock;
> >                 }
> > -
> > +#ifdef CONFIG_UPROBES
> > +               if (current->utask && current->utask->active_uprobe)
> > +                       break;
> > +#endif 
> 
> That's just crying for something like:
> 
> #ifdef CONFIG_UPROBES
> static inline bool uprobe_delay_signal(struct task_struct *p)
> {
> 	return p->utask && p->utask->active_uprobe;
> }
> #else
> static inline bool uprobe_delay_signal(struct task_struct *p)
> {
> 	return false;
> }
> #endif
> 
> That'll instantly kill the #ifdeffery as well as describe wtf you're
> actually doing.


Okay, 

I did a rethink and implemented this patch a little differently using
block_all_signals, unblock_all_signals. This wouldnt need the 
#ifdeffery + no changes in kernel/signal.c

Will post the same in the next patchset.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 19/26]   tracing: Extract out common code for kprobes/uprobes traceevents.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-09-28  5:04     ` Masami Hiramatsu
  -1 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-28  5:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

(2011/09/20 21:03), Srikar Dronamraju wrote:
> Move parts of trace_kprobe.c that can be shared with upcoming
> trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
> kernel/trace/trace_probe.c.

This seems including different changes (as below). Please separate it.
(Maybe "Use Boolean instead of integer" patch? :))

[...]
> @@ -651,7 +107,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
>  					     void *addr,
>  					     const char *symbol,
>  					     unsigned long offs,
> -					     int nargs, int is_return)
> +					     int nargs, bool is_return)
>  {
>  	struct trace_probe *tp;
>  	int ret = -ENOMEM;
[...]

> @@ -1153,7 +366,7 @@ static int create_trace_probe(int argc, char **argv)
>  	 */
>  	struct trace_probe *tp;
>  	int i, ret = 0;
> -	int is_return = 0, is_delete = 0;
> +	bool is_return = false, is_delete = false;
>  	char *symbol = NULL, *event = NULL, *group = NULL;
>  	char *arg;
>  	unsigned long offset = 0;
> @@ -1162,11 +375,11 @@ static int create_trace_probe(int argc, char **argv)
>  
>  	/* argc must be >= 1 */
>  	if (argv[0][0] == 'p')
> -		is_return = 0;
> +		is_return = false;
>  	else if (argv[0][0] == 'r')
> -		is_return = 1;
> +		is_return = true;
>  	else if (argv[0][0] == '-')
> -		is_delete = 1;
> +		is_delete = true;
>  	else {
>  		pr_info("Probe definition must be started with 'p', 'r' or"
>  			" '-'.\n");

And also, this has bugs in selftest code.

[...]
> @@ -2020,7 +1166,7 @@ static __init int kprobe_trace_self_tests_init(void)
>  
>  	pr_info("Testing kprobe tracing: ");
>  
> -	ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
> +	ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
>  				  "$stack $stack0 +0($stack)");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on probing function entry.\n");
> @@ -2035,7 +1181,7 @@ static __init int kprobe_trace_self_tests_init(void)
>  			enable_trace_probe(tp, TP_FLAG_TRACE);
>  	}
>  
> -	ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
> +	ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
>  				  "$retval");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on probing function return.\n");
> @@ -2055,13 +1201,13 @@ static __init int kprobe_trace_self_tests_init(void)
>  
>  	ret = target(1, 2, 3, 4, 5, 6);
>  
> -	ret = command_trace_probe("-:testprobe");
> +	ret = traceprobe_command_trace_probe("-:testprobe");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on deleting a probe.\n");
>  		warn++;
>  	}
>  
> -	ret = command_trace_probe("-:testprobe2");
> +	ret = traceprobe_command_trace_probe("-:testprobe2");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on deleting a probe.\n");
>  		warn++;

traceprobe_command(str) and traceprobe_command_trace_probe(str) should be
traceprobe_command(str, create_trace_probe).

Thank you,


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 19/26]   tracing: Extract out common code for kprobes/uprobes traceevents.
@ 2011-09-28  5:04     ` Masami Hiramatsu
  0 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-28  5:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

(2011/09/20 21:03), Srikar Dronamraju wrote:
> Move parts of trace_kprobe.c that can be shared with upcoming
> trace_uprobe.c. Common code to kernel/trace/trace_probe.h and
> kernel/trace/trace_probe.c.

This seems including different changes (as below). Please separate it.
(Maybe "Use Boolean instead of integer" patch? :))

[...]
> @@ -651,7 +107,7 @@ static struct trace_probe *alloc_trace_probe(const char *group,
>  					     void *addr,
>  					     const char *symbol,
>  					     unsigned long offs,
> -					     int nargs, int is_return)
> +					     int nargs, bool is_return)
>  {
>  	struct trace_probe *tp;
>  	int ret = -ENOMEM;
[...]

> @@ -1153,7 +366,7 @@ static int create_trace_probe(int argc, char **argv)
>  	 */
>  	struct trace_probe *tp;
>  	int i, ret = 0;
> -	int is_return = 0, is_delete = 0;
> +	bool is_return = false, is_delete = false;
>  	char *symbol = NULL, *event = NULL, *group = NULL;
>  	char *arg;
>  	unsigned long offset = 0;
> @@ -1162,11 +375,11 @@ static int create_trace_probe(int argc, char **argv)
>  
>  	/* argc must be >= 1 */
>  	if (argv[0][0] == 'p')
> -		is_return = 0;
> +		is_return = false;
>  	else if (argv[0][0] == 'r')
> -		is_return = 1;
> +		is_return = true;
>  	else if (argv[0][0] == '-')
> -		is_delete = 1;
> +		is_delete = true;
>  	else {
>  		pr_info("Probe definition must be started with 'p', 'r' or"
>  			" '-'.\n");

And also, this has bugs in selftest code.

[...]
> @@ -2020,7 +1166,7 @@ static __init int kprobe_trace_self_tests_init(void)
>  
>  	pr_info("Testing kprobe tracing: ");
>  
> -	ret = command_trace_probe("p:testprobe kprobe_trace_selftest_target "
> +	ret = traceprobe_command("p:testprobe kprobe_trace_selftest_target "
>  				  "$stack $stack0 +0($stack)");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on probing function entry.\n");
> @@ -2035,7 +1181,7 @@ static __init int kprobe_trace_self_tests_init(void)
>  			enable_trace_probe(tp, TP_FLAG_TRACE);
>  	}
>  
> -	ret = command_trace_probe("r:testprobe2 kprobe_trace_selftest_target "
> +	ret = traceprobe_command("r:testprobe2 kprobe_trace_selftest_target "
>  				  "$retval");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on probing function return.\n");
> @@ -2055,13 +1201,13 @@ static __init int kprobe_trace_self_tests_init(void)
>  
>  	ret = target(1, 2, 3, 4, 5, 6);
>  
> -	ret = command_trace_probe("-:testprobe");
> +	ret = traceprobe_command_trace_probe("-:testprobe");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on deleting a probe.\n");
>  		warn++;
>  	}
>  
> -	ret = command_trace_probe("-:testprobe2");
> +	ret = traceprobe_command_trace_probe("-:testprobe2");
>  	if (WARN_ON_ONCE(ret)) {
>  		pr_warning("error on deleting a probe.\n");
>  		warn++;

traceprobe_command(str) and traceprobe_command_trace_probe(str) should be
traceprobe_command(str, create_trace_probe).

Thank you,


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 25/26]   perf: Documentation for perf uprobes
  2011-09-20 12:05   ` Srikar Dronamraju
@ 2011-09-28  9:20     ` Masami Hiramatsu
  -1 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-28  9:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

(2011/09/20 21:05), Srikar Dronamraju wrote:
> Modify perf-probe.txt to include uprobe documentation

This change should be included in 23rd and 24th patches,
because the documentation should be updated with the tool
enhancement.

Thank you,

> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  tools/perf/Documentation/perf-probe.txt |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
> index 800775e..3c98a54 100644
> --- a/tools/perf/Documentation/perf-probe.txt
> +++ b/tools/perf/Documentation/perf-probe.txt
> @@ -78,6 +78,8 @@ OPTIONS
>  -F::
>  --funcs::
>  	Show available functions in given module or kernel.
> +	With -x/--exec, can also list functions in a user space executable
> +	/ shared library.
>  
>  --filter=FILTER::
>  	(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
> @@ -98,6 +100,11 @@ OPTIONS
>  --max-probes::
>  	Set the maximum number of probe points for an event. Default is 128.
>  
> +-x::
> +--exec=PATH::
> +	Specify path to the executable or shared library file for user
> +	space tracing. Can also be used with --funcs option.
> +
>  PROBE SYNTAX
>  ------------
>  Probe points are defined by following syntax.
> @@ -182,6 +189,13 @@ Delete all probes on schedule().
>  
>   ./perf probe --del='schedule*'
>  
> +Add probes at zfree() function on /bin/zsh
> +
> + ./perf probe -x /bin/zsh zfree
> +
> +Add probes at malloc() function on libc
> +
> + ./perf probe -x /lib/libc.so.6 malloc
>  
>  SEE ALSO
>  --------
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 25/26]   perf: Documentation for perf uprobes
@ 2011-09-28  9:20     ` Masami Hiramatsu
  0 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-09-28  9:20 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Oleg Nesterov, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

(2011/09/20 21:05), Srikar Dronamraju wrote:
> Modify perf-probe.txt to include uprobe documentation

This change should be included in 23rd and 24th patches,
because the documentation should be updated with the tool
enhancement.

Thank you,

> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  tools/perf/Documentation/perf-probe.txt |   14 ++++++++++++++
>  1 files changed, 14 insertions(+), 0 deletions(-)
> 
> diff --git a/tools/perf/Documentation/perf-probe.txt b/tools/perf/Documentation/perf-probe.txt
> index 800775e..3c98a54 100644
> --- a/tools/perf/Documentation/perf-probe.txt
> +++ b/tools/perf/Documentation/perf-probe.txt
> @@ -78,6 +78,8 @@ OPTIONS
>  -F::
>  --funcs::
>  	Show available functions in given module or kernel.
> +	With -x/--exec, can also list functions in a user space executable
> +	/ shared library.
>  
>  --filter=FILTER::
>  	(Only for --vars and --funcs) Set filter. FILTER is a combination of glob
> @@ -98,6 +100,11 @@ OPTIONS
>  --max-probes::
>  	Set the maximum number of probe points for an event. Default is 128.
>  
> +-x::
> +--exec=PATH::
> +	Specify path to the executable or shared library file for user
> +	space tracing. Can also be used with --funcs option.
> +
>  PROBE SYNTAX
>  ------------
>  Probe points are defined by following syntax.
> @@ -182,6 +189,13 @@ Delete all probes on schedule().
>  
>   ./perf probe --del='schedule*'
>  
> +Add probes at zfree() function on /bin/zsh
> +
> + ./perf probe -x /bin/zsh zfree
> +
> +Add probes at malloc() function on libc
> +
> + ./perf probe -x /lib/libc.so.6 malloc
>  
>  SEE ALSO
>  --------
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-10-03 12:46     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-03 12:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +static struct vma_info *__find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping,
> +			struct vma_info *vi)
> +{
> +	struct prio_tree_iter iter;
> +	struct vm_area_struct *vma;
> +	struct vma_info *tmpvi;
> +	loff_t vaddr;
> +	unsigned long pgoff = offset >> PAGE_SHIFT;
> +	int existing_vma;
> +
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		if (!vma || !valid_vma(vma))
> +			return NULL;

!vma is not possible.

But I can't understand the !valid_vma(vma) check... We shouldn't return,
we should ignore this vma and continue, no? Otherwise, I can't see how
this can work if someone does, say, mmap(PROT_READ).

> +		existing_vma = 0;
> +		vaddr = vma->vm_start + offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		list_for_each_entry(tmpvi, head, probe_list) {
> +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> +				existing_vma = 1;
> +				break;
> +			}
> +		}
> +		if (!existing_vma &&
> +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {

This looks suspicious. If atomic_inc_not_zero() can fail, iow if we can
see ->mm_users == 0, then why it is safe to touch this counter/memory?
How we can know ->mm_count != 0 ?

I _think_ this is probably correct, ->mm_users == 0 means we are racing
mmput(), ->i_mmap_mutex and the fact we found this vma guarantees that
mmput() can't pass unlink_file_vma() and thus mmdrop() is not possible.
May be needs a comment...

> +static struct vma_info *find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping)
> +{
> +	struct vma_info *vi, *retvi;
> +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> +	if (!vi)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&vi->probe_list);

Looks unneeded.

> +	mutex_lock(&mapping->i_mmap_mutex);
> +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> +	mutex_unlock(&mapping->i_mmap_mutex);

It is not clear why we can't race with mmap() after find_next_vma_info()
returns NULL. I guess this is solved by the next patches.

> +static int __register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe *uprobe)
> +{
> +	struct list_head try_list;
> +	struct vm_area_struct *vma;
> +	struct address_space *mapping;
> +	struct vma_info *vi, *tmpvi;
> +	struct mm_struct *mm;
> +	int ret = 0;
> +
> +	mapping = inode->i_mapping;
> +	INIT_LIST_HEAD(&try_list);
> +	while ((vi = find_next_vma_info(&try_list, offset,
> +							mapping)) != NULL) {
> +		if (IS_ERR(vi)) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		mm = vi->mm;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, (unsigned long) vi->vaddr);

But we can't trust find_vma? The original vma found by find_next_vma_info()
could go away, at least we should verify vi->vaddr >= vm_start.

And worse, I do not understand how we can trust ->vaddr. Can't we race with
sys_mremap() ?

> +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> +						struct uprobe *uprobe)
> +{
> +	struct list_head try_list;
> +	struct address_space *mapping;
> +	struct vma_info *vi, *tmpvi;
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +
> +	mapping = inode->i_mapping;
> +	INIT_LIST_HEAD(&try_list);
> +	while ((vi = find_next_vma_info(&try_list, offset,
> +							mapping)) != NULL) {
> +		if (IS_ERR(vi))
> +			break;
> +		mm = vi->mm;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, (unsigned long) vi->vaddr);

Same problems...

> +		if (!vma || !valid_vma(vma)) {
> +			list_del(&vi->probe_list);
> +			kfree(vi);
> +			up_read(&mm->mmap_sem);
> +			mmput(mm);
> +			continue;
> +		}

Not sure about !valid_vma() (and note that __find_next_vma_info does() this
check too).

Suppose that register_uprobe() succeeds. After that unregister_ should work
even if user-space does mprotect() which can make valid_vma() == F, right?

> +int register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +	int ret = 0;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer || consumer->next)
> +		return -EINVAL;
> +
> +	if (offset > inode->i_size)
> +		return -EINVAL;

I guess this needs i_size_read().

And every "return" in register/unregister leaks something.

> +
> +	mutex_lock(&inode->i_mutex);
> +	uprobe = alloc_uprobe(inode, offset);

Looks like, alloc_uprobe() doesn't need ->i_mutex.

OTOH,

> +void unregister_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer)
> +		return;
> +
> +	if (offset > inode->i_size)
> +		return;
> +
> +	uprobe = find_uprobe(inode, offset);
> +	if (!uprobe)
> +		return;
> +
> +	if (!del_consumer(uprobe, consumer)) {
> +		put_uprobe(uprobe);
> +		return;
> +	}
> +
> +	mutex_lock(&inode->i_mutex);
> +	if (!uprobe->consumers)
> +		__unregister_uprobe(inode, offset, uprobe);

It seemes that del_consumer() should be done under ->i_mutex. If it
removes the last consumer, we can race with register_uprobe() which
takes ->i_mutex before us and does another __register_uprobe(), no?

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-10-03 12:46     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-03 12:46 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +static struct vma_info *__find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping,
> +			struct vma_info *vi)
> +{
> +	struct prio_tree_iter iter;
> +	struct vm_area_struct *vma;
> +	struct vma_info *tmpvi;
> +	loff_t vaddr;
> +	unsigned long pgoff = offset >> PAGE_SHIFT;
> +	int existing_vma;
> +
> +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> +		if (!vma || !valid_vma(vma))
> +			return NULL;

!vma is not possible.

But I can't understand the !valid_vma(vma) check... We shouldn't return,
we should ignore this vma and continue, no? Otherwise, I can't see how
this can work if someone does, say, mmap(PROT_READ).

> +		existing_vma = 0;
> +		vaddr = vma->vm_start + offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		list_for_each_entry(tmpvi, head, probe_list) {
> +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> +				existing_vma = 1;
> +				break;
> +			}
> +		}
> +		if (!existing_vma &&
> +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {

This looks suspicious. If atomic_inc_not_zero() can fail, iow if we can
see ->mm_users == 0, then why it is safe to touch this counter/memory?
How we can know ->mm_count != 0 ?

I _think_ this is probably correct, ->mm_users == 0 means we are racing
mmput(), ->i_mmap_mutex and the fact we found this vma guarantees that
mmput() can't pass unlink_file_vma() and thus mmdrop() is not possible.
May be needs a comment...

> +static struct vma_info *find_next_vma_info(struct list_head *head,
> +			loff_t offset, struct address_space *mapping)
> +{
> +	struct vma_info *vi, *retvi;
> +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> +	if (!vi)
> +		return ERR_PTR(-ENOMEM);
> +
> +	INIT_LIST_HEAD(&vi->probe_list);

Looks unneeded.

> +	mutex_lock(&mapping->i_mmap_mutex);
> +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> +	mutex_unlock(&mapping->i_mmap_mutex);

It is not clear why we can't race with mmap() after find_next_vma_info()
returns NULL. I guess this is solved by the next patches.

> +static int __register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe *uprobe)
> +{
> +	struct list_head try_list;
> +	struct vm_area_struct *vma;
> +	struct address_space *mapping;
> +	struct vma_info *vi, *tmpvi;
> +	struct mm_struct *mm;
> +	int ret = 0;
> +
> +	mapping = inode->i_mapping;
> +	INIT_LIST_HEAD(&try_list);
> +	while ((vi = find_next_vma_info(&try_list, offset,
> +							mapping)) != NULL) {
> +		if (IS_ERR(vi)) {
> +			ret = -ENOMEM;
> +			break;
> +		}
> +		mm = vi->mm;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, (unsigned long) vi->vaddr);

But we can't trust find_vma? The original vma found by find_next_vma_info()
could go away, at least we should verify vi->vaddr >= vm_start.

And worse, I do not understand how we can trust ->vaddr. Can't we race with
sys_mremap() ?

> +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> +						struct uprobe *uprobe)
> +{
> +	struct list_head try_list;
> +	struct address_space *mapping;
> +	struct vma_info *vi, *tmpvi;
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +
> +	mapping = inode->i_mapping;
> +	INIT_LIST_HEAD(&try_list);
> +	while ((vi = find_next_vma_info(&try_list, offset,
> +							mapping)) != NULL) {
> +		if (IS_ERR(vi))
> +			break;
> +		mm = vi->mm;
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, (unsigned long) vi->vaddr);

Same problems...

> +		if (!vma || !valid_vma(vma)) {
> +			list_del(&vi->probe_list);
> +			kfree(vi);
> +			up_read(&mm->mmap_sem);
> +			mmput(mm);
> +			continue;
> +		}

Not sure about !valid_vma() (and note that __find_next_vma_info does() this
check too).

Suppose that register_uprobe() succeeds. After that unregister_ should work
even if user-space does mprotect() which can make valid_vma() == F, right?

> +int register_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +	int ret = 0;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer || consumer->next)
> +		return -EINVAL;
> +
> +	if (offset > inode->i_size)
> +		return -EINVAL;

I guess this needs i_size_read().

And every "return" in register/unregister leaks something.

> +
> +	mutex_lock(&inode->i_mutex);
> +	uprobe = alloc_uprobe(inode, offset);

Looks like, alloc_uprobe() doesn't need ->i_mutex.

OTOH,

> +void unregister_uprobe(struct inode *inode, loff_t offset,
> +				struct uprobe_consumer *consumer)
> +{
> +	struct uprobe *uprobe;
> +
> +	inode = igrab(inode);
> +	if (!inode || !consumer)
> +		return;
> +
> +	if (offset > inode->i_size)
> +		return;
> +
> +	uprobe = find_uprobe(inode, offset);
> +	if (!uprobe)
> +		return;
> +
> +	if (!del_consumer(uprobe, consumer)) {
> +		put_uprobe(uprobe);
> +		return;
> +	}
> +
> +	mutex_lock(&inode->i_mutex);
> +	if (!uprobe->consumers)
> +		__unregister_uprobe(inode, offset, uprobe);

It seemes that del_consumer() should be done under ->i_mutex. If it
removes the last consumer, we can race with register_uprobe() which
takes ->i_mutex before us and does another __register_uprobe(), no?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-10-03 13:37     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-03 13:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> @@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	mm->pmd_huge_pte = NULL;
>  #endif
> +#ifdef CONFIG_UPROBES
> +	atomic_set(&mm->mm_uprobes_count,
> +			atomic_read(&oldmm->mm_uprobes_count));

Hmm. Why this can't race with install_breakpoint/remove_breakpoint
between _read and _set ?

What about VM_DONTCOPY vma's with breakpoints ?

> -static int match_uprobe(struct uprobe *l, struct uprobe *r)
> +static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
>  {
> +	/*
> +	 * if match_inode is non NULL then indicate if the
> +	 * inode atleast match.
> +	 */
> +	if (match_inode)
> +		*match_inode = 0;
> +
>  	if (l->inode < r->inode)
>  		return -1;
>  	if (l->inode > r->inode)
>  		return 1;
>  	else {
> +		if (match_inode)
> +			*match_inode = 1;
> +

It is very possible I missed something, but imho this looks confusing.

This close_match logic is only needed for build_probe_list() and
dec_mm_uprobes_count(), and both do not actually need the returned
uprobe.

Instead of complicating match_uprobe() and __find_uprobe(), perhaps
it makes sense to add "struct rb_node *__find_close_rb_node(inode)" ?

> +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
>  {
>  	/* Placeholder: Yet to be implemented */
> +	if (!uprobe->consumers)
> +		return 0;

How it is possible to see ->consumers == NULL?

OK, afaics it _is_ possible, but only because unregister does del_consumer()
without ->i_mutex, but this is bug afaics (see the previous email).

Another user is mmap_uprobe() and it checks ->consumers != NULL itself (but
see below).

> +int mmap_uprobe(struct vm_area_struct *vma)
> +{
> +	struct list_head tmp_list;
> +	struct uprobe *uprobe, *u;
> +	struct inode *inode;
> +	int ret = 0;
> +
> +	if (!valid_vma(vma))
> +		return ret;	/* Bail-out */
> +
> +	inode = igrab(vma->vm_file->f_mapping->host);
> +	if (!inode)
> +		return ret;
> +
> +	INIT_LIST_HEAD(&tmp_list);
> +	mutex_lock(&uprobes_mmap_mutex);
> +	build_probe_list(inode, &tmp_list);
> +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> +		loff_t vaddr;
> +
> +		list_del(&uprobe->pending_list);
> +		if (!ret && uprobe->consumers) {
> +			vaddr = vma->vm_start + uprobe->offset;
> +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +				continue;
> +			ret = install_breakpoint(vma->vm_mm, uprobe);

So. We are adding the new mapping, we should find all breakpoints this
file has in the start/end range.

We are holding ->mmap_sem... this seems enough to protect against the
races with register/unregister. Except, what if __register_uprobe()
fails? In this case __unregister_uprobe() does delete_uprobe() at the
very end. What if mmap mmap_uprobe() is called right before delete_?

> +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> +		struct inode *inode)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	uprobe = __find_uprobe(inode, 0, &n);
> +
> +	/*
> +	 * If indeed there is a probe for the inode and with offset zero,
> +	 * then lets release its reference. (ref got thro __find_uprobe)
> +	 */
> +	if (uprobe)
> +		put_uprobe(uprobe);
> +	for (; n; n = rb_next(n)) {
> +		loff_t vaddr;
> +
> +		uprobe = rb_entry(n, struct uprobe, rb_node);
> +		if (uprobe->inode != inode)
> +			break;
> +		vaddr = vma->vm_start + uprobe->offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +			continue;
> +		atomic_dec(&vma->vm_mm->mm_uprobes_count);

So, this does atomic_dec() for each bp in this vma?

And the caller is

> @@ -1337,6 +1338,9 @@ unsigned long unmap_vmas(struct mmu_gather *tlb,
>  		if (unlikely(is_pfn_mapping(vma)))
>  			untrack_pfn_vma(vma, 0, 0);
>
> +		if (vma->vm_file)
> +			munmap_uprobe(vma);

Doesn't look right...

munmap_uprobe() assumes that the whole region goes away. This is
true in munmap() case afaics, it does __split_vma() if necessary.

But what about truncate() ? In this case this vma is not unmapped,
but unmap_vmas() is called anyway and [start, end) can be different.
IOW, unless I missed something (this is very possible) we can do
more atomic_dec's then needed.

Also, truncate() obviously changes ->i_size. Doesn't this mean
unregister_uprobe() should return if offset > i_size ? We need to
free uprobes anyway.

MADV_DONTNEED? It calls unmap_vmas() too. And application can do
madvise(DONTNEED) in a loop.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-10-03 13:37     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-03 13:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> @@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  	mm->pmd_huge_pte = NULL;
>  #endif
> +#ifdef CONFIG_UPROBES
> +	atomic_set(&mm->mm_uprobes_count,
> +			atomic_read(&oldmm->mm_uprobes_count));

Hmm. Why this can't race with install_breakpoint/remove_breakpoint
between _read and _set ?

What about VM_DONTCOPY vma's with breakpoints ?

> -static int match_uprobe(struct uprobe *l, struct uprobe *r)
> +static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
>  {
> +	/*
> +	 * if match_inode is non NULL then indicate if the
> +	 * inode atleast match.
> +	 */
> +	if (match_inode)
> +		*match_inode = 0;
> +
>  	if (l->inode < r->inode)
>  		return -1;
>  	if (l->inode > r->inode)
>  		return 1;
>  	else {
> +		if (match_inode)
> +			*match_inode = 1;
> +

It is very possible I missed something, but imho this looks confusing.

This close_match logic is only needed for build_probe_list() and
dec_mm_uprobes_count(), and both do not actually need the returned
uprobe.

Instead of complicating match_uprobe() and __find_uprobe(), perhaps
it makes sense to add "struct rb_node *__find_close_rb_node(inode)" ?

> +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
>  {
>  	/* Placeholder: Yet to be implemented */
> +	if (!uprobe->consumers)
> +		return 0;

How it is possible to see ->consumers == NULL?

OK, afaics it _is_ possible, but only because unregister does del_consumer()
without ->i_mutex, but this is bug afaics (see the previous email).

Another user is mmap_uprobe() and it checks ->consumers != NULL itself (but
see below).

> +int mmap_uprobe(struct vm_area_struct *vma)
> +{
> +	struct list_head tmp_list;
> +	struct uprobe *uprobe, *u;
> +	struct inode *inode;
> +	int ret = 0;
> +
> +	if (!valid_vma(vma))
> +		return ret;	/* Bail-out */
> +
> +	inode = igrab(vma->vm_file->f_mapping->host);
> +	if (!inode)
> +		return ret;
> +
> +	INIT_LIST_HEAD(&tmp_list);
> +	mutex_lock(&uprobes_mmap_mutex);
> +	build_probe_list(inode, &tmp_list);
> +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> +		loff_t vaddr;
> +
> +		list_del(&uprobe->pending_list);
> +		if (!ret && uprobe->consumers) {
> +			vaddr = vma->vm_start + uprobe->offset;
> +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +				continue;
> +			ret = install_breakpoint(vma->vm_mm, uprobe);

So. We are adding the new mapping, we should find all breakpoints this
file has in the start/end range.

We are holding ->mmap_sem... this seems enough to protect against the
races with register/unregister. Except, what if __register_uprobe()
fails? In this case __unregister_uprobe() does delete_uprobe() at the
very end. What if mmap mmap_uprobe() is called right before delete_?

> +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> +		struct inode *inode)
> +{
> +	struct uprobe *uprobe;
> +	struct rb_node *n;
> +	unsigned long flags;
> +
> +	n = uprobes_tree.rb_node;
> +	spin_lock_irqsave(&uprobes_treelock, flags);
> +	uprobe = __find_uprobe(inode, 0, &n);
> +
> +	/*
> +	 * If indeed there is a probe for the inode and with offset zero,
> +	 * then lets release its reference. (ref got thro __find_uprobe)
> +	 */
> +	if (uprobe)
> +		put_uprobe(uprobe);
> +	for (; n; n = rb_next(n)) {
> +		loff_t vaddr;
> +
> +		uprobe = rb_entry(n, struct uprobe, rb_node);
> +		if (uprobe->inode != inode)
> +			break;
> +		vaddr = vma->vm_start + uprobe->offset;
> +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> +			continue;
> +		atomic_dec(&vma->vm_mm->mm_uprobes_count);

So, this does atomic_dec() for each bp in this vma?

And the caller is

> @@ -1337,6 +1338,9 @@ unsigned long unmap_vmas(struct mmu_gather *tlb,
>  		if (unlikely(is_pfn_mapping(vma)))
>  			untrack_pfn_vma(vma, 0, 0);
>
> +		if (vma->vm_file)
> +			munmap_uprobe(vma);

Doesn't look right...

munmap_uprobe() assumes that the whole region goes away. This is
true in munmap() case afaics, it does __split_vma() if necessary.

But what about truncate() ? In this case this vma is not unmapped,
but unmap_vmas() is called anyway and [start, end) can be different.
IOW, unless I missed something (this is very possible) we can do
more atomic_dec's then needed.

Also, truncate() obviously changes ->i_size. Doesn't this mean
unregister_uprobe() should return if offset > i_size ? We need to
free uprobes anyway.

MADV_DONTNEED? It calls unmap_vmas() too. And application can do
madvise(DONTNEED) in a loop.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
  2011-09-20 12:00   ` Srikar Dronamraju
@ 2011-10-03 16:29     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-03 16:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +static int __copy_insn(struct address_space *mapping,
> +			struct vm_area_struct *vma, char *insn,
> +			unsigned long nbytes, unsigned long offset)
> +{
> +	struct file *filp = vma->vm_file;
> +	struct page *page;
> +	void *vaddr;
> +	unsigned long off1;
> +	unsigned long idx;
> +
> +	if (!filp)
> +		return -EINVAL;
> +
> +	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
> +	off1 = offset &= ~PAGE_MASK;
> +
> +	/*
> +	 * Ensure that the page that has the original instruction is
> +	 * populated and in page-cache.
> +	 */

Hmm. But how we can ensure?

> +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);

This schedules the i/o,

> +	page = grab_cache_page(mapping, idx);

This finds/locks the page in the page-cache,

> +	if (!page)
> +		return -ENOMEM;
> +
> +	vaddr = kmap_atomic(page);
> +	memcpy(insn, vaddr + off1, nbytes);

What if this page is not PageUptodate() ?

Somehow this assumes that the i/o was already completed, I don't
understand this.

But I am starting to think I simply do not understand this change.
To the point, I do not underestand why do we need copy_insn() at all.
We are going to replace this page, can't we save/analyze ->insn later
when we copy the content of the old page? Most probably I missed
something simple...


> +static struct task_struct *get_mm_owner(struct mm_struct *mm)
> +{
> +	struct task_struct *tsk;
> +
> +	rcu_read_lock();
> +	tsk = rcu_dereference(mm->owner);
> +	if (tsk)
> +		get_task_struct(tsk);
> +	rcu_read_unlock();
> +	return tsk;
> +}

Hmm. Do we really need task_struct?

> -static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
> +				struct vm_area_struct *vma, loff_t vaddr)
>  {
> -	/* Placeholder: Yet to be implemented */
> +	struct task_struct *tsk;
> +	unsigned long addr;
> +	int ret = -EINVAL;
> +
>  	if (!uprobe->consumers)
>  		return 0;
>
> -	atomic_inc(&mm->mm_uprobes_count);
> -	return 0;
> +	tsk = get_mm_owner(mm);
> +	if (!tsk)	/* task is probably exiting; bail-out */
> +		return -ESRCH;
> +
> +	if (vaddr > TASK_SIZE_OF(tsk))
> +		goto put_return;

But this should not be possible, no? How it can map this vaddr above
TASK_SIZE ?

get_user_pages(tsk => NULL) is fine. Why else do we need mm->owner ?

Probably used by the next patches... Say, is_32bit_app(tsk). This
can use mm->context.ia32_compat (hopefully will be replaced with
MMF_COMPAT).

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
@ 2011-10-03 16:29     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-03 16:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +static int __copy_insn(struct address_space *mapping,
> +			struct vm_area_struct *vma, char *insn,
> +			unsigned long nbytes, unsigned long offset)
> +{
> +	struct file *filp = vma->vm_file;
> +	struct page *page;
> +	void *vaddr;
> +	unsigned long off1;
> +	unsigned long idx;
> +
> +	if (!filp)
> +		return -EINVAL;
> +
> +	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
> +	off1 = offset &= ~PAGE_MASK;
> +
> +	/*
> +	 * Ensure that the page that has the original instruction is
> +	 * populated and in page-cache.
> +	 */

Hmm. But how we can ensure?

> +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);

This schedules the i/o,

> +	page = grab_cache_page(mapping, idx);

This finds/locks the page in the page-cache,

> +	if (!page)
> +		return -ENOMEM;
> +
> +	vaddr = kmap_atomic(page);
> +	memcpy(insn, vaddr + off1, nbytes);

What if this page is not PageUptodate() ?

Somehow this assumes that the i/o was already completed, I don't
understand this.

But I am starting to think I simply do not understand this change.
To the point, I do not underestand why do we need copy_insn() at all.
We are going to replace this page, can't we save/analyze ->insn later
when we copy the content of the old page? Most probably I missed
something simple...


> +static struct task_struct *get_mm_owner(struct mm_struct *mm)
> +{
> +	struct task_struct *tsk;
> +
> +	rcu_read_lock();
> +	tsk = rcu_dereference(mm->owner);
> +	if (tsk)
> +		get_task_struct(tsk);
> +	rcu_read_unlock();
> +	return tsk;
> +}

Hmm. Do we really need task_struct?

> -static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
> +				struct vm_area_struct *vma, loff_t vaddr)
>  {
> -	/* Placeholder: Yet to be implemented */
> +	struct task_struct *tsk;
> +	unsigned long addr;
> +	int ret = -EINVAL;
> +
>  	if (!uprobe->consumers)
>  		return 0;
>
> -	atomic_inc(&mm->mm_uprobes_count);
> -	return 0;
> +	tsk = get_mm_owner(mm);
> +	if (!tsk)	/* task is probably exiting; bail-out */
> +		return -ESRCH;
> +
> +	if (vaddr > TASK_SIZE_OF(tsk))
> +		goto put_return;

But this should not be possible, no? How it can map this vaddr above
TASK_SIZE ?

get_user_pages(tsk => NULL) is fine. Why else do we need mm->owner ?

Probably used by the next patches... Say, is_32bit_app(tsk). This
can use mm->context.ia32_compat (hopefully will be replaced with
MMF_COMPAT).

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
  2011-10-03 16:29     ` Oleg Nesterov
@ 2011-10-05 10:52       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 10:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +static int __copy_insn(struct address_space *mapping,
> > +			struct vm_area_struct *vma, char *insn,
> > +			unsigned long nbytes, unsigned long offset)
> > +{
> > +	struct file *filp = vma->vm_file;
> > +	struct page *page;
> > +	void *vaddr;
> > +	unsigned long off1;
> > +	unsigned long idx;
> > +
> > +	if (!filp)
> > +		return -EINVAL;
> > +
> > +	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
> > +	off1 = offset &= ~PAGE_MASK;
> > +
> > +	/*
> > +	 * Ensure that the page that has the original instruction is
> > +	 * populated and in page-cache.
> > +	 */
> 
> Hmm. But how we can ensure?


> 
> > +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> 
> This schedules the i/o,
> 
> > +	page = grab_cache_page(mapping, idx);
> 
> This finds/locks the page in the page-cache,
> 
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	vaddr = kmap_atomic(page);
> > +	memcpy(insn, vaddr + off1, nbytes);
> 
> What if this page is not PageUptodate() ?

Since we do a synchronous read ahead, I thought the page would be 
populated and upto date. 

would these two lines after grab_cache_page help?

	if (!PageUptodate(page)) 
		mapping->a_ops->readpage(filp, page);


> 
> Somehow this assumes that the i/o was already completed, I don't
> understand this.
> 
> But I am starting to think I simply do not understand this change.
> To the point, I do not underestand why do we need copy_insn() at all.
> We are going to replace this page, can't we save/analyze ->insn later
> when we copy the content of the old page? Most probably I missed
> something simple...
> 
> 
> > +static struct task_struct *get_mm_owner(struct mm_struct *mm)
> > +{
> > +	struct task_struct *tsk;
> > +
> > +	rcu_read_lock();
> > +	tsk = rcu_dereference(mm->owner);
> > +	if (tsk)
> > +		get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	return tsk;
> > +}
> 
> Hmm. Do we really need task_struct?
> 
> > -static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> > +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
> > +				struct vm_area_struct *vma, loff_t vaddr)
> >  {
> > -	/* Placeholder: Yet to be implemented */
> > +	struct task_struct *tsk;
> > +	unsigned long addr;
> > +	int ret = -EINVAL;
> > +
> >  	if (!uprobe->consumers)
> >  		return 0;
> >
> > -	atomic_inc(&mm->mm_uprobes_count);
> > -	return 0;
> > +	tsk = get_mm_owner(mm);
> > +	if (!tsk)	/* task is probably exiting; bail-out */
> > +		return -ESRCH;
> > +
> > +	if (vaddr > TASK_SIZE_OF(tsk))
> > +		goto put_return;
> 
> But this should not be possible, no? How it can map this vaddr above
> TASK_SIZE ?
> 
> get_user_pages(tsk => NULL) is fine. Why else do we need mm->owner ?

> 
> Probably used by the next patches... Say, is_32bit_app(tsk). This
> can use mm->context.ia32_compat (hopefully will be replaced with
> MMF_COMPAT).
> 

We used the tsk struct for checking if the application was 32 bit and
for calling get_user_pages. Since we can pass NULL to get_user_pages and
since we can use mm->context.ia32_compat or MMF_COMPAT, I will remove
get_mm_owner, that way we dont need to be dependent on CONFIG_MM_OWNER.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
@ 2011-10-05 10:52       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 10:52 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +static int __copy_insn(struct address_space *mapping,
> > +			struct vm_area_struct *vma, char *insn,
> > +			unsigned long nbytes, unsigned long offset)
> > +{
> > +	struct file *filp = vma->vm_file;
> > +	struct page *page;
> > +	void *vaddr;
> > +	unsigned long off1;
> > +	unsigned long idx;
> > +
> > +	if (!filp)
> > +		return -EINVAL;
> > +
> > +	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
> > +	off1 = offset &= ~PAGE_MASK;
> > +
> > +	/*
> > +	 * Ensure that the page that has the original instruction is
> > +	 * populated and in page-cache.
> > +	 */
> 
> Hmm. But how we can ensure?


> 
> > +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> 
> This schedules the i/o,
> 
> > +	page = grab_cache_page(mapping, idx);
> 
> This finds/locks the page in the page-cache,
> 
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	vaddr = kmap_atomic(page);
> > +	memcpy(insn, vaddr + off1, nbytes);
> 
> What if this page is not PageUptodate() ?

Since we do a synchronous read ahead, I thought the page would be 
populated and upto date. 

would these two lines after grab_cache_page help?

	if (!PageUptodate(page)) 
		mapping->a_ops->readpage(filp, page);


> 
> Somehow this assumes that the i/o was already completed, I don't
> understand this.
> 
> But I am starting to think I simply do not understand this change.
> To the point, I do not underestand why do we need copy_insn() at all.
> We are going to replace this page, can't we save/analyze ->insn later
> when we copy the content of the old page? Most probably I missed
> something simple...
> 
> 
> > +static struct task_struct *get_mm_owner(struct mm_struct *mm)
> > +{
> > +	struct task_struct *tsk;
> > +
> > +	rcu_read_lock();
> > +	tsk = rcu_dereference(mm->owner);
> > +	if (tsk)
> > +		get_task_struct(tsk);
> > +	rcu_read_unlock();
> > +	return tsk;
> > +}
> 
> Hmm. Do we really need task_struct?
> 
> > -static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> > +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe,
> > +				struct vm_area_struct *vma, loff_t vaddr)
> >  {
> > -	/* Placeholder: Yet to be implemented */
> > +	struct task_struct *tsk;
> > +	unsigned long addr;
> > +	int ret = -EINVAL;
> > +
> >  	if (!uprobe->consumers)
> >  		return 0;
> >
> > -	atomic_inc(&mm->mm_uprobes_count);
> > -	return 0;
> > +	tsk = get_mm_owner(mm);
> > +	if (!tsk)	/* task is probably exiting; bail-out */
> > +		return -ESRCH;
> > +
> > +	if (vaddr > TASK_SIZE_OF(tsk))
> > +		goto put_return;
> 
> But this should not be possible, no? How it can map this vaddr above
> TASK_SIZE ?
> 
> get_user_pages(tsk => NULL) is fine. Why else do we need mm->owner ?

> 
> Probably used by the next patches... Say, is_32bit_app(tsk). This
> can use mm->context.ia32_compat (hopefully will be replaced with
> MMF_COMPAT).
> 

We used the tsk struct for checking if the application was 32 bit and
for calling get_user_pages. Since we can pass NULL to get_user_pages and
since we can use mm->context.ia32_compat or MMF_COMPAT, I will remove
get_mm_owner, that way we dont need to be dependent on CONFIG_MM_OWNER.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
  2011-10-05 10:52       ` Srikar Dronamraju
@ 2011-10-05 15:11         ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 15:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

Srikar, warning.

I am going to discuss the things I do not really understand ;)
Hopefully someone will correct me if I am wrong.

On 10/05, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:
>
> > > +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> >
> > This schedules the i/o,
> >
> > > +	page = grab_cache_page(mapping, idx);
> >
> > This finds/locks the page in the page-cache,
> >
> > > +	if (!page)
> > > +		return -ENOMEM;
> > > +
> > > +	vaddr = kmap_atomic(page);
> > > +	memcpy(insn, vaddr + off1, nbytes);
> >
> > What if this page is not PageUptodate() ?
>
> Since we do a synchronous read ahead, I thought the page would be
> populated and upto date.

What does this "synchronous" actually mean?

First of all, page_cache_sync_readahead() can simply return. Or
__do_page_cache_readahead() can "skip" the page if it is already in the
page cache.

IOW, we do not even know if ->readpage() was called. But even if it was
called, afaics (in general) the page will be unlocked and marked Uptodate
when I/O completes, not when ->readpage() returns.

> would these two lines after grab_cache_page help?
>
> 	if (!PageUptodate(page))
> 		mapping->a_ops->readpage(filp, page);

This doesn't look right. At least you need lock_page().

Anyway. Why you can't simply use read_mapping_page() or even kernel_read() ?

But the real question is:

> > But I am starting to think I simply do not understand this change.
> > To the point, I do not underestand why do we need copy_insn() at all.
> > We are going to replace this page, can't we save/analyze ->insn later
> > when we copy the content of the old page? Most probably I missed
> > something simple...

Could you please explain?

> > But this should not be possible, no? How it can map this vaddr above
> > TASK_SIZE ?
> >
> > get_user_pages(tsk => NULL) is fine. Why else do we need mm->owner ?
>
> >
> > Probably used by the next patches... Say, is_32bit_app(tsk). This
> > can use mm->context.ia32_compat (hopefully will be replaced with
> > MMF_COMPAT).
> >
>
> We used the tsk struct for checking if the application was 32 bit and
> for calling get_user_pages. Since we can pass NULL to get_user_pages and
> since we can use mm->context.ia32_compat or MMF_COMPAT, I will remove
> get_mm_owner, that way we dont need to be dependent on CONFIG_MM_OWNER.

Great!

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
@ 2011-10-05 15:11         ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 15:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

Srikar, warning.

I am going to discuss the things I do not really understand ;)
Hopefully someone will correct me if I am wrong.

On 10/05, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:
>
> > > +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> >
> > This schedules the i/o,
> >
> > > +	page = grab_cache_page(mapping, idx);
> >
> > This finds/locks the page in the page-cache,
> >
> > > +	if (!page)
> > > +		return -ENOMEM;
> > > +
> > > +	vaddr = kmap_atomic(page);
> > > +	memcpy(insn, vaddr + off1, nbytes);
> >
> > What if this page is not PageUptodate() ?
>
> Since we do a synchronous read ahead, I thought the page would be
> populated and upto date.

What does this "synchronous" actually mean?

First of all, page_cache_sync_readahead() can simply return. Or
__do_page_cache_readahead() can "skip" the page if it is already in the
page cache.

IOW, we do not even know if ->readpage() was called. But even if it was
called, afaics (in general) the page will be unlocked and marked Uptodate
when I/O completes, not when ->readpage() returns.

> would these two lines after grab_cache_page help?
>
> 	if (!PageUptodate(page))
> 		mapping->a_ops->readpage(filp, page);

This doesn't look right. At least you need lock_page().

Anyway. Why you can't simply use read_mapping_page() or even kernel_read() ?

But the real question is:

> > But I am starting to think I simply do not understand this change.
> > To the point, I do not underestand why do we need copy_insn() at all.
> > We are going to replace this page, can't we save/analyze ->insn later
> > when we copy the content of the old page? Most probably I missed
> > something simple...

Could you please explain?

> > But this should not be possible, no? How it can map this vaddr above
> > TASK_SIZE ?
> >
> > get_user_pages(tsk => NULL) is fine. Why else do we need mm->owner ?
>
> >
> > Probably used by the next patches... Say, is_32bit_app(tsk). This
> > can use mm->context.ia32_compat (hopefully will be replaced with
> > MMF_COMPAT).
> >
>
> We used the tsk struct for checking if the application was 32 bit and
> for calling get_user_pages. Since we can pass NULL to get_user_pages and
> since we can use mm->context.ia32_compat or MMF_COMPAT, I will remove
> get_mm_owner, that way we dont need to be dependent on CONFIG_MM_OWNER.

Great!

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-09-20 12:01   ` Srikar Dronamraju
@ 2011-10-05 15:48     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 15:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 09/20, Srikar Dronamraju wrote:
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -250,6 +250,9 @@ config ARCH_CPU_PROBE_RELEASE
>  	def_bool y
>  	depends on HOTPLUG_CPU
>
> +config ARCH_SUPPORTS_UPROBES
> +	def_bool y
> +

It seems you should also change the INSTRUCTION_DECODER entry.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-10-05 15:48     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 15:48 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 09/20, Srikar Dronamraju wrote:
>
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -250,6 +250,9 @@ config ARCH_CPU_PROBE_RELEASE
>  	def_bool y
>  	depends on HOTPLUG_CPU
>
> +config ARCH_SUPPORTS_UPROBES
> +	def_bool y
> +

It seems you should also change the INSTRUCTION_DECODER entry.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
  2011-10-03 16:29     ` Oleg Nesterov
@ 2011-10-05 16:09       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 16:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +static int __copy_insn(struct address_space *mapping,
> > +			struct vm_area_struct *vma, char *insn,
> > +			unsigned long nbytes, unsigned long offset)
> > +{
> > +	struct file *filp = vma->vm_file;
> > +	struct page *page;
> > +	void *vaddr;
> > +	unsigned long off1;
> > +	unsigned long idx;
> > +
> > +	if (!filp)
> > +		return -EINVAL;
> > +
> > +	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
> > +	off1 = offset &= ~PAGE_MASK;
> > +
> > +	/*
> > +	 * Ensure that the page that has the original instruction is
> > +	 * populated and in page-cache.
> > +	 */
> 
> Hmm. But how we can ensure?
> 
> > +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> 
> This schedules the i/o,
> 
> > +	page = grab_cache_page(mapping, idx);
> 
> This finds/locks the page in the page-cache,
> 
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	vaddr = kmap_atomic(page);
> > +	memcpy(insn, vaddr + off1, nbytes);
> 
> What if this page is not PageUptodate() ?
> 
> Somehow this assumes that the i/o was already completed, I don't
> understand this.
> 
> But I am starting to think I simply do not understand this change.
> To the point, I do not underestand why do we need copy_insn() at all.
> We are going to replace this page, can't we save/analyze ->insn later
> when we copy the content of the old page? Most probably I missed
> something simple...
> 

Copying the instruction at the time we replace the original instruction
would have been ideal. However there are a few irritants to handle.

 - While inserting the breakpoint, we might find that the original
   instruction to be the breakpoint instruction itself. (This could
   happen if mmap_uprobe were to race with register_uprobe() or somebody
   else like gdb inserted a breakpoint). How do we distinguish if the
   breakpoint instruction was around in the text or somebody inserted a
   breakpoint in that address-space? Since we read from the page-cache,
   we can easily resolve this.

-  On archs like x86, with variable size instructions, the original
   instruction can be across 2 pages. This is because we copy the
   maximum instruction size from the given vaddr into a buffer for
   subsequent analysis. So the copy_insn takes care of getting two pages
   if and when required. 
   Currently the insert and remove breakpoint
   assumes that the instruction size of a breakpoint is the smallest
   size for that architecture. Hence reading/writing to one page in
   write_opcode is good enough.

-  Again on variable instruction size supporting archs, if two
   subsequent instructions are probed, the original instruction if
   copied using get_user_pages might already have a breakpoint included.
   (This shouldnt have any effect on the uprobes though.)

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
@ 2011-10-05 16:09       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 16:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +static int __copy_insn(struct address_space *mapping,
> > +			struct vm_area_struct *vma, char *insn,
> > +			unsigned long nbytes, unsigned long offset)
> > +{
> > +	struct file *filp = vma->vm_file;
> > +	struct page *page;
> > +	void *vaddr;
> > +	unsigned long off1;
> > +	unsigned long idx;
> > +
> > +	if (!filp)
> > +		return -EINVAL;
> > +
> > +	idx = (unsigned long) (offset >> PAGE_CACHE_SHIFT);
> > +	off1 = offset &= ~PAGE_MASK;
> > +
> > +	/*
> > +	 * Ensure that the page that has the original instruction is
> > +	 * populated and in page-cache.
> > +	 */
> 
> Hmm. But how we can ensure?
> 
> > +	page_cache_sync_readahead(mapping, &filp->f_ra, filp, idx, 1);
> 
> This schedules the i/o,
> 
> > +	page = grab_cache_page(mapping, idx);
> 
> This finds/locks the page in the page-cache,
> 
> > +	if (!page)
> > +		return -ENOMEM;
> > +
> > +	vaddr = kmap_atomic(page);
> > +	memcpy(insn, vaddr + off1, nbytes);
> 
> What if this page is not PageUptodate() ?
> 
> Somehow this assumes that the i/o was already completed, I don't
> understand this.
> 
> But I am starting to think I simply do not understand this change.
> To the point, I do not underestand why do we need copy_insn() at all.
> We are going to replace this page, can't we save/analyze ->insn later
> when we copy the content of the old page? Most probably I missed
> something simple...
> 

Copying the instruction at the time we replace the original instruction
would have been ideal. However there are a few irritants to handle.

 - While inserting the breakpoint, we might find that the original
   instruction to be the breakpoint instruction itself. (This could
   happen if mmap_uprobe were to race with register_uprobe() or somebody
   else like gdb inserted a breakpoint). How do we distinguish if the
   breakpoint instruction was around in the text or somebody inserted a
   breakpoint in that address-space? Since we read from the page-cache,
   we can easily resolve this.

-  On archs like x86, with variable size instructions, the original
   instruction can be across 2 pages. This is because we copy the
   maximum instruction size from the given vaddr into a buffer for
   subsequent analysis. So the copy_insn takes care of getting two pages
   if and when required. 
   Currently the insert and remove breakpoint
   assumes that the instruction size of a breakpoint is the smallest
   size for that architecture. Hence reading/writing to one page in
   write_opcode is good enough.

-  Again on variable instruction size supporting archs, if two
   subsequent instructions are probed, the original instruction if
   copied using get_user_pages might already have a breakpoint included.
   (This shouldnt have any effect on the uprobes though.)

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
  2011-10-05 15:48     ` Oleg Nesterov
@ 2011-10-05 16:12       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 16:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 17:48:38]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -250,6 +250,9 @@ config ARCH_CPU_PROBE_RELEASE
> >  	def_bool y
> >  	depends on HOTPLUG_CPU
> >
> > +config ARCH_SUPPORTS_UPROBES
> > +	def_bool y
> > +
> 
> It seems you should also change the INSTRUCTION_DECODER entry.

Okay will do.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 8/26]   x86: analyze instruction and determine fixups.
@ 2011-10-05 16:12       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 16:12 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 17:48:38]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > --- a/arch/x86/Kconfig
> > +++ b/arch/x86/Kconfig
> > @@ -250,6 +250,9 @@ config ARCH_CPU_PROBE_RELEASE
> >  	def_bool y
> >  	depends on HOTPLUG_CPU
> >
> > +config ARCH_SUPPORTS_UPROBES
> > +	def_bool y
> > +
> 
> It seems you should also change the INSTRUCTION_DECODER entry.

Okay will do.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 9/26]   Uprobes: Background page replacement.
  2011-09-20 12:01   ` Srikar Dronamraju
@ 2011-10-05 16:19     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 16:19 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> +						uprobe_opcode_t *opcode)
> +{
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +	void *vaddr_new;
> +	int ret;
> +
> +	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> +	if (ret <= 0)
> +		return ret;
> +	ret = -EINVAL;
> +
> +	/*
> +	 * We are interested in text pages only. Our pages of interest
> +	 * should be mapped for read and execute only. We desist from
> +	 * adding probes in write mapped pages since the breakpoints
> +	 * might end up in the file copy.
> +	 */
> +	if (!valid_vma(vma))
> +		goto put_out;

Another case when valid_vma() looks suspicious. We are going to restore
the original instruction. We shouldn't fail (at least we shouldn't "leak"
->mm_uprobes_count) if ->vm_flags was changed between register_uprobe()
and unregister_uprobe().

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 9/26]   Uprobes: Background page replacement.
@ 2011-10-05 16:19     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 16:19 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> +						uprobe_opcode_t *opcode)
> +{
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +	void *vaddr_new;
> +	int ret;
> +
> +	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> +	if (ret <= 0)
> +		return ret;
> +	ret = -EINVAL;
> +
> +	/*
> +	 * We are interested in text pages only. Our pages of interest
> +	 * should be mapped for read and execute only. We desist from
> +	 * adding probes in write mapped pages since the breakpoints
> +	 * might end up in the file copy.
> +	 */
> +	if (!valid_vma(vma))
> +		goto put_out;

Another case when valid_vma() looks suspicious. We are going to restore
the original instruction. We shouldn't fail (at least we shouldn't "leak"
->mm_uprobes_count) if ->vm_flags was changed between register_uprobe()
and unregister_uprobe().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 10/26]   x86: Set instruction pointer.
  2011-09-20 12:01   ` Srikar Dronamraju
@ 2011-10-05 16:29     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 16:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -39,4 +39,5 @@ struct uprobe_arch_info {};
>  #endif
>  struct uprobe;
>  extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
> +extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);

Well, this is minor, but we already have instruction_pointer_set().

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 10/26]   x86: Set instruction pointer.
@ 2011-10-05 16:29     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 16:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -39,4 +39,5 @@ struct uprobe_arch_info {};
>  #endif
>  struct uprobe;
>  extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
> +extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);

Well, this is minor, but we already have instruction_pointer_set().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-10-03 12:46     ` Oleg Nesterov
@ 2011-10-05 17:04       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 17:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 14:46:40]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +static struct vma_info *__find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping,
> > +			struct vma_info *vi)
> > +{
> > +	struct prio_tree_iter iter;
> > +	struct vm_area_struct *vma;
> > +	struct vma_info *tmpvi;
> > +	loff_t vaddr;
> > +	unsigned long pgoff = offset >> PAGE_SHIFT;
> > +	int existing_vma;
> > +
> > +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > +		if (!vma || !valid_vma(vma))
> > +			return NULL;
> 
> !vma is not possible.
> 
> But I can't understand the !valid_vma(vma) check... We shouldn't return,
> we should ignore this vma and continue, no? Otherwise, I can't see how
> this can work if someone does, say, mmap(PROT_READ).

Agree. Infact I encountered this problem last week and had fixed it.
In mycase, I had mapped the file read and write while trying to insert
probes.
The changed code looks like this

	if (!vma) 
		return NULL;

	if (!valid_vma(vma))
		continue;

> 
> > +		existing_vma = 0;
> > +		vaddr = vma->vm_start + offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		list_for_each_entry(tmpvi, head, probe_list) {
> > +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> > +				existing_vma = 1;
> > +				break;
> > +			}
> > +		}
> > +		if (!existing_vma &&
> > +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
> 
> This looks suspicious. If atomic_inc_not_zero() can fail, iow if we can
> see ->mm_users == 0, then why it is safe to touch this counter/memory?
> How we can know ->mm_count != 0 ?
> 
> I _think_ this is probably correct, ->mm_users == 0 means we are racing
> mmput(), ->i_mmap_mutex and the fact we found this vma guarantees that
> mmput() can't pass unlink_file_vma() and thus mmdrop() is not possible.
> May be needs a comment...
> 
> > +static struct vma_info *find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping)
> > +{
> > +	struct vma_info *vi, *retvi;
> > +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> > +	if (!vi)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	INIT_LIST_HEAD(&vi->probe_list);
> 
> Looks unneeded.
> 
> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> > +	mutex_unlock(&mapping->i_mmap_mutex);
> 
> It is not clear why we can't race with mmap() after find_next_vma_info()
> returns NULL. I guess this is solved by the next patches.

I assume mmap_uprobe() solves this.

> 
> > +static int __register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe *uprobe)
> > +{
> > +	struct list_head try_list;
> > +	struct vm_area_struct *vma;
> > +	struct address_space *mapping;
> > +	struct vma_info *vi, *tmpvi;
> > +	struct mm_struct *mm;
> > +	int ret = 0;
> > +
> > +	mapping = inode->i_mapping;
> > +	INIT_LIST_HEAD(&try_list);
> > +	while ((vi = find_next_vma_info(&try_list, offset,
> > +							mapping)) != NULL) {
> > +		if (IS_ERR(vi)) {
> > +			ret = -ENOMEM;
> > +			break;
> > +		}
> > +		mm = vi->mm;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> 
> But we can't trust find_vma? The original vma found by find_next_vma_info()
> could go away, at least we should verify vi->vaddr >= vm_start.

Yes, Peter has already pointed this out and I have fixed this too.
Should be fixed in the next iteration.

> 
> And worse, I do not understand how we can trust ->vaddr. Can't we race with
> sys_mremap() ?
> 
> > +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> > +						struct uprobe *uprobe)
> > +{
> > +	struct list_head try_list;
> > +	struct address_space *mapping;
> > +	struct vma_info *vi, *tmpvi;
> > +	struct vm_area_struct *vma;
> > +	struct mm_struct *mm;
> > +
> > +	mapping = inode->i_mapping;
> > +	INIT_LIST_HEAD(&try_list);
> > +	while ((vi = find_next_vma_info(&try_list, offset,
> > +							mapping)) != NULL) {
> > +		if (IS_ERR(vi))
> > +			break;
> > +		mm = vi->mm;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> 
> Same problems...
> 
> > +		if (!vma || !valid_vma(vma)) {
> > +			list_del(&vi->probe_list);
> > +			kfree(vi);
> > +			up_read(&mm->mmap_sem);
> > +			mmput(mm);
> > +			continue;
> > +		}
> 
> Not sure about !valid_vma() (and note that __find_next_vma_info does() this
> check too).
> 
> Suppose that register_uprobe() succeeds. After that unregister_ should work
> even if user-space does mprotect() which can make valid_vma() == F, right?
> 

Agree, If we want __find_next_vma_info() also to not worry about
valid_vma() while unregistering, then we would have to pass an
additional parameter.

> > +int register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +	int ret = 0;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer || consumer->next)
> > +		return -EINVAL;
> > +
> > +	if (offset > inode->i_size)
> > +		return -EINVAL;
> 
> I guess this needs i_size_read().

Okay, 

> 
> And every "return" in register/unregister leaks something.

Yes, this has been pointed out by Stefan Hajnoczi earlier.
Have taken care of this.

> 
> > +
> > +	mutex_lock(&inode->i_mutex);
> > +	uprobe = alloc_uprobe(inode, offset);
> 
> Looks like, alloc_uprobe() doesn't need ->i_mutex.


Actually this was pointed out by you in the last review.
https://lkml.org/lkml/2011/7/24/91

So if we alloc_uprobe() without a lock and succeed but while we contend
on the lock , if the unregister can erase the uprobe from the rbtree.
We end up with a valid uprobe but that is no more in the rbtree. right?

> 
> OTOH,
> 
> > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer)
> > +		return;
> > +
> > +	if (offset > inode->i_size)
> > +		return;
> > +
> > +	uprobe = find_uprobe(inode, offset);
> > +	if (!uprobe)
> > +		return;
> > +
> > +	if (!del_consumer(uprobe, consumer)) {
> > +		put_uprobe(uprobe);
> > +		return;
> > +	}
> > +
> > +	mutex_lock(&inode->i_mutex);
> > +	if (!uprobe->consumers)
> > +		__unregister_uprobe(inode, offset, uprobe);
> 
> It seemes that del_consumer() should be done under ->i_mutex. If it
> removes the last consumer, we can race with register_uprobe() which
> takes ->i_mutex before us and does another __register_uprobe(), no?

We should still be okay, because we check for the consumers before we
do the actual unregister in form of __unregister_uprobe.
since the consumer is again added by the time we get the lock, we dont
do the actual unregistration and go as if del_consumer deleted one
consumer but not the last. 

or Am I missing something?

-- 
Thanks and Regards
Srikar

> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-10-05 17:04       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-05 17:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 14:46:40]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +static struct vma_info *__find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping,
> > +			struct vma_info *vi)
> > +{
> > +	struct prio_tree_iter iter;
> > +	struct vm_area_struct *vma;
> > +	struct vma_info *tmpvi;
> > +	loff_t vaddr;
> > +	unsigned long pgoff = offset >> PAGE_SHIFT;
> > +	int existing_vma;
> > +
> > +	vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) {
> > +		if (!vma || !valid_vma(vma))
> > +			return NULL;
> 
> !vma is not possible.
> 
> But I can't understand the !valid_vma(vma) check... We shouldn't return,
> we should ignore this vma and continue, no? Otherwise, I can't see how
> this can work if someone does, say, mmap(PROT_READ).

Agree. Infact I encountered this problem last week and had fixed it.
In mycase, I had mapped the file read and write while trying to insert
probes.
The changed code looks like this

	if (!vma) 
		return NULL;

	if (!valid_vma(vma))
		continue;

> 
> > +		existing_vma = 0;
> > +		vaddr = vma->vm_start + offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		list_for_each_entry(tmpvi, head, probe_list) {
> > +			if (tmpvi->mm == vma->vm_mm && tmpvi->vaddr == vaddr) {
> > +				existing_vma = 1;
> > +				break;
> > +			}
> > +		}
> > +		if (!existing_vma &&
> > +				atomic_inc_not_zero(&vma->vm_mm->mm_users)) {
> 
> This looks suspicious. If atomic_inc_not_zero() can fail, iow if we can
> see ->mm_users == 0, then why it is safe to touch this counter/memory?
> How we can know ->mm_count != 0 ?
> 
> I _think_ this is probably correct, ->mm_users == 0 means we are racing
> mmput(), ->i_mmap_mutex and the fact we found this vma guarantees that
> mmput() can't pass unlink_file_vma() and thus mmdrop() is not possible.
> May be needs a comment...
> 
> > +static struct vma_info *find_next_vma_info(struct list_head *head,
> > +			loff_t offset, struct address_space *mapping)
> > +{
> > +	struct vma_info *vi, *retvi;
> > +	vi = kzalloc(sizeof(struct vma_info), GFP_KERNEL);
> > +	if (!vi)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	INIT_LIST_HEAD(&vi->probe_list);
> 
> Looks unneeded.
> 
> > +	mutex_lock(&mapping->i_mmap_mutex);
> > +	retvi = __find_next_vma_info(head, offset, mapping, vi);
> > +	mutex_unlock(&mapping->i_mmap_mutex);
> 
> It is not clear why we can't race with mmap() after find_next_vma_info()
> returns NULL. I guess this is solved by the next patches.

I assume mmap_uprobe() solves this.

> 
> > +static int __register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe *uprobe)
> > +{
> > +	struct list_head try_list;
> > +	struct vm_area_struct *vma;
> > +	struct address_space *mapping;
> > +	struct vma_info *vi, *tmpvi;
> > +	struct mm_struct *mm;
> > +	int ret = 0;
> > +
> > +	mapping = inode->i_mapping;
> > +	INIT_LIST_HEAD(&try_list);
> > +	while ((vi = find_next_vma_info(&try_list, offset,
> > +							mapping)) != NULL) {
> > +		if (IS_ERR(vi)) {
> > +			ret = -ENOMEM;
> > +			break;
> > +		}
> > +		mm = vi->mm;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> 
> But we can't trust find_vma? The original vma found by find_next_vma_info()
> could go away, at least we should verify vi->vaddr >= vm_start.

Yes, Peter has already pointed this out and I have fixed this too.
Should be fixed in the next iteration.

> 
> And worse, I do not understand how we can trust ->vaddr. Can't we race with
> sys_mremap() ?
> 
> > +static void __unregister_uprobe(struct inode *inode, loff_t offset,
> > +						struct uprobe *uprobe)
> > +{
> > +	struct list_head try_list;
> > +	struct address_space *mapping;
> > +	struct vma_info *vi, *tmpvi;
> > +	struct vm_area_struct *vma;
> > +	struct mm_struct *mm;
> > +
> > +	mapping = inode->i_mapping;
> > +	INIT_LIST_HEAD(&try_list);
> > +	while ((vi = find_next_vma_info(&try_list, offset,
> > +							mapping)) != NULL) {
> > +		if (IS_ERR(vi))
> > +			break;
> > +		mm = vi->mm;
> > +		down_read(&mm->mmap_sem);
> > +		vma = find_vma(mm, (unsigned long) vi->vaddr);
> 
> Same problems...
> 
> > +		if (!vma || !valid_vma(vma)) {
> > +			list_del(&vi->probe_list);
> > +			kfree(vi);
> > +			up_read(&mm->mmap_sem);
> > +			mmput(mm);
> > +			continue;
> > +		}
> 
> Not sure about !valid_vma() (and note that __find_next_vma_info does() this
> check too).
> 
> Suppose that register_uprobe() succeeds. After that unregister_ should work
> even if user-space does mprotect() which can make valid_vma() == F, right?
> 

Agree, If we want __find_next_vma_info() also to not worry about
valid_vma() while unregistering, then we would have to pass an
additional parameter.

> > +int register_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +	int ret = 0;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer || consumer->next)
> > +		return -EINVAL;
> > +
> > +	if (offset > inode->i_size)
> > +		return -EINVAL;
> 
> I guess this needs i_size_read().

Okay, 

> 
> And every "return" in register/unregister leaks something.

Yes, this has been pointed out by Stefan Hajnoczi earlier.
Have taken care of this.

> 
> > +
> > +	mutex_lock(&inode->i_mutex);
> > +	uprobe = alloc_uprobe(inode, offset);
> 
> Looks like, alloc_uprobe() doesn't need ->i_mutex.


Actually this was pointed out by you in the last review.
https://lkml.org/lkml/2011/7/24/91

So if we alloc_uprobe() without a lock and succeed but while we contend
on the lock , if the unregister can erase the uprobe from the rbtree.
We end up with a valid uprobe but that is no more in the rbtree. right?

> 
> OTOH,
> 
> > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > +				struct uprobe_consumer *consumer)
> > +{
> > +	struct uprobe *uprobe;
> > +
> > +	inode = igrab(inode);
> > +	if (!inode || !consumer)
> > +		return;
> > +
> > +	if (offset > inode->i_size)
> > +		return;
> > +
> > +	uprobe = find_uprobe(inode, offset);
> > +	if (!uprobe)
> > +		return;
> > +
> > +	if (!del_consumer(uprobe, consumer)) {
> > +		put_uprobe(uprobe);
> > +		return;
> > +	}
> > +
> > +	mutex_lock(&inode->i_mutex);
> > +	if (!uprobe->consumers)
> > +		__unregister_uprobe(inode, offset, uprobe);
> 
> It seemes that del_consumer() should be done under ->i_mutex. If it
> removes the last consumer, we can race with register_uprobe() which
> takes ->i_mutex before us and does another __register_uprobe(), no?

We should still be okay, because we check for the consumers before we
do the actual unregister in form of __unregister_uprobe.
since the consumer is again added by the time we get the lock, we dont
do the actual unregistration and go as if del_consumer deleted one
consumer but not the last. 

or Am I missing something?

-- 
Thanks and Regards
Srikar

> 
> Oleg.
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-26 16:25         ` Peter Zijlstra
@ 2011-10-05 17:48           ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/26, Peter Zijlstra wrote:
>
> On Mon, 2011-09-26 at 21:31 +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:59:13]:
> >
> > > On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > > > 						Hence provide some extra
> > > > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > > > + * the uprobes_treelock before the uprobe is removed from the rbtree.
> > >
> > > 'Some extra time' doesn't make me all warm an fuzzy inside, but instead
> > > screams we fudge around a race condition.
> >
> > The extra time provided is sufficient to avoid the race. So will modify
> > it to mean "sufficient" instead of "some".
> >
> > Would that suffice?
>
> Much better, for extra point, explain why its sufficient as well ;-)

+1 ;)

I can't understand why synchronize_sched helps. In fact it is very
possible I simply misunderstood the problem, I'll appreciate if you
can explain.

Just for example. Suppose that uprobe_notify_resume() sleeps in
down_read(mmap_sem). In this case synchronize_sched() can return
even before it takes this sem, how this can help the subsequent
find_uprobe() ? Or that task  can be simply preempted before.

Or I missed the point completely?

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-10-05 17:48           ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 17:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Srikar Dronamraju, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/26, Peter Zijlstra wrote:
>
> On Mon, 2011-09-26 at 21:31 +0530, Srikar Dronamraju wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2011-09-26 15:59:13]:
> >
> > > On Tue, 2011-09-20 at 17:32 +0530, Srikar Dronamraju wrote:
> > > > 						Hence provide some extra
> > > > + * time (by way of synchronize_sched() for breakpoint hit threads to acquire
> > > > + * the uprobes_treelock before the uprobe is removed from the rbtree.
> > >
> > > 'Some extra time' doesn't make me all warm an fuzzy inside, but instead
> > > screams we fudge around a race condition.
> >
> > The extra time provided is sufficient to avoid the race. So will modify
> > it to mean "sufficient" instead of "some".
> >
> > Would that suffice?
>
> Much better, for extra point, explain why its sufficient as well ;-)

+1 ;)

I can't understand why synchronize_sched helps. In fact it is very
possible I simply misunderstood the problem, I'll appreciate if you
can explain.

Just for example. Suppose that uprobe_notify_resume() sleeps in
down_read(mmap_sem). In this case synchronize_sched() can return
even before it takes this sem, how this can help the subsequent
find_uprobe() ? Or that task  can be simply preempted before.

Or I missed the point completely?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
  2011-10-05 16:09       ` Srikar Dronamraju
@ 2011-10-05 17:53         ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 17:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On 10/05, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:
>
> > But I am starting to think I simply do not understand this change.
> > To the point, I do not underestand why do we need copy_insn() at all.
> > We are going to replace this page, can't we save/analyze ->insn later
> > when we copy the content of the old page? Most probably I missed
> > something simple...
> >
>
> Copying the instruction at the time we replace the original instruction
> would have been ideal. However there are a few irritants to handle.
>
> ...
>    How do we distinguish if the
>    breakpoint instruction was around in the text or somebody inserted a
>    breakpoint in that address-space? Since we read from the page-cache,
>    we can easily resolve this.

Ah. I see.

> -  On archs like x86, with variable size instructions, the original
>    instruction can be across 2 pages.

Heh. Indeed ;)

Thanks Srikar.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 5/26]   Uprobes: copy of the original instruction.
@ 2011-10-05 17:53         ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 17:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Jonathan Corbet, LKML, Jim Keniston,
	Roland McGrath, Andi Kleen, Andrew Morton

On 10/05, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-03 18:29:05]:
>
> > But I am starting to think I simply do not understand this change.
> > To the point, I do not underestand why do we need copy_insn() at all.
> > We are going to replace this page, can't we save/analyze ->insn later
> > when we copy the content of the old page? Most probably I missed
> > something simple...
> >
>
> Copying the instruction at the time we replace the original instruction
> would have been ideal. However there are a few irritants to handle.
>
> ...
>    How do we distinguish if the
>    breakpoint instruction was around in the text or somebody inserted a
>    breakpoint in that address-space? Since we read from the page-cache,
>    we can easily resolve this.

Ah. I see.

> -  On archs like x86, with variable size instructions, the original
>    instruction can be across 2 pages.

Heh. Indeed ;)

Thanks Srikar.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-09-27 13:12       ` Srikar Dronamraju
@ 2011-10-05 18:01         ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 18:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

Srikar, I am still reading this series, need more time to read this
patch, but:

On 09/27, Srikar Dronamraju wrote:
>
> I did a rethink and implemented this patch a little differently using
> block_all_signals, unblock_all_signals. This wouldnt need the
> #ifdeffery + no changes in kernel/signal.c

No, Please don't. block_all_signals() must be killed. This interface
simply do not work. At all. It is buggy as hell. I guess I should ping
David Airlie again.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-05 18:01         ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 18:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

Srikar, I am still reading this series, need more time to read this
patch, but:

On 09/27, Srikar Dronamraju wrote:
>
> I did a rethink and implemented this patch a little differently using
> block_all_signals, unblock_all_signals. This wouldnt need the
> #ifdeffery + no changes in kernel/signal.c

No, Please don't. block_all_signals() must be killed. This interface
simply do not work. At all. It is buggy as hell. I guess I should ping
David Airlie again.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-10-05 17:04       ` Srikar Dronamraju
@ 2011-10-05 18:50         ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 18:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 10/05, Srikar Dronamraju wrote:
>
> Agree. Infact I encountered this problem last week and had fixed it.
> In mycase, I had mapped the file read and write while trying to insert
> probes.
> The changed code looks like this
>
> 	if (!vma)
> 		return NULL;

This is unneeded, vma_prio_tree_foreach() stops when vma_prio_tree_next()
returns NULL. IOW, you can never see vma == NULL.

> 	if (!valid_vma(vma))
> 		continue;

Yes.

> > > +	mutex_lock(&inode->i_mutex);
> > > +	uprobe = alloc_uprobe(inode, offset);
> >
> > Looks like, alloc_uprobe() doesn't need ->i_mutex.
>
>
> Actually this was pointed out by you in the last review.
> https://lkml.org/lkml/2011/7/24/91

OOPS ;) may be deserves a comment...

> > > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > > +				struct uprobe_consumer *consumer)
> > > +{
> > > +	struct uprobe *uprobe;
> > > +
> > > +	inode = igrab(inode);
> > > +	if (!inode || !consumer)
> > > +		return;
> > > +
> > > +	if (offset > inode->i_size)
> > > +		return;
> > > +
> > > +	uprobe = find_uprobe(inode, offset);
> > > +	if (!uprobe)
> > > +		return;
> > > +
> > > +	if (!del_consumer(uprobe, consumer)) {
> > > +		put_uprobe(uprobe);
> > > +		return;
> > > +	}
> > > +
> > > +	mutex_lock(&inode->i_mutex);
> > > +	if (!uprobe->consumers)
> > > +		__unregister_uprobe(inode, offset, uprobe);
> >
> > It seemes that del_consumer() should be done under ->i_mutex. If it
> > removes the last consumer, we can race with register_uprobe() which
> > takes ->i_mutex before us and does another __register_uprobe(), no?
>
> We should still be okay, because we check for the consumers before we
> do the actual unregister in form of __unregister_uprobe.
> since the consumer is again added by the time we get the lock, we dont
> do the actual unregistration and go as if del_consumer deleted one
> consumer but not the last.

Yes, but I meant in this case register_uprobe() does the unnecessary
__register_uprobe() because it sees ->consumers == NULL (add_consumer()
returns NULL).

I guess this is probably harmless because of is_bkpt_insn/-EEXIST
logic, but still.


Btw. __register_uprobe() does

		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
			up_read(&mm->mmap_sem);
			mmput(mm);
			break;
		}
		ret = 0;
		up_read(&mm->mmap_sem);
		mmput(mm);

Yes, this is cosmetic, but why do we duplicate up_read/mmput ?

Up to you, but

		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
		up_read(&mm->mmap_sem);
		mmput(mm);

		if (ret) {
			if (ret != -ESRCH && ret != -EEXIST)
				break;
			ret = 0;
		}

Looks a bit simpler.

Oh, wait. I just noticed that the original code does

	(ret != -ESRCH || ret != -EEXIST)

this expression is always true ;)

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-10-05 18:50         ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-05 18:50 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 10/05, Srikar Dronamraju wrote:
>
> Agree. Infact I encountered this problem last week and had fixed it.
> In mycase, I had mapped the file read and write while trying to insert
> probes.
> The changed code looks like this
>
> 	if (!vma)
> 		return NULL;

This is unneeded, vma_prio_tree_foreach() stops when vma_prio_tree_next()
returns NULL. IOW, you can never see vma == NULL.

> 	if (!valid_vma(vma))
> 		continue;

Yes.

> > > +	mutex_lock(&inode->i_mutex);
> > > +	uprobe = alloc_uprobe(inode, offset);
> >
> > Looks like, alloc_uprobe() doesn't need ->i_mutex.
>
>
> Actually this was pointed out by you in the last review.
> https://lkml.org/lkml/2011/7/24/91

OOPS ;) may be deserves a comment...

> > > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > > +				struct uprobe_consumer *consumer)
> > > +{
> > > +	struct uprobe *uprobe;
> > > +
> > > +	inode = igrab(inode);
> > > +	if (!inode || !consumer)
> > > +		return;
> > > +
> > > +	if (offset > inode->i_size)
> > > +		return;
> > > +
> > > +	uprobe = find_uprobe(inode, offset);
> > > +	if (!uprobe)
> > > +		return;
> > > +
> > > +	if (!del_consumer(uprobe, consumer)) {
> > > +		put_uprobe(uprobe);
> > > +		return;
> > > +	}
> > > +
> > > +	mutex_lock(&inode->i_mutex);
> > > +	if (!uprobe->consumers)
> > > +		__unregister_uprobe(inode, offset, uprobe);
> >
> > It seemes that del_consumer() should be done under ->i_mutex. If it
> > removes the last consumer, we can race with register_uprobe() which
> > takes ->i_mutex before us and does another __register_uprobe(), no?
>
> We should still be okay, because we check for the consumers before we
> do the actual unregister in form of __unregister_uprobe.
> since the consumer is again added by the time we get the lock, we dont
> do the actual unregistration and go as if del_consumer deleted one
> consumer but not the last.

Yes, but I meant in this case register_uprobe() does the unnecessary
__register_uprobe() because it sees ->consumers == NULL (add_consumer()
returns NULL).

I guess this is probably harmless because of is_bkpt_insn/-EEXIST
logic, but still.


Btw. __register_uprobe() does

		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
			up_read(&mm->mmap_sem);
			mmput(mm);
			break;
		}
		ret = 0;
		up_read(&mm->mmap_sem);
		mmput(mm);

Yes, this is cosmetic, but why do we duplicate up_read/mmput ?

Up to you, but

		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
		up_read(&mm->mmap_sem);
		mmput(mm);

		if (ret) {
			if (ret != -ESRCH && ret != -EEXIST)
				break;
			ret = 0;
		}

Looks a bit simpler.

Oh, wait. I just noticed that the original code does

	(ret != -ESRCH || ret != -EEXIST)

this expression is always true ;)

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-05 18:01         ` Oleg Nesterov
@ 2011-10-06  5:47           ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06  5:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 20:01:39]:

> Srikar, I am still reading this series, need more time to read this
> patch, but:

Okay, 

> 
> On 09/27, Srikar Dronamraju wrote:
> >
> > I did a rethink and implemented this patch a little differently using
> > block_all_signals, unblock_all_signals. This wouldnt need the
> > #ifdeffery + no changes in kernel/signal.c
> 
> No, Please don't. block_all_signals() must be killed. This interface
> simply do not work. At all. It is buggy as hell. I guess I should ping
> David Airlie again.
> 

I could use sigprocmask instead of block_all_signals.

The patch (that I sent out as part of v5 patchset) uses per task
pending sigqueue and start queueing the signals when the task
singlesteps. After completion of singlestep, walks thro the pending
signals.

But I was thinking if I should block signals instead of queueing them in
a different sigqueue. So Idea is to block signals just before the task
enables singlestep and unblock after task disables singlestep.

Instead of using block_all_signals, I could use sigprocmask to achieve
the same.

Which approach do you suggest or do you have any other approach to look
at?

-- 
Thanks and Regards
Srikar



^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-06  5:47           ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06  5:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 20:01:39]:

> Srikar, I am still reading this series, need more time to read this
> patch, but:

Okay, 

> 
> On 09/27, Srikar Dronamraju wrote:
> >
> > I did a rethink and implemented this patch a little differently using
> > block_all_signals, unblock_all_signals. This wouldnt need the
> > #ifdeffery + no changes in kernel/signal.c
> 
> No, Please don't. block_all_signals() must be killed. This interface
> simply do not work. At all. It is buggy as hell. I guess I should ping
> David Airlie again.
> 

I could use sigprocmask instead of block_all_signals.

The patch (that I sent out as part of v5 patchset) uses per task
pending sigqueue and start queueing the signals when the task
singlesteps. After completion of singlestep, walks thro the pending
signals.

But I was thinking if I should block signals instead of queueing them in
a different sigqueue. So Idea is to block signals just before the task
enables singlestep and unblock after task disables singlestep.

Instead of using block_all_signals, I could use sigprocmask to achieve
the same.

Which approach do you suggest or do you have any other approach to look
at?

-- 
Thanks and Regards
Srikar


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-10-05 18:50         ` Oleg Nesterov
@ 2011-10-06  6:51           ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06  6:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 20:50:08]:

> On 10/05, Srikar Dronamraju wrote:
> >
> > Agree. Infact I encountered this problem last week and had fixed it.
> > In mycase, I had mapped the file read and write while trying to insert
> > probes.
> > The changed code looks like this
> >
> > 	if (!vma)
> > 		return NULL;
> 
> This is unneeded, vma_prio_tree_foreach() stops when vma_prio_tree_next()
> returns NULL. IOW, you can never see vma == NULL.

Agree.

> 
> > 	if (!valid_vma(vma))
> > 		continue;
> 
> Yes.
> 
> > > > +	mutex_lock(&inode->i_mutex);
> > > > +	uprobe = alloc_uprobe(inode, offset);
> > >
> > > Looks like, alloc_uprobe() doesn't need ->i_mutex.
> >
> >
> > Actually this was pointed out by you in the last review.
> > https://lkml.org/lkml/2011/7/24/91
> 
> OOPS ;) may be deserves a comment...

will add a comment.

> 
> > > > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > > > +				struct uprobe_consumer *consumer)
> > > > +{
> > > > +	struct uprobe *uprobe;
> > > > +
> > > > +	inode = igrab(inode);
> > > > +	if (!inode || !consumer)
> > > > +		return;
> > > > +
> > > > +	if (offset > inode->i_size)
> > > > +		return;
> > > > +
> > > > +	uprobe = find_uprobe(inode, offset);
> > > > +	if (!uprobe)
> > > > +		return;
> > > > +
> > > > +	if (!del_consumer(uprobe, consumer)) {
> > > > +		put_uprobe(uprobe);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	mutex_lock(&inode->i_mutex);
> > > > +	if (!uprobe->consumers)
> > > > +		__unregister_uprobe(inode, offset, uprobe);
> > >
> > > It seemes that del_consumer() should be done under ->i_mutex. If it
> > > removes the last consumer, we can race with register_uprobe() which
> > > takes ->i_mutex before us and does another __register_uprobe(), no?
> >
> > We should still be okay, because we check for the consumers before we
> > do the actual unregister in form of __unregister_uprobe.
> > since the consumer is again added by the time we get the lock, we dont
> > do the actual unregistration and go as if del_consumer deleted one
> > consumer but not the last.
> 
> Yes, but I meant in this case register_uprobe() does the unnecessary
> __register_uprobe() because it sees ->consumers == NULL (add_consumer()
> returns NULL).

yes we might be doing an unnecessary __register_uprobe() but because it
raced with unregister_uprobe() and got the lock, we would avoid doing a 
__unregister_uprobe().  

However I am okay to move the lock before del_consumer(). Please let me
know how you prefer this.

> 
> I guess this is probably harmless because of is_bkpt_insn/-EEXIST
> logic, but still.
> 

Agree.

> 
> Btw. __register_uprobe() does
> 
> 		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
> 		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> 			up_read(&mm->mmap_sem);
> 			mmput(mm);
> 			break;
> 		}
> 		ret = 0;
> 		up_read(&mm->mmap_sem);
> 		mmput(mm);
> 
> Yes, this is cosmetic, but why do we duplicate up_read/mmput ?
> 
> Up to you, but
> 
> 		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
> 		up_read(&mm->mmap_sem);
> 		mmput(mm);
> 
> 		if (ret) {
> 			if (ret != -ESRCH && ret != -EEXIST)
> 				break;
> 			ret = 0;
> 		}
> 
> Looks a bit simpler.

Okay, will do.

> 
> Oh, wait. I just noticed that the original code does
> 
> 	(ret != -ESRCH || ret != -EEXIST)
> 
> this expression is always true ;)

Right, will correct this.
> 
> Oleg.
> 

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-10-06  6:51           ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06  6:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 20:50:08]:

> On 10/05, Srikar Dronamraju wrote:
> >
> > Agree. Infact I encountered this problem last week and had fixed it.
> > In mycase, I had mapped the file read and write while trying to insert
> > probes.
> > The changed code looks like this
> >
> > 	if (!vma)
> > 		return NULL;
> 
> This is unneeded, vma_prio_tree_foreach() stops when vma_prio_tree_next()
> returns NULL. IOW, you can never see vma == NULL.

Agree.

> 
> > 	if (!valid_vma(vma))
> > 		continue;
> 
> Yes.
> 
> > > > +	mutex_lock(&inode->i_mutex);
> > > > +	uprobe = alloc_uprobe(inode, offset);
> > >
> > > Looks like, alloc_uprobe() doesn't need ->i_mutex.
> >
> >
> > Actually this was pointed out by you in the last review.
> > https://lkml.org/lkml/2011/7/24/91
> 
> OOPS ;) may be deserves a comment...

will add a comment.

> 
> > > > +void unregister_uprobe(struct inode *inode, loff_t offset,
> > > > +				struct uprobe_consumer *consumer)
> > > > +{
> > > > +	struct uprobe *uprobe;
> > > > +
> > > > +	inode = igrab(inode);
> > > > +	if (!inode || !consumer)
> > > > +		return;
> > > > +
> > > > +	if (offset > inode->i_size)
> > > > +		return;
> > > > +
> > > > +	uprobe = find_uprobe(inode, offset);
> > > > +	if (!uprobe)
> > > > +		return;
> > > > +
> > > > +	if (!del_consumer(uprobe, consumer)) {
> > > > +		put_uprobe(uprobe);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	mutex_lock(&inode->i_mutex);
> > > > +	if (!uprobe->consumers)
> > > > +		__unregister_uprobe(inode, offset, uprobe);
> > >
> > > It seemes that del_consumer() should be done under ->i_mutex. If it
> > > removes the last consumer, we can race with register_uprobe() which
> > > takes ->i_mutex before us and does another __register_uprobe(), no?
> >
> > We should still be okay, because we check for the consumers before we
> > do the actual unregister in form of __unregister_uprobe.
> > since the consumer is again added by the time we get the lock, we dont
> > do the actual unregistration and go as if del_consumer deleted one
> > consumer but not the last.
> 
> Yes, but I meant in this case register_uprobe() does the unnecessary
> __register_uprobe() because it sees ->consumers == NULL (add_consumer()
> returns NULL).

yes we might be doing an unnecessary __register_uprobe() but because it
raced with unregister_uprobe() and got the lock, we would avoid doing a 
__unregister_uprobe().  

However I am okay to move the lock before del_consumer(). Please let me
know how you prefer this.

> 
> I guess this is probably harmless because of is_bkpt_insn/-EEXIST
> logic, but still.
> 

Agree.

> 
> Btw. __register_uprobe() does
> 
> 		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
> 		if (ret && (ret != -ESRCH || ret != -EEXIST)) {
> 			up_read(&mm->mmap_sem);
> 			mmput(mm);
> 			break;
> 		}
> 		ret = 0;
> 		up_read(&mm->mmap_sem);
> 		mmput(mm);
> 
> Yes, this is cosmetic, but why do we duplicate up_read/mmput ?
> 
> Up to you, but
> 
> 		ret = install_breakpoint(mm, uprobe, vma, vi->vaddr);
> 		up_read(&mm->mmap_sem);
> 		mmput(mm);
> 
> 		if (ret) {
> 			if (ret != -ESRCH && ret != -EEXIST)
> 				break;
> 			ret = 0;
> 		}
> 
> Looks a bit simpler.

Okay, will do.

> 
> Oh, wait. I just noticed that the original code does
> 
> 	(ret != -ESRCH || ret != -EEXIST)
> 
> this expression is always true ;)

Right, will correct this.
> 
> Oleg.
> 

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 9/26]   Uprobes: Background page replacement.
  2011-10-05 16:19     ` Oleg Nesterov
@ 2011-10-06  6:53       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06  6:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 18:19:14]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> > +						uprobe_opcode_t *opcode)
> > +{
> > +	struct vm_area_struct *vma;
> > +	struct page *page;
> > +	void *vaddr_new;
> > +	int ret;
> > +
> > +	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> > +	if (ret <= 0)
> > +		return ret;
> > +	ret = -EINVAL;
> > +
> > +	/*
> > +	 * We are interested in text pages only. Our pages of interest
> > +	 * should be mapped for read and execute only. We desist from
> > +	 * adding probes in write mapped pages since the breakpoints
> > +	 * might end up in the file copy.
> > +	 */
> > +	if (!valid_vma(vma))
> > +		goto put_out;
> 
> Another case when valid_vma() looks suspicious. We are going to restore
> the original instruction. We shouldn't fail (at least we shouldn't "leak"
> ->mm_uprobes_count) if ->vm_flags was changed between register_uprobe()
> and unregister_uprobe().
> 

Agree.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 9/26]   Uprobes: Background page replacement.
@ 2011-10-06  6:53       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06  6:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

* Oleg Nesterov <oleg@redhat.com> [2011-10-05 18:19:14]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > +int __weak read_opcode(struct task_struct *tsk, unsigned long vaddr,
> > +						uprobe_opcode_t *opcode)
> > +{
> > +	struct vm_area_struct *vma;
> > +	struct page *page;
> > +	void *vaddr_new;
> > +	int ret;
> > +
> > +	ret = get_user_pages(tsk, tsk->mm, vaddr, 1, 0, 0, &page, &vma);
> > +	if (ret <= 0)
> > +		return ret;
> > +	ret = -EINVAL;
> > +
> > +	/*
> > +	 * We are interested in text pages only. Our pages of interest
> > +	 * should be mapped for read and execute only. We desist from
> > +	 * adding probes in write mapped pages since the breakpoints
> > +	 * might end up in the file copy.
> > +	 */
> > +	if (!valid_vma(vma))
> > +		goto put_out;
> 
> Another case when valid_vma() looks suspicious. We are going to restore
> the original instruction. We shouldn't fail (at least we shouldn't "leak"
> ->mm_uprobes_count) if ->vm_flags was changed between register_uprobe()
> and unregister_uprobe().
> 

Agree.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-10-03 13:37     ` Oleg Nesterov
@ 2011-10-06 11:05       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06 11:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 15:37:10]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > @@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  	mm->pmd_huge_pte = NULL;
> >  #endif
> > +#ifdef CONFIG_UPROBES
> > +	atomic_set(&mm->mm_uprobes_count,
> > +			atomic_read(&oldmm->mm_uprobes_count));
> 
> Hmm. Why this can't race with install_breakpoint/remove_breakpoint
> between _read and _set ?

At this time the child vmas are not yet created, so I dont see a 
install_breakpoints/remove_breakpoints from child affecting.

However if install_breakpoints/remove_breakpoints happen from a parent
context, from now on till we do a vma_prio_tree_add (actually down_write(oldmm->mmap_sem) in dup_mmap()),  then the count in the child may not be the right one. If you are pointing to this race, then its probably bigger than just between read and set.

or are you talking of some other issue?

> 
> What about VM_DONTCOPY vma's with breakpoints ?

Ah... I have missed this.

One solution could be to call mmap_uprobe() like routine just before we
release the mmap_sem of the child but after we do a vma_prio_tree_add.

This should also solve the problem of install_breakpoints/remove
breakpoints called in parent context that we talked about above.

> 
> > -static int match_uprobe(struct uprobe *l, struct uprobe *r)
> > +static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
> >  {
> > +	/*
> > +	 * if match_inode is non NULL then indicate if the
> > +	 * inode atleast match.
> > +	 */
> > +	if (match_inode)
> > +		*match_inode = 0;
> > +
> >  	if (l->inode < r->inode)
> >  		return -1;
> >  	if (l->inode > r->inode)
> >  		return 1;
> >  	else {
> > +		if (match_inode)
> > +			*match_inode = 1;
> > +
> 
> It is very possible I missed something, but imho this looks confusing.
> 
> This close_match logic is only needed for build_probe_list() and
> dec_mm_uprobes_count(), and both do not actually need the returned
> uprobe.
> 
> Instead of complicating match_uprobe() and __find_uprobe(), perhaps
> it makes sense to add "struct rb_node *__find_close_rb_node(inode)" ?


Yes, we do this too.

> 
> > +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> >  {
> >  	/* Placeholder: Yet to be implemented */
> > +	if (!uprobe->consumers)
> > +		return 0;
> 
> How it is possible to see ->consumers == NULL?
> 

consumers == NULL check is mostly for the mmap_uprobe path.
_register_uprobe and _unregister_uprobe() use the same lock to serialize
so they can check consumers after taking the lock.

> OK, afaics it _is_ possible, but only because unregister does del_consumer()
> without ->i_mutex, but this is bug afaics (see the previous email).

We have discussed this in the other thread.

> 
> Another user is mmap_uprobe() and it checks ->consumers != NULL itself (but
> see below).
> 
> > +int mmap_uprobe(struct vm_area_struct *vma)
> > +{
> > +	struct list_head tmp_list;
> > +	struct uprobe *uprobe, *u;
> > +	struct inode *inode;
> > +	int ret = 0;
> > +
> > +	if (!valid_vma(vma))
> > +		return ret;	/* Bail-out */
> > +
> > +	inode = igrab(vma->vm_file->f_mapping->host);
> > +	if (!inode)
> > +		return ret;
> > +
> > +	INIT_LIST_HEAD(&tmp_list);
> > +	mutex_lock(&uprobes_mmap_mutex);
> > +	build_probe_list(inode, &tmp_list);
> > +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > +		loff_t vaddr;
> > +
> > +		list_del(&uprobe->pending_list);
> > +		if (!ret && uprobe->consumers) {
> > +			vaddr = vma->vm_start + uprobe->offset;
> > +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +				continue;
> > +			ret = install_breakpoint(vma->vm_mm, uprobe);
> 
> So. We are adding the new mapping, we should find all breakpoints this
> file has in the start/end range.
> 
> We are holding ->mmap_sem... this seems enough to protect against the
> races with register/unregister. Except, what if __register_uprobe()
> fails? In this case __unregister_uprobe() does delete_uprobe() at the
> very end. What if mmap mmap_uprobe() is called right before delete_?
> 

Because consumers would be NULL before _unregister_uprobe kicks in, we
shouldnt have a problem here.

_unregister_uprobe and mmap_uprobe() would race for
down_read(&mm->mmap_sem), if _unregister_uprobe() gets the read_lock,
then by the time mmap_uprobe() gets to run, consumers would be NULL and
we are fine since we dont go ahead an insert.

If mmap_uprobe() were to get the write_lock, _unregister_uprobe would do
the necessary cleanup.

we are checking consumers twice, but thats just being conservative.
we should able to do with just one check too.

Am I missing something?

> > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > +		struct inode *inode)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	uprobe = __find_uprobe(inode, 0, &n);
> > +
> > +	/*
> > +	 * If indeed there is a probe for the inode and with offset zero,
> > +	 * then lets release its reference. (ref got thro __find_uprobe)
> > +	 */
> > +	if (uprobe)
> > +		put_uprobe(uprobe);
> > +	for (; n; n = rb_next(n)) {
> > +		loff_t vaddr;
> > +
> > +		uprobe = rb_entry(n, struct uprobe, rb_node);
> > +		if (uprobe->inode != inode)
> > +			break;
> > +		vaddr = vma->vm_start + uprobe->offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +			continue;
> > +		atomic_dec(&vma->vm_mm->mm_uprobes_count);
> 
> So, this does atomic_dec() for each bp in this vma?

yes.

> 
> And the caller is
> 
> > @@ -1337,6 +1338,9 @@ unsigned long unmap_vmas(struct mmu_gather *tlb,
> >  		if (unlikely(is_pfn_mapping(vma)))
> >  			untrack_pfn_vma(vma, 0, 0);
> >
> > +		if (vma->vm_file)
> > +			munmap_uprobe(vma);
> 
> Doesn't look right...
> 
> munmap_uprobe() assumes that the whole region goes away. This is
> true in munmap() case afaics, it does __split_vma() if necessary.
> 
> But what about truncate() ? In this case this vma is not unmapped,
> but unmap_vmas() is called anyway and [start, end) can be different.
> IOW, unless I missed something (this is very possible) we can do
> more atomic_dec's then needed.
> 
would unlink_file_vma be a good place to call munmap_uprobe().

The other idea could be to call munmap_uprobe in unmap_region() just
before free_pgtables() and call atomic_set to set the count to 0 in
exit_mmap() (again before free_pgtables.).


One other thing that we probably need to do at mmap_uprobe() is cache
the number of probes mmap_uprobe installed successfully and then
substract the same from mm_uprobes_count if and only if mmap_uprobe()
were to return -ve number.

> Also, truncate() obviously changes ->i_size. Doesn't this mean
> unregister_uprobe() should return if offset > i_size ? We need to
> free uprobes anyway.

Do you mean we shouldnt check for the offset in unregister_uprobe() and
just search in the rbtree for the matching uprobe?
Thats also possible to do.

> 
> MADV_DONTNEED? It calls unmap_vmas() too. And application can do
> madvise(DONTNEED) in a loop.
> 

I think this would be taken care of if we move the munmap_uprobe() hook
from unmap_vmas to unlink_file_vma().

The other thing that I need to investigate a bit more is if I have
handle all cases of mremap correctly.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-10-06 11:05       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-06 11:05 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-03 15:37:10]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > @@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >  	mm->pmd_huge_pte = NULL;
> >  #endif
> > +#ifdef CONFIG_UPROBES
> > +	atomic_set(&mm->mm_uprobes_count,
> > +			atomic_read(&oldmm->mm_uprobes_count));
> 
> Hmm. Why this can't race with install_breakpoint/remove_breakpoint
> between _read and _set ?

At this time the child vmas are not yet created, so I dont see a 
install_breakpoints/remove_breakpoints from child affecting.

However if install_breakpoints/remove_breakpoints happen from a parent
context, from now on till we do a vma_prio_tree_add (actually down_write(oldmm->mmap_sem) in dup_mmap()),  then the count in the child may not be the right one. If you are pointing to this race, then its probably bigger than just between read and set.

or are you talking of some other issue?

> 
> What about VM_DONTCOPY vma's with breakpoints ?

Ah... I have missed this.

One solution could be to call mmap_uprobe() like routine just before we
release the mmap_sem of the child but after we do a vma_prio_tree_add.

This should also solve the problem of install_breakpoints/remove
breakpoints called in parent context that we talked about above.

> 
> > -static int match_uprobe(struct uprobe *l, struct uprobe *r)
> > +static int match_uprobe(struct uprobe *l, struct uprobe *r, int *match_inode)
> >  {
> > +	/*
> > +	 * if match_inode is non NULL then indicate if the
> > +	 * inode atleast match.
> > +	 */
> > +	if (match_inode)
> > +		*match_inode = 0;
> > +
> >  	if (l->inode < r->inode)
> >  		return -1;
> >  	if (l->inode > r->inode)
> >  		return 1;
> >  	else {
> > +		if (match_inode)
> > +			*match_inode = 1;
> > +
> 
> It is very possible I missed something, but imho this looks confusing.
> 
> This close_match logic is only needed for build_probe_list() and
> dec_mm_uprobes_count(), and both do not actually need the returned
> uprobe.
> 
> Instead of complicating match_uprobe() and __find_uprobe(), perhaps
> it makes sense to add "struct rb_node *__find_close_rb_node(inode)" ?


Yes, we do this too.

> 
> > +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> >  {
> >  	/* Placeholder: Yet to be implemented */
> > +	if (!uprobe->consumers)
> > +		return 0;
> 
> How it is possible to see ->consumers == NULL?
> 

consumers == NULL check is mostly for the mmap_uprobe path.
_register_uprobe and _unregister_uprobe() use the same lock to serialize
so they can check consumers after taking the lock.

> OK, afaics it _is_ possible, but only because unregister does del_consumer()
> without ->i_mutex, but this is bug afaics (see the previous email).

We have discussed this in the other thread.

> 
> Another user is mmap_uprobe() and it checks ->consumers != NULL itself (but
> see below).
> 
> > +int mmap_uprobe(struct vm_area_struct *vma)
> > +{
> > +	struct list_head tmp_list;
> > +	struct uprobe *uprobe, *u;
> > +	struct inode *inode;
> > +	int ret = 0;
> > +
> > +	if (!valid_vma(vma))
> > +		return ret;	/* Bail-out */
> > +
> > +	inode = igrab(vma->vm_file->f_mapping->host);
> > +	if (!inode)
> > +		return ret;
> > +
> > +	INIT_LIST_HEAD(&tmp_list);
> > +	mutex_lock(&uprobes_mmap_mutex);
> > +	build_probe_list(inode, &tmp_list);
> > +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > +		loff_t vaddr;
> > +
> > +		list_del(&uprobe->pending_list);
> > +		if (!ret && uprobe->consumers) {
> > +			vaddr = vma->vm_start + uprobe->offset;
> > +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +				continue;
> > +			ret = install_breakpoint(vma->vm_mm, uprobe);
> 
> So. We are adding the new mapping, we should find all breakpoints this
> file has in the start/end range.
> 
> We are holding ->mmap_sem... this seems enough to protect against the
> races with register/unregister. Except, what if __register_uprobe()
> fails? In this case __unregister_uprobe() does delete_uprobe() at the
> very end. What if mmap mmap_uprobe() is called right before delete_?
> 

Because consumers would be NULL before _unregister_uprobe kicks in, we
shouldnt have a problem here.

_unregister_uprobe and mmap_uprobe() would race for
down_read(&mm->mmap_sem), if _unregister_uprobe() gets the read_lock,
then by the time mmap_uprobe() gets to run, consumers would be NULL and
we are fine since we dont go ahead an insert.

If mmap_uprobe() were to get the write_lock, _unregister_uprobe would do
the necessary cleanup.

we are checking consumers twice, but thats just being conservative.
we should able to do with just one check too.

Am I missing something?

> > +static void dec_mm_uprobes_count(struct vm_area_struct *vma,
> > +		struct inode *inode)
> > +{
> > +	struct uprobe *uprobe;
> > +	struct rb_node *n;
> > +	unsigned long flags;
> > +
> > +	n = uprobes_tree.rb_node;
> > +	spin_lock_irqsave(&uprobes_treelock, flags);
> > +	uprobe = __find_uprobe(inode, 0, &n);
> > +
> > +	/*
> > +	 * If indeed there is a probe for the inode and with offset zero,
> > +	 * then lets release its reference. (ref got thro __find_uprobe)
> > +	 */
> > +	if (uprobe)
> > +		put_uprobe(uprobe);
> > +	for (; n; n = rb_next(n)) {
> > +		loff_t vaddr;
> > +
> > +		uprobe = rb_entry(n, struct uprobe, rb_node);
> > +		if (uprobe->inode != inode)
> > +			break;
> > +		vaddr = vma->vm_start + uprobe->offset;
> > +		vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > +		if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > +			continue;
> > +		atomic_dec(&vma->vm_mm->mm_uprobes_count);
> 
> So, this does atomic_dec() for each bp in this vma?

yes.

> 
> And the caller is
> 
> > @@ -1337,6 +1338,9 @@ unsigned long unmap_vmas(struct mmu_gather *tlb,
> >  		if (unlikely(is_pfn_mapping(vma)))
> >  			untrack_pfn_vma(vma, 0, 0);
> >
> > +		if (vma->vm_file)
> > +			munmap_uprobe(vma);
> 
> Doesn't look right...
> 
> munmap_uprobe() assumes that the whole region goes away. This is
> true in munmap() case afaics, it does __split_vma() if necessary.
> 
> But what about truncate() ? In this case this vma is not unmapped,
> but unmap_vmas() is called anyway and [start, end) can be different.
> IOW, unless I missed something (this is very possible) we can do
> more atomic_dec's then needed.
> 
would unlink_file_vma be a good place to call munmap_uprobe().

The other idea could be to call munmap_uprobe in unmap_region() just
before free_pgtables() and call atomic_set to set the count to 0 in
exit_mmap() (again before free_pgtables.).


One other thing that we probably need to do at mmap_uprobe() is cache
the number of probes mmap_uprobe installed successfully and then
substract the same from mm_uprobes_count if and only if mmap_uprobe()
were to return -ve number.

> Also, truncate() obviously changes ->i_size. Doesn't this mean
> unregister_uprobe() should return if offset > i_size ? We need to
> free uprobes anyway.

Do you mean we shouldnt check for the offset in unregister_uprobe() and
just search in the rbtree for the matching uprobe?
Thats also possible to do.

> 
> MADV_DONTNEED? It calls unmap_vmas() too. And application can do
> madvise(DONTNEED) in a loop.
> 

I think this would be taken care of if we move the munmap_uprobe() hook
from unmap_vmas to unlink_file_vma().

The other thing that I need to investigate a bit more is if I have
handle all cases of mremap correctly.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-09-22  1:05     ` Josh Stone
  (?)
@ 2011-10-06 23:58     ` Josh Stone
  2011-10-07  1:37       ` hpanvin@gmail.com
  -1 siblings, 1 reply; 330+ messages in thread
From: Josh Stone @ 2011-10-06 23:58 UTC (permalink / raw)
  To: linux-kernel
  Cc: Josh Stone, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Masami Hiramatsu, Srikar Dronamraju, Jakub Jelinek

This casts the addr that's fed to asm into a struct-array pointer, so
gcc knows that more than just the first long is needed.  Since there's
no fixed size for all callers, an arbitrary size is chosen just to
ensure that it's probably good enough.

I noticed this warning on i686, with gcc-4.6.1-9.fc15:

  CC      arch/x86/kernel/kprobes.o
In file included from include/linux/bitops.h:22:0,
                 from include/linux/kernel.h:17,
                 from [...]/arch/x86/include/asm/percpu.h:44,
                 from [...]/arch/x86/include/asm/current.h:5,
                 from [...]/arch/x86/include/asm/processor.h:15,
                 from [...]/arch/x86/include/asm/atomic.h:6,
                 from include/linux/atomic.h:4,
                 from include/linux/mutex.h:18,
                 from include/linux/notifier.h:13,
                 from include/linux/kprobes.h:34,
                 from arch/x86/kernel/kprobes.c:43:
[...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
[...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]

In investigating the impact of this warning, I discovered that only the
first long of the 32-byte twobyte_is_boostable[] was making into the
object file.

Jakub advised that variable_test_bit is incorrectly telling gcc that its
asm only uses a single long from the addr pointer, and he suggested the
struct-array cast to broaden the memory reference.

Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Jakub Jelinek <jakub@redhat.com>

---

An alternate fix would be to make kprobes' twobyte_is_boostable[]
volatile, which forces gcc to keep it around.  I feel that's treating
the symptom though, rather than the cause in variable_test_bit().

IMO this is also a good candidate for -stable, for fixing the obviously bad
data behavior, but I'll let others judge...

---
 arch/x86/include/asm/bitops.h |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/bitops.h b/arch/x86/include/asm/bitops.h
index 1775d6e..0565371 100644
--- a/arch/x86/include/asm/bitops.h
+++ b/arch/x86/include/asm/bitops.h
@@ -319,7 +319,8 @@ static inline int variable_test_bit(int nr, volatile const unsigned long *addr)
 	asm volatile("bt %2,%1\n\t"
 		     "sbb %0,%0"
 		     : "=r" (oldbit)
-		     : "m" (*(unsigned long *)addr), "Ir" (nr));
+		     : "m" (*(struct { unsigned long _[0x10000]; } *)addr),
+		       "Ir" (nr));
 
 	return oldbit;
 }
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-06 23:58     ` [PATCH] x86: Make variable_test_bit reference all of *addr Josh Stone
@ 2011-10-07  1:37       ` hpanvin@gmail.com
  2011-10-07  2:02         ` Andi Kleen
  0 siblings, 1 reply; 330+ messages in thread
From: hpanvin@gmail.com @ 2011-10-07  1:37 UTC (permalink / raw)
  To: Josh Stone, linux-kernel
  Cc: Thomas Gleixner, Ingo Molnar, x86, Masami Hiramatsu,
	Srikar Dronamraju, Jakub Jelinek

This is concerning... the kernel relies heavily on asm volatile being a universal memory consumer.  If that is suddenly broken, we are f*** in many, many, MANY places in the kernel all of a sudden!

Josh Stone <jistone@redhat.com> wrote:

>This casts the addr that's fed to asm into a struct-array pointer, so
>gcc knows that more than just the first long is needed.  Since there's
>no fixed size for all callers, an arbitrary size is chosen just to
>ensure that it's probably good enough.
>
>I noticed this warning on i686, with gcc-4.6.1-9.fc15:
>
>  CC      arch/x86/kernel/kprobes.o
>In file included from include/linux/bitops.h:22:0,
>                 from include/linux/kernel.h:17,
>                 from [...]/arch/x86/include/asm/percpu.h:44,
>                 from [...]/arch/x86/include/asm/current.h:5,
>                 from [...]/arch/x86/include/asm/processor.h:15,
>                 from [...]/arch/x86/include/asm/atomic.h:6,
>                 from include/linux/atomic.h:4,
>                 from include/linux/mutex.h:18,
>                 from include/linux/notifier.h:13,
>                 from include/linux/kprobes.h:34,
>                 from arch/x86/kernel/kprobes.c:43:
>[...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
>[...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input
>without lvalue in asm operand 1 is deprecated [enabled by default]
>
>In investigating the impact of this warning, I discovered that only the
>first long of the 32-byte twobyte_is_boostable[] was making into the
>object file.
>
>Jakub advised that variable_test_bit is incorrectly telling gcc that
>its
>asm only uses a single long from the addr pointer, and he suggested the
>struct-array cast to broaden the memory reference.
>
>Signed-off-by: Josh Stone <jistone@redhat.com>
>Cc: Jakub Jelinek <jakub@redhat.com>
>
>---
>
>An alternate fix would be to make kprobes' twobyte_is_boostable[]
>volatile, which forces gcc to keep it around.  I feel that's treating
>the symptom though, rather than the cause in variable_test_bit().
>
>IMO this is also a good candidate for -stable, for fixing the obviously
>bad
>data behavior, but I'll let others judge...
>
>---
> arch/x86/include/asm/bitops.h |    3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
>diff --git a/arch/x86/include/asm/bitops.h
>b/arch/x86/include/asm/bitops.h
>index 1775d6e..0565371 100644
>--- a/arch/x86/include/asm/bitops.h
>+++ b/arch/x86/include/asm/bitops.h
>@@ -319,7 +319,8 @@ static inline int variable_test_bit(int nr,
>volatile const unsigned long *addr)
> 	asm volatile("bt %2,%1\n\t"
> 		     "sbb %0,%0"
> 		     : "=r" (oldbit)
>-		     : "m" (*(unsigned long *)addr), "Ir" (nr));
>+		     : "m" (*(struct { unsigned long _[0x10000]; } *)addr),
>+		       "Ir" (nr));
> 
> 	return oldbit;
> }
>-- 
>1.7.6.4

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  1:37       ` hpanvin@gmail.com
@ 2011-10-07  2:02         ` Andi Kleen
  2011-10-07  2:50           ` Josh Stone
  2011-10-07  3:13           ` [PATCH] x86: Make variable_test_bit reference all of *addr hpanvin@gmail.com
  0 siblings, 2 replies; 330+ messages in thread
From: Andi Kleen @ 2011-10-07  2:02 UTC (permalink / raw)
  To: hpanvin@gmail.com
  Cc: Josh Stone, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Masami Hiramatsu, Srikar Dronamraju, Jakub Jelinek

"hpanvin@gmail.com" <hpa@zytor.com> writes:


> This is concerning... the kernel relies heavily on asm volatile being a universal memory consumer.  If that is suddenly broken, we are f*** in many, many, MANY places in the kernel all of a sudden!

I don't think that's true. We generally add "memory" clobbers for this
purpose. asm volatile just means "don't move" 

Just this one doesn't have it for unknown reasons (someone overoptimizing?)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  2:02         ` Andi Kleen
@ 2011-10-07  2:50           ` Josh Stone
  2011-10-07  3:12             ` hpanvin@gmail.com
                               ` (2 more replies)
  2011-10-07  3:13           ` [PATCH] x86: Make variable_test_bit reference all of *addr hpanvin@gmail.com
  1 sibling, 3 replies; 330+ messages in thread
From: Josh Stone @ 2011-10-07  2:50 UTC (permalink / raw)
  To: Andi Kleen, hpanvin@gmail.com
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Masami Hiramatsu, Srikar Dronamraju, Jakub Jelinek

On 10/06/2011 04:58 PM, Josh Stone wrote:
> [...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
> [...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]

I probably should have noted that Jakub also blamed gcc's behavior, for
transforming const memory into a literal constant and then complaining
about lvalues.  He fixed that upstream, and applied to 4.6.1-10.fc16:
  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50571

I didn't figure out any automated way to detect the problem in general
(apart from the presence of that warning), but here's how I'm checking
kprobes' in particular.

Using gcc-4.6.1-9.fc15.i686:
> $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>      551:	0f a3 05 00 00 00 00 	bt     %eax,0x0
> 			554: R_386_32	.rodata.cst4
> $ objdump -s -j .rodata -j .rodata.cst4 arch/x86/kernel/kprobes.o
> 
> arch/x86/kernel/kprobes.o:     file format elf32-i386
> 
> Contents of section .rodata:
>  0000 02000000                             ....            
> Contents of section .rodata.cst4:
>  0000 4c030000                             L...            

Using gcc-4.6.1-9.fc15.i686, with my variable_test_bit patch:
> $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>      551:	0f a3 05 20 00 00 00 	bt     %eax,0x20
> 			554: R_386_32	.rodata
> $ objdump -s -j .rodata arch/x86/kernel/kprobes.o
> 
> arch/x86/kernel/kprobes.o:     file format elf32-i386
> 
> Contents of section .rodata:
>  0000 02000000 00000000 00000000 00000000  ................
>  0010 00000000 00000000 00000000 00000000  ................
>  0020 4c030000 0f000200 ffff0000 ffcff0c0  L...............
>  0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77  ....;.......&..w

Using gcc-4.6.1-10.fc16.i686, with Jakub's fix, without my patch:
> $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>      551:	0f a3 05 20 00 00 00 	bt     %eax,0x20
> 			554: R_386_32	.rodata
> $ objdump -s -j .rodata arch/x86/kernel/kprobes.o
> 
> arch/x86/kernel/kprobes.o:     file format elf32-i386
> 
> Contents of section .rodata:
>  0000 02000000 00000000 00000000 00000000  ................
>  0010 00000000 00000000 00000000 00000000  ................
>  0020 4c030000 0f000200 ffff0000 ffcff0c0  L...............
>  0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77  ....;.......&..w

There's some zero-padding on the previous .rodata contents, but then
starting at 0x20 it now has the full 32-bytes of twobyte_is_boostable[].

So Jakub's gcc change fixes this issue independently of my patch, but I
got the impression from him that the way the kernel is expressing this
is still in the realm of "gcc might break your expectations here".  If
that's not the case, then my patch here is only needed if you want to
cope with prior broken versions.  Jakub, do you have an idea of the
range of gcc versions broken in this way?

On 10/06/2011 07:02 PM, Andi Kleen wrote:
> "hpanvin@gmail.com" <hpa@zytor.com> writes:
>> This is concerning... the kernel relies heavily on asm volatile being
>> a universal memory consumer.  If that is suddenly broken, we are f***
>> in many, many, MANY places in the kernel all of a sudden!
> 
> I don't think that's true. We generally add "memory" clobbers for this
> purpose. asm volatile just means "don't move" 
> 
> Just this one doesn't have it for unknown reasons (someone overoptimizing?)

Which overoptimizing part are you referring to?  The only part of
variable_test_bit that's not volatile is "m" (*(unsigned long *)addr),
and throwing volatile in that cast does nothing for the problem (at
least on gcc-4.6.1-9.fc15).

We can make twobyte_is_boostable[] volatile instead, which does the
trick, but that seems a kludge to me.


Josh

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  2:50           ` Josh Stone
@ 2011-10-07  3:12             ` hpanvin@gmail.com
  2011-10-07  3:30               ` Andi Kleen
  2011-10-07  4:35             ` Masami Hiramatsu
  2011-10-07  4:55             ` Masami Hiramatsu
  2 siblings, 1 reply; 330+ messages in thread
From: hpanvin@gmail.com @ 2011-10-07  3:12 UTC (permalink / raw)
  To: Josh Stone, Andi Kleen
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Masami Hiramatsu, Srikar Dronamraju, Jakub Jelinek

I mean the volatile in "asm volatile".

Josh Stone <jistone@redhat.com> wrote:

>On 10/06/2011 04:58 PM, Josh Stone wrote:
>> [...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
>> [...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory
>input without lvalue in asm operand 1 is deprecated [enabled by
>default]
>
>I probably should have noted that Jakub also blamed gcc's behavior, for
>transforming const memory into a literal constant and then complaining
>about lvalues.  He fixed that upstream, and applied to 4.6.1-10.fc16:
>  http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50571
>
>I didn't figure out any automated way to detect the problem in general
>(apart from the presence of that warning), but here's how I'm checking
>kprobes' in particular.
>
>Using gcc-4.6.1-9.fc15.i686:
>> $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>>      551:	0f a3 05 00 00 00 00 	bt     %eax,0x0
>> 			554: R_386_32	.rodata.cst4
>> $ objdump -s -j .rodata -j .rodata.cst4 arch/x86/kernel/kprobes.o
>> 
>> arch/x86/kernel/kprobes.o:     file format elf32-i386
>> 
>> Contents of section .rodata:
>>  0000 02000000                             ....            
>> Contents of section .rodata.cst4:
>>  0000 4c030000                             L...            
>
>Using gcc-4.6.1-9.fc15.i686, with my variable_test_bit patch:
>> $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>>      551:	0f a3 05 20 00 00 00 	bt     %eax,0x20
>> 			554: R_386_32	.rodata
>> $ objdump -s -j .rodata arch/x86/kernel/kprobes.o
>> 
>> arch/x86/kernel/kprobes.o:     file format elf32-i386
>> 
>> Contents of section .rodata:
>>  0000 02000000 00000000 00000000 00000000  ................
>>  0010 00000000 00000000 00000000 00000000  ................
>>  0020 4c030000 0f000200 ffff0000 ffcff0c0  L...............
>>  0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77  ....;.......&..w
>
>Using gcc-4.6.1-10.fc16.i686, with Jakub's fix, without my patch:
>> $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>>      551:	0f a3 05 20 00 00 00 	bt     %eax,0x20
>> 			554: R_386_32	.rodata
>> $ objdump -s -j .rodata arch/x86/kernel/kprobes.o
>> 
>> arch/x86/kernel/kprobes.o:     file format elf32-i386
>> 
>> Contents of section .rodata:
>>  0000 02000000 00000000 00000000 00000000  ................
>>  0010 00000000 00000000 00000000 00000000  ................
>>  0020 4c030000 0f000200 ffff0000 ffcff0c0  L...............
>>  0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77  ....;.......&..w
>
>There's some zero-padding on the previous .rodata contents, but then
>starting at 0x20 it now has the full 32-bytes of
>twobyte_is_boostable[].
>
>So Jakub's gcc change fixes this issue independently of my patch, but I
>got the impression from him that the way the kernel is expressing this
>is still in the realm of "gcc might break your expectations here".  If
>that's not the case, then my patch here is only needed if you want to
>cope with prior broken versions.  Jakub, do you have an idea of the
>range of gcc versions broken in this way?
>
>On 10/06/2011 07:02 PM, Andi Kleen wrote:
>> "hpanvin@gmail.com" <hpa@zytor.com> writes:
>>> This is concerning... the kernel relies heavily on asm volatile
>being
>>> a universal memory consumer.  If that is suddenly broken, we are
>f***
>>> in many, many, MANY places in the kernel all of a sudden!
>> 
>> I don't think that's true. We generally add "memory" clobbers for
>this
>> purpose. asm volatile just means "don't move" 
>> 
>> Just this one doesn't have it for unknown reasons (someone
>overoptimizing?)
>
>Which overoptimizing part are you referring to?  The only part of
>variable_test_bit that's not volatile is "m" (*(unsigned long *)addr),
>and throwing volatile in that cast does nothing for the problem (at
>least on gcc-4.6.1-9.fc15).
>
>We can make twobyte_is_boostable[] volatile instead, which does the
>trick, but that seems a kludge to me.
>
>
>Josh

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  2:02         ` Andi Kleen
  2011-10-07  2:50           ` Josh Stone
@ 2011-10-07  3:13           ` hpanvin@gmail.com
  1 sibling, 0 replies; 330+ messages in thread
From: hpanvin@gmail.com @ 2011-10-07  3:13 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Josh Stone, linux-kernel, Thomas Gleixner, Ingo Molnar, x86,
	Masami Hiramatsu, Srikar Dronamraju, Jakub Jelinek

Memory clobbers are universal producers, not consumers.

Andi Kleen <andi@firstfloor.org> wrote:

>"hpanvin@gmail.com" <hpa@zytor.com> writes:
>
>
>> This is concerning... the kernel relies heavily on asm volatile being
>a universal memory consumer.  If that is suddenly broken, we are f***
>in many, many, MANY places in the kernel all of a sudden!
>
>I don't think that's true. We generally add "memory" clobbers for this
>purpose. asm volatile just means "don't move" 
>
>Just this one doesn't have it for unknown reasons (someone
>overoptimizing?)
>
>-Andi
>
>-- 
>ak@linux.intel.com -- Speaking for myself only

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  3:12             ` hpanvin@gmail.com
@ 2011-10-07  3:30               ` Andi Kleen
  0 siblings, 0 replies; 330+ messages in thread
From: Andi Kleen @ 2011-10-07  3:30 UTC (permalink / raw)
  To: hpanvin@gmail.com
  Cc: Josh Stone, Andi Kleen, linux-kernel, Thomas Gleixner,
	Ingo Molnar, x86, Masami Hiramatsu, Srikar Dronamraju,
	Jakub Jelinek

On Thu, Oct 06, 2011 at 08:12:48PM -0700, hpanvin@gmail.com wrote:
> I mean the volatile in "asm volatile".

That volatile has nothing to do with memory. It just means "don't move
much". It's actually quite vague, because the rest of the function
can still move around.

-Andi

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  2:50           ` Josh Stone
  2011-10-07  3:12             ` hpanvin@gmail.com
@ 2011-10-07  4:35             ` Masami Hiramatsu
  2011-10-07  4:55             ` Masami Hiramatsu
  2 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-10-07  4:35 UTC (permalink / raw)
  To: Josh Stone
  Cc: Andi Kleen, hpanvin@gmail.com, linux-kernel, Thomas Gleixner,
	Ingo Molnar, x86, Srikar Dronamraju, Jakub Jelinek

Hi Josh,

Thank you for reporting details :)

(2011/10/07 11:50), Josh Stone wrote:
> We can make twobyte_is_boostable[] volatile instead, which does the
> trick, but that seems a kludge to me.

I think this time we'd better to do this (with comments),
since this may not affect most of bitmap users using it for
checking dynamically allocated bit-flags. kprobe's bitmap
usage seems a special case.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make variable_test_bit reference all of *addr
  2011-10-07  2:50           ` Josh Stone
  2011-10-07  3:12             ` hpanvin@gmail.com
  2011-10-07  4:35             ` Masami Hiramatsu
@ 2011-10-07  4:55             ` Masami Hiramatsu
  2011-10-18  1:00               ` [PATCH] x86: Make kprobes' twobyte_is_boostable volatile Josh Stone
  2 siblings, 1 reply; 330+ messages in thread
From: Masami Hiramatsu @ 2011-10-07  4:55 UTC (permalink / raw)
  To: Josh Stone
  Cc: Andi Kleen, hpanvin@gmail.com, linux-kernel, Thomas Gleixner,
	Ingo Molnar, x86, Srikar Dronamraju, Jakub Jelinek

Hi Josh,

Thank you for reporting details :)

(2011/10/07 11:50), Josh Stone wrote:
> We can make twobyte_is_boostable[] volatile instead, which does the
> trick, but that seems a kludge to me.

I think this time we'd better to do this (with comments),
since this may not affect most of bitmap users using it for
checking dynamically allocated bit-flags. kprobe's bitmap
usage seems a special case.

Thank you,

-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-06  5:47           ` Srikar Dronamraju
@ 2011-10-07 16:58             ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 16:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/06, Srikar Dronamraju wrote:
>
> The patch (that I sent out as part of v5 patchset) uses per task
> pending sigqueue and start queueing the signals when the task
> singlesteps. After completion of singlestep, walks thro the pending
> signals.

Yes, I see. Doesn't look very nice ;)

> But I was thinking if I should block signals instead of queueing them in
> a different sigqueue. So Idea is to block signals just before the task
> enables singlestep and unblock after task disables singlestep.

Agreed, this looks much, much better. In both cases the task is current,
it is safe to change ->blocked.

But please avoid sigprocmask(), we have set_current_blocked().

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-07 16:58             ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 16:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/06, Srikar Dronamraju wrote:
>
> The patch (that I sent out as part of v5 patchset) uses per task
> pending sigqueue and start queueing the signals when the task
> singlesteps. After completion of singlestep, walks thro the pending
> signals.

Yes, I see. Doesn't look very nice ;)

> But I was thinking if I should block signals instead of queueing them in
> a different sigqueue. So Idea is to block signals just before the task
> enables singlestep and unblock after task disables singlestep.

Agreed, this looks much, much better. In both cases the task is current,
it is safe to change ->blocked.

But please avoid sigprocmask(), we have set_current_blocked().

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
  2011-10-06  6:51           ` Srikar Dronamraju
@ 2011-10-07 17:03             ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 17:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 10/06, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-05 20:50:08]:
>
> yes we might be doing an unnecessary __register_uprobe() but because it
> raced with unregister_uprobe() and got the lock, we would avoid doing a
> __unregister_uprobe().
>
> However I am okay to move the lock before del_consumer().

To me this looks a bit "safer" even if currently __register is idempotent.

But,

> Please let me
> know how you prefer this.

No, no, Srikar. Please do what you prefer. You are the author.

And btw I forgot to mention that initially I wrongly thought this is buggy.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 3/26]   Uprobes: register/unregister probes.
@ 2011-10-07 17:03             ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 17:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Andi Kleen, LKML, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, Andrew Morton

On 10/06, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-05 20:50:08]:
>
> yes we might be doing an unnecessary __register_uprobe() but because it
> raced with unregister_uprobe() and got the lock, we would avoid doing a
> __unregister_uprobe().
>
> However I am okay to move the lock before del_consumer().

To me this looks a bit "safer" even if currently __register is idempotent.

But,

> Please let me
> know how you prefer this.

No, no, Srikar. Please do what you prefer. You are the author.

And btw I forgot to mention that initially I wrongly thought this is buggy.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-10-06 11:05       ` Srikar Dronamraju
@ 2011-10-07 17:36         ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 17:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 10/06, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-03 15:37:10]:
>
> > On 09/20, Srikar Dronamraju wrote:
> > >
> > > @@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >  	mm->pmd_huge_pte = NULL;
> > >  #endif
> > > +#ifdef CONFIG_UPROBES
> > > +	atomic_set(&mm->mm_uprobes_count,
> > > +			atomic_read(&oldmm->mm_uprobes_count));
> >
> > Hmm. Why this can't race with install_breakpoint/remove_breakpoint
> > between _read and _set ?
>
> At this time the child vmas are not yet created, so I dont see a
> install_breakpoints/remove_breakpoints from child affecting.

I meant oldmm.

> However if install_breakpoints/remove_breakpoints happen from a parent
> context, from now on till we do a vma_prio_tree_add (actually down_write(oldmm->mmap_sem)
> in dup_mmap()),  then the count in the child may not be the right one.
> If you are pointing to this race, then its probably bigger than just between read and set.

Yes, this too. IOW, atomic_read/set(mm_uprobes_count) looks always
wrong without down_write(mmap_sem).

> > > +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> > >  {
> > >  	/* Placeholder: Yet to be implemented */
> > > +	if (!uprobe->consumers)
> > > +		return 0;
> >
> > How it is possible to see ->consumers == NULL?
>
> consumers == NULL check is mostly for the mmap_uprobe path.

mmap_uprobe() explicitely checks ->consumers != NULL before
install_breakpoint().

> _register_uprobe and _unregister_uprobe() use the same lock to serialize
> so they can check consumers after taking the lock.

Yes,

> > OK, afaics it _is_ possible, but only because unregister does del_consumer()
> > without ->i_mutex, but this is bug afaics (see the previous email).
>
> We have discussed this in the other thread.

Yes. So afaics we can remove this check if unregister() does del_consumer()
under mutex.

Note: I am not saying you should do this ;) I just tried to understand
this code.

> > > +int mmap_uprobe(struct vm_area_struct *vma)
> > > +{
> > > +	struct list_head tmp_list;
> > > +	struct uprobe *uprobe, *u;
> > > +	struct inode *inode;
> > > +	int ret = 0;
> > > +
> > > +	if (!valid_vma(vma))
> > > +		return ret;	/* Bail-out */
> > > +
> > > +	inode = igrab(vma->vm_file->f_mapping->host);
> > > +	if (!inode)
> > > +		return ret;
> > > +
> > > +	INIT_LIST_HEAD(&tmp_list);
> > > +	mutex_lock(&uprobes_mmap_mutex);
> > > +	build_probe_list(inode, &tmp_list);
> > > +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > +		loff_t vaddr;
> > > +
> > > +		list_del(&uprobe->pending_list);
> > > +		if (!ret && uprobe->consumers) {
> > > +			vaddr = vma->vm_start + uprobe->offset;
> > > +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > +				continue;
> > > +			ret = install_breakpoint(vma->vm_mm, uprobe);
> >
> > So. We are adding the new mapping, we should find all breakpoints this
> > file has in the start/end range.
> >
> > We are holding ->mmap_sem... this seems enough to protect against the
> > races with register/unregister. Except, what if __register_uprobe()
> > fails? In this case __unregister_uprobe() does delete_uprobe() at the
> > very end. What if mmap mmap_uprobe() is called right before delete_?
> >
>
> Because consumers would be NULL before _unregister_uprobe kicks in, we
> shouldnt have a problem here.

Hmm. But it is not NULL.

Once again, I didn't mean unregister_uprobe(). I meant register_uprobe().
In this case, if __register_uprobe() fails, we are doing __unregister
but uprobe->consumer != NULL.

Just suppose that the caller of register_uprobe() gets a (long) preemption
right before __unregister_uprobe()->delete_uprobe(). What if mmap() is
called at this time?

> Am I missing something?

May be you, may be me. Please recheck ;)

> > Also, truncate() obviously changes ->i_size. Doesn't this mean
> > unregister_uprobe() should return if offset > i_size ? We need to
> > free uprobes anyway.

Argh, I meant "should NOT return if offset > i_size".

> Do you mean we shouldnt check for the offset in unregister_uprobe() and
> just search in the rbtree for the matching uprobe?
> Thats also possible to do.

Yes, we can't trust this check afaics.

> I think this would be taken care of if we move the munmap_uprobe() hook
> from unmap_vmas to unlink_file_vma().

Probably yes, we should rely on prio_tree locking/changes.

> The other thing that I need to investigate a bit more is if I have
> handle all cases of mremap correctly.

Yes. May be mmap_uprobe() should be "closer" to vma_prio_tree_add/insert
too, but I am not sure.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-10-07 17:36         ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 17:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 10/06, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-03 15:37:10]:
>
> > On 09/20, Srikar Dronamraju wrote:
> > >
> > > @@ -739,6 +740,10 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
> > >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >  	mm->pmd_huge_pte = NULL;
> > >  #endif
> > > +#ifdef CONFIG_UPROBES
> > > +	atomic_set(&mm->mm_uprobes_count,
> > > +			atomic_read(&oldmm->mm_uprobes_count));
> >
> > Hmm. Why this can't race with install_breakpoint/remove_breakpoint
> > between _read and _set ?
>
> At this time the child vmas are not yet created, so I dont see a
> install_breakpoints/remove_breakpoints from child affecting.

I meant oldmm.

> However if install_breakpoints/remove_breakpoints happen from a parent
> context, from now on till we do a vma_prio_tree_add (actually down_write(oldmm->mmap_sem)
> in dup_mmap()),  then the count in the child may not be the right one.
> If you are pointing to this race, then its probably bigger than just between read and set.

Yes, this too. IOW, atomic_read/set(mm_uprobes_count) looks always
wrong without down_write(mmap_sem).

> > > +static int install_breakpoint(struct mm_struct *mm, struct uprobe *uprobe)
> > >  {
> > >  	/* Placeholder: Yet to be implemented */
> > > +	if (!uprobe->consumers)
> > > +		return 0;
> >
> > How it is possible to see ->consumers == NULL?
>
> consumers == NULL check is mostly for the mmap_uprobe path.

mmap_uprobe() explicitely checks ->consumers != NULL before
install_breakpoint().

> _register_uprobe and _unregister_uprobe() use the same lock to serialize
> so they can check consumers after taking the lock.

Yes,

> > OK, afaics it _is_ possible, but only because unregister does del_consumer()
> > without ->i_mutex, but this is bug afaics (see the previous email).
>
> We have discussed this in the other thread.

Yes. So afaics we can remove this check if unregister() does del_consumer()
under mutex.

Note: I am not saying you should do this ;) I just tried to understand
this code.

> > > +int mmap_uprobe(struct vm_area_struct *vma)
> > > +{
> > > +	struct list_head tmp_list;
> > > +	struct uprobe *uprobe, *u;
> > > +	struct inode *inode;
> > > +	int ret = 0;
> > > +
> > > +	if (!valid_vma(vma))
> > > +		return ret;	/* Bail-out */
> > > +
> > > +	inode = igrab(vma->vm_file->f_mapping->host);
> > > +	if (!inode)
> > > +		return ret;
> > > +
> > > +	INIT_LIST_HEAD(&tmp_list);
> > > +	mutex_lock(&uprobes_mmap_mutex);
> > > +	build_probe_list(inode, &tmp_list);
> > > +	list_for_each_entry_safe(uprobe, u, &tmp_list, pending_list) {
> > > +		loff_t vaddr;
> > > +
> > > +		list_del(&uprobe->pending_list);
> > > +		if (!ret && uprobe->consumers) {
> > > +			vaddr = vma->vm_start + uprobe->offset;
> > > +			vaddr -= vma->vm_pgoff << PAGE_SHIFT;
> > > +			if (vaddr < vma->vm_start || vaddr >= vma->vm_end)
> > > +				continue;
> > > +			ret = install_breakpoint(vma->vm_mm, uprobe);
> >
> > So. We are adding the new mapping, we should find all breakpoints this
> > file has in the start/end range.
> >
> > We are holding ->mmap_sem... this seems enough to protect against the
> > races with register/unregister. Except, what if __register_uprobe()
> > fails? In this case __unregister_uprobe() does delete_uprobe() at the
> > very end. What if mmap mmap_uprobe() is called right before delete_?
> >
>
> Because consumers would be NULL before _unregister_uprobe kicks in, we
> shouldnt have a problem here.

Hmm. But it is not NULL.

Once again, I didn't mean unregister_uprobe(). I meant register_uprobe().
In this case, if __register_uprobe() fails, we are doing __unregister
but uprobe->consumer != NULL.

Just suppose that the caller of register_uprobe() gets a (long) preemption
right before __unregister_uprobe()->delete_uprobe(). What if mmap() is
called at this time?

> Am I missing something?

May be you, may be me. Please recheck ;)

> > Also, truncate() obviously changes ->i_size. Doesn't this mean
> > unregister_uprobe() should return if offset > i_size ? We need to
> > free uprobes anyway.

Argh, I meant "should NOT return if offset > i_size".

> Do you mean we shouldnt check for the offset in unregister_uprobe() and
> just search in the rbtree for the matching uprobe?
> Thats also possible to do.

Yes, we can't trust this check afaics.

> I think this would be taken care of if we move the munmap_uprobe() hook
> from unmap_vmas to unlink_file_vma().

Probably yes, we should rely on prio_tree locking/changes.

> The other thing that I need to investigate a bit more is if I have
> handle all cases of mremap correctly.

Yes. May be mmap_uprobe() should be "closer" to vma_prio_tree_add/insert
too, but I am not sure.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-09-20 12:02   ` Srikar Dronamraju
@ 2011-10-07 18:28     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 18:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> @@ -1285,6 +1286,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	INIT_LIST_HEAD(&p->pi_state_list);
>  	p->pi_state_cache = NULL;
>  #endif
> +#ifdef CONFIG_UPROBES
> +	p->utask = NULL;
> +#endif

I am not sure I understand this all right, but I am not sure this
is enough...

What if the forking task (current) is in UTASK_BP_HIT state?

IOW, uprobe replaces the original syscall insn with "int3", then we
enter the kernel from the xol_vma. The new child has the same
modified instruction pointer (pointing to nowhere without CLONE_VM)
and in any case it doesn't have TIF_SINGLESTEP.

No?

> +void uprobe_notify_resume(struct pt_regs *regs)
> +{
> +	struct vm_area_struct *vma;
> +	struct uprobe_task *utask;
> +	struct mm_struct *mm;
> +	struct uprobe *u = NULL;
> +	unsigned long probept;
> +
> +	utask = current->utask;
> +	mm = current->mm;
> +	if (!utask || utask->state == UTASK_BP_HIT) {
> +		probept = get_uprobe_bkpt_addr(regs);
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, probept);
> +		if (vma && valid_vma(vma))
> +			u = find_uprobe(vma->vm_file->f_mapping->host,
> +					probept - vma->vm_start +
> +					(vma->vm_pgoff << PAGE_SHIFT));
> +		up_read(&mm->mmap_sem);
> +		if (!u)
> +			/* No matching uprobe; signal SIGTRAP. */
> +			goto cleanup_ret;
> +		if (!utask) {
> +			utask = add_utask();
> +			/* Cannot Allocate; re-execute the instruction. */
> +			if (!utask)
> +				goto cleanup_ret;
> +		}
> +		/* TODO Start queueing signals. */
> +		utask->active_uprobe = u;
> +		handler_chain(u, regs);
> +		utask->state = UTASK_SSTEP;
> +		if (!pre_ssout(u, regs, probept))
> +			user_enable_single_step(current);

Oooh. Playing with user_*_single_step() is obviously not very nice...
But I guess you have no choice. Although I _hope_ we can do something
else later.

And what if we step into a syscall insn? I do not understand this
low level code, but it seems that in this case we trap in kernel mode
and do_debug() doesn't clear X86_EFLAGS_TF because uprobes hook
DIE_DEBUG. IOW, the task will trap again and again inside this syscall,
no?

> +	} else if (utask->state == UTASK_SSTEP) {
> +		u = utask->active_uprobe;
> +		if (sstep_complete(u, regs)) {

It is not clear to me if it is correct to simply return if
sstep_complete() returns false... What if X86_EFLAGS_TF was "lost"
somehow?


Again, I am not saying I understand this magic. Not at all ;)
Please simply ignore my email if you think everything is fine.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-10-07 18:28     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 18:28 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> @@ -1285,6 +1286,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>  	INIT_LIST_HEAD(&p->pi_state_list);
>  	p->pi_state_cache = NULL;
>  #endif
> +#ifdef CONFIG_UPROBES
> +	p->utask = NULL;
> +#endif

I am not sure I understand this all right, but I am not sure this
is enough...

What if the forking task (current) is in UTASK_BP_HIT state?

IOW, uprobe replaces the original syscall insn with "int3", then we
enter the kernel from the xol_vma. The new child has the same
modified instruction pointer (pointing to nowhere without CLONE_VM)
and in any case it doesn't have TIF_SINGLESTEP.

No?

> +void uprobe_notify_resume(struct pt_regs *regs)
> +{
> +	struct vm_area_struct *vma;
> +	struct uprobe_task *utask;
> +	struct mm_struct *mm;
> +	struct uprobe *u = NULL;
> +	unsigned long probept;
> +
> +	utask = current->utask;
> +	mm = current->mm;
> +	if (!utask || utask->state == UTASK_BP_HIT) {
> +		probept = get_uprobe_bkpt_addr(regs);
> +		down_read(&mm->mmap_sem);
> +		vma = find_vma(mm, probept);
> +		if (vma && valid_vma(vma))
> +			u = find_uprobe(vma->vm_file->f_mapping->host,
> +					probept - vma->vm_start +
> +					(vma->vm_pgoff << PAGE_SHIFT));
> +		up_read(&mm->mmap_sem);
> +		if (!u)
> +			/* No matching uprobe; signal SIGTRAP. */
> +			goto cleanup_ret;
> +		if (!utask) {
> +			utask = add_utask();
> +			/* Cannot Allocate; re-execute the instruction. */
> +			if (!utask)
> +				goto cleanup_ret;
> +		}
> +		/* TODO Start queueing signals. */
> +		utask->active_uprobe = u;
> +		handler_chain(u, regs);
> +		utask->state = UTASK_SSTEP;
> +		if (!pre_ssout(u, regs, probept))
> +			user_enable_single_step(current);

Oooh. Playing with user_*_single_step() is obviously not very nice...
But I guess you have no choice. Although I _hope_ we can do something
else later.

And what if we step into a syscall insn? I do not understand this
low level code, but it seems that in this case we trap in kernel mode
and do_debug() doesn't clear X86_EFLAGS_TF because uprobes hook
DIE_DEBUG. IOW, the task will trap again and again inside this syscall,
no?

> +	} else if (utask->state == UTASK_SSTEP) {
> +		u = utask->active_uprobe;
> +		if (sstep_complete(u, regs)) {

It is not clear to me if it is correct to simply return if
sstep_complete() returns false... What if X86_EFLAGS_TF was "lost"
somehow?


Again, I am not saying I understand this magic. Not at all ;)
Please simply ignore my email if you think everything is fine.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26]   x86: define a x86 specific exception notifier.
  2011-09-20 12:02   ` Srikar Dronamraju
@ 2011-10-07 18:31     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 18:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +int uprobe_exception_notify(struct notifier_block *self,
> +				       unsigned long val, void *data)
> +{
> +	struct die_args *args = data;
> +	struct pt_regs *regs = args->regs;
> +	int ret = NOTIFY_DONE;
> +
> +	/* We are only interested in userspace traps */
> +	if (regs && !user_mode_vm(regs))
> +		return NOTIFY_DONE;
> +
> +	switch (val) {
> +	case DIE_INT3:
> +		/* Run your handler here */
> +		if (uprobe_bkpt_notifier(regs))
> +			ret = NOTIFY_STOP;
> +		break;

OK, but I simply can't understand do_int3(). It uses DIE_INT3 or
DIE_TRAP depending on CONFIG_KPROBES.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 13/26]   x86: define a x86 specific exception notifier.
@ 2011-10-07 18:31     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 18:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Andi Kleen,
	Hugh Dickins, Christoph Hellwig, Jonathan Corbet,
	Thomas Gleixner, Masami Hiramatsu, LKML, Jim Keniston,
	Roland McGrath, Ananth N Mavinakayanahalli, Andrew Morton

On 09/20, Srikar Dronamraju wrote:
>
> +int uprobe_exception_notify(struct notifier_block *self,
> +				       unsigned long val, void *data)
> +{
> +	struct die_args *args = data;
> +	struct pt_regs *regs = args->regs;
> +	int ret = NOTIFY_DONE;
> +
> +	/* We are only interested in userspace traps */
> +	if (regs && !user_mode_vm(regs))
> +		return NOTIFY_DONE;
> +
> +	switch (val) {
> +	case DIE_INT3:
> +		/* Run your handler here */
> +		if (uprobe_bkpt_notifier(regs))
> +			ret = NOTIFY_STOP;
> +		break;

OK, but I simply can't understand do_int3(). It uses DIE_INT3 or
DIE_TRAP depending on CONFIG_KPROBES.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-09-20 12:03   ` Srikar Dronamraju
@ 2011-10-07 18:37     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 18:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> - * valid_vma: Verify if the specified vma is an executable vma
> + * valid_vma: Verify if the specified vma is an executable vma,
> + * but not an XOL vma.
>   *	- Return 1 if the specified virtual address is in an
> - *	  executable vma.
> + *	  executable vma, but not in an XOL vma.
>   */
>  static bool valid_vma(struct vm_area_struct *vma)
>  {
> +	struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
> +
>  	if (!vma->vm_file)
>  		return false;
>
> +	if (area && (area->vaddr == vma->vm_start))
> +			return false;

Could you explain why do we need this "but not an XOL vma" check?
xol_vma->vm_file is always NULL, no?

> +static struct uprobes_xol_area *xol_alloc_area(void)
> +{
> +	struct uprobes_xol_area *area = NULL;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (unlikely(!area))
> +		return NULL;
> +
> +	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> +								GFP_KERNEL);
> +
> +	if (!area->bitmap)
> +		goto fail;
> +
> +	init_waitqueue_head(&area->wq);
> +	spin_lock_init(&area->slot_lock);
> +	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> +		task_lock(current);
> +		if (!current->mm->uprobes_xol_area) {
> +			current->mm->uprobes_xol_area = area;
> +			task_unlock(current);
> +			return area;
> +		}
> +		task_unlock(current);

But you can't rely on task_lock(), you can race with another thread
with the same ->mm. I guess you need mmap_sem or xchg().

>  static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
>  				unsigned long vaddr)
>  {
> -	/* TODO: Yet to be implemented */
> +	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
> +		set_instruction_pointer(regs, current->utask->xol_vaddr);

set_instruction_pointer() looks unneded, pre_xol() has already changed
regs->ip.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-10-07 18:37     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-07 18:37 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 09/20, Srikar Dronamraju wrote:
>
> - * valid_vma: Verify if the specified vma is an executable vma
> + * valid_vma: Verify if the specified vma is an executable vma,
> + * but not an XOL vma.
>   *	- Return 1 if the specified virtual address is in an
> - *	  executable vma.
> + *	  executable vma, but not in an XOL vma.
>   */
>  static bool valid_vma(struct vm_area_struct *vma)
>  {
> +	struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
> +
>  	if (!vma->vm_file)
>  		return false;
>
> +	if (area && (area->vaddr == vma->vm_start))
> +			return false;

Could you explain why do we need this "but not an XOL vma" check?
xol_vma->vm_file is always NULL, no?

> +static struct uprobes_xol_area *xol_alloc_area(void)
> +{
> +	struct uprobes_xol_area *area = NULL;
> +
> +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> +	if (unlikely(!area))
> +		return NULL;
> +
> +	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> +								GFP_KERNEL);
> +
> +	if (!area->bitmap)
> +		goto fail;
> +
> +	init_waitqueue_head(&area->wq);
> +	spin_lock_init(&area->slot_lock);
> +	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> +		task_lock(current);
> +		if (!current->mm->uprobes_xol_area) {
> +			current->mm->uprobes_xol_area = area;
> +			task_unlock(current);
> +			return area;
> +		}
> +		task_unlock(current);

But you can't rely on task_lock(), you can race with another thread
with the same ->mm. I guess you need mmap_sem or xchg().

>  static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
>  				unsigned long vaddr)
>  {
> -	/* TODO: Yet to be implemented */
> +	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
> +		set_instruction_pointer(regs, current->utask->xol_vaddr);

set_instruction_pointer() looks unneded, pre_xol() has already changed
regs->ip.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
  2011-10-07 18:37     ` Oleg Nesterov
@ 2011-10-09 11:47       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-09 11:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-07 20:37:40]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > - * valid_vma: Verify if the specified vma is an executable vma
> > + * valid_vma: Verify if the specified vma is an executable vma,
> > + * but not an XOL vma.
> >   *	- Return 1 if the specified virtual address is in an
> > - *	  executable vma.
> > + *	  executable vma, but not in an XOL vma.
> >   */
> >  static bool valid_vma(struct vm_area_struct *vma)
> >  {
> > +	struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
> > +
> >  	if (!vma->vm_file)
> >  		return false;
> >
> > +	if (area && (area->vaddr == vma->vm_start))
> > +			return false;
> 
> Could you explain why do we need this "but not an XOL vma" check?
> xol_vma->vm_file is always NULL, no?
> 

Yes, xol_vma->vm_file is always NULL.
previously we used shmem_file_setup before we map the XOL area.
However we now use init_creds instead, so this should also change
accordingly. Will correct this.

> > +static struct uprobes_xol_area *xol_alloc_area(void)
> > +{
> > +	struct uprobes_xol_area *area = NULL;
> > +
> > +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +	if (unlikely(!area))
> > +		return NULL;
> > +
> > +	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> > +								GFP_KERNEL);
> > +
> > +	if (!area->bitmap)
> > +		goto fail;
> > +
> > +	init_waitqueue_head(&area->wq);
> > +	spin_lock_init(&area->slot_lock);
> > +	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> > +		task_lock(current);
> > +		if (!current->mm->uprobes_xol_area) {
> > +			current->mm->uprobes_xol_area = area;
> > +			task_unlock(current);
> > +			return area;
> > +		}
> > +		task_unlock(current);
> 
> But you can't rely on task_lock(), you can race with another thread
> with the same ->mm. I guess you need mmap_sem or xchg().

Agree, 
I think its better to use cmpxchg instead of xchg(). Otherwise,
(using xchg), I would set area to new value, but the old area might be in
use already. So I cant unmap the old area.

If I use cmpxchg, I can free up the new area if previous area is non
NULL.

However setting uprobes_xol_area in xol_add_vma() where we already take
mmap_sem for write while maping the xol_area is the best option.

> 
> >  static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
> >  				unsigned long vaddr)
> >  {
> > -	/* TODO: Yet to be implemented */
> > +	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
> > +		set_instruction_pointer(regs, current->utask->xol_vaddr);
> 
> set_instruction_pointer() looks unneded, pre_xol() has already changed
> regs->ip.
> 

Agree.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 18/26]   uprobes: slot allocation.
@ 2011-10-09 11:47       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-09 11:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-07 20:37:40]:

> On 09/20, Srikar Dronamraju wrote:
> >
> > - * valid_vma: Verify if the specified vma is an executable vma
> > + * valid_vma: Verify if the specified vma is an executable vma,
> > + * but not an XOL vma.
> >   *	- Return 1 if the specified virtual address is in an
> > - *	  executable vma.
> > + *	  executable vma, but not in an XOL vma.
> >   */
> >  static bool valid_vma(struct vm_area_struct *vma)
> >  {
> > +	struct uprobes_xol_area *area = vma->vm_mm->uprobes_xol_area;
> > +
> >  	if (!vma->vm_file)
> >  		return false;
> >
> > +	if (area && (area->vaddr == vma->vm_start))
> > +			return false;
> 
> Could you explain why do we need this "but not an XOL vma" check?
> xol_vma->vm_file is always NULL, no?
> 

Yes, xol_vma->vm_file is always NULL.
previously we used shmem_file_setup before we map the XOL area.
However we now use init_creds instead, so this should also change
accordingly. Will correct this.

> > +static struct uprobes_xol_area *xol_alloc_area(void)
> > +{
> > +	struct uprobes_xol_area *area = NULL;
> > +
> > +	area = kzalloc(sizeof(*area), GFP_KERNEL);
> > +	if (unlikely(!area))
> > +		return NULL;
> > +
> > +	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
> > +								GFP_KERNEL);
> > +
> > +	if (!area->bitmap)
> > +		goto fail;
> > +
> > +	init_waitqueue_head(&area->wq);
> > +	spin_lock_init(&area->slot_lock);
> > +	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
> > +		task_lock(current);
> > +		if (!current->mm->uprobes_xol_area) {
> > +			current->mm->uprobes_xol_area = area;
> > +			task_unlock(current);
> > +			return area;
> > +		}
> > +		task_unlock(current);
> 
> But you can't rely on task_lock(), you can race with another thread
> with the same ->mm. I guess you need mmap_sem or xchg().

Agree, 
I think its better to use cmpxchg instead of xchg(). Otherwise,
(using xchg), I would set area to new value, but the old area might be in
use already. So I cant unmap the old area.

If I use cmpxchg, I can free up the new area if previous area is non
NULL.

However setting uprobes_xol_area in xol_add_vma() where we already take
mmap_sem for write while maping the xol_area is the best option.

> 
> >  static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
> >  				unsigned long vaddr)
> >  {
> > -	/* TODO: Yet to be implemented */
> > +	if (xol_get_insn_slot(uprobe, vaddr) && !pre_xol(uprobe, regs)) {
> > +		set_instruction_pointer(regs, current->utask->xol_vaddr);
> 
> set_instruction_pointer() looks unneded, pre_xol() has already changed
> regs->ip.
> 

Agree.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
  2011-10-07 18:28     ` Oleg Nesterov
@ 2011-10-09 13:31       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-09 13:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 10/07, Oleg Nesterov wrote:
>
> What if the forking task (current) is in UTASK_BP_HIT state?
> ...
>
> And what if we step into a syscall insn?
> ...

And I guess there would be a lot more problems here. But, looking
at is_prefix_bad() I see the nice comment:

	* opcodes we'll probably never support:
	* 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2

This answers my questions.

> Please simply ignore my email if you think everything is fine.

Yep.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 12/26]   Uprobes: Handle breakpoint and Singlestep
@ 2011-10-09 13:31       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-09 13:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Hugh Dickins, Christoph Hellwig, Masami Hiramatsu,
	Thomas Gleixner, Ananth N Mavinakayanahalli, Andrew Morton,
	Jim Keniston, Roland McGrath, Andi Kleen, LKML

On 10/07, Oleg Nesterov wrote:
>
> What if the forking task (current) is in UTASK_BP_HIT state?
> ...
>
> And what if we step into a syscall insn?
> ...

And I guess there would be a lot more problems here. But, looking
at is_prefix_bad() I see the nice comment:

	* opcodes we'll probably never support:
	* 0f - lar, lsl, syscall, clts, sysret, sysenter, sysexit, invd, wbinvd, ud2

This answers my questions.

> Please simply ignore my email if you think everything is fine.

Yep.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-07 16:58             ` Oleg Nesterov
@ 2011-10-10 12:25               ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-10 12:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-07 18:58:28]:

> 
> Agreed, this looks much, much better. In both cases the task is current,
> it is safe to change ->blocked.
> 
> But please avoid sigprocmask(), we have set_current_blocked().

Sure, I will use set_current_blocked().

While we are here, do you suggest I re-use current->saved_sigmask and
hence use set_restore_sigmask() while resetting the sigmask?

I see saved_sigmask being used just before task sleeps and restored when
task is scheduled back. So I dont see a case where using saved_sigmask
in uprobes could conflict with its current usage.

However if you prefer we use a different sigmask to save and restore, I
can make it part of the utask structure.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-10 12:25               ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-10 12:25 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-07 18:58:28]:

> 
> Agreed, this looks much, much better. In both cases the task is current,
> it is safe to change ->blocked.
> 
> But please avoid sigprocmask(), we have set_current_blocked().

Sure, I will use set_current_blocked().

While we are here, do you suggest I re-use current->saved_sigmask and
hence use set_restore_sigmask() while resetting the sigmask?

I see saved_sigmask being used just before task sleeps and restored when
task is scheduled back. So I dont see a case where using saved_sigmask
in uprobes could conflict with its current usage.

However if you prefer we use a different sigmask to save and restore, I
can make it part of the utask structure.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
  2011-10-07 17:36         ` Oleg Nesterov
@ 2011-10-10 12:31           ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-10 12:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

> > >
> > > So. We are adding the new mapping, we should find all breakpoints this
> > > file has in the start/end range.
> > >
> > > We are holding ->mmap_sem... this seems enough to protect against the
> > > races with register/unregister. Except, what if __register_uprobe()
> > > fails? In this case __unregister_uprobe() does delete_uprobe() at the
> > > very end. What if mmap mmap_uprobe() is called right before delete_?
> > >
> >
> > Because consumers would be NULL before _unregister_uprobe kicks in, we
> > shouldnt have a problem here.
> 
> Hmm. But it is not NULL.
> 
> Once again, I didn't mean unregister_uprobe(). I meant register_uprobe().
> In this case, if __register_uprobe() fails, we are doing __unregister
> but uprobe->consumer != NULL.

Oh Okay, I missed setting uprobe->consumer = NULL once __register_uprobe
fails.
I shall go ahead and set uprobe->consumer = NULL; (the other option is
calling del_consumer() but I dont see a need for calling this.) just
before calling __unregister_uprobe() if and only if __register_uprobe
fails.

> 
> Just suppose that the caller of register_uprobe() gets a (long) preemption
> right before __unregister_uprobe()->delete_uprobe(). What if mmap() is
> called at this time?
> 
> > Am I missing something?
> 
> May be you, may be me. Please recheck ;)

Rechecked and found the issue. Thanks.

> 
> > I think this would be taken care of if we move the munmap_uprobe() hook
> > from unmap_vmas to unlink_file_vma().
> 
> Probably yes, we should rely on prio_tree locking/changes.
> 
> > The other thing that I need to investigate a bit more is if I have
> > handle all cases of mremap correctly.
> 
> Yes. May be mmap_uprobe() should be "closer" to vma_prio_tree_add/insert
> too, but I am not sure.

Okay, that seems like a good idea.

> 
> Oleg.
> 

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 4/26]   uprobes: Define hooks for mmap/munmap.
@ 2011-10-10 12:31           ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-10 12:31 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds,
	Ananth N Mavinakayanahalli, Hugh Dickins, Christoph Hellwig,
	Jonathan Corbet, Thomas Gleixner, Masami Hiramatsu,
	Andrew Morton, Jim Keniston, Roland McGrath, Andi Kleen, LKML

> > >
> > > So. We are adding the new mapping, we should find all breakpoints this
> > > file has in the start/end range.
> > >
> > > We are holding ->mmap_sem... this seems enough to protect against the
> > > races with register/unregister. Except, what if __register_uprobe()
> > > fails? In this case __unregister_uprobe() does delete_uprobe() at the
> > > very end. What if mmap mmap_uprobe() is called right before delete_?
> > >
> >
> > Because consumers would be NULL before _unregister_uprobe kicks in, we
> > shouldnt have a problem here.
> 
> Hmm. But it is not NULL.
> 
> Once again, I didn't mean unregister_uprobe(). I meant register_uprobe().
> In this case, if __register_uprobe() fails, we are doing __unregister
> but uprobe->consumer != NULL.

Oh Okay, I missed setting uprobe->consumer = NULL once __register_uprobe
fails.
I shall go ahead and set uprobe->consumer = NULL; (the other option is
calling del_consumer() but I dont see a need for calling this.) just
before calling __unregister_uprobe() if and only if __register_uprobe
fails.

> 
> Just suppose that the caller of register_uprobe() gets a (long) preemption
> right before __unregister_uprobe()->delete_uprobe(). What if mmap() is
> called at this time?
> 
> > Am I missing something?
> 
> May be you, may be me. Please recheck ;)

Rechecked and found the issue. Thanks.

> 
> > I think this would be taken care of if we move the munmap_uprobe() hook
> > from unmap_vmas to unlink_file_vma().
> 
> Probably yes, we should rely on prio_tree locking/changes.
> 
> > The other thing that I need to investigate a bit more is if I have
> > handle all cases of mremap correctly.
> 
> Yes. May be mmap_uprobe() should be "closer" to vma_prio_tree_add/insert
> too, but I am not sure.

Okay, that seems like a good idea.

> 
> Oleg.
> 

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-10 12:25               ` Srikar Dronamraju
@ 2011-10-10 18:25                 ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-10 18:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/10, Srikar Dronamraju wrote:
>
> While we are here, do you suggest I re-use current->saved_sigmask and
> hence use set_restore_sigmask() while resetting the sigmask?
>
> I see saved_sigmask being used just before task sleeps and restored when
> task is scheduled back. So I dont see a case where using saved_sigmask
> in uprobes could conflict with its current usage.

Yes, I think this is possible, and probably you do not even need
set_restore_sigmask().

But. There are some problems with this approach too.

Firstly, even if you block all signals, there are other reasons for
TIF_SIGPENDING which you can't control. For example, the task can be
frozen or it can stop in UTASK_SSTEP state. Not good, if we have
enough threads, this can lead to the "soft" deadlock. Say, a group
stop can never finish because a thread sleeps in xol_wait_event()
"forever".

Another problem is that it is not possible to block the "implicit"
SIGKILL sent by exec/exit_group/etc. This mean the task can exit
without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
this can deadlock afaics. Suppose that the coredumping is in progress,
the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
threads. If we have enough threads like this, we can deadlock with
another thread sleeping in xol_wait_event().

This can be fixed, we can move free_uprobe_utask() from
put_task_struct() to mm_release(). Btw, imho this makes sense anyway,
why should a zombie thread abuse a slot?

However the first problem looks nasty, even if it is not very serious.
And, otoh, it doesn't look right to block SIGKILL, the task can loop
forever executing the xol insn (see below).



What do you think about the patch below? On top of 25/26, uncompiled,
untested. With this patch the task simply refuses to react to
TIF_SIGPENDING until sstep_complete().

This relies on the fact that do_notify_resume() calls
uprobe_notify_resume() before do_signal(), I guess this is safe because
we have other reasons for this order.

And, unless I missed something, this makes
free_uprobe_utask()->xol_free_insn_slot() unnecessary.



HOWEVER! I simply do not know what should we do if the probed insn
is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
never returns true. The patch also adds the fatal_signal_pending()
check to make this task killlable, but the problem is: whatever we do,
I do not think it is correct to disable/delay the signals in this case.
With any approach.

What do you think? Maybe we should simply disallow to probe such insns?

Once again, the change in sstep_complete() is "off-topic", this is
another problem we should solve somehow.

Oleg.

--- x/kernel/signal.c
+++ x/kernel/signal.c
@@ -2141,6 +2141,15 @@ int get_signal_to_deliver(siginfo_t *inf
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+#ifdef CONFIG_UPROBES
+	if (unlikely(current->utask &&
+			current->utask->state != UTASK_RUNNING)) {
+		WARN_ON_ONCE(current->utask->state != UTASK_SSTEP);
+		clear_thread_flag(TIF_SIGPENDING);
+		return 0;
+	}
+#endif
+
 relock:
 	/*
 	 * We'll jump back here after any time we were stopped in TASK_STOPPED.
--- x/kernel/uprobes.c
+++ x/kernel/uprobes.c
@@ -1331,7 +1331,8 @@ static bool sstep_complete(struct uprobe
 	 * If we have executed out of line, Instruction pointer
 	 * cannot be same as virtual address of XOL slot.
 	 */
-	if (vaddr == current->utask->xol_vaddr)
+	if (vaddr == current->utask->xol_vaddr &&
+			!__fatal_signal_pending(current))
 		return false;
 	post_xol(uprobe, regs);
 	return true;
@@ -1390,8 +1391,7 @@ void uprobe_notify_resume(struct pt_regs
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
 			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
+			recalc_sigpending();
 		}
 	}
 	return;


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-10 18:25                 ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-10 18:25 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/10, Srikar Dronamraju wrote:
>
> While we are here, do you suggest I re-use current->saved_sigmask and
> hence use set_restore_sigmask() while resetting the sigmask?
>
> I see saved_sigmask being used just before task sleeps and restored when
> task is scheduled back. So I dont see a case where using saved_sigmask
> in uprobes could conflict with its current usage.

Yes, I think this is possible, and probably you do not even need
set_restore_sigmask().

But. There are some problems with this approach too.

Firstly, even if you block all signals, there are other reasons for
TIF_SIGPENDING which you can't control. For example, the task can be
frozen or it can stop in UTASK_SSTEP state. Not good, if we have
enough threads, this can lead to the "soft" deadlock. Say, a group
stop can never finish because a thread sleeps in xol_wait_event()
"forever".

Another problem is that it is not possible to block the "implicit"
SIGKILL sent by exec/exit_group/etc. This mean the task can exit
without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
this can deadlock afaics. Suppose that the coredumping is in progress,
the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
threads. If we have enough threads like this, we can deadlock with
another thread sleeping in xol_wait_event().

This can be fixed, we can move free_uprobe_utask() from
put_task_struct() to mm_release(). Btw, imho this makes sense anyway,
why should a zombie thread abuse a slot?

However the first problem looks nasty, even if it is not very serious.
And, otoh, it doesn't look right to block SIGKILL, the task can loop
forever executing the xol insn (see below).



What do you think about the patch below? On top of 25/26, uncompiled,
untested. With this patch the task simply refuses to react to
TIF_SIGPENDING until sstep_complete().

This relies on the fact that do_notify_resume() calls
uprobe_notify_resume() before do_signal(), I guess this is safe because
we have other reasons for this order.

And, unless I missed something, this makes
free_uprobe_utask()->xol_free_insn_slot() unnecessary.



HOWEVER! I simply do not know what should we do if the probed insn
is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
never returns true. The patch also adds the fatal_signal_pending()
check to make this task killlable, but the problem is: whatever we do,
I do not think it is correct to disable/delay the signals in this case.
With any approach.

What do you think? Maybe we should simply disallow to probe such insns?

Once again, the change in sstep_complete() is "off-topic", this is
another problem we should solve somehow.

Oleg.

--- x/kernel/signal.c
+++ x/kernel/signal.c
@@ -2141,6 +2141,15 @@ int get_signal_to_deliver(siginfo_t *inf
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+#ifdef CONFIG_UPROBES
+	if (unlikely(current->utask &&
+			current->utask->state != UTASK_RUNNING)) {
+		WARN_ON_ONCE(current->utask->state != UTASK_SSTEP);
+		clear_thread_flag(TIF_SIGPENDING);
+		return 0;
+	}
+#endif
+
 relock:
 	/*
 	 * We'll jump back here after any time we were stopped in TASK_STOPPED.
--- x/kernel/uprobes.c
+++ x/kernel/uprobes.c
@@ -1331,7 +1331,8 @@ static bool sstep_complete(struct uprobe
 	 * If we have executed out of line, Instruction pointer
 	 * cannot be same as virtual address of XOL slot.
 	 */
-	if (vaddr == current->utask->xol_vaddr)
+	if (vaddr == current->utask->xol_vaddr &&
+			!__fatal_signal_pending(current))
 		return false;
 	post_xol(uprobe, regs);
 	return true;
@@ -1390,8 +1391,7 @@ void uprobe_notify_resume(struct pt_regs
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
 			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
+			recalc_sigpending();
 		}
 	}
 	return;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-10 18:25                 ` Oleg Nesterov
@ 2011-10-11 17:24                   ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-11 17:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/10, Oleg Nesterov wrote:
>
> HOWEVER! I simply do not know what should we do if the probed insn
> is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
> never returns true. The patch also adds the fatal_signal_pending()
> check to make this task killlable, but the problem is: whatever we do,
> I do not think it is correct to disable/delay the signals in this case.
> With any approach.
>
> What do you think? Maybe we should simply disallow to probe such insns?

Or. Could you explain why we can't simply remove the
"if (vaddr == current->utask->xol_vaddr)" check from sstep_complete() ?

In some sense, imho this looks more correct for "rep" or jmp/call self.
The task will trap again on the same (original) address, and
handler_chain() will be called to notify the consumers.

But. I am really, really ignorant in this area, I am almost sure this
is not that simple.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-11 17:24                   ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-11 17:24 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/10, Oleg Nesterov wrote:
>
> HOWEVER! I simply do not know what should we do if the probed insn
> is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
> never returns true. The patch also adds the fatal_signal_pending()
> check to make this task killlable, but the problem is: whatever we do,
> I do not think it is correct to disable/delay the signals in this case.
> With any approach.
>
> What do you think? Maybe we should simply disallow to probe such insns?

Or. Could you explain why we can't simply remove the
"if (vaddr == current->utask->xol_vaddr)" check from sstep_complete() ?

In some sense, imho this looks more correct for "rep" or jmp/call self.
The task will trap again on the same (original) address, and
handler_chain() will be called to notify the consumers.

But. I am really, really ignorant in this area, I am almost sure this
is not that simple.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-10 18:25                 ` Oleg Nesterov
@ 2011-10-11 17:26                   ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-11 17:26 UTC (permalink / raw)
  To: Oleg Nesterov, Masami Hiramatsu
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-10 20:25:35]:

> On 10/10, Srikar Dronamraju wrote:
> >
> > While we are here, do you suggest I re-use current->saved_sigmask and
> > hence use set_restore_sigmask() while resetting the sigmask?
> >
> > I see saved_sigmask being used just before task sleeps and restored when
> > task is scheduled back. So I dont see a case where using saved_sigmask
> > in uprobes could conflict with its current usage.
> 
> Yes, I think this is possible, and probably you do not even need
> set_restore_sigmask().
> 
> But. There are some problems with this approach too.
> 
> Firstly, even if you block all signals, there are other reasons for
> TIF_SIGPENDING which you can't control. For example, the task can be
> frozen or it can stop in UTASK_SSTEP state. Not good, if we have
> enough threads, this can lead to the "soft" deadlock. Say, a group
> stop can never finish because a thread sleeps in xol_wait_event()
> "forever".
> 

My idea was to block signals just across singlestep only. I.e 
we dont block signals while we contend for the slot. 

> Another problem is that it is not possible to block the "implicit"
> SIGKILL sent by exec/exit_group/etc. This mean the task can exit
> without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
> we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
> this can deadlock afaics. Suppose that the coredumping is in progress,
> the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
> threads. If we have enough threads like this, we can deadlock with
> another thread sleeping in xol_wait_event().

Shouldnt the behaviour be the same as threads that did a
select,sigsuspend?

> 
> This can be fixed, we can move free_uprobe_utask() from
> put_task_struct() to mm_release(). Btw, imho this makes sense anyway,
> why should a zombie thread abuse a slot?
> 

Yes,, makes sense. Will make this change.

> However the first problem looks nasty, even if it is not very serious.
> And, otoh, it doesn't look right to block SIGKILL, the task can loop
> forever executing the xol insn (see below).
> 
> 
> 
> What do you think about the patch below? On top of 25/26, uncompiled,
> untested. With this patch the task simply refuses to react to
> TIF_SIGPENDING until sstep_complete().
> 

Your patch looks very simple and clean.
Will test this patch and revert. 

> This relies on the fact that do_notify_resume() calls
> uprobe_notify_resume() before do_signal(), I guess this is safe because
> we have other reasons for this order.
> 
> And, unless I missed something, this makes
> free_uprobe_utask()->xol_free_insn_slot() unnecessary.

What if a fatal (SIGKILL) signal was delivered only to that thread even
before it singlestepped? or a fatal signal for a thread-group but more
than one thread-group share the mm?
> 
> 
> 
> HOWEVER! I simply do not know what should we do if the probed insn
> is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
> never returns true. The patch also adds the fatal_signal_pending()
> check to make this task killlable, but the problem is: whatever we do,
> I do not think it is correct to disable/delay the signals in this case.
> With any approach.
> 
> What do you think? Maybe we should simply disallow to probe such insns?

Yes, we should disable such probes, but iam not sure we can detect such
probes with the current instruction analyzer.

Masami, can we detect them (instructions that jump back to the same
address as they are executing?)

> 
> Once again, the change in sstep_complete() is "off-topic", this is
> another problem we should solve somehow.
> 

Agree.

you have already commented why blocking signals is a problem, but I
still thought I will post the patch that I had to let you know what I
was thinking before I saw your patch.

While task is processing a singlestep due to uprobes breakpoint hit, 
block signals from the time it enables singlestep to the time it disables
singlestep.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/uprobes.c |   38 ++++++++++++++++++++++++++++++++------
 1 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5067979..bc3e178 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1366,6 +1366,26 @@ static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 }
 
 /*
+ * While we are handling breakpoint / singlestep, ensure that a
+ * SIGTRAP is not delivered to the task.
+ */
+static void __clear_trap_flag(void)
+{
+	sigdelset(&current->pending.signal, SIGTRAP);
+	sigdelset(&current->signal->shared_pending.signal, SIGTRAP);
+}
+
+static void clear_trap_flag(void)
+{
+	if (!test_and_clear_thread_flag(TIF_SIGPENDING))
+		return;
+
+	spin_lock_irq(&current->sighand->siglock);
+	__clear_trap_flag();
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
+/*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
  *
@@ -1380,6 +1400,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 	struct mm_struct *mm;
 	struct uprobe *u = NULL;
 	unsigned long probept;
+	sigset_t masksigs;
 
 	utask = current->utask;
 	mm = current->mm;
@@ -1401,13 +1422,18 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			if (!utask)
 				goto cleanup_ret;
 		}
-		/* TODO Start queueing signals. */
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
 		utask->state = UTASK_SSTEP;
-		if (!pre_ssout(u, regs, probept))
+		if (!pre_ssout(u, regs, probept)) {
+			sigfillset(&masksigs);
+			sigdelsetmask(&masksigs,
+					sigmask(SIGKILL)|sigmask(SIGSTOP));
+			current->saved_sigmask = current->blocked;
+			set_current_blocked(&masksigs);
+			clear_trap_flag();
 			user_enable_single_step(current);
-		else
+		} else
 			/* Cannot Singlestep; re-execute the instruction. */
 			goto cleanup_ret;
 	} else if (utask->state == UTASK_SSTEP) {
@@ -1418,8 +1444,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
 			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
+			clear_trap_flag();
+			set_restore_sigmask();
 		}
 	}
 	return;
@@ -1433,7 +1459,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		put_uprobe(u);
 		set_instruction_pointer(regs, probept);
 	} else
-		/*TODO Return SIGTRAP signal */
+		send_sig(SIGTRAP, current, 0);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-11 17:26                   ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-11 17:26 UTC (permalink / raw)
  To: Oleg Nesterov, Masami Hiramatsu
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-10 20:25:35]:

> On 10/10, Srikar Dronamraju wrote:
> >
> > While we are here, do you suggest I re-use current->saved_sigmask and
> > hence use set_restore_sigmask() while resetting the sigmask?
> >
> > I see saved_sigmask being used just before task sleeps and restored when
> > task is scheduled back. So I dont see a case where using saved_sigmask
> > in uprobes could conflict with its current usage.
> 
> Yes, I think this is possible, and probably you do not even need
> set_restore_sigmask().
> 
> But. There are some problems with this approach too.
> 
> Firstly, even if you block all signals, there are other reasons for
> TIF_SIGPENDING which you can't control. For example, the task can be
> frozen or it can stop in UTASK_SSTEP state. Not good, if we have
> enough threads, this can lead to the "soft" deadlock. Say, a group
> stop can never finish because a thread sleeps in xol_wait_event()
> "forever".
> 

My idea was to block signals just across singlestep only. I.e 
we dont block signals while we contend for the slot. 

> Another problem is that it is not possible to block the "implicit"
> SIGKILL sent by exec/exit_group/etc. This mean the task can exit
> without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
> we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
> this can deadlock afaics. Suppose that the coredumping is in progress,
> the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
> threads. If we have enough threads like this, we can deadlock with
> another thread sleeping in xol_wait_event().

Shouldnt the behaviour be the same as threads that did a
select,sigsuspend?

> 
> This can be fixed, we can move free_uprobe_utask() from
> put_task_struct() to mm_release(). Btw, imho this makes sense anyway,
> why should a zombie thread abuse a slot?
> 

Yes,, makes sense. Will make this change.

> However the first problem looks nasty, even if it is not very serious.
> And, otoh, it doesn't look right to block SIGKILL, the task can loop
> forever executing the xol insn (see below).
> 
> 
> 
> What do you think about the patch below? On top of 25/26, uncompiled,
> untested. With this patch the task simply refuses to react to
> TIF_SIGPENDING until sstep_complete().
> 

Your patch looks very simple and clean.
Will test this patch and revert. 

> This relies on the fact that do_notify_resume() calls
> uprobe_notify_resume() before do_signal(), I guess this is safe because
> we have other reasons for this order.
> 
> And, unless I missed something, this makes
> free_uprobe_utask()->xol_free_insn_slot() unnecessary.

What if a fatal (SIGKILL) signal was delivered only to that thread even
before it singlestepped? or a fatal signal for a thread-group but more
than one thread-group share the mm?
> 
> 
> 
> HOWEVER! I simply do not know what should we do if the probed insn
> is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
> never returns true. The patch also adds the fatal_signal_pending()
> check to make this task killlable, but the problem is: whatever we do,
> I do not think it is correct to disable/delay the signals in this case.
> With any approach.
> 
> What do you think? Maybe we should simply disallow to probe such insns?

Yes, we should disable such probes, but iam not sure we can detect such
probes with the current instruction analyzer.

Masami, can we detect them (instructions that jump back to the same
address as they are executing?)

> 
> Once again, the change in sstep_complete() is "off-topic", this is
> another problem we should solve somehow.
> 

Agree.

you have already commented why blocking signals is a problem, but I
still thought I will post the patch that I had to let you know what I
was thinking before I saw your patch.

While task is processing a singlestep due to uprobes breakpoint hit, 
block signals from the time it enables singlestep to the time it disables
singlestep.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/uprobes.c |   38 ++++++++++++++++++++++++++++++++------
 1 files changed, 32 insertions(+), 6 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5067979..bc3e178 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1366,6 +1366,26 @@ static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
 }
 
 /*
+ * While we are handling breakpoint / singlestep, ensure that a
+ * SIGTRAP is not delivered to the task.
+ */
+static void __clear_trap_flag(void)
+{
+	sigdelset(&current->pending.signal, SIGTRAP);
+	sigdelset(&current->signal->shared_pending.signal, SIGTRAP);
+}
+
+static void clear_trap_flag(void)
+{
+	if (!test_and_clear_thread_flag(TIF_SIGPENDING))
+		return;
+
+	spin_lock_irq(&current->sighand->siglock);
+	__clear_trap_flag();
+	spin_unlock_irq(&current->sighand->siglock);
+}
+
+/*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
  *
@@ -1380,6 +1400,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 	struct mm_struct *mm;
 	struct uprobe *u = NULL;
 	unsigned long probept;
+	sigset_t masksigs;
 
 	utask = current->utask;
 	mm = current->mm;
@@ -1401,13 +1422,18 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			if (!utask)
 				goto cleanup_ret;
 		}
-		/* TODO Start queueing signals. */
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
 		utask->state = UTASK_SSTEP;
-		if (!pre_ssout(u, regs, probept))
+		if (!pre_ssout(u, regs, probept)) {
+			sigfillset(&masksigs);
+			sigdelsetmask(&masksigs,
+					sigmask(SIGKILL)|sigmask(SIGSTOP));
+			current->saved_sigmask = current->blocked;
+			set_current_blocked(&masksigs);
+			clear_trap_flag();
 			user_enable_single_step(current);
-		else
+		} else
 			/* Cannot Singlestep; re-execute the instruction. */
 			goto cleanup_ret;
 	} else if (utask->state == UTASK_SSTEP) {
@@ -1418,8 +1444,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			utask->state = UTASK_RUNNING;
 			user_disable_single_step(current);
 			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
+			clear_trap_flag();
+			set_restore_sigmask();
 		}
 	}
 	return;
@@ -1433,7 +1459,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		put_uprobe(u);
 		set_instruction_pointer(regs, probept);
 	} else
-		/*TODO Return SIGTRAP signal */
+		send_sig(SIGTRAP, current, 0);
 }
 
 /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-11 17:24                   ` Oleg Nesterov
@ 2011-10-11 17:38                     ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-11 17:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

> > HOWEVER! I simply do not know what should we do if the probed insn
> > is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
> > never returns true. The patch also adds the fatal_signal_pending()
> > check to make this task killlable, but the problem is: whatever we do,
> > I do not think it is correct to disable/delay the signals in this case.
> > With any approach.
> >
> > What do you think? Maybe we should simply disallow to probe such insns?
> 
> Or. Could you explain why we can't simply remove the
> "if (vaddr == current->utask->xol_vaddr)" check from sstep_complete() ?


Yes, we could remove the check and rely on just the DIE_DEBUG to say
that singlestep has occurred. This was mostly needed when we were not
handling signals on singlestep.

> In some sense, imho this looks more correct for "rep" or jmp/call self.
> The task will trap again on the same (original) address, and
> handler_chain() will be called to notify the consumers.
> 
> But. I am really, really ignorant in this area, I am almost sure this
> is not that simple.
> 

Thats being modest.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-11 17:38                     ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-11 17:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Andi Kleen, Thomas Gleixner,
	Jonathan Corbet, Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

> > HOWEVER! I simply do not know what should we do if the probed insn
> > is something like asm("1:; jmp 1b;"). IIUC, in this sstep_complete()
> > never returns true. The patch also adds the fatal_signal_pending()
> > check to make this task killlable, but the problem is: whatever we do,
> > I do not think it is correct to disable/delay the signals in this case.
> > With any approach.
> >
> > What do you think? Maybe we should simply disallow to probe such insns?
> 
> Or. Could you explain why we can't simply remove the
> "if (vaddr == current->utask->xol_vaddr)" check from sstep_complete() ?


Yes, we could remove the check and rely on just the DIE_DEBUG to say
that singlestep has occurred. This was mostly needed when we were not
handling signals on singlestep.

> In some sense, imho this looks more correct for "rep" or jmp/call self.
> The task will trap again on the same (original) address, and
> handler_chain() will be called to notify the consumers.
> 
> But. I am really, really ignorant in this area, I am almost sure this
> is not that simple.
> 

Thats being modest.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-11 17:26                   ` Srikar Dronamraju
@ 2011-10-11 18:56                     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-11 18:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/11, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-10 20:25:35]:
>
> > Yes, I think this is possible, and probably you do not even need
> > set_restore_sigmask().
> >
> > But. There are some problems with this approach too.
> >
> > Firstly, even if you block all signals, there are other reasons for
> > TIF_SIGPENDING which you can't control. For example, the task can be
> > frozen or it can stop in UTASK_SSTEP state. Not good, if we have
> > enough threads, this can lead to the "soft" deadlock. Say, a group
> > stop can never finish because a thread sleeps in xol_wait_event()
> > "forever".
> >
>
> My idea was to block signals just across singlestep only. I.e
> we dont block signals while we contend for the slot.

Yes, yes, I see. But, once again, this can only protect from kill().

The task can stop even if you block all signals, another thread
can initiate the group stop and set JOBCTL_STOP_PENDING + TIF_SIGPENDING.
And note that it can stop _before_ it returns to user mode to step
over the xol insn.

In theory the tasks like this can consume all slots, and if we have
yet another thread waiting in xol_wait_event(), we deadlock. Although
in this case SIGCONT helps, but this group stop can never finish.

> > Another problem is that it is not possible to block the "implicit"
> > SIGKILL sent by exec/exit_group/etc. This mean the task can exit
> > without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
> > we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
> > this can deadlock afaics. Suppose that the coredumping is in progress,
> > the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
> > threads. If we have enough threads like this, we can deadlock with
> > another thread sleeping in xol_wait_event().
>
> Shouldnt the behaviour be the same as threads that did a
> select,sigsuspend?

Hmm. I don't understand... Could you explain?

Firstly, select/sigsuspend can't block SIGKILL, but this doesn't matter.
My point was, the task can exit in UTASK_SSTEP state, and without
xol_free_insn_slot(). And this (in theory) can lead to the "real"
deadlock.

> > However the first problem looks nasty, even if it is not very serious.
> > And, otoh, it doesn't look right to block SIGKILL, the task can loop
> > forever executing the xol insn (see below).
> >
> >
> >
> > What do you think about the patch below? On top of 25/26, uncompiled,
> > untested. With this patch the task simply refuses to react to
> > TIF_SIGPENDING until sstep_complete().
> >
>
> Your patch looks very simple and clean.
> Will test this patch and revert.

Great. I'll think a bit more and send you the "final" version tomorrow.
Assuming we can change sstep_complete() as we discussed, it doesn't need
fatal_signal_pending().

HOWEVER. There is yet another problem. Another thread can, say, unmap()
xol_vma. In this case we should ensure that the task can't fault in an
endless loop.

> > And, unless I missed something, this makes
> > free_uprobe_utask()->xol_free_insn_slot() unnecessary.
>
> What if a fatal (SIGKILL) signal was delivered only to that thread

this is not possible, in this case all threads are killed. But,

> or a fatal signal for a thread-group but more
> than one thread-group share the mm?

Yes, this is possible.

Sorry for confusion. Yes, if we have the fatal_signal_pending() check
in sstep_complete(), then we do need
free_uprobe_utask()->xol_free_insn_slot(). But this check was added
only to illustrate another problem with the self-repeating insns.

And. With "HOWEVER" above, we probably need this xol_free anyway.

> you have already commented why blocking signals is a problem, but I
> still thought I will post the patch that I had to let you know what I
> was thinking before I saw your patch.
>
> While task is processing a singlestep due to uprobes breakpoint hit,
> block signals from the time it enables singlestep to the time it disables
> singlestep.

OK, it is too late for me today, I'll take a look tomorrow.

This approach has some advantages too, perhaps we should make something
"in between".

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-11 18:56                     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-11 18:56 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/11, Srikar Dronamraju wrote:
>
> * Oleg Nesterov <oleg@redhat.com> [2011-10-10 20:25:35]:
>
> > Yes, I think this is possible, and probably you do not even need
> > set_restore_sigmask().
> >
> > But. There are some problems with this approach too.
> >
> > Firstly, even if you block all signals, there are other reasons for
> > TIF_SIGPENDING which you can't control. For example, the task can be
> > frozen or it can stop in UTASK_SSTEP state. Not good, if we have
> > enough threads, this can lead to the "soft" deadlock. Say, a group
> > stop can never finish because a thread sleeps in xol_wait_event()
> > "forever".
> >
>
> My idea was to block signals just across singlestep only. I.e
> we dont block signals while we contend for the slot.

Yes, yes, I see. But, once again, this can only protect from kill().

The task can stop even if you block all signals, another thread
can initiate the group stop and set JOBCTL_STOP_PENDING + TIF_SIGPENDING.
And note that it can stop _before_ it returns to user mode to step
over the xol insn.

In theory the tasks like this can consume all slots, and if we have
yet another thread waiting in xol_wait_event(), we deadlock. Although
in this case SIGCONT helps, but this group stop can never finish.

> > Another problem is that it is not possible to block the "implicit"
> > SIGKILL sent by exec/exit_group/etc. This mean the task can exit
> > without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
> > we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
> > this can deadlock afaics. Suppose that the coredumping is in progress,
> > the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
> > threads. If we have enough threads like this, we can deadlock with
> > another thread sleeping in xol_wait_event().
>
> Shouldnt the behaviour be the same as threads that did a
> select,sigsuspend?

Hmm. I don't understand... Could you explain?

Firstly, select/sigsuspend can't block SIGKILL, but this doesn't matter.
My point was, the task can exit in UTASK_SSTEP state, and without
xol_free_insn_slot(). And this (in theory) can lead to the "real"
deadlock.

> > However the first problem looks nasty, even if it is not very serious.
> > And, otoh, it doesn't look right to block SIGKILL, the task can loop
> > forever executing the xol insn (see below).
> >
> >
> >
> > What do you think about the patch below? On top of 25/26, uncompiled,
> > untested. With this patch the task simply refuses to react to
> > TIF_SIGPENDING until sstep_complete().
> >
>
> Your patch looks very simple and clean.
> Will test this patch and revert.

Great. I'll think a bit more and send you the "final" version tomorrow.
Assuming we can change sstep_complete() as we discussed, it doesn't need
fatal_signal_pending().

HOWEVER. There is yet another problem. Another thread can, say, unmap()
xol_vma. In this case we should ensure that the task can't fault in an
endless loop.

> > And, unless I missed something, this makes
> > free_uprobe_utask()->xol_free_insn_slot() unnecessary.
>
> What if a fatal (SIGKILL) signal was delivered only to that thread

this is not possible, in this case all threads are killed. But,

> or a fatal signal for a thread-group but more
> than one thread-group share the mm?

Yes, this is possible.

Sorry for confusion. Yes, if we have the fatal_signal_pending() check
in sstep_complete(), then we do need
free_uprobe_utask()->xol_free_insn_slot(). But this check was added
only to illustrate another problem with the self-repeating insns.

And. With "HOWEVER" above, we probably need this xol_free anyway.

> you have already commented why blocking signals is a problem, but I
> still thought I will post the patch that I had to let you know what I
> was thinking before I saw your patch.
>
> While task is processing a singlestep due to uprobes breakpoint hit,
> block signals from the time it enables singlestep to the time it disables
> singlestep.

OK, it is too late for me today, I'll take a look tomorrow.

This approach has some advantages too, perhaps we should make something
"in between".

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-11 18:56                     ` Oleg Nesterov
@ 2011-10-12 12:01                       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-12 12:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

> 
> Yes, yes, I see. But, once again, this can only protect from kill().
> 
> The task can stop even if you block all signals, another thread
> can initiate the group stop and set JOBCTL_STOP_PENDING + TIF_SIGPENDING.
> And note that it can stop _before_ it returns to user mode to step
> over the xol insn.
> 
> In theory the tasks like this can consume all slots, and if we have
> yet another thread waiting in xol_wait_event(), we deadlock. Although
> in this case SIGCONT helps, but this group stop can never finish.
> 

Okay. 

> > > Another problem is that it is not possible to block the "implicit"
> > > SIGKILL sent by exec/exit_group/etc. This mean the task can exit
> > > without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
> > > we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
> > > this can deadlock afaics. Suppose that the coredumping is in progress,
> > > the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
> > > threads. If we have enough threads like this, we can deadlock with
> > > another thread sleeping in xol_wait_event().
> >
> > Shouldnt the behaviour be the same as threads that did a
> > select,sigsuspend?
> 
> Hmm. I don't understand... Could you explain?
> 
> Firstly, select/sigsuspend can't block SIGKILL, but this doesn't matter.
> My point was, the task can exit in UTASK_SSTEP state, and without
> xol_free_insn_slot(). And this (in theory) can lead to the "real"
> deadlock.

I think we should be okay if the test exits in UTASK_SSTEP state.
All I thought we needed to do was block it from doing anything except
exit or singlestep. Our exit hook should cleanup any references that we
hold.

> 
> > > However the first problem looks nasty, even if it is not very serious.
> > > And, otoh, it doesn't look right to block SIGKILL, the task can loop
> > > forever executing the xol insn (see below).
> > >
> > >
> > >
> > > What do you think about the patch below? On top of 25/26, uncompiled,
> > > untested. With this patch the task simply refuses to react to
> > > TIF_SIGPENDING until sstep_complete().
> > >
> >
> > Your patch looks very simple and clean.
> > Will test this patch and revert.
> 
> Great. I'll think a bit more and send you the "final" version tomorrow.
> Assuming we can change sstep_complete() as we discussed, it doesn't need
> fatal_signal_pending().

Okay. 

> 
> HOWEVER. There is yet another problem. Another thread can, say, unmap()
> xol_vma. In this case we should ensure that the task can't fault in an
> endless loop.
> 

Hmm should we add a check in unmap() to see if the vma that we are
trying to unmap is the xol_vma and if so return?
Our assumption has been that once an xol_vma has been created, it should
be around till the process gets killed.

> > > And, unless I missed something, this makes
> > > free_uprobe_utask()->xol_free_insn_slot() unnecessary.
> >
> > What if a fatal (SIGKILL) signal was delivered only to that thread
> 
> this is not possible, in this case all threads are killed. But,
> 
> > or a fatal signal for a thread-group but more
> > than one thread-group share the mm?
> 
> Yes, this is possible.
> 
> Sorry for confusion. Yes, if we have the fatal_signal_pending() check
> in sstep_complete(), then we do need
> free_uprobe_utask()->xol_free_insn_slot(). But this check was added
> only to illustrate another problem with the self-repeating insns.
> 
> And. With "HOWEVER" above, we probably need this xol_free anyway.
> 
> > you have already commented why blocking signals is a problem, but I
> > still thought I will post the patch that I had to let you know what I
> > was thinking before I saw your patch.
> >
> > While task is processing a singlestep due to uprobes breakpoint hit,
> > block signals from the time it enables singlestep to the time it disables
> > singlestep.
> 
> OK, it is too late for me today, I'll take a look tomorrow.
> 
> This approach has some advantages too, perhaps we should make something
> "in between".
> 

Okay.

-- 
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-12 12:01                       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-12 12:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

> 
> Yes, yes, I see. But, once again, this can only protect from kill().
> 
> The task can stop even if you block all signals, another thread
> can initiate the group stop and set JOBCTL_STOP_PENDING + TIF_SIGPENDING.
> And note that it can stop _before_ it returns to user mode to step
> over the xol insn.
> 
> In theory the tasks like this can consume all slots, and if we have
> yet another thread waiting in xol_wait_event(), we deadlock. Although
> in this case SIGCONT helps, but this group stop can never finish.
> 

Okay. 

> > > Another problem is that it is not possible to block the "implicit"
> > > SIGKILL sent by exec/exit_group/etc. This mean the task can exit
> > > without sstep_complete/xol_free_insn_slot/etc. Mostly this is fine,
> > > we have free_uprobe_utask()->xol_free_insn_slot(). But in theory
> > > this can deadlock afaics. Suppose that the coredumping is in progress,
> > > the killed UTASK_SSTEP task hangs in exit_mm() waiting for other
> > > threads. If we have enough threads like this, we can deadlock with
> > > another thread sleeping in xol_wait_event().
> >
> > Shouldnt the behaviour be the same as threads that did a
> > select,sigsuspend?
> 
> Hmm. I don't understand... Could you explain?
> 
> Firstly, select/sigsuspend can't block SIGKILL, but this doesn't matter.
> My point was, the task can exit in UTASK_SSTEP state, and without
> xol_free_insn_slot(). And this (in theory) can lead to the "real"
> deadlock.

I think we should be okay if the test exits in UTASK_SSTEP state.
All I thought we needed to do was block it from doing anything except
exit or singlestep. Our exit hook should cleanup any references that we
hold.

> 
> > > However the first problem looks nasty, even if it is not very serious.
> > > And, otoh, it doesn't look right to block SIGKILL, the task can loop
> > > forever executing the xol insn (see below).
> > >
> > >
> > >
> > > What do you think about the patch below? On top of 25/26, uncompiled,
> > > untested. With this patch the task simply refuses to react to
> > > TIF_SIGPENDING until sstep_complete().
> > >
> >
> > Your patch looks very simple and clean.
> > Will test this patch and revert.
> 
> Great. I'll think a bit more and send you the "final" version tomorrow.
> Assuming we can change sstep_complete() as we discussed, it doesn't need
> fatal_signal_pending().

Okay. 

> 
> HOWEVER. There is yet another problem. Another thread can, say, unmap()
> xol_vma. In this case we should ensure that the task can't fault in an
> endless loop.
> 

Hmm should we add a check in unmap() to see if the vma that we are
trying to unmap is the xol_vma and if so return?
Our assumption has been that once an xol_vma has been created, it should
be around till the process gets killed.

> > > And, unless I missed something, this makes
> > > free_uprobe_utask()->xol_free_insn_slot() unnecessary.
> >
> > What if a fatal (SIGKILL) signal was delivered only to that thread
> 
> this is not possible, in this case all threads are killed. But,
> 
> > or a fatal signal for a thread-group but more
> > than one thread-group share the mm?
> 
> Yes, this is possible.
> 
> Sorry for confusion. Yes, if we have the fatal_signal_pending() check
> in sstep_complete(), then we do need
> free_uprobe_utask()->xol_free_insn_slot(). But this check was added
> only to illustrate another problem with the self-repeating insns.
> 
> And. With "HOWEVER" above, we probably need this xol_free anyway.
> 
> > you have already commented why blocking signals is a problem, but I
> > still thought I will post the patch that I had to let you know what I
> > was thinking before I saw your patch.
> >
> > While task is processing a singlestep due to uprobes breakpoint hit,
> > block signals from the time it enables singlestep to the time it disables
> > singlestep.
> 
> OK, it is too late for me today, I'll take a look tomorrow.
> 
> This approach has some advantages too, perhaps we should make something
> "in between".
> 

Okay.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-12 12:01                       ` Srikar Dronamraju
@ 2011-10-12 19:34                         ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-12 19:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/12, Srikar Dronamraju wrote:
>
> I think we should be okay if the test exits in UTASK_SSTEP state.

Yes, and afaics we can't avoid this case, at least currently.

But we should move free_uprobe_utask() to mm_release(), or somewhere
else before mm->core_state check in exit_mm().

My main concern is stop/freeze in UTASK_SSTEP state. If nothing else,
debugger can attach to the stopped task and disable the stepping. Or
SIGKILL, it should work in this case.

> > Great. I'll think a bit more and send you the "final" version tomorrow.
> > Assuming we can change sstep_complete() as we discussed, it doesn't need
> > fatal_signal_pending().
>
> Okay.

Sorry. I was busy today. Tomorrow ;)

> > HOWEVER. There is yet another problem. Another thread can, say, unmap()
> > xol_vma. In this case we should ensure that the task can't fault in an
> > endless loop.
>
> Hmm should we add a check in unmap() to see if the vma that we are
> trying to unmap is the xol_vma and if so return?

Oh, I am not sure. You know, I _think_ that perhaps we should do something
diferent in the long term. In particular, this xol page should not have
vma at all. This way we shouldn't worry about unmap/remap/mprotect.
But even if this is possible (I am not really sure), I do not think we
should do this right now.

> Our assumption has been that once an xol_vma has been created, it should
> be around till the process gets killed.

Yes, I see. But afaics this assumption is currently wrong. This means
that we should ensure the evil application can't exploit this fact.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-12 19:34                         ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-12 19:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/12, Srikar Dronamraju wrote:
>
> I think we should be okay if the test exits in UTASK_SSTEP state.

Yes, and afaics we can't avoid this case, at least currently.

But we should move free_uprobe_utask() to mm_release(), or somewhere
else before mm->core_state check in exit_mm().

My main concern is stop/freeze in UTASK_SSTEP state. If nothing else,
debugger can attach to the stopped task and disable the stepping. Or
SIGKILL, it should work in this case.

> > Great. I'll think a bit more and send you the "final" version tomorrow.
> > Assuming we can change sstep_complete() as we discussed, it doesn't need
> > fatal_signal_pending().
>
> Okay.

Sorry. I was busy today. Tomorrow ;)

> > HOWEVER. There is yet another problem. Another thread can, say, unmap()
> > xol_vma. In this case we should ensure that the task can't fault in an
> > endless loop.
>
> Hmm should we add a check in unmap() to see if the vma that we are
> trying to unmap is the xol_vma and if so return?

Oh, I am not sure. You know, I _think_ that perhaps we should do something
diferent in the long term. In particular, this xol page should not have
vma at all. This way we shouldn't worry about unmap/remap/mprotect.
But even if this is possible (I am not really sure), I do not think we
should do this right now.

> Our assumption has been that once an xol_vma has been created, it should
> be around till the process gets killed.

Yes, I see. But afaics this assumption is currently wrong. This means
that we should ensure the evil application can't exploit this fact.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
  2011-10-11 17:26                   ` Srikar Dronamraju
@ 2011-10-12 19:59                     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-12 19:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/11, Srikar Dronamraju wrote:
>
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -1366,6 +1366,26 @@ static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
>  }
>
>  /*
> + * While we are handling breakpoint / singlestep, ensure that a
> + * SIGTRAP is not delivered to the task.
> + */
> +static void __clear_trap_flag(void)
> +{
> +	sigdelset(&current->pending.signal, SIGTRAP);
> +	sigdelset(&current->signal->shared_pending.signal, SIGTRAP);
> +}
> +
> +static void clear_trap_flag(void)
> +{
> +	if (!test_and_clear_thread_flag(TIF_SIGPENDING))
> +		return;
> +
> +	spin_lock_irq(&current->sighand->siglock);
> +	__clear_trap_flag();
> +	spin_unlock_irq(&current->sighand->siglock);
> +}

And this is called before and after the step.

Confused... For what? What makes SIGTRAP special? Where does this
signal come from? If you meant do_debug() this seems impossible,
uprobe_exception_notify(DIE_DEBUG) returns NOTIFY_STOP.

I certainly missed something.

> @@ -1401,13 +1422,18 @@ void uprobe_notify_resume(struct pt_regs *regs)
>  			if (!utask)
>  				goto cleanup_ret;
>  		}
> -		/* TODO Start queueing signals. */
>  		utask->active_uprobe = u;
>  		handler_chain(u, regs);
>  		utask->state = UTASK_SSTEP;
> -		if (!pre_ssout(u, regs, probept))
> +		if (!pre_ssout(u, regs, probept)) {
> +			sigfillset(&masksigs);
> +			sigdelsetmask(&masksigs,
> +					sigmask(SIGKILL)|sigmask(SIGSTOP));
> +			current->saved_sigmask = current->blocked;
> +			set_current_blocked(&masksigs);

OK, we already discussed the problems with this approach.

> +			clear_trap_flag();

In any case unneeded, we already blocked SIGTRAP.

> @@ -1418,8 +1444,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
>  			utask->state = UTASK_RUNNING;
>  			user_disable_single_step(current);
>  			xol_free_insn_slot(current);
> -
> -			/* TODO Stop queueing signals. */
> +			clear_trap_flag();

This is what I can't understand.

> +			set_restore_sigmask();

No, this is not right. If we have a pending signal, the signal handler
will run with the almost-all-blocked mask we set before.

And this is overkill anyway, you could simply do
set_current_blocked(&current->saved_sigmask).

->saved_sigmask is only used when we return from syscall, so uprobes
can (ab)use it safely.

> @@ -1433,7 +1459,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
>  		put_uprobe(u);
>  		set_instruction_pointer(regs, probept);
>  	} else
> -		/*TODO Return SIGTRAP signal */
> +		send_sig(SIGTRAP, current, 0);

This change looks "offtopic" to the problems we are discussing.

Or I missed something and this is connected to the clear_trap_flag()
somehow?

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH v5 3.1.0-rc4-tip 26/26]   uprobes: queue signals while thread is singlestepping.
@ 2011-10-12 19:59                     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-12 19:59 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Masami Hiramatsu, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds, Hugh Dickins,
	Christoph Hellwig, Andi Kleen, Thomas Gleixner, Jonathan Corbet,
	Andrew Morton, Jim Keniston, Roland McGrath,
	Ananth N Mavinakayanahalli, LKML

On 10/11, Srikar Dronamraju wrote:
>
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -1366,6 +1366,26 @@ static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
>  }
>
>  /*
> + * While we are handling breakpoint / singlestep, ensure that a
> + * SIGTRAP is not delivered to the task.
> + */
> +static void __clear_trap_flag(void)
> +{
> +	sigdelset(&current->pending.signal, SIGTRAP);
> +	sigdelset(&current->signal->shared_pending.signal, SIGTRAP);
> +}
> +
> +static void clear_trap_flag(void)
> +{
> +	if (!test_and_clear_thread_flag(TIF_SIGPENDING))
> +		return;
> +
> +	spin_lock_irq(&current->sighand->siglock);
> +	__clear_trap_flag();
> +	spin_unlock_irq(&current->sighand->siglock);
> +}

And this is called before and after the step.

Confused... For what? What makes SIGTRAP special? Where does this
signal come from? If you meant do_debug() this seems impossible,
uprobe_exception_notify(DIE_DEBUG) returns NOTIFY_STOP.

I certainly missed something.

> @@ -1401,13 +1422,18 @@ void uprobe_notify_resume(struct pt_regs *regs)
>  			if (!utask)
>  				goto cleanup_ret;
>  		}
> -		/* TODO Start queueing signals. */
>  		utask->active_uprobe = u;
>  		handler_chain(u, regs);
>  		utask->state = UTASK_SSTEP;
> -		if (!pre_ssout(u, regs, probept))
> +		if (!pre_ssout(u, regs, probept)) {
> +			sigfillset(&masksigs);
> +			sigdelsetmask(&masksigs,
> +					sigmask(SIGKILL)|sigmask(SIGSTOP));
> +			current->saved_sigmask = current->blocked;
> +			set_current_blocked(&masksigs);

OK, we already discussed the problems with this approach.

> +			clear_trap_flag();

In any case unneeded, we already blocked SIGTRAP.

> @@ -1418,8 +1444,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
>  			utask->state = UTASK_RUNNING;
>  			user_disable_single_step(current);
>  			xol_free_insn_slot(current);
> -
> -			/* TODO Stop queueing signals. */
> +			clear_trap_flag();

This is what I can't understand.

> +			set_restore_sigmask();

No, this is not right. If we have a pending signal, the signal handler
will run with the almost-all-blocked mask we set before.

And this is overkill anyway, you could simply do
set_current_blocked(&current->saved_sigmask).

->saved_sigmask is only used when we return from syscall, so uprobes
can (ab)use it safely.

> @@ -1433,7 +1459,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
>  		put_uprobe(u);
>  		set_instruction_pointer(regs, probept);
>  	} else
> -		/*TODO Return SIGTRAP signal */
> +		send_sig(SIGTRAP, current, 0);

This change looks "offtopic" to the problems we are discussing.

Or I missed something and this is connected to the clear_trap_flag()
somehow?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH 0/X] (Was:  Uprobes patchset with perf probe support)
  2011-09-20 11:59 ` Srikar Dronamraju
@ 2011-10-15 19:00   ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Hello.

I was trying to make the signal patches we discussed, but I have
noticed other problems.

Please see the suggested initial fixes, more to come. Of course
there were not tested ;) Please review.

And. It it not that I suggest to add this series, just it is much
simpler to me to write the patch with the changelog to explain
what I mean.

IOW, if you agree with these fixes, please incorporate them into
the next version. Feel free to redo.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH 0/X] (Was:  Uprobes patchset with perf probe support)
@ 2011-10-15 19:00   ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Hello.

I was trying to make the signal patches we discussed, but I have
noticed other problems.

Please see the suggested initial fixes, more to come. Of course
there were not tested ;) Please review.

And. It it not that I suggest to add this series, just it is much
simpler to me to write the patch with the changelog to explain
what I mean.

IOW, if you agree with these fixes, please incorporate them into
the next version. Feel free to redo.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH 1/X] uprobes: write_opcode: the new page needs PG_uptodate
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-15 19:00     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

write_opcode()->__replace_page() installs the new anonymous page,
this new_page is PageSwapBacked() and it can be swapped out.

However it forgets to do SetPageUptodate(), fix write_opcode().

For example, this is needed if do_swap_page() finds that orginial
page in the the swap cache (and doesn't try to read it back), in
this case it returns VM_FAULT_SIGBUS.
---
 kernel/uprobes.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 3928bcc..52b20c8 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -200,6 +200,8 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
 		goto put_out;
 	}
 
+	__SetPageUptodate(new_page);
+
 	/*
 	 * lock page will serialize against do_wp_page()'s
 	 * PageAnon() handling
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 1/X] uprobes: write_opcode: the new page needs PG_uptodate
@ 2011-10-15 19:00     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

write_opcode()->__replace_page() installs the new anonymous page,
this new_page is PageSwapBacked() and it can be swapped out.

However it forgets to do SetPageUptodate(), fix write_opcode().

For example, this is needed if do_swap_page() finds that orginial
page in the the swap cache (and doesn't try to read it back), in
this case it returns VM_FAULT_SIGBUS.
---
 kernel/uprobes.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 3928bcc..52b20c8 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -200,6 +200,8 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
 		goto put_out;
 	}
 
+	__SetPageUptodate(new_page);
+
 	/*
 	 * lock page will serialize against do_wp_page()'s
 	 * PageAnon() handling
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 2/X] uprobes: write_opcode() needs put_page(new_page) unconditionally
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-15 19:00     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Every write_opcode()->__replace_page() leaks the new page on success.

We have the reference after alloc_page_vma(), then __replace_page()
does another get_page() for the new mapping, we need put_page(new_page)
in any case.

Alternatively we could remove __replace_page()->get_page() but it is
better to change write_opcode(). This way it is simpler to unify the
code with ksm.c:replace_page() and we can simplify the error handling
in write_opcode(), the patch simply adds a single page_cache_release()
under "unlock_out" label.
---
 kernel/uprobes.c |   15 +++++----------
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 52b20c8..fd9c8e3 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -193,15 +193,12 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
 	if (vaddr != (unsigned long) addr)
 		goto put_out;
 
-	/* Allocate a page */
+	ret = -ENOMEM;
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
-	if (!new_page) {
-		ret = -ENOMEM;
+	if (!new_page)
 		goto put_out;
-	}
 
 	__SetPageUptodate(new_page);
-
 	/*
 	 * lock page will serialize against do_wp_page()'s
 	 * PageAnon() handling
@@ -220,18 +217,16 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
 	kunmap_atomic(vaddr_old);
 
 	ret = anon_vma_prepare(vma);
-	if (ret) {
-		page_cache_release(new_page);
+	if (ret)
 		goto unlock_out;
-	}
 
 	lock_page(new_page);
 	ret = __replace_page(vma, old_page, new_page);
 	unlock_page(new_page);
-	if (ret != 0)
-		page_cache_release(new_page);
+
 unlock_out:
 	unlock_page(old_page);
+	page_cache_release(new_page);
 
 put_out:
 	put_page(old_page); /* we did a get_page in the beginning */
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 2/X] uprobes: write_opcode() needs put_page(new_page) unconditionally
@ 2011-10-15 19:00     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:00 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Every write_opcode()->__replace_page() leaks the new page on success.

We have the reference after alloc_page_vma(), then __replace_page()
does another get_page() for the new mapping, we need put_page(new_page)
in any case.

Alternatively we could remove __replace_page()->get_page() but it is
better to change write_opcode(). This way it is simpler to unify the
code with ksm.c:replace_page() and we can simplify the error handling
in write_opcode(), the patch simply adds a single page_cache_release()
under "unlock_out" label.
---
 kernel/uprobes.c |   15 +++++----------
 1 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 52b20c8..fd9c8e3 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -193,15 +193,12 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
 	if (vaddr != (unsigned long) addr)
 		goto put_out;
 
-	/* Allocate a page */
+	ret = -ENOMEM;
 	new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vaddr);
-	if (!new_page) {
-		ret = -ENOMEM;
+	if (!new_page)
 		goto put_out;
-	}
 
 	__SetPageUptodate(new_page);
-
 	/*
 	 * lock page will serialize against do_wp_page()'s
 	 * PageAnon() handling
@@ -220,18 +217,16 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
 	kunmap_atomic(vaddr_old);
 
 	ret = anon_vma_prepare(vma);
-	if (ret) {
-		page_cache_release(new_page);
+	if (ret)
 		goto unlock_out;
-	}
 
 	lock_page(new_page);
 	ret = __replace_page(vma, old_page, new_page);
 	unlock_page(new_page);
-	if (ret != 0)
-		page_cache_release(new_page);
+
 unlock_out:
 	unlock_page(old_page);
+	page_cache_release(new_page);
 
 put_out:
 	put_page(old_page); /* we did a get_page in the beginning */
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 3/X] uprobes: xol_add_vma: fix ->uprobes_xol_area initialization
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-15 19:01     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

xol_add_vma() can race with another thread which sets ->uprobes_xol_area,
in this case we can't rely on per-thread task_lock() and we should unmap
xol_vma.

Move the setting of mm->uprobes_xol_area into xol_add_vma(), it has to
take mmap_sem for writing anyway, this also simplifies the code.

Change xol_add_vma() to do do_munmap() if it fails after do_mmap_pgoff().
---
 kernel/uprobes.c |   34 ++++++++++++++++------------------
 1 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index fd9c8e3..6fe2b20 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1049,18 +1049,18 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long addr;
-	int ret = -ENOMEM;
+	int ret;
 
 	mm = get_task_mm(current);
 	if (!mm)
 		return -ESRCH;
 
 	down_write(&mm->mmap_sem);
-	if (mm->uprobes_xol_area) {
-		ret = -EALREADY;
+	ret = -EALREADY;
+	if (mm->uprobes_xol_area)
 		goto fail;
-	}
 
+	ret = -ENOMEM;
 	/*
 	 * Find the end of the top mapping and skip a page.
 	 * If there is no space for PAGE_SIZE above
@@ -1078,15 +1078,19 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 
 	if (addr & ~PAGE_MASK)
 		goto fail;
-	vma = find_vma(mm, addr);
 
+	vma = find_vma(mm, addr);
 	/* Don't expand vma on mremap(). */
 	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
 	area->vaddr = vma->vm_start;
 	if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
-				&vma) > 0)
-		ret = 0;
+				&vma) != 1) {
+		do_munmap(mm, addr, PAGE_SIZE);
+		goto fail;
+	}
 
+	mm->uprobes_xol_area = area;
+	ret = 0;
 fail:
 	up_write(&mm->mmap_sem);
 	mmput(mm);
@@ -1102,7 +1106,7 @@ fail:
  */
 static struct uprobes_xol_area *xol_alloc_area(void)
 {
-	struct uprobes_xol_area *area = NULL;
+	struct uprobes_xol_area *area;
 
 	area = kzalloc(sizeof(*area), GFP_KERNEL);
 	if (unlikely(!area))
@@ -1110,22 +1114,16 @@ static struct uprobes_xol_area *xol_alloc_area(void)
 
 	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
 								GFP_KERNEL);
-
 	if (!area->bitmap)
 		goto fail;
 
 	init_waitqueue_head(&area->wq);
 	spin_lock_init(&area->slot_lock);
-	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
-		task_lock(current);
-		if (!current->mm->uprobes_xol_area) {
-			current->mm->uprobes_xol_area = area;
-			task_unlock(current);
-			return area;
-		}
-		task_unlock(current);
-	}
 
+	if (xol_add_vma(area))
+		goto fail;
+
+	return area;
 fail:
 	kfree(area->bitmap);
 	kfree(area);
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 3/X] uprobes: xol_add_vma: fix ->uprobes_xol_area initialization
@ 2011-10-15 19:01     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

xol_add_vma() can race with another thread which sets ->uprobes_xol_area,
in this case we can't rely on per-thread task_lock() and we should unmap
xol_vma.

Move the setting of mm->uprobes_xol_area into xol_add_vma(), it has to
take mmap_sem for writing anyway, this also simplifies the code.

Change xol_add_vma() to do do_munmap() if it fails after do_mmap_pgoff().
---
 kernel/uprobes.c |   34 ++++++++++++++++------------------
 1 files changed, 16 insertions(+), 18 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index fd9c8e3..6fe2b20 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1049,18 +1049,18 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
 	unsigned long addr;
-	int ret = -ENOMEM;
+	int ret;
 
 	mm = get_task_mm(current);
 	if (!mm)
 		return -ESRCH;
 
 	down_write(&mm->mmap_sem);
-	if (mm->uprobes_xol_area) {
-		ret = -EALREADY;
+	ret = -EALREADY;
+	if (mm->uprobes_xol_area)
 		goto fail;
-	}
 
+	ret = -ENOMEM;
 	/*
 	 * Find the end of the top mapping and skip a page.
 	 * If there is no space for PAGE_SIZE above
@@ -1078,15 +1078,19 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 
 	if (addr & ~PAGE_MASK)
 		goto fail;
-	vma = find_vma(mm, addr);
 
+	vma = find_vma(mm, addr);
 	/* Don't expand vma on mremap(). */
 	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
 	area->vaddr = vma->vm_start;
 	if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
-				&vma) > 0)
-		ret = 0;
+				&vma) != 1) {
+		do_munmap(mm, addr, PAGE_SIZE);
+		goto fail;
+	}
 
+	mm->uprobes_xol_area = area;
+	ret = 0;
 fail:
 	up_write(&mm->mmap_sem);
 	mmput(mm);
@@ -1102,7 +1106,7 @@ fail:
  */
 static struct uprobes_xol_area *xol_alloc_area(void)
 {
-	struct uprobes_xol_area *area = NULL;
+	struct uprobes_xol_area *area;
 
 	area = kzalloc(sizeof(*area), GFP_KERNEL);
 	if (unlikely(!area))
@@ -1110,22 +1114,16 @@ static struct uprobes_xol_area *xol_alloc_area(void)
 
 	area->bitmap = kzalloc(BITS_TO_LONGS(UINSNS_PER_PAGE) * sizeof(long),
 								GFP_KERNEL);
-
 	if (!area->bitmap)
 		goto fail;
 
 	init_waitqueue_head(&area->wq);
 	spin_lock_init(&area->slot_lock);
-	if (!xol_add_vma(area) && !current->mm->uprobes_xol_area) {
-		task_lock(current);
-		if (!current->mm->uprobes_xol_area) {
-			current->mm->uprobes_xol_area = area;
-			task_unlock(current);
-			return area;
-		}
-		task_unlock(current);
-	}
 
+	if (xol_add_vma(area))
+		goto fail;
+
+	return area;
 fail:
 	kfree(area->bitmap);
 	kfree(area);
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 4/X] uprobes: xol_add_vma: misc cleanups
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-15 19:01     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

1. get_task_mm(current)/mmput is not needed, we can use ->mm directly.

   It can't be NULL or use_mm'ed(), otherwise we are buggy anyway.

2. use IS_ERR_VALUE() after do_mmap_pgoff().

3. No need to read vma->vm_start, it must be equal to addr returned
   by do_mmap_pgoff().

4. No need to pass vmas => &vma to get_user_pages().
---
 kernel/uprobes.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 6fe2b20..5c2554c 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1051,9 +1051,7 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	unsigned long addr;
 	int ret;
 
-	mm = get_task_mm(current);
-	if (!mm)
-		return -ESRCH;
+	mm = current->mm;
 
 	down_write(&mm->mmap_sem);
 	ret = -EALREADY;
@@ -1076,24 +1074,23 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
 	revert_creds(curr_cred);
 
-	if (addr & ~PAGE_MASK)
+	if (IS_ERR_VALUE(addr))
 		goto fail;
 
 	vma = find_vma(mm, addr);
 	/* Don't expand vma on mremap(). */
 	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
-	area->vaddr = vma->vm_start;
-	if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
-				&vma) != 1) {
+	if (get_user_pages(current, mm, addr, 1, 1, 1,
+					&area->page, NULL) != 1) {
 		do_munmap(mm, addr, PAGE_SIZE);
 		goto fail;
 	}
 
+	area->vaddr = addr;
 	mm->uprobes_xol_area = area;
 	ret = 0;
 fail:
 	up_write(&mm->mmap_sem);
-	mmput(mm);
 	return ret;
 }
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 4/X] uprobes: xol_add_vma: misc cleanups
@ 2011-10-15 19:01     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

1. get_task_mm(current)/mmput is not needed, we can use ->mm directly.

   It can't be NULL or use_mm'ed(), otherwise we are buggy anyway.

2. use IS_ERR_VALUE() after do_mmap_pgoff().

3. No need to read vma->vm_start, it must be equal to addr returned
   by do_mmap_pgoff().

4. No need to pass vmas => &vma to get_user_pages().
---
 kernel/uprobes.c |   13 +++++--------
 1 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 6fe2b20..5c2554c 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1051,9 +1051,7 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	unsigned long addr;
 	int ret;
 
-	mm = get_task_mm(current);
-	if (!mm)
-		return -ESRCH;
+	mm = current->mm;
 
 	down_write(&mm->mmap_sem);
 	ret = -EALREADY;
@@ -1076,24 +1074,23 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
 	revert_creds(curr_cred);
 
-	if (addr & ~PAGE_MASK)
+	if (IS_ERR_VALUE(addr))
 		goto fail;
 
 	vma = find_vma(mm, addr);
 	/* Don't expand vma on mremap(). */
 	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
-	area->vaddr = vma->vm_start;
-	if (get_user_pages(current, mm, area->vaddr, 1, 1, 1, &area->page,
-				&vma) != 1) {
+	if (get_user_pages(current, mm, addr, 1, 1, 1,
+					&area->page, NULL) != 1) {
 		do_munmap(mm, addr, PAGE_SIZE);
 		goto fail;
 	}
 
+	area->vaddr = addr;
 	mm->uprobes_xol_area = area;
 	ret = 0;
 fail:
 	up_write(&mm->mmap_sem);
-	mmput(mm);
 	return ret;
 }
 
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 5/X] uprobes: xol_alloc_area() needs memory barriers
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-15 19:01     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

If xol_get_insn_slot() or xol_alloc_area() races with another thread
doing xol_add_vma() it is not safe to dereference ->uprobes_xol_area.

Add the necessary wmb/read_barrier_depends pair, this ensures that
xol_get_insn_slot() always sees the properly initialized memory.

Other users of ->uprobes_xol_area look fine, they can't race with
xol_add_vma() this way. xol_free_insn_slot() checks utask->xol_vaddr,
and free_uprobes_xol_area() is calles by mmput().

Except: valid_vma() is racy but it should not use ->uprobes_xol_area
as we discussed.
---
 kernel/uprobes.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5c2554c..b59af3b 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1087,6 +1087,7 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	}
 
 	area->vaddr = addr;
+	smp_wmb();	/* pairs with get_uprobes_xol_area() */
 	mm->uprobes_xol_area = area;
 	ret = 0;
 fail:
@@ -1094,6 +1095,14 @@ fail:
 	return ret;
 }
 
+static inline
+struct uprobes_xol_area *get_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+	smp_read_barrier_depends();	/* pairs with wmb in xol_add_vma() */
+	return area;
+}
+
 /*
  * xol_alloc_area - Allocate process's uprobes_xol_area.
  * This area will be used for storing instructions for execution out of
@@ -1124,7 +1133,7 @@ static struct uprobes_xol_area *xol_alloc_area(void)
 fail:
 	kfree(area->bitmap);
 	kfree(area);
-	return current->mm->uprobes_xol_area;
+	return get_uprobes_xol_area(current->mm);
 }
 
 /*
@@ -1183,17 +1192,17 @@ static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
 static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
 					unsigned long slot_addr)
 {
-	struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+	struct uprobes_xol_area *area;
 	unsigned long offset;
 	void *vaddr;
 
+	area = get_uprobes_xol_area(current->mm);
 	if (!area) {
 		area = xol_alloc_area();
 		if (!area)
 			return 0;
 	}
 	current->utask->xol_vaddr = xol_take_insn_slot(area);
-
 	/*
 	 * Initialize the slot if xol_vaddr points to valid
 	 * instruction slot.
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 5/X] uprobes: xol_alloc_area() needs memory barriers
@ 2011-10-15 19:01     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-15 19:01 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

If xol_get_insn_slot() or xol_alloc_area() races with another thread
doing xol_add_vma() it is not safe to dereference ->uprobes_xol_area.

Add the necessary wmb/read_barrier_depends pair, this ensures that
xol_get_insn_slot() always sees the properly initialized memory.

Other users of ->uprobes_xol_area look fine, they can't race with
xol_add_vma() this way. xol_free_insn_slot() checks utask->xol_vaddr,
and free_uprobes_xol_area() is calles by mmput().

Except: valid_vma() is racy but it should not use ->uprobes_xol_area
as we discussed.
---
 kernel/uprobes.c |   15 ++++++++++++---
 1 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5c2554c..b59af3b 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1087,6 +1087,7 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	}
 
 	area->vaddr = addr;
+	smp_wmb();	/* pairs with get_uprobes_xol_area() */
 	mm->uprobes_xol_area = area;
 	ret = 0;
 fail:
@@ -1094,6 +1095,14 @@ fail:
 	return ret;
 }
 
+static inline
+struct uprobes_xol_area *get_uprobes_xol_area(struct mm_struct *mm)
+{
+	struct uprobes_xol_area *area = mm->uprobes_xol_area;
+	smp_read_barrier_depends();	/* pairs with wmb in xol_add_vma() */
+	return area;
+}
+
 /*
  * xol_alloc_area - Allocate process's uprobes_xol_area.
  * This area will be used for storing instructions for execution out of
@@ -1124,7 +1133,7 @@ static struct uprobes_xol_area *xol_alloc_area(void)
 fail:
 	kfree(area->bitmap);
 	kfree(area);
-	return current->mm->uprobes_xol_area;
+	return get_uprobes_xol_area(current->mm);
 }
 
 /*
@@ -1183,17 +1192,17 @@ static unsigned long xol_take_insn_slot(struct uprobes_xol_area *area)
 static unsigned long xol_get_insn_slot(struct uprobe *uprobe,
 					unsigned long slot_addr)
 {
-	struct uprobes_xol_area *area = current->mm->uprobes_xol_area;
+	struct uprobes_xol_area *area;
 	unsigned long offset;
 	void *vaddr;
 
+	area = get_uprobes_xol_area(current->mm);
 	if (!area) {
 		area = xol_alloc_area();
 		if (!area)
 			return 0;
 	}
 	current->utask->xol_vaddr = xol_take_insn_slot(area);
-
 	/*
 	 * Initialize the slot if xol_vaddr points to valid
 	 * instruction slot.
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-16 16:13     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-16 16:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

I apologize in advance if this was already discussed, but I just can't
understand why xol_add_vma() does not use install_special_mapping().
Unless I missed something this should work and this has the following
advantages:

	- we can avoid override_creds() hacks, install_special_mapping()
	  fools security_file_mmap() passing prot/flags = 0

	- no need to play with vma after do_mmap_pgoff()

	- no need for get_user_pages(FOLL_WRITE/FOLL_FORCE) hack

	- no need for do_munmap() if get_user_pages() fails

	- this protects us from mprotect(READ/WRITE)

	- this protects from MADV_DONTNEED, the page will be correctly
	  re-instantiated from area->page

	- this makes xol_vma more "cheap", swapper can't see this page
	  and we avoid the meaningless add_to_swap/pageout.

	  Note that, before this patch, area->page can't be removed
	  from the swap cache anyway (we have the reference). And it
	  must not, uprobes modifies this page directly.

Note on vm_flags:

	- we do not use VM_DONTEXPAND, install_special_mapping() adds it

	- VM_IO protects from MADV_DOFORK

	- I am not sure, may be some archs need VM_READ along with EXEC?

Anything else I have missed?
---

 kernel/uprobes.c |   42 +++++++++++++++++++-----------------------
 1 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index b59af3b..038f21c 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1045,53 +1045,49 @@ void munmap_uprobe(struct vm_area_struct *vma)
 /* Slot allocation for XOL */
 static int xol_add_vma(struct uprobes_xol_area *area)
 {
-	const struct cred *curr_cred;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
-	unsigned long addr;
+	unsigned long addr_hint;
 	int ret;
 
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		return -ENOMEM;
+
 	mm = current->mm;
 
 	down_write(&mm->mmap_sem);
 	ret = -EALREADY;
 	if (mm->uprobes_xol_area)
 		goto fail;
-
-	ret = -ENOMEM;
 	/*
 	 * Find the end of the top mapping and skip a page.
-	 * If there is no space for PAGE_SIZE above
-	 * that, mmap will ignore our address hint.
-	 *
-	 * override credentials otherwise anonymous memory might
-	 * not be granted execute permission when the selinux
-	 * security hooks have their way.
+	 * If there is no space for PAGE_SIZE above that,
+	 * this hint will be ignored.
 	 */
 	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
-	addr = vma->vm_end + PAGE_SIZE;
-	curr_cred = override_creds(&init_cred);
-	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
-	revert_creds(curr_cred);
+	addr_hint = vma->vm_end + PAGE_SIZE;
 
-	if (IS_ERR_VALUE(addr))
+	area->vaddr = get_unmapped_area(NULL, addr_hint, PAGE_SIZE, 0, 0);
+	if (IS_ERR_VALUE(area->vaddr)) {
+		ret = area->vaddr;
 		goto fail;
+	}
 
-	vma = find_vma(mm, addr);
-	/* Don't expand vma on mremap(). */
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
-	if (get_user_pages(current, mm, addr, 1, 1, 1,
-					&area->page, NULL) != 1) {
-		do_munmap(mm, addr, PAGE_SIZE);
+	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+					VM_EXEC|VM_MAYEXEC | VM_DONTCOPY|VM_IO,
+					&area->page);
+	if (ret)
 		goto fail;
-	}
 
-	area->vaddr = addr;
 	smp_wmb();	/* pairs with get_uprobes_xol_area() */
 	mm->uprobes_xol_area = area;
 	ret = 0;
 fail:
 	up_write(&mm->mmap_sem);
+	if (ret)
+		__free_page(area->page);
+
 	return ret;
 }
 


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
@ 2011-10-16 16:13     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-16 16:13 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

I apologize in advance if this was already discussed, but I just can't
understand why xol_add_vma() does not use install_special_mapping().
Unless I missed something this should work and this has the following
advantages:

	- we can avoid override_creds() hacks, install_special_mapping()
	  fools security_file_mmap() passing prot/flags = 0

	- no need to play with vma after do_mmap_pgoff()

	- no need for get_user_pages(FOLL_WRITE/FOLL_FORCE) hack

	- no need for do_munmap() if get_user_pages() fails

	- this protects us from mprotect(READ/WRITE)

	- this protects from MADV_DONTNEED, the page will be correctly
	  re-instantiated from area->page

	- this makes xol_vma more "cheap", swapper can't see this page
	  and we avoid the meaningless add_to_swap/pageout.

	  Note that, before this patch, area->page can't be removed
	  from the swap cache anyway (we have the reference). And it
	  must not, uprobes modifies this page directly.

Note on vm_flags:

	- we do not use VM_DONTEXPAND, install_special_mapping() adds it

	- VM_IO protects from MADV_DOFORK

	- I am not sure, may be some archs need VM_READ along with EXEC?

Anything else I have missed?
---

 kernel/uprobes.c |   42 +++++++++++++++++++-----------------------
 1 files changed, 19 insertions(+), 23 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index b59af3b..038f21c 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1045,53 +1045,49 @@ void munmap_uprobe(struct vm_area_struct *vma)
 /* Slot allocation for XOL */
 static int xol_add_vma(struct uprobes_xol_area *area)
 {
-	const struct cred *curr_cred;
 	struct vm_area_struct *vma;
 	struct mm_struct *mm;
-	unsigned long addr;
+	unsigned long addr_hint;
 	int ret;
 
+	area->page = alloc_page(GFP_HIGHUSER);
+	if (!area->page)
+		return -ENOMEM;
+
 	mm = current->mm;
 
 	down_write(&mm->mmap_sem);
 	ret = -EALREADY;
 	if (mm->uprobes_xol_area)
 		goto fail;
-
-	ret = -ENOMEM;
 	/*
 	 * Find the end of the top mapping and skip a page.
-	 * If there is no space for PAGE_SIZE above
-	 * that, mmap will ignore our address hint.
-	 *
-	 * override credentials otherwise anonymous memory might
-	 * not be granted execute permission when the selinux
-	 * security hooks have their way.
+	 * If there is no space for PAGE_SIZE above that,
+	 * this hint will be ignored.
 	 */
 	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
-	addr = vma->vm_end + PAGE_SIZE;
-	curr_cred = override_creds(&init_cred);
-	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
-	revert_creds(curr_cred);
+	addr_hint = vma->vm_end + PAGE_SIZE;
 
-	if (IS_ERR_VALUE(addr))
+	area->vaddr = get_unmapped_area(NULL, addr_hint, PAGE_SIZE, 0, 0);
+	if (IS_ERR_VALUE(area->vaddr)) {
+		ret = area->vaddr;
 		goto fail;
+	}
 
-	vma = find_vma(mm, addr);
-	/* Don't expand vma on mremap(). */
-	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
-	if (get_user_pages(current, mm, addr, 1, 1, 1,
-					&area->page, NULL) != 1) {
-		do_munmap(mm, addr, PAGE_SIZE);
+	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
+					VM_EXEC|VM_MAYEXEC | VM_DONTCOPY|VM_IO,
+					&area->page);
+	if (ret)
 		goto fail;
-	}
 
-	area->vaddr = addr;
 	smp_wmb();	/* pairs with get_uprobes_xol_area() */
 	mm->uprobes_xol_area = area;
 	ret = 0;
 fail:
 	up_write(&mm->mmap_sem);
+	if (ret)
+		__free_page(area->page);
+
 	return ret;
 }
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 7/X] uprobes: xol_add_vma: simply use TASK_SIZE as a hint
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-16 16:14     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-16 16:14 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

I don't understand why xol_add_vma() abuses mm->mm_rb to find the
highest mapping. We can simply use TASK_SIZE-PAGE_SIZE a hint.

If this area is already occupied, the hint will be ignored with
or without this change. Otherwise the result is "obviously better"
and the code becomes simpler.

---

 kernel/uprobes.c |   13 ++++---------
 1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 038f21c..b876977 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1045,9 +1045,7 @@ void munmap_uprobe(struct vm_area_struct *vma)
 /* Slot allocation for XOL */
 static int xol_add_vma(struct uprobes_xol_area *area)
 {
-	struct vm_area_struct *vma;
 	struct mm_struct *mm;
-	unsigned long addr_hint;
 	int ret;
 
 	area->page = alloc_page(GFP_HIGHUSER);
@@ -1060,15 +1058,12 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	ret = -EALREADY;
 	if (mm->uprobes_xol_area)
 		goto fail;
+
 	/*
-	 * Find the end of the top mapping and skip a page.
-	 * If there is no space for PAGE_SIZE above that,
-	 * this hint will be ignored.
+	 * Try to map as high as possible, this is only a hint.
 	 */
-	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
-	addr_hint = vma->vm_end + PAGE_SIZE;
-
-	area->vaddr = get_unmapped_area(NULL, addr_hint, PAGE_SIZE, 0, 0);
+	area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+					PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(area->vaddr)) {
 		ret = area->vaddr;
 		goto fail;


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 7/X] uprobes: xol_add_vma: simply use TASK_SIZE as a hint
@ 2011-10-16 16:14     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-16 16:14 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

I don't understand why xol_add_vma() abuses mm->mm_rb to find the
highest mapping. We can simply use TASK_SIZE-PAGE_SIZE a hint.

If this area is already occupied, the hint will be ignored with
or without this change. Otherwise the result is "obviously better"
and the code becomes simpler.

---

 kernel/uprobes.c |   13 ++++---------
 1 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 038f21c..b876977 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1045,9 +1045,7 @@ void munmap_uprobe(struct vm_area_struct *vma)
 /* Slot allocation for XOL */
 static int xol_add_vma(struct uprobes_xol_area *area)
 {
-	struct vm_area_struct *vma;
 	struct mm_struct *mm;
-	unsigned long addr_hint;
 	int ret;
 
 	area->page = alloc_page(GFP_HIGHUSER);
@@ -1060,15 +1058,12 @@ static int xol_add_vma(struct uprobes_xol_area *area)
 	ret = -EALREADY;
 	if (mm->uprobes_xol_area)
 		goto fail;
+
 	/*
-	 * Find the end of the top mapping and skip a page.
-	 * If there is no space for PAGE_SIZE above that,
-	 * this hint will be ignored.
+	 * Try to map as high as possible, this is only a hint.
 	 */
-	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
-	addr_hint = vma->vm_end + PAGE_SIZE;
-
-	area->vaddr = get_unmapped_area(NULL, addr_hint, PAGE_SIZE, 0, 0);
+	area->vaddr = get_unmapped_area(NULL, TASK_SIZE - PAGE_SIZE,
+					PAGE_SIZE, 0, 0);
 	if (IS_ERR_VALUE(area->vaddr)) {
 		ret = area->vaddr;
 		goto fail;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
  2011-10-16 16:13     ` Oleg Nesterov
@ 2011-10-17 10:50       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-17 10:50 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Eric Paris, Stephen Smalley
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Linus Torvalds,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

> I apologize in advance if this was already discussed, but I just can't
> understand why xol_add_vma() does not use install_special_mapping().
> Unless I missed something this should work and this has the following
> advantages:


The override_creds was based on what Stephen Smalley suggested 
https://lkml.org/lkml/2011/4/20/224    

At that time Peter had suggested install_special_mapping(). However the
consensus was to go with Stephen's suggestion of override_creds.

> 
> 	- we can avoid override_creds() hacks, install_special_mapping()
> 	  fools security_file_mmap() passing prot/flags = 0
> 
> 	- no need to play with vma after do_mmap_pgoff()
> 
> 	- no need for get_user_pages(FOLL_WRITE/FOLL_FORCE) hack
> 
> 	- no need for do_munmap() if get_user_pages() fails
> 
> 	- this protects us from mprotect(READ/WRITE)
> 
> 	- this protects from MADV_DONTNEED, the page will be correctly
> 	  re-instantiated from area->page
> 
> 	- this makes xol_vma more "cheap", swapper can't see this page
> 	  and we avoid the meaningless add_to_swap/pageout.
> 
> 	  Note that, before this patch, area->page can't be removed
> 	  from the swap cache anyway (we have the reference). And it
> 	  must not, uprobes modifies this page directly.

Stephan, Eric, 

Would you agree with Oleg's observation that we would be better off
using install_special_mapping rather than using override_creds.

To give you some more information about the problem.

Uprobes will be a in-kernel debugging facility that provides
singlestepping out of line. To achieve this, it will create a per-mm vma
which is not mapped to any file. However this vma has to be executable.

Slots are made in this executable vma, and one slot can be used to
single step a original instruction.

This executable vma that we are creating is not for any particular
binary but would have to be created dynamically as and when an
application is debugged. For example, if we were to debug malloc call in
libc, we would end up adding xol vma to all the live processes in the
system.

Since selinux wasnt happy to have an anonymous vma attached, we would
create a pseudo file using shmem_file_setup. However after comments from
Peter and Stephan's suggestions we started using override_creds. Peter and
Oleg suggest that we use install_special_mapping. 

Are you okay with using install_special_mapping instead of
override_creds()?

-- 
Thanks and Regards
Srikar


> 
> Note on vm_flags:
> 
> 	- we do not use VM_DONTEXPAND, install_special_mapping() adds it
> 
> 	- VM_IO protects from MADV_DOFORK
> 
> 	- I am not sure, may be some archs need VM_READ along with EXEC?
> 
> Anything else I have missed?
> ---
> 
>  kernel/uprobes.c |   42 +++++++++++++++++++-----------------------
>  1 files changed, 19 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index b59af3b..038f21c 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -1045,53 +1045,49 @@ void munmap_uprobe(struct vm_area_struct *vma)
>  /* Slot allocation for XOL */
>  static int xol_add_vma(struct uprobes_xol_area *area)
>  {
> -	const struct cred *curr_cred;
>  	struct vm_area_struct *vma;
>  	struct mm_struct *mm;
> -	unsigned long addr;
> +	unsigned long addr_hint;
>  	int ret;
> 
> +	area->page = alloc_page(GFP_HIGHUSER);
> +	if (!area->page)
> +		return -ENOMEM;
> +
>  	mm = current->mm;
> 
>  	down_write(&mm->mmap_sem);
>  	ret = -EALREADY;
>  	if (mm->uprobes_xol_area)
>  		goto fail;
> -
> -	ret = -ENOMEM;
>  	/*
>  	 * Find the end of the top mapping and skip a page.
> -	 * If there is no space for PAGE_SIZE above
> -	 * that, mmap will ignore our address hint.
> -	 *
> -	 * override credentials otherwise anonymous memory might
> -	 * not be granted execute permission when the selinux
> -	 * security hooks have their way.
> +	 * If there is no space for PAGE_SIZE above that,
> +	 * this hint will be ignored.
>  	 */
>  	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> -	addr = vma->vm_end + PAGE_SIZE;
> -	curr_cred = override_creds(&init_cred);
> -	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> -	revert_creds(curr_cred);
> +	addr_hint = vma->vm_end + PAGE_SIZE;
> 
> -	if (IS_ERR_VALUE(addr))
> +	area->vaddr = get_unmapped_area(NULL, addr_hint, PAGE_SIZE, 0, 0);
> +	if (IS_ERR_VALUE(area->vaddr)) {
> +		ret = area->vaddr;
>  		goto fail;
> +	}
> 
> -	vma = find_vma(mm, addr);
> -	/* Don't expand vma on mremap(). */
> -	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> -	if (get_user_pages(current, mm, addr, 1, 1, 1,
> -					&area->page, NULL) != 1) {
> -		do_munmap(mm, addr, PAGE_SIZE);
> +	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> +					VM_EXEC|VM_MAYEXEC | VM_DONTCOPY|VM_IO,
> +					&area->page);
> +	if (ret)
>  		goto fail;
> -	}
> 
> -	area->vaddr = addr;
>  	smp_wmb();	/* pairs with get_uprobes_xol_area() */
>  	mm->uprobes_xol_area = area;
>  	ret = 0;
>  fail:
>  	up_write(&mm->mmap_sem);
> +	if (ret)
> +		__free_page(area->page);
> +
>  	return ret;
>  }
> 
> 


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
@ 2011-10-17 10:50       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-17 10:50 UTC (permalink / raw)
  To: Oleg Nesterov, Peter Zijlstra, Eric Paris, Stephen Smalley
  Cc: Ingo Molnar, Steven Rostedt, Linux-mm, Linus Torvalds,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

> I apologize in advance if this was already discussed, but I just can't
> understand why xol_add_vma() does not use install_special_mapping().
> Unless I missed something this should work and this has the following
> advantages:


The override_creds was based on what Stephen Smalley suggested 
https://lkml.org/lkml/2011/4/20/224    

At that time Peter had suggested install_special_mapping(). However the
consensus was to go with Stephen's suggestion of override_creds.

> 
> 	- we can avoid override_creds() hacks, install_special_mapping()
> 	  fools security_file_mmap() passing prot/flags = 0
> 
> 	- no need to play with vma after do_mmap_pgoff()
> 
> 	- no need for get_user_pages(FOLL_WRITE/FOLL_FORCE) hack
> 
> 	- no need for do_munmap() if get_user_pages() fails
> 
> 	- this protects us from mprotect(READ/WRITE)
> 
> 	- this protects from MADV_DONTNEED, the page will be correctly
> 	  re-instantiated from area->page
> 
> 	- this makes xol_vma more "cheap", swapper can't see this page
> 	  and we avoid the meaningless add_to_swap/pageout.
> 
> 	  Note that, before this patch, area->page can't be removed
> 	  from the swap cache anyway (we have the reference). And it
> 	  must not, uprobes modifies this page directly.

Stephan, Eric, 

Would you agree with Oleg's observation that we would be better off
using install_special_mapping rather than using override_creds.

To give you some more information about the problem.

Uprobes will be a in-kernel debugging facility that provides
singlestepping out of line. To achieve this, it will create a per-mm vma
which is not mapped to any file. However this vma has to be executable.

Slots are made in this executable vma, and one slot can be used to
single step a original instruction.

This executable vma that we are creating is not for any particular
binary but would have to be created dynamically as and when an
application is debugged. For example, if we were to debug malloc call in
libc, we would end up adding xol vma to all the live processes in the
system.

Since selinux wasnt happy to have an anonymous vma attached, we would
create a pseudo file using shmem_file_setup. However after comments from
Peter and Stephan's suggestions we started using override_creds. Peter and
Oleg suggest that we use install_special_mapping. 

Are you okay with using install_special_mapping instead of
override_creds()?

-- 
Thanks and Regards
Srikar


> 
> Note on vm_flags:
> 
> 	- we do not use VM_DONTEXPAND, install_special_mapping() adds it
> 
> 	- VM_IO protects from MADV_DOFORK
> 
> 	- I am not sure, may be some archs need VM_READ along with EXEC?
> 
> Anything else I have missed?
> ---
> 
>  kernel/uprobes.c |   42 +++++++++++++++++++-----------------------
>  1 files changed, 19 insertions(+), 23 deletions(-)
> 
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index b59af3b..038f21c 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -1045,53 +1045,49 @@ void munmap_uprobe(struct vm_area_struct *vma)
>  /* Slot allocation for XOL */
>  static int xol_add_vma(struct uprobes_xol_area *area)
>  {
> -	const struct cred *curr_cred;
>  	struct vm_area_struct *vma;
>  	struct mm_struct *mm;
> -	unsigned long addr;
> +	unsigned long addr_hint;
>  	int ret;
> 
> +	area->page = alloc_page(GFP_HIGHUSER);
> +	if (!area->page)
> +		return -ENOMEM;
> +
>  	mm = current->mm;
> 
>  	down_write(&mm->mmap_sem);
>  	ret = -EALREADY;
>  	if (mm->uprobes_xol_area)
>  		goto fail;
> -
> -	ret = -ENOMEM;
>  	/*
>  	 * Find the end of the top mapping and skip a page.
> -	 * If there is no space for PAGE_SIZE above
> -	 * that, mmap will ignore our address hint.
> -	 *
> -	 * override credentials otherwise anonymous memory might
> -	 * not be granted execute permission when the selinux
> -	 * security hooks have their way.
> +	 * If there is no space for PAGE_SIZE above that,
> +	 * this hint will be ignored.
>  	 */
>  	vma = rb_entry(rb_last(&mm->mm_rb), struct vm_area_struct, vm_rb);
> -	addr = vma->vm_end + PAGE_SIZE;
> -	curr_cred = override_creds(&init_cred);
> -	addr = do_mmap_pgoff(NULL, addr, PAGE_SIZE, PROT_EXEC, MAP_PRIVATE, 0);
> -	revert_creds(curr_cred);
> +	addr_hint = vma->vm_end + PAGE_SIZE;
> 
> -	if (IS_ERR_VALUE(addr))
> +	area->vaddr = get_unmapped_area(NULL, addr_hint, PAGE_SIZE, 0, 0);
> +	if (IS_ERR_VALUE(area->vaddr)) {
> +		ret = area->vaddr;
>  		goto fail;
> +	}
> 
> -	vma = find_vma(mm, addr);
> -	/* Don't expand vma on mremap(). */
> -	vma->vm_flags |= VM_DONTEXPAND | VM_DONTCOPY;
> -	if (get_user_pages(current, mm, addr, 1, 1, 1,
> -					&area->page, NULL) != 1) {
> -		do_munmap(mm, addr, PAGE_SIZE);
> +	ret = install_special_mapping(mm, area->vaddr, PAGE_SIZE,
> +					VM_EXEC|VM_MAYEXEC | VM_DONTCOPY|VM_IO,
> +					&area->page);
> +	if (ret)
>  		goto fail;
> -	}
> 
> -	area->vaddr = addr;
>  	smp_wmb();	/* pairs with get_uprobes_xol_area() */
>  	mm->uprobes_xol_area = area;
>  	ret = 0;
>  fail:
>  	up_write(&mm->mmap_sem);
> +	if (ret)
> +		__free_page(area->page);
> +
>  	return ret;
>  }
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 1/X] uprobes: write_opcode: the new page needs PG_uptodate
  2011-10-15 19:00     ` Oleg Nesterov
@ 2011-10-17 10:59       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-17 10:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-15 21:00:37]:

> write_opcode()->__replace_page() installs the new anonymous page,
> this new_page is PageSwapBacked() and it can be swapped out.
> 
> However it forgets to do SetPageUptodate(), fix write_opcode().
> 
> For example, this is needed if do_swap_page() finds that orginial
> page in the the swap cache (and doesn't try to read it back), in
> this case it returns VM_FAULT_SIGBUS.
> ---
>  kernel/uprobes.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index 3928bcc..52b20c8 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -200,6 +200,8 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
>  		goto put_out;
>  	}
> 
> +	__SetPageUptodate(new_page);
> +

Agree. 

>  	/*
>  	 * lock page will serialize against do_wp_page()'s
>  	 * PageAnon() handling

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 1/X] uprobes: write_opcode: the new page needs PG_uptodate
@ 2011-10-17 10:59       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-17 10:59 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

* Oleg Nesterov <oleg@redhat.com> [2011-10-15 21:00:37]:

> write_opcode()->__replace_page() installs the new anonymous page,
> this new_page is PageSwapBacked() and it can be swapped out.
> 
> However it forgets to do SetPageUptodate(), fix write_opcode().
> 
> For example, this is needed if do_swap_page() finds that orginial
> page in the the swap cache (and doesn't try to read it back), in
> this case it returns VM_FAULT_SIGBUS.
> ---
>  kernel/uprobes.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/uprobes.c b/kernel/uprobes.c
> index 3928bcc..52b20c8 100644
> --- a/kernel/uprobes.c
> +++ b/kernel/uprobes.c
> @@ -200,6 +200,8 @@ static int write_opcode(struct task_struct *tsk, struct uprobe * uprobe,
>  		goto put_out;
>  	}
> 
> +	__SetPageUptodate(new_page);
> +

Agree. 

>  	/*
>  	 * lock page will serialize against do_wp_page()'s
>  	 * PageAnon() handling

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
  2011-10-17 10:50       ` Srikar Dronamraju
@ 2011-10-17 13:34         ` Stephen Smalley
  -1 siblings, 0 replies; 330+ messages in thread
From: Stephen Smalley @ 2011-10-17 13:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Oleg Nesterov, Peter Zijlstra, Eric Paris, Ingo Molnar,
	Steven Rostedt, Linux-mm, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Andrew Morton, Jim Keniston,
	Roland McGrath, LKML

On Mon, 2011-10-17 at 16:20 +0530, Srikar Dronamraju wrote:
> > I apologize in advance if this was already discussed, but I just can't
> > understand why xol_add_vma() does not use install_special_mapping().
> > Unless I missed something this should work and this has the following
> > advantages:
> 
> 
> The override_creds was based on what Stephen Smalley suggested 
> https://lkml.org/lkml/2011/4/20/224    
> 
> At that time Peter had suggested install_special_mapping(). However the
> consensus was to go with Stephen's suggestion of override_creds.
> 
> > 
> > 	- we can avoid override_creds() hacks, install_special_mapping()
> > 	  fools security_file_mmap() passing prot/flags = 0
> > 
> > 	- no need to play with vma after do_mmap_pgoff()
> > 
> > 	- no need for get_user_pages(FOLL_WRITE/FOLL_FORCE) hack
> > 
> > 	- no need for do_munmap() if get_user_pages() fails
> > 
> > 	- this protects us from mprotect(READ/WRITE)
> > 
> > 	- this protects from MADV_DONTNEED, the page will be correctly
> > 	  re-instantiated from area->page
> > 
> > 	- this makes xol_vma more "cheap", swapper can't see this page
> > 	  and we avoid the meaningless add_to_swap/pageout.
> > 
> > 	  Note that, before this patch, area->page can't be removed
> > 	  from the swap cache anyway (we have the reference). And it
> > 	  must not, uprobes modifies this page directly.
> 
> Stephan, Eric, 
> 
> Would you agree with Oleg's observation that we would be better off
> using install_special_mapping rather than using override_creds.
> 
> To give you some more information about the problem.
> 
> Uprobes will be a in-kernel debugging facility that provides
> singlestepping out of line. To achieve this, it will create a per-mm vma
> which is not mapped to any file. However this vma has to be executable.
> 
> Slots are made in this executable vma, and one slot can be used to
> single step a original instruction.
> 
> This executable vma that we are creating is not for any particular
> binary but would have to be created dynamically as and when an
> application is debugged. For example, if we were to debug malloc call in
> libc, we would end up adding xol vma to all the live processes in the
> system.
> 
> Since selinux wasnt happy to have an anonymous vma attached, we would
> create a pseudo file using shmem_file_setup. However after comments from
> Peter and Stephan's suggestions we started using override_creds. Peter and
> Oleg suggest that we use install_special_mapping. 
> 
> Are you okay with using install_special_mapping instead of
> override_creds()?

That's fine with me.  But I'm still not clear on how you are controlling
the use of this facility from userspace, which is my primary concern.
Who gets to enable/disable this facility, and what check is applied
between the process that enables it and the target process(es) that are
affected by it?  Is it subject to the same checks as ptrace?

-- 
Stephen Smalley
National Security Agency


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
@ 2011-10-17 13:34         ` Stephen Smalley
  0 siblings, 0 replies; 330+ messages in thread
From: Stephen Smalley @ 2011-10-17 13:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Oleg Nesterov, Peter Zijlstra, Eric Paris, Ingo Molnar,
	Steven Rostedt, Linux-mm, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Andrew Morton, Jim Keniston,
	Roland McGrath, LKML

On Mon, 2011-10-17 at 16:20 +0530, Srikar Dronamraju wrote:
> > I apologize in advance if this was already discussed, but I just can't
> > understand why xol_add_vma() does not use install_special_mapping().
> > Unless I missed something this should work and this has the following
> > advantages:
> 
> 
> The override_creds was based on what Stephen Smalley suggested 
> https://lkml.org/lkml/2011/4/20/224    
> 
> At that time Peter had suggested install_special_mapping(). However the
> consensus was to go with Stephen's suggestion of override_creds.
> 
> > 
> > 	- we can avoid override_creds() hacks, install_special_mapping()
> > 	  fools security_file_mmap() passing prot/flags = 0
> > 
> > 	- no need to play with vma after do_mmap_pgoff()
> > 
> > 	- no need for get_user_pages(FOLL_WRITE/FOLL_FORCE) hack
> > 
> > 	- no need for do_munmap() if get_user_pages() fails
> > 
> > 	- this protects us from mprotect(READ/WRITE)
> > 
> > 	- this protects from MADV_DONTNEED, the page will be correctly
> > 	  re-instantiated from area->page
> > 
> > 	- this makes xol_vma more "cheap", swapper can't see this page
> > 	  and we avoid the meaningless add_to_swap/pageout.
> > 
> > 	  Note that, before this patch, area->page can't be removed
> > 	  from the swap cache anyway (we have the reference). And it
> > 	  must not, uprobes modifies this page directly.
> 
> Stephan, Eric, 
> 
> Would you agree with Oleg's observation that we would be better off
> using install_special_mapping rather than using override_creds.
> 
> To give you some more information about the problem.
> 
> Uprobes will be a in-kernel debugging facility that provides
> singlestepping out of line. To achieve this, it will create a per-mm vma
> which is not mapped to any file. However this vma has to be executable.
> 
> Slots are made in this executable vma, and one slot can be used to
> single step a original instruction.
> 
> This executable vma that we are creating is not for any particular
> binary but would have to be created dynamically as and when an
> application is debugged. For example, if we were to debug malloc call in
> libc, we would end up adding xol vma to all the live processes in the
> system.
> 
> Since selinux wasnt happy to have an anonymous vma attached, we would
> create a pseudo file using shmem_file_setup. However after comments from
> Peter and Stephan's suggestions we started using override_creds. Peter and
> Oleg suggest that we use install_special_mapping. 
> 
> Are you okay with using install_special_mapping instead of
> override_creds()?

That's fine with me.  But I'm still not clear on how you are controlling
the use of this facility from userspace, which is my primary concern.
Who gets to enable/disable this facility, and what check is applied
between the process that enables it and the target process(es) that are
affected by it?  Is it subject to the same checks as ptrace?

-- 
Stephen Smalley
National Security Agency

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
  2011-10-17 13:34         ` Stephen Smalley
@ 2011-10-17 18:55           ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-17 18:55 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Srikar Dronamraju, Peter Zijlstra, Eric Paris, Ingo Molnar,
	Steven Rostedt, Linux-mm, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Andrew Morton, Jim Keniston,
	Roland McGrath, LKML

On 10/17, Stephen Smalley wrote:
>
> > Since selinux wasnt happy to have an anonymous vma attached, we would
> > create a pseudo file using shmem_file_setup. However after comments from
> > Peter and Stephan's suggestions we started using override_creds. Peter and
> > Oleg suggest that we use install_special_mapping.
> >
> > Are you okay with using install_special_mapping instead of
> > override_creds()?
>
> That's fine with me.

Good.

> But I'm still not clear on how you are controlling
> the use of this facility from userspace, which is my primary concern.

Yes, but just in case... Any security check in xol_add_vma() is pointless.
The task is already "owned" by uprobes when xol_add_vma() is called.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping()
@ 2011-10-17 18:55           ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-17 18:55 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Srikar Dronamraju, Peter Zijlstra, Eric Paris, Ingo Molnar,
	Steven Rostedt, Linux-mm, Linus Torvalds, Masami Hiramatsu,
	Hugh Dickins, Christoph Hellwig, Ananth N Mavinakayanahalli,
	Thomas Gleixner, Andi Kleen, Andrew Morton, Jim Keniston,
	Roland McGrath, LKML

On 10/17, Stephen Smalley wrote:
>
> > Since selinux wasnt happy to have an anonymous vma attached, we would
> > create a pseudo file using shmem_file_setup. However after comments from
> > Peter and Stephan's suggestions we started using override_creds. Peter and
> > Oleg suggest that we use install_special_mapping.
> >
> > Are you okay with using install_special_mapping instead of
> > override_creds()?
>
> That's fine with me.

Good.

> But I'm still not clear on how you are controlling
> the use of this facility from userspace, which is my primary concern.

Yes, but just in case... Any security check in xol_add_vma() is pointless.
The task is already "owned" by uprobes when xol_add_vma() is called.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH] x86: Make kprobes' twobyte_is_boostable volatile
  2011-10-07  4:55             ` Masami Hiramatsu
@ 2011-10-18  1:00               ` Josh Stone
  2011-10-18  1:21                 ` Masami Hiramatsu
  0 siblings, 1 reply; 330+ messages in thread
From: Josh Stone @ 2011-10-18  1:00 UTC (permalink / raw)
  To: linux-kernel
  Cc: Josh Stone, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Srikar Dronamraju, Masami Hiramatsu, Jakub Jelinek

When compiling an i386_defconfig kernel with gcc-4.6.1-9.fc15.i686, I
noticed a warning about the asm operand for test_bit in kprobes'
can_boost.  I discovered that this caused only the first long of
twobyte_is_boostable[] to be output.

Jakub filed and fixed gcc PR50571 to correct the warning and this output
issue.  But to solve it for less current gcc, we can make kprobes'
twobyte_is_boostable[] volatile, and it won't be optimized out.

Before:

    CC      arch/x86/kernel/kprobes.o
  In file included from include/linux/bitops.h:22:0,
                   from include/linux/kernel.h:17,
                   from [...]/arch/x86/include/asm/percpu.h:44,
                   from [...]/arch/x86/include/asm/current.h:5,
                   from [...]/arch/x86/include/asm/processor.h:15,
                   from [...]/arch/x86/include/asm/atomic.h:6,
                   from include/linux/atomic.h:4,
                   from include/linux/mutex.h:18,
                   from include/linux/notifier.h:13,
                   from include/linux/kprobes.h:34,
                   from arch/x86/kernel/kprobes.c:43:
  [...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
  [...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]

  $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
       551:	0f a3 05 00 00 00 00 	bt     %eax,0x0
                          554: R_386_32	.rodata.cst4

  $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o

  arch/x86/kernel/kprobes.o:     file format elf32-i386

  Contents of section .data:
   0000 48000000 00000000 00000000 00000000  H...............
  Contents of section .rodata.cst4:
   0000 4c030000                             L...

Only a single long of twobyte_is_boostable[] is in the object file.

After, with volatile:

  $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
       551:	0f a3 05 20 00 00 00 	bt     %eax,0x20
                          554: R_386_32	.data

  $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o

  arch/x86/kernel/kprobes.o:     file format elf32-i386

  Contents of section .data:
   0000 48000000 00000000 00000000 00000000  H...............
   0010 00000000 00000000 00000000 00000000  ................
   0020 4c030000 0f000200 ffff0000 ffcff0c0  L...............
   0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77  ....;.......&..w

Now all 32 bytes are output into .data instead.

Signed-off-by: Josh Stone <jistone@redhat.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Jakub Jelinek <jakub@redhat.com>

---
 arch/x86/kernel/kprobes.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
index f1a6244..c0ed3d9 100644
--- a/arch/x86/kernel/kprobes.c
+++ b/arch/x86/kernel/kprobes.c
@@ -75,8 +75,10 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
 	/*
 	 * Undefined/reserved opcodes, conditional jump, Opcode Extension
 	 * Groups, and some special opcodes can not boost.
+	 * This is volatile to keep gcc from statically optimizing it out, as
+	 * variable_test_bit makes gcc think only *(unsigned long*) is used.
 	 */
-static const u32 twobyte_is_boostable[256 / 32] = {
+static volatile const u32 twobyte_is_boostable[256 / 32] = {
 	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f          */
 	/*      ----------------------------------------------          */
 	W(0x00, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0) | /* 00 */
-- 
1.7.6.4


^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH] x86: Make kprobes' twobyte_is_boostable volatile
  2011-10-18  1:00               ` [PATCH] x86: Make kprobes' twobyte_is_boostable volatile Josh Stone
@ 2011-10-18  1:21                 ` Masami Hiramatsu
  0 siblings, 0 replies; 330+ messages in thread
From: Masami Hiramatsu @ 2011-10-18  1:21 UTC (permalink / raw)
  To: Josh Stone
  Cc: linux-kernel, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86,
	Srikar Dronamraju, Jakub Jelinek

(2011/10/18 10:00), Josh Stone wrote:
> When compiling an i386_defconfig kernel with gcc-4.6.1-9.fc15.i686, I
> noticed a warning about the asm operand for test_bit in kprobes'
> can_boost.  I discovered that this caused only the first long of
> twobyte_is_boostable[] to be output.
> 
> Jakub filed and fixed gcc PR50571 to correct the warning and this output
> issue.  But to solve it for less current gcc, we can make kprobes'
> twobyte_is_boostable[] volatile, and it won't be optimized out.

Uh, this should be an urgent fix.

Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>

Thanks a lot!

> 
> Before:
> 
>     CC      arch/x86/kernel/kprobes.o
>   In file included from include/linux/bitops.h:22:0,
>                    from include/linux/kernel.h:17,
>                    from [...]/arch/x86/include/asm/percpu.h:44,
>                    from [...]/arch/x86/include/asm/current.h:5,
>                    from [...]/arch/x86/include/asm/processor.h:15,
>                    from [...]/arch/x86/include/asm/atomic.h:6,
>                    from include/linux/atomic.h:4,
>                    from include/linux/mutex.h:18,
>                    from include/linux/notifier.h:13,
>                    from include/linux/kprobes.h:34,
>                    from arch/x86/kernel/kprobes.c:43:
>   [...]/arch/x86/include/asm/bitops.h: In function ‘can_boost.part.1’:
>   [...]/arch/x86/include/asm/bitops.h:319:2: warning: use of memory input without lvalue in asm operand 1 is deprecated [enabled by default]
> 
>   $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>        551:	0f a3 05 00 00 00 00 	bt     %eax,0x0
>                           554: R_386_32	.rodata.cst4
> 
>   $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
> 
>   arch/x86/kernel/kprobes.o:     file format elf32-i386
> 
>   Contents of section .data:
>    0000 48000000 00000000 00000000 00000000  H...............
>   Contents of section .rodata.cst4:
>    0000 4c030000                             L...
> 
> Only a single long of twobyte_is_boostable[] is in the object file.
> 
> After, with volatile:
> 
>   $ objdump -rd arch/x86/kernel/kprobes.o | grep -A1 -w bt
>        551:	0f a3 05 20 00 00 00 	bt     %eax,0x20
>                           554: R_386_32	.data
> 
>   $ objdump -s -j .rodata.cst4 -j .data arch/x86/kernel/kprobes.o
> 
>   arch/x86/kernel/kprobes.o:     file format elf32-i386
> 
>   Contents of section .data:
>    0000 48000000 00000000 00000000 00000000  H...............
>    0010 00000000 00000000 00000000 00000000  ................
>    0020 4c030000 0f000200 ffff0000 ffcff0c0  L...............
>    0030 0000ffff 3bbbfff8 03ff2ebb 26bb2e77  ....;.......&..w
> 
> Now all 32 bytes are output into .data instead.
> 
> Signed-off-by: Josh Stone <jistone@redhat.com>
> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
> Cc: Jakub Jelinek <jakub@redhat.com>
> 
> ---
>  arch/x86/kernel/kprobes.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/x86/kernel/kprobes.c b/arch/x86/kernel/kprobes.c
> index f1a6244..c0ed3d9 100644
> --- a/arch/x86/kernel/kprobes.c
> +++ b/arch/x86/kernel/kprobes.c
> @@ -75,8 +75,10 @@ DEFINE_PER_CPU(struct kprobe_ctlblk, kprobe_ctlblk);
>  	/*
>  	 * Undefined/reserved opcodes, conditional jump, Opcode Extension
>  	 * Groups, and some special opcodes can not boost.
> +	 * This is volatile to keep gcc from statically optimizing it out, as
> +	 * variable_test_bit makes gcc think only *(unsigned long*) is used.
>  	 */
> -static const u32 twobyte_is_boostable[256 / 32] = {
> +static volatile const u32 twobyte_is_boostable[256 / 32] = {
>  	/*      0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f          */
>  	/*      ----------------------------------------------          */
>  	W(0x00, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0) | /* 00 */


-- 
Masami HIRAMATSU
Software Platform Research Dept. Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory
E-mail: masami.hiramatsu.pt@hitachi.com

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 2/X] uprobes: write_opcode() needs put_page(new_page) unconditionally
  2011-10-15 19:00     ` Oleg Nesterov
@ 2011-10-18 16:47       ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-18 16:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

> Every write_opcode()->__replace_page() leaks the new page on success.
> 
> We have the reference after alloc_page_vma(), then __replace_page()
> does another get_page() for the new mapping, we need put_page(new_page)
> in any case.
> 
> Alternatively we could remove __replace_page()->get_page() but it is
> better to change write_opcode(). This way it is simpler to unify the
> code with ksm.c:replace_page() and we can simplify the error handling
> in write_opcode(), the patch simply adds a single page_cache_release()
> under "unlock_out" label.

I have folded this change and your other suggested changes into my patches.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 2/X] uprobes: write_opcode() needs put_page(new_page) unconditionally
@ 2011-10-18 16:47       ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-18 16:47 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

> Every write_opcode()->__replace_page() leaks the new page on success.
> 
> We have the reference after alloc_page_vma(), then __replace_page()
> does another get_page() for the new mapping, we need put_page(new_page)
> in any case.
> 
> Alternatively we could remove __replace_page()->get_page() but it is
> better to change write_opcode(). This way it is simpler to unify the
> code with ksm.c:replace_page() and we can simplify the error handling
> in write_opcode(), the patch simply adds a single page_cache_release()
> under "unlock_out" label.

I have folded this change and your other suggested changes into my patches.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH 8-14/X] (Was:  Uprobes patchset with perf probe support)
  2011-10-15 19:00   ` Oleg Nesterov
@ 2011-10-19 21:51     ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Hello.

Signal patches for uprobes. Note that there are more issues than
we discussed before, we can not simply delay/block/whatever the
signal.

This needs a bit more cleanups (and perhaps minor fixes), but I
_think_ this should work. Well, I hope ;)

On top of 1-7 I sent before.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH 8-14/X] (Was:  Uprobes patchset with perf probe support)
@ 2011-10-19 21:51     ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Hello.

Signal patches for uprobes. Note that there are more issues than
we discussed before, we can not simply delay/block/whatever the
signal.

This needs a bit more cleanups (and perhaps minor fixes), but I
_think_ this should work. Well, I hope ;)

On top of 1-7 I sent before.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [PATCH 8/X] uprobes: kill sstep_complete()
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:52       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Kill sstep_complete(), change uprobe_notify_resume() to use post_xol()
unconditionally.

As we already discussed, it is wrong to assume that regs->ip always
changes after the step. rep or jmp/call to self for example. We know
that this task has already done the step, we can rely on DIE_DEBUG
notification.
---
 kernel/uprobes.c |   37 +++++++++----------------------------
 1 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index a323e0a..135b9a2 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1321,24 +1321,6 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 }
 
 /*
- * Verify from Instruction Pointer if singlestep has indeed occurred.
- * If Singlestep has occurred, then do post singlestep fix-ups.
- */
-static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
-{
-	unsigned long vaddr = instruction_pointer(regs);
-
-	/*
-	 * If we have executed out of line, Instruction pointer
-	 * cannot be same as virtual address of XOL slot.
-	 */
-	if (vaddr == current->utask->xol_vaddr)
-		return false;
-	post_xol(uprobe, regs);
-	return true;
-}
-
-/*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
  *
@@ -1374,7 +1356,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			if (!utask)
 				goto cleanup_ret;
 		}
-		/* TODO Start queueing signals. */
+
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
 		utask->state = UTASK_SSTEP;
@@ -1385,15 +1367,14 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			goto cleanup_ret;
 	} else if (utask->state == UTASK_SSTEP) {
 		u = utask->active_uprobe;
-		if (sstep_complete(u, regs)) {
-			put_uprobe(u);
-			utask->active_uprobe = NULL;
-			utask->state = UTASK_RUNNING;
-			user_disable_single_step(current);
-			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
-		}
+
+		post_xol(u, regs);	/* TODO: check result? */
+
+		put_uprobe(u);
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+		user_disable_single_step(current);
+		xol_free_insn_slot(current);
 	}
 	return;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 8/X] uprobes: kill sstep_complete()
@ 2011-10-19 21:52       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Kill sstep_complete(), change uprobe_notify_resume() to use post_xol()
unconditionally.

As we already discussed, it is wrong to assume that regs->ip always
changes after the step. rep or jmp/call to self for example. We know
that this task has already done the step, we can rely on DIE_DEBUG
notification.
---
 kernel/uprobes.c |   37 +++++++++----------------------------
 1 files changed, 9 insertions(+), 28 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index a323e0a..135b9a2 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1321,24 +1321,6 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 }
 
 /*
- * Verify from Instruction Pointer if singlestep has indeed occurred.
- * If Singlestep has occurred, then do post singlestep fix-ups.
- */
-static bool sstep_complete(struct uprobe *uprobe, struct pt_regs *regs)
-{
-	unsigned long vaddr = instruction_pointer(regs);
-
-	/*
-	 * If we have executed out of line, Instruction pointer
-	 * cannot be same as virtual address of XOL slot.
-	 */
-	if (vaddr == current->utask->xol_vaddr)
-		return false;
-	post_xol(uprobe, regs);
-	return true;
-}
-
-/*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
  *
@@ -1374,7 +1356,7 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			if (!utask)
 				goto cleanup_ret;
 		}
-		/* TODO Start queueing signals. */
+
 		utask->active_uprobe = u;
 		handler_chain(u, regs);
 		utask->state = UTASK_SSTEP;
@@ -1385,15 +1367,14 @@ void uprobe_notify_resume(struct pt_regs *regs)
 			goto cleanup_ret;
 	} else if (utask->state == UTASK_SSTEP) {
 		u = utask->active_uprobe;
-		if (sstep_complete(u, regs)) {
-			put_uprobe(u);
-			utask->active_uprobe = NULL;
-			utask->state = UTASK_RUNNING;
-			user_disable_single_step(current);
-			xol_free_insn_slot(current);
-
-			/* TODO Stop queueing signals. */
-		}
+
+		post_xol(u, regs);	/* TODO: check result? */
+
+		put_uprobe(u);
+		utask->active_uprobe = NULL;
+		utask->state = UTASK_RUNNING;
+		user_disable_single_step(current);
+		xol_free_insn_slot(current);
 	}
 	return;
 
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 9/X] uprobes: introduce UTASK_SSTEP_ACK state
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:52       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Introduce the new state, UTASK_SSTEP_ACK. uprobe_post_notifier()
sets this state like uprobe_bkpt_notifier() sets UTASK_BP_HIT.

Change uprobe_notify_resume() to always do the post_xol() logic
if state != UTASK_BP_HIT and WARN() if the utask->state is wrong.

This makes the state transitions more explicit. The current code
returns silently if, say, state == UTASK_RUNNING. But this must
not happen, we should complain in this case. And, with the new
state we know for sure that DIE_DEBUG was triggered.
---
 include/linux/uprobes.h |    3 ++-
 kernel/uprobes.c        |   15 ++++++++-------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a407d17..1591c7c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -74,7 +74,8 @@ struct uprobe {
 enum uprobe_task_state {
 	UTASK_RUNNING,
 	UTASK_BP_HIT,
-	UTASK_SSTEP
+	UTASK_SSTEP,
+	UTASK_SSTEP_ACK,
 };
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 135b9a2..5fd72b8 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1365,10 +1365,13 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		else
 			/* Cannot Singlestep; re-execute the instruction. */
 			goto cleanup_ret;
-	} else if (utask->state == UTASK_SSTEP) {
+	} else {
 		u = utask->active_uprobe;
 
-		post_xol(u, regs);	/* TODO: check result? */
+		if (utask->state == UTASK_SSTEP_ACK)
+			post_xol(u, regs);	/* TODO: check result? */
+		else
+			WARN_ON_ONCE(1);
 
 		put_uprobe(u);
 		utask->active_uprobe = NULL;
@@ -1416,15 +1419,13 @@ int uprobe_bkpt_notifier(struct pt_regs *regs)
  */
 int uprobe_post_notifier(struct pt_regs *regs)
 {
-	struct uprobe *uprobe;
-	struct uprobe_task *utask;
+	struct uprobe_task *utask = current->utask;
 
-	if (!current->mm || !current->utask || !current->utask->active_uprobe)
+	if (!current->mm || !utask || !utask->active_uprobe)
 		/* task is currently not uprobed */
 		return 0;
 
-	utask = current->utask;
-	uprobe = utask->active_uprobe;
+	utask->state = UTASK_SSTEP_ACK;
 	set_thread_flag(TIF_UPROBE);
 	return 1;
 }
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 9/X] uprobes: introduce UTASK_SSTEP_ACK state
@ 2011-10-19 21:52       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Introduce the new state, UTASK_SSTEP_ACK. uprobe_post_notifier()
sets this state like uprobe_bkpt_notifier() sets UTASK_BP_HIT.

Change uprobe_notify_resume() to always do the post_xol() logic
if state != UTASK_BP_HIT and WARN() if the utask->state is wrong.

This makes the state transitions more explicit. The current code
returns silently if, say, state == UTASK_RUNNING. But this must
not happen, we should complain in this case. And, with the new
state we know for sure that DIE_DEBUG was triggered.
---
 include/linux/uprobes.h |    3 ++-
 kernel/uprobes.c        |   15 ++++++++-------
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index a407d17..1591c7c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -74,7 +74,8 @@ struct uprobe {
 enum uprobe_task_state {
 	UTASK_RUNNING,
 	UTASK_BP_HIT,
-	UTASK_SSTEP
+	UTASK_SSTEP,
+	UTASK_SSTEP_ACK,
 };
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 135b9a2..5fd72b8 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1365,10 +1365,13 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		else
 			/* Cannot Singlestep; re-execute the instruction. */
 			goto cleanup_ret;
-	} else if (utask->state == UTASK_SSTEP) {
+	} else {
 		u = utask->active_uprobe;
 
-		post_xol(u, regs);	/* TODO: check result? */
+		if (utask->state == UTASK_SSTEP_ACK)
+			post_xol(u, regs);	/* TODO: check result? */
+		else
+			WARN_ON_ONCE(1);
 
 		put_uprobe(u);
 		utask->active_uprobe = NULL;
@@ -1416,15 +1419,13 @@ int uprobe_bkpt_notifier(struct pt_regs *regs)
  */
 int uprobe_post_notifier(struct pt_regs *regs)
 {
-	struct uprobe *uprobe;
-	struct uprobe_task *utask;
+	struct uprobe_task *utask = current->utask;
 
-	if (!current->mm || !current->utask || !current->utask->active_uprobe)
+	if (!current->mm || !utask || !utask->active_uprobe)
 		/* task is currently not uprobed */
 		return 0;
 
-	utask = current->utask;
-	uprobe = utask->active_uprobe;
+	utask->state = UTASK_SSTEP_ACK;
 	set_thread_flag(TIF_UPROBE);
 	return 1;
 }
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 10/X] uprobes: introduce uprobe_deny_signal()
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:52       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

A not-UTASK_RUNNING task obviously can't handle the signals, neither it
should stop/freeze/etc. It must not even exit if it was SIGKILL'ed (see
the next changes).

This patch adds the new hook, uprobe_deny_signal(), called by
get_signal_to_deliver(). It simply clears TIF_SIGPENDING to ensure that
this thread can do nothing connected to signals until it becomes
UTASK_RUNNING.

We also change post_xol() path to do recalc_sigpending() before return
to user-mode, this ensures the signal can't be lost.

NOTE! Without the next changes this patch is buggy.
---
 include/linux/uprobes.h |    5 +++++
 kernel/signal.c         |    3 +++
 kernel/uprobes.c        |   23 +++++++++++++++++++++++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 1591c7c..27928e5 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -134,6 +134,7 @@ extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
 extern int uprobe_post_notifier(struct pt_regs *regs);
 extern int uprobe_bkpt_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
+extern bool uprobe_deny_signal(void);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -154,6 +155,10 @@ static inline void munmap_uprobe(struct vm_area_struct *vma)
 static inline void uprobe_notify_resume(struct pt_regs *regs)
 {
 }
+static inline bool uprobe_deny_signal(void)
+{
+	return false;
+}
 static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
 {
 	return 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index 291c970..788b494 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2141,6 +2141,9 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+	if (unlikely(uprobe_deny_signal()))
+		return 0;
+
 relock:
 	/*
 	 * We'll jump back here after any time we were stopped in TASK_STOPPED.
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5fd72b8..d6f4508 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1320,6 +1320,25 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 	return -EFAULT;
 }
 
+bool uprobe_deny_signal(void)
+{
+	struct task_struct *tsk = current;
+	struct uprobe_task *utask = tsk->utask;
+
+	if (likely(!utask || !utask->active_uprobe))
+		return false;
+
+	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
+
+	if (signal_pending(tsk)) {
+		spin_lock_irq(&tsk->sighand->siglock);
+		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
+		spin_unlock_irq(&tsk->sighand->siglock);
+	}
+
+	return true;
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1378,6 +1397,10 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		utask->state = UTASK_RUNNING;
 		user_disable_single_step(current);
 		xol_free_insn_slot(current);
+
+		spin_lock_irq(&current->sighand->siglock);
+		recalc_sigpending(); /* see uprobe_deny_signal() */
+		spin_unlock_irq(&current->sighand->siglock);
 	}
 	return;
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 10/X] uprobes: introduce uprobe_deny_signal()
@ 2011-10-19 21:52       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

A not-UTASK_RUNNING task obviously can't handle the signals, neither it
should stop/freeze/etc. It must not even exit if it was SIGKILL'ed (see
the next changes).

This patch adds the new hook, uprobe_deny_signal(), called by
get_signal_to_deliver(). It simply clears TIF_SIGPENDING to ensure that
this thread can do nothing connected to signals until it becomes
UTASK_RUNNING.

We also change post_xol() path to do recalc_sigpending() before return
to user-mode, this ensures the signal can't be lost.

NOTE! Without the next changes this patch is buggy.
---
 include/linux/uprobes.h |    5 +++++
 kernel/signal.c         |    3 +++
 kernel/uprobes.c        |   23 +++++++++++++++++++++++
 3 files changed, 31 insertions(+), 0 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 1591c7c..27928e5 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -134,6 +134,7 @@ extern unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs);
 extern int uprobe_post_notifier(struct pt_regs *regs);
 extern int uprobe_bkpt_notifier(struct pt_regs *regs);
 extern void uprobe_notify_resume(struct pt_regs *regs);
+extern bool uprobe_deny_signal(void);
 #else /* CONFIG_UPROBES is not defined */
 static inline int register_uprobe(struct inode *inode, loff_t offset,
 				struct uprobe_consumer *consumer)
@@ -154,6 +155,10 @@ static inline void munmap_uprobe(struct vm_area_struct *vma)
 static inline void uprobe_notify_resume(struct pt_regs *regs)
 {
 }
+static inline bool uprobe_deny_signal(void)
+{
+	return false;
+}
 static inline unsigned long get_uprobe_bkpt_addr(struct pt_regs *regs)
 {
 	return 0;
diff --git a/kernel/signal.c b/kernel/signal.c
index 291c970..788b494 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2141,6 +2141,9 @@ int get_signal_to_deliver(siginfo_t *info, struct k_sigaction *return_ka,
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+	if (unlikely(uprobe_deny_signal()))
+		return 0;
+
 relock:
 	/*
 	 * We'll jump back here after any time we were stopped in TASK_STOPPED.
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index 5fd72b8..d6f4508 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1320,6 +1320,25 @@ static int pre_ssout(struct uprobe *uprobe, struct pt_regs *regs,
 	return -EFAULT;
 }
 
+bool uprobe_deny_signal(void)
+{
+	struct task_struct *tsk = current;
+	struct uprobe_task *utask = tsk->utask;
+
+	if (likely(!utask || !utask->active_uprobe))
+		return false;
+
+	WARN_ON_ONCE(utask->state != UTASK_SSTEP);
+
+	if (signal_pending(tsk)) {
+		spin_lock_irq(&tsk->sighand->siglock);
+		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
+		spin_unlock_irq(&tsk->sighand->siglock);
+	}
+
+	return true;
+}
+
 /*
  * uprobe_notify_resume gets called in task context just before returning
  * to userspace.
@@ -1378,6 +1397,10 @@ void uprobe_notify_resume(struct pt_regs *regs)
 		utask->state = UTASK_RUNNING;
 		user_disable_single_step(current);
 		xol_free_insn_slot(current);
+
+		spin_lock_irq(&current->sighand->siglock);
+		recalc_sigpending(); /* see uprobe_deny_signal() */
+		spin_unlock_irq(&current->sighand->siglock);
 	}
 	return;
 
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 11/X] uprobes: x86: introduce xol_was_trapped()
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:53       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

After the previous patch, we postpone the signals until we execute
the probed insn. This is simply wrong if xol insn traps and generates
the signal itself. Say, SIGILL/SIGSEGV/etc.

This patch only adds xol_was_trapped() to detect this case. It assumes
that anything like do_page_fault/do_trap/etc sets thread.trap_no != -1.

We add uprobe_task_arch_info->saved_trap_no and change pre_xol/post_xol
to save/restore thread.trap_no, xol_was_trapped() simply checks that
->trap_no is not equal to UPROBE_TRAP_NO == -1 set by pre_xol().
---
 arch/x86/include/asm/uprobes.h |    2 ++
 arch/x86/kernel/uprobes.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 1c30cfd..f0fbdab 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -39,6 +39,7 @@ struct uprobe_arch_info {
 
 struct uprobe_task_arch_info {
 	unsigned long saved_scratch_register;
+	unsigned long saved_trap_no;
 };
 #else
 struct uprobe_arch_info {};
@@ -49,6 +50,7 @@ extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern bool xol_was_trapped(struct task_struct *tsk);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e2e7882..c861c27 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -395,6 +395,8 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 	regs->ip = vaddr;
 }
 
+#define	UPROBE_TRAP_NO	-1ul
+
 /*
  * pre_xol - prepare to execute out of line.
  * @uprobe: the probepoint information.
@@ -410,6 +412,9 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 {
 	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
 
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
 	regs->ip = current->utask->xol_vaddr;
 	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
 		tskinfo->saved_scratch_register = regs->ax;
@@ -425,6 +430,11 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 #else
 int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 {
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
 	regs->ip = current->utask->xol_vaddr;
 	return 0;
 }
@@ -493,6 +503,14 @@ static void handle_riprel_post_xol(struct uprobe *uprobe,
 }
 #endif
 
+bool xol_was_trapped(struct task_struct *tsk)
+{
+	if (tsk->thread.trap_no != UPROBE_TRAP_NO)
+		return true;
+
+	return false;
+}
+
 /*
  * Called after single-stepping. To avoid the SMP problems that can
  * occur when we temporarily put back the original opcode to
@@ -523,6 +541,8 @@ int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	int result = 0;
 	long correction;
 
+	current->thread.trap_no = utask->tskinfo.saved_trap_no;
+
 	correction = (long)(utask->vaddr - utask->xol_vaddr);
 	handle_riprel_post_xol(uprobe, regs, &correction);
 	if (uprobe->fixups & UPROBES_FIX_IP)
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 11/X] uprobes: x86: introduce xol_was_trapped()
@ 2011-10-19 21:53       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

After the previous patch, we postpone the signals until we execute
the probed insn. This is simply wrong if xol insn traps and generates
the signal itself. Say, SIGILL/SIGSEGV/etc.

This patch only adds xol_was_trapped() to detect this case. It assumes
that anything like do_page_fault/do_trap/etc sets thread.trap_no != -1.

We add uprobe_task_arch_info->saved_trap_no and change pre_xol/post_xol
to save/restore thread.trap_no, xol_was_trapped() simply checks that
->trap_no is not equal to UPROBE_TRAP_NO == -1 set by pre_xol().
---
 arch/x86/include/asm/uprobes.h |    2 ++
 arch/x86/kernel/uprobes.c      |   20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index 1c30cfd..f0fbdab 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -39,6 +39,7 @@ struct uprobe_arch_info {
 
 struct uprobe_task_arch_info {
 	unsigned long saved_scratch_register;
+	unsigned long saved_trap_no;
 };
 #else
 struct uprobe_arch_info {};
@@ -49,6 +50,7 @@ extern int analyze_insn(struct task_struct *tsk, struct uprobe *uprobe);
 extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
+extern bool xol_was_trapped(struct task_struct *tsk);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index e2e7882..c861c27 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -395,6 +395,8 @@ void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr)
 	regs->ip = vaddr;
 }
 
+#define	UPROBE_TRAP_NO	-1ul
+
 /*
  * pre_xol - prepare to execute out of line.
  * @uprobe: the probepoint information.
@@ -410,6 +412,9 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 {
 	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
 
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
 	regs->ip = current->utask->xol_vaddr;
 	if (uprobe->fixups & UPROBES_FIX_RIP_AX) {
 		tskinfo->saved_scratch_register = regs->ax;
@@ -425,6 +430,11 @@ int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 #else
 int pre_xol(struct uprobe *uprobe, struct pt_regs *regs)
 {
+	struct uprobe_task_arch_info *tskinfo = &current->utask->tskinfo;
+
+	tskinfo->saved_trap_no = current->thread.trap_no;
+	current->thread.trap_no = UPROBE_TRAP_NO;
+
 	regs->ip = current->utask->xol_vaddr;
 	return 0;
 }
@@ -493,6 +503,14 @@ static void handle_riprel_post_xol(struct uprobe *uprobe,
 }
 #endif
 
+bool xol_was_trapped(struct task_struct *tsk)
+{
+	if (tsk->thread.trap_no != UPROBE_TRAP_NO)
+		return true;
+
+	return false;
+}
+
 /*
  * Called after single-stepping. To avoid the SMP problems that can
  * occur when we temporarily put back the original opcode to
@@ -523,6 +541,8 @@ int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	int result = 0;
 	long correction;
 
+	current->thread.trap_no = utask->tskinfo.saved_trap_no;
+
 	correction = (long)(utask->vaddr - utask->xol_vaddr);
 	handle_riprel_post_xol(uprobe, regs, &correction);
 	if (uprobe->fixups & UPROBES_FIX_IP)
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 12/X] uprobes: x86: introduce abort_xol()
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:53       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

A separate "patch", just to emphasize that I do not know what
actually abort_xol() should do! I do not understand this asm
magic.

This patch simply changes regs->ip back to the probed insn,
obviously this is not enough to handle UPROBES_FIX_*. Please
take care.

If it is not clear, abort_xol() is needed when we should
re-execute the original insn (replaced with int3), see the
next patch.
---
 arch/x86/include/asm/uprobes.h |    1 +
 arch/x86/kernel/uprobes.c      |    9 +++++++++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index f0fbdab..6209da1 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -51,6 +51,7 @@ extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern bool xol_was_trapped(struct task_struct *tsk);
+extern void abort_xol(struct pt_regs *regs);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index c861c27..bc11a89 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -511,6 +511,15 @@ bool xol_was_trapped(struct task_struct *tsk)
 	return false;
 }
 
+void abort_xol(struct pt_regs *regs)
+{
+	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+	// !!! Dear Srikar and Ananth, please implement me !!!
+	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+	struct uprobe_task *utask = current->utask;
+	regs->ip = utask->vaddr;
+}
+
 /*
  * Called after single-stepping. To avoid the SMP problems that can
  * occur when we temporarily put back the original opcode to
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 12/X] uprobes: x86: introduce abort_xol()
@ 2011-10-19 21:53       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

A separate "patch", just to emphasize that I do not know what
actually abort_xol() should do! I do not understand this asm
magic.

This patch simply changes regs->ip back to the probed insn,
obviously this is not enough to handle UPROBES_FIX_*. Please
take care.

If it is not clear, abort_xol() is needed when we should
re-execute the original insn (replaced with int3), see the
next patch.
---
 arch/x86/include/asm/uprobes.h |    1 +
 arch/x86/kernel/uprobes.c      |    9 +++++++++
 2 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
index f0fbdab..6209da1 100644
--- a/arch/x86/include/asm/uprobes.h
+++ b/arch/x86/include/asm/uprobes.h
@@ -51,6 +51,7 @@ extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
 extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
 extern bool xol_was_trapped(struct task_struct *tsk);
+extern void abort_xol(struct pt_regs *regs);
 extern int uprobe_exception_notify(struct notifier_block *self,
 				       unsigned long val, void *data);
 #endif	/* _ASM_UPROBES_H */
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index c861c27..bc11a89 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -511,6 +511,15 @@ bool xol_was_trapped(struct task_struct *tsk)
 	return false;
 }
 
+void abort_xol(struct pt_regs *regs)
+{
+	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+	// !!! Dear Srikar and Ananth, please implement me !!!
+	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
+	struct uprobe_task *utask = current->utask;
+	regs->ip = utask->vaddr;
+}
+
 /*
  * Called after single-stepping. To avoid the SMP problems that can
  * occur when we temporarily put back the original opcode to
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:53       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
xol insn itself triggers the signal.

In this case we should restart the original insn even if the task is
already SIGKILL'ed (say, the coredump should report the correct ip).
This is even more important if the task has a handler for SIGSEGV/etc,
The _same_ instruction should be repeated again after return from the
signal handler, and SSTEP can never finish in this case.

So this patch changes uprobe_deny_signal() to set UTASK_SSTEP_TRAPPED
and TIF_UPROBE. It also sets TIF_NOTIFY_RESUME, _afaics_ TIF_UPROBE
alone is not enough to trigger do_notify_resume() in this case.

When uprobe_notify_resume() sees UTASK_SSTEP_TRAPPED it does abort_xol()
instead of post_xol().
---
 arch/x86/kernel/uprobes.c |    1 +
 include/linux/uprobes.h   |    1 +
 kernel/uprobes.c          |    8 ++++++++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index bc11a89..73f58ad 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -550,6 +550,7 @@ int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	int result = 0;
 	long correction;
 
+	WARN_ON_ONCE(current->thread.trap_no != UPROBE_TRAP_NO);
 	current->thread.trap_no = utask->tskinfo.saved_trap_no;
 
 	correction = (long)(utask->vaddr - utask->xol_vaddr);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 27928e5..2b4bc8c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,7 @@ enum uprobe_task_state {
 	UTASK_BP_HIT,
 	UTASK_SSTEP,
 	UTASK_SSTEP_ACK,
+	UTASK_SSTEP_TRAPPED,
 };
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index d6f4508..aa5492a 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1334,6 +1334,12 @@ bool uprobe_deny_signal(void)
 		spin_lock_irq(&tsk->sighand->siglock);
 		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 		spin_unlock_irq(&tsk->sighand->siglock);
+
+		if (xol_was_trapped(tsk)) {
+			utask->state = UTASK_SSTEP_TRAPPED;
+			set_tsk_thread_flag(tsk, TIF_UPROBE);
+			set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+		}
 	}
 
 	return true;
@@ -1389,6 +1395,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 
 		if (utask->state == UTASK_SSTEP_ACK)
 			post_xol(u, regs);	/* TODO: check result? */
+		else if (utask->state == UTASK_SSTEP_TRAPPED)
+			abort_xol(regs);
 		else
 			WARN_ON_ONCE(1);
 
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-19 21:53       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:53 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
xol insn itself triggers the signal.

In this case we should restart the original insn even if the task is
already SIGKILL'ed (say, the coredump should report the correct ip).
This is even more important if the task has a handler for SIGSEGV/etc,
The _same_ instruction should be repeated again after return from the
signal handler, and SSTEP can never finish in this case.

So this patch changes uprobe_deny_signal() to set UTASK_SSTEP_TRAPPED
and TIF_UPROBE. It also sets TIF_NOTIFY_RESUME, _afaics_ TIF_UPROBE
alone is not enough to trigger do_notify_resume() in this case.

When uprobe_notify_resume() sees UTASK_SSTEP_TRAPPED it does abort_xol()
instead of post_xol().
---
 arch/x86/kernel/uprobes.c |    1 +
 include/linux/uprobes.h   |    1 +
 kernel/uprobes.c          |    8 ++++++++
 3 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index bc11a89..73f58ad 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -550,6 +550,7 @@ int post_xol(struct uprobe *uprobe, struct pt_regs *regs)
 	int result = 0;
 	long correction;
 
+	WARN_ON_ONCE(current->thread.trap_no != UPROBE_TRAP_NO);
 	current->thread.trap_no = utask->tskinfo.saved_trap_no;
 
 	correction = (long)(utask->vaddr - utask->xol_vaddr);
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 27928e5..2b4bc8c 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -76,6 +76,7 @@ enum uprobe_task_state {
 	UTASK_BP_HIT,
 	UTASK_SSTEP,
 	UTASK_SSTEP_ACK,
+	UTASK_SSTEP_TRAPPED,
 };
 
 /*
diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index d6f4508..aa5492a 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1334,6 +1334,12 @@ bool uprobe_deny_signal(void)
 		spin_lock_irq(&tsk->sighand->siglock);
 		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 		spin_unlock_irq(&tsk->sighand->siglock);
+
+		if (xol_was_trapped(tsk)) {
+			utask->state = UTASK_SSTEP_TRAPPED;
+			set_tsk_thread_flag(tsk, TIF_UPROBE);
+			set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
+		}
 	}
 
 	return true;
@@ -1389,6 +1395,8 @@ void uprobe_notify_resume(struct pt_regs *regs)
 
 		if (utask->state == UTASK_SSTEP_ACK)
 			post_xol(u, regs);	/* TODO: check result? */
+		else if (utask->state == UTASK_SSTEP_TRAPPED)
+			abort_xol(regs);
 		else
 			WARN_ON_ONCE(1);
 
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 14/X] uprobes: uprobe_deny_signal: check __fatal_signal_pending()
  2011-10-19 21:51     ` Oleg Nesterov
@ 2011-10-19 21:54       ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Change uprobe_deny_signal() to check __fatal_signal_pending() along with
xol_was_trapped().

Normally this is not really needed but this is safer. And this makes more
clear the fact that even SIGKILL is handled via UTASK_SSTEP_TRAPPED. Once
again, SIGKILL can be pending because of the core-dumping, we should not
exit with regs->ip pointing to ->xol_vaddr.
---
 kernel/uprobes.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index aa5492a..9e9d4e4 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1335,7 +1335,7 @@ bool uprobe_deny_signal(void)
 		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		if (xol_was_trapped(tsk)) {
+		if (__fatal_signal_pending(tsk) || xol_was_trapped(tsk)) {
 			utask->state = UTASK_SSTEP_TRAPPED;
 			set_tsk_thread_flag(tsk, TIF_UPROBE);
 			set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
-- 
1.5.5.1



^ permalink raw reply related	[flat|nested] 330+ messages in thread

* [PATCH 14/X] uprobes: uprobe_deny_signal: check __fatal_signal_pending()
@ 2011-10-19 21:54       ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-19 21:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Change uprobe_deny_signal() to check __fatal_signal_pending() along with
xol_was_trapped().

Normally this is not really needed but this is safer. And this makes more
clear the fact that even SIGKILL is handled via UTASK_SSTEP_TRAPPED. Once
again, SIGKILL can be pending because of the core-dumping, we should not
exit with regs->ip pointing to ->xol_vaddr.
---
 kernel/uprobes.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/uprobes.c b/kernel/uprobes.c
index aa5492a..9e9d4e4 100644
--- a/kernel/uprobes.c
+++ b/kernel/uprobes.c
@@ -1335,7 +1335,7 @@ bool uprobe_deny_signal(void)
 		clear_tsk_thread_flag(tsk, TIF_SIGPENDING);
 		spin_unlock_irq(&tsk->sighand->siglock);
 
-		if (xol_was_trapped(tsk)) {
+		if (__fatal_signal_pending(tsk) || xol_was_trapped(tsk)) {
 			utask->state = UTASK_SSTEP_TRAPPED;
 			set_tsk_thread_flag(tsk, TIF_UPROBE);
 			set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
-- 
1.5.5.1


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
  2011-10-19 21:53       ` Oleg Nesterov
@ 2011-10-21 14:42         ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-21 14:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Hey Oleg,

> A separate "patch", just to emphasize that I do not know what
> actually abort_xol() should do! I do not understand this asm
> magic.
> 
> This patch simply changes regs->ip back to the probed insn,
> obviously this is not enough to handle UPROBES_FIX_*. Please
> take care.
> 
> If it is not clear, abort_xol() is needed when we should
> re-execute the original insn (replaced with int3), see the
> next patch.

We should be removing the breakpoint in abort_xol().
Otherwise if we just set the instruction pointer to int3 and signal a
sigill, then the user may be confused why a breakpoint is generating
SIGILL.

> ---
>  arch/x86/include/asm/uprobes.h |    1 +
>  arch/x86/kernel/uprobes.c      |    9 +++++++++
>  2 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index f0fbdab..6209da1 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -51,6 +51,7 @@ extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
>  extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
>  extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
>  extern bool xol_was_trapped(struct task_struct *tsk);
> +extern void abort_xol(struct pt_regs *regs);
>  extern int uprobe_exception_notify(struct notifier_block *self,
>  				       unsigned long val, void *data);
>  #endif	/* _ASM_UPROBES_H */
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index c861c27..bc11a89 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -511,6 +511,15 @@ bool xol_was_trapped(struct task_struct *tsk)
>  	return false;
>  }
> 
> +void abort_xol(struct pt_regs *regs)
> +{
> +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> +	// !!! Dear Srikar and Ananth, please implement me !!!
> +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> +	struct uprobe_task *utask = current->utask;
> +	regs->ip = utask->vaddr;

nit:
Shouldnt we be setting the ip to the next instruction after this
instruction?

> +}
> +
>  /*
>   * Called after single-stepping. To avoid the SMP problems that can
>   * occur when we temporarily put back the original opcode to


I have applied all your patches and ran tests, the tests are all
passing.

I will fold them into my patches and send them out.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
@ 2011-10-21 14:42         ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-21 14:42 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

Hey Oleg,

> A separate "patch", just to emphasize that I do not know what
> actually abort_xol() should do! I do not understand this asm
> magic.
> 
> This patch simply changes regs->ip back to the probed insn,
> obviously this is not enough to handle UPROBES_FIX_*. Please
> take care.
> 
> If it is not clear, abort_xol() is needed when we should
> re-execute the original insn (replaced with int3), see the
> next patch.

We should be removing the breakpoint in abort_xol().
Otherwise if we just set the instruction pointer to int3 and signal a
sigill, then the user may be confused why a breakpoint is generating
SIGILL.

> ---
>  arch/x86/include/asm/uprobes.h |    1 +
>  arch/x86/kernel/uprobes.c      |    9 +++++++++
>  2 files changed, 10 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index f0fbdab..6209da1 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -51,6 +51,7 @@ extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
>  extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
>  extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
>  extern bool xol_was_trapped(struct task_struct *tsk);
> +extern void abort_xol(struct pt_regs *regs);
>  extern int uprobe_exception_notify(struct notifier_block *self,
>  				       unsigned long val, void *data);
>  #endif	/* _ASM_UPROBES_H */
> diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> index c861c27..bc11a89 100644
> --- a/arch/x86/kernel/uprobes.c
> +++ b/arch/x86/kernel/uprobes.c
> @@ -511,6 +511,15 @@ bool xol_was_trapped(struct task_struct *tsk)
>  	return false;
>  }
> 
> +void abort_xol(struct pt_regs *regs)
> +{
> +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> +	// !!! Dear Srikar and Ananth, please implement me !!!
> +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> +	struct uprobe_task *utask = current->utask;
> +	regs->ip = utask->vaddr;

nit:
Shouldnt we be setting the ip to the next instruction after this
instruction?

> +}
> +
>  /*
>   * Called after single-stepping. To avoid the SMP problems that can
>   * occur when we temporarily put back the original opcode to


I have applied all your patches and ran tests, the tests are all
passing.

I will fold them into my patches and send them out.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
  2011-10-21 14:42         ` Srikar Dronamraju
@ 2011-10-21 16:22           ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-21 16:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

On 10/21, Srikar Dronamraju wrote:
>
> > If it is not clear, abort_xol() is needed when we should
> > re-execute the original insn (replaced with int3), see the
> > next patch.
>
> We should be removing the breakpoint in abort_xol().

Why? See also below.

> Otherwise if we just set the instruction pointer to int3 and signal a
> sigill, then the user may be confused why a breakpoint is generating
> SIGILL.

Which user?

gdb? Of course it can be confused. But it can be confused in any case.

> > +void abort_xol(struct pt_regs *regs)
> > +{
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	struct uprobe_task *utask = current->utask;
> > +	regs->ip = utask->vaddr;
>
> nit:
> Shouldnt we be setting the ip to the next instruction after this
> instruction?

Not sure...

We should restart the same insn. Say, if the probed insn
was "*(int*)0 = 0", it should be executed again after SIGSEGV. Unless
the task was killed by this signal.

And in this case we should call uprobe_consumer()->handler() again,
we shouldn't remove "int3".

> I have applied all your patches and ran tests, the tests are all
> passing.
>
> I will fold them into my patches and send them out.

Great, thanks.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
@ 2011-10-21 16:22           ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-21 16:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

On 10/21, Srikar Dronamraju wrote:
>
> > If it is not clear, abort_xol() is needed when we should
> > re-execute the original insn (replaced with int3), see the
> > next patch.
>
> We should be removing the breakpoint in abort_xol().

Why? See also below.

> Otherwise if we just set the instruction pointer to int3 and signal a
> sigill, then the user may be confused why a breakpoint is generating
> SIGILL.

Which user?

gdb? Of course it can be confused. But it can be confused in any case.

> > +void abort_xol(struct pt_regs *regs)
> > +{
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	struct uprobe_task *utask = current->utask;
> > +	regs->ip = utask->vaddr;
>
> nit:
> Shouldnt we be setting the ip to the next instruction after this
> instruction?

Not sure...

We should restart the same insn. Say, if the probed insn
was "*(int*)0 = 0", it should be executed again after SIGSEGV. Unless
the task was killed by this signal.

And in this case we should call uprobe_consumer()->handler() again,
we shouldn't remove "int3".

> I have applied all your patches and ran tests, the tests are all
> passing.
>
> I will fold them into my patches and send them out.

Great, thanks.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
  2011-10-21 14:42         ` Srikar Dronamraju
@ 2011-10-21 16:26           ` Ananth N Mavinakayanahalli
  -1 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-21 16:26 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Fri, Oct 21, 2011 at 08:12:07PM +0530, Srikar Dronamraju wrote:

...

> > If it is not clear, abort_xol() is needed when we should
> > re-execute the original insn (replaced with int3), see the
> > next patch.
> 
> We should be removing the breakpoint in abort_xol().
> Otherwise if we just set the instruction pointer to int3 and signal a
> sigill, then the user may be confused why a breakpoint is generating
> SIGILL.
> 
> > ---
> >  arch/x86/include/asm/uprobes.h |    1 +
> >  arch/x86/kernel/uprobes.c      |    9 +++++++++
> >  2 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index f0fbdab..6209da1 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -51,6 +51,7 @@ extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
> >  extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
> >  extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
> >  extern bool xol_was_trapped(struct task_struct *tsk);
> > +extern void abort_xol(struct pt_regs *regs);
> >  extern int uprobe_exception_notify(struct notifier_block *self,
> >  				       unsigned long val, void *data);
> >  #endif	/* _ASM_UPROBES_H */
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index c861c27..bc11a89 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -511,6 +511,15 @@ bool xol_was_trapped(struct task_struct *tsk)
> >  	return false;
> >  }
> > 
> > +void abort_xol(struct pt_regs *regs)
> > +{
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	struct uprobe_task *utask = current->utask;
> > +	regs->ip = utask->vaddr;
> 
> nit:
> Shouldnt we be setting the ip to the next instruction after this
> instruction?

No, since we should re-execute the original instruction after removing
the breakpoint.

Also, wrt ip being set to the next instruction on a breakpoint hit,
that's arch specific. For instance, on x86, it points to the next
instruction, while on powerpc, the nip points to the breakpoint vaddr
at the time of exception.

Ananth


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
@ 2011-10-21 16:26           ` Ananth N Mavinakayanahalli
  0 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-21 16:26 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Oleg Nesterov, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Fri, Oct 21, 2011 at 08:12:07PM +0530, Srikar Dronamraju wrote:

...

> > If it is not clear, abort_xol() is needed when we should
> > re-execute the original insn (replaced with int3), see the
> > next patch.
> 
> We should be removing the breakpoint in abort_xol().
> Otherwise if we just set the instruction pointer to int3 and signal a
> sigill, then the user may be confused why a breakpoint is generating
> SIGILL.
> 
> > ---
> >  arch/x86/include/asm/uprobes.h |    1 +
> >  arch/x86/kernel/uprobes.c      |    9 +++++++++
> >  2 files changed, 10 insertions(+), 0 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index f0fbdab..6209da1 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -51,6 +51,7 @@ extern void set_instruction_pointer(struct pt_regs *regs, unsigned long vaddr);
> >  extern int pre_xol(struct uprobe *uprobe, struct pt_regs *regs);
> >  extern int post_xol(struct uprobe *uprobe, struct pt_regs *regs);
> >  extern bool xol_was_trapped(struct task_struct *tsk);
> > +extern void abort_xol(struct pt_regs *regs);
> >  extern int uprobe_exception_notify(struct notifier_block *self,
> >  				       unsigned long val, void *data);
> >  #endif	/* _ASM_UPROBES_H */
> > diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
> > index c861c27..bc11a89 100644
> > --- a/arch/x86/kernel/uprobes.c
> > +++ b/arch/x86/kernel/uprobes.c
> > @@ -511,6 +511,15 @@ bool xol_was_trapped(struct task_struct *tsk)
> >  	return false;
> >  }
> > 
> > +void abort_xol(struct pt_regs *regs)
> > +{
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > +	struct uprobe_task *utask = current->utask;
> > +	regs->ip = utask->vaddr;
> 
> nit:
> Shouldnt we be setting the ip to the next instruction after this
> instruction?

No, since we should re-execute the original instruction after removing
the breakpoint.

Also, wrt ip being set to the next instruction on a breakpoint hit,
that's arch specific. For instance, on x86, it points to the next
instruction, while on powerpc, the nip points to the breakpoint vaddr
at the time of exception.

Ananth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
  2011-10-21 16:26           ` Ananth N Mavinakayanahalli
@ 2011-10-21 16:42             ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-21 16:42 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/21, Ananth N Mavinakayanahalli wrote:
>
> On Fri, Oct 21, 2011 at 08:12:07PM +0530, Srikar Dronamraju wrote:
>
> > > +void abort_xol(struct pt_regs *regs)
> > > +{
> > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > +	struct uprobe_task *utask = current->utask;
> > > +	regs->ip = utask->vaddr;
> >
> > nit:
> > Shouldnt we be setting the ip to the next instruction after this
> > instruction?
>
> No, since we should re-execute the original instruction

Yes,

> after removing
> the breakpoint.

No? we should not remove this uprobe?

> Also, wrt ip being set to the next instruction on a breakpoint hit,
> that's arch specific.

Probably yes, I am not sure. But:

> For instance, on x86, it points to the next
> instruction,

No?

	/**
	 * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
	 * @regs: Reflects the saved state of the task after it has hit a breakpoint
	 * instruction.
	 * Return the address of the breakpoint instruction.
	 */
	unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
	{
		return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
	}

Yes, initially regs->ip points to the next insn after int3, but
utask->vaddr == get_uprobe_bkpt_addr() == addr of int3.

Right?

> while on powerpc, the nip points to the breakpoint vaddr
> at the time of exception.

I think get_uprobe_bkpt_addr() should be consistent on every arch.
That is why (I think) it is __weak.

Anyway, abort_xol() has to be arch-specific.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
@ 2011-10-21 16:42             ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-21 16:42 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/21, Ananth N Mavinakayanahalli wrote:
>
> On Fri, Oct 21, 2011 at 08:12:07PM +0530, Srikar Dronamraju wrote:
>
> > > +void abort_xol(struct pt_regs *regs)
> > > +{
> > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > +	struct uprobe_task *utask = current->utask;
> > > +	regs->ip = utask->vaddr;
> >
> > nit:
> > Shouldnt we be setting the ip to the next instruction after this
> > instruction?
>
> No, since we should re-execute the original instruction

Yes,

> after removing
> the breakpoint.

No? we should not remove this uprobe?

> Also, wrt ip being set to the next instruction on a breakpoint hit,
> that's arch specific.

Probably yes, I am not sure. But:

> For instance, on x86, it points to the next
> instruction,

No?

	/**
	 * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
	 * @regs: Reflects the saved state of the task after it has hit a breakpoint
	 * instruction.
	 * Return the address of the breakpoint instruction.
	 */
	unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
	{
		return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
	}

Yes, initially regs->ip points to the next insn after int3, but
utask->vaddr == get_uprobe_bkpt_addr() == addr of int3.

Right?

> while on powerpc, the nip points to the breakpoint vaddr
> at the time of exception.

I think get_uprobe_bkpt_addr() should be consistent on every arch.
That is why (I think) it is __weak.

Anyway, abort_xol() has to be arch-specific.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol())
  2011-10-21 16:42             ` Oleg Nesterov
@ 2011-10-21 17:59               ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-21 17:59 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/21, Oleg Nesterov wrote:
>
> On 10/21, Ananth N Mavinakayanahalli wrote:
> >
> > On Fri, Oct 21, 2011 at 08:12:07PM +0530, Srikar Dronamraju wrote:
> >
> > > > +void abort_xol(struct pt_regs *regs)
> > > > +{
> > > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > > +	struct uprobe_task *utask = current->utask;
> > > > +	regs->ip = utask->vaddr;
> > >
> > > nit:
> > > Shouldnt we be setting the ip to the next instruction after this
> > > instruction?
> >
> > No, since we should re-execute the original instruction
>
> Yes,

In case it was not clear, I meant "agree with your 'No'".

> > after removing
> > the breakpoint.
>
> No? we should not remove this uprobe?
>
> > Also, wrt ip being set to the next instruction on a breakpoint hit,
> > that's arch specific.
>
> Probably yes, I am not sure. But:
>
> > For instance, on x86, it points to the next
> > instruction,
>
> No?
>
> 	/**
> 	 * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
> 	 * @regs: Reflects the saved state of the task after it has hit a breakpoint
> 	 * instruction.
> 	 * Return the address of the breakpoint instruction.
> 	 */
> 	unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
> 	{
> 		return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
> 	}
>
> Yes, initially regs->ip points to the next insn after int3, but
> utask->vaddr == get_uprobe_bkpt_addr() == addr of int3.

Ananth, Srikar, I'd suggest this test-case:

	#include <stdio.h>
	#include <signal.h>
	#include <ucontext.h>

	void *fault_insn;

	static inline void *uc_ip(struct ucontext *ctxt)
	{
		return (void*)ctxt->uc_mcontext.gregs[16];
	}

	void segv(int sig, siginfo_t *info, void *ctxt)
	{
		static int cnt;

		printf("SIGSEGV! ip=%p addr=%p\n", uc_ip(ctxt), info->si_addr);

		if (uc_ip(ctxt) != fault_insn)
			printf("ERR!! wrong ip\n");
		if (info->si_addr != (void*)0x12345678)
			printf("ERR!! wrong addr\n");

		if (++cnt == 3)
			signal(SIGSEGV, SIG_DFL);
	}

	int main(void)
	{
		struct sigaction sa = {
			.sa_sigaction	= segv,
			.sa_flags	= SA_SIGINFO,
		};

		sigaction(SIGSEGV, &sa, NULL);

		fault_insn = &&label;

	label:
		asm volatile ("movl $0x0,0x12345678");

		return 0;
	}

result:

	$ ulimit -c unlimited

	$ ./segv
	SIGSEGV! ip=0x4006eb addr=0x12345678
	SIGSEGV! ip=0x4006eb addr=0x12345678
	SIGSEGV! ip=0x4006eb addr=0x12345678
	Segmentation fault (core dumped)

	$ gdb -c ./core.1826
	...
	Program terminated with signal 11, Segmentation fault.
	#0  0x00000000004006eb in ?? ()

Now. If you insert uprobe at asm("movl") insn, result should be the same
or the patches I sent are wrong. In particular, the addr in the coredump
should be correct too. And consumer->handler() should be called 3 times
too. This insn is really executed 3 times.

I have no idea how can I test this.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol())
@ 2011-10-21 17:59               ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-21 17:59 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/21, Oleg Nesterov wrote:
>
> On 10/21, Ananth N Mavinakayanahalli wrote:
> >
> > On Fri, Oct 21, 2011 at 08:12:07PM +0530, Srikar Dronamraju wrote:
> >
> > > > +void abort_xol(struct pt_regs *regs)
> > > > +{
> > > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > > +	// !!! Dear Srikar and Ananth, please implement me !!!
> > > > +	// !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
> > > > +	struct uprobe_task *utask = current->utask;
> > > > +	regs->ip = utask->vaddr;
> > >
> > > nit:
> > > Shouldnt we be setting the ip to the next instruction after this
> > > instruction?
> >
> > No, since we should re-execute the original instruction
>
> Yes,

In case it was not clear, I meant "agree with your 'No'".

> > after removing
> > the breakpoint.
>
> No? we should not remove this uprobe?
>
> > Also, wrt ip being set to the next instruction on a breakpoint hit,
> > that's arch specific.
>
> Probably yes, I am not sure. But:
>
> > For instance, on x86, it points to the next
> > instruction,
>
> No?
>
> 	/**
> 	 * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
> 	 * @regs: Reflects the saved state of the task after it has hit a breakpoint
> 	 * instruction.
> 	 * Return the address of the breakpoint instruction.
> 	 */
> 	unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
> 	{
> 		return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
> 	}
>
> Yes, initially regs->ip points to the next insn after int3, but
> utask->vaddr == get_uprobe_bkpt_addr() == addr of int3.

Ananth, Srikar, I'd suggest this test-case:

	#include <stdio.h>
	#include <signal.h>
	#include <ucontext.h>

	void *fault_insn;

	static inline void *uc_ip(struct ucontext *ctxt)
	{
		return (void*)ctxt->uc_mcontext.gregs[16];
	}

	void segv(int sig, siginfo_t *info, void *ctxt)
	{
		static int cnt;

		printf("SIGSEGV! ip=%p addr=%p\n", uc_ip(ctxt), info->si_addr);

		if (uc_ip(ctxt) != fault_insn)
			printf("ERR!! wrong ip\n");
		if (info->si_addr != (void*)0x12345678)
			printf("ERR!! wrong addr\n");

		if (++cnt == 3)
			signal(SIGSEGV, SIG_DFL);
	}

	int main(void)
	{
		struct sigaction sa = {
			.sa_sigaction	= segv,
			.sa_flags	= SA_SIGINFO,
		};

		sigaction(SIGSEGV, &sa, NULL);

		fault_insn = &&label;

	label:
		asm volatile ("movl $0x0,0x12345678");

		return 0;
	}

result:

	$ ulimit -c unlimited

	$ ./segv
	SIGSEGV! ip=0x4006eb addr=0x12345678
	SIGSEGV! ip=0x4006eb addr=0x12345678
	SIGSEGV! ip=0x4006eb addr=0x12345678
	Segmentation fault (core dumped)

	$ gdb -c ./core.1826
	...
	Program terminated with signal 11, Segmentation fault.
	#0  0x00000000004006eb in ?? ()

Now. If you insert uprobe at asm("movl") insn, result should be the same
or the patches I sent are wrong. In particular, the addr in the coredump
should be correct too. And consumer->handler() should be called 3 times
too. This insn is really executed 3 times.

I have no idea how can I test this.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
  2011-10-21 16:42             ` Oleg Nesterov
@ 2011-10-22  7:09               ` Ananth N Mavinakayanahalli
  -1 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-22  7:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Fri, Oct 21, 2011 at 06:42:21PM +0200, Oleg Nesterov wrote:
> On 10/21, Ananth N Mavinakayanahalli wrote:

...

> > For instance, on x86, it points to the next
> > instruction,
> 
> No?

At exception entry, we'd not have done the following fixup...

> 	/**
> 	 * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
> 	 * @regs: Reflects the saved state of the task after it has hit a breakpoint
> 	 * instruction.
> 	 * Return the address of the breakpoint instruction.
> 	 */
> 	unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
> 	{
> 		return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
> 	}
> 
> Yes, initially regs->ip points to the next insn after int3, but
> utask->vaddr == get_uprobe_bkpt_addr() == addr of int3.
> 
> Right?

Yes, we fix it up so we point to the right (breakpoint) address.

> > while on powerpc, the nip points to the breakpoint vaddr
> > at the time of exception.
> 
> I think get_uprobe_bkpt_addr() should be consistent on every arch.
> That is why (I think) it is __weak.

Yes, that is the intention.

> Anyway, abort_xol() has to be arch-specific.

Agree.

Ananth

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 12/X] uprobes: x86: introduce abort_xol()
@ 2011-10-22  7:09               ` Ananth N Mavinakayanahalli
  0 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-22  7:09 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Fri, Oct 21, 2011 at 06:42:21PM +0200, Oleg Nesterov wrote:
> On 10/21, Ananth N Mavinakayanahalli wrote:

...

> > For instance, on x86, it points to the next
> > instruction,
> 
> No?

At exception entry, we'd not have done the following fixup...

> 	/**
> 	 * get_uprobe_bkpt_addr - compute address of bkpt given post-bkpt regs
> 	 * @regs: Reflects the saved state of the task after it has hit a breakpoint
> 	 * instruction.
> 	 * Return the address of the breakpoint instruction.
> 	 */
> 	unsigned long __weak get_uprobe_bkpt_addr(struct pt_regs *regs)
> 	{
> 		return instruction_pointer(regs) - UPROBES_BKPT_INSN_SIZE;
> 	}
> 
> Yes, initially regs->ip points to the next insn after int3, but
> utask->vaddr == get_uprobe_bkpt_addr() == addr of int3.
> 
> Right?

Yes, we fix it up so we point to the right (breakpoint) address.

> > while on powerpc, the nip points to the breakpoint vaddr
> > at the time of exception.
> 
> I think get_uprobe_bkpt_addr() should be consistent on every arch.
> That is why (I think) it is __weak.

Yes, that is the intention.

> Anyway, abort_xol() has to be arch-specific.

Agree.

Ananth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-19 21:53       ` Oleg Nesterov
@ 2011-10-22  7:20         ` Ananth N Mavinakayanahalli
  -1 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-22  7:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Wed, Oct 19, 2011 at 11:53:44PM +0200, Oleg Nesterov wrote:
> Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
> xol insn itself triggers the signal.
> 
> In this case we should restart the original insn even if the task is
> already SIGKILL'ed (say, the coredump should report the correct ip).
> This is even more important if the task has a handler for SIGSEGV/etc,
> The _same_ instruction should be repeated again after return from the
> signal handler, and SSTEP can never finish in this case.

Oleg,

Not sure I understand this completely...

When you say 'correct ip' you mean the original vaddr where we now have
a uprobe breakpoint and not the xol copy, right?

Coredump needs to report the correct ip, but should it also not report
correctly the instruction that caused the signal? Ergo, shouldn't we
put the original instruction back at the uprobed vaddr?

Ananth


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-22  7:20         ` Ananth N Mavinakayanahalli
  0 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-22  7:20 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Wed, Oct 19, 2011 at 11:53:44PM +0200, Oleg Nesterov wrote:
> Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
> xol insn itself triggers the signal.
> 
> In this case we should restart the original insn even if the task is
> already SIGKILL'ed (say, the coredump should report the correct ip).
> This is even more important if the task has a handler for SIGSEGV/etc,
> The _same_ instruction should be repeated again after return from the
> signal handler, and SSTEP can never finish in this case.

Oleg,

Not sure I understand this completely...

When you say 'correct ip' you mean the original vaddr where we now have
a uprobe breakpoint and not the xol copy, right?

Coredump needs to report the correct ip, but should it also not report
correctly the instruction that caused the signal? Ergo, shouldn't we
put the original instruction back at the uprobed vaddr?

Ananth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-22  7:20         ` Ananth N Mavinakayanahalli
@ 2011-10-24 14:41           ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-24 14:41 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/22, Ananth N Mavinakayanahalli wrote:
>
> On Wed, Oct 19, 2011 at 11:53:44PM +0200, Oleg Nesterov wrote:
> > Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
> > xol insn itself triggers the signal.
> >
> > In this case we should restart the original insn even if the task is
> > already SIGKILL'ed (say, the coredump should report the correct ip).
> > This is even more important if the task has a handler for SIGSEGV/etc,
> > The _same_ instruction should be repeated again after return from the
> > signal handler, and SSTEP can never finish in this case.
>
> Oleg,
>
> Not sure I understand this completely...

I hope you do not think I do ;)

> When you say 'correct ip' you mean the original vaddr where we now have
> a uprobe breakpoint and not the xol copy, right?

Yes,

> Coredump needs to report the correct ip, but should it also not report
> correctly the instruction that caused the signal? Ergo, shouldn't we
> put the original instruction back at the uprobed vaddr?

OK, now I see what you mean. I was confused by the "restore the original
instruction before _restart_" suggestion.

Agreed! it would be nice to "hide" these int3's if we dump the core, but
I think this is a bit off-topic. It makes sense to do this in any case,
even if the core-dumping was triggered by another thread/insn. It makes
sense to remove all int3's, not only at regs->ip location. But how can
we do this? This is nontrivial.

And. Even worse. Suppose that you do "gdb probed_application". Now you
see int3's in the disassemble output. What can we do?

I think we can do nothing, at least currently. This just reflects the
fact that uprobe connects to inode, not to process/mm/etc.

What do you think?

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-24 14:41           ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-24 14:41 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/22, Ananth N Mavinakayanahalli wrote:
>
> On Wed, Oct 19, 2011 at 11:53:44PM +0200, Oleg Nesterov wrote:
> > Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
> > xol insn itself triggers the signal.
> >
> > In this case we should restart the original insn even if the task is
> > already SIGKILL'ed (say, the coredump should report the correct ip).
> > This is even more important if the task has a handler for SIGSEGV/etc,
> > The _same_ instruction should be repeated again after return from the
> > signal handler, and SSTEP can never finish in this case.
>
> Oleg,
>
> Not sure I understand this completely...

I hope you do not think I do ;)

> When you say 'correct ip' you mean the original vaddr where we now have
> a uprobe breakpoint and not the xol copy, right?

Yes,

> Coredump needs to report the correct ip, but should it also not report
> correctly the instruction that caused the signal? Ergo, shouldn't we
> put the original instruction back at the uprobed vaddr?

OK, now I see what you mean. I was confused by the "restore the original
instruction before _restart_" suggestion.

Agreed! it would be nice to "hide" these int3's if we dump the core, but
I think this is a bit off-topic. It makes sense to do this in any case,
even if the core-dumping was triggered by another thread/insn. It makes
sense to remove all int3's, not only at regs->ip location. But how can
we do this? This is nontrivial.

And. Even worse. Suppose that you do "gdb probed_application". Now you
see int3's in the disassemble output. What can we do?

I think we can do nothing, at least currently. This just reflects the
fact that uprobe connects to inode, not to process/mm/etc.

What do you think?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 11/X] uprobes: x86: introduce xol_was_trapped()
  2011-10-19 21:53       ` Oleg Nesterov
@ 2011-10-24 14:55         ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-24 14:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 1c30cfd..f0fbdab 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -39,6 +39,7 @@ struct uprobe_arch_info {
> 
>  struct uprobe_task_arch_info {
>  	unsigned long saved_scratch_register;
> +	unsigned long saved_trap_no;
>  };
>  #else
>  struct uprobe_arch_info {};


one nit
I had to add saved_trap_no to #else part (i.e uprobe_arch_info ). 

--
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 11/X] uprobes: x86: introduce xol_was_trapped()
@ 2011-10-24 14:55         ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-24 14:55 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

> diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> index 1c30cfd..f0fbdab 100644
> --- a/arch/x86/include/asm/uprobes.h
> +++ b/arch/x86/include/asm/uprobes.h
> @@ -39,6 +39,7 @@ struct uprobe_arch_info {
> 
>  struct uprobe_task_arch_info {
>  	unsigned long saved_scratch_register;
> +	unsigned long saved_trap_no;
>  };
>  #else
>  struct uprobe_arch_info {};


one nit
I had to add saved_trap_no to #else part (i.e uprobe_arch_info ). 

--
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-24 14:41           ` Oleg Nesterov
@ 2011-10-24 15:16             ` Ananth N Mavinakayanahalli
  -1 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-24 15:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Mon, Oct 24, 2011 at 04:41:27PM +0200, Oleg Nesterov wrote:
> On 10/22, Ananth N Mavinakayanahalli wrote:
> >
> > On Wed, Oct 19, 2011 at 11:53:44PM +0200, Oleg Nesterov wrote:
> > > Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
> > > xol insn itself triggers the signal.
> > >
> > > In this case we should restart the original insn even if the task is
> > > already SIGKILL'ed (say, the coredump should report the correct ip).
> > > This is even more important if the task has a handler for SIGSEGV/etc,
> > > The _same_ instruction should be repeated again after return from the
> > > signal handler, and SSTEP can never finish in this case.
> >
> > Oleg,
> >
> > Not sure I understand this completely...
> 
> I hope you do not think I do ;)

I think you understand it better than you think you do :-)

> > When you say 'correct ip' you mean the original vaddr where we now have
> > a uprobe breakpoint and not the xol copy, right?
> 
> Yes,
> 
> > Coredump needs to report the correct ip, but should it also not report
> > correctly the instruction that caused the signal? Ergo, shouldn't we
> > put the original instruction back at the uprobed vaddr?
> 
> OK, now I see what you mean. I was confused by the "restore the original
> instruction before _restart_" suggestion.
> 
> Agreed! it would be nice to "hide" these int3's if we dump the core, but
> I think this is a bit off-topic. It makes sense to do this in any case,
> even if the core-dumping was triggered by another thread/insn. It makes
> sense to remove all int3's, not only at regs->ip location. But how can
> we do this? This is nontrivial.

I don't think that is a problem.. see below...

> And. Even worse. Suppose that you do "gdb probed_application". Now you
> see int3's in the disassemble output. What can we do?

In this case, nothing.

> I think we can do nothing, at least currently. This just reflects the
> fact that uprobe connects to inode, not to process/mm/etc.
> 
> What do you think?

Thinking further on this, in the normal 'running gdb on a core' case, we
won't have this problem, as the binary that we point gdb to, will be a
pristine one, without the uprobe int3s, right?

Ananth


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-24 15:16             ` Ananth N Mavinakayanahalli
  0 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-24 15:16 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Mon, Oct 24, 2011 at 04:41:27PM +0200, Oleg Nesterov wrote:
> On 10/22, Ananth N Mavinakayanahalli wrote:
> >
> > On Wed, Oct 19, 2011 at 11:53:44PM +0200, Oleg Nesterov wrote:
> > > Finally, add UTASK_SSTEP_TRAPPED state/code to handle the case when
> > > xol insn itself triggers the signal.
> > >
> > > In this case we should restart the original insn even if the task is
> > > already SIGKILL'ed (say, the coredump should report the correct ip).
> > > This is even more important if the task has a handler for SIGSEGV/etc,
> > > The _same_ instruction should be repeated again after return from the
> > > signal handler, and SSTEP can never finish in this case.
> >
> > Oleg,
> >
> > Not sure I understand this completely...
> 
> I hope you do not think I do ;)

I think you understand it better than you think you do :-)

> > When you say 'correct ip' you mean the original vaddr where we now have
> > a uprobe breakpoint and not the xol copy, right?
> 
> Yes,
> 
> > Coredump needs to report the correct ip, but should it also not report
> > correctly the instruction that caused the signal? Ergo, shouldn't we
> > put the original instruction back at the uprobed vaddr?
> 
> OK, now I see what you mean. I was confused by the "restore the original
> instruction before _restart_" suggestion.
> 
> Agreed! it would be nice to "hide" these int3's if we dump the core, but
> I think this is a bit off-topic. It makes sense to do this in any case,
> even if the core-dumping was triggered by another thread/insn. It makes
> sense to remove all int3's, not only at regs->ip location. But how can
> we do this? This is nontrivial.

I don't think that is a problem.. see below...

> And. Even worse. Suppose that you do "gdb probed_application". Now you
> see int3's in the disassemble output. What can we do?

In this case, nothing.

> I think we can do nothing, at least currently. This just reflects the
> fact that uprobe connects to inode, not to process/mm/etc.
> 
> What do you think?

Thinking further on this, in the normal 'running gdb on a core' case, we
won't have this problem, as the binary that we point gdb to, will be a
pristine one, without the uprobe int3s, right?

Ananth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 11/X] uprobes: x86: introduce xol_was_trapped()
  2011-10-24 14:55         ` Srikar Dronamraju
@ 2011-10-24 16:07           ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-24 16:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

On 10/24, Srikar Dronamraju wrote:
>
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 1c30cfd..f0fbdab 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -39,6 +39,7 @@ struct uprobe_arch_info {
> >
> >  struct uprobe_task_arch_info {
> >  	unsigned long saved_scratch_register;
> > +	unsigned long saved_trap_no;
> >  };
> >  #else
> >  struct uprobe_arch_info {};
>
>
> one nit
> I had to add saved_trap_no to #else part (i.e uprobe_arch_info ).

Yes, thanks, I didn't notice this is for X86_64 only.

And just in case, please feel free to rename/redo/whatever.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 11/X] uprobes: x86: introduce xol_was_trapped()
@ 2011-10-24 16:07           ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-24 16:07 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Steven Rostedt, Linux-mm,
	Arnaldo Carvalho de Melo, Linus Torvalds, Jonathan Corbet,
	Masami Hiramatsu, Hugh Dickins, Christoph Hellwig,
	Ananth N Mavinakayanahalli, Thomas Gleixner, Andi Kleen,
	Andrew Morton, Jim Keniston, Roland McGrath, LKML

On 10/24, Srikar Dronamraju wrote:
>
> > diff --git a/arch/x86/include/asm/uprobes.h b/arch/x86/include/asm/uprobes.h
> > index 1c30cfd..f0fbdab 100644
> > --- a/arch/x86/include/asm/uprobes.h
> > +++ b/arch/x86/include/asm/uprobes.h
> > @@ -39,6 +39,7 @@ struct uprobe_arch_info {
> >
> >  struct uprobe_task_arch_info {
> >  	unsigned long saved_scratch_register;
> > +	unsigned long saved_trap_no;
> >  };
> >  #else
> >  struct uprobe_arch_info {};
>
>
> one nit
> I had to add saved_trap_no to #else part (i.e uprobe_arch_info ).

Yes, thanks, I didn't notice this is for X86_64 only.

And just in case, please feel free to rename/redo/whatever.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-24 15:16             ` Ananth N Mavinakayanahalli
@ 2011-10-24 16:13               ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-24 16:13 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/24, Ananth N Mavinakayanahalli wrote:
>
> On Mon, Oct 24, 2011 at 04:41:27PM +0200, Oleg Nesterov wrote:
> >
> > Agreed! it would be nice to "hide" these int3's if we dump the core, but
> > I think this is a bit off-topic. It makes sense to do this in any case,
> > even if the core-dumping was triggered by another thread/insn. It makes
> > sense to remove all int3's, not only at regs->ip location. But how can
> > we do this? This is nontrivial.
>
> I don't think that is a problem.. see below...
>
> > And. Even worse. Suppose that you do "gdb probed_application". Now you
> > see int3's in the disassemble output. What can we do?
>
> In this case, nothing.
>
> > I think we can do nothing, at least currently. This just reflects the
> > fact that uprobe connects to inode, not to process/mm/etc.
> >
> > What do you think?
>
> Thinking further on this, in the normal 'running gdb on a core' case, we
> won't have this problem, as the binary that we point gdb to, will be a
> pristine one, without the uprobe int3s, right?

Not sure I understand.

I meant, if we have a binary with uprobes (iow, register_uprobe() installed
uprobes into that file), then gdb will see int3's with or without the core.
Or you can add uprobe into glibc, say you can probe getpid(). Now (again,
with or without the core) disassemble shows that getpid() starts with int3.

But I guess you meant something else...

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-24 16:13               ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-24 16:13 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/24, Ananth N Mavinakayanahalli wrote:
>
> On Mon, Oct 24, 2011 at 04:41:27PM +0200, Oleg Nesterov wrote:
> >
> > Agreed! it would be nice to "hide" these int3's if we dump the core, but
> > I think this is a bit off-topic. It makes sense to do this in any case,
> > even if the core-dumping was triggered by another thread/insn. It makes
> > sense to remove all int3's, not only at regs->ip location. But how can
> > we do this? This is nontrivial.
>
> I don't think that is a problem.. see below...
>
> > And. Even worse. Suppose that you do "gdb probed_application". Now you
> > see int3's in the disassemble output. What can we do?
>
> In this case, nothing.
>
> > I think we can do nothing, at least currently. This just reflects the
> > fact that uprobe connects to inode, not to process/mm/etc.
> >
> > What do you think?
>
> Thinking further on this, in the normal 'running gdb on a core' case, we
> won't have this problem, as the binary that we point gdb to, will be a
> pristine one, without the uprobe int3s, right?

Not sure I understand.

I meant, if we have a binary with uprobes (iow, register_uprobe() installed
uprobes into that file), then gdb will see int3's with or without the core.
Or you can add uprobe into glibc, say you can probe getpid(). Now (again,
with or without the core) disassemble shows that getpid() starts with int3.

But I guess you meant something else...

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-24 16:13               ` Oleg Nesterov
@ 2011-10-25  6:01                 ` Ananth N Mavinakayanahalli
  -1 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-25  6:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Mon, Oct 24, 2011 at 06:13:06PM +0200, Oleg Nesterov wrote:
> On 10/24, Ananth N Mavinakayanahalli wrote:
> >
> > Thinking further on this, in the normal 'running gdb on a core' case, we
> > won't have this problem, as the binary that we point gdb to, will be a
> > pristine one, without the uprobe int3s, right?
> 
> Not sure I understand.
> 
> I meant, if we have a binary with uprobes (iow, register_uprobe() installed
> uprobes into that file), then gdb will see int3's with or without the core.
> Or you can add uprobe into glibc, say you can probe getpid(). Now (again,
> with or without the core) disassemble shows that getpid() starts with int3.
> 
> But I guess you meant something else...

No, you are right... my inference was wrong. On a core with a uprobe
with an explicit raise(SIGABRT) does show the breakpoint.

(gdb) disassemble start_thread2
Dump of assembler code for function start_thread2:
   0x0000000000400831 <+0>:	int3   
   0x0000000000400832 <+1>:	mov    %rsp,%rbp
   0x0000000000400835 <+4>:	sub    $0x10,%rsp
   0x0000000000400839 <+8>:	mov    %rdi,-0x8(%rbp)
   0x000000000040083d <+12>:	callq  0x400650 <getpid@plt>

Now, I guess we need to agree on what is the acceptable behavior in the
uprobes case. What's your suggestion?

Ananth


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-25  6:01                 ` Ananth N Mavinakayanahalli
  0 siblings, 0 replies; 330+ messages in thread
From: Ananth N Mavinakayanahalli @ 2011-10-25  6:01 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On Mon, Oct 24, 2011 at 06:13:06PM +0200, Oleg Nesterov wrote:
> On 10/24, Ananth N Mavinakayanahalli wrote:
> >
> > Thinking further on this, in the normal 'running gdb on a core' case, we
> > won't have this problem, as the binary that we point gdb to, will be a
> > pristine one, without the uprobe int3s, right?
> 
> Not sure I understand.
> 
> I meant, if we have a binary with uprobes (iow, register_uprobe() installed
> uprobes into that file), then gdb will see int3's with or without the core.
> Or you can add uprobe into glibc, say you can probe getpid(). Now (again,
> with or without the core) disassemble shows that getpid() starts with int3.
> 
> But I guess you meant something else...

No, you are right... my inference was wrong. On a core with a uprobe
with an explicit raise(SIGABRT) does show the breakpoint.

(gdb) disassemble start_thread2
Dump of assembler code for function start_thread2:
   0x0000000000400831 <+0>:	int3   
   0x0000000000400832 <+1>:	mov    %rsp,%rbp
   0x0000000000400835 <+4>:	sub    $0x10,%rsp
   0x0000000000400839 <+8>:	mov    %rdi,-0x8(%rbp)
   0x000000000040083d <+12>:	callq  0x400650 <getpid@plt>

Now, I guess we need to agree on what is the acceptable behavior in the
uprobes case. What's your suggestion?

Ananth

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol())
  2011-10-21 17:59               ` Oleg Nesterov
@ 2011-10-25 14:06                 ` Srikar Dronamraju
  -1 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-25 14:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ananth N Mavinakayanahalli, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

> 
> Ananth, Srikar, I'd suggest this test-case:
> 
> 	#include <stdio.h>
> 	#include <signal.h>
> 	#include <ucontext.h>
> 
> 	void *fault_insn;
> 
> 	static inline void *uc_ip(struct ucontext *ctxt)
> 	{
> 		return (void*)ctxt->uc_mcontext.gregs[16];
> 	}
> 
> 	void segv(int sig, siginfo_t *info, void *ctxt)
> 	{
> 		static int cnt;
> 
> 		printf("SIGSEGV! ip=%p addr=%p\n", uc_ip(ctxt), info->si_addr);
> 
> 		if (uc_ip(ctxt) != fault_insn)
> 			printf("ERR!! wrong ip\n");
> 		if (info->si_addr != (void*)0x12345678)
> 			printf("ERR!! wrong addr\n");
> 
> 		if (++cnt == 3)
> 			signal(SIGSEGV, SIG_DFL);
> 	}
> 
> 	int main(void)
> 	{
> 		struct sigaction sa = {
> 			.sa_sigaction	= segv,
> 			.sa_flags	= SA_SIGINFO,
> 		};
> 
> 		sigaction(SIGSEGV, &sa, NULL);
> 
> 		fault_insn = &&label;
> 
> 	label:
> 		asm volatile ("movl $0x0,0x12345678");
> 
> 		return 0;
> 	}
> 
> result:
> 
> 	$ ulimit -c unlimited
> 
> 	$ ./segv
> 	SIGSEGV! ip=0x4006eb addr=0x12345678
> 	SIGSEGV! ip=0x4006eb addr=0x12345678
> 	SIGSEGV! ip=0x4006eb addr=0x12345678
> 	Segmentation fault (core dumped)
> 
> 	$ gdb -c ./core.1826
> 	...
> 	Program terminated with signal 11, Segmentation fault.
> 	#0  0x00000000004006eb in ?? ()
> 
> Now. If you insert uprobe at asm("movl") insn, result should be the same
> or the patches I sent are wrong. In particular, the addr in the coredump
> should be correct too. And consumer->handler() should be called 3 times
> too. This insn is really executed 3 times.
> 
> I have no idea how can I test this.
> 

I have tested this on both x86_32 and x86_64 and can confirm that the
behaviour is same with or without uprobes placed at that instruction.
This is on the uprobes code with your changes.

However on x86_32; the output is different from x86_64. 

On x86_32 (I have additionally printed the uc_ip and fault_insn.

SIGSEGV! ip=0x10246 addr=0x12345678
ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c
SIGSEGV! ip=0x10246 addr=0x12345678
ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c
SIGSEGV! ip=0x10246 addr=0x12345678
ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c
Segmentation fault

the fault_insn matches the address shown in disassemble of gdb.
I still trying to dig up what uc_ip is and why its different on x86_32.

On x86_64 the result is what you pasted above.


Also I was thinking on your suggestion of making abort_xol a weak
function.  In which case we could have architecture independent function
in kernel/uprobes.c which is just a wrapper for set_instruction_pointer. 

void __weak abort_xol(struct pt_regs *regs, struct uprobe_task *utask)
{
	set_instruction_pointer(regs, utask->vaddr);	
}

where it would called  from uprobe_notify_resume() as 

	abort_xol(regs, utask);

If other archs would want to do something else, they could override
abort_xol definition.

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol())
@ 2011-10-25 14:06                 ` Srikar Dronamraju
  0 siblings, 0 replies; 330+ messages in thread
From: Srikar Dronamraju @ 2011-10-25 14:06 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Ananth N Mavinakayanahalli, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

> 
> Ananth, Srikar, I'd suggest this test-case:
> 
> 	#include <stdio.h>
> 	#include <signal.h>
> 	#include <ucontext.h>
> 
> 	void *fault_insn;
> 
> 	static inline void *uc_ip(struct ucontext *ctxt)
> 	{
> 		return (void*)ctxt->uc_mcontext.gregs[16];
> 	}
> 
> 	void segv(int sig, siginfo_t *info, void *ctxt)
> 	{
> 		static int cnt;
> 
> 		printf("SIGSEGV! ip=%p addr=%p\n", uc_ip(ctxt), info->si_addr);
> 
> 		if (uc_ip(ctxt) != fault_insn)
> 			printf("ERR!! wrong ip\n");
> 		if (info->si_addr != (void*)0x12345678)
> 			printf("ERR!! wrong addr\n");
> 
> 		if (++cnt == 3)
> 			signal(SIGSEGV, SIG_DFL);
> 	}
> 
> 	int main(void)
> 	{
> 		struct sigaction sa = {
> 			.sa_sigaction	= segv,
> 			.sa_flags	= SA_SIGINFO,
> 		};
> 
> 		sigaction(SIGSEGV, &sa, NULL);
> 
> 		fault_insn = &&label;
> 
> 	label:
> 		asm volatile ("movl $0x0,0x12345678");
> 
> 		return 0;
> 	}
> 
> result:
> 
> 	$ ulimit -c unlimited
> 
> 	$ ./segv
> 	SIGSEGV! ip=0x4006eb addr=0x12345678
> 	SIGSEGV! ip=0x4006eb addr=0x12345678
> 	SIGSEGV! ip=0x4006eb addr=0x12345678
> 	Segmentation fault (core dumped)
> 
> 	$ gdb -c ./core.1826
> 	...
> 	Program terminated with signal 11, Segmentation fault.
> 	#0  0x00000000004006eb in ?? ()
> 
> Now. If you insert uprobe at asm("movl") insn, result should be the same
> or the patches I sent are wrong. In particular, the addr in the coredump
> should be correct too. And consumer->handler() should be called 3 times
> too. This insn is really executed 3 times.
> 
> I have no idea how can I test this.
> 

I have tested this on both x86_32 and x86_64 and can confirm that the
behaviour is same with or without uprobes placed at that instruction.
This is on the uprobes code with your changes.

However on x86_32; the output is different from x86_64. 

On x86_32 (I have additionally printed the uc_ip and fault_insn.

SIGSEGV! ip=0x10246 addr=0x12345678
ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c
SIGSEGV! ip=0x10246 addr=0x12345678
ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c
SIGSEGV! ip=0x10246 addr=0x12345678
ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c
Segmentation fault

the fault_insn matches the address shown in disassemble of gdb.
I still trying to dig up what uc_ip is and why its different on x86_32.

On x86_64 the result is what you pasted above.


Also I was thinking on your suggestion of making abort_xol a weak
function.  In which case we could have architecture independent function
in kernel/uprobes.c which is just a wrapper for set_instruction_pointer. 

void __weak abort_xol(struct pt_regs *regs, struct uprobe_task *utask)
{
	set_instruction_pointer(regs, utask->vaddr);	
}

where it would called  from uprobe_notify_resume() as 

	abort_xol(regs, utask);

If other archs would want to do something else, they could override
abort_xol definition.

-- 
Thanks and Regards
Srikar

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
  2011-10-25  6:01                 ` Ananth N Mavinakayanahalli
@ 2011-10-25 14:30                   ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-25 14:30 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/25, Ananth N Mavinakayanahalli wrote:
>
> No, you are right... my inference was wrong. On a core with a uprobe
> with an explicit raise(SIGABRT) does show the breakpoint.
>
> (gdb) disassemble start_thread2
> Dump of assembler code for function start_thread2:
>    0x0000000000400831 <+0>:	int3
>    0x0000000000400832 <+1>:	mov    %rsp,%rbp
>    0x0000000000400835 <+4>:	sub    $0x10,%rsp
>    0x0000000000400839 <+8>:	mov    %rdi,-0x8(%rbp)
>    0x000000000040083d <+12>:	callq  0x400650 <getpid@plt>
>
> Now, I guess we need to agree on what is the acceptable behavior in the
> uprobes case. What's your suggestion?

Well, personally I think this is acceptable.

Once again, uprobes were designed to be "system wide", and each uprobe
connects to the file. This int3 reflects this fact. In any case, I do
not see how we can hide these int3's. Perhaps we can fool ptrace/core,
but I am not sure this would be really good, this can add more confusion.
And the application itself can read its .text and see int3, what can
we do?

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic
@ 2011-10-25 14:30                   ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-25 14:30 UTC (permalink / raw)
  To: Ananth N Mavinakayanahalli
  Cc: Srikar Dronamraju, Peter Zijlstra, Ingo Molnar, Steven Rostedt,
	Linux-mm, Arnaldo Carvalho de Melo, Linus Torvalds,
	Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/25, Ananth N Mavinakayanahalli wrote:
>
> No, you are right... my inference was wrong. On a core with a uprobe
> with an explicit raise(SIGABRT) does show the breakpoint.
>
> (gdb) disassemble start_thread2
> Dump of assembler code for function start_thread2:
>    0x0000000000400831 <+0>:	int3
>    0x0000000000400832 <+1>:	mov    %rsp,%rbp
>    0x0000000000400835 <+4>:	sub    $0x10,%rsp
>    0x0000000000400839 <+8>:	mov    %rdi,-0x8(%rbp)
>    0x000000000040083d <+12>:	callq  0x400650 <getpid@plt>
>
> Now, I guess we need to agree on what is the acceptable behavior in the
> uprobes case. What's your suggestion?

Well, personally I think this is acceptable.

Once again, uprobes were designed to be "system wide", and each uprobe
connects to the file. This int3 reflects this fact. In any case, I do
not see how we can hide these int3's. Perhaps we can fool ptrace/core,
but I am not sure this would be really good, this can add more confusion.
And the application itself can read its .text and see int3, what can
we do?

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol())
  2011-10-25 14:06                 ` Srikar Dronamraju
@ 2011-10-25 15:49                   ` Oleg Nesterov
  -1 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-25 15:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ananth N Mavinakayanahalli, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/25, Srikar Dronamraju wrote:
> >
> > 	static inline void *uc_ip(struct ucontext *ctxt)
> > 	{
> > 		return (void*)ctxt->uc_mcontext.gregs[16];
> > 	}
> > ...
> >
> I have tested this on both x86_32 and x86_64 and can confirm that the
> behaviour is same with or without uprobes placed at that instruction.
> This is on the uprobes code with your changes.

Great, thanks.

> However on x86_32; the output is different from x86_64.
>
> On x86_32 (I have additionally printed the uc_ip and fault_insn.
>
> SIGSEGV! ip=0x10246 addr=0x12345678
> ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c

Yep. uc_ip() is not correct on x86_32. Sorry, I forgot to mention this.

I was really surprised when I wrote this test. I simply can't understand
how can I play with ucontext in the user-space. I guess uc_ip() should use
REG_EIP instead of 16, but I wasn't able to compile it even if I added
__USE_GNU. It would be even better to use sigcontext instead of the ugly
mcontext_t, but this looks "impossible". The kernel is much simpler ;)


> I still trying to dig up what uc_ip is and why its different on x86_32.

See above. I guess it needs ctxt->uc_mcontext.gregs[14]. Or REG_EIP.

uc_ip() simply reads sigcontext->ip passed by setup_sigcontext().

> Also I was thinking on your suggestion of making abort_xol a weak
> function. In which case we could have architecture independent function
> in kernel/uprobes.c which is just a wrapper for set_instruction_pointer.
>
> void __weak abort_xol(struct pt_regs *regs, struct uprobe_task *utask)
> {
> 	set_instruction_pointer(regs, utask->vaddr);
> }
>
> where it would called  from uprobe_notify_resume() as
>
> 	abort_xol(regs, utask);
>
> If other archs would want to do something else, they could override
> abort_xol definition.

I didn't suggest this ;) But looks reasonable to me. And afaics x86_32
can use this arch-independent function.

Oleg.


^ permalink raw reply	[flat|nested] 330+ messages in thread

* Re: test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol())
@ 2011-10-25 15:49                   ` Oleg Nesterov
  0 siblings, 0 replies; 330+ messages in thread
From: Oleg Nesterov @ 2011-10-25 15:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ananth N Mavinakayanahalli, Peter Zijlstra, Ingo Molnar,
	Steven Rostedt, Linux-mm, Arnaldo Carvalho de Melo,
	Linus Torvalds, Jonathan Corbet, Masami Hiramatsu, Hugh Dickins,
	Christoph Hellwig, Thomas Gleixner, Andi Kleen, Andrew Morton,
	Jim Keniston, Roland McGrath, LKML

On 10/25, Srikar Dronamraju wrote:
> >
> > 	static inline void *uc_ip(struct ucontext *ctxt)
> > 	{
> > 		return (void*)ctxt->uc_mcontext.gregs[16];
> > 	}
> > ...
> >
> I have tested this on both x86_32 and x86_64 and can confirm that the
> behaviour is same with or without uprobes placed at that instruction.
> This is on the uprobes code with your changes.

Great, thanks.

> However on x86_32; the output is different from x86_64.
>
> On x86_32 (I have additionally printed the uc_ip and fault_insn.
>
> SIGSEGV! ip=0x10246 addr=0x12345678
> ERR!! wrong ip uc_ip(ctxt) = 10246 fault_insn = 804856c

Yep. uc_ip() is not correct on x86_32. Sorry, I forgot to mention this.

I was really surprised when I wrote this test. I simply can't understand
how can I play with ucontext in the user-space. I guess uc_ip() should use
REG_EIP instead of 16, but I wasn't able to compile it even if I added
__USE_GNU. It would be even better to use sigcontext instead of the ugly
mcontext_t, but this looks "impossible". The kernel is much simpler ;)


> I still trying to dig up what uc_ip is and why its different on x86_32.

See above. I guess it needs ctxt->uc_mcontext.gregs[14]. Or REG_EIP.

uc_ip() simply reads sigcontext->ip passed by setup_sigcontext().

> Also I was thinking on your suggestion of making abort_xol a weak
> function. In which case we could have architecture independent function
> in kernel/uprobes.c which is just a wrapper for set_instruction_pointer.
>
> void __weak abort_xol(struct pt_regs *regs, struct uprobe_task *utask)
> {
> 	set_instruction_pointer(regs, utask->vaddr);
> }
>
> where it would called  from uprobe_notify_resume() as
>
> 	abort_xol(regs, utask);
>
> If other archs would want to do something else, they could override
> abort_xol definition.

I didn't suggest this ;) But looks reasonable to me. And afaics x86_32
can use this arch-independent function.

Oleg.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 330+ messages in thread

end of thread, other threads:[~2011-10-25 15:54 UTC | newest]

Thread overview: 330+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-20 11:59 [PATCH v5 3.1.0-rc4-tip 0/26] Uprobes patchset with perf probe support Srikar Dronamraju
2011-09-20 11:59 ` Srikar Dronamraju
2011-09-20 11:59 ` [PATCH v5 3.1.0-rc4-tip 1/26] uprobes: Auxillary routines to insert, find, delete uprobes Srikar Dronamraju
2011-09-20 11:59   ` Srikar Dronamraju
2011-09-20 15:42   ` Stefan Hajnoczi
2011-09-20 15:42     ` Stefan Hajnoczi
2011-09-26 11:18     ` Peter Zijlstra
2011-09-26 11:18       ` Peter Zijlstra
2011-09-26 11:59       ` Srikar Dronamraju
2011-09-26 11:59         ` Srikar Dronamraju
2011-09-26 11:18   ` Peter Zijlstra
2011-09-26 11:18     ` Peter Zijlstra
2011-09-26 12:02     ` Srikar Dronamraju
2011-09-26 12:02       ` Srikar Dronamraju
2011-09-26 13:35   ` Peter Zijlstra
2011-09-26 13:35     ` Peter Zijlstra
2011-09-26 16:19     ` Srikar Dronamraju
2011-09-26 16:19       ` Srikar Dronamraju
2011-09-20 12:00 ` [PATCH v5 3.1.0-rc4-tip 2/26] Uprobes: Allow multiple consumers for an uprobe Srikar Dronamraju
2011-09-20 12:00   ` Srikar Dronamraju
2011-09-26 12:29   ` Peter Zijlstra
2011-09-26 12:29     ` Peter Zijlstra
2011-09-20 12:00 ` [PATCH v5 3.1.0-rc4-tip 3/26] Uprobes: register/unregister probes Srikar Dronamraju
2011-09-20 12:00   ` Srikar Dronamraju
2011-09-20 16:50   ` Stefan Hajnoczi
2011-09-20 16:50     ` Stefan Hajnoczi
2011-09-21  4:07     ` Srikar Dronamraju
2011-09-21  4:07       ` Srikar Dronamraju
2011-09-26 13:15   ` Peter Zijlstra
2011-09-26 13:15     ` Peter Zijlstra
2011-09-26 13:23     ` Srikar Dronamraju
2011-09-26 13:23       ` Srikar Dronamraju
2011-10-03 12:46   ` Oleg Nesterov
2011-10-03 12:46     ` Oleg Nesterov
2011-10-05 17:04     ` Srikar Dronamraju
2011-10-05 17:04       ` Srikar Dronamraju
2011-10-05 18:50       ` Oleg Nesterov
2011-10-05 18:50         ` Oleg Nesterov
2011-10-06  6:51         ` Srikar Dronamraju
2011-10-06  6:51           ` Srikar Dronamraju
2011-10-07 17:03           ` Oleg Nesterov
2011-10-07 17:03             ` Oleg Nesterov
2011-09-20 12:00 ` [PATCH v5 3.1.0-rc4-tip 4/26] uprobes: Define hooks for mmap/munmap Srikar Dronamraju
2011-09-20 12:00   ` Srikar Dronamraju
2011-09-20 17:03   ` Stefan Hajnoczi
2011-09-20 17:03     ` Stefan Hajnoczi
2011-09-21  4:03     ` Srikar Dronamraju
2011-09-21  4:03       ` Srikar Dronamraju
2011-09-26 13:53   ` Peter Zijlstra
2011-09-26 13:53     ` Peter Zijlstra
2011-09-26 15:44     ` Srikar Dronamraju
2011-09-26 15:44       ` Srikar Dronamraju
2011-09-27 11:37       ` Peter Zijlstra
2011-09-27 11:37         ` Peter Zijlstra
2011-09-27 13:08         ` Srikar Dronamraju
2011-09-27 13:08           ` Srikar Dronamraju
2011-09-27 11:41       ` Peter Zijlstra
2011-09-27 11:41         ` Peter Zijlstra
2011-09-27 12:59         ` Srikar Dronamraju
2011-09-27 12:59           ` Srikar Dronamraju
2011-09-27 11:42       ` Peter Zijlstra
2011-09-27 11:42         ` Peter Zijlstra
2011-10-03 13:37   ` Oleg Nesterov
2011-10-03 13:37     ` Oleg Nesterov
2011-10-06 11:05     ` Srikar Dronamraju
2011-10-06 11:05       ` Srikar Dronamraju
2011-10-07 17:36       ` Oleg Nesterov
2011-10-07 17:36         ` Oleg Nesterov
2011-10-10 12:31         ` Srikar Dronamraju
2011-10-10 12:31           ` Srikar Dronamraju
2011-09-20 12:00 ` [PATCH v5 3.1.0-rc4-tip 5/26] Uprobes: copy of the original instruction Srikar Dronamraju
2011-09-20 12:00   ` Srikar Dronamraju
2011-10-03 16:29   ` Oleg Nesterov
2011-10-03 16:29     ` Oleg Nesterov
2011-10-05 10:52     ` Srikar Dronamraju
2011-10-05 10:52       ` Srikar Dronamraju
2011-10-05 15:11       ` Oleg Nesterov
2011-10-05 15:11         ` Oleg Nesterov
2011-10-05 16:09     ` Srikar Dronamraju
2011-10-05 16:09       ` Srikar Dronamraju
2011-10-05 17:53       ` Oleg Nesterov
2011-10-05 17:53         ` Oleg Nesterov
2011-09-20 12:01 ` [PATCH v5 3.1.0-rc4-tip 6/26] Uprobes: define fixups Srikar Dronamraju
2011-09-20 12:01   ` Srikar Dronamraju
2011-09-20 12:01 ` [PATCH v5 3.1.0-rc4-tip 7/26] Uprobes: uprobes arch info Srikar Dronamraju
2011-09-20 12:01   ` Srikar Dronamraju
2011-09-20 12:01 ` [PATCH v5 3.1.0-rc4-tip 8/26] x86: analyze instruction and determine fixups Srikar Dronamraju
2011-09-20 12:01   ` Srikar Dronamraju
2011-09-20 17:13   ` Stefan Hajnoczi
2011-09-20 17:13     ` Stefan Hajnoczi
2011-09-20 18:12     ` Christoph Hellwig
2011-09-20 18:12       ` Christoph Hellwig
2011-09-20 20:53       ` Stefan Hajnoczi
2011-09-20 20:53         ` Stefan Hajnoczi
2011-09-23 11:53         ` Masami Hiramatsu
2011-09-23 11:53           ` Masami Hiramatsu
2011-09-23 16:51           ` Stefan Hajnoczi
2011-09-23 16:51             ` Stefan Hajnoczi
2011-09-26 19:59             ` Josh Stone
2011-09-26 19:59               ` Josh Stone
2011-09-27  1:32               ` Masami Hiramatsu
2011-09-27  2:59                 ` Josh Stone
2011-09-27  2:59                   ` Josh Stone
2011-09-27  7:08               ` Stefan Hajnoczi
2011-09-26 18:30         ` Mark Wielaard
2011-09-22  1:05   ` Josh Stone
2011-09-22  1:05     ` Josh Stone
2011-10-06 23:58     ` [PATCH] x86: Make variable_test_bit reference all of *addr Josh Stone
2011-10-07  1:37       ` hpanvin@gmail.com
2011-10-07  2:02         ` Andi Kleen
2011-10-07  2:50           ` Josh Stone
2011-10-07  3:12             ` hpanvin@gmail.com
2011-10-07  3:30               ` Andi Kleen
2011-10-07  4:35             ` Masami Hiramatsu
2011-10-07  4:55             ` Masami Hiramatsu
2011-10-18  1:00               ` [PATCH] x86: Make kprobes' twobyte_is_boostable volatile Josh Stone
2011-10-18  1:21                 ` Masami Hiramatsu
2011-10-07  3:13           ` [PATCH] x86: Make variable_test_bit reference all of *addr hpanvin@gmail.com
2011-10-05 15:48   ` [PATCH v5 3.1.0-rc4-tip 8/26] x86: analyze instruction and determine fixups Oleg Nesterov
2011-10-05 15:48     ` Oleg Nesterov
2011-10-05 16:12     ` Srikar Dronamraju
2011-10-05 16:12       ` Srikar Dronamraju
2011-09-20 12:01 ` [PATCH v5 3.1.0-rc4-tip 9/26] Uprobes: Background page replacement Srikar Dronamraju
2011-09-20 12:01   ` Srikar Dronamraju
2011-10-05 16:19   ` Oleg Nesterov
2011-10-05 16:19     ` Oleg Nesterov
2011-10-06  6:53     ` Srikar Dronamraju
2011-10-06  6:53       ` Srikar Dronamraju
2011-09-20 12:01 ` [PATCH v5 3.1.0-rc4-tip 10/26] x86: Set instruction pointer Srikar Dronamraju
2011-09-20 12:01   ` Srikar Dronamraju
2011-10-05 16:29   ` Oleg Nesterov
2011-10-05 16:29     ` Oleg Nesterov
2011-09-20 12:02 ` [PATCH v5 3.1.0-rc4-tip 11/26] x86: Introduce TIF_UPROBE FLAG Srikar Dronamraju
2011-09-20 12:02   ` Srikar Dronamraju
2011-09-20 12:02 ` [PATCH v5 3.1.0-rc4-tip 12/26] Uprobes: Handle breakpoint and Singlestep Srikar Dronamraju
2011-09-20 12:02   ` Srikar Dronamraju
2011-09-26 13:59   ` Peter Zijlstra
2011-09-26 13:59     ` Peter Zijlstra
2011-09-26 16:01     ` Srikar Dronamraju
2011-09-26 16:01       ` Srikar Dronamraju
2011-09-26 16:25       ` Peter Zijlstra
2011-09-26 16:25         ` Peter Zijlstra
2011-10-05 17:48         ` Oleg Nesterov
2011-10-05 17:48           ` Oleg Nesterov
2011-09-26 14:02   ` Peter Zijlstra
2011-09-26 14:02     ` Peter Zijlstra
2011-10-07 18:28   ` Oleg Nesterov
2011-10-07 18:28     ` Oleg Nesterov
2011-10-09 13:31     ` Oleg Nesterov
2011-10-09 13:31       ` Oleg Nesterov
2011-09-20 12:02 ` [PATCH v5 3.1.0-rc4-tip 13/26] x86: define a x86 specific exception notifier Srikar Dronamraju
2011-09-20 12:02   ` Srikar Dronamraju
2011-09-26 14:19   ` Peter Zijlstra
2011-09-26 14:19     ` Peter Zijlstra
2011-09-26 15:52     ` Srikar Dronamraju
2011-09-26 15:52       ` Srikar Dronamraju
2011-09-27 11:46       ` Peter Zijlstra
2011-09-27 11:46         ` Peter Zijlstra
2011-10-07 18:31   ` Oleg Nesterov
2011-10-07 18:31     ` Oleg Nesterov
2011-09-20 12:02 ` [PATCH v5 3.1.0-rc4-tip 14/26] uprobe: register " Srikar Dronamraju
2011-09-20 12:02   ` Srikar Dronamraju
2011-09-20 12:03 ` [PATCH v5 3.1.0-rc4-tip 15/26] x86: Define x86_64 specific uprobe_task_arch_info structure Srikar Dronamraju
2011-09-20 12:03   ` Srikar Dronamraju
2011-09-20 12:03 ` [PATCH v5 3.1.0-rc4-tip 16/26] uprobes: Introduce " Srikar Dronamraju
2011-09-20 12:03   ` Srikar Dronamraju
2011-09-20 12:03 ` [PATCH v5 3.1.0-rc4-tip 17/26] x86: arch specific hooks for pre/post singlestep handling Srikar Dronamraju
2011-09-20 12:03   ` Srikar Dronamraju
2011-09-26 14:23   ` Peter Zijlstra
2011-09-26 14:23     ` Peter Zijlstra
2011-09-26 16:34     ` Srikar Dronamraju
2011-09-26 16:34       ` Srikar Dronamraju
2011-09-27 11:44       ` Peter Zijlstra
2011-09-27 11:44         ` Peter Zijlstra
2011-09-20 12:03 ` [PATCH v5 3.1.0-rc4-tip 18/26] uprobes: slot allocation Srikar Dronamraju
2011-09-20 12:03   ` Srikar Dronamraju
2011-09-27 11:49   ` Peter Zijlstra
2011-09-27 11:49     ` Peter Zijlstra
2011-09-27 12:32     ` Srikar Dronamraju
2011-09-27 12:32       ` Srikar Dronamraju
2011-09-27 12:59       ` Peter Zijlstra
2011-09-27 12:59         ` Peter Zijlstra
2011-09-27 12:18   ` Peter Zijlstra
2011-09-27 12:18     ` Peter Zijlstra
2011-09-27 12:45     ` Srikar Dronamraju
2011-09-27 12:45       ` Srikar Dronamraju
2011-09-27 12:36   ` Peter Zijlstra
2011-09-27 12:36     ` Peter Zijlstra
2011-09-27 12:37   ` Peter Zijlstra
2011-09-27 12:37     ` Peter Zijlstra
2011-09-27 12:50     ` Srikar Dronamraju
2011-09-27 12:50       ` Srikar Dronamraju
2011-09-27 12:50   ` Peter Zijlstra
2011-09-27 12:50     ` Peter Zijlstra
2011-09-27 12:55   ` Peter Zijlstra
2011-09-27 12:55     ` Peter Zijlstra
2011-10-07 18:37   ` Oleg Nesterov
2011-10-07 18:37     ` Oleg Nesterov
2011-10-09 11:47     ` Srikar Dronamraju
2011-10-09 11:47       ` Srikar Dronamraju
2011-09-20 12:03 ` [PATCH v5 3.1.0-rc4-tip 19/26] tracing: Extract out common code for kprobes/uprobes traceevents Srikar Dronamraju
2011-09-20 12:03   ` Srikar Dronamraju
2011-09-28  5:04   ` Masami Hiramatsu
2011-09-28  5:04     ` Masami Hiramatsu
2011-09-20 12:04 ` [PATCH v5 3.1.0-rc4-tip 20/26] tracing: uprobes trace_event interface Srikar Dronamraju
2011-09-20 12:04   ` Srikar Dronamraju
2011-09-20 12:04 ` [PATCH v5 3.1.0-rc4-tip 21/26] tracing: uprobes Documentation Srikar Dronamraju
2011-09-20 12:04   ` Srikar Dronamraju
2011-09-20 12:04 ` [PATCH v5 3.1.0-rc4-tip 22/26] perf: rename target_module to target Srikar Dronamraju
2011-09-20 12:04   ` Srikar Dronamraju
2011-09-20 12:04 ` [PATCH v5 3.1.0-rc4-tip 23/26] perf: perf interface for uprobes Srikar Dronamraju
2011-09-20 12:04   ` Srikar Dronamraju
2011-09-20 12:04 ` [PATCH v5 3.1.0-rc4-tip 24/26] perf: show possible probes in a given executable file or library Srikar Dronamraju
2011-09-20 12:04   ` Srikar Dronamraju
2011-09-20 12:05 ` [PATCH v5 3.1.0-rc4-tip 25/26] perf: Documentation for perf uprobes Srikar Dronamraju
2011-09-20 12:05   ` Srikar Dronamraju
2011-09-28  9:20   ` Masami Hiramatsu
2011-09-28  9:20     ` Masami Hiramatsu
2011-09-20 12:05 ` [PATCH v5 3.1.0-rc4-tip 26/26] uprobes: queue signals while thread is singlestepping Srikar Dronamraju
2011-09-20 12:05   ` Srikar Dronamraju
2011-09-27 13:03   ` Peter Zijlstra
2011-09-27 13:03     ` Peter Zijlstra
2011-09-27 13:12     ` Srikar Dronamraju
2011-09-27 13:12       ` Srikar Dronamraju
2011-10-05 18:01       ` Oleg Nesterov
2011-10-05 18:01         ` Oleg Nesterov
2011-10-06  5:47         ` Srikar Dronamraju
2011-10-06  5:47           ` Srikar Dronamraju
2011-10-07 16:58           ` Oleg Nesterov
2011-10-07 16:58             ` Oleg Nesterov
2011-10-10 12:25             ` Srikar Dronamraju
2011-10-10 12:25               ` Srikar Dronamraju
2011-10-10 18:25               ` Oleg Nesterov
2011-10-10 18:25                 ` Oleg Nesterov
2011-10-11 17:24                 ` Oleg Nesterov
2011-10-11 17:24                   ` Oleg Nesterov
2011-10-11 17:38                   ` Srikar Dronamraju
2011-10-11 17:38                     ` Srikar Dronamraju
2011-10-11 17:26                 ` Srikar Dronamraju
2011-10-11 17:26                   ` Srikar Dronamraju
2011-10-11 18:56                   ` Oleg Nesterov
2011-10-11 18:56                     ` Oleg Nesterov
2011-10-12 12:01                     ` Srikar Dronamraju
2011-10-12 12:01                       ` Srikar Dronamraju
2011-10-12 19:34                       ` Oleg Nesterov
2011-10-12 19:34                         ` Oleg Nesterov
2011-10-12 19:59                   ` Oleg Nesterov
2011-10-12 19:59                     ` Oleg Nesterov
2011-09-20 13:34 ` [PATCH v5 3.1.0-rc4-tip 0/26] Uprobes patchset with perf probe support Christoph Hellwig
2011-09-20 13:34   ` Christoph Hellwig
2011-09-20 14:12   ` Srikar Dronamraju
2011-09-20 14:12     ` Srikar Dronamraju
2011-09-20 14:28     ` Christoph Hellwig
2011-09-20 14:28       ` Christoph Hellwig
2011-09-20 15:19       ` Srikar Dronamraju
2011-09-20 15:19         ` Srikar Dronamraju
2011-10-15 19:00 ` [PATCH 0/X] (Was: Uprobes patchset with perf probe support) Oleg Nesterov
2011-10-15 19:00   ` Oleg Nesterov
2011-10-15 19:00   ` [PATCH 1/X] uprobes: write_opcode: the new page needs PG_uptodate Oleg Nesterov
2011-10-15 19:00     ` Oleg Nesterov
2011-10-17 10:59     ` Srikar Dronamraju
2011-10-17 10:59       ` Srikar Dronamraju
2011-10-15 19:00   ` [PATCH 2/X] uprobes: write_opcode() needs put_page(new_page) unconditionally Oleg Nesterov
2011-10-15 19:00     ` Oleg Nesterov
2011-10-18 16:47     ` Srikar Dronamraju
2011-10-18 16:47       ` Srikar Dronamraju
2011-10-15 19:01   ` [PATCH 3/X] uprobes: xol_add_vma: fix ->uprobes_xol_area initialization Oleg Nesterov
2011-10-15 19:01     ` Oleg Nesterov
2011-10-15 19:01   ` [PATCH 4/X] uprobes: xol_add_vma: misc cleanups Oleg Nesterov
2011-10-15 19:01     ` Oleg Nesterov
2011-10-15 19:01   ` [PATCH 5/X] uprobes: xol_alloc_area() needs memory barriers Oleg Nesterov
2011-10-15 19:01     ` Oleg Nesterov
2011-10-16 16:13   ` [PATCH 6/X] uprobes: reimplement xol_add_vma() via install_special_mapping() Oleg Nesterov
2011-10-16 16:13     ` Oleg Nesterov
2011-10-17 10:50     ` Srikar Dronamraju
2011-10-17 10:50       ` Srikar Dronamraju
2011-10-17 13:34       ` Stephen Smalley
2011-10-17 13:34         ` Stephen Smalley
2011-10-17 18:55         ` Oleg Nesterov
2011-10-17 18:55           ` Oleg Nesterov
2011-10-16 16:14   ` [PATCH 7/X] uprobes: xol_add_vma: simply use TASK_SIZE as a hint Oleg Nesterov
2011-10-16 16:14     ` Oleg Nesterov
2011-10-19 21:51   ` [PATCH 8-14/X] (Was: Uprobes patchset with perf probe support) Oleg Nesterov
2011-10-19 21:51     ` Oleg Nesterov
2011-10-19 21:52     ` [PATCH 8/X] uprobes: kill sstep_complete() Oleg Nesterov
2011-10-19 21:52       ` Oleg Nesterov
2011-10-19 21:52     ` [PATCH 9/X] uprobes: introduce UTASK_SSTEP_ACK state Oleg Nesterov
2011-10-19 21:52       ` Oleg Nesterov
2011-10-19 21:52     ` [PATCH 10/X] uprobes: introduce uprobe_deny_signal() Oleg Nesterov
2011-10-19 21:52       ` Oleg Nesterov
2011-10-19 21:53     ` [PATCH 11/X] uprobes: x86: introduce xol_was_trapped() Oleg Nesterov
2011-10-19 21:53       ` Oleg Nesterov
2011-10-24 14:55       ` Srikar Dronamraju
2011-10-24 14:55         ` Srikar Dronamraju
2011-10-24 16:07         ` Oleg Nesterov
2011-10-24 16:07           ` Oleg Nesterov
2011-10-19 21:53     ` [PATCH 12/X] uprobes: x86: introduce abort_xol() Oleg Nesterov
2011-10-19 21:53       ` Oleg Nesterov
2011-10-21 14:42       ` Srikar Dronamraju
2011-10-21 14:42         ` Srikar Dronamraju
2011-10-21 16:22         ` Oleg Nesterov
2011-10-21 16:22           ` Oleg Nesterov
2011-10-21 16:26         ` Ananth N Mavinakayanahalli
2011-10-21 16:26           ` Ananth N Mavinakayanahalli
2011-10-21 16:42           ` Oleg Nesterov
2011-10-21 16:42             ` Oleg Nesterov
2011-10-21 17:59             ` test-case (Was: [PATCH 12/X] uprobes: x86: introduce abort_xol()) Oleg Nesterov
2011-10-21 17:59               ` Oleg Nesterov
2011-10-25 14:06               ` Srikar Dronamraju
2011-10-25 14:06                 ` Srikar Dronamraju
2011-10-25 15:49                 ` Oleg Nesterov
2011-10-25 15:49                   ` Oleg Nesterov
2011-10-22  7:09             ` [PATCH 12/X] uprobes: x86: introduce abort_xol() Ananth N Mavinakayanahalli
2011-10-22  7:09               ` Ananth N Mavinakayanahalli
2011-10-19 21:53     ` [PATCH 13/X] uprobes: introduce UTASK_SSTEP_TRAPPED logic Oleg Nesterov
2011-10-19 21:53       ` Oleg Nesterov
2011-10-22  7:20       ` Ananth N Mavinakayanahalli
2011-10-22  7:20         ` Ananth N Mavinakayanahalli
2011-10-24 14:41         ` Oleg Nesterov
2011-10-24 14:41           ` Oleg Nesterov
2011-10-24 15:16           ` Ananth N Mavinakayanahalli
2011-10-24 15:16             ` Ananth N Mavinakayanahalli
2011-10-24 16:13             ` Oleg Nesterov
2011-10-24 16:13               ` Oleg Nesterov
2011-10-25  6:01               ` Ananth N Mavinakayanahalli
2011-10-25  6:01                 ` Ananth N Mavinakayanahalli
2011-10-25 14:30                 ` Oleg Nesterov
2011-10-25 14:30                   ` Oleg Nesterov
2011-10-19 21:54     ` [PATCH 14/X] uprobes: uprobe_deny_signal: check __fatal_signal_pending() Oleg Nesterov
2011-10-19 21:54       ` Oleg Nesterov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.