linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore)
@ 2018-07-16  8:47 Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 1/6] Uprobes: Simplify uprobe_register() body Ravi Bangoria
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:

    if (reference_counter > 0) {
        Execute marker instructions;
    }

Default value of reference counter is 0. Tracer has to increment the 
reference counter before tracing on a marker and decrement it when
done with the tracing.

Currently, perf tool has limited supports for SDT markers. I.e. it
can not trace markers surrounded by reference counter. Also, it's
not easy to add reference counter logic in userspace tool like perf,
so basic idea for this patchset is to add reference counter logic in
the a uprobe infrastructure. Ex,[2]

  # cat tick.c
    ... 
    for (i = 0; i < 100; i++) {
	DTRACE_PROBE1(tick, loop1, i);
        if (TICK_LOOP2_ENABLED()) {
            DTRACE_PROBE1(tick, loop2, i); 
        }
        printf("hi: %d\n", i); 
        sleep(1);
    }   
    ... 

Here tick:loop1 is marker without reference counter where as tick:loop2
is surrounded by reference counter condition.

  # perf buildid-cache --add /tmp/tick
  # perf probe sdt_tick:loop1
  # perf probe sdt_tick:loop2

  # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick
  hi: 0
  hi: 1
  hi: 2
  ^C
  Performance counter stats for '/tmp/tick':
             3      sdt_tick:loop1
             0      sdt_tick:loop2
     2.747086086 seconds time elapsed

Perf failed to record data for tick:loop2. Same experiment with this
patch series:

  # ./perf buildid-cache --add /tmp/tick
  # ./perf probe sdt_tick:loop2
  # ./perf stat -e sdt_tick:loop2 /tmp/tick
    hi: 0
    hi: 1
    hi: 2
    ^C  
     Performance counter stats for '/tmp/tick':
                 3      sdt_tick:loop2
       2.561851452 seconds time elapsed


v6 changes:
 - Do not export struct uprobe outside of uprobe.c. Instead, use
   container_of(arch_uprobe) to regain uprobe in uprobe_write_opcode().
 - Allow 0 as a special value for reference counter _offset_. I.e.
   two uprobes, one having ref_ctr_offset=0 and the other having
   non-zero ref_ctr_offset can coexists.
 - If vma holding reference counter is not present while patching an
   instruction, we add that uprobe in delayed_uprobe_list. When
   appropriate mapping is created, we increment the reference counter
   and remove uprobe from delayed_uprobe_list. While doing all this,
   v5 was searching for all such uprobes in uprobe_tree which is not 
   require. Also, uprobes are stored in rbtree with inode+offset as
   the key and v5 was searching uprobe based on inode+ref_ctr_offset
   which is wrong too. Fix this by directly looing on delayed_uprobe_list.
 - Consider VM_SHARED vma as invalid for reference counter. Return
   false from valid_ref_ctr_vma() if vma->vm_flags has VM_SHARED set.
 - No need to use FOLL_FORCE in get_user_pages_remote() while getting
   a page to update reference counter. Remove it.
 - Do not mention usage of reference counter in Documentation/.


v5 can be found at: https://lkml.org/lkml/2018/6/28/51

Ravi Bangoria (6):
  Uprobes: Simplify uprobe_register() body
  Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
  Uprobes: Support SDT markers having reference count (semaphore)
  trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
  Uprobes/sdt: Prevent multiple reference counter for same uprobe
  perf probe: Support SDT markers having reference counter (semaphore)

 arch/arm/probes/uprobes/core.c |   2 +-
 arch/mips/kernel/uprobes.c     |   2 +-
 include/linux/uprobes.h        |   7 +-
 kernel/events/uprobes.c        | 358 ++++++++++++++++++++++++++++++++++++-----
 kernel/trace/trace.c           |   2 +-
 kernel/trace/trace_uprobe.c    |  78 ++++++++-
 tools/perf/util/probe-event.c  |  39 ++++-
 tools/perf/util/probe-event.h  |   1 +
 tools/perf/util/probe-file.c   |  34 +++-
 tools/perf/util/probe-file.h   |   1 +
 tools/perf/util/symbol-elf.c   |  46 ++++--
 tools/perf/util/symbol.h       |   7 +
 12 files changed, 503 insertions(+), 74 deletions(-)

-- 
2.14.4


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v6 1/6] Uprobes: Simplify uprobe_register() body
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
@ 2018-07-16  8:47 ` Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Ravi Bangoria
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

Simplify uprobe_register() function body and let __uprobe_register()
handle everything. Also move dependency functions around to fix build
failures.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
 kernel/events/uprobes.c | 69 ++++++++++++++++++++++++++-----------------------
 1 file changed, 36 insertions(+), 33 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ccc579a7d32e..471eac896635 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -840,13 +840,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 	return err;
 }
 
-static int __uprobe_register(struct uprobe *uprobe, struct uprobe_consumer *uc)
-{
-	consumer_add(uprobe, uc);
-	return register_for_each_vma(uprobe, uc);
-}
-
-static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
+static void
+__uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *uc)
 {
 	int err;
 
@@ -860,24 +855,46 @@ static void __uprobe_unregister(struct uprobe *uprobe, struct uprobe_consumer *u
 }
 
 /*
- * uprobe_register - register a probe
+ * uprobe_unregister - unregister a already registered probe.
+ * @inode: the file in which the probe has to be removed.
+ * @offset: offset from the start of the file.
+ * @uc: identify which probe if multiple probes are colocated.
+ */
+void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+{
+	struct uprobe *uprobe;
+
+	uprobe = find_uprobe(inode, offset);
+	if (WARN_ON(!uprobe))
+		return;
+
+	down_write(&uprobe->register_rwsem);
+	__uprobe_unregister(uprobe, uc);
+	up_write(&uprobe->register_rwsem);
+	put_uprobe(uprobe);
+}
+EXPORT_SYMBOL_GPL(uprobe_unregister);
+
+/*
+ * __uprobe_register - register a probe
  * @inode: the file in which the probe has to be placed.
  * @offset: offset from the start of the file.
  * @uc: information on howto handle the probe..
  *
- * Apart from the access refcount, uprobe_register() takes a creation
+ * Apart from the access refcount, __uprobe_register() takes a creation
  * refcount (thro alloc_uprobe) if and only if this @uprobe is getting
  * inserted into the rbtree (i.e first consumer for a @inode:@offset
  * tuple).  Creation refcount stops uprobe_unregister from freeing the
  * @uprobe even before the register operation is complete. Creation
  * refcount is released when the last @uc for the @uprobe
- * unregisters. Caller of uprobe_register() is required to keep @inode
+ * unregisters. Caller of __uprobe_register() is required to keep @inode
  * (and the containing mount) referenced.
  *
  * Return errno if it cannot successully install probes
  * else return 0 (success)
  */
-int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
+static int __uprobe_register(struct inode *inode, loff_t offset,
+			     struct uprobe_consumer *uc)
 {
 	struct uprobe *uprobe;
 	int ret;
@@ -904,7 +921,8 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *
 	down_write(&uprobe->register_rwsem);
 	ret = -EAGAIN;
 	if (likely(uprobe_is_active(uprobe))) {
-		ret = __uprobe_register(uprobe, uc);
+		consumer_add(uprobe, uc);
+		ret = register_for_each_vma(uprobe, uc);
 		if (ret)
 			__uprobe_unregister(uprobe, uc);
 	}
@@ -915,6 +933,12 @@ int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *
 		goto retry;
 	return ret;
 }
+
+int uprobe_register(struct inode *inode, loff_t offset,
+		    struct uprobe_consumer *uc)
+{
+	return __uprobe_register(inode, offset, uc);
+}
 EXPORT_SYMBOL_GPL(uprobe_register);
 
 /*
@@ -946,27 +970,6 @@ int uprobe_apply(struct inode *inode, loff_t offset,
 	return ret;
 }
 
-/*
- * uprobe_unregister - unregister a already registered probe.
- * @inode: the file in which the probe has to be removed.
- * @offset: offset from the start of the file.
- * @uc: identify which probe if multiple probes are colocated.
- */
-void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
-{
-	struct uprobe *uprobe;
-
-	uprobe = find_uprobe(inode, offset);
-	if (WARN_ON(!uprobe))
-		return;
-
-	down_write(&uprobe->register_rwsem);
-	__uprobe_unregister(uprobe, uc);
-	up_write(&uprobe->register_rwsem);
-	put_uprobe(uprobe);
-}
-EXPORT_SYMBOL_GPL(uprobe_unregister);
-
 static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 1/6] Uprobes: Simplify uprobe_register() body Ravi Bangoria
@ 2018-07-16  8:47 ` Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

Add addition argument 'arch_uprobe' to uprobe_write_opcode().
We need this in later set of patches.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
 arch/arm/probes/uprobes/core.c | 2 +-
 arch/mips/kernel/uprobes.c     | 2 +-
 include/linux/uprobes.h        | 2 +-
 kernel/events/uprobes.c        | 9 +++++----
 4 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/arm/probes/uprobes/core.c b/arch/arm/probes/uprobes/core.c
index d1329f1ba4e4..bf992264060e 100644
--- a/arch/arm/probes/uprobes/core.c
+++ b/arch/arm/probes/uprobes/core.c
@@ -32,7 +32,7 @@ bool is_swbp_insn(uprobe_opcode_t *insn)
 int set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	     unsigned long vaddr)
 {
-	return uprobe_write_opcode(mm, vaddr,
+	return uprobe_write_opcode(auprobe, mm, vaddr,
 		   __opcode_to_mem_arm(auprobe->bpinsn));
 }
 
diff --git a/arch/mips/kernel/uprobes.c b/arch/mips/kernel/uprobes.c
index f7a0645ccb82..4aaff3b3175c 100644
--- a/arch/mips/kernel/uprobes.c
+++ b/arch/mips/kernel/uprobes.c
@@ -224,7 +224,7 @@ unsigned long arch_uretprobe_hijack_return_addr(
 int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	unsigned long vaddr)
 {
-	return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN);
+	return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
 }
 
 void arch_uprobe_copy_ixol(struct page *page, unsigned long vaddr,
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index 0a294e950df8..bb9d2084af03 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -121,7 +121,7 @@ extern bool is_swbp_insn(uprobe_opcode_t *insn);
 extern bool is_trap_insn(uprobe_opcode_t *insn);
 extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
-extern int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
+extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 471eac896635..c0418ba52ba8 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -299,8 +299,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
  * Called with mm->mmap_sem held for write.
  * Return 0 (success) or a negative errno.
  */
-int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
-			uprobe_opcode_t opcode)
+int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
+			unsigned long vaddr, uprobe_opcode_t opcode)
 {
 	struct page *old_page, *new_page;
 	struct vm_area_struct *vma;
@@ -351,7 +351,7 @@ int uprobe_write_opcode(struct mm_struct *mm, unsigned long vaddr,
  */
 int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
-	return uprobe_write_opcode(mm, vaddr, UPROBE_SWBP_INSN);
+	return uprobe_write_opcode(auprobe, mm, vaddr, UPROBE_SWBP_INSN);
 }
 
 /**
@@ -366,7 +366,8 @@ int __weak set_swbp(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned
 int __weak
 set_orig_insn(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr)
 {
-	return uprobe_write_opcode(mm, vaddr, *(uprobe_opcode_t *)&auprobe->insn);
+	return uprobe_write_opcode(auprobe, mm, vaddr,
+			*(uprobe_opcode_t *)&auprobe->insn);
 }
 
 static struct uprobe *get_uprobe(struct uprobe *uprobe)
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore)
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 1/6] Uprobes: Simplify uprobe_register() body Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Ravi Bangoria
@ 2018-07-16  8:47 ` Ravi Bangoria
  2018-07-23 16:26   ` Oleg Nesterov
  2018-07-24 14:21   ` Masami Hiramatsu
  2018-07-16  8:47 ` [PATCH v6 4/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe Ravi Bangoria
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

Userspace Statically Defined Tracepoints[1] are dtrace style markers
inside userspace applications. Applications like PostgreSQL, MySQL,
Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
have these markers embedded in them. These markers are added by developer
at important places in the code. Each marker source expands to a single
nop instruction in the compiled code but there may be additional
overhead for computing the marker arguments which expands to couple of
instructions. In case the overhead is more, execution of it can be
omitted by runtime if() condition when no one is tracing on the marker:

    if (reference_counter > 0) {
        Execute marker instructions;
    }

Default value of reference counter is 0. Tracer has to increment the
reference counter before tracing on a marker and decrement it when
done with the tracing.

Implement the reference counter logic in core uprobe. User will be
able to use it from trace_uprobe as well as from kernel module. New
trace_uprobe definition with reference counter will now be:

    <path>:<offset>[(ref_ctr_offset)]

where ref_ctr_offset is an optional field. For kernel module, new
variant of uprobe_register() has been introduced:

    uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer)

No new variant for uprobe_unregister() because it's assumed to have
only one reference counter for one uprobe.

[1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation

Note: 'reference counter' is called as 'semaphore' in original Dtrace
(or Systemtap, bcc and even in ELF) documentation and code. But the
term 'semaphore' is misleading in this context. This is just a counter
used to hold number of tracers tracing on a marker. This is not really
used for any synchronization. So we are referring it as 'reference
counter' in kernel / perf code.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
 include/linux/uprobes.h     |   5 +
 kernel/events/uprobes.c     | 232 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/trace/trace.c        |   2 +-
 kernel/trace/trace_uprobe.c |  38 +++++++-
 4 files changed, 267 insertions(+), 10 deletions(-)

diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index bb9d2084af03..103a48a48872 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -123,6 +123,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
 extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
 extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
 extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
+extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
 extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
 extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
 extern int uprobe_mmap(struct vm_area_struct *vma);
@@ -160,6 +161,10 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
 {
 	return -ENOSYS;
 }
+static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+{
+	return -ENOSYS;
+}
 static inline int
 uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
 {
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index c0418ba52ba8..84da8512a974 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -73,6 +73,7 @@ struct uprobe {
 	struct uprobe_consumer	*consumers;
 	struct inode		*inode;		/* Also hold a ref to inode */
 	loff_t			offset;
+	loff_t			ref_ctr_offset;
 	unsigned long		flags;
 
 	/*
@@ -88,6 +89,15 @@ struct uprobe {
 	struct arch_uprobe	arch;
 };
 
+struct delayed_uprobe {
+	struct list_head list;
+	struct uprobe *uprobe;
+	struct mm_struct *mm;
+};
+
+static DEFINE_MUTEX(delayed_uprobe_lock);
+static LIST_HEAD(delayed_uprobe_list);
+
 /*
  * Execute out of line area: anonymous executable mapping installed
  * by the probed task to execute the copy of the original instruction
@@ -282,6 +292,154 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 	return 1;
 }
 
+static struct delayed_uprobe *
+delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
+{
+	struct delayed_uprobe *du;
+
+	list_for_each_entry(du, &delayed_uprobe_list, list)
+		if (du->uprobe == uprobe && du->mm == mm)
+			return du;
+	return NULL;
+}
+
+static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm)
+{
+	struct delayed_uprobe *du;
+
+	if (delayed_uprobe_check(uprobe, mm))
+		return 0;
+
+	du  = kzalloc(sizeof(*du), GFP_KERNEL);
+	if (!du)
+		return -ENOMEM;
+
+	du->uprobe = uprobe;
+	du->mm = mm;
+	list_add(&du->list, &delayed_uprobe_list);
+	return 0;
+}
+
+static void delayed_uprobe_delete(struct delayed_uprobe *du)
+{
+	if (!du)
+		return;
+	list_del(&du->list);
+	kfree(du);
+}
+
+static void delayed_uprobe_remove(struct uprobe *uprobe, struct mm_struct *mm)
+{
+	struct list_head *pos, *q;
+	struct delayed_uprobe *du;
+
+	list_for_each_safe(pos, q, &delayed_uprobe_list) {
+		du = list_entry(pos, struct delayed_uprobe, list);
+		if (uprobe) {
+			if (du->uprobe == uprobe && du->mm == mm)
+				delayed_uprobe_delete(du);
+		} else if (du->mm == mm) {
+			delayed_uprobe_delete(du);
+		}
+	}
+}
+
+static bool valid_ref_ctr_vma(struct uprobe *uprobe,
+			      struct vm_area_struct *vma)
+{
+	unsigned long vaddr = offset_to_vaddr(vma, uprobe->ref_ctr_offset);
+
+	return uprobe->ref_ctr_offset &&
+		vma->vm_file &&
+		file_inode(vma->vm_file) == uprobe->inode &&
+		vma->vm_flags & VM_WRITE &&
+		!(vma->vm_flags & VM_SHARED) &&
+		vma->vm_start <= vaddr &&
+		vma->vm_end > vaddr;
+}
+
+static struct vm_area_struct *
+find_ref_ctr_vma(struct uprobe *uprobe, struct mm_struct *mm)
+{
+	struct vm_area_struct *tmp;
+
+	for (tmp = mm->mmap; tmp; tmp = tmp->vm_next)
+		if (valid_ref_ctr_vma(uprobe, tmp))
+			return tmp;
+
+	return NULL;
+}
+
+static int
+__update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
+{
+	void *kaddr;
+	struct page *page;
+	struct vm_area_struct *vma;
+	int ret = 0;
+	short *ptr;
+
+	if (vaddr == 0 || d == 0)
+		return -EINVAL;
+
+	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
+			FOLL_WRITE, &page, &vma, NULL);
+	if (unlikely(ret <= 0)) {
+		/*
+		 * We are asking for 1 page. If get_user_pages_remote() fails,
+		 * it may return 0, in that case we have to return error.
+		 */
+		ret = (ret == 0) ? -EBUSY : ret;
+		pr_warn("Failed to %s ref_ctr. (%d)\n",
+			d > 0 ? "increment" : "decrement", ret);
+		return ret;
+	}
+
+	kaddr = kmap_atomic(page);
+	ptr = kaddr + (vaddr & ~PAGE_MASK);
+
+	if (unlikely(*ptr + d < 0)) {
+		pr_warn("ref_ctr going negative. vaddr: 0x%lx, "
+			"curr val: %d, delta: %d\n", vaddr, *ptr, d);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	*ptr += d;
+	ret = 0;
+out:
+	kunmap_atomic(kaddr);
+	put_page(page);
+	return ret;
+}
+
+static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
+			  bool is_register)
+{
+	struct vm_area_struct *rc_vma;
+	unsigned long rc_vaddr;
+	int ret = 0;
+
+	rc_vma = find_ref_ctr_vma(uprobe, mm);
+
+	if (rc_vma) {
+		rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
+		ret = __update_ref_ctr(mm, rc_vaddr, is_register ? 1 : -1);
+
+		if (is_register)
+			return ret;
+	}
+
+	mutex_lock(&delayed_uprobe_lock);
+	if (is_register)
+		ret = delayed_uprobe_add(uprobe, mm);
+	else
+		delayed_uprobe_remove(uprobe, mm);
+	mutex_unlock(&delayed_uprobe_lock);
+
+	return ret;
+}
+
 /*
  * NOTE:
  * Expect the breakpoint instruction to be the smallest size instruction for
@@ -302,9 +460,13 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
 int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 			unsigned long vaddr, uprobe_opcode_t opcode)
 {
+	struct uprobe *uprobe;
 	struct page *old_page, *new_page;
 	struct vm_area_struct *vma;
-	int ret;
+	int ret, is_register, ref_ctr_updated = 0;
+
+	is_register = is_swbp_insn(&opcode);
+	uprobe = container_of(auprobe, struct uprobe, arch);
 
 retry:
 	/* Read the page with vaddr into memory */
@@ -317,6 +479,15 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 	if (ret <= 0)
 		goto put_old;
 
+	/* We are going to replace instruction, update ref_ctr. */
+	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
+		ret = update_ref_ctr(uprobe, mm, is_register);
+		if (ret)
+			goto put_old;
+
+		ref_ctr_updated = 1;
+	}
+
 	ret = anon_vma_prepare(vma);
 	if (ret)
 		goto put_old;
@@ -337,6 +508,11 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 
 	if (unlikely(ret == -EAGAIN))
 		goto retry;
+
+	/* Revert back reference counter if instruction update failed. */
+	if (ret && is_register && ref_ctr_updated)
+		update_ref_ctr(uprobe, mm, false);
+
 	return ret;
 }
 
@@ -484,7 +660,8 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
 	return u;
 }
 
-static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
+static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
+				   loff_t ref_ctr_offset)
 {
 	struct uprobe *uprobe, *cur_uprobe;
 
@@ -494,6 +671,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
 
 	uprobe->inode = inode;
 	uprobe->offset = offset;
+	uprobe->ref_ctr_offset = ref_ctr_offset;
 	init_rwsem(&uprobe->register_rwsem);
 	init_rwsem(&uprobe->consumer_rwsem);
 
@@ -895,7 +1073,7 @@ EXPORT_SYMBOL_GPL(uprobe_unregister);
  * else return 0 (success)
  */
 static int __uprobe_register(struct inode *inode, loff_t offset,
-			     struct uprobe_consumer *uc)
+			     loff_t ref_ctr_offset, struct uprobe_consumer *uc)
 {
 	struct uprobe *uprobe;
 	int ret;
@@ -912,7 +1090,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 		return -EINVAL;
 
  retry:
-	uprobe = alloc_uprobe(inode, offset);
+	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (!uprobe)
 		return -ENOMEM;
 	/*
@@ -938,10 +1116,17 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 int uprobe_register(struct inode *inode, loff_t offset,
 		    struct uprobe_consumer *uc)
 {
-	return __uprobe_register(inode, offset, uc);
+	return __uprobe_register(inode, offset, 0, uc);
 }
 EXPORT_SYMBOL_GPL(uprobe_register);
 
+int uprobe_register_refctr(struct inode *inode, loff_t offset,
+			   loff_t ref_ctr_offset, struct uprobe_consumer *uc)
+{
+	return __uprobe_register(inode, offset, ref_ctr_offset, uc);
+}
+EXPORT_SYMBOL_GPL(uprobe_register_refctr);
+
 /*
  * uprobe_apply - unregister a already registered probe.
  * @inode: the file in which the probe has to be removed.
@@ -1060,6 +1245,31 @@ static void build_probe_list(struct inode *inode,
 	spin_unlock(&uprobes_treelock);
 }
 
+static int delayed_uprobe_install(struct vm_area_struct *vma)
+{
+	struct list_head *pos, *q;
+	struct delayed_uprobe *du;
+	unsigned long vaddr;
+	int ret = 0, err = 0;
+
+	mutex_lock(&delayed_uprobe_lock);
+	list_for_each_safe(pos, q, &delayed_uprobe_list) {
+		du = list_entry(pos, struct delayed_uprobe, list);
+
+		if (!du->uprobe->ref_ctr_offset ||
+		    !valid_ref_ctr_vma(du->uprobe, vma))
+			continue;
+
+		vaddr = offset_to_vaddr(vma, du->uprobe->ref_ctr_offset);
+		ret = __update_ref_ctr(vma->vm_mm, vaddr, 1);
+		/* Record an error and continue. */
+		err = ret & !err ? ret : err;
+		delayed_uprobe_delete(du);
+	}
+	mutex_unlock(&delayed_uprobe_lock);
+	return err;
+}
+
 /*
  * Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
  *
@@ -1072,7 +1282,13 @@ int uprobe_mmap(struct vm_area_struct *vma)
 	struct uprobe *uprobe, *u;
 	struct inode *inode;
 
-	if (no_uprobe_events() || !valid_vma(vma, true))
+	if (no_uprobe_events())
+		return 0;
+
+	if (vma->vm_flags & VM_WRITE)
+		delayed_uprobe_install(vma);
+
+	if (!valid_vma(vma, true))
 		return 0;
 
 	inode = file_inode(vma->vm_file);
@@ -1246,6 +1462,10 @@ void uprobe_clear_state(struct mm_struct *mm)
 {
 	struct xol_area *area = mm->uprobes_state.xol_area;
 
+	mutex_lock(&delayed_uprobe_lock);
+	delayed_uprobe_remove(NULL, mm);
+	mutex_unlock(&delayed_uprobe_lock);
+
 	if (!area)
 		return;
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index f054bd6a1c66..b431790779f7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -4614,7 +4614,7 @@ static const char readme_msg[] =
   "place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>\n"
 #endif
 #ifdef CONFIG_UPROBE_EVENTS
-	"\t    place: <path>:<offset>\n"
+  "   place (uprobe): <path>:<offset>[(ref_ctr_offset)]\n"
 #endif
 	"\t     args: <name>=fetcharg[:type]\n"
 	"\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index bf89a51e740d..bf2be098eb08 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -59,6 +59,7 @@ struct trace_uprobe {
 	struct inode			*inode;
 	char				*filename;
 	unsigned long			offset;
+	unsigned long			ref_ctr_offset;
 	unsigned long			nhit;
 	struct trace_probe		tp;
 };
@@ -364,10 +365,10 @@ static int register_trace_uprobe(struct trace_uprobe *tu)
 static int create_trace_uprobe(int argc, char **argv)
 {
 	struct trace_uprobe *tu;
-	char *arg, *event, *group, *filename;
+	char *arg, *event, *group, *filename, *rctr, *rctr_end;
 	char buf[MAX_EVENT_NAME_LEN];
 	struct path path;
-	unsigned long offset;
+	unsigned long offset, ref_ctr_offset;
 	bool is_delete, is_return;
 	int i, ret;
 
@@ -376,6 +377,7 @@ static int create_trace_uprobe(int argc, char **argv)
 	is_return = false;
 	event = NULL;
 	group = NULL;
+	ref_ctr_offset = 0;
 
 	/* argc must be >= 1 */
 	if (argv[0][0] == '-')
@@ -450,6 +452,26 @@ static int create_trace_uprobe(int argc, char **argv)
 		goto fail_address_parse;
 	}
 
+	/* Parse reference counter offset if specified. */
+	rctr = strchr(arg, '(');
+	if (rctr) {
+		rctr_end = strchr(rctr, ')');
+		if (rctr > rctr_end || *(rctr_end + 1) != 0) {
+			ret = -EINVAL;
+			pr_info("Invalid reference counter offset.\n");
+			goto fail_address_parse;
+		}
+
+		*rctr++ = '\0';
+		*rctr_end = '\0';
+		ret = kstrtoul(rctr, 0, &ref_ctr_offset);
+		if (ret) {
+			pr_info("Invalid reference counter offset.\n");
+			goto fail_address_parse;
+		}
+	}
+
+	/* Parse uprobe offset. */
 	ret = kstrtoul(arg, 0, &offset);
 	if (ret)
 		goto fail_address_parse;
@@ -484,6 +506,7 @@ static int create_trace_uprobe(int argc, char **argv)
 		goto fail_address_parse;
 	}
 	tu->offset = offset;
+	tu->ref_ctr_offset = ref_ctr_offset;
 	tu->path = path;
 	tu->filename = kstrdup(filename, GFP_KERNEL);
 
@@ -602,6 +625,9 @@ static int probes_seq_show(struct seq_file *m, void *v)
 			trace_event_name(&tu->tp.call), tu->filename,
 			(int)(sizeof(void *) * 2), tu->offset);
 
+	if (tu->ref_ctr_offset)
+		seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
+
 	for (i = 0; i < tu->tp.nr_args; i++)
 		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
 
@@ -917,7 +943,13 @@ probe_event_enable(struct trace_uprobe *tu, struct trace_event_file *file,
 
 	tu->consumer.filter = filter;
 	tu->inode = d_real_inode(tu->path.dentry);
-	ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
+	if (tu->ref_ctr_offset) {
+		ret = uprobe_register_refctr(tu->inode, tu->offset,
+				tu->ref_ctr_offset, &tu->consumer);
+	} else {
+		ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
+	}
+
 	if (ret)
 		goto err_buffer;
 
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 4/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
                   ` (2 preceding siblings ...)
  2018-07-16  8:47 ` [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
@ 2018-07-16  8:47 ` Ravi Bangoria
  2018-07-16  8:47 ` [PATCH v6 5/6] Uprobes/sdt: " Ravi Bangoria
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

We assume to have only one reference counter for one uprobe.
Don't allow user to add multiple trace_uprobe entries having
same inode+offset but different reference counter.

Ex,
  # echo "p:sdt_tick/loop2 /home/ravi/tick:0x6e4(0x10036)" > uprobe_events
  # echo "p:sdt_tick/loop2_1 /home/ravi/tick:0x6e4(0xfffff)" >> uprobe_events
  bash: echo: write error: Invalid argument

  # dmesg
  trace_kprobe: Reference counter offset mismatch.

There are two exceptions though:
1) When user is trying to replace the old entry with the new
   one, we allow this if the new entry does not conflict with
   any other existing entry.
2) Any one of the reference counter offset is 0 while comparing
   new entry with old one.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
 kernel/trace/trace_uprobe.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index bf2be098eb08..9fafb183c49d 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -324,6 +324,38 @@ static int unregister_trace_uprobe(struct trace_uprobe *tu)
 	return 0;
 }
 
+/*
+ * Uprobe with multiple reference counter is not allowed. i.e.
+ * If inode and offset matches, reference counter offset *must*
+ * match as well. Though, there are two exceptions:
+ * 1) We are replacing old trace_uprobe with new one(same
+ *    group/event) then we don't care as far as the new entry
+ *    does not conflict with any other existing entry.
+ * 2) Any one of the reference counter offset is 0 while comparing
+ *    new entry with old one.
+ */
+static struct trace_uprobe *find_old_trace_uprobe(struct trace_uprobe *new)
+{
+	struct trace_uprobe *tmp, *old = NULL;
+	struct inode *new_inode = d_real_inode(new->path.dentry);
+
+	old = find_probe_event(trace_event_name(&new->tp.call),
+				new->tp.call.class->system);
+
+	list_for_each_entry(tmp, &uprobe_list, list) {
+		if ((old ? old != tmp : true) &&
+		    new_inode == d_real_inode(tmp->path.dentry) &&
+		    new->offset == tmp->offset &&
+		    new->ref_ctr_offset != tmp->ref_ctr_offset &&
+		    new->ref_ctr_offset != 0 &&
+		    tmp->ref_ctr_offset != 0) {
+			pr_warn("Reference counter offset mismatch.");
+			return ERR_PTR(-EINVAL);
+		}
+	}
+	return old;
+}
+
 /* Register a trace_uprobe and probe_event */
 static int register_trace_uprobe(struct trace_uprobe *tu)
 {
@@ -333,8 +365,12 @@ static int register_trace_uprobe(struct trace_uprobe *tu)
 	mutex_lock(&uprobe_lock);
 
 	/* register as an event */
-	old_tu = find_probe_event(trace_event_name(&tu->tp.call),
-			tu->tp.call.class->system);
+	old_tu = find_old_trace_uprobe(tu);
+	if (IS_ERR(old_tu)) {
+		ret = PTR_ERR(old_tu);
+		goto end;
+	}
+
 	if (old_tu) {
 		/* delete old event */
 		ret = unregister_trace_uprobe(old_tu);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 5/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
                   ` (3 preceding siblings ...)
  2018-07-16  8:47 ` [PATCH v6 4/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe Ravi Bangoria
@ 2018-07-16  8:47 ` Ravi Bangoria
  2018-07-25 11:08   ` Oleg Nesterov
  2018-07-16  8:47 ` [PATCH v6 6/6] perf probe: Support SDT markers having reference counter (semaphore) Ravi Bangoria
  2018-07-20 13:47 ` [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
  6 siblings, 1 reply; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

We assume to have only one reference counter for one uprobe. Don't
allow user to register multiple uprobes having same inode+offset
but different reference counter.

Though, existing tools which already support SDT events creates
normal uprobe and updates reference counter on their own. Allow 0 as
a special value for reference counter offset. I.e. two uprobes, one
having ref_ctr_offset=0 and the other having non-zero ref_ctr_offset
can coexists. This gives user a flexibility to either depend on
kernel uprobe infrastructure to maintain reference counter or just
use normal uprobe and maintain reference counter on his own.

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
 kernel/events/uprobes.c | 52 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84da8512a974..563cc3e625b3 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -63,6 +63,8 @@ static struct percpu_rw_semaphore dup_mmap_sem;
 
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN	0
+/* Reference counter offset is reloaded with non-zero value. */
+#define REF_CTR_OFF_RELOADED	1
 
 struct uprobe {
 	struct rb_node		rb_node;	/* node in the rb tree */
@@ -476,9 +478,23 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
 		return ret;
 
 	ret = verify_opcode(old_page, vaddr, &opcode);
-	if (ret <= 0)
+	if (ret < 0)
 		goto put_old;
 
+	/*
+	 * If instruction is already patched but reference counter offset
+	 * has been reloaded to non-zero value, increment the reference
+	 * counter and return.
+	 */
+	if (ret == 0) {
+		if (is_register &&
+		    test_bit(REF_CTR_OFF_RELOADED, &uprobe->flags)) {
+			WARN_ON(!uprobe->ref_ctr_offset);
+			ret = update_ref_ctr(uprobe, mm, true);
+		}
+		goto put_old;
+	}
+
 	/* We are going to replace instruction, update ref_ctr. */
 	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
 		ret = update_ref_ctr(uprobe, mm, is_register);
@@ -679,6 +695,30 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
 	cur_uprobe = insert_uprobe(uprobe);
 	/* a uprobe exists for this inode:offset combination */
 	if (cur_uprobe) {
+		/*
+		 * If inode+offset matches, ref_ctr_offset must match as
+		 * well. Though, 0 is a special value for ref_ctr_offset.
+		 */
+		if (cur_uprobe->ref_ctr_offset != uprobe->ref_ctr_offset &&
+		    cur_uprobe->ref_ctr_offset != 0 &&
+		    uprobe->ref_ctr_offset != 0) {
+			pr_warn("Err: Reference counter mismatch.\n");
+			put_uprobe(cur_uprobe);
+			kfree(uprobe);
+			return ERR_PTR(-EINVAL);
+		}
+
+		/*
+		 * If existing uprobe->ref_ctr_offset is 0 and user is
+		 * registering same uprobe with non-zero ref_ctr_offset,
+		 * set new ref_ctr_offset to existing uprobe.
+		 */
+
+		if (!cur_uprobe->ref_ctr_offset && uprobe->ref_ctr_offset) {
+			cur_uprobe->ref_ctr_offset = uprobe->ref_ctr_offset;
+			set_bit(REF_CTR_OFF_RELOADED, &cur_uprobe->flags);
+		}
+
 		kfree(uprobe);
 		uprobe = cur_uprobe;
 	}
@@ -971,6 +1011,7 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 	bool is_register = !!new;
 	struct map_info *info;
 	int err = 0;
+	bool installed = false;
 
 	percpu_down_write(&dup_mmap_sem);
 	info = build_map_info(uprobe->inode->i_mapping,
@@ -1000,8 +1041,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 		if (is_register) {
 			/* consult only the "caller", new consumer. */
 			if (consumer_filter(new,
-					UPROBE_FILTER_REGISTER, mm))
+					UPROBE_FILTER_REGISTER, mm)) {
 				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
+				installed = true;
+			}
 		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
 			if (!filter_chain(uprobe,
 					UPROBE_FILTER_UNREGISTER, mm))
@@ -1016,6 +1059,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
 	}
  out:
 	percpu_up_write(&dup_mmap_sem);
+	if (installed)
+		clear_bit(REF_CTR_OFF_RELOADED, &uprobe->flags);
 	return err;
 }
 
@@ -1093,6 +1138,9 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
 	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
 	if (!uprobe)
 		return -ENOMEM;
+	if (IS_ERR(uprobe))
+		return PTR_ERR(uprobe);
+
 	/*
 	 * We can race with uprobe_unregister()->delete_uprobe().
 	 * Check uprobe_is_active() and retry if it is false.
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH v6 6/6] perf probe: Support SDT markers having reference counter (semaphore)
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
                   ` (4 preceding siblings ...)
  2018-07-16  8:47 ` [PATCH v6 5/6] Uprobes/sdt: " Ravi Bangoria
@ 2018-07-16  8:47 ` Ravi Bangoria
  2018-07-20 13:47 ` [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
  6 siblings, 0 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-16  8:47 UTC (permalink / raw)
  To: srikar, oleg, rostedt, mhiramat
  Cc: peterz, mingo, acme, alexander.shishkin, jolsa, namhyung,
	linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

With this, perf buildid-cache will save SDT markers with reference
counter in probe cache. Perf probe will be able to probe markers
having reference counter. Ex,

  # readelf -n /tmp/tick | grep -A1 loop2
    Name: loop2
    ... Semaphore: 0x0000000010020036

  # ./perf buildid-cache --add /tmp/tick
  # ./perf probe sdt_tick:loop2
  # ./perf stat -e sdt_tick:loop2 /tmp/tick
    hi: 0
    hi: 1
    hi: 2
    ^C
     Performance counter stats for '/tmp/tick':
                 3      sdt_tick:loop2
       2.561851452 seconds time elapsed

Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 tools/perf/util/probe-event.c | 39 ++++++++++++++++++++++++++++++++----
 tools/perf/util/probe-event.h |  1 +
 tools/perf/util/probe-file.c  | 34 ++++++++++++++++++++++++++------
 tools/perf/util/probe-file.h  |  1 +
 tools/perf/util/symbol-elf.c  | 46 ++++++++++++++++++++++++++++++++-----------
 tools/perf/util/symbol.h      |  7 +++++++
 6 files changed, 106 insertions(+), 22 deletions(-)

diff --git a/tools/perf/util/probe-event.c b/tools/perf/util/probe-event.c
index f119eb628dbb..e86f8be89157 100644
--- a/tools/perf/util/probe-event.c
+++ b/tools/perf/util/probe-event.c
@@ -1819,6 +1819,12 @@ int parse_probe_trace_command(const char *cmd, struct probe_trace_event *tev)
 			tp->offset = strtoul(fmt2_str, NULL, 10);
 	}
 
+	if (tev->uprobes) {
+		fmt2_str = strchr(p, '(');
+		if (fmt2_str)
+			tp->ref_ctr_offset = strtoul(fmt2_str + 1, NULL, 0);
+	}
+
 	tev->nargs = argc - 2;
 	tev->args = zalloc(sizeof(struct probe_trace_arg) * tev->nargs);
 	if (tev->args == NULL) {
@@ -2012,6 +2018,22 @@ static int synthesize_probe_trace_arg(struct probe_trace_arg *arg,
 	return err;
 }
 
+static int
+synthesize_uprobe_trace_def(struct probe_trace_event *tev, struct strbuf *buf)
+{
+	struct probe_trace_point *tp = &tev->point;
+	int err;
+
+	err = strbuf_addf(buf, "%s:0x%lx", tp->module, tp->address);
+
+	if (err >= 0 && tp->ref_ctr_offset) {
+		if (!uprobe_ref_ctr_is_supported())
+			return -1;
+		err = strbuf_addf(buf, "(0x%lx)", tp->ref_ctr_offset);
+	}
+	return err >= 0 ? 0 : -1;
+}
+
 char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 {
 	struct probe_trace_point *tp = &tev->point;
@@ -2041,15 +2063,17 @@ char *synthesize_probe_trace_command(struct probe_trace_event *tev)
 	}
 
 	/* Use the tp->address for uprobes */
-	if (tev->uprobes)
-		err = strbuf_addf(&buf, "%s:0x%lx", tp->module, tp->address);
-	else if (!strncmp(tp->symbol, "0x", 2))
+	if (tev->uprobes) {
+		err = synthesize_uprobe_trace_def(tev, &buf);
+	} else if (!strncmp(tp->symbol, "0x", 2)) {
 		/* Absolute address. See try_to_find_absolute_address() */
 		err = strbuf_addf(&buf, "%s%s0x%lx", tp->module ?: "",
 				  tp->module ? ":" : "", tp->address);
-	else
+	} else {
 		err = strbuf_addf(&buf, "%s%s%s+%lu", tp->module ?: "",
 				tp->module ? ":" : "", tp->symbol, tp->offset);
+	}
+
 	if (err)
 		goto error;
 
@@ -2633,6 +2657,13 @@ static void warn_uprobe_event_compat(struct probe_trace_event *tev)
 {
 	int i;
 	char *buf = synthesize_probe_trace_command(tev);
+	struct probe_trace_point *tp = &tev->point;
+
+	if (tp->ref_ctr_offset && !uprobe_ref_ctr_is_supported()) {
+		pr_warning("A semaphore is associated with %s:%s and "
+			   "seems your kernel doesn't support it.\n",
+			   tev->group, tev->event);
+	}
 
 	/* Old uprobe event doesn't support memory dereference */
 	if (!tev->uprobes || tev->nargs == 0 || !buf)
diff --git a/tools/perf/util/probe-event.h b/tools/perf/util/probe-event.h
index 45b14f020558..15a98c3a2a2f 100644
--- a/tools/perf/util/probe-event.h
+++ b/tools/perf/util/probe-event.h
@@ -27,6 +27,7 @@ struct probe_trace_point {
 	char		*symbol;	/* Base symbol */
 	char		*module;	/* Module name */
 	unsigned long	offset;		/* Offset from symbol */
+	unsigned long	ref_ctr_offset;	/* SDT reference counter offset */
 	unsigned long	address;	/* Actual address of the trace point */
 	bool		retprobe;	/* Return probe flag */
 };
diff --git a/tools/perf/util/probe-file.c b/tools/perf/util/probe-file.c
index b76088fadf3d..aac7817d9e14 100644
--- a/tools/perf/util/probe-file.c
+++ b/tools/perf/util/probe-file.c
@@ -696,8 +696,16 @@ int probe_cache__add_entry(struct probe_cache *pcache,
 #ifdef HAVE_GELF_GETNOTE_SUPPORT
 static unsigned long long sdt_note__get_addr(struct sdt_note *note)
 {
-	return note->bit32 ? (unsigned long long)note->addr.a32[0]
-		 : (unsigned long long)note->addr.a64[0];
+	return note->bit32 ?
+		(unsigned long long)note->addr.a32[SDT_NOTE_IDX_LOC] :
+		(unsigned long long)note->addr.a64[SDT_NOTE_IDX_LOC];
+}
+
+static unsigned long long sdt_note__get_ref_ctr_offset(struct sdt_note *note)
+{
+	return note->bit32 ?
+		(unsigned long long)note->addr.a32[SDT_NOTE_IDX_REFCTR] :
+		(unsigned long long)note->addr.a64[SDT_NOTE_IDX_REFCTR];
 }
 
 static const char * const type_to_suffix[] = {
@@ -775,14 +783,21 @@ static char *synthesize_sdt_probe_command(struct sdt_note *note,
 {
 	struct strbuf buf;
 	char *ret = NULL, **args;
-	int i, args_count;
+	int i, args_count, err;
+	unsigned long long ref_ctr_offset;
 
 	if (strbuf_init(&buf, 32) < 0)
 		return NULL;
 
-	if (strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
-				sdtgrp, note->name, pathname,
-				sdt_note__get_addr(note)) < 0)
+	err = strbuf_addf(&buf, "p:%s/%s %s:0x%llx",
+			sdtgrp, note->name, pathname,
+			sdt_note__get_addr(note));
+
+	ref_ctr_offset = sdt_note__get_ref_ctr_offset(note);
+	if (ref_ctr_offset && err >= 0)
+		err = strbuf_addf(&buf, "(0x%llx)", ref_ctr_offset);
+
+	if (err < 0)
 		goto error;
 
 	if (!note->args)
@@ -998,6 +1013,7 @@ int probe_cache__show_all_caches(struct strfilter *filter)
 enum ftrace_readme {
 	FTRACE_README_PROBE_TYPE_X = 0,
 	FTRACE_README_KRETPROBE_OFFSET,
+	FTRACE_README_UPROBE_REF_CTR,
 	FTRACE_README_END,
 };
 
@@ -1009,6 +1025,7 @@ static struct {
 	[idx] = {.pattern = pat, .avail = false}
 	DEFINE_TYPE(FTRACE_README_PROBE_TYPE_X, "*type: * x8/16/32/64,*"),
 	DEFINE_TYPE(FTRACE_README_KRETPROBE_OFFSET, "*place (kretprobe): *"),
+	DEFINE_TYPE(FTRACE_README_UPROBE_REF_CTR, "*ref_ctr_offset*"),
 };
 
 static bool scan_ftrace_readme(enum ftrace_readme type)
@@ -1064,3 +1081,8 @@ bool kretprobe_offset_is_supported(void)
 {
 	return scan_ftrace_readme(FTRACE_README_KRETPROBE_OFFSET);
 }
+
+bool uprobe_ref_ctr_is_supported(void)
+{
+	return scan_ftrace_readme(FTRACE_README_UPROBE_REF_CTR);
+}
diff --git a/tools/perf/util/probe-file.h b/tools/perf/util/probe-file.h
index 63f29b1d22c1..2a249182f2a6 100644
--- a/tools/perf/util/probe-file.h
+++ b/tools/perf/util/probe-file.h
@@ -69,6 +69,7 @@ struct probe_cache_entry *probe_cache__find_by_name(struct probe_cache *pcache,
 int probe_cache__show_all_caches(struct strfilter *filter);
 bool probe_type_is_available(enum probe_type type);
 bool kretprobe_offset_is_supported(void);
+bool uprobe_ref_ctr_is_supported(void);
 #else	/* ! HAVE_LIBELF_SUPPORT */
 static inline struct probe_cache *probe_cache__new(const char *tgt __maybe_unused, struct nsinfo *nsi __maybe_unused)
 {
diff --git a/tools/perf/util/symbol-elf.c b/tools/perf/util/symbol-elf.c
index 29770ea61768..0281d5e2cd67 100644
--- a/tools/perf/util/symbol-elf.c
+++ b/tools/perf/util/symbol-elf.c
@@ -1947,6 +1947,34 @@ void kcore_extract__delete(struct kcore_extract *kce)
 }
 
 #ifdef HAVE_GELF_GETNOTE_SUPPORT
+
+static void sdt_adjust_loc(struct sdt_note *tmp, GElf_Addr base_off)
+{
+	if (!base_off)
+		return;
+
+	if (tmp->bit32)
+		tmp->addr.a32[SDT_NOTE_IDX_LOC] =
+			tmp->addr.a32[SDT_NOTE_IDX_LOC] + base_off -
+			tmp->addr.a32[SDT_NOTE_IDX_BASE];
+	else
+		tmp->addr.a64[SDT_NOTE_IDX_LOC] =
+			tmp->addr.a64[SDT_NOTE_IDX_LOC] + base_off -
+			tmp->addr.a64[SDT_NOTE_IDX_BASE];
+}
+
+static void sdt_adjust_refctr(struct sdt_note *tmp, GElf_Addr base_addr,
+			      GElf_Addr base_off)
+{
+	if (!base_off)
+		return;
+
+	if (tmp->bit32 && tmp->addr.a32[SDT_NOTE_IDX_REFCTR])
+		tmp->addr.a32[SDT_NOTE_IDX_REFCTR] -= (base_addr - base_off);
+	else if (tmp->addr.a64[SDT_NOTE_IDX_REFCTR])
+		tmp->addr.a64[SDT_NOTE_IDX_REFCTR] -= (base_addr - base_off);
+}
+
 /**
  * populate_sdt_note : Parse raw data and identify SDT note
  * @elf: elf of the opened file
@@ -1964,7 +1992,6 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
 	const char *provider, *name, *args;
 	struct sdt_note *tmp = NULL;
 	GElf_Ehdr ehdr;
-	GElf_Addr base_off = 0;
 	GElf_Shdr shdr;
 	int ret = -EINVAL;
 
@@ -2060,17 +2087,12 @@ static int populate_sdt_note(Elf **elf, const char *data, size_t len,
 	 * base address in the description of the SDT note. If its different,
 	 * then accordingly, adjust the note location.
 	 */
-	if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_BASE_SCN, NULL)) {
-		base_off = shdr.sh_offset;
-		if (base_off) {
-			if (tmp->bit32)
-				tmp->addr.a32[0] = tmp->addr.a32[0] + base_off -
-					tmp->addr.a32[1];
-			else
-				tmp->addr.a64[0] = tmp->addr.a64[0] + base_off -
-					tmp->addr.a64[1];
-		}
-	}
+	if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_BASE_SCN, NULL))
+		sdt_adjust_loc(tmp, shdr.sh_offset);
+
+	/* Adjust reference counter offset */
+	if (elf_section_by_name(*elf, &ehdr, &shdr, SDT_PROBES_SCN, NULL))
+		sdt_adjust_refctr(tmp, shdr.sh_addr, shdr.sh_offset);
 
 	list_add_tail(&tmp->note_list, sdt_notes);
 	return 0;
diff --git a/tools/perf/util/symbol.h b/tools/perf/util/symbol.h
index f25fae4b5743..20f49779116b 100644
--- a/tools/perf/util/symbol.h
+++ b/tools/perf/util/symbol.h
@@ -379,12 +379,19 @@ int get_sdt_note_list(struct list_head *head, const char *target);
 int cleanup_sdt_note_list(struct list_head *sdt_notes);
 int sdt_notes__get_count(struct list_head *start);
 
+#define SDT_PROBES_SCN ".probes"
 #define SDT_BASE_SCN ".stapsdt.base"
 #define SDT_NOTE_SCN  ".note.stapsdt"
 #define SDT_NOTE_TYPE 3
 #define SDT_NOTE_NAME "stapsdt"
 #define NR_ADDR 3
 
+enum {
+	SDT_NOTE_IDX_LOC = 0,
+	SDT_NOTE_IDX_BASE,
+	SDT_NOTE_IDX_REFCTR,
+};
+
 struct mem_info *mem_info__new(void);
 struct mem_info *mem_info__get(struct mem_info *mi);
 void   mem_info__put(struct mem_info *mi);
-- 
2.14.4


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore)
  2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
                   ` (5 preceding siblings ...)
  2018-07-16  8:47 ` [PATCH v6 6/6] perf probe: Support SDT markers having reference counter (semaphore) Ravi Bangoria
@ 2018-07-20 13:47 ` Ravi Bangoria
  6 siblings, 0 replies; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-20 13:47 UTC (permalink / raw)
  To: srikar, oleg, mhiramat, Song Liu
  Cc: rostedt, peterz, mingo, acme, alexander.shishkin, jolsa,
	namhyung, linux-kernel, ananth, alexis.berlemont, naveen.n.rao,
	linux-arm-kernel, linux-mips, linux, ralf, paul.burton,
	Ravi Bangoria

Oleg, Srikar, Masami, Song,

Any feedback please? :)

Thanks,
Ravi

On 07/16/2018 02:17 PM, Ravi Bangoria wrote:
> Userspace Statically Defined Tracepoints[1] are dtrace style markers
> inside userspace applications. Applications like PostgreSQL, MySQL,
> Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
> have these markers embedded in them. These markers are added by developer
> at important places in the code. Each marker source expands to a single
> nop instruction in the compiled code but there may be additional
> overhead for computing the marker arguments which expands to couple of
> instructions. In case the overhead is more, execution of it can be
> omitted by runtime if() condition when no one is tracing on the marker:
> 
>     if (reference_counter > 0) {
>         Execute marker instructions;
>     }
> 
> Default value of reference counter is 0. Tracer has to increment the 
> reference counter before tracing on a marker and decrement it when
> done with the tracing.
> 
> Currently, perf tool has limited supports for SDT markers. I.e. it
> can not trace markers surrounded by reference counter. Also, it's
> not easy to add reference counter logic in userspace tool like perf,
> so basic idea for this patchset is to add reference counter logic in
> the a uprobe infrastructure. Ex,[2]
> 
>   # cat tick.c
>     ... 
>     for (i = 0; i < 100; i++) {
> 	DTRACE_PROBE1(tick, loop1, i);
>         if (TICK_LOOP2_ENABLED()) {
>             DTRACE_PROBE1(tick, loop2, i); 
>         }
>         printf("hi: %d\n", i); 
>         sleep(1);
>     }   
>     ... 
> 
> Here tick:loop1 is marker without reference counter where as tick:loop2
> is surrounded by reference counter condition.
> 
>   # perf buildid-cache --add /tmp/tick
>   # perf probe sdt_tick:loop1
>   # perf probe sdt_tick:loop2
> 
>   # perf stat -e sdt_tick:loop1,sdt_tick:loop2 -- /tmp/tick
>   hi: 0
>   hi: 1
>   hi: 2
>   ^C
>   Performance counter stats for '/tmp/tick':
>              3      sdt_tick:loop1
>              0      sdt_tick:loop2
>      2.747086086 seconds time elapsed
> 
> Perf failed to record data for tick:loop2. Same experiment with this
> patch series:
> 
>   # ./perf buildid-cache --add /tmp/tick
>   # ./perf probe sdt_tick:loop2
>   # ./perf stat -e sdt_tick:loop2 /tmp/tick
>     hi: 0
>     hi: 1
>     hi: 2
>     ^C  
>      Performance counter stats for '/tmp/tick':
>                  3      sdt_tick:loop2
>        2.561851452 seconds time elapsed
> 
> 
> v6 changes:
>  - Do not export struct uprobe outside of uprobe.c. Instead, use
>    container_of(arch_uprobe) to regain uprobe in uprobe_write_opcode().
>  - Allow 0 as a special value for reference counter _offset_. I.e.
>    two uprobes, one having ref_ctr_offset=0 and the other having
>    non-zero ref_ctr_offset can coexists.
>  - If vma holding reference counter is not present while patching an
>    instruction, we add that uprobe in delayed_uprobe_list. When
>    appropriate mapping is created, we increment the reference counter
>    and remove uprobe from delayed_uprobe_list. While doing all this,
>    v5 was searching for all such uprobes in uprobe_tree which is not 
>    require. Also, uprobes are stored in rbtree with inode+offset as
>    the key and v5 was searching uprobe based on inode+ref_ctr_offset
>    which is wrong too. Fix this by directly looing on delayed_uprobe_list.
>  - Consider VM_SHARED vma as invalid for reference counter. Return
>    false from valid_ref_ctr_vma() if vma->vm_flags has VM_SHARED set.
>  - No need to use FOLL_FORCE in get_user_pages_remote() while getting
>    a page to update reference counter. Remove it.
>  - Do not mention usage of reference counter in Documentation/.
> 
> 
> v5 can be found at: https://lkml.org/lkml/2018/6/28/51
> 
> Ravi Bangoria (6):
>   Uprobes: Simplify uprobe_register() body
>   Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
>   Uprobes: Support SDT markers having reference count (semaphore)
>   trace_uprobe/sdt: Prevent multiple reference counter for same uprobe
>   Uprobes/sdt: Prevent multiple reference counter for same uprobe
>   perf probe: Support SDT markers having reference counter (semaphore)
> 
>  arch/arm/probes/uprobes/core.c |   2 +-
>  arch/mips/kernel/uprobes.c     |   2 +-
>  include/linux/uprobes.h        |   7 +-
>  kernel/events/uprobes.c        | 358 ++++++++++++++++++++++++++++++++++++-----
>  kernel/trace/trace.c           |   2 +-
>  kernel/trace/trace_uprobe.c    |  78 ++++++++-
>  tools/perf/util/probe-event.c  |  39 ++++-
>  tools/perf/util/probe-event.h  |   1 +
>  tools/perf/util/probe-file.c   |  34 +++-
>  tools/perf/util/probe-file.h   |   1 +
>  tools/perf/util/symbol-elf.c   |  46 ++++--
>  tools/perf/util/symbol.h       |   7 +
>  12 files changed, 503 insertions(+), 74 deletions(-)
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore)
  2018-07-16  8:47 ` [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
@ 2018-07-23 16:26   ` Oleg Nesterov
  2018-07-24  3:34     ` Ravi Bangoria
  2018-07-24 14:21   ` Masami Hiramatsu
  1 sibling, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2018-07-23 16:26 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: srikar, rostedt, mhiramat, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, linux-kernel, ananth,
	alexis.berlemont, naveen.n.rao, linux-arm-kernel, linux-mips,
	linux, ralf, paul.burton

I have a mixed feeling about this series... I'll try to summarise my thinking
tomorrow, but I do not see any obvious problem so far. Although I have some
concerns about 5/6, I need to re-read it after sleep.


On 07/16, Ravi Bangoria wrote:
>
> +static int delayed_uprobe_install(struct vm_area_struct *vma)
> +{
> +	struct list_head *pos, *q;
> +	struct delayed_uprobe *du;
> +	unsigned long vaddr;
> +	int ret = 0, err = 0;
> +
> +	mutex_lock(&delayed_uprobe_lock);
> +	list_for_each_safe(pos, q, &delayed_uprobe_list) {
> +		du = list_entry(pos, struct delayed_uprobe, list);
> +
> +		if (!du->uprobe->ref_ctr_offset ||

Is it possible to see ->ref_ctr_offset == 0 in delayed_uprobe_list?

> @@ -1072,7 +1282,13 @@ int uprobe_mmap(struct vm_area_struct *vma)
>  	struct uprobe *uprobe, *u;
>  	struct inode *inode;
>  
> -	if (no_uprobe_events() || !valid_vma(vma, true))
> +	if (no_uprobe_events())
> +		return 0;
> +
> +	if (vma->vm_flags & VM_WRITE)
> +		delayed_uprobe_install(vma);

Obviously not nice performance-wise... OK, I do not know if it will actually
hurt in practice and probably we can use the better data structures if necessary.
But can't we check MMF_HAS_UPROBES at least? I mean,

	if (vma->vm_flags & VM_WRITE && test_bit(MMF_HAS_UPROBES, &vma->vm_mm->flags))
		delayed_uprobe_install(vma);

?


Perhaps we can even add another flag later, MMF_HAS_DELAYED_UPROBES set by
delayed_uprobe_add().

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore)
  2018-07-23 16:26   ` Oleg Nesterov
@ 2018-07-24  3:34     ` Ravi Bangoria
  2018-07-27 13:59       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-24  3:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: srikar, rostedt, mhiramat, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, linux-kernel, ananth,
	alexis.berlemont, naveen.n.rao, linux-arm-kernel, linux-mips,
	linux, ralf, paul.burton, Ravi Bangoria

Hi Oleg,

On 07/23/2018 09:56 PM, Oleg Nesterov wrote:
> I have a mixed feeling about this series... I'll try to summarise my thinking
> tomorrow, but I do not see any obvious problem so far. Although I have some
> concerns about 5/6, I need to re-read it after sleep.

Sure.

> 
> 
> On 07/16, Ravi Bangoria wrote:
>>
>> +static int delayed_uprobe_install(struct vm_area_struct *vma)
>> +{
>> +	struct list_head *pos, *q;
>> +	struct delayed_uprobe *du;
>> +	unsigned long vaddr;
>> +	int ret = 0, err = 0;
>> +
>> +	mutex_lock(&delayed_uprobe_lock);
>> +	list_for_each_safe(pos, q, &delayed_uprobe_list) {
>> +		du = list_entry(pos, struct delayed_uprobe, list);
>> +
>> +		if (!du->uprobe->ref_ctr_offset ||
> 
> Is it possible to see ->ref_ctr_offset == 0 in delayed_uprobe_list?

I'll remove this check.

> 
>> @@ -1072,7 +1282,13 @@ int uprobe_mmap(struct vm_area_struct *vma)
>>  	struct uprobe *uprobe, *u;
>>  	struct inode *inode;
>>  
>> -	if (no_uprobe_events() || !valid_vma(vma, true))
>> +	if (no_uprobe_events())
>> +		return 0;
>> +
>> +	if (vma->vm_flags & VM_WRITE)
>> +		delayed_uprobe_install(vma);
> 
> Obviously not nice performance-wise... OK, I do not know if it will actually
> hurt in practice and probably we can use the better data structures if necessary.
> But can't we check MMF_HAS_UPROBES at least? I mean,
> 
> 	if (vma->vm_flags & VM_WRITE && test_bit(MMF_HAS_UPROBES, &vma->vm_mm->flags))
> 		delayed_uprobe_install(vma);
> 
> ?

Yes.

> 
> 
> Perhaps we can even add another flag later, MMF_HAS_DELAYED_UPROBES set by
> delayed_uprobe_add().

Yes, good idea.

Thanks for the review,
Ravi


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore)
  2018-07-16  8:47 ` [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
  2018-07-23 16:26   ` Oleg Nesterov
@ 2018-07-24 14:21   ` Masami Hiramatsu
  1 sibling, 0 replies; 15+ messages in thread
From: Masami Hiramatsu @ 2018-07-24 14:21 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: srikar, oleg, rostedt, peterz, mingo, acme, alexander.shishkin,
	jolsa, namhyung, linux-kernel, ananth, alexis.berlemont,
	naveen.n.rao, linux-arm-kernel, linux-mips, linux, ralf,
	paul.burton

Hi Ravi,

Thank you for updating the series. At least trace_uprobe side,
it looks good to me ;)

Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>

Thanks!

On Mon, 16 Jul 2018 14:17:03 +0530
Ravi Bangoria <ravi.bangoria@linux.ibm.com> wrote:

> Userspace Statically Defined Tracepoints[1] are dtrace style markers
> inside userspace applications. Applications like PostgreSQL, MySQL,
> Pthread, Perl, Python, Java, Ruby, Node.js, libvirt, QEMU, glib etc
> have these markers embedded in them. These markers are added by developer
> at important places in the code. Each marker source expands to a single
> nop instruction in the compiled code but there may be additional
> overhead for computing the marker arguments which expands to couple of
> instructions. In case the overhead is more, execution of it can be
> omitted by runtime if() condition when no one is tracing on the marker:
> 
>     if (reference_counter > 0) {
>         Execute marker instructions;
>     }
> 
> Default value of reference counter is 0. Tracer has to increment the
> reference counter before tracing on a marker and decrement it when
> done with the tracing.
> 
> Implement the reference counter logic in core uprobe. User will be
> able to use it from trace_uprobe as well as from kernel module. New
> trace_uprobe definition with reference counter will now be:
> 
>     <path>:<offset>[(ref_ctr_offset)]
> 
> where ref_ctr_offset is an optional field. For kernel module, new
> variant of uprobe_register() has been introduced:
> 
>     uprobe_register_refctr(inode, offset, ref_ctr_offset, consumer)
> 
> No new variant for uprobe_unregister() because it's assumed to have
> only one reference counter for one uprobe.
> 
> [1] https://sourceware.org/systemtap/wiki/UserSpaceProbeImplementation
> 
> Note: 'reference counter' is called as 'semaphore' in original Dtrace
> (or Systemtap, bcc and even in ELF) documentation and code. But the
> term 'semaphore' is misleading in this context. This is just a counter
> used to hold number of tracers tracing on a marker. This is not really
> used for any synchronization. So we are referring it as 'reference
> counter' in kernel / perf code.
> 
> Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
> ---
>  include/linux/uprobes.h     |   5 +
>  kernel/events/uprobes.c     | 232 ++++++++++++++++++++++++++++++++++++++++++--
>  kernel/trace/trace.c        |   2 +-
>  kernel/trace/trace_uprobe.c |  38 +++++++-
>  4 files changed, 267 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
> index bb9d2084af03..103a48a48872 100644
> --- a/include/linux/uprobes.h
> +++ b/include/linux/uprobes.h
> @@ -123,6 +123,7 @@ extern unsigned long uprobe_get_swbp_addr(struct pt_regs *regs);
>  extern unsigned long uprobe_get_trap_addr(struct pt_regs *regs);
>  extern int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm, unsigned long vaddr, uprobe_opcode_t);
>  extern int uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
> +extern int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc);
>  extern int uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool);
>  extern void uprobe_unregister(struct inode *inode, loff_t offset, struct uprobe_consumer *uc);
>  extern int uprobe_mmap(struct vm_area_struct *vma);
> @@ -160,6 +161,10 @@ uprobe_register(struct inode *inode, loff_t offset, struct uprobe_consumer *uc)
>  {
>  	return -ENOSYS;
>  }
> +static inline int uprobe_register_refctr(struct inode *inode, loff_t offset, loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> +{
> +	return -ENOSYS;
> +}
>  static inline int
>  uprobe_apply(struct inode *inode, loff_t offset, struct uprobe_consumer *uc, bool add)
>  {
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index c0418ba52ba8..84da8512a974 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -73,6 +73,7 @@ struct uprobe {
>  	struct uprobe_consumer	*consumers;
>  	struct inode		*inode;		/* Also hold a ref to inode */
>  	loff_t			offset;
> +	loff_t			ref_ctr_offset;
>  	unsigned long		flags;
>  
>  	/*
> @@ -88,6 +89,15 @@ struct uprobe {
>  	struct arch_uprobe	arch;
>  };
>  
> +struct delayed_uprobe {
> +	struct list_head list;
> +	struct uprobe *uprobe;
> +	struct mm_struct *mm;
> +};
> +
> +static DEFINE_MUTEX(delayed_uprobe_lock);
> +static LIST_HEAD(delayed_uprobe_list);
> +
>  /*
>   * Execute out of line area: anonymous executable mapping installed
>   * by the probed task to execute the copy of the original instruction
> @@ -282,6 +292,154 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
>  	return 1;
>  }
>  
> +static struct delayed_uprobe *
> +delayed_uprobe_check(struct uprobe *uprobe, struct mm_struct *mm)
> +{
> +	struct delayed_uprobe *du;
> +
> +	list_for_each_entry(du, &delayed_uprobe_list, list)
> +		if (du->uprobe == uprobe && du->mm == mm)
> +			return du;
> +	return NULL;
> +}
> +
> +static int delayed_uprobe_add(struct uprobe *uprobe, struct mm_struct *mm)
> +{
> +	struct delayed_uprobe *du;
> +
> +	if (delayed_uprobe_check(uprobe, mm))
> +		return 0;
> +
> +	du  = kzalloc(sizeof(*du), GFP_KERNEL);
> +	if (!du)
> +		return -ENOMEM;
> +
> +	du->uprobe = uprobe;
> +	du->mm = mm;
> +	list_add(&du->list, &delayed_uprobe_list);
> +	return 0;
> +}
> +
> +static void delayed_uprobe_delete(struct delayed_uprobe *du)
> +{
> +	if (!du)
> +		return;
> +	list_del(&du->list);
> +	kfree(du);
> +}
> +
> +static void delayed_uprobe_remove(struct uprobe *uprobe, struct mm_struct *mm)
> +{
> +	struct list_head *pos, *q;
> +	struct delayed_uprobe *du;
> +
> +	list_for_each_safe(pos, q, &delayed_uprobe_list) {
> +		du = list_entry(pos, struct delayed_uprobe, list);
> +		if (uprobe) {
> +			if (du->uprobe == uprobe && du->mm == mm)
> +				delayed_uprobe_delete(du);
> +		} else if (du->mm == mm) {
> +			delayed_uprobe_delete(du);
> +		}
> +	}
> +}
> +
> +static bool valid_ref_ctr_vma(struct uprobe *uprobe,
> +			      struct vm_area_struct *vma)
> +{
> +	unsigned long vaddr = offset_to_vaddr(vma, uprobe->ref_ctr_offset);
> +
> +	return uprobe->ref_ctr_offset &&
> +		vma->vm_file &&
> +		file_inode(vma->vm_file) == uprobe->inode &&
> +		vma->vm_flags & VM_WRITE &&
> +		!(vma->vm_flags & VM_SHARED) &&
> +		vma->vm_start <= vaddr &&
> +		vma->vm_end > vaddr;
> +}
> +
> +static struct vm_area_struct *
> +find_ref_ctr_vma(struct uprobe *uprobe, struct mm_struct *mm)
> +{
> +	struct vm_area_struct *tmp;
> +
> +	for (tmp = mm->mmap; tmp; tmp = tmp->vm_next)
> +		if (valid_ref_ctr_vma(uprobe, tmp))
> +			return tmp;
> +
> +	return NULL;
> +}
> +
> +static int
> +__update_ref_ctr(struct mm_struct *mm, unsigned long vaddr, short d)
> +{
> +	void *kaddr;
> +	struct page *page;
> +	struct vm_area_struct *vma;
> +	int ret = 0;
> +	short *ptr;
> +
> +	if (vaddr == 0 || d == 0)
> +		return -EINVAL;
> +
> +	ret = get_user_pages_remote(NULL, mm, vaddr, 1,
> +			FOLL_WRITE, &page, &vma, NULL);
> +	if (unlikely(ret <= 0)) {
> +		/*
> +		 * We are asking for 1 page. If get_user_pages_remote() fails,
> +		 * it may return 0, in that case we have to return error.
> +		 */
> +		ret = (ret == 0) ? -EBUSY : ret;
> +		pr_warn("Failed to %s ref_ctr. (%d)\n",
> +			d > 0 ? "increment" : "decrement", ret);
> +		return ret;
> +	}
> +
> +	kaddr = kmap_atomic(page);
> +	ptr = kaddr + (vaddr & ~PAGE_MASK);
> +
> +	if (unlikely(*ptr + d < 0)) {
> +		pr_warn("ref_ctr going negative. vaddr: 0x%lx, "
> +			"curr val: %d, delta: %d\n", vaddr, *ptr, d);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	*ptr += d;
> +	ret = 0;
> +out:
> +	kunmap_atomic(kaddr);
> +	put_page(page);
> +	return ret;
> +}
> +
> +static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
> +			  bool is_register)
> +{
> +	struct vm_area_struct *rc_vma;
> +	unsigned long rc_vaddr;
> +	int ret = 0;
> +
> +	rc_vma = find_ref_ctr_vma(uprobe, mm);
> +
> +	if (rc_vma) {
> +		rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
> +		ret = __update_ref_ctr(mm, rc_vaddr, is_register ? 1 : -1);
> +
> +		if (is_register)
> +			return ret;
> +	}
> +
> +	mutex_lock(&delayed_uprobe_lock);
> +	if (is_register)
> +		ret = delayed_uprobe_add(uprobe, mm);
> +	else
> +		delayed_uprobe_remove(uprobe, mm);
> +	mutex_unlock(&delayed_uprobe_lock);
> +
> +	return ret;
> +}
> +
>  /*
>   * NOTE:
>   * Expect the breakpoint instruction to be the smallest size instruction for
> @@ -302,9 +460,13 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
>  int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  			unsigned long vaddr, uprobe_opcode_t opcode)
>  {
> +	struct uprobe *uprobe;
>  	struct page *old_page, *new_page;
>  	struct vm_area_struct *vma;
> -	int ret;
> +	int ret, is_register, ref_ctr_updated = 0;
> +
> +	is_register = is_swbp_insn(&opcode);
> +	uprobe = container_of(auprobe, struct uprobe, arch);
>  
>  retry:
>  	/* Read the page with vaddr into memory */
> @@ -317,6 +479,15 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  	if (ret <= 0)
>  		goto put_old;
>  
> +	/* We are going to replace instruction, update ref_ctr. */
> +	if (!ref_ctr_updated && uprobe->ref_ctr_offset) {
> +		ret = update_ref_ctr(uprobe, mm, is_register);
> +		if (ret)
> +			goto put_old;
> +
> +		ref_ctr_updated = 1;
> +	}
> +
>  	ret = anon_vma_prepare(vma);
>  	if (ret)
>  		goto put_old;
> @@ -337,6 +508,11 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  
>  	if (unlikely(ret == -EAGAIN))
>  		goto retry;
> +
> +	/* Revert back reference counter if instruction update failed. */
> +	if (ret && is_register && ref_ctr_updated)
> +		update_ref_ctr(uprobe, mm, false);
> +
>  	return ret;
>  }
>  
> @@ -484,7 +660,8 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>  	return u;
>  }
>  
> -static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
> +static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset,
> +				   loff_t ref_ctr_offset)
>  {
>  	struct uprobe *uprobe, *cur_uprobe;
>  
> @@ -494,6 +671,7 @@ static struct uprobe *alloc_uprobe(struct inode *inode, loff_t offset)
>  
>  	uprobe->inode = inode;
>  	uprobe->offset = offset;
> +	uprobe->ref_ctr_offset = ref_ctr_offset;
>  	init_rwsem(&uprobe->register_rwsem);
>  	init_rwsem(&uprobe->consumer_rwsem);
>  
> @@ -895,7 +1073,7 @@ EXPORT_SYMBOL_GPL(uprobe_unregister);
>   * else return 0 (success)
>   */
>  static int __uprobe_register(struct inode *inode, loff_t offset,
> -			     struct uprobe_consumer *uc)
> +			     loff_t ref_ctr_offset, struct uprobe_consumer *uc)
>  {
>  	struct uprobe *uprobe;
>  	int ret;
> @@ -912,7 +1090,7 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
>  		return -EINVAL;
>  
>   retry:
> -	uprobe = alloc_uprobe(inode, offset);
> +	uprobe = alloc_uprobe(inode, offset, ref_ctr_offset);
>  	if (!uprobe)
>  		return -ENOMEM;
>  	/*
> @@ -938,10 +1116,17 @@ static int __uprobe_register(struct inode *inode, loff_t offset,
>  int uprobe_register(struct inode *inode, loff_t offset,
>  		    struct uprobe_consumer *uc)
>  {
> -	return __uprobe_register(inode, offset, uc);
> +	return __uprobe_register(inode, offset, 0, uc);
>  }
>  EXPORT_SYMBOL_GPL(uprobe_register);
>  
> +int uprobe_register_refctr(struct inode *inode, loff_t offset,
> +			   loff_t ref_ctr_offset, struct uprobe_consumer *uc)
> +{
> +	return __uprobe_register(inode, offset, ref_ctr_offset, uc);
> +}
> +EXPORT_SYMBOL_GPL(uprobe_register_refctr);
> +
>  /*
>   * uprobe_apply - unregister a already registered probe.
>   * @inode: the file in which the probe has to be removed.
> @@ -1060,6 +1245,31 @@ static void build_probe_list(struct inode *inode,
>  	spin_unlock(&uprobes_treelock);
>  }
>  
> +static int delayed_uprobe_install(struct vm_area_struct *vma)
> +{
> +	struct list_head *pos, *q;
> +	struct delayed_uprobe *du;
> +	unsigned long vaddr;
> +	int ret = 0, err = 0;
> +
> +	mutex_lock(&delayed_uprobe_lock);
> +	list_for_each_safe(pos, q, &delayed_uprobe_list) {
> +		du = list_entry(pos, struct delayed_uprobe, list);
> +
> +		if (!du->uprobe->ref_ctr_offset ||
> +		    !valid_ref_ctr_vma(du->uprobe, vma))
> +			continue;
> +
> +		vaddr = offset_to_vaddr(vma, du->uprobe->ref_ctr_offset);
> +		ret = __update_ref_ctr(vma->vm_mm, vaddr, 1);
> +		/* Record an error and continue. */
> +		err = ret & !err ? ret : err;
> +		delayed_uprobe_delete(du);
> +	}
> +	mutex_unlock(&delayed_uprobe_lock);
> +	return err;
> +}
> +
>  /*
>   * Called from mmap_region/vma_adjust with mm->mmap_sem acquired.
>   *
> @@ -1072,7 +1282,13 @@ int uprobe_mmap(struct vm_area_struct *vma)
>  	struct uprobe *uprobe, *u;
>  	struct inode *inode;
>  
> -	if (no_uprobe_events() || !valid_vma(vma, true))
> +	if (no_uprobe_events())
> +		return 0;
> +
> +	if (vma->vm_flags & VM_WRITE)
> +		delayed_uprobe_install(vma);
> +
> +	if (!valid_vma(vma, true))
>  		return 0;
>  
>  	inode = file_inode(vma->vm_file);
> @@ -1246,6 +1462,10 @@ void uprobe_clear_state(struct mm_struct *mm)
>  {
>  	struct xol_area *area = mm->uprobes_state.xol_area;
>  
> +	mutex_lock(&delayed_uprobe_lock);
> +	delayed_uprobe_remove(NULL, mm);
> +	mutex_unlock(&delayed_uprobe_lock);
> +
>  	if (!area)
>  		return;
>  
> diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
> index f054bd6a1c66..b431790779f7 100644
> --- a/kernel/trace/trace.c
> +++ b/kernel/trace/trace.c
> @@ -4614,7 +4614,7 @@ static const char readme_msg[] =
>    "place (kretprobe): [<module>:]<symbol>[+<offset>]|<memaddr>\n"
>  #endif
>  #ifdef CONFIG_UPROBE_EVENTS
> -	"\t    place: <path>:<offset>\n"
> +  "   place (uprobe): <path>:<offset>[(ref_ctr_offset)]\n"
>  #endif
>  	"\t     args: <name>=fetcharg[:type]\n"
>  	"\t fetcharg: %<register>, @<address>, @<symbol>[+|-<offset>],\n"
> diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
> index bf89a51e740d..bf2be098eb08 100644
> --- a/kernel/trace/trace_uprobe.c
> +++ b/kernel/trace/trace_uprobe.c
> @@ -59,6 +59,7 @@ struct trace_uprobe {
>  	struct inode			*inode;
>  	char				*filename;
>  	unsigned long			offset;
> +	unsigned long			ref_ctr_offset;
>  	unsigned long			nhit;
>  	struct trace_probe		tp;
>  };
> @@ -364,10 +365,10 @@ static int register_trace_uprobe(struct trace_uprobe *tu)
>  static int create_trace_uprobe(int argc, char **argv)
>  {
>  	struct trace_uprobe *tu;
> -	char *arg, *event, *group, *filename;
> +	char *arg, *event, *group, *filename, *rctr, *rctr_end;
>  	char buf[MAX_EVENT_NAME_LEN];
>  	struct path path;
> -	unsigned long offset;
> +	unsigned long offset, ref_ctr_offset;
>  	bool is_delete, is_return;
>  	int i, ret;
>  
> @@ -376,6 +377,7 @@ static int create_trace_uprobe(int argc, char **argv)
>  	is_return = false;
>  	event = NULL;
>  	group = NULL;
> +	ref_ctr_offset = 0;
>  
>  	/* argc must be >= 1 */
>  	if (argv[0][0] == '-')
> @@ -450,6 +452,26 @@ static int create_trace_uprobe(int argc, char **argv)
>  		goto fail_address_parse;
>  	}
>  
> +	/* Parse reference counter offset if specified. */
> +	rctr = strchr(arg, '(');
> +	if (rctr) {
> +		rctr_end = strchr(rctr, ')');
> +		if (rctr > rctr_end || *(rctr_end + 1) != 0) {
> +			ret = -EINVAL;
> +			pr_info("Invalid reference counter offset.\n");
> +			goto fail_address_parse;
> +		}
> +
> +		*rctr++ = '\0';
> +		*rctr_end = '\0';
> +		ret = kstrtoul(rctr, 0, &ref_ctr_offset);
> +		if (ret) {
> +			pr_info("Invalid reference counter offset.\n");
> +			goto fail_address_parse;
> +		}
> +	}
> +
> +	/* Parse uprobe offset. */
>  	ret = kstrtoul(arg, 0, &offset);
>  	if (ret)
>  		goto fail_address_parse;
> @@ -484,6 +506,7 @@ static int create_trace_uprobe(int argc, char **argv)
>  		goto fail_address_parse;
>  	}
>  	tu->offset = offset;
> +	tu->ref_ctr_offset = ref_ctr_offset;
>  	tu->path = path;
>  	tu->filename = kstrdup(filename, GFP_KERNEL);
>  
> @@ -602,6 +625,9 @@ static int probes_seq_show(struct seq_file *m, void *v)
>  			trace_event_name(&tu->tp.call), tu->filename,
>  			(int)(sizeof(void *) * 2), tu->offset);
>  
> +	if (tu->ref_ctr_offset)
> +		seq_printf(m, "(0x%lx)", tu->ref_ctr_offset);
> +
>  	for (i = 0; i < tu->tp.nr_args; i++)
>  		seq_printf(m, " %s=%s", tu->tp.args[i].name, tu->tp.args[i].comm);
>  
> @@ -917,7 +943,13 @@ probe_event_enable(struct trace_uprobe *tu, struct trace_event_file *file,
>  
>  	tu->consumer.filter = filter;
>  	tu->inode = d_real_inode(tu->path.dentry);
> -	ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
> +	if (tu->ref_ctr_offset) {
> +		ret = uprobe_register_refctr(tu->inode, tu->offset,
> +				tu->ref_ctr_offset, &tu->consumer);
> +	} else {
> +		ret = uprobe_register(tu->inode, tu->offset, &tu->consumer);
> +	}
> +
>  	if (ret)
>  		goto err_buffer;
>  
> -- 
> 2.14.4
> 


-- 
Masami Hiramatsu <mhiramat@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 5/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
  2018-07-16  8:47 ` [PATCH v6 5/6] Uprobes/sdt: " Ravi Bangoria
@ 2018-07-25 11:08   ` Oleg Nesterov
  2018-07-27  4:17     ` Ravi Bangoria
  0 siblings, 1 reply; 15+ messages in thread
From: Oleg Nesterov @ 2018-07-25 11:08 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: srikar, rostedt, mhiramat, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, linux-kernel, ananth,
	alexis.berlemont, naveen.n.rao, linux-arm-kernel, linux-mips,
	linux, ralf, paul.burton

No, I can't understand this patch...

On 07/16, Ravi Bangoria wrote:
>
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -63,6 +63,8 @@ static struct percpu_rw_semaphore dup_mmap_sem;
>
>  /* Have a copy of original instruction */
>  #define UPROBE_COPY_INSN	0
> +/* Reference counter offset is reloaded with non-zero value. */
> +#define REF_CTR_OFF_RELOADED	1
>
>  struct uprobe {
>  	struct rb_node		rb_node;	/* node in the rb tree */
> @@ -476,9 +478,23 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>  		return ret;
>
>  	ret = verify_opcode(old_page, vaddr, &opcode);
> -	if (ret <= 0)
> +	if (ret < 0)
>  		goto put_old;

I agree, "ret <= 0" wasn't nice even before this change, but "ret < 0" looks
worse because this is simply not possible.

> +	/*
> +	 * If instruction is already patched but reference counter offset
> +	 * has been reloaded to non-zero value, increment the reference
> +	 * counter and return.
> +	 */
> +	if (ret == 0) {
> +		if (is_register &&
> +		    test_bit(REF_CTR_OFF_RELOADED, &uprobe->flags)) {
> +			WARN_ON(!uprobe->ref_ctr_offset);
> +			ret = update_ref_ctr(uprobe, mm, true);
> +		}
> +		goto put_old;
> +	}

So we need to force update_ref_ctr(true) in case when uprobe_register_refctr()
detects the already registered uprobe with ref_ctr_offset == 0, and then it calls
register_for_each_vma().

Why this can't race with uprobe_mmap() ?

uprobe_mmap() can do install_breakpoint() right after REF_CTR_OFF_RELOADED was set,
then register_for_each_vma() will find this vma and do install_breakpoint() too.
If ref_ctr_vma was already mmaped, the counter will be incremented twice, no?

> @@ -971,6 +1011,7 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>  	bool is_register = !!new;
>  	struct map_info *info;
>  	int err = 0;
> +	bool installed = false;
>
>  	percpu_down_write(&dup_mmap_sem);
>  	info = build_map_info(uprobe->inode->i_mapping,
> @@ -1000,8 +1041,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>  		if (is_register) {
>  			/* consult only the "caller", new consumer. */
>  			if (consumer_filter(new,
> -					UPROBE_FILTER_REGISTER, mm))
> +					UPROBE_FILTER_REGISTER, mm)) {
>  				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
> +				installed = true;
> +			}
>  		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
>  			if (!filter_chain(uprobe,
>  					UPROBE_FILTER_UNREGISTER, mm))
> @@ -1016,6 +1059,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>  	}
>   out:
>  	percpu_up_write(&dup_mmap_sem);
> +	if (installed)
> +		clear_bit(REF_CTR_OFF_RELOADED, &uprobe->flags);

I simply can't understand this "bool installed"....

shouldn't we clear REF_CTR_OFF_RELOADED unconditionally after register_for_each_vma()?



Also. Suppose we have a registered uprobe with ref_ctr_offset == 0. Then you add and
remove uprobe with ref_ctr_offset != 0. But afaics uprobe->ref_ctr_offset is never
cleared, so another uprobe with a different ref_ctr_offset != 0 will hit pr_warn/-EINVAL
in alloc_uprobe() and find_old_trace_uprobe() added by the previous patch can't detect
this case?

Plus it seems that we can have the unbalanced update_ref_ctr(false), at least in case
when __uprobe_register() with REF_CTR_OFF_RELOADED set fails before it patches all mm's.
If/when the 1st uprobe with ref_ctr_offset == 0 goes away, remove_breakpoint() will dec
the counter even if wasn't incremented.

Quite possibly I am totally confused, but this patch wrong in many ways...

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 5/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
  2018-07-25 11:08   ` Oleg Nesterov
@ 2018-07-27  4:17     ` Ravi Bangoria
  2018-07-27 13:55       ` Oleg Nesterov
  0 siblings, 1 reply; 15+ messages in thread
From: Ravi Bangoria @ 2018-07-27  4:17 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: srikar, rostedt, mhiramat, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, linux-kernel, ananth,
	alexis.berlemont, naveen.n.rao, linux-arm-kernel, linux-mips,
	linux, ralf, paul.burton, Ravi Bangoria

Hi Oleg,

On 07/25/2018 04:38 PM, Oleg Nesterov wrote:
> No, I can't understand this patch...
> 
> On 07/16, Ravi Bangoria wrote:
>>
>> --- a/kernel/events/uprobes.c
>> +++ b/kernel/events/uprobes.c
>> @@ -63,6 +63,8 @@ static struct percpu_rw_semaphore dup_mmap_sem;
>>
>>  /* Have a copy of original instruction */
>>  #define UPROBE_COPY_INSN	0
>> +/* Reference counter offset is reloaded with non-zero value. */
>> +#define REF_CTR_OFF_RELOADED	1
>>
>>  struct uprobe {
>>  	struct rb_node		rb_node;	/* node in the rb tree */
>> @@ -476,9 +478,23 @@ int uprobe_write_opcode(struct arch_uprobe *auprobe, struct mm_struct *mm,
>>  		return ret;
>>
>>  	ret = verify_opcode(old_page, vaddr, &opcode);
>> -	if (ret <= 0)
>> +	if (ret < 0)
>>  		goto put_old;
> 
> I agree, "ret <= 0" wasn't nice even before this change, but "ret < 0" looks
> worse because this is simply not possible.


Ok. Any better idea?
I think if we don't track all mms patched by uprobe, we have to rely
on current instruction.


> 
>> +	/*
>> +	 * If instruction is already patched but reference counter offset
>> +	 * has been reloaded to non-zero value, increment the reference
>> +	 * counter and return.
>> +	 */
>> +	if (ret == 0) {
>> +		if (is_register &&
>> +		    test_bit(REF_CTR_OFF_RELOADED, &uprobe->flags)) {
>> +			WARN_ON(!uprobe->ref_ctr_offset);
>> +			ret = update_ref_ctr(uprobe, mm, true);
>> +		}
>> +		goto put_old;
>> +	}
> 
> So we need to force update_ref_ctr(true) in case when uprobe_register_refctr()
> detects the already registered uprobe with ref_ctr_offset == 0, and then it calls
> register_for_each_vma().
> 
> Why this can't race with uprobe_mmap() ?
> 
> uprobe_mmap() can do install_breakpoint() right after REF_CTR_OFF_RELOADED was set,
> then register_for_each_vma() will find this vma and do install_breakpoint() too.
> If ref_ctr_vma was already mmaped, the counter will be incremented twice, no?


Hmm right. Probably, I can fix this race by using some lock, but I don't
know if it's good to do that inside uprobe_mmap().


> 
>> @@ -971,6 +1011,7 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>>  	bool is_register = !!new;
>>  	struct map_info *info;
>>  	int err = 0;
>> +	bool installed = false;
>>
>>  	percpu_down_write(&dup_mmap_sem);
>>  	info = build_map_info(uprobe->inode->i_mapping,
>> @@ -1000,8 +1041,10 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>>  		if (is_register) {
>>  			/* consult only the "caller", new consumer. */
>>  			if (consumer_filter(new,
>> -					UPROBE_FILTER_REGISTER, mm))
>> +					UPROBE_FILTER_REGISTER, mm)) {
>>  				err = install_breakpoint(uprobe, mm, vma, info->vaddr);
>> +				installed = true;
>> +			}
>>  		} else if (test_bit(MMF_HAS_UPROBES, &mm->flags)) {
>>  			if (!filter_chain(uprobe,
>>  					UPROBE_FILTER_UNREGISTER, mm))
>> @@ -1016,6 +1059,8 @@ register_for_each_vma(struct uprobe *uprobe, struct uprobe_consumer *new)
>>  	}
>>   out:
>>  	percpu_up_write(&dup_mmap_sem);
>> +	if (installed)
>> +		clear_bit(REF_CTR_OFF_RELOADED, &uprobe->flags);
> 
> I simply can't understand this "bool installed"....


That boolean is needed because consumer_filter() returns false when this
function gets called first time from uprobe_register(). And consumer_filter
returns true when it gets called by uprobe_apply(). If I make it
unconditional, there is no effect of setting REF_CTR_OFF_RELOADED bit. So
this boolean is needed.

Though, there is one more issue I found. Let's say there are two processes
running and user probes on both of them using uprobe_register() using, let's
say systemtap. Now, some other guy does uprobe_register_refctr() using
'perf -p PID' on same uprobe but he is interested in only one process. Here,
we will increment the reference counter only in the "PID" process and we will
clear REF_CTR_OFF_RELOADED flag. Later, some other guy does 'perf -a' which
will call uprobe_register_refctr() for both the target but we will fail to
increment the counter in "non-PID" process because we had already clear
REF_CTR_OFF_RELOADED.

I have a solution for this. Idea is, if reference counter is reloaded, save
of all mms for which consumer_filter() denied to updated when being called
from register_for_each_vma(). Use this list of mms as checklist next time
onwards. I don't know if it's good to do that or not.


> 
> shouldn't we clear REF_CTR_OFF_RELOADED unconditionally after register_for_each_vma()?
> 


No, because uprobe_register() is simply NOP and breakpoint is actually
installed by uprobe_apply().


> 
> 
> Also. Suppose we have a registered uprobe with ref_ctr_offset == 0. Then you add and
> remove uprobe with ref_ctr_offset != 0. But afaics uprobe->ref_ctr_offset is never
> cleared, so another uprobe with a different ref_ctr_offset != 0 will hit pr_warn/-EINVAL
> in alloc_uprobe() and find_old_trace_uprobe() added by the previous patch can't detect
> this case?


This is a valid concern. So, this point is forcing me to make it a consumer
property. But if I do that, all optimization done by uprobe_perf_open() and
uprobe_perf_close() needs to be reverted, which I don't want to.


> 
> Plus it seems that we can have the unbalanced update_ref_ctr(false), at least in case
> when __uprobe_register() with REF_CTR_OFF_RELOADED set fails before it patches all mm's.
> If/when the 1st uprobe with ref_ctr_offset == 0 goes away, remove_breakpoint() will dec
> the counter even if wasn't incremented.


Hmm incase of failure, this could be possible.


> 
> Quite possibly I am totally confused, but this patch wrong in many ways...

No, you are right.

Please let me know if you have any better approach.

Thanks for the review :)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 5/6] Uprobes/sdt: Prevent multiple reference counter for same uprobe
  2018-07-27  4:17     ` Ravi Bangoria
@ 2018-07-27 13:55       ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2018-07-27 13:55 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: srikar, rostedt, mhiramat, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, linux-kernel, ananth,
	alexis.berlemont, naveen.n.rao, linux-arm-kernel, linux-mips,
	linux, ralf, paul.burton

Hi Ravi,

On 07/27, Ravi Bangoria wrote:
>
> > I simply can't understand this "bool installed"....
>
>
> That boolean is needed because consumer_filter() returns false when this
> function gets called first time from uprobe_register().

Ah yes, I forgot about the (ugly) 2-stage TRACE_REG_PERF_REGISTER +
TRACE_REG_PERF_OPEN logic...

But then I think the whole idea of REF_CTR_OFF_RELOADED is even more broken.
Nevermind.

> I have a solution for this. Idea is, if reference counter is reloaded, save
> of all mms for which consumer_filter() denied to updated when being called
> from register_for_each_vma(). Use this list of mms as checklist next time
> onwards. I don't know if it's good to do that or not.

Sounds horrible ;) and I bet this is not enough.

> Please let me know if you have any better approach.

Just drop this patch.

If we can't make it per consumer, let it be global like it was in first version.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore)
  2018-07-24  3:34     ` Ravi Bangoria
@ 2018-07-27 13:59       ` Oleg Nesterov
  0 siblings, 0 replies; 15+ messages in thread
From: Oleg Nesterov @ 2018-07-27 13:59 UTC (permalink / raw)
  To: Ravi Bangoria
  Cc: srikar, rostedt, mhiramat, peterz, mingo, acme,
	alexander.shishkin, jolsa, namhyung, linux-kernel, ananth,
	alexis.berlemont, naveen.n.rao, linux-arm-kernel, linux-mips,
	linux, ralf, paul.burton

On 07/24, Ravi Bangoria wrote:
>
> > Perhaps we can even add another flag later, MMF_HAS_DELAYED_UPROBES set by
> > delayed_uprobe_add().
>
> Yes, good idea.

But let me repeat, lets do this later. We can do even better I think, say,
move this delaed-list into uprobes_state. But imo the 1st version can just
check MMF_HAS_UPROBES because this is simple, obviously correct, and can
actually help.

Oleg.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-07-27 13:59 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-16  8:47 [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
2018-07-16  8:47 ` [PATCH v6 1/6] Uprobes: Simplify uprobe_register() body Ravi Bangoria
2018-07-16  8:47 ` [PATCH v6 2/6] Uprobe: Additional argument arch_uprobe to uprobe_write_opcode() Ravi Bangoria
2018-07-16  8:47 ` [PATCH v6 3/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria
2018-07-23 16:26   ` Oleg Nesterov
2018-07-24  3:34     ` Ravi Bangoria
2018-07-27 13:59       ` Oleg Nesterov
2018-07-24 14:21   ` Masami Hiramatsu
2018-07-16  8:47 ` [PATCH v6 4/6] trace_uprobe/sdt: Prevent multiple reference counter for same uprobe Ravi Bangoria
2018-07-16  8:47 ` [PATCH v6 5/6] Uprobes/sdt: " Ravi Bangoria
2018-07-25 11:08   ` Oleg Nesterov
2018-07-27  4:17     ` Ravi Bangoria
2018-07-27 13:55       ` Oleg Nesterov
2018-07-16  8:47 ` [PATCH v6 6/6] perf probe: Support SDT markers having reference counter (semaphore) Ravi Bangoria
2018-07-20 13:47 ` [PATCH v6 0/6] Uprobes: Support SDT markers having reference count (semaphore) Ravi Bangoria

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).