All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Add tracepoints around mmap_lock acquisition
@ 2020-10-09 22:05 ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-09 22:05 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Axel Rasmussen,
	Jann Horn, Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

This patchset adds tracepoints around mmap_lock acquisition. This is useful so
we can measure the latency of lock acquisition, in order to detect contention.

This version is based upon linux-next (since it depends on some recently-merged
patches [1] [2]).

Changes since v2:

- Refactored tracing helper functions so the helpers are simper, but the locking
  functinos are slightly more verbose. Overall, this decreased the delta to
  mmap_lock.h slightly.

- Fixed a typo in a comment. :)

Changes since v1:

- Functions renamed to reserve the "trace_" prefix for actual tracepoints.

- We no longer measure the duration directly. Instead, users are expected to
  construct a synthetic event which computes the interval between "start
  locking" and "acquire returned".

- The new helper for checking if tracepoints are enabled in a header is used to
  avoid un-inlining any of the lock wrappers. This yields ~zero overhead if the
  tracepoints aren't enabled, and therefore obviates the need for a Kconfig for
  this change.

[1] https://lore.kernel.org/patchwork/patch/1316922/
[2] https://lore.kernel.org/patchwork/patch/1311996/

Axel Rasmussen (2):
  tracing: support "bool" type in synthetic trace events
  mmap_lock: add tracepoints around lock acquisition

 include/linux/mmap_lock.h         | 93 +++++++++++++++++++++++++++++--
 include/trace/events/mmap_lock.h  | 70 +++++++++++++++++++++++
 kernel/trace/trace_events_synth.c |  4 ++
 mm/Makefile                       |  2 +-
 mm/mmap_lock.c                    | 87 +++++++++++++++++++++++++++++
 5 files changed, 250 insertions(+), 6 deletions(-)
 create mode 100644 include/trace/events/mmap_lock.h
 create mode 100644 mm/mmap_lock.c

--
2.28.0.1011.ga647a8990f-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 0/2] Add tracepoints around mmap_lock acquisition
@ 2020-10-09 22:05 ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-09 22:05 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Axel Rasmussen,
	Jann Horn, Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

This patchset adds tracepoints around mmap_lock acquisition. This is useful so
we can measure the latency of lock acquisition, in order to detect contention.

This version is based upon linux-next (since it depends on some recently-merged
patches [1] [2]).

Changes since v2:

- Refactored tracing helper functions so the helpers are simper, but the locking
  functinos are slightly more verbose. Overall, this decreased the delta to
  mmap_lock.h slightly.

- Fixed a typo in a comment. :)

Changes since v1:

- Functions renamed to reserve the "trace_" prefix for actual tracepoints.

- We no longer measure the duration directly. Instead, users are expected to
  construct a synthetic event which computes the interval between "start
  locking" and "acquire returned".

- The new helper for checking if tracepoints are enabled in a header is used to
  avoid un-inlining any of the lock wrappers. This yields ~zero overhead if the
  tracepoints aren't enabled, and therefore obviates the need for a Kconfig for
  this change.

[1] https://lore.kernel.org/patchwork/patch/1316922/
[2] https://lore.kernel.org/patchwork/patch/1311996/

Axel Rasmussen (2):
  tracing: support "bool" type in synthetic trace events
  mmap_lock: add tracepoints around lock acquisition

 include/linux/mmap_lock.h         | 93 +++++++++++++++++++++++++++++--
 include/trace/events/mmap_lock.h  | 70 +++++++++++++++++++++++
 kernel/trace/trace_events_synth.c |  4 ++
 mm/Makefile                       |  2 +-
 mm/mmap_lock.c                    | 87 +++++++++++++++++++++++++++++
 5 files changed, 250 insertions(+), 6 deletions(-)
 create mode 100644 include/trace/events/mmap_lock.h
 create mode 100644 mm/mmap_lock.c

--
2.28.0.1011.ga647a8990f-goog



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
  2020-10-09 22:05 ` Axel Rasmussen
@ 2020-10-09 22:05   ` Axel Rasmussen
  -1 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-09 22:05 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Axel Rasmussen,
	Jann Horn, Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

It's common [1] to define tracepoint fields as "bool" when they contain
a true / false value. Currently, defining a synthetic event with a
"bool" field yields EINVAL. It's possible to work around this by using
e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
these properties don't match, you get EINVAL [2]).

Supporting "bool" explicitly makes hooking this up easier and more
portable for userspace.

[1]: grep -r "bool" include/trace/events/
[2]: check_synth_field() in kernel/trace/trace_events_hist.c

Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 kernel/trace/trace_events_synth.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 8e1974fbad0e..8f94c84349a6 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -234,6 +234,8 @@ static int synth_field_size(char *type)
 		size = sizeof(long);
 	else if (strcmp(type, "unsigned long") == 0)
 		size = sizeof(unsigned long);
+	else if (strcmp(type, "bool") == 0)
+		size = sizeof(bool);
 	else if (strcmp(type, "pid_t") == 0)
 		size = sizeof(pid_t);
 	else if (strcmp(type, "gfp_t") == 0)
@@ -276,6 +278,8 @@ static const char *synth_field_fmt(char *type)
 		fmt = "%ld";
 	else if (strcmp(type, "unsigned long") == 0)
 		fmt = "%lu";
+	else if (strcmp(type, "bool") == 0)
+		fmt = "%d";
 	else if (strcmp(type, "pid_t") == 0)
 		fmt = "%d";
 	else if (strcmp(type, "gfp_t") == 0)
--
2.28.0.1011.ga647a8990f-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
@ 2020-10-09 22:05   ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-09 22:05 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Axel Rasmussen,
	Jann Horn, Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

It's common [1] to define tracepoint fields as "bool" when they contain
a true / false value. Currently, defining a synthetic event with a
"bool" field yields EINVAL. It's possible to work around this by using
e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
these properties don't match, you get EINVAL [2]).

Supporting "bool" explicitly makes hooking this up easier and more
portable for userspace.

[1]: grep -r "bool" include/trace/events/
[2]: check_synth_field() in kernel/trace/trace_events_hist.c

Acked-by: Michel Lespinasse <walken@google.com>
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 kernel/trace/trace_events_synth.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
index 8e1974fbad0e..8f94c84349a6 100644
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -234,6 +234,8 @@ static int synth_field_size(char *type)
 		size = sizeof(long);
 	else if (strcmp(type, "unsigned long") == 0)
 		size = sizeof(unsigned long);
+	else if (strcmp(type, "bool") == 0)
+		size = sizeof(bool);
 	else if (strcmp(type, "pid_t") == 0)
 		size = sizeof(pid_t);
 	else if (strcmp(type, "gfp_t") == 0)
@@ -276,6 +278,8 @@ static const char *synth_field_fmt(char *type)
 		fmt = "%ld";
 	else if (strcmp(type, "unsigned long") == 0)
 		fmt = "%lu";
+	else if (strcmp(type, "bool") == 0)
+		fmt = "%d";
 	else if (strcmp(type, "pid_t") == 0)
 		fmt = "%d";
 	else if (strcmp(type, "gfp_t") == 0)
--
2.28.0.1011.ga647a8990f-goog



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
  2020-10-09 22:05 ` Axel Rasmussen
@ 2020-10-09 22:05   ` Axel Rasmussen
  -1 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-09 22:05 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Axel Rasmussen,
	Jann Horn, Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

The goal of these tracepoints is to be able to debug lock contention
issues. This lock is acquired on most (all?) mmap / munmap / page fault
operations, so a multi-threaded process which does a lot of these can
experience significant contention.

We trace just before we start acquisition, when the acquisition returns
(whether it succeeded or not), and when the lock is released (or
downgraded). The events are broken out by lock type (read / write).

The events are also broken out by memcg path. For container-based
workloads, users often think of several processes in a memcg as a single
logical "task", so collecting statistics at this level is useful.

The end goal is to get latency information. This isn't directly included
in the trace events. Instead, users are expected to compute the time
between "start locking" and "acquire returned", using e.g. synthetic
events or BPF. The benefit we get from this is simpler code.

Because we use tracepoint_enabled() to decide whether or not to trace,
this patch has effectively no overhead unless tracepoints are enabled at
runtime. If tracepoints are enabled, there is a performance impact, but
how much depends on exactly what e.g. the BPF program does.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/mmap_lock.h        | 93 ++++++++++++++++++++++++++++++--
 include/trace/events/mmap_lock.h | 70 ++++++++++++++++++++++++
 mm/Makefile                      |  2 +-
 mm/mmap_lock.c                   | 87 ++++++++++++++++++++++++++++++
 4 files changed, 246 insertions(+), 6 deletions(-)
 create mode 100644 include/trace/events/mmap_lock.h
 create mode 100644 mm/mmap_lock.c

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 0707671851a8..6586b42b4faa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -1,11 +1,63 @@
 #ifndef _LINUX_MMAP_LOCK_H
 #define _LINUX_MMAP_LOCK_H
 
+#include <linux/lockdep.h>
+#include <linux/mm_types.h>
 #include <linux/mmdebug.h>
+#include <linux/rwsem.h>
+#include <linux/tracepoint-defs.h>
+#include <linux/types.h>
 
 #define MMAP_LOCK_INITIALIZER(name) \
 	.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
 
+DECLARE_TRACEPOINT(mmap_lock_start_locking);
+DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
+DECLARE_TRACEPOINT(mmap_lock_released);
+
+#ifdef CONFIG_TRACING
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write);
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+					   bool success);
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write);
+
+static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
+						   bool write)
+{
+	if (tracepoint_enabled(mmap_lock_start_locking))
+		__mmap_lock_do_trace_start_locking(mm, write);
+}
+
+static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
+						      bool write, bool success)
+{
+	if (tracepoint_enabled(mmap_lock_acquire_returned))
+		__mmap_lock_do_trace_acquire_returned(mm, write, success);
+}
+
+static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
+{
+	if (tracepoint_enabled(mmap_lock_released))
+		__mmap_lock_do_trace_released(mm, write);
+}
+
+#else /* !CONFIG_TRACING */
+
+static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
+						   bool write)
+{
+}
+static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
+						      bool write, bool success)
+{
+}
+static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
+{
+}
+
+#endif /* CONFIG_TRACING */
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
@@ -13,58 +65,88 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
+	__mmap_lock_trace_start_locking(mm, true);
 	down_write(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
 static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
 {
+	__mmap_lock_trace_start_locking(mm, true);
 	down_write_nested(&mm->mmap_lock, subclass);
+	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
 static inline int mmap_write_lock_killable(struct mm_struct *mm)
 {
-	return down_write_killable(&mm->mmap_lock);
+	int ret;
+
+	__mmap_lock_trace_start_locking(mm, true);
+	ret = down_write_killable(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
+	return ret;
 }
 
 static inline bool mmap_write_trylock(struct mm_struct *mm)
 {
-	return down_write_trylock(&mm->mmap_lock) != 0;
+	bool ret;
+
+	__mmap_lock_trace_start_locking(mm, true);
+	ret = down_write_trylock(&mm->mmap_lock) != 0;
+	__mmap_lock_trace_acquire_returned(mm, true, ret);
+	return ret;
 }
 
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	up_write(&mm->mmap_lock);
+	__mmap_lock_trace_released(mm, true);
 }
 
 static inline void mmap_write_downgrade(struct mm_struct *mm)
 {
 	downgrade_write(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, false, true);
 }
 
 static inline void mmap_read_lock(struct mm_struct *mm)
 {
+	__mmap_lock_trace_start_locking(mm, false);
 	down_read(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, false, true);
 }
 
 static inline int mmap_read_lock_killable(struct mm_struct *mm)
 {
-	return down_read_killable(&mm->mmap_lock);
+	int ret;
+
+	__mmap_lock_trace_start_locking(mm, false);
+	ret = down_read_killable(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, false, ret == 0);
+	return ret;
 }
 
 static inline bool mmap_read_trylock(struct mm_struct *mm)
 {
-	return down_read_trylock(&mm->mmap_lock) != 0;
+	bool ret;
+
+	__mmap_lock_trace_start_locking(mm, false);
+	ret = down_read_trylock(&mm->mmap_lock) != 0;
+	__mmap_lock_trace_acquire_returned(mm, false, ret);
+	return ret;
 }
 
 static inline void mmap_read_unlock(struct mm_struct *mm)
 {
 	up_read(&mm->mmap_lock);
+	__mmap_lock_trace_released(mm, false);
 }
 
 static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
 {
-	if (down_read_trylock(&mm->mmap_lock)) {
+	if (mmap_read_trylock(mm)) {
 		rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
+		__mmap_lock_trace_released(mm, false);
 		return true;
 	}
 	return false;
@@ -73,6 +155,7 @@ static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
 static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
 {
 	up_read_non_owner(&mm->mmap_lock);
+	__mmap_lock_trace_released(mm, false);
 }
 
 static inline void mmap_assert_locked(struct mm_struct *mm)
diff --git a/include/trace/events/mmap_lock.h b/include/trace/events/mmap_lock.h
new file mode 100644
index 000000000000..ca652b52510e
--- /dev/null
+++ b/include/trace/events/mmap_lock.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmap_lock
+
+#if !defined(_TRACE_MMAP_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMAP_LOCK_H
+
+#include <linux/tracepoint.h>
+#include <linux/types.h>
+
+struct mm_struct;
+
+DECLARE_EVENT_CLASS(
+	mmap_lock_template,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__string(memcg_path, memcg_path)
+		__field(bool, write)
+		__field(bool, success)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__assign_str(memcg_path, memcg_path);
+		__entry->write = write;
+		__entry->success = success;
+	),
+
+	TP_printk(
+		"mm=%p memcg_path=%s write=%s success=%s\n",
+		__entry->mm,
+		__get_str(memcg_path),
+		__entry->write ? "true" : "false",
+		__entry->success ? "true" : "false")
+	);
+
+DEFINE_EVENT(mmap_lock_template, mmap_lock_start_locking,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success)
+);
+
+DEFINE_EVENT(mmap_lock_template, mmap_lock_acquire_returned,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success)
+);
+
+DEFINE_EVENT(mmap_lock_template, mmap_lock_released,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success)
+);
+
+#endif /* _TRACE_MMAP_LOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..1a7ea212fd8b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
new file mode 100644
index 000000000000..b849287bd12a
--- /dev/null
+++ b/mm/mmap_lock.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+#define CREATE_TRACE_POINTS
+#include <trace/events/mmap_lock.h>
+
+#include <linux/mm.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/mmap_lock.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+#include <linux/trace_events.h>
+
+/*
+ * We have to export these, as drivers use mmap_lock, and our inline functions
+ * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
+ * the nvidia driver is an existing caller of this code.
+ */
+EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
+EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
+EXPORT_SYMBOL(__tracepoint_mmap_lock_released);
+
+#ifdef CONFIG_MEMCG
+
+DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
+
+/*
+ * Write the given mm_struct's memcg path to a percpu buffer, and return a
+ * pointer to it. If the path cannot be determined, the buffer will contain the
+ * empty string.
+ *
+ * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
+ * disabled by the caller before calling us, and re-enabled only after the
+ * caller is done with the pointer.
+ */
+static const char *get_mm_memcg_path(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+
+	if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
+		char *buf = this_cpu_ptr(trace_memcg_path);
+
+		cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
+		return buf;
+	}
+	return "";
+}
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
+	do {                                                                   \
+		if (trace_mmap_lock_##type##_enabled()) {                      \
+			get_cpu();                                             \
+			trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
+					       ##__VA_ARGS__);                 \
+			put_cpu();                                             \
+		}                                                              \
+	} while (0)
+
+#else /* !CONFIG_MEMCG */
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
+	trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
+
+#endif /* CONFIG_MEMCG */
+
+/*
+ * Trace calls must be in a separate file, as otherwise there's a circular
+ * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
+ */
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
+{
+	TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
+
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+					   bool success)
+{
+	TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
+
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
+{
+	TRACE_MMAP_LOCK_EVENT(released, mm, write, true);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_released);
-- 
2.28.0.1011.ga647a8990f-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
@ 2020-10-09 22:05   ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-09 22:05 UTC (permalink / raw)
  To: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Axel Rasmussen,
	Jann Horn, Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

The goal of these tracepoints is to be able to debug lock contention
issues. This lock is acquired on most (all?) mmap / munmap / page fault
operations, so a multi-threaded process which does a lot of these can
experience significant contention.

We trace just before we start acquisition, when the acquisition returns
(whether it succeeded or not), and when the lock is released (or
downgraded). The events are broken out by lock type (read / write).

The events are also broken out by memcg path. For container-based
workloads, users often think of several processes in a memcg as a single
logical "task", so collecting statistics at this level is useful.

The end goal is to get latency information. This isn't directly included
in the trace events. Instead, users are expected to compute the time
between "start locking" and "acquire returned", using e.g. synthetic
events or BPF. The benefit we get from this is simpler code.

Because we use tracepoint_enabled() to decide whether or not to trace,
this patch has effectively no overhead unless tracepoints are enabled at
runtime. If tracepoints are enabled, there is a performance impact, but
how much depends on exactly what e.g. the BPF program does.

Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
---
 include/linux/mmap_lock.h        | 93 ++++++++++++++++++++++++++++++--
 include/trace/events/mmap_lock.h | 70 ++++++++++++++++++++++++
 mm/Makefile                      |  2 +-
 mm/mmap_lock.c                   | 87 ++++++++++++++++++++++++++++++
 4 files changed, 246 insertions(+), 6 deletions(-)
 create mode 100644 include/trace/events/mmap_lock.h
 create mode 100644 mm/mmap_lock.c

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index 0707671851a8..6586b42b4faa 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -1,11 +1,63 @@
 #ifndef _LINUX_MMAP_LOCK_H
 #define _LINUX_MMAP_LOCK_H
 
+#include <linux/lockdep.h>
+#include <linux/mm_types.h>
 #include <linux/mmdebug.h>
+#include <linux/rwsem.h>
+#include <linux/tracepoint-defs.h>
+#include <linux/types.h>
 
 #define MMAP_LOCK_INITIALIZER(name) \
 	.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
 
+DECLARE_TRACEPOINT(mmap_lock_start_locking);
+DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
+DECLARE_TRACEPOINT(mmap_lock_released);
+
+#ifdef CONFIG_TRACING
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write);
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+					   bool success);
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write);
+
+static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
+						   bool write)
+{
+	if (tracepoint_enabled(mmap_lock_start_locking))
+		__mmap_lock_do_trace_start_locking(mm, write);
+}
+
+static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
+						      bool write, bool success)
+{
+	if (tracepoint_enabled(mmap_lock_acquire_returned))
+		__mmap_lock_do_trace_acquire_returned(mm, write, success);
+}
+
+static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
+{
+	if (tracepoint_enabled(mmap_lock_released))
+		__mmap_lock_do_trace_released(mm, write);
+}
+
+#else /* !CONFIG_TRACING */
+
+static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
+						   bool write)
+{
+}
+static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
+						      bool write, bool success)
+{
+}
+static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
+{
+}
+
+#endif /* CONFIG_TRACING */
+
 static inline void mmap_init_lock(struct mm_struct *mm)
 {
 	init_rwsem(&mm->mmap_lock);
@@ -13,58 +65,88 @@ static inline void mmap_init_lock(struct mm_struct *mm)
 
 static inline void mmap_write_lock(struct mm_struct *mm)
 {
+	__mmap_lock_trace_start_locking(mm, true);
 	down_write(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
 static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
 {
+	__mmap_lock_trace_start_locking(mm, true);
 	down_write_nested(&mm->mmap_lock, subclass);
+	__mmap_lock_trace_acquire_returned(mm, true, true);
 }
 
 static inline int mmap_write_lock_killable(struct mm_struct *mm)
 {
-	return down_write_killable(&mm->mmap_lock);
+	int ret;
+
+	__mmap_lock_trace_start_locking(mm, true);
+	ret = down_write_killable(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, true, ret == 0);
+	return ret;
 }
 
 static inline bool mmap_write_trylock(struct mm_struct *mm)
 {
-	return down_write_trylock(&mm->mmap_lock) != 0;
+	bool ret;
+
+	__mmap_lock_trace_start_locking(mm, true);
+	ret = down_write_trylock(&mm->mmap_lock) != 0;
+	__mmap_lock_trace_acquire_returned(mm, true, ret);
+	return ret;
 }
 
 static inline void mmap_write_unlock(struct mm_struct *mm)
 {
 	up_write(&mm->mmap_lock);
+	__mmap_lock_trace_released(mm, true);
 }
 
 static inline void mmap_write_downgrade(struct mm_struct *mm)
 {
 	downgrade_write(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, false, true);
 }
 
 static inline void mmap_read_lock(struct mm_struct *mm)
 {
+	__mmap_lock_trace_start_locking(mm, false);
 	down_read(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, false, true);
 }
 
 static inline int mmap_read_lock_killable(struct mm_struct *mm)
 {
-	return down_read_killable(&mm->mmap_lock);
+	int ret;
+
+	__mmap_lock_trace_start_locking(mm, false);
+	ret = down_read_killable(&mm->mmap_lock);
+	__mmap_lock_trace_acquire_returned(mm, false, ret == 0);
+	return ret;
 }
 
 static inline bool mmap_read_trylock(struct mm_struct *mm)
 {
-	return down_read_trylock(&mm->mmap_lock) != 0;
+	bool ret;
+
+	__mmap_lock_trace_start_locking(mm, false);
+	ret = down_read_trylock(&mm->mmap_lock) != 0;
+	__mmap_lock_trace_acquire_returned(mm, false, ret);
+	return ret;
 }
 
 static inline void mmap_read_unlock(struct mm_struct *mm)
 {
 	up_read(&mm->mmap_lock);
+	__mmap_lock_trace_released(mm, false);
 }
 
 static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
 {
-	if (down_read_trylock(&mm->mmap_lock)) {
+	if (mmap_read_trylock(mm)) {
 		rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
+		__mmap_lock_trace_released(mm, false);
 		return true;
 	}
 	return false;
@@ -73,6 +155,7 @@ static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
 static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
 {
 	up_read_non_owner(&mm->mmap_lock);
+	__mmap_lock_trace_released(mm, false);
 }
 
 static inline void mmap_assert_locked(struct mm_struct *mm)
diff --git a/include/trace/events/mmap_lock.h b/include/trace/events/mmap_lock.h
new file mode 100644
index 000000000000..ca652b52510e
--- /dev/null
+++ b/include/trace/events/mmap_lock.h
@@ -0,0 +1,70 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmap_lock
+
+#if !defined(_TRACE_MMAP_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMAP_LOCK_H
+
+#include <linux/tracepoint.h>
+#include <linux/types.h>
+
+struct mm_struct;
+
+DECLARE_EVENT_CLASS(
+	mmap_lock_template,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success),
+
+	TP_STRUCT__entry(
+		__field(struct mm_struct *, mm)
+		__string(memcg_path, memcg_path)
+		__field(bool, write)
+		__field(bool, success)
+	),
+
+	TP_fast_assign(
+		__entry->mm = mm;
+		__assign_str(memcg_path, memcg_path);
+		__entry->write = write;
+		__entry->success = success;
+	),
+
+	TP_printk(
+		"mm=%p memcg_path=%s write=%s success=%s\n",
+		__entry->mm,
+		__get_str(memcg_path),
+		__entry->write ? "true" : "false",
+		__entry->success ? "true" : "false")
+	);
+
+DEFINE_EVENT(mmap_lock_template, mmap_lock_start_locking,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success)
+);
+
+DEFINE_EVENT(mmap_lock_template, mmap_lock_acquire_returned,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success)
+);
+
+DEFINE_EVENT(mmap_lock_template, mmap_lock_released,
+
+	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+		bool success),
+
+	TP_ARGS(mm, memcg_path, write, success)
+);
+
+#endif /* _TRACE_MMAP_LOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/Makefile b/mm/Makefile
index d5649f1c12c0..1a7ea212fd8b 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   mm_init.o percpu.o slab_common.o \
 			   compaction.o vmacache.o \
 			   interval_tree.o list_lru.o workingset.o \
-			   debug.o gup.o $(mmu-y)
+			   debug.o gup.o mmap_lock.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
new file mode 100644
index 000000000000..b849287bd12a
--- /dev/null
+++ b/mm/mmap_lock.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+#define CREATE_TRACE_POINTS
+#include <trace/events/mmap_lock.h>
+
+#include <linux/mm.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/mmap_lock.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+#include <linux/trace_events.h>
+
+/*
+ * We have to export these, as drivers use mmap_lock, and our inline functions
+ * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
+ * the nvidia driver is an existing caller of this code.
+ */
+EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
+EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
+EXPORT_SYMBOL(__tracepoint_mmap_lock_released);
+
+#ifdef CONFIG_MEMCG
+
+DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
+
+/*
+ * Write the given mm_struct's memcg path to a percpu buffer, and return a
+ * pointer to it. If the path cannot be determined, the buffer will contain the
+ * empty string.
+ *
+ * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
+ * disabled by the caller before calling us, and re-enabled only after the
+ * caller is done with the pointer.
+ */
+static const char *get_mm_memcg_path(struct mm_struct *mm)
+{
+	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+
+	if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
+		char *buf = this_cpu_ptr(trace_memcg_path);
+
+		cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
+		return buf;
+	}
+	return "";
+}
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
+	do {                                                                   \
+		if (trace_mmap_lock_##type##_enabled()) {                      \
+			get_cpu();                                             \
+			trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
+					       ##__VA_ARGS__);                 \
+			put_cpu();                                             \
+		}                                                              \
+	} while (0)
+
+#else /* !CONFIG_MEMCG */
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
+	trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
+
+#endif /* CONFIG_MEMCG */
+
+/*
+ * Trace calls must be in a separate file, as otherwise there's a circular
+ * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
+ */
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
+{
+	TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
+
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+					   bool success)
+{
+	TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
+
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
+{
+	TRACE_MMAP_LOCK_EVENT(released, mm, write, true);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_released);
-- 
2.28.0.1011.ga647a8990f-goog



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
  2020-10-09 22:05   ` Axel Rasmussen
@ 2020-10-09 22:35     ` Michel Lespinasse
  -1 siblings, 0 replies; 24+ messages in thread
From: Michel Lespinasse @ 2020-10-09 22:35 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Vlastimil Babka,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, LKML, linux-mm

On Fri, Oct 9, 2020 at 3:05 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Reviewed-by: Michel Lespinasse <walken@google.com>

Looks good to me, thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
@ 2020-10-09 22:35     ` Michel Lespinasse
  0 siblings, 0 replies; 24+ messages in thread
From: Michel Lespinasse @ 2020-10-09 22:35 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Vlastimil Babka,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, LKML, linux-mm

On Fri, Oct 9, 2020 at 3:05 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Reviewed-by: Michel Lespinasse <walken@google.com>

Looks good to me, thanks!


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
  2020-10-09 22:05   ` Axel Rasmussen
@ 2020-10-10  5:31     ` Yafang Shao
  -1 siblings, 0 replies; 24+ messages in thread
From: Yafang Shao @ 2020-10-10  5:31 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, LKML, Linux MM

On Sat, Oct 10, 2020 at 6:05 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> ---
>  include/linux/mmap_lock.h        | 93 ++++++++++++++++++++++++++++++--
>  include/trace/events/mmap_lock.h | 70 ++++++++++++++++++++++++
>  mm/Makefile                      |  2 +-
>  mm/mmap_lock.c                   | 87 ++++++++++++++++++++++++++++++
>  4 files changed, 246 insertions(+), 6 deletions(-)
>  create mode 100644 include/trace/events/mmap_lock.h
>  create mode 100644 mm/mmap_lock.c
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 0707671851a8..6586b42b4faa 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -1,11 +1,63 @@
>  #ifndef _LINUX_MMAP_LOCK_H
>  #define _LINUX_MMAP_LOCK_H
>
> +#include <linux/lockdep.h>
> +#include <linux/mm_types.h>
>  #include <linux/mmdebug.h>
> +#include <linux/rwsem.h>
> +#include <linux/tracepoint-defs.h>
> +#include <linux/types.h>
>
>  #define MMAP_LOCK_INITIALIZER(name) \
>         .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
>
> +DECLARE_TRACEPOINT(mmap_lock_start_locking);
> +DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
> +DECLARE_TRACEPOINT(mmap_lock_released);
> +
> +#ifdef CONFIG_TRACING
> +
> +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write);
> +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> +                                          bool success);
> +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write);
> +
> +static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
> +                                                  bool write)
> +{
> +       if (tracepoint_enabled(mmap_lock_start_locking))
> +               __mmap_lock_do_trace_start_locking(mm, write);
> +}
> +
> +static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
> +                                                     bool write, bool success)
> +{
> +       if (tracepoint_enabled(mmap_lock_acquire_returned))
> +               __mmap_lock_do_trace_acquire_returned(mm, write, success);
> +}
> +
> +static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> +{
> +       if (tracepoint_enabled(mmap_lock_released))
> +               __mmap_lock_do_trace_released(mm, write);
> +}
> +
> +#else /* !CONFIG_TRACING */
> +
> +static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
> +                                                  bool write)
> +{
> +}
> +static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
> +                                                     bool write, bool success)
> +{
> +}
> +static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> +{
> +}
> +
> +#endif /* CONFIG_TRACING */
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>         init_rwsem(&mm->mmap_lock);
> @@ -13,58 +65,88 @@ static inline void mmap_init_lock(struct mm_struct *mm)
>
>  static inline void mmap_write_lock(struct mm_struct *mm)
>  {
> +       __mmap_lock_trace_start_locking(mm, true);
>         down_write(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
>  static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
>  {
> +       __mmap_lock_trace_start_locking(mm, true);
>         down_write_nested(&mm->mmap_lock, subclass);
> +       __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
>  static inline int mmap_write_lock_killable(struct mm_struct *mm)
>  {
> -       return down_write_killable(&mm->mmap_lock);
> +       int ret;
> +
> +       __mmap_lock_trace_start_locking(mm, true);
> +       ret = down_write_killable(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> +       return ret;
>  }
>
>  static inline bool mmap_write_trylock(struct mm_struct *mm)
>  {
> -       return down_write_trylock(&mm->mmap_lock) != 0;
> +       bool ret;
> +
> +       __mmap_lock_trace_start_locking(mm, true);
> +       ret = down_write_trylock(&mm->mmap_lock) != 0;
> +       __mmap_lock_trace_acquire_returned(mm, true, ret);
> +       return ret;
>  }
>
>  static inline void mmap_write_unlock(struct mm_struct *mm)
>  {
>         up_write(&mm->mmap_lock);
> +       __mmap_lock_trace_released(mm, true);
>  }
>
>  static inline void mmap_write_downgrade(struct mm_struct *mm)
>  {
>         downgrade_write(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, false, true);
>  }
>
>  static inline void mmap_read_lock(struct mm_struct *mm)
>  {
> +       __mmap_lock_trace_start_locking(mm, false);
>         down_read(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, false, true);
>  }
>
>  static inline int mmap_read_lock_killable(struct mm_struct *mm)
>  {
> -       return down_read_killable(&mm->mmap_lock);
> +       int ret;
> +
> +       __mmap_lock_trace_start_locking(mm, false);
> +       ret = down_read_killable(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
> +       return ret;
>  }
>
>  static inline bool mmap_read_trylock(struct mm_struct *mm)
>  {
> -       return down_read_trylock(&mm->mmap_lock) != 0;
> +       bool ret;
> +
> +       __mmap_lock_trace_start_locking(mm, false);
> +       ret = down_read_trylock(&mm->mmap_lock) != 0;
> +       __mmap_lock_trace_acquire_returned(mm, false, ret);
> +       return ret;
>  }
>
>  static inline void mmap_read_unlock(struct mm_struct *mm)
>  {
>         up_read(&mm->mmap_lock);
> +       __mmap_lock_trace_released(mm, false);
>  }
>
>  static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
>  {
> -       if (down_read_trylock(&mm->mmap_lock)) {
> +       if (mmap_read_trylock(mm)) {
>                 rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
> +               __mmap_lock_trace_released(mm, false);
>                 return true;
>         }
>         return false;
> @@ -73,6 +155,7 @@ static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
>  static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
>  {
>         up_read_non_owner(&mm->mmap_lock);
> +       __mmap_lock_trace_released(mm, false);
>  }
>
>  static inline void mmap_assert_locked(struct mm_struct *mm)
> diff --git a/include/trace/events/mmap_lock.h b/include/trace/events/mmap_lock.h
> new file mode 100644
> index 000000000000..ca652b52510e
> --- /dev/null
> +++ b/include/trace/events/mmap_lock.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mmap_lock
> +
> +#if !defined(_TRACE_MMAP_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MMAP_LOCK_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/types.h>
> +
> +struct mm_struct;
> +
> +DECLARE_EVENT_CLASS(
> +       mmap_lock_template,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success),
> +
> +       TP_STRUCT__entry(
> +               __field(struct mm_struct *, mm)
> +               __string(memcg_path, memcg_path)
> +               __field(bool, write)
> +               __field(bool, success)
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->mm = mm;
> +               __assign_str(memcg_path, memcg_path);
> +               __entry->write = write;
> +               __entry->success = success;
> +       ),
> +
> +       TP_printk(
> +               "mm=%p memcg_path=%s write=%s success=%s\n",
> +               __entry->mm,
> +               __get_str(memcg_path),
> +               __entry->write ? "true" : "false",
> +               __entry->success ? "true" : "false")
> +       );
> +
> +DEFINE_EVENT(mmap_lock_template, mmap_lock_start_locking,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success)
> +);
> +
> +DEFINE_EVENT(mmap_lock_template, mmap_lock_acquire_returned,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success)
> +);
> +
> +DEFINE_EVENT(mmap_lock_template, mmap_lock_released,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success)
> +);
> +
> +#endif /* _TRACE_MMAP_LOCK_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Makefile b/mm/Makefile
> index d5649f1c12c0..1a7ea212fd8b 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -52,7 +52,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            mm_init.o percpu.o slab_common.o \
>                            compaction.o vmacache.o \
>                            interval_tree.o list_lru.o workingset.o \
> -                          debug.o gup.o $(mmu-y)
> +                          debug.o gup.o mmap_lock.o $(mmu-y)
>
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> new file mode 100644
> index 000000000000..b849287bd12a
> --- /dev/null
> +++ b/mm/mmap_lock.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mmap_lock.h>
> +
> +#include <linux/mm.h>
> +#include <linux/cgroup.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mmap_lock.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
> +#include <linux/trace_events.h>
> +
> +/*
> + * We have to export these, as drivers use mmap_lock, and our inline functions
> + * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
> + * the nvidia driver is an existing caller of this code.
> + */
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_released);
> +
> +#ifdef CONFIG_MEMCG
> +
> +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> +
> +/*
> + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> + * pointer to it. If the path cannot be determined, the buffer will contain the
> + * empty string.
> + *
> + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> + * disabled by the caller before calling us, and re-enabled only after the
> + * caller is done with the pointer.
> + */
> +static const char *get_mm_memcg_path(struct mm_struct *mm)
> +{
> +       struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> +
> +       if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> +               char *buf = this_cpu_ptr(trace_memcg_path);
> +
> +               cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> +               return buf;
> +       }
> +       return "";
> +}
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> +       do {                                                                   \
> +               if (trace_mmap_lock_##type##_enabled()) {                      \
> +                       get_cpu();                                             \
> +                       trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
> +                                              ##__VA_ARGS__);                 \
> +                       put_cpu();                                             \
> +               }                                                              \
> +       } while (0)
> +
> +#else /* !CONFIG_MEMCG */
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> +       trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> +
> +#endif /* CONFIG_MEMCG */
> +
> +/*
> + * Trace calls must be in a separate file, as otherwise there's a circular
> + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> + */
> +
> +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> +{
> +       TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> +
> +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> +                                          bool success)
> +{
> +       TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> +
> +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> +{
> +       TRACE_MMAP_LOCK_EVENT(released, mm, write, true);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> --
> 2.28.0.1011.ga647a8990f-goog
>


-- 
Thanks
Yafang

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
@ 2020-10-10  5:31     ` Yafang Shao
  0 siblings, 0 replies; 24+ messages in thread
From: Yafang Shao @ 2020-10-10  5:31 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, LKML, Linux MM

On Sat, Oct 10, 2020 at 6:05 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: Yafang Shao <laoar.shao@gmail.com>

> ---
>  include/linux/mmap_lock.h        | 93 ++++++++++++++++++++++++++++++--
>  include/trace/events/mmap_lock.h | 70 ++++++++++++++++++++++++
>  mm/Makefile                      |  2 +-
>  mm/mmap_lock.c                   | 87 ++++++++++++++++++++++++++++++
>  4 files changed, 246 insertions(+), 6 deletions(-)
>  create mode 100644 include/trace/events/mmap_lock.h
>  create mode 100644 mm/mmap_lock.c
>
> diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
> index 0707671851a8..6586b42b4faa 100644
> --- a/include/linux/mmap_lock.h
> +++ b/include/linux/mmap_lock.h
> @@ -1,11 +1,63 @@
>  #ifndef _LINUX_MMAP_LOCK_H
>  #define _LINUX_MMAP_LOCK_H
>
> +#include <linux/lockdep.h>
> +#include <linux/mm_types.h>
>  #include <linux/mmdebug.h>
> +#include <linux/rwsem.h>
> +#include <linux/tracepoint-defs.h>
> +#include <linux/types.h>
>
>  #define MMAP_LOCK_INITIALIZER(name) \
>         .mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),
>
> +DECLARE_TRACEPOINT(mmap_lock_start_locking);
> +DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
> +DECLARE_TRACEPOINT(mmap_lock_released);
> +
> +#ifdef CONFIG_TRACING
> +
> +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write);
> +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> +                                          bool success);
> +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write);
> +
> +static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
> +                                                  bool write)
> +{
> +       if (tracepoint_enabled(mmap_lock_start_locking))
> +               __mmap_lock_do_trace_start_locking(mm, write);
> +}
> +
> +static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
> +                                                     bool write, bool success)
> +{
> +       if (tracepoint_enabled(mmap_lock_acquire_returned))
> +               __mmap_lock_do_trace_acquire_returned(mm, write, success);
> +}
> +
> +static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> +{
> +       if (tracepoint_enabled(mmap_lock_released))
> +               __mmap_lock_do_trace_released(mm, write);
> +}
> +
> +#else /* !CONFIG_TRACING */
> +
> +static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
> +                                                  bool write)
> +{
> +}
> +static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
> +                                                     bool write, bool success)
> +{
> +}
> +static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
> +{
> +}
> +
> +#endif /* CONFIG_TRACING */
> +
>  static inline void mmap_init_lock(struct mm_struct *mm)
>  {
>         init_rwsem(&mm->mmap_lock);
> @@ -13,58 +65,88 @@ static inline void mmap_init_lock(struct mm_struct *mm)
>
>  static inline void mmap_write_lock(struct mm_struct *mm)
>  {
> +       __mmap_lock_trace_start_locking(mm, true);
>         down_write(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
>  static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
>  {
> +       __mmap_lock_trace_start_locking(mm, true);
>         down_write_nested(&mm->mmap_lock, subclass);
> +       __mmap_lock_trace_acquire_returned(mm, true, true);
>  }
>
>  static inline int mmap_write_lock_killable(struct mm_struct *mm)
>  {
> -       return down_write_killable(&mm->mmap_lock);
> +       int ret;
> +
> +       __mmap_lock_trace_start_locking(mm, true);
> +       ret = down_write_killable(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
> +       return ret;
>  }
>
>  static inline bool mmap_write_trylock(struct mm_struct *mm)
>  {
> -       return down_write_trylock(&mm->mmap_lock) != 0;
> +       bool ret;
> +
> +       __mmap_lock_trace_start_locking(mm, true);
> +       ret = down_write_trylock(&mm->mmap_lock) != 0;
> +       __mmap_lock_trace_acquire_returned(mm, true, ret);
> +       return ret;
>  }
>
>  static inline void mmap_write_unlock(struct mm_struct *mm)
>  {
>         up_write(&mm->mmap_lock);
> +       __mmap_lock_trace_released(mm, true);
>  }
>
>  static inline void mmap_write_downgrade(struct mm_struct *mm)
>  {
>         downgrade_write(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, false, true);
>  }
>
>  static inline void mmap_read_lock(struct mm_struct *mm)
>  {
> +       __mmap_lock_trace_start_locking(mm, false);
>         down_read(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, false, true);
>  }
>
>  static inline int mmap_read_lock_killable(struct mm_struct *mm)
>  {
> -       return down_read_killable(&mm->mmap_lock);
> +       int ret;
> +
> +       __mmap_lock_trace_start_locking(mm, false);
> +       ret = down_read_killable(&mm->mmap_lock);
> +       __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
> +       return ret;
>  }
>
>  static inline bool mmap_read_trylock(struct mm_struct *mm)
>  {
> -       return down_read_trylock(&mm->mmap_lock) != 0;
> +       bool ret;
> +
> +       __mmap_lock_trace_start_locking(mm, false);
> +       ret = down_read_trylock(&mm->mmap_lock) != 0;
> +       __mmap_lock_trace_acquire_returned(mm, false, ret);
> +       return ret;
>  }
>
>  static inline void mmap_read_unlock(struct mm_struct *mm)
>  {
>         up_read(&mm->mmap_lock);
> +       __mmap_lock_trace_released(mm, false);
>  }
>
>  static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
>  {
> -       if (down_read_trylock(&mm->mmap_lock)) {
> +       if (mmap_read_trylock(mm)) {
>                 rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
> +               __mmap_lock_trace_released(mm, false);
>                 return true;
>         }
>         return false;
> @@ -73,6 +155,7 @@ static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
>  static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
>  {
>         up_read_non_owner(&mm->mmap_lock);
> +       __mmap_lock_trace_released(mm, false);
>  }
>
>  static inline void mmap_assert_locked(struct mm_struct *mm)
> diff --git a/include/trace/events/mmap_lock.h b/include/trace/events/mmap_lock.h
> new file mode 100644
> index 000000000000..ca652b52510e
> --- /dev/null
> +++ b/include/trace/events/mmap_lock.h
> @@ -0,0 +1,70 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM mmap_lock
> +
> +#if !defined(_TRACE_MMAP_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_MMAP_LOCK_H
> +
> +#include <linux/tracepoint.h>
> +#include <linux/types.h>
> +
> +struct mm_struct;
> +
> +DECLARE_EVENT_CLASS(
> +       mmap_lock_template,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success),
> +
> +       TP_STRUCT__entry(
> +               __field(struct mm_struct *, mm)
> +               __string(memcg_path, memcg_path)
> +               __field(bool, write)
> +               __field(bool, success)
> +       ),
> +
> +       TP_fast_assign(
> +               __entry->mm = mm;
> +               __assign_str(memcg_path, memcg_path);
> +               __entry->write = write;
> +               __entry->success = success;
> +       ),
> +
> +       TP_printk(
> +               "mm=%p memcg_path=%s write=%s success=%s\n",
> +               __entry->mm,
> +               __get_str(memcg_path),
> +               __entry->write ? "true" : "false",
> +               __entry->success ? "true" : "false")
> +       );
> +
> +DEFINE_EVENT(mmap_lock_template, mmap_lock_start_locking,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success)
> +);
> +
> +DEFINE_EVENT(mmap_lock_template, mmap_lock_acquire_returned,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success)
> +);
> +
> +DEFINE_EVENT(mmap_lock_template, mmap_lock_released,
> +
> +       TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
> +               bool success),
> +
> +       TP_ARGS(mm, memcg_path, write, success)
> +);
> +
> +#endif /* _TRACE_MMAP_LOCK_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>
> diff --git a/mm/Makefile b/mm/Makefile
> index d5649f1c12c0..1a7ea212fd8b 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -52,7 +52,7 @@ obj-y                 := filemap.o mempool.o oom_kill.o fadvise.o \
>                            mm_init.o percpu.o slab_common.o \
>                            compaction.o vmacache.o \
>                            interval_tree.o list_lru.o workingset.o \
> -                          debug.o gup.o $(mmu-y)
> +                          debug.o gup.o mmap_lock.o $(mmu-y)
>
>  # Give 'page_alloc' its own module-parameter namespace
>  page-alloc-y := page_alloc.o
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> new file mode 100644
> index 000000000000..b849287bd12a
> --- /dev/null
> +++ b/mm/mmap_lock.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mmap_lock.h>
> +
> +#include <linux/mm.h>
> +#include <linux/cgroup.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mmap_lock.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
> +#include <linux/trace_events.h>
> +
> +/*
> + * We have to export these, as drivers use mmap_lock, and our inline functions
> + * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
> + * the nvidia driver is an existing caller of this code.
> + */
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_released);
> +
> +#ifdef CONFIG_MEMCG
> +
> +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> +
> +/*
> + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> + * pointer to it. If the path cannot be determined, the buffer will contain the
> + * empty string.
> + *
> + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> + * disabled by the caller before calling us, and re-enabled only after the
> + * caller is done with the pointer.
> + */
> +static const char *get_mm_memcg_path(struct mm_struct *mm)
> +{
> +       struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> +
> +       if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> +               char *buf = this_cpu_ptr(trace_memcg_path);
> +
> +               cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> +               return buf;
> +       }
> +       return "";
> +}
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> +       do {                                                                   \
> +               if (trace_mmap_lock_##type##_enabled()) {                      \
> +                       get_cpu();                                             \
> +                       trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
> +                                              ##__VA_ARGS__);                 \
> +                       put_cpu();                                             \
> +               }                                                              \
> +       } while (0)
> +
> +#else /* !CONFIG_MEMCG */
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> +       trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> +
> +#endif /* CONFIG_MEMCG */
> +
> +/*
> + * Trace calls must be in a separate file, as otherwise there's a circular
> + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> + */
> +
> +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> +{
> +       TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> +
> +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> +                                          bool success)
> +{
> +       TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> +
> +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> +{
> +       TRACE_MMAP_LOCK_EVENT(released, mm, write, true);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> --
> 2.28.0.1011.ga647a8990f-goog
>


-- 
Thanks
Yafang


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
  2020-10-09 22:05   ` Axel Rasmussen
  (?)
@ 2020-10-12 14:15   ` Steven Rostedt
  2020-10-12 14:26       ` Tom Zanussi
  -1 siblings, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2020-10-12 14:15 UTC (permalink / raw)
  To: Axel Rasmussen, Tom Zanussi
  Cc: Ingo Molnar, Andrew Morton, Michel Lespinasse, Vlastimil Babka,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, linux-kernel, linux-mm


Tom,

Can you ack this patch for me?

-- Steve


On Fri,  9 Oct 2020 15:05:23 -0700
Axel Rasmussen <axelrasmussen@google.com> wrote:

> It's common [1] to define tracepoint fields as "bool" when they contain
> a true / false value. Currently, defining a synthetic event with a
> "bool" field yields EINVAL. It's possible to work around this by using
> e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
> these properties don't match, you get EINVAL [2]).
> 
> Supporting "bool" explicitly makes hooking this up easier and more
> portable for userspace.
> 
> [1]: grep -r "bool" include/trace/events/
> [2]: check_synth_field() in kernel/trace/trace_events_hist.c
> 
> Acked-by: Michel Lespinasse <walken@google.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  kernel/trace/trace_events_synth.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
> index 8e1974fbad0e..8f94c84349a6 100644
> --- a/kernel/trace/trace_events_synth.c
> +++ b/kernel/trace/trace_events_synth.c
> @@ -234,6 +234,8 @@ static int synth_field_size(char *type)
>  		size = sizeof(long);
>  	else if (strcmp(type, "unsigned long") == 0)
>  		size = sizeof(unsigned long);
> +	else if (strcmp(type, "bool") == 0)
> +		size = sizeof(bool);
>  	else if (strcmp(type, "pid_t") == 0)
>  		size = sizeof(pid_t);
>  	else if (strcmp(type, "gfp_t") == 0)
> @@ -276,6 +278,8 @@ static const char *synth_field_fmt(char *type)
>  		fmt = "%ld";
>  	else if (strcmp(type, "unsigned long") == 0)
>  		fmt = "%lu";
> +	else if (strcmp(type, "bool") == 0)
> +		fmt = "%d";
>  	else if (strcmp(type, "pid_t") == 0)
>  		fmt = "%d";
>  	else if (strcmp(type, "gfp_t") == 0)
> --
> 2.28.0.1011.ga647a8990f-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
  2020-10-12 14:15   ` Steven Rostedt
@ 2020-10-12 14:26       ` Tom Zanussi
  0 siblings, 0 replies; 24+ messages in thread
From: Tom Zanussi @ 2020-10-12 14:26 UTC (permalink / raw)
  To: Steven Rostedt, Axel Rasmussen
  Cc: Ingo Molnar, Andrew Morton, Michel Lespinasse, Vlastimil Babka,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, linux-kernel, linux-mm

Hi Steve,

Looks ok to me.

Acked-by: Tom Zanussi <zanussi@kernel.org>

Thanks,

Tom


On Mon, 2020-10-12 at 10:15 -0400, Steven Rostedt wrote:
> Tom,
> 
> Can you ack this patch for me?
> 
> -- Steve
> 
> 
> On Fri,  9 Oct 2020 15:05:23 -0700
> Axel Rasmussen <axelrasmussen@google.com> wrote:
> 
> > It's common [1] to define tracepoint fields as "bool" when they
> > contain
> > a true / false value. Currently, defining a synthetic event with a
> > "bool" field yields EINVAL. It's possible to work around this by
> > using
> > e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if
> > either of
> > these properties don't match, you get EINVAL [2]).
> > 
> > Supporting "bool" explicitly makes hooking this up easier and more
> > portable for userspace.
> > 
> > [1]: grep -r "bool" include/trace/events/
> > [2]: check_synth_field() in kernel/trace/trace_events_hist.c
> > 
> > Acked-by: Michel Lespinasse <walken@google.com>
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  kernel/trace/trace_events_synth.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/trace/trace_events_synth.c
> > b/kernel/trace/trace_events_synth.c
> > index 8e1974fbad0e..8f94c84349a6 100644
> > --- a/kernel/trace/trace_events_synth.c
> > +++ b/kernel/trace/trace_events_synth.c
> > @@ -234,6 +234,8 @@ static int synth_field_size(char *type)
> >  		size = sizeof(long);
> >  	else if (strcmp(type, "unsigned long") == 0)
> >  		size = sizeof(unsigned long);
> > +	else if (strcmp(type, "bool") == 0)
> > +		size = sizeof(bool);
> >  	else if (strcmp(type, "pid_t") == 0)
> >  		size = sizeof(pid_t);
> >  	else if (strcmp(type, "gfp_t") == 0)
> > @@ -276,6 +278,8 @@ static const char *synth_field_fmt(char *type)
> >  		fmt = "%ld";
> >  	else if (strcmp(type, "unsigned long") == 0)
> >  		fmt = "%lu";
> > +	else if (strcmp(type, "bool") == 0)
> > +		fmt = "%d";
> >  	else if (strcmp(type, "pid_t") == 0)
> >  		fmt = "%d";
> >  	else if (strcmp(type, "gfp_t") == 0)
> > --
> > 2.28.0.1011.ga647a8990f-goog
> 
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
@ 2020-10-12 14:26       ` Tom Zanussi
  0 siblings, 0 replies; 24+ messages in thread
From: Tom Zanussi @ 2020-10-12 14:26 UTC (permalink / raw)
  To: Steven Rostedt, Axel Rasmussen
  Cc: Ingo Molnar, Andrew Morton, Michel Lespinasse, Vlastimil Babka,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, linux-kernel, linux-mm

Hi Steve,

Looks ok to me.

Acked-by: Tom Zanussi <zanussi@kernel.org>

Thanks,

Tom


On Mon, 2020-10-12 at 10:15 -0400, Steven Rostedt wrote:
> Tom,
> 
> Can you ack this patch for me?
> 
> -- Steve
> 
> 
> On Fri,  9 Oct 2020 15:05:23 -0700
> Axel Rasmussen <axelrasmussen@google.com> wrote:
> 
> > It's common [1] to define tracepoint fields as "bool" when they
> > contain
> > a true / false value. Currently, defining a synthetic event with a
> > "bool" field yields EINVAL. It's possible to work around this by
> > using
> > e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if
> > either of
> > these properties don't match, you get EINVAL [2]).
> > 
> > Supporting "bool" explicitly makes hooking this up easier and more
> > portable for userspace.
> > 
> > [1]: grep -r "bool" include/trace/events/
> > [2]: check_synth_field() in kernel/trace/trace_events_hist.c
> > 
> > Acked-by: Michel Lespinasse <walken@google.com>
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  kernel/trace/trace_events_synth.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/kernel/trace/trace_events_synth.c
> > b/kernel/trace/trace_events_synth.c
> > index 8e1974fbad0e..8f94c84349a6 100644
> > --- a/kernel/trace/trace_events_synth.c
> > +++ b/kernel/trace/trace_events_synth.c
> > @@ -234,6 +234,8 @@ static int synth_field_size(char *type)
> >  		size = sizeof(long);
> >  	else if (strcmp(type, "unsigned long") == 0)
> >  		size = sizeof(unsigned long);
> > +	else if (strcmp(type, "bool") == 0)
> > +		size = sizeof(bool);
> >  	else if (strcmp(type, "pid_t") == 0)
> >  		size = sizeof(pid_t);
> >  	else if (strcmp(type, "gfp_t") == 0)
> > @@ -276,6 +278,8 @@ static const char *synth_field_fmt(char *type)
> >  		fmt = "%ld";
> >  	else if (strcmp(type, "unsigned long") == 0)
> >  		fmt = "%lu";
> > +	else if (strcmp(type, "bool") == 0)
> > +		fmt = "%d";
> >  	else if (strcmp(type, "pid_t") == 0)
> >  		fmt = "%d";
> >  	else if (strcmp(type, "gfp_t") == 0)
> > --
> > 2.28.0.1011.ga647a8990f-goog
> 
> 



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
  2020-10-12 14:26       ` Tom Zanussi
@ 2020-10-12 14:46         ` Steven Rostedt
  -1 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-10-12 14:46 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Axel Rasmussen, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, linux-kernel, linux-mm

On Mon, 12 Oct 2020 09:26:13 -0500
Tom Zanussi <zanussi@kernel.org> wrote:

> Hi Steve,
> 
> Looks ok to me.
> 
> Acked-by: Tom Zanussi <zanussi@kernel.org>

Great!

I'll pull this patch into my tree. It doesn't look like patch 2/2 is
dependent on this and these two can go through different trees.

Is everyone OK if I take this patch through my tree?

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
@ 2020-10-12 14:46         ` Steven Rostedt
  0 siblings, 0 replies; 24+ messages in thread
From: Steven Rostedt @ 2020-10-12 14:46 UTC (permalink / raw)
  To: Tom Zanussi
  Cc: Axel Rasmussen, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, linux-kernel, linux-mm

On Mon, 12 Oct 2020 09:26:13 -0500
Tom Zanussi <zanussi@kernel.org> wrote:

> Hi Steve,
> 
> Looks ok to me.
> 
> Acked-by: Tom Zanussi <zanussi@kernel.org>

Great!

I'll pull this patch into my tree. It doesn't look like patch 2/2 is
dependent on this and these two can go through different trees.

Is everyone OK if I take this patch through my tree?

-- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
  2020-10-12 14:46         ` Steven Rostedt
@ 2020-10-12 16:23           ` Axel Rasmussen
  -1 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-12 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tom Zanussi, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, LKML, Linux MM

On Mon, Oct 12, 2020 at 7:46 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 12 Oct 2020 09:26:13 -0500
> Tom Zanussi <zanussi@kernel.org> wrote:
>
> > Hi Steve,
> >
> > Looks ok to me.
> >
> > Acked-by: Tom Zanussi <zanussi@kernel.org>
>
> Great!
>
> I'll pull this patch into my tree. It doesn't look like patch 2/2 is
> dependent on this and these two can go through different trees.
>
> Is everyone OK if I take this patch through my tree?

Sounds good to me. You're right that there is no compile-time
dependency between the two patches.

Thanks!

>
> -- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
@ 2020-10-12 16:23           ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-12 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Tom Zanussi, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, LKML, Linux MM

On Mon, Oct 12, 2020 at 7:46 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 12 Oct 2020 09:26:13 -0500
> Tom Zanussi <zanussi@kernel.org> wrote:
>
> > Hi Steve,
> >
> > Looks ok to me.
> >
> > Acked-by: Tom Zanussi <zanussi@kernel.org>
>
> Great!
>
> I'll pull this patch into my tree. It doesn't look like patch 2/2 is
> dependent on this and these two can go through different trees.
>
> Is everyone OK if I take this patch through my tree?

Sounds good to me. You're right that there is no compile-time
dependency between the two patches.

Thanks!

>
> -- Steve


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
  2020-10-09 22:05   ` Axel Rasmussen
@ 2020-10-13 19:41     ` David Rientjes
  -1 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2020-10-13 19:41 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, linux-kernel, linux-mm

On Fri, 9 Oct 2020, Axel Rasmussen wrote:

> It's common [1] to define tracepoint fields as "bool" when they contain
> a true / false value. Currently, defining a synthetic event with a
> "bool" field yields EINVAL. It's possible to work around this by using
> e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
> these properties don't match, you get EINVAL [2]).
> 
> Supporting "bool" explicitly makes hooking this up easier and more
> portable for userspace.
> 
> [1]: grep -r "bool" include/trace/events/
> [2]: check_synth_field() in kernel/trace/trace_events_hist.c
> 
> Acked-by: Michel Lespinasse <walken@google.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events
@ 2020-10-13 19:41     ` David Rientjes
  0 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2020-10-13 19:41 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, linux-kernel, linux-mm

On Fri, 9 Oct 2020, Axel Rasmussen wrote:

> It's common [1] to define tracepoint fields as "bool" when they contain
> a true / false value. Currently, defining a synthetic event with a
> "bool" field yields EINVAL. It's possible to work around this by using
> e.g. u8 (assuming sizeof(bool) is 1, and bool is unsigned; if either of
> these properties don't match, you get EINVAL [2]).
> 
> Supporting "bool" explicitly makes hooking this up easier and more
> portable for userspace.
> 
> [1]: grep -r "bool" include/trace/events/
> [2]: check_synth_field() in kernel/trace/trace_events_hist.c
> 
> Acked-by: Michel Lespinasse <walken@google.com>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
  2020-10-09 22:05   ` Axel Rasmussen
@ 2020-10-13 19:42     ` David Rientjes
  -1 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2020-10-13 19:42 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, linux-kernel, linux-mm

On Fri, 9 Oct 2020, Axel Rasmussen wrote:

> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
> 
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
> 
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
> 
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
> 
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
@ 2020-10-13 19:42     ` David Rientjes
  0 siblings, 0 replies; 24+ messages in thread
From: David Rientjes @ 2020-10-13 19:42 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Vlastimil Babka, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang, Yafang Shao, linux-kernel, linux-mm

On Fri, 9 Oct 2020, Axel Rasmussen wrote:

> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
> 
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
> 
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
> 
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
> 
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Acked-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
  2020-10-09 22:05   ` Axel Rasmussen
                     ` (3 preceding siblings ...)
  (?)
@ 2020-10-20 14:50   ` Vlastimil Babka
  2020-10-20 18:17       ` Axel Rasmussen
  -1 siblings, 1 reply; 24+ messages in thread
From: Vlastimil Babka @ 2020-10-20 14:50 UTC (permalink / raw)
  To: Axel Rasmussen, Steven Rostedt, Ingo Molnar, Andrew Morton,
	Michel Lespinasse, Daniel Jordan, Laurent Dufour, Jann Horn,
	Chinwen Chang
  Cc: Yafang Shao, linux-kernel, linux-mm

On 10/10/20 12:05 AM, Axel Rasmussen wrote:
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
> 
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
> 
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
> 
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
> 
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
> 
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>

Yeah I agree with this approach that follows the page ref one.

...

> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> new file mode 100644
> index 000000000000..b849287bd12a
> --- /dev/null
> +++ b/mm/mmap_lock.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/mmap_lock.h>
> +
> +#include <linux/mm.h>
> +#include <linux/cgroup.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mmap_lock.h>
> +#include <linux/percpu.h>
> +#include <linux/smp.h>
> +#include <linux/trace_events.h>
> +
> +/*
> + * We have to export these, as drivers use mmap_lock, and our inline functions
> + * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
> + * the nvidia driver is an existing caller of this code.

I don't think this argument works in the kernel community. I would just remove 
this comment.

> + */
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
> +EXPORT_SYMBOL(__tracepoint_mmap_lock_released);

You can use EXPORT_TRACEPOINT_SYMBOL() here.

> +#ifdef CONFIG_MEMCG
> +
> +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> +
> +/*
> + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> + * pointer to it. If the path cannot be determined, the buffer will contain the
> + * empty string.
> + *
> + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> + * disabled by the caller before calling us, and re-enabled only after the
> + * caller is done with the pointer.
> + */
> +static const char *get_mm_memcg_path(struct mm_struct *mm)
> +{
> +	struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> +
> +	if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> +		char *buf = this_cpu_ptr(trace_memcg_path);
> +
> +		cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> +		return buf;
> +	}
> +	return "";
> +}
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> +	do {                                                                   \
> +		if (trace_mmap_lock_##type##_enabled()) {                      \

Is this check really needed? We only got called from the functions inlined in 
the .h file because tracepoint_enabled() was true in the first place, so this 
seems redundant.

> +			get_cpu();                                             \
> +			trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
> +					       ##__VA_ARGS__);                 \
> +			put_cpu();                                             \
> +		}                                                              \
> +	} while (0)
> +
> +#else /* !CONFIG_MEMCG */
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> +	trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> +
> +#endif /* CONFIG_MEMCG */
> +
> +/*
> + * Trace calls must be in a separate file, as otherwise there's a circular
> + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> + */
> +
> +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> +{
> +	TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);

Seems wasteful to have an always-true success field here. Yeah, not reusing the 
same event class for all three tracepoints means more code, but for tracing 
efficiency it's worth it, IMHO.

> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> +
> +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> +					   bool success)
> +{
> +	TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> +
> +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> +{
> +	TRACE_MMAP_LOCK_EVENT(released, mm, write, true);

Ditto.

> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
  2020-10-20 14:50   ` Vlastimil Babka
@ 2020-10-20 18:17       ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-20 18:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, LKML, Linux MM

On Tue, Oct 20, 2020 at 7:50 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/10/20 12:05 AM, Axel Rasmussen wrote:
> > The goal of these tracepoints is to be able to debug lock contention
> > issues. This lock is acquired on most (all?) mmap / munmap / page fault
> > operations, so a multi-threaded process which does a lot of these can
> > experience significant contention.
> >
> > We trace just before we start acquisition, when the acquisition returns
> > (whether it succeeded or not), and when the lock is released (or
> > downgraded). The events are broken out by lock type (read / write).
> >
> > The events are also broken out by memcg path. For container-based
> > workloads, users often think of several processes in a memcg as a single
> > logical "task", so collecting statistics at this level is useful.
> >
> > The end goal is to get latency information. This isn't directly included
> > in the trace events. Instead, users are expected to compute the time
> > between "start locking" and "acquire returned", using e.g. synthetic
> > events or BPF. The benefit we get from this is simpler code.
> >
> > Because we use tracepoint_enabled() to decide whether or not to trace,
> > this patch has effectively no overhead unless tracepoints are enabled at
> > runtime. If tracepoints are enabled, there is a performance impact, but
> > how much depends on exactly what e.g. the BPF program does.
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
>
> Yeah I agree with this approach that follows the page ref one.
>
> ...
>
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > new file mode 100644
> > index 000000000000..b849287bd12a
> > --- /dev/null
> > +++ b/mm/mmap_lock.c
> > @@ -0,0 +1,87 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/mmap_lock.h>
> > +
> > +#include <linux/mm.h>
> > +#include <linux/cgroup.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/mmap_lock.h>
> > +#include <linux/percpu.h>
> > +#include <linux/smp.h>
> > +#include <linux/trace_events.h>
> > +
> > +/*
> > + * We have to export these, as drivers use mmap_lock, and our inline functions
> > + * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
> > + * the nvidia driver is an existing caller of this code.
>
> I don't think this argument works in the kernel community. I would just remove
> this comment.
>
> > + */
> > +EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
> > +EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
> > +EXPORT_SYMBOL(__tracepoint_mmap_lock_released);
>
> You can use EXPORT_TRACEPOINT_SYMBOL() here.

This is simpler, thanks for the pointer!

Agree the comment isn't needed in this case. I added it mainly since
checkpatch.pl doesn't like the EXPORT_SYMBOL-ing things not defined
just above, but EXPORT_TRACEPOINT_SYMBOL won't raise the same
concerns.

>
> > +#ifdef CONFIG_MEMCG
> > +
> > +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> > +
> > +/*
> > + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> > + * pointer to it. If the path cannot be determined, the buffer will contain the
> > + * empty string.
> > + *
> > + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> > + * disabled by the caller before calling us, and re-enabled only after the
> > + * caller is done with the pointer.
> > + */
> > +static const char *get_mm_memcg_path(struct mm_struct *mm)
> > +{
> > +     struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> > +
> > +     if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> > +             char *buf = this_cpu_ptr(trace_memcg_path);
> > +
> > +             cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> > +             return buf;
> > +     }
> > +     return "";
> > +}
> > +
> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> > +     do {                                                                   \
> > +             if (trace_mmap_lock_##type##_enabled()) {                      \
>
> Is this check really needed? We only got called from the functions inlined in
> the .h file because tracepoint_enabled() was true in the first place, so this
> seems redundant.

Right, now that we've moved the check into the header, this isn't needed.

>
> > +                     get_cpu();                                             \
> > +                     trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
> > +                                            ##__VA_ARGS__);                 \
> > +                     put_cpu();                                             \
> > +             }                                                              \
> > +     } while (0)
> > +
> > +#else /* !CONFIG_MEMCG */
> > +
> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> > +     trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> > +
> > +#endif /* CONFIG_MEMCG */
> > +
> > +/*
> > + * Trace calls must be in a separate file, as otherwise there's a circular
> > + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> > + */
> > +
> > +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> > +{
> > +     TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);
>
> Seems wasteful to have an always-true success field here. Yeah, not reusing the
> same event class for all three tracepoints means more code, but for tracing
> efficiency it's worth it, IMHO.

Right, originally I was worried about code size. But, I switched to
not re-using an event class, and I only measure an increase of 524
bytes in .text, which seems trivial.

I'll send a v4 with all of the above changes.

>
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> > +
> > +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> > +                                        bool success)
> > +{
> > +     TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> > +
> > +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> > +{
> > +     TRACE_MMAP_LOCK_EVENT(released, mm, write, true);
>
> Ditto.
>
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> >
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition
@ 2020-10-20 18:17       ` Axel Rasmussen
  0 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2020-10-20 18:17 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Steven Rostedt, Ingo Molnar, Andrew Morton, Michel Lespinasse,
	Daniel Jordan, Laurent Dufour, Jann Horn, Chinwen Chang,
	Yafang Shao, LKML, Linux MM

On Tue, Oct 20, 2020 at 7:50 AM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> On 10/10/20 12:05 AM, Axel Rasmussen wrote:
> > The goal of these tracepoints is to be able to debug lock contention
> > issues. This lock is acquired on most (all?) mmap / munmap / page fault
> > operations, so a multi-threaded process which does a lot of these can
> > experience significant contention.
> >
> > We trace just before we start acquisition, when the acquisition returns
> > (whether it succeeded or not), and when the lock is released (or
> > downgraded). The events are broken out by lock type (read / write).
> >
> > The events are also broken out by memcg path. For container-based
> > workloads, users often think of several processes in a memcg as a single
> > logical "task", so collecting statistics at this level is useful.
> >
> > The end goal is to get latency information. This isn't directly included
> > in the trace events. Instead, users are expected to compute the time
> > between "start locking" and "acquire returned", using e.g. synthetic
> > events or BPF. The benefit we get from this is simpler code.
> >
> > Because we use tracepoint_enabled() to decide whether or not to trace,
> > this patch has effectively no overhead unless tracepoints are enabled at
> > runtime. If tracepoints are enabled, there is a performance impact, but
> > how much depends on exactly what e.g. the BPF program does.
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
>
> Yeah I agree with this approach that follows the page ref one.
>
> ...
>
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > new file mode 100644
> > index 000000000000..b849287bd12a
> > --- /dev/null
> > +++ b/mm/mmap_lock.c
> > @@ -0,0 +1,87 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#define CREATE_TRACE_POINTS
> > +#include <trace/events/mmap_lock.h>
> > +
> > +#include <linux/mm.h>
> > +#include <linux/cgroup.h>
> > +#include <linux/memcontrol.h>
> > +#include <linux/mmap_lock.h>
> > +#include <linux/percpu.h>
> > +#include <linux/smp.h>
> > +#include <linux/trace_events.h>
> > +
> > +/*
> > + * We have to export these, as drivers use mmap_lock, and our inline functions
> > + * in the header check if the tracepoint is enabled. They can't be GPL, as e.g.
> > + * the nvidia driver is an existing caller of this code.
>
> I don't think this argument works in the kernel community. I would just remove
> this comment.
>
> > + */
> > +EXPORT_SYMBOL(__tracepoint_mmap_lock_start_locking);
> > +EXPORT_SYMBOL(__tracepoint_mmap_lock_acquire_returned);
> > +EXPORT_SYMBOL(__tracepoint_mmap_lock_released);
>
> You can use EXPORT_TRACEPOINT_SYMBOL() here.

This is simpler, thanks for the pointer!

Agree the comment isn't needed in this case. I added it mainly since
checkpatch.pl doesn't like the EXPORT_SYMBOL-ing things not defined
just above, but EXPORT_TRACEPOINT_SYMBOL won't raise the same
concerns.

>
> > +#ifdef CONFIG_MEMCG
> > +
> > +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> > +
> > +/*
> > + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> > + * pointer to it. If the path cannot be determined, the buffer will contain the
> > + * empty string.
> > + *
> > + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> > + * disabled by the caller before calling us, and re-enabled only after the
> > + * caller is done with the pointer.
> > + */
> > +static const char *get_mm_memcg_path(struct mm_struct *mm)
> > +{
> > +     struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> > +
> > +     if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> > +             char *buf = this_cpu_ptr(trace_memcg_path);
> > +
> > +             cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> > +             return buf;
> > +     }
> > +     return "";
> > +}
> > +
> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> > +     do {                                                                   \
> > +             if (trace_mmap_lock_##type##_enabled()) {                      \
>
> Is this check really needed? We only got called from the functions inlined in
> the .h file because tracepoint_enabled() was true in the first place, so this
> seems redundant.

Right, now that we've moved the check into the header, this isn't needed.

>
> > +                     get_cpu();                                             \
> > +                     trace_mmap_lock_##type(mm, get_mm_memcg_path(mm),      \
> > +                                            ##__VA_ARGS__);                 \
> > +                     put_cpu();                                             \
> > +             }                                                              \
> > +     } while (0)
> > +
> > +#else /* !CONFIG_MEMCG */
> > +
> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...)                                   \
> > +     trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> > +
> > +#endif /* CONFIG_MEMCG */
> > +
> > +/*
> > + * Trace calls must be in a separate file, as otherwise there's a circular
> > + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> > + */
> > +
> > +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> > +{
> > +     TRACE_MMAP_LOCK_EVENT(start_locking, mm, write, true);
>
> Seems wasteful to have an always-true success field here. Yeah, not reusing the
> same event class for all three tracepoints means more code, but for tracing
> efficiency it's worth it, IMHO.

Right, originally I was worried about code size. But, I switched to
not re-using an event class, and I only measure an increase of 524
bytes in .text, which seems trivial.

I'll send a v4 with all of the above changes.

>
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> > +
> > +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> > +                                        bool success)
> > +{
> > +     TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> > +
> > +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> > +{
> > +     TRACE_MMAP_LOCK_EVENT(released, mm, write, true);
>
> Ditto.
>
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> >
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-10-20 18:18 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-09 22:05 [PATCH v3 0/2] Add tracepoints around mmap_lock acquisition Axel Rasmussen
2020-10-09 22:05 ` Axel Rasmussen
2020-10-09 22:05 ` [PATCH v3 1/2] tracing: support "bool" type in synthetic trace events Axel Rasmussen
2020-10-09 22:05   ` Axel Rasmussen
2020-10-12 14:15   ` Steven Rostedt
2020-10-12 14:26     ` Tom Zanussi
2020-10-12 14:26       ` Tom Zanussi
2020-10-12 14:46       ` Steven Rostedt
2020-10-12 14:46         ` Steven Rostedt
2020-10-12 16:23         ` Axel Rasmussen
2020-10-12 16:23           ` Axel Rasmussen
2020-10-13 19:41   ` David Rientjes
2020-10-13 19:41     ` David Rientjes
2020-10-09 22:05 ` [PATCH v3 2/2] mmap_lock: add tracepoints around lock acquisition Axel Rasmussen
2020-10-09 22:05   ` Axel Rasmussen
2020-10-09 22:35   ` Michel Lespinasse
2020-10-09 22:35     ` Michel Lespinasse
2020-10-10  5:31   ` Yafang Shao
2020-10-10  5:31     ` Yafang Shao
2020-10-13 19:42   ` David Rientjes
2020-10-13 19:42     ` David Rientjes
2020-10-20 14:50   ` Vlastimil Babka
2020-10-20 18:17     ` Axel Rasmussen
2020-10-20 18:17       ` Axel Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.