All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V4 1/6] perf: Save PMU specific data in task_struct
@ 2021-05-19 15:06 kan.liang
  2021-05-19 15:06 ` [PATCH V4 2/6] perf: attach/detach PMU specific data kan.liang
                   ` (5 more replies)
  0 siblings, 6 replies; 8+ messages in thread
From: kan.liang @ 2021-05-19 15:06 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, vitaly.slobodskoy, namhyung, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Some PMU specific data has to be saved/restored during context switch,
e.g. LBR call stack data. Currently, the data is saved in event context
structure, but only for per-process event. For system-wide event,
because of missing the LBR call stack data after context switch, LBR
callstacks are always shorter in comparison to per-process mode.

For example,
  Per-process mode:
  $perf record --call-graph lbr -- taskset -c 0 ./tchain_edit

  -   99.90%    99.86%  tchain_edit  tchain_edit       [.] f3
       99.86% _start
          __libc_start_main
          generic_start_main
          main
          f1
        - f2
             f3

  System-wide mode:
  $perf record --call-graph lbr -a -- taskset -c 0 ./tchain_edit

  -   99.88%    99.82%  tchain_edit  tchain_edit        [.] f3
   - 62.02% main
        f1
        f2
        f3
   - 28.83% f1
      - f2
        f3
   - 28.83% f1
      - f2
           f3
   - 8.88% generic_start_main
        main
        f1
        f2
        f3

It isn't practical to simply allocate the data for system-wide event in
CPU context structure for all tasks. We have no idea which CPU a task
will be scheduled to. The duplicated LBR data has to be maintained on
every CPU context structure. That's a huge waste. Otherwise, the LBR
data still lost if the task is scheduled to another CPU.

Save the pmu specific data in task_struct. The size of pmu specific data
is 788 bytes for LBR call stack. Usually, the overall amount of threads
doesn't exceed a few thousands. For 10K threads, keeping LBR data would
consume additional ~8MB. The additional space will only be allocated
during LBR call stack monitoring. It will be released when the
monitoring is finished.

Furthermore, moving task_ctx_data from perf_event_context to task_struct
can reduce complexity and make things clearer. E.g. perf doesn't need to
swap task_ctx_data on optimized context switch path.
This patch set is just the first step. There could be other
optimization/extension on top of this patch set. E.g. for cgroup
profiling, perf just needs to save/store the LBR call stack information
for tasks in specific cgroup. That could reduce the additional space.
Also, the LBR call stack can be available for software events, or allow
even debugging use cases, like LBRs on crash later.

The data can be shared among events. To sync the writers of
perf_ctx_data RCU pointer, add a lock in task_struct as well.

The Kmem cache of pmu specific data is saved in struct perf_ctx_data.
It's required when child task allocates the space.
The refcount in struct perf_ctx_data is used to track the users of pmu
specific data.

Reviewed-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

The V3 can be found here.
https://lore.kernel.org/lkml/1578495789-95006-1-git-send-email-kan.liang@linux.intel.com/

Changes since V3:
- Rebase for the Arch LBR. Use Kmem cache to replace the data_size.

Changes since V2:
- Cannot use mutex inside rcu_read_lock().
  Restore the pin lock perf_ctx_data_lock


 include/linux/perf_event.h | 28 ++++++++++++++++++++++++++++
 include/linux/sched.h      |  4 ++++
 kernel/events/core.c       |  2 ++
 3 files changed, 34 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5a6a2f..d46b7e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -851,6 +851,34 @@ struct perf_event_context {
 	struct rcu_head			rcu_head;
 };
 
+/**
+ * struct perf_ctx_data - PMU specific data for a task
+ * @rcu_head:  To avoid the race on free PMU specific data
+ * @refcount:  To track users
+ * @ctx_cache: Kmem cache of PMU specific data
+ * @data:      PMU specific data
+ *
+ * Currently, the struct is only used in Intel LBR call stack mode to
+ * save/restore the call stack of a task on context switches.
+ * The data only be allocated when Intel LBR call stack mode is enabled.
+ * The data will be freed when the mode is disabled. The rcu_head is
+ * used to prevent the race on free the data.
+ * The content of the data will only be accessed in context switch, which
+ * should be protected by rcu_read_lock().
+ *
+ * Careful: Struct perf_ctx_data is added as a pointor in struct task_struct.
+ * When system-wide Intel LBR call stack mode is enabled, a buffer with
+ * constant size will be allocated for each task.
+ * Also, system memory consumption can further grow when the size of
+ * struct perf_ctx_data enlarges.
+ */
+struct perf_ctx_data {
+	struct rcu_head			rcu_head;
+	refcount_t			refcount;
+	struct kmem_cache		*ctx_cache;
+	void				*data;
+};
+
 /*
  * Number of contexts where an event can trigger:
  *	task, softirq, hardirq, nmi.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d2c8813..700f56f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -52,6 +52,7 @@ struct mempolicy;
 struct nameidata;
 struct nsproxy;
 struct perf_event_context;
+struct perf_ctx_data;
 struct pid_namespace;
 struct pipe_inode_info;
 struct rcu_node;
@@ -1135,6 +1136,9 @@ struct task_struct {
 	struct perf_event_context	*perf_event_ctxp[perf_nr_task_contexts];
 	struct mutex			perf_event_mutex;
 	struct list_head		perf_event_list;
+	/* Sync the writers of perf_ctx_data RCU pointer */
+	raw_spinlock_t			perf_ctx_data_lock;
+	struct perf_ctx_data __rcu	*perf_ctx_data;
 #endif
 #ifdef CONFIG_DEBUG_PREEMPT
 	unsigned long			preempt_disable_ip;
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2e947a4..9bb9bee 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -13131,6 +13131,8 @@ int perf_event_init_task(struct task_struct *child, u64 clone_flags)
 	memset(child->perf_event_ctxp, 0, sizeof(child->perf_event_ctxp));
 	mutex_init(&child->perf_event_mutex);
 	INIT_LIST_HEAD(&child->perf_event_list);
+	child->perf_ctx_data = NULL;
+	raw_spin_lock_init(&child->perf_ctx_data_lock);
 
 	for_each_task_context_nr(ctxn) {
 		ret = perf_event_init_context(child, ctxn, clone_flags);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V4 2/6] perf: attach/detach PMU specific data
  2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
@ 2021-05-19 15:06 ` kan.liang
  2021-05-19 15:06 ` [PATCH V4 3/6] perf: Supply task information to sched_task() kan.liang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: kan.liang @ 2021-05-19 15:06 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, vitaly.slobodskoy, namhyung, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The LBR call stack data has to be saved/restored during context switch
to fix the shorter LBRs call stacks issue in the  system-wide mode.
Allocate PMU specific data and attach them to the corresponding
task_struct during LBR call stack monitoring.

When a LBR call stack event is accounted, the perf_ctx_data for the
related tasks will be allocated/attached by attach_perf_ctx_data().
When a LBR call stack event is unaccounted, the perf_ctx_data for
related tasks will be detached/freed by detach_perf_ctx_data().

The LBR call stack event could be a per-task event or a system-wide
event.
- For a per-task event, perf only allocates the perf_ctx_data for the
  current task. If the allocation fails, perf will error out.
- For a system-wide event, perf has to allocate the perf_ctx_data for
  both the existing tasks and the upcoming tasks.
  The allocation for the existing tasks is done in perf_event_alloc().
  The allocation for the new tasks will be done in perf_event_fork().
  If any allocation fails, perf doesn't error out for the system-wide
  event.  A debug message will be dumped to system log instead. LBR
  callstack may be cutoff for the task which doesn't have the space
  allocated.
- The perf_ctx_data only be freed by the last LBR call stack event.
  The number of the per-task events is tracked by refcount of each task.
  Since the system-wide events impact all tasks, it's not practical to
  go through the whole task list to update the refcount for each
  system-wide event. The number of system-wide events is tracked by a
  global variable nr_task_data_sys_wide_events.
  Introduce a macro TASK_DATA_SYS_WIDE for refcount to indicate the
  PMU specific data is used by the system-wide events.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

Changes since V3:
- Rebase for the Arch LBR
- Use kvcalloc to replace kcalloc (Andi)

Changes since V2:
- Remove global spin lock task_data_sys_wide_events_lock
  Since the global spin lock has been removed, we cannot guarantee
  that the allocation/assignments for existing threads and free are
  serialized.
  To fix it, in V3, we go through the task list when accounting for
  each system-wide event, and assign the perf_ctx_data pointer if needed.
  (In V2, we only do the assignment for the first system-wide event).
  In V3, we also add a breaker in free process for system-wide event.
  If there is new system-wide event accounted, stop the free process
  immediately.
- Add a macro TASK_DATA_SYS_WIDE to indicate the PMU specific data
  is used by system-wide events.

 kernel/events/core.c | 380 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 380 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 9bb9bee..bb1b27e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -48,6 +48,7 @@
 #include <linux/parser.h>
 #include <linux/sched/clock.h>
 #include <linux/sched/mm.h>
+#include <linux/sched/stat.h>
 #include <linux/proc_ns.h>
 #include <linux/mount.h>
 #include <linux/min_heap.h>
@@ -401,6 +402,39 @@ static atomic_t nr_cgroup_events __read_mostly;
 static atomic_t nr_text_poke_events __read_mostly;
 static atomic_t nr_build_id_events __read_mostly;
 
+/* Track the number of system-wide event which requires pmu specific data */
+static atomic_t nr_task_data_sys_wide_events;
+
+/*
+ * There are two types of users for pmu specific data, system-wide event and
+ * per-task event.
+ *
+ * The number of system-wide events is already tracked by global variable
+ * nr_task_data_sys_wide_events. Set TASK_DATA_SYS_WIDE in refcount to
+ * indicate the PMU specific data is used by system-wide events.
+ *
+ * The number of per-task event users is tracked by refcount. Since the
+ * TASK_DATA_SYS_WIDE is already occupied by system-wide events, limit
+ * the max number of per-task event users less than half of TASK_DATA_SYS_WIDE.
+ */
+#define TASK_DATA_SYS_WIDE		0x1000000
+#define MAX_NR_TASK_DATA_EVENTS		(TASK_DATA_SYS_WIDE >> 1)
+
+static inline bool has_task_data_sys_wide(struct perf_ctx_data *perf_ctx_data)
+{
+	return !!(refcount_read(&perf_ctx_data->refcount) & TASK_DATA_SYS_WIDE);
+}
+
+static inline bool exceed_task_data_events_limit(struct perf_ctx_data *perf_ctx_data)
+{
+	unsigned int count = refcount_read(&perf_ctx_data->refcount);
+
+	if (has_task_data_sys_wide(perf_ctx_data))
+		return (count - TASK_DATA_SYS_WIDE) > MAX_NR_TASK_DATA_EVENTS;
+	else
+		return count > MAX_NR_TASK_DATA_EVENTS;
+}
+
 static LIST_HEAD(pmus);
 static DEFINE_MUTEX(pmus_lock);
 static struct srcu_struct pmus_srcu;
@@ -4768,6 +4802,288 @@ static void unaccount_freq_event(void)
 		atomic_dec(&nr_freq_events);
 }
 
+static int
+alloc_perf_ctx_data(struct kmem_cache *ctx_cache, gfp_t flags,
+		    struct perf_ctx_data **task_ctx_data)
+{
+	struct perf_ctx_data *ctx_data;
+
+	if (!ctx_cache)
+		return -EINVAL;
+
+	ctx_data = kzalloc(sizeof(struct perf_ctx_data), flags);
+	if (!ctx_data)
+		return -ENOMEM;
+
+	ctx_data->data = kmem_cache_zalloc(ctx_cache, flags);
+	if (!ctx_data->data) {
+		kfree(ctx_data);
+		return -ENOMEM;
+	}
+
+	ctx_data->ctx_cache = ctx_cache;
+	*task_ctx_data = ctx_data;
+
+	return 0;
+}
+
+static void
+free_perf_ctx_data(struct perf_ctx_data *ctx_data)
+{
+	kfree(ctx_data->data);
+	kfree(ctx_data);
+}
+
+static void
+free_perf_ctx_data_rcu(struct rcu_head *rcu_head)
+{
+	struct perf_ctx_data *ctx_data;
+
+	ctx_data = container_of(rcu_head, struct perf_ctx_data, rcu_head);
+	free_perf_ctx_data(ctx_data);
+}
+
+static int
+attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache)
+{
+	struct perf_ctx_data *ctx_data, *tsk_data;
+
+	/*
+	 * To make the code RT friendly, make the allocation out of
+	 * the spinlock.
+	 */
+	if (alloc_perf_ctx_data(ctx_cache, GFP_KERNEL, &ctx_data))
+		return -ENOMEM;
+
+	raw_spin_lock(&task->perf_ctx_data_lock);
+
+	tsk_data = rcu_dereference_protected(task->perf_ctx_data,
+				lockdep_is_held(&task->perf_ctx_data_lock));
+	if (tsk_data) {
+		free_perf_ctx_data(ctx_data);
+		if (WARN_ON_ONCE(exceed_task_data_events_limit(tsk_data))) {
+			raw_spin_unlock(&task->perf_ctx_data_lock);
+			return -EINVAL;
+		}
+		refcount_inc(&tsk_data->refcount);
+	} else {
+		refcount_set(&ctx_data->refcount, 1);
+		/* System-wide event is active as well */
+		if (atomic_read(&nr_task_data_sys_wide_events))
+			refcount_add(TASK_DATA_SYS_WIDE, &ctx_data->refcount);
+
+		rcu_assign_pointer(task->perf_ctx_data, ctx_data);
+	}
+
+	raw_spin_unlock(&task->perf_ctx_data_lock);
+	return 0;
+}
+
+static int
+attach_system_wide_ctx_data(struct kmem_cache *ctx_cache)
+{
+	int i, num_thread, pos, nr_failed_alloc;
+	struct perf_ctx_data *tsk_data;
+	struct perf_ctx_data **data;
+	struct task_struct *g, *p;
+	gfp_t flags = GFP_ATOMIC;
+	bool re_alloc = true;
+
+	/* Retrieve total number of threads */
+	num_thread = nr_threads;
+
+	data = kvcalloc(num_thread, sizeof(*data), GFP_KERNEL);
+	if (!data) {
+		printk_once(KERN_DEBUG
+			    "Failed to allocate space for LBR callstack. "
+			    "The LBR callstack for all tasks may be cutoff.\n");
+		return -ENOMEM;
+	}
+
+	atomic_inc(&nr_task_data_sys_wide_events);
+
+repeat:
+	/*
+	 * Allocate perf_ctx_data for all existing threads.
+	 * The perf_ctx_data for new threads will be allocated in
+	 * perf_event_fork().
+	 * Do a quick allocation in first round with GFP_ATOMIC.
+	 */
+	for (i = 0; i < num_thread; i++) {
+		if (alloc_perf_ctx_data(ctx_cache, flags, &data[i]))
+			break;
+	}
+	num_thread = i;
+	nr_failed_alloc = 0;
+	pos = 0;
+
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		raw_spin_lock(&p->perf_ctx_data_lock);
+		tsk_data = p->perf_ctx_data;
+		if (tsk_data) {
+			/*
+			 * The perf_ctx_data for this thread may has been
+			 * allocated by per-task event.
+			 * Only update refcount for the case.
+			 */
+			if (!has_task_data_sys_wide(tsk_data))
+				refcount_add(TASK_DATA_SYS_WIDE, &tsk_data->refcount);
+			raw_spin_unlock(&p->perf_ctx_data_lock);
+			continue;
+		}
+
+		if (pos < num_thread) {
+			refcount_set(&data[pos]->refcount, TASK_DATA_SYS_WIDE);
+			rcu_assign_pointer(p->perf_ctx_data, data[pos++]);
+		} else {
+			/*
+			 * The quick allocation in first round may be failed.
+			 * Track the number in nr_failed_alloc.
+			 */
+			nr_failed_alloc++;
+		}
+		raw_spin_unlock(&p->perf_ctx_data_lock);
+	}
+	rcu_read_unlock();
+
+	if (re_alloc && !nr_failed_alloc) {
+		num_thread = nr_failed_alloc;
+		flags = GFP_KERNEL;
+		re_alloc = false;
+		goto repeat;
+	}
+
+	if (nr_failed_alloc) {
+		printk_once(KERN_DEBUG
+			    "Failed to allocate space for LBR callstack. "
+			    "The LBR callstack for some tasks may be cutoff.\n");
+	}
+
+	for (; pos < num_thread; pos++)
+		free_perf_ctx_data(data[pos]);
+
+	kvfree(data);
+	return 0;
+}
+
+static int
+attach_perf_ctx_data(struct perf_event *event)
+{
+	struct task_struct *task = event->hw.target;
+	struct kmem_cache *ctx_cache = event->pmu->task_ctx_cache;
+
+	if (task)
+		return attach_task_ctx_data(task, ctx_cache);
+	else
+		return attach_system_wide_ctx_data(ctx_cache);
+}
+
+/**
+ * detach_task_ctx_data - Detach perf_ctx_data RCU pointer for a task
+ *			  monitored by per-task event
+ * @task:        Target Task
+ * @force:       Unconditionally free perf_ctx_data
+ *
+ * If force is set, free perf_ctx_data unconditionally.
+ * Otherwise, free perf_ctx_data when there are no users.
+ * Lock is required to sync the writers of perf_ctx_data RCU pointer
+ */
+static void
+detach_task_ctx_data(struct task_struct *task, bool force)
+{
+	struct perf_ctx_data *ctx_data;
+
+	raw_spin_lock(&task->perf_ctx_data_lock);
+
+	ctx_data = rcu_dereference_protected(task->perf_ctx_data,
+				lockdep_is_held(&task->perf_ctx_data_lock));
+
+	if (!ctx_data)
+		goto unlock;
+
+	if (!force) {
+		WARN_ON_ONCE(refcount_read(&ctx_data->refcount) == TASK_DATA_SYS_WIDE);
+
+		if (!refcount_dec_and_test(&ctx_data->refcount))
+			goto unlock;
+	}
+
+	RCU_INIT_POINTER(task->perf_ctx_data, NULL);
+	call_rcu(&ctx_data->rcu_head, free_perf_ctx_data_rcu);
+
+unlock:
+	raw_spin_unlock(&task->perf_ctx_data_lock);
+}
+
+/**
+ * detach_task_ctx_data_sys_wide - Detach perf_ctx_data RCU pointer for
+ *				   a task monitored by system-wide event
+ * @task:        Target Task
+ *
+ * Free perf_ctx_data when there are no users.
+ */
+static void
+detach_task_ctx_data_sys_wide(struct task_struct *task)
+{
+	struct perf_ctx_data *ctx_data;
+
+	lockdep_assert_held(&task->perf_ctx_data_lock);
+
+	ctx_data = rcu_dereference_protected(task->perf_ctx_data,
+				lockdep_is_held(&task->perf_ctx_data_lock));
+	if (!ctx_data)
+		return;
+
+	WARN_ON_ONCE(!has_task_data_sys_wide(ctx_data));
+
+	if (!refcount_sub_and_test(TASK_DATA_SYS_WIDE, &ctx_data->refcount))
+		return;
+
+	RCU_INIT_POINTER(task->perf_ctx_data, NULL);
+	call_rcu(&ctx_data->rcu_head, free_perf_ctx_data_rcu);
+}
+
+static void detach_system_wide_ctx_data(void)
+{
+	struct task_struct *g, *p;
+
+	if (!atomic_dec_and_test(&nr_task_data_sys_wide_events))
+		return;
+
+	rcu_read_lock();
+	for_each_process_thread(g, p) {
+		raw_spin_lock(&p->perf_ctx_data_lock);
+
+		/*
+		 * A new system-wide event may be attached while freeing
+		 * everything for the old event.
+		 * If so, stop the free process immediately.
+		 * For the freed threads, attach_system_wide_ctx_data()
+		 * will re-allocate the space.
+		 */
+		if (unlikely(atomic_read(&nr_task_data_sys_wide_events))) {
+			raw_spin_unlock(&p->perf_ctx_data_lock);
+			goto unlock;
+		}
+
+		detach_task_ctx_data_sys_wide(p);
+		raw_spin_unlock(&p->perf_ctx_data_lock);
+	}
+unlock:
+	rcu_read_unlock();
+}
+
+static void detach_perf_ctx_data(struct perf_event *event)
+{
+	struct task_struct *task = event->hw.target;
+
+	if (task)
+		detach_task_ctx_data(task, false);
+	else
+		detach_system_wide_ctx_data();
+}
+
 static void unaccount_event(struct perf_event *event)
 {
 	bool dec = false;
@@ -4805,6 +5121,8 @@ static void unaccount_event(struct perf_event *event)
 		atomic_dec(&nr_bpf_events);
 	if (event->attr.text_poke)
 		atomic_dec(&nr_text_poke_events);
+	if (event->attach_state & PERF_ATTACH_TASK_DATA)
+		detach_perf_ctx_data(event);
 
 	if (dec) {
 		if (!atomic_add_unless(&perf_sched_count, -1, 1))
@@ -7841,10 +8159,63 @@ static void perf_event_task(struct task_struct *task,
 		       task_ctx);
 }
 
+/*
+ * Allocate data for a new task when profiling system-wide
+ * events which require PMU specific data
+ */
+static void perf_event_alloc_task_data(struct task_struct *child,
+				       struct task_struct *parent)
+{
+	struct kmem_cache *ctx_cache = NULL;
+	struct perf_ctx_data *ctx_data;
+
+	if (!atomic_read(&nr_task_data_sys_wide_events))
+		return;
+
+	rcu_read_lock();
+	ctx_data = rcu_dereference(parent->perf_ctx_data);
+	if (ctx_data)
+		ctx_cache = ctx_data->ctx_cache;
+	rcu_read_unlock();
+
+	if (!ctx_cache)
+		return;
+
+	if (alloc_perf_ctx_data(ctx_cache, GFP_KERNEL, &ctx_data))
+		return;
+
+	raw_spin_lock(&child->perf_ctx_data_lock);
+
+	if (child->perf_ctx_data) {
+		free_perf_ctx_data(ctx_data);
+	} else {
+		refcount_set(&ctx_data->refcount, TASK_DATA_SYS_WIDE);
+		rcu_assign_pointer(child->perf_ctx_data, ctx_data);
+	}
+
+	/*
+	 * System-wide event may be unaccount when attaching the perf_ctx_data.
+	 * For example,
+	 *                CPU A                              CPU B
+	 *        perf_event_alloc_task_data():
+	 *          read(nr_task_data_sys_wide_events)
+	 *                                         detach_system_wide_ctx_data()
+	 *          alloc_perf_ctx_data()
+	 *          rcu_assign_pointer(perf_ctx_data);
+	 *
+	 * The perf_ctx_data may never be freed until the task is terminated.
+	 */
+	if (unlikely(!atomic_read(&nr_task_data_sys_wide_events)))
+		detach_task_ctx_data_sys_wide(child);
+
+	raw_spin_unlock(&child->perf_ctx_data_lock);
+}
+
 void perf_event_fork(struct task_struct *task)
 {
 	perf_event_task(task, NULL, 1);
 	perf_event_namespaces(task);
+	perf_event_alloc_task_data(task, current);
 }
 
 /*
@@ -11614,11 +11985,18 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (err)
 		goto err_callchain_buffer;
 
+	if ((event->attach_state & PERF_ATTACH_TASK_DATA) &&
+	    attach_perf_ctx_data(event))
+		goto err_task_ctx_data;
+
 	/* symmetric to unaccount_event() in _free_event() */
 	account_event(event);
 
 	return event;
 
+err_task_ctx_data:
+	if (!event->parent && (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN))
+		put_callchain_buffers();
 err_callchain_buffer:
 	if (!event->parent) {
 		if (event->attr.sample_type & PERF_SAMPLE_CALLCHAIN)
@@ -12696,6 +13074,8 @@ void perf_event_exit_task(struct task_struct *child)
 	 * At this point we need to send EXIT events to cpu contexts.
 	 */
 	perf_event_task(child, NULL, 0);
+
+	detach_task_ctx_data(child, true);
 }
 
 static void perf_free_event(struct perf_event *event,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V4 3/6] perf: Supply task information to sched_task()
  2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
  2021-05-19 15:06 ` [PATCH V4 2/6] perf: attach/detach PMU specific data kan.liang
@ 2021-05-19 15:06 ` kan.liang
  2021-05-19 15:06 ` [PATCH V4 4/6] perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode kan.liang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: kan.liang @ 2021-05-19 15:06 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, vitaly.slobodskoy, namhyung, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

To save/restore LBR call stack data in system-wide mode, the task_struct
information is required.

Extend the parameters of sched_task() to supply task_struct information.

When schedule in, the LBR call stack data for new task will be restored.
When schedule out, the LBR call stack data for old task will be saved.
Only need to pass the required task_struct information.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

Changes since V3:
- Rebase on top of the 5.13-rc2

 arch/powerpc/perf/core-book3s.c |  8 ++++++--
 arch/x86/events/core.c          |  5 +++--
 arch/x86/events/intel/core.c    |  4 ++--
 arch/x86/events/intel/lbr.c     |  3 ++-
 arch/x86/events/perf_event.h    |  5 +++--
 include/linux/perf_event.h      |  2 +-
 kernel/events/core.c            | 15 ++++++++-------
 7 files changed, 25 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 16d4d1b..5c2cc4c 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -131,7 +131,10 @@ static unsigned long ebb_switch_in(bool ebb, struct cpu_hw_events *cpuhw)
 
 static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
 static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in) {}
+static void power_pmu_sched_task(struct perf_event_context *ctx,
+				 struct task_struct *task, bool sched_in)
+{
+}
 static inline void power_pmu_bhrb_read(struct perf_event *event, struct cpu_hw_events *cpuhw) {}
 static void pmao_restore_workaround(bool ebb) { }
 #endif /* CONFIG_PPC32 */
@@ -441,7 +444,8 @@ static void power_pmu_bhrb_disable(struct perf_event *event)
 /* Called from ctxsw to prevent one process's branch entries to
  * mingle with the other process's entries during context switch.
  */
-static void power_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void power_pmu_sched_task(struct perf_event_context *ctx,
+				 struct task_struct *task, bool sched_in)
 {
 	if (!ppmu->bhrb_nr)
 		return;
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8f71dd7..8368cc7 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2603,9 +2603,10 @@ static const struct attribute_group *x86_pmu_attr_groups[] = {
 	NULL,
 };
 
-static void x86_pmu_sched_task(struct perf_event_context *ctx, bool sched_in)
+static void x86_pmu_sched_task(struct perf_event_context *ctx,
+			       struct task_struct *task, bool sched_in)
 {
-	static_call_cond(x86_pmu_sched_task)(ctx, sched_in);
+	static_call_cond(x86_pmu_sched_task)(ctx, task, sched_in);
 }
 
 static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e288922..b7fc7c9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4467,10 +4467,10 @@ static void intel_pmu_cpu_dead(int cpu)
 }
 
 static void intel_pmu_sched_task(struct perf_event_context *ctx,
-				 bool sched_in)
+				 struct task_struct *task, bool sched_in)
 {
 	intel_pmu_pebs_sched_task(ctx, sched_in);
-	intel_pmu_lbr_sched_task(ctx, sched_in);
+	intel_pmu_lbr_sched_task(ctx, task, sched_in);
 }
 
 static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 4409d2c..7914f40 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -619,7 +619,8 @@ void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
 	     task_context_opt(next_ctx_data)->lbr_callstack_users);
 }
 
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in)
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx,
+			      struct task_struct *task, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	void *task_ctx;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ad87cb3..29f8df7 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -777,7 +777,7 @@ struct x86_pmu {
 
 	void		(*check_microcode)(void);
 	void		(*sched_task)(struct perf_event_context *ctx,
-				      bool sched_in);
+				      struct task_struct *task, bool sched_in);
 
 	/*
 	 * Intel Arch Perfmon v2+
@@ -1310,7 +1310,8 @@ void intel_ds_init(void);
 void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
 				 struct perf_event_context *next);
 
-void intel_pmu_lbr_sched_task(struct perf_event_context *ctx, bool sched_in);
+void intel_pmu_lbr_sched_task(struct perf_event_context *ctx,
+			      struct task_struct *task, bool sched_in);
 
 u64 lbr_from_signext_quirk_wr(u64 val);
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d46b7e1..0085b97 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -428,7 +428,7 @@ struct pmu {
 	 * context-switches callback
 	 */
 	void (*sched_task)		(struct perf_event_context *ctx,
-					bool sched_in);
+					 struct task_struct *task, bool sched_in);
 
 	/*
 	 * Kmem cache of PMU specific data
diff --git a/kernel/events/core.c b/kernel/events/core.c
index bb1b27e..a83284b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3483,7 +3483,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 			perf_pmu_disable(pmu);
 
 			if (cpuctx->sched_cb_usage && pmu->sched_task)
-				pmu->sched_task(ctx, false);
+				pmu->sched_task(ctx, task, false);
 
 			/*
 			 * PMU specific parts of task perf context can require
@@ -3523,7 +3523,7 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 		perf_pmu_disable(pmu);
 
 		if (cpuctx->sched_cb_usage && pmu->sched_task)
-			pmu->sched_task(ctx, false);
+			pmu->sched_task(ctx, task, false);
 		task_ctx_sched_out(cpuctx, ctx, EVENT_ALL);
 
 		perf_pmu_enable(pmu);
@@ -3562,7 +3562,8 @@ void perf_sched_cb_inc(struct pmu *pmu)
  * PEBS requires this to provide PID/TID information. This requires we flush
  * all queued PEBS records before we context switch to a new task.
  */
-static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in)
+static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx,
+				  struct task_struct *task, bool sched_in)
 {
 	struct pmu *pmu;
 
@@ -3574,7 +3575,7 @@ static void __perf_pmu_sched_task(struct perf_cpu_context *cpuctx, bool sched_in
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(pmu);
 
-	pmu->sched_task(cpuctx->task_ctx, sched_in);
+	pmu->sched_task(cpuctx->task_ctx, task, sched_in);
 
 	perf_pmu_enable(pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3594,7 +3595,7 @@ static void perf_pmu_sched_task(struct task_struct *prev,
 		if (cpuctx->task_ctx)
 			continue;
 
-		__perf_pmu_sched_task(cpuctx, sched_in);
+		__perf_pmu_sched_task(cpuctx, sched_in ? next : prev, sched_in);
 	}
 }
 
@@ -3860,7 +3861,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	cpuctx = __get_cpu_context(ctx);
 	if (cpuctx->task_ctx == ctx) {
 		if (cpuctx->sched_cb_usage)
-			__perf_pmu_sched_task(cpuctx, true);
+			__perf_pmu_sched_task(cpuctx, task, true);
 		return;
 	}
 
@@ -3886,7 +3887,7 @@ static void perf_event_context_sched_in(struct perf_event_context *ctx,
 	perf_event_sched_in(cpuctx, ctx, task);
 
 	if (cpuctx->sched_cb_usage && pmu->sched_task)
-		pmu->sched_task(cpuctx->task_ctx, true);
+		pmu->sched_task(cpuctx->task_ctx, task, true);
 
 	perf_pmu_enable(pmu);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V4 4/6] perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode
  2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
  2021-05-19 15:06 ` [PATCH V4 2/6] perf: attach/detach PMU specific data kan.liang
  2021-05-19 15:06 ` [PATCH V4 3/6] perf: Supply task information to sched_task() kan.liang
@ 2021-05-19 15:06 ` kan.liang
  2021-05-19 15:06 ` [PATCH V4 5/6] perf/x86: Remove swap_task_ctx() kan.liang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 8+ messages in thread
From: kan.liang @ 2021-05-19 15:06 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, vitaly.slobodskoy, namhyung, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

In the system-wide mode, LBR callstacks are shorter in comparison to
the per-process mode.

LBR MSRs are reset during a context switch in the system-wide mode. For
the LBR call stack, the LBRs should be always saved/restored during a
context switch.

Use the space in task_struct to save/restore the LBR call stack data.

For a system-wide event, it's unnecessagy to update the
lbr_callstack_users for each threads. Add a variable in x86_pmu to
indicate whether the system-wide event is active.

Fixes: 76cb2c617f12 ("perf/x86/intel: Save/restore LBR stack during context switch")
Reported-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Debugged-by: Alexey Budankov <alexey.budankov@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

Changes since V3:
- Rebase on top of the 5.13-rc2

 arch/x86/events/intel/lbr.c  | 49 ++++++++++++++++++++++++++++++++++++--------
 arch/x86/events/perf_event.h |  1 +
 2 files changed, 41 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 7914f40..bef40b9 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -502,11 +502,17 @@ static __always_inline bool lbr_is_reset_in_cstate(void *ctx)
 	return !rdlbr_from(((struct x86_perf_task_context *)ctx)->tos, NULL);
 }
 
+static inline bool has_lbr_callstack_users(void *ctx)
+{
+	return task_context_opt(ctx)->lbr_callstack_users ||
+	       x86_pmu.lbr_callstack_users;
+}
+
 static void __intel_pmu_lbr_restore(void *ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (task_context_opt(ctx)->lbr_callstack_users == 0 ||
+	if (!has_lbr_callstack_users(ctx) ||
 	    task_context_opt(ctx)->lbr_stack_state == LBR_NONE) {
 		intel_pmu_lbr_reset();
 		return;
@@ -583,7 +589,7 @@ static void __intel_pmu_lbr_save(void *ctx)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 
-	if (task_context_opt(ctx)->lbr_callstack_users == 0) {
+	if (!has_lbr_callstack_users(ctx)) {
 		task_context_opt(ctx)->lbr_stack_state = LBR_NONE;
 		return;
 	}
@@ -623,6 +629,7 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx,
 			      struct task_struct *task, bool sched_in)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_ctx_data *ctx_data;
 	void *task_ctx;
 
 	if (!cpuc->lbr_users)
@@ -633,15 +640,18 @@ void intel_pmu_lbr_sched_task(struct perf_event_context *ctx,
 	 * the task was scheduled out, restore the stack. Otherwise flush
 	 * the LBR stack.
 	 */
-	task_ctx = ctx ? ctx->task_ctx_data : NULL;
+	rcu_read_lock();
+	ctx_data = rcu_dereference(task->perf_ctx_data);
+	task_ctx = ctx_data ? ctx_data->data : NULL;
 	if (task_ctx) {
 		if (sched_in)
 			__intel_pmu_lbr_restore(task_ctx);
 		else
 			__intel_pmu_lbr_save(task_ctx);
+		rcu_read_unlock();
 		return;
 	}
-
+	rcu_read_unlock();
 	/*
 	 * Since a context switch can flip the address space and LBR entries
 	 * are not tagged with an identifier, we need to wipe the LBR, even for
@@ -669,8 +679,19 @@ void intel_pmu_lbr_add(struct perf_event *event)
 
 	cpuc->br_sel = event->hw.branch_reg.reg;
 
-	if (branch_user_callstack(cpuc->br_sel) && event->ctx->task_ctx_data)
-		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users++;
+	if (branch_user_callstack(cpuc->br_sel)) {
+		if (event->attach_state & PERF_ATTACH_TASK) {
+			struct task_struct *task = event->hw.target;
+			struct perf_ctx_data *ctx_data;
+
+			rcu_read_lock();
+			ctx_data = rcu_dereference(task->perf_ctx_data);
+			if (ctx_data)
+				task_context_opt(ctx_data->data)->lbr_callstack_users++;
+			rcu_read_unlock();
+		} else
+			x86_pmu.lbr_callstack_users++;
+	}
 
 	/*
 	 * Request pmu::sched_task() callback, which will fire inside the
@@ -744,9 +765,19 @@ void intel_pmu_lbr_del(struct perf_event *event)
 	if (!x86_pmu.lbr_nr)
 		return;
 
-	if (branch_user_callstack(cpuc->br_sel) &&
-	    event->ctx->task_ctx_data)
-		task_context_opt(event->ctx->task_ctx_data)->lbr_callstack_users--;
+	if (branch_user_callstack(cpuc->br_sel)) {
+		if (event->attach_state & PERF_ATTACH_TASK) {
+			struct task_struct *task = event->hw.target;
+			struct perf_ctx_data *ctx_data;
+
+			rcu_read_lock();
+			ctx_data = rcu_dereference(task->perf_ctx_data);
+			if (ctx_data)
+				task_context_opt(ctx_data->data)->lbr_callstack_users--;
+			rcu_read_unlock();
+		} else
+			x86_pmu.lbr_callstack_users--;
+	}
 
 	if (event->hw.flags & PERF_X86_EVENT_LBR_SELECT)
 		cpuc->lbr_select = 0;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 29f8df7..208b859 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -819,6 +819,7 @@ struct x86_pmu {
 		const int	*lbr_sel_map;	   /* lbr_select mappings */
 		int		*lbr_ctl_map;	   /* LBR_CTL mappings */
 	};
+	u64		lbr_callstack_users;	   /* lbr callstack system wide users */
 	bool		lbr_double_abort;	   /* duplicated lbr aborts */
 	bool		lbr_pt_coexist;		   /* (LBR|BTS) may coexist with PT */
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V4 5/6] perf/x86: Remove swap_task_ctx()
  2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
                   ` (2 preceding siblings ...)
  2021-05-19 15:06 ` [PATCH V4 4/6] perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode kan.liang
@ 2021-05-19 15:06 ` kan.liang
  2021-05-19 15:06 ` [PATCH V4 6/6] perf: Clean up pmu specific data kan.liang
  2021-05-21  7:24 ` [PATCH V4 1/6] perf: Save PMU specific data in task_struct Peter Zijlstra
  5 siblings, 0 replies; 8+ messages in thread
From: kan.liang @ 2021-05-19 15:06 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, vitaly.slobodskoy, namhyung, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The pmu specific data is saved in task_struct now. It doesn't need to
swap between context.

Remove swap_task_ctx() support.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

Changes since V3:
- Rebase on top of the 5.13-rc2

 arch/x86/events/core.c       |  9 ---------
 arch/x86/events/intel/core.c |  7 -------
 arch/x86/events/intel/lbr.c  | 23 -----------------------
 arch/x86/events/perf_event.h | 11 -----------
 4 files changed, 50 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 8368cc7..d89a942 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -79,7 +79,6 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_commit_scheduling, *x86_pmu.commit_scheduling);
 DEFINE_STATIC_CALL_NULL(x86_pmu_stop_scheduling,   *x86_pmu.stop_scheduling);
 
 DEFINE_STATIC_CALL_NULL(x86_pmu_sched_task,    *x86_pmu.sched_task);
-DEFINE_STATIC_CALL_NULL(x86_pmu_swap_task_ctx, *x86_pmu.swap_task_ctx);
 
 DEFINE_STATIC_CALL_NULL(x86_pmu_drain_pebs,   *x86_pmu.drain_pebs);
 DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_aliases, *x86_pmu.pebs_aliases);
@@ -2018,7 +2017,6 @@ static void x86_pmu_static_call_update(void)
 	static_call_update(x86_pmu_stop_scheduling, x86_pmu.stop_scheduling);
 
 	static_call_update(x86_pmu_sched_task, x86_pmu.sched_task);
-	static_call_update(x86_pmu_swap_task_ctx, x86_pmu.swap_task_ctx);
 
 	static_call_update(x86_pmu_drain_pebs, x86_pmu.drain_pebs);
 	static_call_update(x86_pmu_pebs_aliases, x86_pmu.pebs_aliases);
@@ -2609,12 +2607,6 @@ static void x86_pmu_sched_task(struct perf_event_context *ctx,
 	static_call_cond(x86_pmu_sched_task)(ctx, task, sched_in);
 }
 
-static void x86_pmu_swap_task_ctx(struct perf_event_context *prev,
-				  struct perf_event_context *next)
-{
-	static_call_cond(x86_pmu_swap_task_ctx)(prev, next);
-}
-
 void perf_check_microcode(void)
 {
 	if (x86_pmu.check_microcode)
@@ -2676,7 +2668,6 @@ static struct pmu pmu = {
 
 	.event_idx		= x86_pmu_event_idx,
 	.sched_task		= x86_pmu_sched_task,
-	.swap_task_ctx		= x86_pmu_swap_task_ctx,
 	.check_period		= x86_pmu_check_period,
 
 	.aux_output_match	= x86_pmu_aux_output_match,
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b7fc7c9..8c3c885 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4473,12 +4473,6 @@ static void intel_pmu_sched_task(struct perf_event_context *ctx,
 	intel_pmu_lbr_sched_task(ctx, task, sched_in);
 }
 
-static void intel_pmu_swap_task_ctx(struct perf_event_context *prev,
-				    struct perf_event_context *next)
-{
-	intel_pmu_lbr_swap_task_ctx(prev, next);
-}
-
 static int intel_pmu_check_period(struct perf_event *event, u64 value)
 {
 	return intel_pmu_has_bts_period(event, value) ? -EINVAL : 0;
@@ -4627,7 +4621,6 @@ static __initconst const struct x86_pmu intel_pmu = {
 
 	.guest_get_msrs		= intel_guest_get_msrs,
 	.sched_task		= intel_pmu_sched_task,
-	.swap_task_ctx		= intel_pmu_swap_task_ctx,
 
 	.check_period		= intel_pmu_check_period,
 
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index bef40b9..84d0a80 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -602,29 +602,6 @@ static void __intel_pmu_lbr_save(void *ctx)
 	cpuc->last_log_id = ++task_context_opt(ctx)->log_id;
 }
 
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
-				 struct perf_event_context *next)
-{
-	void *prev_ctx_data, *next_ctx_data;
-
-	swap(prev->task_ctx_data, next->task_ctx_data);
-
-	/*
-	 * Architecture specific synchronization makes sense in
-	 * case both prev->task_ctx_data and next->task_ctx_data
-	 * pointers are allocated.
-	 */
-
-	prev_ctx_data = next->task_ctx_data;
-	next_ctx_data = prev->task_ctx_data;
-
-	if (!prev_ctx_data || !next_ctx_data)
-		return;
-
-	swap(task_context_opt(prev_ctx_data)->lbr_callstack_users,
-	     task_context_opt(next_ctx_data)->lbr_callstack_users);
-}
-
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx,
 			      struct task_struct *task, bool sched_in)
 {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 208b859..abeb01c 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -854,14 +854,6 @@ struct x86_pmu {
 	int		(*set_topdown_event_period)(struct perf_event *event);
 
 	/*
-	 * perf task context (i.e. struct perf_event_context::task_ctx_data)
-	 * switch helper to bridge calls from perf/core to perf/x86.
-	 * See struct pmu::swap_task_ctx() usage for examples;
-	 */
-	void		(*swap_task_ctx)(struct perf_event_context *prev,
-					 struct perf_event_context *next);
-
-	/*
 	 * AMD bits
 	 */
 	unsigned int	amd_nb_constraints : 1;
@@ -1308,9 +1300,6 @@ void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
 
 void intel_ds_init(void);
 
-void intel_pmu_lbr_swap_task_ctx(struct perf_event_context *prev,
-				 struct perf_event_context *next);
-
 void intel_pmu_lbr_sched_task(struct perf_event_context *ctx,
 			      struct task_struct *task, bool sched_in);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH V4 6/6] perf: Clean up pmu specific data
  2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
                   ` (3 preceding siblings ...)
  2021-05-19 15:06 ` [PATCH V4 5/6] perf/x86: Remove swap_task_ctx() kan.liang
@ 2021-05-19 15:06 ` kan.liang
  2021-05-21  7:24 ` [PATCH V4 1/6] perf: Save PMU specific data in task_struct Peter Zijlstra
  5 siblings, 0 replies; 8+ messages in thread
From: kan.liang @ 2021-05-19 15:06 UTC (permalink / raw)
  To: peterz, mingo, acme, tglx, bp, linux-kernel
  Cc: eranian, vitaly.slobodskoy, namhyung, ak, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The pmu specific data is saved in task_struct now. Remove it from event
context structure.

Remove swap_task_ctx() as well.

Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---

Changes since V3:
- Rebase on top of the 5.13-rc2

 include/linux/perf_event.h | 11 --------
 kernel/events/core.c       | 64 ++--------------------------------------------
 2 files changed, 2 insertions(+), 73 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0085b97..5624792 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -436,16 +436,6 @@ struct pmu {
 	struct kmem_cache		*task_ctx_cache;
 
 	/*
-	 * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data)
-	 * can be synchronized using this function. See Intel LBR callstack support
-	 * implementation and Perf core context switch handling callbacks for usage
-	 * examples.
-	 */
-	void (*swap_task_ctx)		(struct perf_event_context *prev,
-					 struct perf_event_context *next);
-					/* optional */
-
-	/*
 	 * Set up pmu-private data structures for an AUX area
 	 */
 	void *(*setup_aux)		(struct perf_event *event, void **pages,
@@ -847,7 +837,6 @@ struct perf_event_context {
 #ifdef CONFIG_CGROUP_PERF
 	int				nr_cgroups;	 /* cgroup evts */
 #endif
-	void				*task_ctx_data; /* pmu specific data */
 	struct rcu_head			rcu_head;
 };
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index a83284b..ddebbdb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1273,26 +1273,11 @@ static void get_ctx(struct perf_event_context *ctx)
 	refcount_inc(&ctx->refcount);
 }
 
-static void *alloc_task_ctx_data(struct pmu *pmu)
-{
-	if (pmu->task_ctx_cache)
-		return kmem_cache_zalloc(pmu->task_ctx_cache, GFP_KERNEL);
-
-	return NULL;
-}
-
-static void free_task_ctx_data(struct pmu *pmu, void *task_ctx_data)
-{
-	if (pmu->task_ctx_cache && task_ctx_data)
-		kmem_cache_free(pmu->task_ctx_cache, task_ctx_data);
-}
-
 static void free_ctx(struct rcu_head *head)
 {
 	struct perf_event_context *ctx;
 
 	ctx = container_of(head, struct perf_event_context, rcu_head);
-	free_task_ctx_data(ctx->pmu, ctx->task_ctx_data);
 	kfree(ctx);
 }
 
@@ -3485,25 +3470,13 @@ static void perf_event_context_sched_out(struct task_struct *task, int ctxn,
 			if (cpuctx->sched_cb_usage && pmu->sched_task)
 				pmu->sched_task(ctx, task, false);
 
-			/*
-			 * PMU specific parts of task perf context can require
-			 * additional synchronization. As an example of such
-			 * synchronization see implementation details of Intel
-			 * LBR call stack data profiling;
-			 */
-			if (pmu->swap_task_ctx)
-				pmu->swap_task_ctx(ctx, next_ctx);
-			else
-				swap(ctx->task_ctx_data, next_ctx->task_ctx_data);
-
 			perf_pmu_enable(pmu);
 
 			/*
 			 * RCU_INIT_POINTER here is safe because we've not
 			 * modified the ctx and the above modification of
-			 * ctx->task and ctx->task_ctx_data are immaterial
-			 * since those values are always verified under
-			 * ctx->lock which we're now holding.
+			 * ctx->task is immaterial since this value is always
+			 * verified under ctx->lock which we're now holding.
 			 */
 			RCU_INIT_POINTER(task->perf_event_ctxp[ctxn], next_ctx);
 			RCU_INIT_POINTER(next->perf_event_ctxp[ctxn], ctx);
@@ -4630,7 +4603,6 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 {
 	struct perf_event_context *ctx, *clone_ctx = NULL;
 	struct perf_cpu_context *cpuctx;
-	void *task_ctx_data = NULL;
 	unsigned long flags;
 	int ctxn, err;
 	int cpu = event->cpu;
@@ -4654,24 +4626,12 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 	if (ctxn < 0)
 		goto errout;
 
-	if (event->attach_state & PERF_ATTACH_TASK_DATA) {
-		task_ctx_data = alloc_task_ctx_data(pmu);
-		if (!task_ctx_data) {
-			err = -ENOMEM;
-			goto errout;
-		}
-	}
-
 retry:
 	ctx = perf_lock_task_context(task, ctxn, &flags);
 	if (ctx) {
 		clone_ctx = unclone_ctx(ctx);
 		++ctx->pin_count;
 
-		if (task_ctx_data && !ctx->task_ctx_data) {
-			ctx->task_ctx_data = task_ctx_data;
-			task_ctx_data = NULL;
-		}
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 
 		if (clone_ctx)
@@ -4682,11 +4642,6 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		if (!ctx)
 			goto errout;
 
-		if (task_ctx_data) {
-			ctx->task_ctx_data = task_ctx_data;
-			task_ctx_data = NULL;
-		}
-
 		err = 0;
 		mutex_lock(&task->perf_event_mutex);
 		/*
@@ -4713,11 +4668,9 @@ find_get_context(struct pmu *pmu, struct task_struct *task,
 		}
 	}
 
-	free_task_ctx_data(pmu, task_ctx_data);
 	return ctx;
 
 errout:
-	free_task_ctx_data(pmu, task_ctx_data);
 	return ERR_PTR(err);
 }
 
@@ -13230,18 +13183,6 @@ inherit_event(struct perf_event *parent_event,
 	if (IS_ERR(child_event))
 		return child_event;
 
-
-	if ((child_event->attach_state & PERF_ATTACH_TASK_DATA) &&
-	    !child_ctx->task_ctx_data) {
-		struct pmu *pmu = child_event->pmu;
-
-		child_ctx->task_ctx_data = alloc_task_ctx_data(pmu);
-		if (!child_ctx->task_ctx_data) {
-			free_event(child_event);
-			return ERR_PTR(-ENOMEM);
-		}
-	}
-
 	/*
 	 * is_orphaned_event() and list_add_tail(&parent_event->child_list)
 	 * must be under the same lock in order to serialize against
@@ -13252,7 +13193,6 @@ inherit_event(struct perf_event *parent_event,
 	if (is_orphaned_event(parent_event) ||
 	    !atomic_long_inc_not_zero(&parent_event->refcount)) {
 		mutex_unlock(&parent_event->child_mutex);
-		/* task_ctx_data is freed with child_ctx */
 		free_event(child_event);
 		return NULL;
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/6] perf: Save PMU specific data in task_struct
  2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
                   ` (4 preceding siblings ...)
  2021-05-19 15:06 ` [PATCH V4 6/6] perf: Clean up pmu specific data kan.liang
@ 2021-05-21  7:24 ` Peter Zijlstra
  2021-05-21 15:54   ` Liang, Kan
  5 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2021-05-21  7:24 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, tglx, bp, linux-kernel, eranian, vitaly.slobodskoy,
	namhyung, ak


I'm replying to the whole thing squashed, because I couldn't make sense
of the individual patches much.

> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f5a6a2f069ed..5624792c2c87 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -428,23 +428,13 @@ struct pmu {
>  	 * context-switches callback
>  	 */
>  	void (*sched_task)		(struct perf_event_context *ctx,
> -					bool sched_in);
> +					 struct task_struct *task, bool sched_in);
>  
>  	/*
>  	 * Kmem cache of PMU specific data
>  	 */
>  	struct kmem_cache		*task_ctx_cache;
>  
> -	/*
> -	 * PMU specific parts of task perf event context (i.e. ctx->task_ctx_data)
> -	 * can be synchronized using this function. See Intel LBR callstack support
> -	 * implementation and Perf core context switch handling callbacks for usage
> -	 * examples.
> -	 */
> -	void (*swap_task_ctx)		(struct perf_event_context *prev,
> -					 struct perf_event_context *next);
> -					/* optional */
> -
>  	/*
>  	 * Set up pmu-private data structures for an AUX area
>  	 */
> @@ -847,10 +837,37 @@ struct perf_event_context {
>  #ifdef CONFIG_CGROUP_PERF
>  	int				nr_cgroups;	 /* cgroup evts */
>  #endif
> -	void				*task_ctx_data; /* pmu specific data */
>  	struct rcu_head			rcu_head;
>  };

So this I like, less is more. The rest however needs some serious
surgery :/

> @@ -401,6 +402,39 @@ static atomic_t nr_cgroup_events __read_mostly;
>  static atomic_t nr_text_poke_events __read_mostly;
>  static atomic_t nr_build_id_events __read_mostly;
>  
> +/* Track the number of system-wide event which requires pmu specific data */
> +static atomic_t nr_task_data_sys_wide_events;
> +
> +/*
> + * There are two types of users for pmu specific data, system-wide event and
> + * per-task event.
> + *
> + * The number of system-wide events is already tracked by global variable
> + * nr_task_data_sys_wide_events. Set TASK_DATA_SYS_WIDE in refcount to
> + * indicate the PMU specific data is used by system-wide events.
> + *
> + * The number of per-task event users is tracked by refcount. Since the
> + * TASK_DATA_SYS_WIDE is already occupied by system-wide events, limit
> + * the max number of per-task event users less than half of TASK_DATA_SYS_WIDE.
> + */
> +#define TASK_DATA_SYS_WIDE		0x1000000
> +#define MAX_NR_TASK_DATA_EVENTS		(TASK_DATA_SYS_WIDE >> 1)
> +
> +static inline bool has_task_data_sys_wide(struct perf_ctx_data *perf_ctx_data)
> +{
> +	return !!(refcount_read(&perf_ctx_data->refcount) & TASK_DATA_SYS_WIDE);
> +}
> +
> +static inline bool exceed_task_data_events_limit(struct perf_ctx_data *perf_ctx_data)
> +{
> +	unsigned int count = refcount_read(&perf_ctx_data->refcount);
> +
> +	if (has_task_data_sys_wide(perf_ctx_data))
> +		return (count - TASK_DATA_SYS_WIDE) > MAX_NR_TASK_DATA_EVENTS;
> +	else
> +		return count > MAX_NR_TASK_DATA_EVENTS;
> +}

This OTOH is terrible. Please don't play games like that with
refcount_t. Also naming :/

Just add an extra variable that indicates we hold one global reference
on the object. You have a hole anyway:

struct perf_ctx_data {
	struct rcu_head                 rcu_head;
	refcount_t                      refcount;
	int				global;
	struct kmem_cache               *ctx_cache;
	void                            *data;
};

> @@ -4768,6 +4756,288 @@ static void unaccount_freq_event(void)
>  		atomic_dec(&nr_freq_events);
>  }
>  
> +static int
> +alloc_perf_ctx_data(struct kmem_cache *ctx_cache, gfp_t flags,
> +		    struct perf_ctx_data **task_ctx_data)
> +{
> +	struct perf_ctx_data *ctx_data;
> +
> +	if (!ctx_cache)
> +		return -EINVAL;
> +
> +	ctx_data = kzalloc(sizeof(struct perf_ctx_data), flags);
> +	if (!ctx_data)
> +		return -ENOMEM;
> +
> +	ctx_data->data = kmem_cache_zalloc(ctx_cache, flags);
> +	if (!ctx_data->data) {
> +		kfree(ctx_data);
> +		return -ENOMEM;
> +	}
> +
> +	ctx_data->ctx_cache = ctx_cache;
> +	*task_ctx_data = ctx_data;
> +
> +	return 0;
> +}

That's pretty horrible too; what's wrong with something simpler?

static struct perf_ctx_data *alloc_perf_ctx_data(struct kmem_cache *ctx_cache, bool global)
{
	struct perf_ctx_data *cd;

	cd = kzalloc(sizeof(*cd), GFP_KERNEL);
	if (!cd)
		return NULL;

	cd->data = kmem_cache_zalloc(ctx_cache, GFP_KERNEL);
	if (!cd->data) {
		kfree(cd);
		return NULL;
	}

	cd->global = global;
	cd->ctx_cache = ctx_cache;
	refcount_set(&cd->refcount, 1);

	return cd;
}

> +
> +static void
> +free_perf_ctx_data(struct perf_ctx_data *ctx_data)
> +{
> +	kfree(ctx_data->data);

we just allocated that using kmem_cache_alloc(); shouldn't this be:

	kmem_cache_free(cd->ctx_cache, cd->data);

> +	kfree(ctx_data);
> +}
> +
> +static void
> +free_perf_ctx_data_rcu(struct rcu_head *rcu_head)
__free_perf_ctx_data_rcu(
> +{
> +	struct perf_ctx_data *ctx_data;
> +
> +	ctx_data = container_of(rcu_head, struct perf_ctx_data, rcu_head);
> +	free_perf_ctx_data(ctx_data);
> +}

static inline void perf_free_ctx_data_rcu(struct perf_ctx_data *cd)
{
	call_rcu(&cd->rcuhead, __free_perf_ctx_data_rcu);
}

> +static int
> +attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache)
> +{
> +	struct perf_ctx_data *ctx_data, *tsk_data;
> +
> +	/*
> +	 * To make the code RT friendly, make the allocation out of
> +	 * the spinlock.
> +	 */

Nothing RT specific there, doing allocations under spnilocks is crap at
all times. RT just really doesn't let you do it under raw_spinlock,
rightfully so.

> +	if (alloc_perf_ctx_data(ctx_cache, GFP_KERNEL, &ctx_data))
> +		return -ENOMEM;
> +
> +	raw_spin_lock(&task->perf_ctx_data_lock);
> +
> +	tsk_data = rcu_dereference_protected(task->perf_ctx_data,
> +				lockdep_is_held(&task->perf_ctx_data_lock));
> +	if (tsk_data) {
> +		free_perf_ctx_data(ctx_data);
> +		if (WARN_ON_ONCE(exceed_task_data_events_limit(tsk_data))) {
> +			raw_spin_unlock(&task->perf_ctx_data_lock);
> +			return -EINVAL;
> +		}
> +		refcount_inc(&tsk_data->refcount);
> +	} else {
> +		refcount_set(&ctx_data->refcount, 1);
> +		/* System-wide event is active as well */
> +		if (atomic_read(&nr_task_data_sys_wide_events))
> +			refcount_add(TASK_DATA_SYS_WIDE, &ctx_data->refcount);
> +
> +		rcu_assign_pointer(task->perf_ctx_data, ctx_data);
> +	}
> +
> +	raw_spin_unlock(&task->perf_ctx_data_lock);

I think you can do without that lock:

	struct perf_ctx_data *old = NULL;

	cd = alloc_perf_ctx_data(ctx_cache);
	if (!cd)
		return -ENOMEM;

	for (;;) {
		if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) {
			if (old)
				free_perf_ctx_data_rcu(old);
			return 0;
		}

		if (!old) {
			/*
			 * After seeing a dead @old, we raced with
			 * removal and lost, try again to install @cd.
			 */
			continue;
		}

		if (refcount_inc_not_zero(&old->refcount)) {
			free_perf_ctx_data(cd); /* unused */
			return 0;
		}

		/*
		 * @old is a dead object, refcount==0 is stable, try and
		 * replace it with @cd.
		 */
	}

And it can be *much* simpler if we ditch that refcount crud and never
release the data once allocated.

> +	return 0;
> +}
> +
> +static int
> +attach_system_wide_ctx_data(struct kmem_cache *ctx_cache)
> +{
> +	int i, num_thread, pos, nr_failed_alloc;
> +	struct perf_ctx_data *tsk_data;
> +	struct perf_ctx_data **data;
> +	struct task_struct *g, *p;
> +	gfp_t flags = GFP_ATOMIC;
> +	bool re_alloc = true;
> +
> +	/* Retrieve total number of threads */
> +	num_thread = nr_threads;
> +
> +	data = kvcalloc(num_thread, sizeof(*data), GFP_KERNEL);
> +	if (!data) {
> +		printk_once(KERN_DEBUG
> +			    "Failed to allocate space for LBR callstack. "
> +			    "The LBR callstack for all tasks may be cutoff.\n");
> +		return -ENOMEM;
> +	}
> +
> +	atomic_inc(&nr_task_data_sys_wide_events);

This is rather unfortunate; you're going to do this massive amount of
allocation for every event, regardless of whether all tasks already have
a data entry.

The alternative is a global lock around this; whichever way around, a
second invocation is going to have to wait for completion anyway.

This suggests:

static DEFINE_MUTEX(global_ctx_data_lock);
static refcount_t global_ctx_data_ref;


attach_global_ctx_data()
{
	if (refcount_inc_not_zero(&global_ctx_data_ref))
		return;

	mutex_lock(&global_ctx_data_lock);
	if (!refcount_inc_not_zero(&global_ctx_data_ref)) {

		/*
		 * allocate everything
		 */

		refcount_set(&global_ctx_data_ref, 1);
	}
	mutex_unlock(&global_ctx_data_lock);
}


detach_global_ctx_data()
{
	if (refcount_dec_not_one(&global_ctx_data_ref))
		return;

	mutex_lock(&global_ctx_data_lock);
	if (!refcount_dec_and_test(&global_ctx_data_ref) {
		mutex_unlock(&global_ctx_data_lock);
		return;
	}

	/*
	 * remove everything
	 */

	mutex_unlock(&global_ctx_data_lock);
}

(NB. the beginning of detach is normally spelled
refcount_dec_and_mutex_lock(), but seeing how you're going to need
another lock type, read below, I figured I'd spell this out, since
otherwise you're likely to get it wrong)

But that leaves you in a bind vs perf_event_fork() <-
perf_event_alloc_task_data(), which can still race. I think the simplest
solution is replacing the above DEFINE_MUTEX with DEFINE_PERCPU_RWSEM
and using the read side in perf_event_alloc_task_data().

AFAICT this solves all the global races.

> +repeat:
> +	/*
> +	 * Allocate perf_ctx_data for all existing threads.
> +	 * The perf_ctx_data for new threads will be allocated in
> +	 * perf_event_fork().
> +	 * Do a quick allocation in first round with GFP_ATOMIC.
> +	 */
> +	for (i = 0; i < num_thread; i++) {
> +		if (alloc_perf_ctx_data(ctx_cache, flags, &data[i]))
> +			break;
> +	}
> +	num_thread = i;
> +	nr_failed_alloc = 0;
> +	pos = 0;
> +
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		raw_spin_lock(&p->perf_ctx_data_lock);
> +		tsk_data = p->perf_ctx_data;
> +		if (tsk_data) {
> +			/*
> +			 * The perf_ctx_data for this thread may has been
> +			 * allocated by per-task event.
> +			 * Only update refcount for the case.
> +			 */
> +			if (!has_task_data_sys_wide(tsk_data))
> +				refcount_add(TASK_DATA_SYS_WIDE, &tsk_data->refcount);
> +			raw_spin_unlock(&p->perf_ctx_data_lock);
> +			continue;
> +		}
> +
> +		if (pos < num_thread) {
> +			refcount_set(&data[pos]->refcount, TASK_DATA_SYS_WIDE);
> +			rcu_assign_pointer(p->perf_ctx_data, data[pos++]);
> +		} else {
> +			/*
> +			 * The quick allocation in first round may be failed.
> +			 * Track the number in nr_failed_alloc.
> +			 */
> +			nr_failed_alloc++;
> +		}
> +		raw_spin_unlock(&p->perf_ctx_data_lock);
> +	}
> +	rcu_read_unlock();
> +
> +	if (re_alloc && !nr_failed_alloc) {
> +		num_thread = nr_failed_alloc;
> +		flags = GFP_KERNEL;
> +		re_alloc = false;
> +		goto repeat;
> +	}
> +
> +	if (nr_failed_alloc) {
> +		printk_once(KERN_DEBUG
> +			    "Failed to allocate space for LBR callstack. "
> +			    "The LBR callstack for some tasks may be cutoff.\n");
> +	}
> +
> +	for (; pos < num_thread; pos++)
> +		free_perf_ctx_data(data[pos]);
> +
> +	kvfree(data);
> +	return 0;

*groan*

What's wrong with something simple, like this:

again:
	rcu_read_lock();
	for_each_process_thread(g, p) {
		struct perf_ctx_data *cd = rcu_dereference(p->perf_ctx_data);
		if (cd && !cd->global) {
			cd->global = 1;
			if (!refcount_inc_not_zero(&cd->refcount))
				cd = NULL;
		}
		if (!cd) {
			get_task_struct(p);
			rcu_read_unlock();

			ret = attach_task_ctx_data(p, ctx_cache, true);
			put_task_struct(p);
			if (ret)
				return ret;

			goto again;
		}
	}
	rcu_read_unlock();

> +}
> +
> +static int
> +attach_perf_ctx_data(struct perf_event *event)
> +{
> +	struct task_struct *task = event->hw.target;
> +	struct kmem_cache *ctx_cache = event->pmu->task_ctx_cache;

	if (!ctx_cache)
		return;

This is the place to stop if there's no ctx_cache.

> +
> +	if (task)
> +		return attach_task_ctx_data(task, ctx_cache);
> +	else
> +		return attach_system_wide_ctx_data(ctx_cache);
> +}
> +
> +/**
> + * detach_task_ctx_data - Detach perf_ctx_data RCU pointer for a task
> + *			  monitored by per-task event
> + * @task:        Target Task
> + * @force:       Unconditionally free perf_ctx_data
> + *
> + * If force is set, free perf_ctx_data unconditionally.
> + * Otherwise, free perf_ctx_data when there are no users.
> + * Lock is required to sync the writers of perf_ctx_data RCU pointer
> + */
> +static void
> +detach_task_ctx_data(struct task_struct *task, bool force)

You're conflating detach_task_ctx_data() with free_task_ctx_data_rcu().

> +{
> +	struct perf_ctx_data *ctx_data;
> +
> +	raw_spin_lock(&task->perf_ctx_data_lock);
> +
> +	ctx_data = rcu_dereference_protected(task->perf_ctx_data,
> +				lockdep_is_held(&task->perf_ctx_data_lock));
> +
> +	if (!ctx_data)
> +		goto unlock;
> +
> +	if (!force) {
> +		WARN_ON_ONCE(refcount_read(&ctx_data->refcount) == TASK_DATA_SYS_WIDE);
> +
> +		if (!refcount_dec_and_test(&ctx_data->refcount))
> +			goto unlock;
> +	}
> +
> +	RCU_INIT_POINTER(task->perf_ctx_data, NULL);
> +	call_rcu(&ctx_data->rcu_head, free_perf_ctx_data_rcu);
> +
> +unlock:
> +	raw_spin_unlock(&task->perf_ctx_data_lock);


detach_task_ctx_data(struct task_struct *p)
{
	struct perf_ctx_data *cd = rcu_dereference(p->perf_ctx_data);

	if (!cd)
		return;

	if (!refcount_dec_and_test(&cd->refcount))
		return;

	if (!try_cmpxchg(&p->perf_ctx_data, &cd, NULL)) {
		/* we lost the race, nothing more to do */
		return;
	}

	free_task_ctx_data_rcu(cd);
}

> +}
> +
> +/**
> + * detach_task_ctx_data_sys_wide - Detach perf_ctx_data RCU pointer for
> + *				   a task monitored by system-wide event
> + * @task:        Target Task
> + *
> + * Free perf_ctx_data when there are no users.
> + */
> +static void
> +detach_task_ctx_data_sys_wide(struct task_struct *task)
> +{
> +	struct perf_ctx_data *ctx_data;
> +
> +	lockdep_assert_held(&task->perf_ctx_data_lock);
> +
> +	ctx_data = rcu_dereference_protected(task->perf_ctx_data,
> +				lockdep_is_held(&task->perf_ctx_data_lock));
> +	if (!ctx_data)
> +		return;
> +
> +	WARN_ON_ONCE(!has_task_data_sys_wide(ctx_data));
> +
> +	if (!refcount_sub_and_test(TASK_DATA_SYS_WIDE, &ctx_data->refcount))
> +		return;
> +
> +	RCU_INIT_POINTER(task->perf_ctx_data, NULL);
> +	call_rcu(&ctx_data->rcu_head, free_perf_ctx_data_rcu);

	struct perf_ctx_data *cd = rcu_dereference(p->task_ctx_data);

	if (!cd || !cd->global)
		return;

	cd->global = 0;
	detach_task_ctx_data(p);

> +}
> +
> +static void detach_system_wide_ctx_data(void)
> +{
> +	struct task_struct *g, *p;
> +
> +	if (!atomic_dec_and_test(&nr_task_data_sys_wide_events))
> +		return;
> +
> +	rcu_read_lock();
> +	for_each_process_thread(g, p) {
> +		raw_spin_lock(&p->perf_ctx_data_lock);
> +
> +		/*
> +		 * A new system-wide event may be attached while freeing
> +		 * everything for the old event.
> +		 * If so, stop the free process immediately.
> +		 * For the freed threads, attach_system_wide_ctx_data()
> +		 * will re-allocate the space.
> +		 */
> +		if (unlikely(atomic_read(&nr_task_data_sys_wide_events))) {
> +			raw_spin_unlock(&p->perf_ctx_data_lock);
> +			goto unlock;
> +		}
> +
> +		detach_task_ctx_data_sys_wide(p);
> +		raw_spin_unlock(&p->perf_ctx_data_lock);
> +	}
> +unlock:
> +	rcu_read_unlock();

See above with attach().

> +}
> +
> +static void detach_perf_ctx_data(struct perf_event *event)
> +{
> +	struct task_struct *task = event->hw.target;

An early termination if !ctx_cache might be useful, saves a bunch of
iteration work.

> +	if (task)
> +		detach_task_ctx_data(task, false);
> +	else
> +		detach_system_wide_ctx_data();
> +}
> +
>  static void unaccount_event(struct perf_event *event)
>  {
>  	bool dec = false;

> @@ -7841,10 +8113,63 @@ static void perf_event_task(struct task_struct *task,
>  		       task_ctx);
>  }
>  
> +/*
> + * Allocate data for a new task when profiling system-wide
> + * events which require PMU specific data
> + */
> +static void perf_event_alloc_task_data(struct task_struct *child,
> +				       struct task_struct *parent)
> +{
> +	struct kmem_cache *ctx_cache = NULL;
> +	struct perf_ctx_data *ctx_data;
> +
> +	if (!atomic_read(&nr_task_data_sys_wide_events))
> +		return;
> +
> +	rcu_read_lock();
> +	ctx_data = rcu_dereference(parent->perf_ctx_data);
> +	if (ctx_data)
> +		ctx_cache = ctx_data->ctx_cache;
> +	rcu_read_unlock();
> +
> +	if (!ctx_cache)
> +		return;
> +
> +	if (alloc_perf_ctx_data(ctx_cache, GFP_KERNEL, &ctx_data))
> +		return;
> +
> +	raw_spin_lock(&child->perf_ctx_data_lock);
> +
> +	if (child->perf_ctx_data) {
> +		free_perf_ctx_data(ctx_data);
> +	} else {
> +		refcount_set(&ctx_data->refcount, TASK_DATA_SYS_WIDE);
> +		rcu_assign_pointer(child->perf_ctx_data, ctx_data);
> +	}
> +
> +	/*
> +	 * System-wide event may be unaccount when attaching the perf_ctx_data.
> +	 * For example,
> +	 *                CPU A                              CPU B
> +	 *        perf_event_alloc_task_data():
> +	 *          read(nr_task_data_sys_wide_events)
> +	 *                                         detach_system_wide_ctx_data()
> +	 *          alloc_perf_ctx_data()
> +	 *          rcu_assign_pointer(perf_ctx_data);
> +	 *
> +	 * The perf_ctx_data may never be freed until the task is terminated.
> +	 */
> +	if (unlikely(!atomic_read(&nr_task_data_sys_wide_events)))
> +		detach_task_ctx_data_sys_wide(child);
> +
> +	raw_spin_unlock(&child->perf_ctx_data_lock);
> +}

That race goes away if you use the percpu-rwsem properly.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH V4 1/6] perf: Save PMU specific data in task_struct
  2021-05-21  7:24 ` [PATCH V4 1/6] perf: Save PMU specific data in task_struct Peter Zijlstra
@ 2021-05-21 15:54   ` Liang, Kan
  0 siblings, 0 replies; 8+ messages in thread
From: Liang, Kan @ 2021-05-21 15:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, tglx, bp, linux-kernel, eranian, vitaly.slobodskoy,
	namhyung, ak



On 5/21/2021 3:24 AM, Peter Zijlstra wrote:
> I'm replying to the whole thing squashed, because I couldn't make sense
> of the individual patches much.

Thanks for the detailed comments. The optimization suggestions would 
make the codes much simpler and more efficient.
I will modify the codes based on your suggestions and do more tests.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2021-05-21 15:54 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-19 15:06 [PATCH V4 1/6] perf: Save PMU specific data in task_struct kan.liang
2021-05-19 15:06 ` [PATCH V4 2/6] perf: attach/detach PMU specific data kan.liang
2021-05-19 15:06 ` [PATCH V4 3/6] perf: Supply task information to sched_task() kan.liang
2021-05-19 15:06 ` [PATCH V4 4/6] perf/x86/lbr: Fix shorter LBRs call stacks for the system-wide mode kan.liang
2021-05-19 15:06 ` [PATCH V4 5/6] perf/x86: Remove swap_task_ctx() kan.liang
2021-05-19 15:06 ` [PATCH V4 6/6] perf: Clean up pmu specific data kan.liang
2021-05-21  7:24 ` [PATCH V4 1/6] perf: Save PMU specific data in task_struct Peter Zijlstra
2021-05-21 15:54   ` Liang, Kan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.