[PATCH v4 0/10] Implement group-read of events using txn interface

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/10] Implement group-read of events using txn interface
@ 2015-07-27  5:40 Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 01/10] perf: Add a flags parameter to pmu txn interfaces Sukadev Bhattiprolu
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Unlike normal hardware PMCs, the 24x7 counters in Power8 are stored in
memory and accessed via a hypervisor call (HCALL).  A major aspect of the
HCALL is that it allows retireving _several_ counters at once (unlike
regular PMCs, which are read one at a time). By reading several counters
at once, we can get a more consistent snapshot of the system.

This patchset extends the transaction interface to accomplish submitting
several events to the PMU and have the PMU read them all at once. User is
expected to submit the set of events they want to read as an "event group".

In the kernel, we submit each event to the PMU using the following logic
(from Peter Zijlstra).

        pmu->start_txn(pmu, PMU_TXN_READ);

        leader->read();
        for_each_sibling()
                sibling->read();
        pmu->commit_txn();

where:
        - the ->read()s queue events to be submitted to the hypervisor, and,
        - the ->commit_txn() issues the HCALL, retrieves the result and
          updates the event count.

Architectures/PMUs that don't need/implement PMU_TXN_READ type of transactions,
simply ignore the ->start_txn() and ->commit_txn() and continue to read the
counters one at a time in the ->read() call.

Compile/touch tested on x86. Need help testing on s390 and Sparc.

Thanks to Peter Zijlstra for his input/code.

Changelog[v4]
	- Ensure all the transactions operations happen on the same CPU so PMUs
	  can use per-CPU buffers for the transaction.
	- Add lockdep assert and fix a locking issue in perf_read_group().

Changelog [v3]
	- Simple changes/reorg of patchset to split/rename functions
	- [Peter Zijlstra] Save the transaction flags in ->start_txn() and
	  drop the flags parameter from ->commit_txn() and ->cancel_txn().
	- [Peter Zijlstra] The nop txn interfaces don't need to disable/enable
	  PMU for PERF_PMU_TXN_READ transactions.

Changelog [v2]
        - Use the transaction interface unconditionally to avoid special-case
          code. Architectures/PMUs that don't need the READ transaction types
          simply ignore the ->start_txn() and ->commit_txn() calls.

Peter Zijlstra (Intel) (1):
  perf: Rename perf_event_read_{one,group}, perf_read_hw

Sukadev Bhattiprolu (9):
  perf: Add a flags parameter to pmu txn interfaces
  perf: Split perf_event_read() and perf_event_count()
  perf: Define perf_event_aggregate()
  perf: Unroll perf_event_read_value() in perf_read_group()
  perf: Add return value for perf_event_read().
  perf: Add group parameter to perf_event_read()
  perf: Add return value to __perf_event_read()
  Define PERF_PMU_TXN_READ interface
  powerpc/perf/hv-24x7: Use PMU_TXN_READ interface

 arch/powerpc/perf/core-book3s.c  |   25 +++++-
 arch/powerpc/perf/hv-24x7.c      |  166 ++++++++++++++++++++++++++++++++++++-
 arch/s390/kernel/perf_cpum_cf.c  |   24 +++++-
 arch/sparc/kernel/perf_event.c   |   19 ++++-
 arch/x86/kernel/cpu/perf_event.c |   27 +++++-
 arch/x86/kernel/cpu/perf_event.h |    1 +
 include/linux/perf_event.h       |   15 +++-
 kernel/events/core.c             |  167 +++++++++++++++++++++++++++++++-------
 8 files changed, 403 insertions(+), 41 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 01/10] perf: Add a flags parameter to pmu txn interfaces
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 02/10] perf: Split perf_event_read() and perf_event_count() Sukadev Bhattiprolu
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Currently, the PMU interface allows reading only one counter at a time.
But some PMUs like the 24x7 counters in Power, support reading several
counters at once. To leveage this functionality, extend the transaction
interface to support a "transaction type".

The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
i.e. used to _schedule_ all the events on the PMU as a group. A second
transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
by the 24x7 counters to read several counters at once.

Extend the transaction interfaces to the PMU to accept a 'txn_flags'
parameter and use this parameter to ignore any transactions that are
not of type PERF_PMU_TXN_ADD.

Thanks to Peter Zijlstra for his input.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
Changelog[v4]
	- [Peter Zijlstra] Fix an copy-paste error in power_pmu_cancel_txn().
	- [Peter Zijlstra] Use __this_cpu_read() and __this_cpu_write().

Changelog[v3]
	- [Peter Zijlstra] Ensure the nop_txn interfaces disable/enable
	  PMU only for TXN_ADD transactions.
	- [Peter Zijlstra] Cache the flags parameter in ->start_txn() and
	  drop the flags parameter from ->commit_txn() and ->cancel_txn().

---
 arch/powerpc/perf/core-book3s.c  |   25 ++++++++++++++++++++++++-
 arch/s390/kernel/perf_cpum_cf.c  |   24 +++++++++++++++++++++++-
 arch/sparc/kernel/perf_event.c   |   19 ++++++++++++++++++-
 arch/x86/kernel/cpu/perf_event.c |   27 +++++++++++++++++++++++++--
 arch/x86/kernel/cpu/perf_event.h |    1 +
 include/linux/perf_event.h       |   14 +++++++++++---
 kernel/events/core.c             |   31 ++++++++++++++++++++++++++++---
 7 files changed, 130 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index d90893b..b18efe4 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -50,6 +50,7 @@ struct cpu_hw_events {
 
 	unsigned int group_flag;
 	int n_txn_start;
+	int txn_flags;
 
 	/* BHRB bits */
 	u64				bhrb_filter;	/* BHRB HW branch filter */
@@ -1586,11 +1587,19 @@ static void power_pmu_stop(struct perf_event *event, int ef_flags)
  * Start group events scheduling transaction
  * Set the flag to make pmu::enable() not perform the
  * schedulability test, it will be performed at commit time
+ *
+ * We only support PERF_PMU_TXN_ADD transactions. Save the
+ * transaction flags but otherwise ignore non-PERF_PMU_TXN_ADD
+ * transactions.
  */
-static void power_pmu_start_txn(struct pmu *pmu)
+static void power_pmu_start_txn(struct pmu *pmu, int txn_flags)
 {
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
 
+	cpuhw->txn_flags = txn_flags;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	perf_pmu_disable(pmu);
 	cpuhw->group_flag |= PERF_EVENT_TXN;
 	cpuhw->n_txn_start = cpuhw->n_events;
@@ -1604,6 +1613,12 @@ static void power_pmu_start_txn(struct pmu *pmu)
 static void power_pmu_cancel_txn(struct pmu *pmu)
 {
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
+	int txn_flags;
+
+	txn_flags = cpuhw->txn_flags;
+	cpuhw->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
 
 	cpuhw->group_flag &= ~PERF_EVENT_TXN;
 	perf_pmu_enable(pmu);
@@ -1618,10 +1633,18 @@ static int power_pmu_commit_txn(struct pmu *pmu)
 {
 	struct cpu_hw_events *cpuhw;
 	long i, n;
+	int txn_flags;
 
 	if (!ppmu)
 		return -EAGAIN;
+
 	cpuhw = this_cpu_ptr(&cpu_hw_events);
+
+	txn_flags = cpuhw->txn_flags;
+	cpuhw->txn_flags = 0;
+	if (cpuhw->txn_flags & ~PERF_PMU_TXN_ADD)
+		return 0;
+
 	n = cpuhw->n_events;
 	if (check_excludes(cpuhw->event, cpuhw->flags, 0, n))
 		return -EAGAIN;
diff --git a/arch/s390/kernel/perf_cpum_cf.c b/arch/s390/kernel/perf_cpum_cf.c
index 56fdad4..a6f9e7b 100644
--- a/arch/s390/kernel/perf_cpum_cf.c
+++ b/arch/s390/kernel/perf_cpum_cf.c
@@ -72,6 +72,7 @@ struct cpu_hw_events {
 	atomic_t		ctr_set[CPUMF_CTR_SET_MAX];
 	u64			state, tx_state;
 	unsigned int		flags;
+	int			txn_flags;
 };
 static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
 	.ctr_set = {
@@ -82,6 +83,7 @@ static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = {
 	},
 	.state = 0,
 	.flags = 0,
+	.txn_flags = 0,
 };
 
 static int get_counter_set(u64 event)
@@ -572,11 +574,19 @@ static void cpumf_pmu_del(struct perf_event *event, int flags)
 /*
  * Start group events scheduling transaction.
  * Set flags to perform a single test at commit time.
+ *
+ * We only support PERF_PMU_TXN_ADD transactions. Save the
+ * transaction flags but otherwise ignore non-PERF_PMU_TXN_ADD
+ * transactions.
  */
-static void cpumf_pmu_start_txn(struct pmu *pmu)
+static void cpumf_pmu_start_txn(struct pmu *pmu, int txn_flags)
 {
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
 
+	cpuhw->txn_flags = txn_flags;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	perf_pmu_disable(pmu);
 	cpuhw->flags |= PERF_EVENT_TXN;
 	cpuhw->tx_state = cpuhw->state;
@@ -589,8 +599,14 @@ static void cpumf_pmu_start_txn(struct pmu *pmu)
  */
 static void cpumf_pmu_cancel_txn(struct pmu *pmu)
 {
+	int txn_flags;
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
 
+	txn_flags = cphw->txn_flags;
+	cpuhw->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	WARN_ON(cpuhw->tx_state != cpuhw->state);
 
 	cpuhw->flags &= ~PERF_EVENT_TXN;
@@ -604,9 +620,15 @@ static void cpumf_pmu_cancel_txn(struct pmu *pmu)
  */
 static int cpumf_pmu_commit_txn(struct pmu *pmu)
 {
+	int txn_flags;
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
 	u64 state;
 
+	txn_flags = cpuhw->txn_flags;
+	cpuhw->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return 0;
+
 	/* check if the updated state can be scheduled */
 	state = cpuhw->state & ~((1 << CPUMF_LCCTL_ENABLE_SHIFT) - 1);
 	state >>= CPUMF_LCCTL_ENABLE_SHIFT;
diff --git a/arch/sparc/kernel/perf_event.c b/arch/sparc/kernel/perf_event.c
index 689db65..b5fcbce 100644
--- a/arch/sparc/kernel/perf_event.c
+++ b/arch/sparc/kernel/perf_event.c
@@ -109,6 +109,7 @@ struct cpu_hw_events {
 	int			enabled;
 
 	unsigned int		group_flag;
+	int			txn_flags;
 };
 static DEFINE_PER_CPU(struct cpu_hw_events, cpu_hw_events) = { .enabled = 1, };
 
@@ -1494,10 +1495,14 @@ static int sparc_pmu_event_init(struct perf_event *event)
  * Set the flag to make pmu::enable() not perform the
  * schedulability test, it will be performed at commit time
  */
-static void sparc_pmu_start_txn(struct pmu *pmu)
+static void sparc_pmu_start_txn(struct pmu *pmu, int txn_flags)
 {
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
 
+	cpuhw->txn_flags = txn_flags;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	perf_pmu_disable(pmu);
 	cpuhw->group_flag |= PERF_EVENT_TXN;
 }
@@ -1510,6 +1515,12 @@ static void sparc_pmu_start_txn(struct pmu *pmu)
 static void sparc_pmu_cancel_txn(struct pmu *pmu)
 {
 	struct cpu_hw_events *cpuhw = this_cpu_ptr(&cpu_hw_events);
+	int txn_flags;
+
+	txn_flags = cpuhw->txn_flags;
+	cpuhw->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
 
 	cpuhw->group_flag &= ~PERF_EVENT_TXN;
 	perf_pmu_enable(pmu);
@@ -1524,11 +1535,17 @@ static int sparc_pmu_commit_txn(struct pmu *pmu)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int n;
+	int txn_flags;
 
 	if (!sparc_pmu)
 		return -EINVAL;
 
 	cpuc = this_cpu_ptr(&cpu_hw_events);
+	txn_flags = cpuc->txn_flags;
+	cpuc->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return 0;
+
 	n = cpuc->n_events;
 	if (check_excludes(cpuc->event, 0, n))
 		return -EINVAL;
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 3658de4..3a7e08f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1748,9 +1748,19 @@ static inline void x86_pmu_read(struct perf_event *event)
  * Start group events scheduling transaction
  * Set the flag to make pmu::enable() not perform the
  * schedulability test, it will be performed at commit time
+ *
+ * We only support PERF_PMU_TXN_ADD transactions. Save the
+ * transaction flags but otherwise ignore non-PERF_PMU_TXN_ADD
+ * transactions.
  */
-static void x86_pmu_start_txn(struct pmu *pmu)
+static void x86_pmu_start_txn(struct pmu *pmu, int txn_flags)
 {
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	cpuc->txn_flags = txn_flags;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	perf_pmu_disable(pmu);
 	__this_cpu_or(cpu_hw_events.group_flag, PERF_EVENT_TXN);
 	__this_cpu_write(cpu_hw_events.n_txn, 0);
@@ -1763,6 +1773,14 @@ static void x86_pmu_start_txn(struct pmu *pmu)
  */
 static void x86_pmu_cancel_txn(struct pmu *pmu)
 {
+	int txn_flags;
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	txn_flags = cpuc->txn_flags;
+	cpuc->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	__this_cpu_and(cpu_hw_events.group_flag, ~PERF_EVENT_TXN);
 	/*
 	 * Truncate collected array by the number of events added in this
@@ -1784,7 +1802,12 @@ static int x86_pmu_commit_txn(struct pmu *pmu)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	int assign[X86_PMC_IDX_MAX];
-	int n, ret;
+	int n, ret, txn_flags;
+
+	txn_flags = cpuc->txn_flags;
+	cpuc->txn_flags = 0;
+	if (txn_flags & ~PERF_PMU_TXN_ADD)
+		return 0;
 
 	n = cpuc->n_events;
 
diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h
index 3e7fd27..87b6bb7 100644
--- a/arch/x86/kernel/cpu/perf_event.h
+++ b/arch/x86/kernel/cpu/perf_event.h
@@ -197,6 +197,7 @@ struct cpu_hw_events {
 
 	unsigned int		group_flag;
 	int			is_fake;
+	int			txn_flags;
 
 	/*
 	 * Intel DebugStore bits
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2027809..44bf05f 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -168,6 +168,8 @@ struct perf_event;
  */
 #define PERF_EVENT_TXN 0x1
 
+#define PERF_PMU_TXN_ADD  0x1		/* txn to add/schedule event on PMU */
+
 /**
  * pmu::capabilities flags
  */
@@ -252,20 +254,26 @@ struct pmu {
 	 *
 	 * Start the transaction, after this ->add() doesn't need to
 	 * do schedulability tests.
+	 *
+	 * Optional.
 	 */
-	void (*start_txn)		(struct pmu *pmu); /* optional */
+	void (*start_txn)		(struct pmu *pmu, int txn_flags);
 	/*
 	 * If ->start_txn() disabled the ->add() schedulability test
 	 * then ->commit_txn() is required to perform one. On success
 	 * the transaction is closed. On error the transaction is kept
 	 * open until ->cancel_txn() is called.
+	 *
+	 * Optional.
 	 */
-	int  (*commit_txn)		(struct pmu *pmu); /* optional */
+	int  (*commit_txn)		(struct pmu *pmu);
 	/*
 	 * Will cancel the transaction, assumes ->del() is called
 	 * for each successful ->add() during the transaction.
+	 *
+	 * Optional.
 	 */
-	void (*cancel_txn)		(struct pmu *pmu); /* optional */
+	void (*cancel_txn)		(struct pmu *pmu);
 
 	/*
 	 * Will return the value for perf_event_mmap_page::index for this event,
diff --git a/kernel/events/core.c b/kernel/events/core.c
index d3dae34..4435bf5 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -1913,7 +1913,7 @@ group_sched_in(struct perf_event *group_event,
 	if (group_event->state == PERF_EVENT_STATE_OFF)
 		return 0;
 
-	pmu->start_txn(pmu);
+	pmu->start_txn(pmu, PERF_PMU_TXN_ADD);
 
 	if (event_sched_in(group_event, cpuctx, ctx)) {
 		pmu->cancel_txn(pmu);
@@ -7074,24 +7074,49 @@ static void perf_pmu_nop_void(struct pmu *pmu)
 {
 }
 
+static void perf_pmu_nop_txn(struct pmu *pmu, int flags)
+{
+}
+
 static int perf_pmu_nop_int(struct pmu *pmu)
 {
 	return 0;
 }
 
-static void perf_pmu_start_txn(struct pmu *pmu)
+DEFINE_PER_CPU(int, nop_txn_flags);
+
+static void perf_pmu_start_txn(struct pmu *pmu, int flags)
 {
+	__this_cpu_write(nop_txn_flags, flags);
+
+	if (flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	perf_pmu_disable(pmu);
 }
 
 static int perf_pmu_commit_txn(struct pmu *pmu)
 {
+	int flags = __this_cpu_read(nop_txn_flags);
+
+	__this_cpu_write(nop_txn_flags, 0);
+
+	if (flags & ~PERF_PMU_TXN_ADD)
+		return 0;
+
 	perf_pmu_enable(pmu);
 	return 0;
 }
 
 static void perf_pmu_cancel_txn(struct pmu *pmu)
 {
+	int flags =  __this_cpu_read(nop_txn_flags);
+
+	__this_cpu_write(nop_txn_flags, 0);
+
+	if (flags & ~PERF_PMU_TXN_ADD)
+		return;
+
 	perf_pmu_enable(pmu);
 }
 
@@ -7330,7 +7355,7 @@ got_cpu_context:
 			pmu->commit_txn = perf_pmu_commit_txn;
 			pmu->cancel_txn = perf_pmu_cancel_txn;
 		} else {
-			pmu->start_txn  = perf_pmu_nop_void;
+			pmu->start_txn  = perf_pmu_nop_txn;
 			pmu->commit_txn = perf_pmu_nop_int;
 			pmu->cancel_txn = perf_pmu_nop_void;
 		}
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 02/10] perf: Split perf_event_read() and perf_event_count()
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 01/10] perf: Add a flags parameter to pmu txn interfaces Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 03/10] perf: Define perf_event_aggregate() Sukadev Bhattiprolu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

perf_event_read() does two things:

	- call the PMU to read/update the counter value, and
	- compute the total count of the event and its children

Not all callers need both. perf_event_reset() for instance needs the
first piece but doesn't need the second.  Similarly, when we implement
the ability to read a group of events using the transaction interface,
we would need the two pieces done independently.

Break up perf_event_read() and have it just read/update the counter
and have the callers compute the total count if necessary.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 kernel/events/core.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 4435bf5..f9ca8cb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3212,7 +3212,7 @@ static inline u64 perf_event_count(struct perf_event *event)
 	return __perf_event_count(event);
 }
 
-static u64 perf_event_read(struct perf_event *event)
+static void perf_event_read(struct perf_event *event)
 {
 	/*
 	 * If event is enabled and currently active on a CPU, update the
@@ -3238,8 +3238,6 @@ static u64 perf_event_read(struct perf_event *event)
 		update_event_times(event);
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
-
-	return perf_event_count(event);
 }
 
 /*
@@ -3751,14 +3749,18 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 	*running = 0;
 
 	mutex_lock(&event->child_mutex);
-	total += perf_event_read(event);
+
+	perf_event_read(event);
+	total += perf_event_count(event);
+
 	*enabled += event->total_time_enabled +
 			atomic64_read(&event->child_total_time_enabled);
 	*running += event->total_time_running +
 			atomic64_read(&event->child_total_time_running);
 
 	list_for_each_entry(child, &event->child_list, child_list) {
-		total += perf_event_read(child);
+		perf_event_read(child);
+		total += perf_event_count(child);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
 	}
@@ -3918,7 +3920,7 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
 
 static void _perf_event_reset(struct perf_event *event)
 {
-	(void)perf_event_read(event);
+	perf_event_read(event);
 	local64_set(&event->count, 0);
 	perf_event_update_userpage(event);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 03/10] perf: Define perf_event_aggregate()
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 01/10] perf: Add a flags parameter to pmu txn interfaces Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 02/10] perf: Split perf_event_read() and perf_event_count() Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 04/10] perf: Rename perf_event_read_{one,group}, perf_read_hw Sukadev Bhattiprolu
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Move the part of perf_event_read_value() that aggregates the event
counts and event times into a new function, perf_event_aggregate().

This would allow us to call perf_event_aggregate() independently.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
Changelog[v4]
	[Peter Zijlstra] Add missing lockdep_assert(). Rename
	perf_event_compute() (to perf_event_aggregate()).

Changelog[v3]
	Rather than move perf_event_read() into callers and then
	rename, just move the computations into a separate function
	(redesign to address comment from Peter Zijlstra).

---
 kernel/events/core.c |   39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f9ca8cb..97619ed 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3704,6 +3704,31 @@ static int perf_release(struct inode *inode, struct file *file)
 	return 0;
 }
 
+static u64 perf_event_aggregate(struct perf_event *event, u64 *enabled,
+			      u64 *running)
+{
+	struct perf_event *child;
+	u64 total;
+
+	total = perf_event_count(event);
+
+	*enabled += event->total_time_enabled +
+			atomic64_read(&event->child_total_time_enabled);
+	*running += event->total_time_running +
+			atomic64_read(&event->child_total_time_running);
+
+	lockdep_assert_held(&event->child_mutex);
+
+	list_for_each_entry(child, &event->child_list, child_list) {
+		perf_event_read(child);
+		total += perf_event_count(child);
+		*enabled += child->total_time_enabled;
+		*running += child->total_time_running;
+	}
+
+	return total;
+}
+
 /*
  * Remove all orphanes events from the context.
  */
@@ -3742,7 +3767,6 @@ static void orphans_remove_work(struct work_struct *work)
 
 u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 {
-	struct perf_event *child;
 	u64 total = 0;
 
 	*enabled = 0;
@@ -3751,19 +3775,8 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 	mutex_lock(&event->child_mutex);
 
 	perf_event_read(event);
-	total += perf_event_count(event);
-
-	*enabled += event->total_time_enabled +
-			atomic64_read(&event->child_total_time_enabled);
-	*running += event->total_time_running +
-			atomic64_read(&event->child_total_time_running);
+	total = perf_event_aggregate(event, enabled, running);
 
-	list_for_each_entry(child, &event->child_list, child_list) {
-		perf_event_read(child);
-		total += perf_event_count(child);
-		*enabled += child->total_time_enabled;
-		*running += child->total_time_running;
-	}
 	mutex_unlock(&event->child_mutex);
 
 	return total;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 04/10] perf: Rename perf_event_read_{one,group}, perf_read_hw
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (2 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 03/10] perf: Define perf_event_aggregate() Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 05/10] perf: Unroll perf_event_read_value() in perf_read_group() Sukadev Bhattiprolu
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

From: "Peter Zijlstra (Intel)" <peterz@infradead.org>

In order to free up the perf_event_read_group() name:

 s/perf_event_read_\(one\|group\)/perf_read_\1/g
 s/perf_read_hw/__perf_read/g

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 97619ed..a6bd09d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3675,7 +3675,7 @@ static void put_event(struct perf_event *event)
 	 *     see the comment there.
 	 *
 	 *  2) there is a lock-inversion with mmap_sem through
-	 *     perf_event_read_group(), which takes faults while
+	 *     perf_read_group(), which takes faults while
 	 *     holding ctx->mutex, however this is called after
 	 *     the last filedesc died, so there is no possibility
 	 *     to trigger the AB-BA case.
@@ -3783,7 +3783,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 }
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
-static int perf_event_read_group(struct perf_event *event,
+static int perf_read_group(struct perf_event *event,
 				   u64 read_format, char __user *buf)
 {
 	struct perf_event *leader = event->group_leader, *sub;
@@ -3831,7 +3831,7 @@ static int perf_event_read_group(struct perf_event *event,
 	return ret;
 }
 
-static int perf_event_read_one(struct perf_event *event,
+static int perf_read_one(struct perf_event *event,
 				 u64 read_format, char __user *buf)
 {
 	u64 enabled, running;
@@ -3869,7 +3869,7 @@ static bool is_event_hup(struct perf_event *event)
  * Read the performance event - simple non blocking version for now
  */
 static ssize_t
-perf_read_hw(struct perf_event *event, char __user *buf, size_t count)
+__perf_read(struct perf_event *event, char __user *buf, size_t count)
 {
 	u64 read_format = event->attr.read_format;
 	int ret;
@@ -3887,9 +3887,9 @@ perf_read_hw(struct perf_event *event, char __user *buf, size_t count)
 
 	WARN_ON_ONCE(event->ctx->parent_ctx);
 	if (read_format & PERF_FORMAT_GROUP)
-		ret = perf_event_read_group(event, read_format, buf);
+		ret = perf_read_group(event, read_format, buf);
 	else
-		ret = perf_event_read_one(event, read_format, buf);
+		ret = perf_read_one(event, read_format, buf);
 
 	return ret;
 }
@@ -3902,7 +3902,7 @@ perf_read(struct file *file, char __user *buf, size_t count, loff_t *ppos)
 	int ret;
 
 	ctx = perf_event_ctx_lock(event);
-	ret = perf_read_hw(event, buf, count);
+	ret = __perf_read(event, buf, count);
 	perf_event_ctx_unlock(event, ctx);
 
 	return ret;
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 05/10] perf: Unroll perf_event_read_value() in perf_read_group()
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (3 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 04/10] perf: Rename perf_event_read_{one,group}, perf_read_hw Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 06/10] perf: Add return value for perf_event_read() Sukadev Bhattiprolu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Unroll the calls to perf_event_read_value() in perf_read_group()
so we can later optimize out parts we don't need for group events.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 kernel/events/core.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index a6bd09d..0ce3012 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3794,7 +3794,14 @@ static int perf_read_group(struct perf_event *event,
 
 	lockdep_assert_held(&ctx->mutex);
 
-	count = perf_event_read_value(leader, &enabled, &running);
+	enabled = running = 0;
+
+	mutex_lock(&leader->child_mutex);
+
+	perf_event_read(leader);
+	count = perf_event_aggregate(leader, &enabled, &running);
+
+	mutex_unlock(&leader->child_mutex);
 
 	values[n++] = 1 + leader->nr_siblings;
 	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
@@ -3815,7 +3822,13 @@ static int perf_read_group(struct perf_event *event,
 	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
 		n = 0;
 
-		values[n++] = perf_event_read_value(sub, &enabled, &running);
+		mutex_lock(&leader->child_mutex);
+
+		perf_event_read(sub);
+		values[n++] = perf_event_aggregate(sub, &enabled, &running);
+
+		mutex_unlock(&leader->child_mutex);
+
 		if (read_format & PERF_FORMAT_ID)
 			values[n++] = primary_event_id(sub);
 
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 06/10] perf: Add return value for perf_event_read().
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (4 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 05/10] perf: Unroll perf_event_read_value() in perf_read_group() Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 07/10] perf: Add group parameter to perf_event_read() Sukadev Bhattiprolu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Add a return value to perf_event_read(). The return value will be
needed later in perf_read_group() implements ability to read several
counters in a PERF_PMU_TXN_READ transaction.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 kernel/events/core.c |   19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0ce3012..21a55d1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3212,7 +3212,7 @@ static inline u64 perf_event_count(struct perf_event *event)
 	return __perf_event_count(event);
 }
 
-static void perf_event_read(struct perf_event *event)
+static int perf_event_read(struct perf_event *event)
 {
 	/*
 	 * If event is enabled and currently active on a CPU, update the
@@ -3238,6 +3238,8 @@ static void perf_event_read(struct perf_event *event)
 		update_event_times(event);
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
+
+	return 0;
 }
 
 /*
@@ -3720,7 +3722,7 @@ static u64 perf_event_aggregate(struct perf_event *event, u64 *enabled,
 	lockdep_assert_held(&event->child_mutex);
 
 	list_for_each_entry(child, &event->child_list, child_list) {
-		perf_event_read(child);
+		(void)perf_event_read(child);
 		total += perf_event_count(child);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
@@ -3774,7 +3776,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 
 	mutex_lock(&event->child_mutex);
 
-	perf_event_read(event);
+	(void)perf_event_read(event);
 	total = perf_event_aggregate(event, enabled, running);
 
 	mutex_unlock(&event->child_mutex);
@@ -3798,7 +3800,12 @@ static int perf_read_group(struct perf_event *event,
 
 	mutex_lock(&leader->child_mutex);
 
-	perf_event_read(leader);
+	ret = perf_event_read(leader);
+	if (ret) {
+		mutex_unlock(&leader->child_mutex);
+		return ret;
+	}
+
 	count = perf_event_aggregate(leader, &enabled, &running);
 
 	mutex_unlock(&leader->child_mutex);
@@ -3824,7 +3831,7 @@ static int perf_read_group(struct perf_event *event,
 
 		mutex_lock(&leader->child_mutex);
 
-		perf_event_read(sub);
+		(void)perf_event_read(sub);
 		values[n++] = perf_event_aggregate(sub, &enabled, &running);
 
 		mutex_unlock(&leader->child_mutex);
@@ -3946,7 +3953,7 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
 
 static void _perf_event_reset(struct perf_event *event)
 {
-	perf_event_read(event);
+	(void)perf_event_read(event);
 	local64_set(&event->count, 0);
 	perf_event_update_userpage(event);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 07/10] perf: Add group parameter to perf_event_read()
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (5 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 06/10] perf: Add return value for perf_event_read() Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 08/10] perf: Add return value to __perf_event_read() Sukadev Bhattiprolu
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Add a 'group' parameter to perf_event_read(). It will be used (set
to true) in a follow-on patch to update event times of the group.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 kernel/events/core.c |   17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 21a55d1..f38fe0b 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3212,7 +3212,7 @@ static inline u64 perf_event_count(struct perf_event *event)
 	return __perf_event_count(event);
 }
 
-static int perf_event_read(struct perf_event *event)
+static int perf_event_read(struct perf_event *event, bool group)
 {
 	/*
 	 * If event is enabled and currently active on a CPU, update the
@@ -3235,7 +3235,12 @@ static int perf_event_read(struct perf_event *event)
 			update_context_time(ctx);
 			update_cgrp_time_from_event(event);
 		}
-		update_event_times(event);
+
+		if (group)
+			update_group_times(event);
+		else
+			update_event_times(event);
+
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
 
@@ -3722,7 +3727,7 @@ static u64 perf_event_aggregate(struct perf_event *event, u64 *enabled,
 	lockdep_assert_held(&event->child_mutex);
 
 	list_for_each_entry(child, &event->child_list, child_list) {
-		(void)perf_event_read(child);
+		(void)perf_event_read(child, false);
 		total += perf_event_count(child);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
@@ -3776,7 +3781,7 @@ u64 perf_event_read_value(struct perf_event *event, u64 *enabled, u64 *running)
 
 	mutex_lock(&event->child_mutex);
 
-	(void)perf_event_read(event);
+	(void)perf_event_read(event, false);
 	total = perf_event_aggregate(event, enabled, running);
 
 	mutex_unlock(&event->child_mutex);
@@ -3831,7 +3836,7 @@ static int perf_read_group(struct perf_event *event,
 
 		mutex_lock(&leader->child_mutex);
 
-		(void)perf_event_read(sub);
+		(void)perf_event_read(sub, false);
 		values[n++] = perf_event_aggregate(sub, &enabled, &running);
 
 		mutex_unlock(&leader->child_mutex);
@@ -3953,7 +3958,7 @@ static unsigned int perf_poll(struct file *file, poll_table *wait)
 
 static void _perf_event_reset(struct perf_event *event)
 {
-	(void)perf_event_read(event);
+	(void)perf_event_read(event, false);
 	local64_set(&event->count, 0);
 	perf_event_update_userpage(event);
 }
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 08/10] perf: Add return value to __perf_event_read()
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (6 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 07/10] perf: Add group parameter to perf_event_read() Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 09/10] Define PERF_PMU_TXN_READ interface Sukadev Bhattiprolu
  2015-07-27  5:40 ` [PATCH 10/10] powerpc/perf/hv-24x7: Use PMU_TXN_READ interface Sukadev Bhattiprolu
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Add a return value to __perf_event_read(). The return value will be
needed later in perf_read_group() implements ability to read several
counters in a PERF_PMU_TXN_READ transaction.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
 kernel/events/core.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f38fe0b..951d835 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3174,12 +3174,18 @@ void perf_event_exec(void)
 	rcu_read_unlock();
 }
 
+struct perf_read_data {
+	struct perf_event *event;
+	int ret;
+};
+
 /*
  * Cross CPU call to read the hardware event
  */
 static void __perf_event_read(void *info)
 {
-	struct perf_event *event = info;
+	struct perf_read_data *data = info;
+	struct perf_event *event = data->event;
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 
@@ -3201,6 +3207,8 @@ static void __perf_event_read(void *info)
 	update_event_times(event);
 	if (event->state == PERF_EVENT_STATE_ACTIVE)
 		event->pmu->read(event);
+
+	data->ret = 0;
 	raw_spin_unlock(&ctx->lock);
 }
 
@@ -3214,13 +3222,21 @@ static inline u64 perf_event_count(struct perf_event *event)
 
 static int perf_event_read(struct perf_event *event, bool group)
 {
+	int ret = 0;
+
 	/*
 	 * If event is enabled and currently active on a CPU, update the
 	 * value in the event structure:
 	 */
 	if (event->state == PERF_EVENT_STATE_ACTIVE) {
+		struct perf_read_data data = {
+			.event = event,
+			.ret = 0,
+		};
+
 		smp_call_function_single(event->oncpu,
-					 __perf_event_read, event, 1);
+					 __perf_event_read, &data, 1);
+		ret = data.ret;
 	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
 		struct perf_event_context *ctx = event->ctx;
 		unsigned long flags;
@@ -3244,7 +3260,7 @@ static int perf_event_read(struct perf_event *event, bool group)
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
 
-	return 0;
+	return ret;
 }
 
 /*
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 09/10] Define PERF_PMU_TXN_READ interface
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (7 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 08/10] perf: Add return value to __perf_event_read() Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  2015-08-06 12:10   ` Peter Zijlstra
  2015-07-27  5:40 ` [PATCH 10/10] powerpc/perf/hv-24x7: Use PMU_TXN_READ interface Sukadev Bhattiprolu
  9 siblings, 1 reply; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

Define a new PERF_PMU_TXN_READ interface to read a group of counters
at once.

        pmu->start_txn()                // Initialize before first event

        for each event in group
                pmu->read(event);       // Queue each event to be read

        pmu->commit_txn()               // Read/update all queued counters

Note that we use this interface with all PMUs.  PMUs that implement this
interface use the ->read() operation to _queue_ the counters to be read
and use ->commit_txn() to actually read all the queued counters at once.

PMUs that don't implement PERF_PMU_TXN_READ ignore ->start_txn() and
->commit_txn() and continue to read counters one at a time.

Thanks to input from Peter Zijlstra.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
---
Changelog[v4]
        [Peter Zijlstra] Add lockdep_assert_held() in perf_event_read_group().
        Make sure the entire transaction happens on the same CPU.

---
 include/linux/perf_event.h |    1 +
 kernel/events/core.c       |   38 +++++++++++++++++++++++++++++++-------
 2 files changed, 32 insertions(+), 7 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 44bf05f..da307ad 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -169,6 +169,7 @@ struct perf_event;
 #define PERF_EVENT_TXN 0x1
 
 #define PERF_PMU_TXN_ADD  0x1		/* txn to add/schedule event on PMU */
+#define PERF_PMU_TXN_READ 0x2		/* txn to read event group from PMU */
 
 /**
  * pmu::capabilities flags
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 951d835..b5aa92c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3176,6 +3176,7 @@ void perf_event_exec(void)
 
 struct perf_read_data {
 	struct perf_event *event;
+	bool group;
 	int ret;
 };
 
@@ -3186,8 +3187,10 @@ static void __perf_event_read(void *info)
 {
 	struct perf_read_data *data = info;
 	struct perf_event *event = data->event;
+	struct perf_event *sub;
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
+	struct pmu *pmu = event->pmu;
 
 	/*
 	 * If this is a task context, we need to check whether it is
@@ -3205,10 +3208,25 @@ static void __perf_event_read(void *info)
 		update_cgrp_time_from_event(event);
 	}
 	update_event_times(event);
-	if (event->state == PERF_EVENT_STATE_ACTIVE)
-		event->pmu->read(event);
+	if (event->state != PERF_EVENT_STATE_ACTIVE)
+		goto unlock;
+
+	if (!data->group) {
+		pmu->read(event);
+		goto unlock;
+	}
+
+	pmu->start_txn(pmu, PERF_PMU_TXN_READ);
 
-	data->ret = 0;
+	pmu->read(event);
+	list_for_each_entry(sub, &event->sibling_list, group_entry) {
+		if (sub->state == PERF_EVENT_STATE_ACTIVE)
+			pmu->read(sub);
+	}
+
+	data->ret = pmu->commit_txn(pmu);
+
+unlock:
 	raw_spin_unlock(&ctx->lock);
 }
 
@@ -3231,6 +3249,7 @@ static int perf_event_read(struct perf_event *event, bool group)
 	if (event->state == PERF_EVENT_STATE_ACTIVE) {
 		struct perf_read_data data = {
 			.event = event,
+			.group = group,
 			.ret = 0,
 		};
 
@@ -3743,7 +3762,13 @@ static u64 perf_event_aggregate(struct perf_event *event, u64 *enabled,
 	lockdep_assert_held(&event->child_mutex);
 
 	list_for_each_entry(child, &event->child_list, child_list) {
+#if 0
+		/*
+		 * TODO: Do we need this read() for group events on PMUs that
+		 * 	 don't implement PERF_PMU_TXN_READ transactions?
+		 */
 		(void)perf_event_read(child, false);
+#endif
 		total += perf_event_count(child);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
@@ -3821,7 +3846,7 @@ static int perf_read_group(struct perf_event *event,
 
 	mutex_lock(&leader->child_mutex);
 
-	ret = perf_event_read(leader);
+	ret = perf_event_read(leader, true);
 	if (ret) {
 		mutex_unlock(&leader->child_mutex);
 		return ret;
@@ -3850,12 +3875,11 @@ static int perf_read_group(struct perf_event *event,
 	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
 		n = 0;
 
-		mutex_lock(&leader->child_mutex);
+		mutex_lock(&sub->child_mutex);
 
-		(void)perf_event_read(sub, false);
 		values[n++] = perf_event_aggregate(sub, &enabled, &running);
 
-		mutex_unlock(&leader->child_mutex);
+		mutex_unlock(&sub->child_mutex);
 
 		if (read_format & PERF_FORMAT_ID)
 			values[n++] = primary_event_id(sub);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 10/10] powerpc/perf/hv-24x7: Use PMU_TXN_READ interface
  2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
                   ` (8 preceding siblings ...)
  2015-07-27  5:40 ` [PATCH 09/10] Define PERF_PMU_TXN_READ interface Sukadev Bhattiprolu
@ 2015-07-27  5:40 ` Sukadev Bhattiprolu
  9 siblings, 0 replies; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-07-27  5:40 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman
  Cc: linux-kernel, linuxppc-dev, linux-s390, sparclinux

The 24x7 counters in Powerpc allow monitoring a large number of counters
simultaneously. They also allow reading several counters in a single
HCALL so we can get a more consistent snapshot of the system.

Use the PMU's transaction interface to monitor and read several event
counters at once. The idea is that users can group several 24x7 events
into a single group of events. We use the following logic to submit
the group of events to the PMU and read the values:

	pmu->start_txn()		// Initialize before first event

	for each event in group
		pmu->read(event);	// Queue each event to be read

	pmu->commit_txn()		// Read/update all queuedcounters

The ->commit_txn() also updates the event counts in the respective
perf_event objects.  The perf subsystem can then directly get the
event counts from the perf_event and can avoid submitting a new
->read() request to the PMU.

Thanks to input from Peter Zijlstra.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

---
Changelog[v3]
	[Peter Zijlstra] Save the transaction state in ->start_txn() and
	remove the flags parameter from ->commit_txn() and ->cancel_txn().

---
 arch/powerpc/perf/hv-24x7.c |  166 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 164 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 4d1a8d1..b86121c 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -142,6 +142,15 @@ static struct attribute_group event_long_desc_group = {
 
 static struct kmem_cache *hv_page_cache;
 
+DEFINE_PER_CPU(int, hv_24x7_txn_flags);
+DEFINE_PER_CPU(int, hv_24x7_txn_err);
+
+struct hv_24x7_hw {
+	struct perf_event *events[255];
+};
+
+DEFINE_PER_CPU(struct hv_24x7_hw, hv_24x7_hw);
+
 /*
  * request_buffer and result_buffer are not required to be 4k aligned,
  * but are not allowed to cross any 4k boundary. Aligning them to 4k is
@@ -1233,9 +1242,48 @@ static void update_event_count(struct perf_event *event, u64 now)
 static void h_24x7_event_read(struct perf_event *event)
 {
 	u64 now;
+	struct hv_24x7_request_buffer *request_buffer;
+	struct hv_24x7_hw *h24x7hw;
+	int txn_flags;
+
+	txn_flags = __this_cpu_read(hv_24x7_txn_flags);
+
+	/*
+	 * If in a READ transaction, add this counter to the list of
+	 * counters to read during the next HCALL (i.e commit_txn()).
+	 * If not in a READ transaction, go ahead and make the HCALL
+	 * to read this counter by itself.
+	 */
+
+	if (txn_flags & PERF_PMU_TXN_READ) {
+		int i;
+		int ret;
 
-	now = h_24x7_get_value(event);
-	update_event_count(event, now);
+		if (__this_cpu_read(hv_24x7_txn_err))
+			return;
+
+		request_buffer = (void *)get_cpu_var(hv_24x7_reqb);
+
+		ret = add_event_to_24x7_request(event, request_buffer);
+		if (ret) {
+			__this_cpu_write(hv_24x7_txn_err, ret);
+		} else {
+			/*
+			 * Assoicate the event with the HCALL request index,
+			 * so ->commit_txn() can quickly find/update count.
+			 */
+			i = request_buffer->num_requests - 1;
+
+			h24x7hw = &get_cpu_var(hv_24x7_hw);
+			h24x7hw->events[i] = event;
+			put_cpu_var(h24x7hw);
+		}
+
+		put_cpu_var(hv_24x7_reqb);
+	} else {
+		now = h_24x7_get_value(event);
+		update_event_count(event, now);
+	}
 }
 
 static void h_24x7_event_start(struct perf_event *event, int flags)
@@ -1257,6 +1305,117 @@ static int h_24x7_event_add(struct perf_event *event, int flags)
 	return 0;
 }
 
+/*
+ * 24x7 counters only support READ transactions. They are
+ * always counting and dont need/support ADD transactions.
+ * Cache the flags, but otherwise ignore transactions that
+ * are not PERF_PMU_TXN_READ.
+ */
+static void h_24x7_event_start_txn(struct pmu *pmu, int flags)
+{
+	struct hv_24x7_request_buffer *request_buffer;
+	struct hv_24x7_data_result_buffer *result_buffer;
+
+	/* We should not be called if we are already in a txn */
+	WARN_ON_ONCE(__this_cpu_read(hv_24x7_txn_flags));
+
+	__this_cpu_write(hv_24x7_txn_flags, flags);
+	if (flags & ~PERF_PMU_TXN_READ)
+		return;
+
+	request_buffer = (void *)get_cpu_var(hv_24x7_reqb);
+	result_buffer = (void *)get_cpu_var(hv_24x7_resb);
+
+	init_24x7_request(request_buffer, result_buffer);
+
+	put_cpu_var(hv_24x7_resb);
+	put_cpu_var(hv_24x7_reqb);
+}
+
+/*
+ * Clean up transaction state.
+ *
+ * NOTE: Ignore state of request and result buffers for now.
+ * 	 We will initialize them during the next read/txn.
+ */
+static void reset_txn(void)
+{
+	__this_cpu_write(hv_24x7_txn_flags, 0);
+	__this_cpu_write(hv_24x7_txn_err, 0);
+}
+
+/*
+ * 24x7 counters only support READ transactions. They are always counting
+ * and dont need/support ADD transactions. Clear ->txn_flags but otherwise
+ * ignore transactions that are not of type PERF_PMU_TXN_READ.
+ *
+ * For READ transactions, submit all pending 24x7 requests (i.e requests
+ * that were queued by h_24x7_event_read()), to the hypervisor and update
+ * the event counts.
+ */
+static int h_24x7_event_commit_txn(struct pmu *pmu)
+{
+	struct hv_24x7_request_buffer *request_buffer;
+	struct hv_24x7_data_result_buffer *result_buffer;
+	struct hv_24x7_result *resb;
+	struct perf_event *event;
+	u64 count;
+	int i, ret, txn_flags;
+	struct hv_24x7_hw *h24x7hw;
+
+	txn_flags = __this_cpu_read(hv_24x7_txn_flags);
+	WARN_ON_ONCE(!txn_flags);
+
+	ret = 0;
+	if (txn_flags & ~PERF_PMU_TXN_READ)
+		goto out;
+
+	ret = __this_cpu_read(hv_24x7_txn_err);
+	if (ret)
+		goto out;
+
+	request_buffer = (void *)get_cpu_var(hv_24x7_reqb);
+	result_buffer = (void *)get_cpu_var(hv_24x7_resb);
+
+	ret = make_24x7_request(request_buffer, result_buffer);
+	if (ret) {
+		log_24x7_hcall(request_buffer, result_buffer, ret);
+		goto put_reqb;
+	}
+
+	h24x7hw = &get_cpu_var(hv_24x7_hw);
+
+	/* Update event counts from hcall */
+	for (i = 0; i < request_buffer->num_requests; i++) {
+		resb = &result_buffer->results[i];
+		count = be64_to_cpu(resb->elements[0].element_data[0]);
+		event = h24x7hw->events[i];
+		h24x7hw->events[i] = NULL;
+		update_event_count(event, count);
+	}
+
+	put_cpu_var(hv_24x7_hw);
+
+put_reqb:
+	put_cpu_var(hv_24x7_resb);
+	put_cpu_var(hv_24x7_reqb);
+out:
+	reset_txn();
+	return ret;
+}
+
+/*
+ * 24x7 counters only support READ transactions. They are always counting
+ * and dont need/support ADD transactions. However, regardless of type
+ * of transaction, all we need to do is cleanup, so we don't have to check
+ * the type of transaction.
+ */
+static void h_24x7_event_cancel_txn(struct pmu *pmu)
+{
+	WARN_ON_ONCE(!__this_cpu_read(hv_24x7_txn_flags));
+	reset_txn();
+}
+
 static struct pmu h_24x7_pmu = {
 	.task_ctx_nr = perf_invalid_context,
 
@@ -1268,6 +1427,9 @@ static struct pmu h_24x7_pmu = {
 	.start       = h_24x7_event_start,
 	.stop        = h_24x7_event_stop,
 	.read        = h_24x7_event_read,
+	.start_txn   = h_24x7_event_start_txn,
+	.commit_txn  = h_24x7_event_commit_txn,
+	.cancel_txn  = h_24x7_event_cancel_txn,
 };
 
 static int hv_24x7_init(void)
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] Define PERF_PMU_TXN_READ interface
  2015-07-27  5:40 ` [PATCH 09/10] Define PERF_PMU_TXN_READ interface Sukadev Bhattiprolu
@ 2015-08-06 12:10   ` Peter Zijlstra
  2015-08-12  4:14     ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-08-06 12:10 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman,
	linux-kernel, linuxppc-dev, linux-s390, sparclinux

[-- Attachment #1: Type: text/plain, Size: 959 bytes --]

On Sun, Jul 26, 2015 at 10:40:37PM -0700, Sukadev Bhattiprolu wrote:
> @@ -3743,7 +3762,13 @@ static u64 perf_event_aggregate(struct perf_event *event, u64 *enabled,
>  	lockdep_assert_held(&event->child_mutex);
>  
>  	list_for_each_entry(child, &event->child_list, child_list) {
> +#if 0
> +		/*
> +		 * TODO: Do we need this read() for group events on PMUs that
> +		 * 	 don't implement PERF_PMU_TXN_READ transactions?
> +		 */
>  		(void)perf_event_read(child, false);
> +#endif
>  		total += perf_event_count(child);
>  		*enabled += child->total_time_enabled;
>  		*running += child->total_time_running;

Aw gawd, I've been an idiot!!

I just realized this is a _CHILD_ loop, not a _SIBLING_ loop !!

We need to flip the loops in perf_read_group(), find attached two
patches that go on top of 1,2,4.

After this you can add the perf_event_read() return value (just fold
patches 6,8) after which you can do patch 10 (which has a broken
Subject fwiw).



[-- Attachment #2: peterz-perf_event_read-group.patch --]
[-- Type: text/plain, Size: 3606 bytes --]

Subject: perf: Add group reads to perf_event_read()
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu, 23 Jul 2015 10:04:35 +0200

Enable perf_event_read() to update entire groups at once, this will be
useful for read transactions.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150723080435.GE25159@twins.programming.kicks-ass.net
---
 kernel/events/core.c |   39 ++++++++++++++++++++++++++++++++-------
 1 file changed, 32 insertions(+), 7 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3184,12 +3184,18 @@ void perf_event_exec(void)
 	rcu_read_unlock();
 }
 
+struct perf_read_data {
+	struct perf_event *event;
+	bool group;
+};
+
 /*
  * Cross CPU call to read the hardware event
  */
 static void __perf_event_read(void *info)
 {
-	struct perf_event *event = info;
+	struct perf_read_data *data = info;
+	struct perf_event *sub, *event = data->event;
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 
@@ -3208,9 +3214,21 @@ static void __perf_event_read(void *info
 		update_context_time(ctx);
 		update_cgrp_time_from_event(event);
 	}
+
 	update_event_times(event);
 	if (event->state == PERF_EVENT_STATE_ACTIVE)
 		event->pmu->read(event);
+
+	if (!data->group)
+		goto unlock;
+
+	list_for_each_entry(sub, &event->sibling_list, group_entry) {
+		update_event_times(sub);
+		if (sub->state == PERF_EVENT_STATE_ACTIVE)
+			sub->pmu->read(sub);
+	}
+
+unlock:
 	raw_spin_unlock(&ctx->lock);
 }
 
@@ -3222,15 +3240,19 @@ static inline u64 perf_event_count(struc
 	return __perf_event_count(event);
 }
 
-static void perf_event_read(struct perf_event *event)
+static void perf_event_read(struct perf_event *event, bool group)
 {
 	/*
 	 * If event is enabled and currently active on a CPU, update the
 	 * value in the event structure:
 	 */
 	if (event->state == PERF_EVENT_STATE_ACTIVE) {
+		struct perf_read_data data = {
+			.event = event,
+			.group = group,
+		};
 		smp_call_function_single(event->oncpu,
-					 __perf_event_read, event, 1);
+					 __perf_event_read, &data, 1);
 	} else if (event->state == PERF_EVENT_STATE_INACTIVE) {
 		struct perf_event_context *ctx = event->ctx;
 		unsigned long flags;
@@ -3245,7 +3267,10 @@ static void perf_event_read(struct perf_
 			update_context_time(ctx);
 			update_cgrp_time_from_event(event);
 		}
-		update_event_times(event);
+		if (group)
+			update_group_times(event);
+		else
+			update_event_times(event);
 		raw_spin_unlock_irqrestore(&ctx->lock, flags);
 	}
 }
@@ -3764,7 +3789,7 @@ u64 perf_event_read_value(struct perf_ev
 
 	mutex_lock(&event->child_mutex);
 
-	perf_event_read(event);
+	perf_event_read(event, false);
 	total += perf_event_count(event);
 
 	*enabled += event->total_time_enabled +
@@ -3773,7 +3798,7 @@ u64 perf_event_read_value(struct perf_ev
 			atomic64_read(&event->child_total_time_running);
 
 	list_for_each_entry(child, &event->child_list, child_list) {
-		perf_event_read(child);
+		perf_event_read(child, false);
 		total += perf_event_count(child);
 		*enabled += child->total_time_enabled;
 		*running += child->total_time_running;
@@ -3934,7 +3959,7 @@ static unsigned int perf_poll(struct fil
 
 static void _perf_event_reset(struct perf_event *event)
 {
-	perf_event_read(event);
+	perf_event_read(event, false);
 	local64_set(&event->count, 0);
 	perf_event_update_userpage(event);
 }

[-- Attachment #3: peterz-perf-invert-perf_read_group.patch --]
[-- Type: text/plain, Size: 3745 bytes --]

Subject: perf: Invert perf_read_group() loops
From: Peter Zijlstra <peterz@infradead.org>
Date: Thu Aug 6 13:41:13 CEST 2015

In order to enable the use of perf_event_read(.group = true), we need
to invert the sibling-child loop nesting of perf_read_group().

Currently we iterate the child list for each sibling, this precludes
using group reads. Flip things around so we iterate each group for
each child.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/events/core.c |   84 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 54 insertions(+), 30 deletions(-)

--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3809,50 +3809,74 @@ u64 perf_event_read_value(struct perf_ev
 }
 EXPORT_SYMBOL_GPL(perf_event_read_value);
 
-static int perf_read_group(struct perf_event *event,
-				   u64 read_format, char __user *buf)
+static void __perf_read_group_add(struct perf_event *leader, u64 read_format, u64 *values)
 {
-	struct perf_event *leader = event->group_leader, *sub;
-	struct perf_event_context *ctx = leader->ctx;
-	int n = 0, size = 0, ret;
-	u64 count, enabled, running;
-	u64 values[5];
+	struct perf_event *sub;
+	int n = 1; /* skip @nr */
 
-	lockdep_assert_held(&ctx->mutex);
+	perf_event_read(leader, true);
+
+	/*
+	 * Since we co-schedule groups, {enabled,running} times of siblings
+	 * will be identical to those of the leader, so we only publish one
+	 * set.
+	 */
+	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
+		values[n++] += leader->total_time_enabled +
+			atomic64_read(leader->child_total_time_enabled);
+	}
 
-	count = perf_event_read_value(leader, &enabled, &running);
+	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) {
+		values[n++] += leader->total_time_running +
+			atomic64_read(leader->child_total_time_running);
+	}
 
-	values[n++] = 1 + leader->nr_siblings;
-	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
-		values[n++] = enabled;
-	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
-		values[n++] = running;
-	values[n++] = count;
+	/*
+	 * Write {count,id} tuples for every sibling.
+	 */
+	values[n++] += perf_event_count(leader);
 	if (read_format & PERF_FORMAT_ID)
 		values[n++] = primary_event_id(leader);
 
-	size = n * sizeof(u64);
+	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
+		values[n++] += perf_event_count(sub);
+		if (read_format & PERF_FORMAT_ID)
+			values[n++] = primary_event_id(sub);
+	}
+}
 
-	if (copy_to_user(buf, values, size))
-		return -EFAULT;
+static int perf_read_group(struct perf_event *event,
+				   u64 read_format, char __user *buf)
+{
+	struct perf_event *leader = event->group_leader, *child;
+	struct perf_event_context *ctx = leader->ctx;
+	int ret = leader->read_size;
+	u64 *values;
 
-	ret = size;
+	lockdep_assert_held(&ctx->mutex);
 
-	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
-		n = 0;
+	values = kzalloc(event->read_size);
+	if (!values)
+		return -ENOMEM;
 
-		values[n++] = perf_event_read_value(sub, &enabled, &running);
-		if (read_format & PERF_FORMAT_ID)
-			values[n++] = primary_event_id(sub);
+	values[0] = 1 + leader->nr_siblings;
 
-		size = n * sizeof(u64);
+	/*
+	 * By locking the child_mutex of the leader we effectively
+	 * lock the child list of all siblings.. XXX explain how.
+	 */
+	mutex_lock(&leader->child_mutex);
 
-		if (copy_to_user(buf + ret, values, size)) {
-			return -EFAULT;
-		}
+	__perf_read_group_add(leader, read_format, values);
+	list_for_each_entry(child, &leader->child_list, child_list)
+		__perf_read_group_add(child, read_format, values);
 
-		ret += size;
-	}
+	mutex_unlock(&leader->child_mutex);
+
+	if (copy_to_user(buf, values, event->read_size))
+		ret = -EFAULT;
+
+	kfree(values);
 
 	return ret;
 }

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] Define PERF_PMU_TXN_READ interface
  2015-08-06 12:10   ` Peter Zijlstra
@ 2015-08-12  4:14     ` Sukadev Bhattiprolu
  2015-08-12  8:45       ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-08-12  4:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman,
	linux-kernel, linuxppc-dev, linux-s390, sparclinux

Peter Zijlstra [peterz@infradead.org] wrote:
| On Sun, Jul 26, 2015 at 10:40:37PM -0700, Sukadev Bhattiprolu wrote:
| > @@ -3743,7 +3762,13 @@ static u64 perf_event_aggregate(struct perf_event *event, u64 *enabled,
| >  	lockdep_assert_held(&event->child_mutex);
| >  
| >  	list_for_each_entry(child, &event->child_list, child_list) {
| > +#if 0
| > +		/*
| > +		 * TODO: Do we need this read() for group events on PMUs that
| > +		 * 	 don't implement PERF_PMU_TXN_READ transactions?
| > +		 */
| >  		(void)perf_event_read(child, false);
| > +#endif
| >  		total += perf_event_count(child);
| >  		*enabled += child->total_time_enabled;
| >  		*running += child->total_time_running;
| 
| Aw gawd, I've been an idiot!!
| 
| I just realized this is a _CHILD_ loop, not a _SIBLING_ loop !!
| 
| We need to flip the loops in perf_read_group(), find attached two
| patches that go on top of 1,2,4.
| 
| After this you can add the perf_event_read() return value (just fold
| patches 6,8) after which you can do patch 10 (which has a broken
| Subject fwiw).

Thanks for the patches. I am building and testing, but have a question
on the second patch below:

<snip>


| Subject: perf: Invert perf_read_group() loops
| From: Peter Zijlstra <peterz@infradead.org>
| Date: Thu Aug 6 13:41:13 CEST 2015
| 
| In order to enable the use of perf_event_read(.group = true), we need
| to invert the sibling-child loop nesting of perf_read_group().
| 
| Currently we iterate the child list for each sibling, this precludes
| using group reads. Flip things around so we iterate each group for
| each child.
| 
| Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
| ---
|  kernel/events/core.c |   84 ++++++++++++++++++++++++++++++++-------------------
|  1 file changed, 54 insertions(+), 30 deletions(-)
| 
| --- a/kernel/events/core.c
| +++ b/kernel/events/core.c
| @@ -3809,50 +3809,74 @@ u64 perf_event_read_value(struct perf_ev
|  }
|  EXPORT_SYMBOL_GPL(perf_event_read_value);
| 
| -static int perf_read_group(struct perf_event *event,
| -				   u64 read_format, char __user *buf)
| +static void __perf_read_group_add(struct perf_event *leader, u64 read_format, u64 *values)
|  {
| -	struct perf_event *leader = event->group_leader, *sub;
| -	struct perf_event_context *ctx = leader->ctx;
| -	int n = 0, size = 0, ret;
| -	u64 count, enabled, running;
| -	u64 values[5];
| +	struct perf_event *sub;
| +	int n = 1; /* skip @nr */

This n = 1 is to skip over the values[0] = 1 + nr_siblings in the
caller.

Anyway, in __perf_read_group_add() we always start with n = 1, however
...
| 
| -	lockdep_assert_held(&ctx->mutex);
| +	perf_event_read(leader, true);
| +
| +	/*
| +	 * Since we co-schedule groups, {enabled,running} times of siblings
| +	 * will be identical to those of the leader, so we only publish one
| +	 * set.
| +	 */
| +	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
| +		values[n++] += leader->total_time_enabled +
| +			atomic64_read(leader->child_total_time_enabled);
| +	}
| 
| -	count = perf_event_read_value(leader, &enabled, &running);
| +	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) {
| +		values[n++] += leader->total_time_running +
| +			atomic64_read(leader->child_total_time_running);
| +	}
| 
| -	values[n++] = 1 + leader->nr_siblings;
| -	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED)
| -		values[n++] = enabled;
| -	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING)
| -		values[n++] = running;
| -	values[n++] = count;
| +	/*
| +	 * Write {count,id} tuples for every sibling.
| +	 */
| +	values[n++] += perf_event_count(leader);
|  	if (read_format & PERF_FORMAT_ID)
|  		values[n++] = primary_event_id(leader);
| 
| -	size = n * sizeof(u64);
| +	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
| +		values[n++] += perf_event_count(sub);
| +		if (read_format & PERF_FORMAT_ID)
| +			values[n++] = primary_event_id(sub);
| +	}
| +}
| 
| -	if (copy_to_user(buf, values, size))
| -		return -EFAULT;
| +static int perf_read_group(struct perf_event *event,
| +				   u64 read_format, char __user *buf)
| +{
| +	struct perf_event *leader = event->group_leader, *child;
| +	struct perf_event_context *ctx = leader->ctx;
| +	int ret = leader->read_size;
| +	u64 *values;
| 
| -	ret = size;
| +	lockdep_assert_held(&ctx->mutex);
| 
| -	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
| -		n = 0;
| +	values = kzalloc(event->read_size);
| +	if (!values)
| +		return -ENOMEM;
| 
| -		values[n++] = perf_event_read_value(sub, &enabled, &running);
| -		if (read_format & PERF_FORMAT_ID)
| -			values[n++] = primary_event_id(sub);
| +	values[0] = 1 + leader->nr_siblings;
| 
| -		size = n * sizeof(u64);
| +	/*
| +	 * By locking the child_mutex of the leader we effectively
| +	 * lock the child list of all siblings.. XXX explain how.
| +	 */
| +	mutex_lock(&leader->child_mutex);
| 
| -		if (copy_to_user(buf + ret, values, size)) {
| -			return -EFAULT;
| -		}
| +	__perf_read_group_add(leader, read_format, values);

... we don't copy_to_user() here,

| +	list_for_each_entry(child, &leader->child_list, child_list)
| +		__perf_read_group_add(child, read_format, values);

so won't we overwrite the values[], if we always start at n = 1
in __perf_read_group_add()?

| 

| -		ret += size;
| -	}
| +	mutex_unlock(&leader->child_mutex);
| +
| +	if (copy_to_user(buf, values, event->read_size))
| +		ret = -EFAULT;
| +
| +	kfree(values);
| 
|  	return ret;
|  }


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] Define PERF_PMU_TXN_READ interface
  2015-08-12  4:14     ` Sukadev Bhattiprolu
@ 2015-08-12  8:45       ` Peter Zijlstra
  2015-08-13 20:04         ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Zijlstra @ 2015-08-12  8:45 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman,
	linux-kernel, linuxppc-dev, linux-s390, sparclinux

On Tue, Aug 11, 2015 at 09:14:00PM -0700, Sukadev Bhattiprolu wrote:
> | +static void __perf_read_group_add(struct perf_event *leader, u64 read_format, u64 *values)
> |  {
> | +	struct perf_event *sub;
> | +	int n = 1; /* skip @nr */
> 
> This n = 1 is to skip over the values[0] = 1 + nr_siblings in the
> caller.
> 
> Anyway, in __perf_read_group_add() we always start with n = 1, however
> ...
> | 
> | +	perf_event_read(leader, true);
> | +
> | +	/*
> | +	 * Since we co-schedule groups, {enabled,running} times of siblings
> | +	 * will be identical to those of the leader, so we only publish one
> | +	 * set.
> | +	 */
> | +	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
> | +		values[n++] += leader->total_time_enabled +
> | +			atomic64_read(leader->child_total_time_enabled);

Note how this is an in-place addition,

> | +	}
> | 
> | +	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) {
> | +		values[n++] += leader->total_time_running +
> | +			atomic64_read(leader->child_total_time_running);

and here,

> | +	}
> | 
> | +	/*
> | +	 * Write {count,id} tuples for every sibling.
> | +	 */
> | +	values[n++] += perf_event_count(leader);

and here,


> |  	if (read_format & PERF_FORMAT_ID)
> |  		values[n++] = primary_event_id(leader);

and this will always assign the same value.

> | +	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
> | +		values[n++] += perf_event_count(sub);
> | +		if (read_format & PERF_FORMAT_ID)
> | +			values[n++] = primary_event_id(sub);

Same for these, therefore,

> | +	}
> | +}
> | 
> | +static int perf_read_group(struct perf_event *event,
> | +				   u64 read_format, char __user *buf)
> | +{
> | +	struct perf_event *leader = event->group_leader, *child;
> | +	struct perf_event_context *ctx = leader->ctx;
> | +	int ret = leader->read_size;
> | +	u64 *values;
> | 
> | +	lockdep_assert_held(&ctx->mutex);
> | 
> | +	values = kzalloc(event->read_size);
> | +	if (!values)
> | +		return -ENOMEM;
> | 
> | +	values[0] = 1 + leader->nr_siblings;
> | 
> | +	/*
> | +	 * By locking the child_mutex of the leader we effectively
> | +	 * lock the child list of all siblings.. XXX explain how.
> | +	 */
> | +	mutex_lock(&leader->child_mutex);
> | 
> | +	__perf_read_group_add(leader, read_format, values);
> 
> ... we don't copy_to_user() here,
> 
> | +	list_for_each_entry(child, &leader->child_list, child_list)
> | +		__perf_read_group_add(child, read_format, values);
> 
> so won't we overwrite the values[], if we always start at n = 1
> in __perf_read_group_add()?

yes and no, we have to re-iterate the same values for each child as they
all have the same group, but we add the time and count fields, we do not
overwrite. The _add() suffix was supposed to be a hint ;-)

> | +	mutex_unlock(&leader->child_mutex);
> | +
> | +	if (copy_to_user(buf, values, event->read_size))
> | +		ret = -EFAULT;
> | +
> | +	kfree(values);
> | 
> |  	return ret;
> |  }

Where previously we would iterate the group and for each member
iterate/sum all the child values together before copying the value out,
we now, because we need to read groups together, need to first iterate
the child list and sum whole groups.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] Define PERF_PMU_TXN_READ interface
  2015-08-12  8:45       ` Peter Zijlstra
@ 2015-08-13 20:04         ` Sukadev Bhattiprolu
  2015-08-13 20:47           ` Peter Zijlstra
  0 siblings, 1 reply; 16+ messages in thread
From: Sukadev Bhattiprolu @ 2015-08-13 20:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman,
	linux-kernel, linuxppc-dev, linux-s390, sparclinux

Peter Zijlstra [peterz@infradead.org] wrote:
| On Tue, Aug 11, 2015 at 09:14:00PM -0700, Sukadev Bhattiprolu wrote:
| > | +static void __perf_read_group_add(struct perf_event *leader, u64 read_format, u64 *values)
| > |  {
| > | +	struct perf_event *sub;
| > | +	int n = 1; /* skip @nr */
| > 
| > This n = 1 is to skip over the values[0] = 1 + nr_siblings in the
| > caller.
| > 
| > Anyway, in __perf_read_group_add() we always start with n = 1, however
| > ...
| > | 
| > | +	perf_event_read(leader, true);
| > | +
| > | +	/*
| > | +	 * Since we co-schedule groups, {enabled,running} times of siblings
| > | +	 * will be identical to those of the leader, so we only publish one
| > | +	 * set.
| > | +	 */
| > | +	if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) {
| > | +		values[n++] += leader->total_time_enabled +
| > | +			atomic64_read(leader->child_total_time_enabled);
| 
| Note how this is an in-place addition,

Ah, yes, Sorry I missed that. It make sense now and my tests seem to
be running fine.

| 
| > | +	}
| > | 
| > | +	if (read_format & PERF_FORMAT_TOTAL_TIME_RUNNING) {
| > | +		values[n++] += leader->total_time_running +
| > | +			atomic64_read(leader->child_total_time_running);
| 
| and here,
| 
| > | +	}
| > | 
| > | +	/*
| > | +	 * Write {count,id} tuples for every sibling.
| > | +	 */
| > | +	values[n++] += perf_event_count(leader);
| 
| and here,
| 
| 
| > |  	if (read_format & PERF_FORMAT_ID)
| > |  		values[n++] = primary_event_id(leader);
| 
| and this will always assign the same value.
| 
| > | +	list_for_each_entry(sub, &leader->sibling_list, group_entry) {
| > | +		values[n++] += perf_event_count(sub);
| > | +		if (read_format & PERF_FORMAT_ID)
| > | +			values[n++] = primary_event_id(sub);
| 
| Same for these, therefore,
| 
| > | +	}
| > | +}
| > | 
| > | +static int perf_read_group(struct perf_event *event,
| > | +				   u64 read_format, char __user *buf)
| > | +{
| > | +	struct perf_event *leader = event->group_leader, *child;
| > | +	struct perf_event_context *ctx = leader->ctx;
| > | +	int ret = leader->read_size;

One other question, We return leader->read_size but allocate/copy_to_user
the sibling's event->read_size. We consistently use read_format from the
'event' being read, rather than its 'group_leader', so we are ok in terms
of what we copy into values[] for each event in the group.

But, can the leader's read_format (and hence its read_size) differ from
its sibling's read_size? If so, in the current code, we return the event's
read_size but in the new code, we return the leader's read_size.

| > | +	u64 *values;
| > | 
| > | +	lockdep_assert_held(&ctx->mutex);
| > | 
| > | +	values = kzalloc(event->read_size);
| > | +	if (!values)
| > | +		return -ENOMEM;
| > | 
| > | +	values[0] = 1 + leader->nr_siblings;
| > | 
| > | +	/*
| > | +	 * By locking the child_mutex of the leader we effectively
| > | +	 * lock the child list of all siblings.. XXX explain how.
| > | +	 */
| > | +	mutex_lock(&leader->child_mutex);
| > | 
| > | +	__perf_read_group_add(leader, read_format, values);
| > 
| > ... we don't copy_to_user() here,
| > 
| > | +	list_for_each_entry(child, &leader->child_list, child_list)
| > | +		__perf_read_group_add(child, read_format, values);
| > 
| > so won't we overwrite the values[], if we always start at n = 1
| > in __perf_read_group_add()?
| 
| yes and no, we have to re-iterate the same values for each child as they
| all have the same group, but we add the time and count fields, we do not
| overwrite. The _add() suffix was supposed to be a hint ;-)
| 
| > | +	mutex_unlock(&leader->child_mutex);
| > | +
| > | +	if (copy_to_user(buf, values, event->read_size))
| > | +		ret = -EFAULT;
| > | +
| > | +	kfree(values);
| > | 
| > |  	return ret;
| > |  }
| 
| Where previously we would iterate the group and for each member
| iterate/sum all the child values together before copying the value out,
| we now, because we need to read groups together, need to first iterate
| the child list and sum whole groups.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 09/10] Define PERF_PMU_TXN_READ interface
  2015-08-13 20:04         ` Sukadev Bhattiprolu
@ 2015-08-13 20:47           ` Peter Zijlstra
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2015-08-13 20:47 UTC (permalink / raw)
  To: Sukadev Bhattiprolu
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Michael Ellerman,
	linux-kernel, linuxppc-dev, linux-s390, sparclinux

On Thu, Aug 13, 2015 at 01:04:28PM -0700, Sukadev Bhattiprolu wrote:

> | > | +static int perf_read_group(struct perf_event *event,
> | > | +				   u64 read_format, char __user *buf)
> | > | +{
> | > | +	struct perf_event *leader = event->group_leader, *child;
> | > | +	struct perf_event_context *ctx = leader->ctx;
> | > | +	int ret = leader->read_size;

> One other question, We return leader->read_size but allocate/copy_to_user
> the sibling's event->read_size. We consistently use read_format from the
> 'event' being read, rather than its 'group_leader', so we are ok in terms
> of what we copy into values[] for each event in the group.
> 
> But, can the leader's read_format (and hence its read_size) differ from
> its sibling's read_size? If so, in the current code, we return the event's
> read_size but in the new code, we return the leader's read_size.

Hmm, good spotting that. I'm fairly sure I didn't do that on purpose.

I think we should use event->read_size there too and have the lot
consistent. I don't think we require read_format to be uniform across
siblings.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2015-08-13 20:47 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-27  5:40 [PATCH v4 0/10] Implement group-read of events using txn interface Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 01/10] perf: Add a flags parameter to pmu txn interfaces Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 02/10] perf: Split perf_event_read() and perf_event_count() Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 03/10] perf: Define perf_event_aggregate() Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 04/10] perf: Rename perf_event_read_{one,group}, perf_read_hw Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 05/10] perf: Unroll perf_event_read_value() in perf_read_group() Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 06/10] perf: Add return value for perf_event_read() Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 07/10] perf: Add group parameter to perf_event_read() Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 08/10] perf: Add return value to __perf_event_read() Sukadev Bhattiprolu
2015-07-27  5:40 ` [PATCH 09/10] Define PERF_PMU_TXN_READ interface Sukadev Bhattiprolu
2015-08-06 12:10   ` Peter Zijlstra
2015-08-12  4:14     ` Sukadev Bhattiprolu
2015-08-12  8:45       ` Peter Zijlstra
2015-08-13 20:04         ` Sukadev Bhattiprolu
2015-08-13 20:47           ` Peter Zijlstra
2015-07-27  5:40 ` [PATCH 10/10] powerpc/perf/hv-24x7: Use PMU_TXN_READ interface Sukadev Bhattiprolu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).