linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] patches to reduce perf overhead
@ 2017-01-18 13:21 kan.liang
  2017-01-18 13:21 ` [PATCH 1/2] perf,core: try parent pmu first when init child event kan.liang
  2017-01-18 13:21 ` [PATCH 2/2] perf,core: use parent avg sample period as child initial period kan.liang
  0 siblings, 2 replies; 6+ messages in thread
From: kan.liang @ 2017-01-18 13:21 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo; +Cc: alexander.shishkin, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

perf brings additional overhead when monitoring the task which
frequently generates child task. Two of the related issues are fixed
in the patch series.

Here is some data from the overhead test on Broadwell server
  perf record -e $TEST_EVENTS -- ./loop.sh 50000

loop.sh
  start=$(date +%s%N)
  i=0
  while [ "$i" -le "$1" ]
  do
          date > /dev/null
          i=`expr $i + 1`
  done
  end=$(date +%s%N)
  elapsed=`expr $end - $start`

Event#	Original elapsed time	Elapsed time with two patches	delta
1	196,573,192,397		187,781,884,240			-4.47%
2	257,567,753,013		238,954,990,569			-7.23%
4	398,730,726,971		354,101,552,792			-11.19%
8	824,983,761,120		673,623,574,792			-18.35%
16	1,883,411,923,498	1,456,452,391,357		-22.67%

Kan Liang (2):
  perf,core: try parent pmu first when init child event
  perf,core: use parent avg sample period as child initial period

 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       | 28 ++++++++++++++++++++++++++--
 2 files changed, 29 insertions(+), 2 deletions(-)

-- 
2.4.3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] perf,core: try parent pmu first when init child event
  2017-01-18 13:21 [PATCH 0/2] patches to reduce perf overhead kan.liang
@ 2017-01-18 13:21 ` kan.liang
  2017-01-30 11:59   ` [tip:perf/core] perf/core: Try parent PMU first when initializing a " tip-bot for Kan Liang
  2017-01-18 13:21 ` [PATCH 2/2] perf,core: use parent avg sample period as child initial period kan.liang
  1 sibling, 1 reply; 6+ messages in thread
From: kan.liang @ 2017-01-18 13:21 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo; +Cc: alexander.shishkin, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

perf brings additional overhead when monitoring the task which
frequently generates child task.

perf_init_event is one of the hotspot for the additional overhead.
Currently, to get the pmu, it tries to search the type in pmu_idr at
first. But it is not always successful, especially for the widely used
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
slow path which go through the whole pmus list.
It will be a big performance issue, if the pmus list is long (E.g. server
with many uncore boxes) and the task frequently generates child tasks.

The child event inherits its parent event. So the child event should
tries its parent pmu first.

Here is some data from the overhead test on Broadwell server
  perf record -e $TEST_EVENTS -- ./loop.sh 50000

loop.sh
  start=$(date +%s%N)
  i=0
  while [ "$i" -le "$1" ]
  do
          date > /dev/null
          i=`expr $i + 1`
  done
  end=$(date +%s%N)
  elapsed=`expr $end - $start`

Event#	Original elapsed time	Elapsed time with patch		delta
1	196,573,192,397		189,162,029,998			-3.77%
2	257,567,753,013		241,620,788,683			-6.19%
4	398,730,726,971		370,518,938,714			-7.08%
8	824,983,761,120		740,702,489,329			-10.22%
16	1,883,411,923,498	1,672,027,508,355		-11.22%

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 kernel/events/core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 110b38a..924268c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8977,6 +8977,14 @@ static struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
+	/* try parent's pmu first */
+	if (event->parent && event->parent->pmu) {
+		pmu = event->parent->pmu;
+		ret = perf_try_init_event(pmu, event);
+		if (!ret)
+			goto unlock;
+	}
+
 	rcu_read_lock();
 	pmu = idr_find(&pmu_idr, event->attr.type);
 	rcu_read_unlock();
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] perf,core: use parent avg sample period as child initial period
  2017-01-18 13:21 [PATCH 0/2] patches to reduce perf overhead kan.liang
  2017-01-18 13:21 ` [PATCH 1/2] perf,core: try parent pmu first when init child event kan.liang
@ 2017-01-18 13:21 ` kan.liang
  2017-01-25 15:33   ` Peter Zijlstra
  1 sibling, 1 reply; 6+ messages in thread
From: kan.liang @ 2017-01-18 13:21 UTC (permalink / raw)
  To: linux-kernel, peterz, mingo; +Cc: alexander.shishkin, eranian, ak, Kan Liang

From: Kan Liang <kan.liang@intel.com>

perf brings additional overhead when monitoring the task which
frequently generates child task.

When inheriting a event from parent task to child task, the
sample_period of original parent event (parent_event->parent) will be
assigned to child event as its initial period, which is usually the
default sample_period 1. But too many very short period like 1 will
increase overhead and may cause various problems.

avg_time_stamp is introduced to keep the average sample period. Each
child event can use its original parent event's avg period as its initial
sample period, which can reduce the overhead.

The avg_time_stamp doesn't update more than once every tick to avoid the
contention.
For each new child event, the parent event refcount++. Parent will not
go away until all children go away. So it's safe to access its parent.

Here is some data from the overhead test on Broadwell server
  perf record -e $TEST_EVENTS -- ./loop.sh 50000

loop.sh
  start=$(date +%s%N)
  i=0
  while [ "$i" -le "$1" ]
  do
          date > /dev/null
          i=`expr $i + 1`
  done
  end=$(date +%s%N)
  elapsed=`expr $end - $start`

Event#	Original Elapsed time	Elapsed time with patch		delta
1	196,573,192,397		188,480,366,278			-4.12%
2	257,567,753,013		242,256,126,043			-5.94%
3	398,730,726,971		373,882,492,502			-6.23%
4	824,983,761,120		750,906,525,917			-8.98%
5	1,883,411,923,498	1,648,192,098,897		-12.49%

Signed-off-by: Kan Liang <kan.liang@intel.com>
---
 include/linux/perf_event.h |  3 +++
 kernel/events/core.c       | 20 ++++++++++++++++++--
 2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 78ed810..84b0f47 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -648,6 +648,9 @@ struct perf_event {
 	struct list_head		child_list;
 	struct perf_event		*parent;
 
+	atomic64_t			avg_sample_period;
+	u64				avg_time_stamp;
+
 	int				oncpu;
 	int				cpu;
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 924268c..82a2c0e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3237,9 +3237,11 @@ static DEFINE_PER_CPU(u64, perf_throttled_seq);
 
 static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bool disable)
 {
+	struct perf_event *head_event = (event->parent != NULL) ? event->parent : event;
 	struct hw_perf_event *hwc = &event->hw;
 	s64 period, sample_period;
 	s64 delta;
+	u64 now;
 
 	period = perf_calculate_period(event, nsec, count);
 
@@ -3253,6 +3255,15 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo
 
 	hwc->sample_period = sample_period;
 
+	now = perf_clock();
+	if ((now - head_event->avg_time_stamp) > TICK_NSEC) {
+		s64 avg_period;
+
+		head_event->avg_time_stamp = now;
+		avg_period = (atomic64_read(&head_event->avg_sample_period) + sample_period) / 2;
+		atomic64_set(&head_event->avg_sample_period, avg_period);
+	}
+
 	if (local64_read(&hwc->period_left) > 8*sample_period) {
 		if (disable)
 			event->pmu->stop(event, PERF_EF_UPDATE);
@@ -9231,8 +9242,13 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 
 	hwc = &event->hw;
 	hwc->sample_period = attr->sample_period;
-	if (attr->freq && attr->sample_freq)
+	if (attr->freq && attr->sample_freq) {
 		hwc->sample_period = 1;
+		if (parent_event)
+			hwc->sample_period = atomic64_read(&parent_event->avg_sample_period);
+		else
+			atomic64_set(&event->avg_sample_period, hwc->sample_period);
+	}
 	hwc->last_period = hwc->sample_period;
 
 	local64_set(&hwc->period_left, hwc->sample_period);
@@ -10464,8 +10480,8 @@ inherit_event(struct perf_event *parent_event,
 		child_event->state = PERF_EVENT_STATE_OFF;
 
 	if (parent_event->attr.freq) {
-		u64 sample_period = parent_event->hw.sample_period;
 		struct hw_perf_event *hwc = &child_event->hw;
+		u64 sample_period = atomic64_read(&parent_event->avg_sample_period);
 
 		hwc->sample_period = sample_period;
 		hwc->last_period   = sample_period;
-- 
2.4.3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] perf,core: use parent avg sample period as child initial period
  2017-01-18 13:21 ` [PATCH 2/2] perf,core: use parent avg sample period as child initial period kan.liang
@ 2017-01-25 15:33   ` Peter Zijlstra
  2017-01-29 15:24     ` Liang, Kan
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2017-01-25 15:33 UTC (permalink / raw)
  To: kan.liang; +Cc: linux-kernel, mingo, alexander.shishkin, eranian, ak

On Wed, Jan 18, 2017 at 08:21:02AM -0500, kan.liang@intel.com wrote:
> From: Kan Liang <kan.liang@intel.com>
> 
> perf brings additional overhead when monitoring the task which
> frequently generates child task.
> 
> When inheriting a event from parent task to child task, the
> sample_period of original parent event (parent_event->parent) will be
> assigned to child event as its initial period, which is usually the
> default sample_period 1.


Why is that mostly 1? I would expect the parent event's sample_period to
ramp up as well.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH 2/2] perf,core: use parent avg sample period as child initial period
  2017-01-25 15:33   ` Peter Zijlstra
@ 2017-01-29 15:24     ` Liang, Kan
  0 siblings, 0 replies; 6+ messages in thread
From: Liang, Kan @ 2017-01-29 15:24 UTC (permalink / raw)
  To: 'Peter Zijlstra'
  Cc: linux-kernel, mingo, alexander.shishkin, eranian, ak


> 
> On Wed, Jan 18, 2017 at 08:21:02AM -0500, kan.liang@intel.com wrote:
> > From: Kan Liang <kan.liang@intel.com>
> >
> > perf brings additional overhead when monitoring the task which
> > frequently generates child task.
> >
> > When inheriting a event from parent task to child task, the
> > sample_period of original parent event (parent_event->parent) will be
> > assigned to child event as its initial period, which is usually the
> > default sample_period 1.
> 
> 
> Why is that mostly 1? I would expect the parent event's sample_period to
> ramp up as well.

The conclusion is based on the observation in the specific test case
mentioned in the description. The sample_period is for the original parent
event (parent_event->parent), not the direct parent.

I did several tests these days, it can be 100% reproduced by this particular
test case in multiplexing.
I think the reason is that the original parent event is scheduled out shortly
and never gets a chance to trigger interrupt in the first round. So its
sample_period doesn't get updated. Then the test case repeatedly
generates child tasks and child events. The new child events will be
scheduled in/out continually. The original parent event never gets a
change to be scheduled again. So the original parent event's
sample_period is kept1.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip:perf/core] perf/core: Try parent PMU first when initializing a child event
  2017-01-18 13:21 ` [PATCH 1/2] perf,core: try parent pmu first when init child event kan.liang
@ 2017-01-30 11:59   ` tip-bot for Kan Liang
  0 siblings, 0 replies; 6+ messages in thread
From: tip-bot for Kan Liang @ 2017-01-30 11:59 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: eranian, kan.liang, peterz, hpa, acme, mingo, vincent.weaver,
	tglx, torvalds, alexander.shishkin, jolsa, linux-kernel

Commit-ID:  40999312c703f80e8d31bc77cf00e6e84d36e091
Gitweb:     http://git.kernel.org/tip/40999312c703f80e8d31bc77cf00e6e84d36e091
Author:     Kan Liang <kan.liang@intel.com>
AuthorDate: Wed, 18 Jan 2017 08:21:01 -0500
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Mon, 30 Jan 2017 12:01:16 +0100

perf/core: Try parent PMU first when initializing a child event

perf has additional overhead when monitoring the task which
frequently generates child tasks.

perf_init_event() is one of the hotspots for the additional overhead:

Currently, to get the PMU, it tries to search the type in pmu_idr at
first. But it is not always successful, especially for the widely used
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
slow path which go through the whole PMUs list.

It will be a big performance issue, if the PMUs list is long (e.g. server
with many uncore boxes) and the task frequently generates child tasks.

The child event inherits its parent event. So the child event should
try its parent PMU first.

Here is some data from the overhead test on Broadwell server:

  perf record -e $TEST_EVENTS -- ./loop.sh 50000

  loop.sh
    start=$(date +%s%N)
    i=0
    while [ "$i" -le "$1" ]
    do
            date > /dev/null
            i=`expr $i + 1`
    done
    end=$(date +%s%N)
    elapsed=`expr $end - $start`

  Event#	Original elapsed time	Elapsed time with patch		delta
  1		196,573,192,397		189,162,029,998			-3.77%
  2		257,567,753,013		241,620,788,683			-6.19%
  4		398,730,726,971		370,518,938,714			-7.08%
  8		824,983,761,120		740,702,489,329			-10.22%
  16		1,883,411,923,498	1,672,027,508,355		-11.22%

... which shows a nice performance improvement.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
[ Tidied up the changelog and the code comment. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/events/core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index cbcee23..88676ff 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9022,6 +9022,14 @@ static struct pmu *perf_init_event(struct perf_event *event)
 
 	idx = srcu_read_lock(&pmus_srcu);
 
+	/* Try parent's PMU first: */
+	if (event->parent && event->parent->pmu) {
+		pmu = event->parent->pmu;
+		ret = perf_try_init_event(pmu, event);
+		if (!ret)
+			goto unlock;
+	}
+
 	rcu_read_lock();
 	pmu = idr_find(&pmu_idr, event->attr.type);
 	rcu_read_unlock();

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-30 12:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-18 13:21 [PATCH 0/2] patches to reduce perf overhead kan.liang
2017-01-18 13:21 ` [PATCH 1/2] perf,core: try parent pmu first when init child event kan.liang
2017-01-30 11:59   ` [tip:perf/core] perf/core: Try parent PMU first when initializing a " tip-bot for Kan Liang
2017-01-18 13:21 ` [PATCH 2/2] perf,core: use parent avg sample period as child initial period kan.liang
2017-01-25 15:33   ` Peter Zijlstra
2017-01-29 15:24     ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).