[RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
@ 2010-05-04 12:38 Tejun Heo
  2010-05-04 12:38 ` [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier Tejun Heo
                   ` (12 more replies)
  0 siblings, 13 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel

Hello, all.

This patchset does the following two things.

* 0001-0007: Unify the three tracers (tracepoints, perf_events and
  preempt/sched notifiers) in scheduler.

* 0008-0012: Move perf hooks in sched on top of tracepoints if TPs are
	     enabled.

It probably should be two separate patchsets with the first part on
top of sched/core which then gets pulled into perf branch, on top of
which the second part is applied but in the spirit of RFC I'm posting
it as a single chunk on top of mainline+perf/urgent.

This patchset contains the following 12 patches.

 0001-sched-drop-cpu-argument-from-sched_in-preempt-notifi.patch
 0002-sched-rename-preempt_notifiers-to-sched_notifiers-an.patch
 0003-perf-add-perf_event_task_migrate.patch
 0004-perf-add-rq-to-perf_event_task_sched_out.patch
 0005-perf-move-perf_event_task_sched_in-next-to-fire_sche.patch
 0006-sched-relocate-fire_sched_notifiers_out-and-trace_sc.patch
 0007-sched-coalesce-event-notifiers.patch
 0008-sched-add-switch_in-and-tick-tracepoints.patch
 0009-perf-factor-out-perf_event_switch_clones.patch
 0010-perf-make-nr_events-an-int-and-add-perf_online_mutex.patch
 0011-perf-prepare-to-move-sched-perf-functions-on-top-of-.patch
 0012-perf-move-sched-perf-functions-on-top-of-tracepoints.patch

0001-0006 move tracers in sched around and massage them here and there
to bring the related ones together and make them take the same
arguments.  The only change worth noting would be 0005 where
perf_event_task_sched_in() is moved after finish_lock_switch() which
costs the function a pair of local_irq_disable/enable() when switching
a task context in when there's a context to switch in (the unlikely
path).

0007 implements SCHED_EVENT[_RQ]() macros which can define all or any
subset of the three at once and uses them to replace all trace calls.

0008-0011 prepare to move sched perf functions on top of matching TPs
and make them enabled on demand.

0012 makes all sched perf functions TP probe functions and
enable/disable them on-demand.

The whole patchset is more or less straight forward except that the
on-demand (or rather off-demand?) disabling of perf functions in 0012
is somewhat complicated because the callback which can determine when
to turn it off is called under rq lock while unregistering requires a
thread context.  As rq lock is depended upon by workqueue, tasklet and
timer, none of those can be used directly.  It's worked around by
having a periodic timer polling for completion.

This patchset is available in the following git branch.

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git sched-trace

The tree is on top of linus#master + sched/urgent for now.  Diffstat
follows.

 arch/ia64/kvm/Kconfig        |    2 
 arch/powerpc/kvm/Kconfig     |    2 
 arch/s390/kvm/Kconfig        |    2 
 arch/x86/kvm/Kconfig         |    2 
 include/linux/kvm_host.h     |    4 
 include/linux/perf_event.h   |   39 ++++-
 include/linux/preempt.h      |   48 -------
 include/linux/sched.h        |   52 +++++++
 include/trace/events/sched.h |   14 ++
 init/Kconfig                 |    2 
 kernel/perf_event.c          |  288 +++++++++++++++++++++++++++++++++++++------
 kernel/sched.c               |  175 ++++++++++++++------------
 virt/kvm/kvm_main.c          |   28 +---
 13 files changed, 459 insertions(+), 199 deletions(-)

Thanks.

--
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 17:11   ` Peter Zijlstra
  2010-05-04 12:38 ` [PATCH 02/12] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel; +Cc: Tejun Heo

@cpu parameter is superflous.  Drop it.  This will help unifying
notifiers in sched.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 include/linux/preempt.h |    3 +--
 kernel/sched.c          |    2 +-
 virt/kvm/kvm_main.c     |    4 ++--
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index 2e681d9..cd7d090 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -101,7 +101,6 @@ struct preempt_notifier;
  * preempt_ops - notifiers called when a task is preempted and rescheduled
  * @sched_in: we're about to be rescheduled:
  *    notifier: struct preempt_notifier for the task being scheduled
- *    cpu:  cpu we're scheduled on
  * @sched_out: we've just been preempted
  *    notifier: struct preempt_notifier for the task being preempted
  *    next: the task that's kicking us out
@@ -112,7 +111,7 @@ struct preempt_notifier;
  * difference is intentional and depended upon by its users.
  */
 struct preempt_ops {
-	void (*sched_in)(struct preempt_notifier *notifier, int cpu);
+	void (*sched_in)(struct preempt_notifier *notifier);
 	void (*sched_out)(struct preempt_notifier *notifier,
 			  struct task_struct *next);
 };
diff --git a/kernel/sched.c b/kernel/sched.c
index 6af210a..724f0e4 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2718,7 +2718,7 @@ static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
 	struct hlist_node *node;
 
 	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_in(notifier, raw_smp_processor_id());
+		notifier->ops->sched_in(notifier);
 }
 
 static void
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c82ae24..2c06489 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2164,11 +2164,11 @@ struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
 	return container_of(pn, struct kvm_vcpu, preempt_notifier);
 }
 
-static void kvm_sched_in(struct preempt_notifier *pn, int cpu)
+static void kvm_sched_in(struct preempt_notifier *pn)
 {
 	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
-	kvm_arch_vcpu_load(vcpu, cpu);
+	kvm_arch_vcpu_load(vcpu, raw_smp_processor_id());
 }
 
 static void kvm_sched_out(struct preempt_notifier *pn,
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 02/12] sched: rename preempt_notifiers to sched_notifiers and refactor implementation
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
  2010-05-04 12:38 ` [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 03/12] perf: add perf_event_task_migrate() Tejun Heo
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel; +Cc: Tejun Heo

Rename preempt_notifiers to sched_notifiers and move it to sched.h.
Also, refactor implementation in sched.c such that adding new
callbacks is easier.

This patch does not introduce any functional change and in fact
generates the same binary at least with my configuration (x86_64 SMP,
kvm and some debug options).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 arch/ia64/kvm/Kconfig    |    2 +-
 arch/powerpc/kvm/Kconfig |    2 +-
 arch/s390/kvm/Kconfig    |    2 +-
 arch/x86/kvm/Kconfig     |    2 +-
 include/linux/kvm_host.h |    4 +-
 include/linux/preempt.h  |   47 ------------------
 include/linux/sched.h    |   52 +++++++++++++++++++-
 init/Kconfig             |    2 +-
 kernel/sched.c           |  119 +++++++++++++++++++++-------------------------
 virt/kvm/kvm_main.c      |   26 +++++-----
 10 files changed, 123 insertions(+), 135 deletions(-)

diff --git a/arch/ia64/kvm/Kconfig b/arch/ia64/kvm/Kconfig
index fa4d1e5..2bacf94 100644
--- a/arch/ia64/kvm/Kconfig
+++ b/arch/ia64/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM && MODULES && EXPERIMENTAL
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
 	select KVM_APIC_ARCHITECTURE
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 60624cc..b40771f 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 
 config KVM
 	bool
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	select KVM_MMIO
 
diff --git a/arch/s390/kvm/Kconfig b/arch/s390/kvm/Kconfig
index a725158..5a9a4f8 100644
--- a/arch/s390/kvm/Kconfig
+++ b/arch/s390/kvm/Kconfig
@@ -18,7 +18,7 @@ if VIRTUALIZATION
 config KVM
 	tristate "Kernel-based Virtual Machine (KVM) support"
 	depends on HAVE_KVM && EXPERIMENTAL
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select ANON_INODES
 	---help---
 	  Support hosting paravirtualized guest machines using the SIE
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 970bbd4..48cfec7 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -22,7 +22,7 @@ config KVM
 	depends on HAVE_KVM
 	# for device assignment:
 	depends on PCI
-	select PREEMPT_NOTIFIERS
+	select SCHED_NOTIFIERS
 	select MMU_NOTIFIER
 	select ANON_INODES
 	select HAVE_KVM_IRQCHIP
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 169d077..00d7e46 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -75,8 +75,8 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx,
 
 struct kvm_vcpu {
 	struct kvm *kvm;
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	struct preempt_notifier preempt_notifier;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	struct sched_notifier sched_notifier;
 #endif
 	int vcpu_id;
 	struct mutex mutex;
diff --git a/include/linux/preempt.h b/include/linux/preempt.h
index cd7d090..538c675 100644
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -93,51 +93,4 @@ do { \
 
 #endif
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-struct preempt_notifier;
-
-/**
- * preempt_ops - notifiers called when a task is preempted and rescheduled
- * @sched_in: we're about to be rescheduled:
- *    notifier: struct preempt_notifier for the task being scheduled
- * @sched_out: we've just been preempted
- *    notifier: struct preempt_notifier for the task being preempted
- *    next: the task that's kicking us out
- *
- * Please note that sched_in and out are called under different
- * contexts.  sched_out is called with rq lock held and irq disabled
- * while sched_in is called without rq lock and irq enabled.  This
- * difference is intentional and depended upon by its users.
- */
-struct preempt_ops {
-	void (*sched_in)(struct preempt_notifier *notifier);
-	void (*sched_out)(struct preempt_notifier *notifier,
-			  struct task_struct *next);
-};
-
-/**
- * preempt_notifier - key for installing preemption notifiers
- * @link: internal use
- * @ops: defines the notifier functions to be called
- *
- * Usually used in conjunction with container_of().
- */
-struct preempt_notifier {
-	struct hlist_node link;
-	struct preempt_ops *ops;
-};
-
-void preempt_notifier_register(struct preempt_notifier *notifier);
-void preempt_notifier_unregister(struct preempt_notifier *notifier);
-
-static inline void preempt_notifier_init(struct preempt_notifier *notifier,
-				     struct preempt_ops *ops)
-{
-	INIT_HLIST_NODE(&notifier->link);
-	notifier->ops = ops;
-}
-
-#endif
-
 #endif /* __LINUX_PREEMPT_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..c26e078 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1165,6 +1165,52 @@ struct sched_rt_entity {
 #endif
 };
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+struct sched_notifier;
+
+/**
+ * sched_notifier_ops - notifiers called for scheduling events
+ * @in: we're about to be rescheduled:
+ *    notifier: struct sched_notifier for the task being scheduled
+ * @out: we've just been preempted
+ *    notifier: struct sched_notifier for the task being preempted
+ *    next: the task that's kicking us out
+ *
+ * Please note that in and out are called under different contexts.
+ * out is called with rq lock held and irq disabled while in is called
+ * without rq lock and irq enabled.  This difference is intentional
+ * and depended upon by its users.
+ */
+struct sched_notifier_ops {
+	void (*in)(struct sched_notifier *notifier);
+	void (*out)(struct sched_notifier *notifier, struct task_struct *next);
+};
+
+/**
+ * sched_notifier - key for installing scheduler notifiers
+ * @link: internal use
+ * @ops: defines the notifier functions to be called
+ *
+ * Usually used in conjunction with container_of().
+ */
+struct sched_notifier {
+	struct hlist_node link;
+	struct sched_notifier_ops *ops;
+};
+
+void sched_notifier_register(struct sched_notifier *notifier);
+void sched_notifier_unregister(struct sched_notifier *notifier);
+
+static inline void sched_notifier_init(struct sched_notifier *notifier,
+				       struct sched_notifier_ops *ops)
+{
+	INIT_HLIST_NODE(&notifier->link);
+	notifier->ops = ops;
+}
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
 struct rcu_node;
 
 struct task_struct {
@@ -1188,9 +1234,9 @@ struct task_struct {
 	struct sched_entity se;
 	struct sched_rt_entity rt;
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	/* list of struct preempt_notifier: */
-	struct hlist_head preempt_notifiers;
+#ifdef CONFIG_SCHED_NOTIFIERS
+	/* list of struct sched_notifier: */
+	struct hlist_head sched_notifiers;
 #endif
 
 	/*
diff --git a/init/Kconfig b/init/Kconfig
index eb77e8c..5b54a55 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1242,7 +1242,7 @@ config STOP_MACHINE
 
 source "block/Kconfig"
 
-config PREEMPT_NOTIFIERS
+config SCHED_NOTIFIERS
 	bool
 
 config PADATA
diff --git a/kernel/sched.c b/kernel/sched.c
index 724f0e4..c20fd31 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1390,6 +1390,55 @@ static inline void cpuacct_update_stats(struct task_struct *tsk,
 		enum cpuacct_stat_index idx, cputime_t val) {}
 #endif
 
+#ifdef CONFIG_SCHED_NOTIFIERS
+
+#define fire_sched_notifiers(event, p, args...) do {			\
+	struct sched_notifier *__sn;					\
+	struct hlist_node *__pos;					\
+									\
+	hlist_for_each_entry(__sn, __pos, &(p)->sched_notifiers, link)	\
+		__sn->ops->event(__sn , ##args);			\
+} while (0)
+
+/**
+ * sched_notifier_register - register scheduler notifier
+ * @notifier: notifier struct to register
+ */
+void sched_notifier_register(struct sched_notifier *notifier)
+{
+	hlist_add_head(&notifier->link, &current->sched_notifiers);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_register);
+
+/**
+ * sched_notifier_unregister - unregister scheduler notifier
+ * @notifier: notifier struct to unregister
+ *
+ * This is safe to call from within a scheduler notifier.
+ */
+void sched_notifier_unregister(struct sched_notifier *notifier)
+{
+	hlist_del(&notifier->link);
+}
+EXPORT_SYMBOL_GPL(sched_notifier_unregister);
+
+#else	/* !CONFIG_SCHED_NOTIFIERS */
+
+#define fire_sched_notifiers(event, p, args...)	do { } while (0)
+
+#endif	/* CONFIG_SCHED_NOTIFIERS */
+
+static inline void fire_sched_notifiers_out(struct task_struct *p,
+					    struct task_struct *next)
+{
+	fire_sched_notifiers(out, p, next);
+}
+
+static inline void fire_sched_notifiers_in(struct task_struct *p)
+{
+	fire_sched_notifiers(in, p);
+}
+
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2569,8 +2618,8 @@ static void __sched_fork(struct task_struct *p)
 	p->se.on_rq = 0;
 	INIT_LIST_HEAD(&p->se.group_node);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&p->preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&p->sched_notifiers);
 #endif
 }
 
@@ -2688,64 +2737,6 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	put_cpu();
 }
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-
-/**
- * preempt_notifier_register - tell me when current is being preempted & rescheduled
- * @notifier: notifier struct to register
- */
-void preempt_notifier_register(struct preempt_notifier *notifier)
-{
-	hlist_add_head(&notifier->link, &current->preempt_notifiers);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_register);
-
-/**
- * preempt_notifier_unregister - no longer interested in preemption notifications
- * @notifier: notifier struct to unregister
- *
- * This is safe to call from within a preemption notifier.
- */
-void preempt_notifier_unregister(struct preempt_notifier *notifier)
-{
-	hlist_del(&notifier->link);
-}
-EXPORT_SYMBOL_GPL(preempt_notifier_unregister);
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_in(notifier);
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-	struct preempt_notifier *notifier;
-	struct hlist_node *node;
-
-	hlist_for_each_entry(notifier, node, &curr->preempt_notifiers, link)
-		notifier->ops->sched_out(notifier, next);
-}
-
-#else /* !CONFIG_PREEMPT_NOTIFIERS */
-
-static void fire_sched_in_preempt_notifiers(struct task_struct *curr)
-{
-}
-
-static void
-fire_sched_out_preempt_notifiers(struct task_struct *curr,
-				 struct task_struct *next)
-{
-}
-
-#endif /* CONFIG_PREEMPT_NOTIFIERS */
-
 /**
  * prepare_task_switch - prepare to switch tasks
  * @rq: the runqueue preparing to switch
@@ -2763,7 +2754,7 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
-	fire_sched_out_preempt_notifiers(prev, next);
+	fire_sched_notifiers_out(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2813,7 +2804,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
 
-	fire_sched_in_preempt_notifiers(current);
+	fire_sched_notifiers_in(current);
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -7800,8 +7791,8 @@ void __init sched_init(void)
 
 	set_load_weight(&init_task);
 
-#ifdef CONFIG_PREEMPT_NOTIFIERS
-	INIT_HLIST_HEAD(&init_task.preempt_notifiers);
+#ifdef CONFIG_SCHED_NOTIFIERS
+	INIT_HLIST_HEAD(&init_task.sched_notifiers);
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2c06489..9d13481 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -77,7 +77,7 @@ static atomic_t hardware_enable_failed;
 struct kmem_cache *kvm_vcpu_cache;
 EXPORT_SYMBOL_GPL(kvm_vcpu_cache);
 
-static __read_mostly struct preempt_ops kvm_preempt_ops;
+static __read_mostly struct sched_notifier_ops kvm_sched_notifier_ops;
 
 struct dentry *kvm_debugfs_dir;
 
@@ -111,7 +111,7 @@ void vcpu_load(struct kvm_vcpu *vcpu)
 
 	mutex_lock(&vcpu->mutex);
 	cpu = get_cpu();
-	preempt_notifier_register(&vcpu->preempt_notifier);
+	sched_notifier_register(&vcpu->sched_notifier);
 	kvm_arch_vcpu_load(vcpu, cpu);
 	put_cpu();
 }
@@ -120,7 +120,7 @@ void vcpu_put(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
 	kvm_arch_vcpu_put(vcpu);
-	preempt_notifier_unregister(&vcpu->preempt_notifier);
+	sched_notifier_unregister(&vcpu->sched_notifier);
 	preempt_enable();
 	mutex_unlock(&vcpu->mutex);
 }
@@ -1315,7 +1315,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 	if (IS_ERR(vcpu))
 		return PTR_ERR(vcpu);
 
-	preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);
+	sched_notifier_init(&vcpu->sched_notifier, &kvm_sched_notifier_ops);
 
 	r = kvm_arch_vcpu_setup(vcpu);
 	if (r)
@@ -2158,23 +2158,21 @@ static struct sys_device kvm_sysdev = {
 struct page *bad_page;
 pfn_t bad_pfn;
 
-static inline
-struct kvm_vcpu *preempt_notifier_to_vcpu(struct preempt_notifier *pn)
+static inline struct kvm_vcpu *sched_notifier_to_vcpu(struct sched_notifier *sn)
 {
-	return container_of(pn, struct kvm_vcpu, preempt_notifier);
+	return container_of(sn, struct kvm_vcpu, sched_notifier);
 }
 
-static void kvm_sched_in(struct preempt_notifier *pn)
+static void kvm_sched_in(struct sched_notifier *sn)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_load(vcpu, raw_smp_processor_id());
 }
 
-static void kvm_sched_out(struct preempt_notifier *pn,
-			  struct task_struct *next)
+static void kvm_sched_out(struct sched_notifier *sn, struct task_struct *next)
 {
-	struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
+	struct kvm_vcpu *vcpu = sched_notifier_to_vcpu(sn);
 
 	kvm_arch_vcpu_put(vcpu);
 }
@@ -2247,8 +2245,8 @@ int kvm_init(void *opaque, unsigned int vcpu_size,
 		goto out_free;
 	}
 
-	kvm_preempt_ops.sched_in = kvm_sched_in;
-	kvm_preempt_ops.sched_out = kvm_sched_out;
+	kvm_sched_notifier_ops.in = kvm_sched_in;
+	kvm_sched_notifier_ops.out = kvm_sched_out;
 
 	kvm_init_debug();
 
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
  2010-05-04 12:38 ` [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier Tejun Heo
  2010-05-04 12:38 ` [PATCH 02/12] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-05  5:08   ` Frederic Weisbecker
  2010-05-04 12:38 ` [PATCH 04/12] perf: add @rq to perf_event_task_sched_out() Tejun Heo
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

Instead of calling perf_sw_event() directly from set_task_cpu(),
implement perf_event_task_migrate() which takes the same arguments as
trace_sched_migrate_task() and invokes perf_sw_event() is the task is
really migrating (cur_cpu != new_cpu).  This will help unifying
notifiers in sched.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 include/linux/perf_event.h |    3 +++
 kernel/perf_event.c        |   11 +++++++++++
 kernel/sched.c             |    5 ++---
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c8e3754..a5eec48 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -754,6 +754,7 @@ extern int perf_max_events;
 
 extern const struct pmu *hw_perf_event_init(struct perf_event *event);
 
+extern void perf_event_task_migrate(struct task_struct *task, int new_cpu);
 extern void perf_event_task_sched_in(struct task_struct *task);
 extern void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next);
 extern void perf_event_task_tick(struct task_struct *task);
@@ -949,6 +950,8 @@ extern void perf_event_enable(struct perf_event *event);
 extern void perf_event_disable(struct perf_event *event);
 #else
 static inline void
+perf_event_task_migrate(struct task_struct *task, int new_cpu)		{ }
+static inline void
 perf_event_task_sched_in(struct task_struct *task)			{ }
 static inline void
 perf_event_task_sched_out(struct task_struct *task,
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 3d1552d..a01ba31 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1148,6 +1148,17 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
 }
 
 /*
+ * Called from scheduler set_task_cpu() to notify migration events.
+ * If the task is moving to a different cpu, generate a migration sw
+ * event.
+ */
+void perf_event_task_migrate(struct task_struct *task, int new_cpu)
+{
+	if (task_cpu(task) != new_cpu)
+		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
+}
+
+/*
  * Called from scheduler to remove the events of the current task,
  * with interrupts disabled.
  *
diff --git a/kernel/sched.c b/kernel/sched.c
index c20fd31..2568911 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2084,11 +2084,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 #endif
 
 	trace_sched_migrate_task(p, new_cpu);
+	perf_event_task_migrate(p, new_cpu);
 
-	if (task_cpu(p) != new_cpu) {
+	if (task_cpu(p) != new_cpu)
 		p->se.nr_migrations++;
-		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
-	}
 
 	__set_task_cpu(p, new_cpu);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (2 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 03/12] perf: add perf_event_task_migrate() Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 17:11   ` Peter Zijlstra
  2010-05-04 12:38 ` [PATCH 05/12] perf: move perf_event_task_sched_in() next to fire_sched_notifiers_in() Tejun Heo
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

Add @rq to perf_event_task_sched_out() so that its argument matches
those of trace_sched_switch().  This will help unifying notifiers in
sched.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 include/linux/perf_event.h |    5 +++--
 kernel/perf_event.c        |    4 ++--
 kernel/sched.c             |    2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index a5eec48..1e3c6c3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -756,7 +756,8 @@ extern const struct pmu *hw_perf_event_init(struct perf_event *event);
 
 extern void perf_event_task_migrate(struct task_struct *task, int new_cpu);
 extern void perf_event_task_sched_in(struct task_struct *task);
-extern void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next);
+extern void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
+				      struct task_struct *next);
 extern void perf_event_task_tick(struct task_struct *task);
 extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
@@ -954,7 +955,7 @@ perf_event_task_migrate(struct task_struct *task, int new_cpu)		{ }
 static inline void
 perf_event_task_sched_in(struct task_struct *task)			{ }
 static inline void
-perf_event_task_sched_out(struct task_struct *task,
+perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
 			    struct task_struct *next)			{ }
 static inline void
 perf_event_task_tick(struct task_struct *task)				{ }
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index a01ba31..73e2c1c 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1169,8 +1169,8 @@ void perf_event_task_migrate(struct task_struct *task, int new_cpu)
  * accessing the event control register. If a NMI hits, then it will
  * not restart the event.
  */
-void perf_event_task_sched_out(struct task_struct *task,
-				 struct task_struct *next)
+void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
+			       struct task_struct *next)
 {
 	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
 	struct perf_event_context *ctx = task->perf_event_ctxp;
diff --git a/kernel/sched.c b/kernel/sched.c
index 2568911..4c5e4c9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3720,7 +3720,7 @@ need_resched_nonpreemptible:
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
-		perf_event_task_sched_out(prev, next);
+		perf_event_task_sched_out(rq, prev, next);
 
 		rq->nr_switches++;
 		rq->curr = next;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 05/12] perf: move perf_event_task_sched_in() next to fire_sched_notifiers_in()
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (3 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 04/12] perf: add @rq to perf_event_task_sched_out() Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 06/12] sched: relocate fire_sched_notifiers_out() and trace_sched_switch() Tejun Heo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

Move perf_event_task_sched_in() after finish_lock_switch() right
before fire_sched_notifiers_in().  This costs an extra pair of irq
enable/disable when switching in a perf task context but allows
unifying event functions in a more flexible position (irq enabled, rq
lock released).

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 kernel/perf_event.c |    9 +++++++--
 kernel/sched.c      |    8 +-------
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 73e2c1c..295699f 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1369,14 +1369,17 @@ static void task_ctx_sched_in(struct task_struct *task,
  */
 void perf_event_task_sched_in(struct task_struct *task)
 {
-	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
 	struct perf_event_context *ctx = task->perf_event_ctxp;
+	struct perf_cpu_context *cpuctx;
 
 	if (likely(!ctx))
 		return;
 
+	local_irq_disable();
+
+	cpuctx = &__get_cpu_var(perf_cpu_context);
 	if (cpuctx->task_ctx == ctx)
-		return;
+		goto out_enable;
 
 	/*
 	 * We want to keep the following priority order:
@@ -1390,6 +1393,8 @@ void perf_event_task_sched_in(struct task_struct *task)
 	ctx_sched_in(ctx, cpuctx, EVENT_FLEXIBLE);
 
 	cpuctx->task_ctx = ctx;
+out_enable:
+	local_irq_enable();
 }
 
 #define MAX_INTERRUPTS (~0ULL)
diff --git a/kernel/sched.c b/kernel/sched.c
index 4c5e4c9..df6e0af 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2794,15 +2794,9 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	 */
 	prev_state = prev->state;
 	finish_arch_switch(prev);
-#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
-	local_irq_disable();
-#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
-	perf_event_task_sched_in(current);
-#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
-	local_irq_enable();
-#endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */
 	finish_lock_switch(rq, prev);
 
+	perf_event_task_sched_in(current);
 	fire_sched_notifiers_in(current);
 	if (mm)
 		mmdrop(mm);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 06/12] sched: relocate fire_sched_notifiers_out() and trace_sched_switch()
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (4 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 05/12] perf: move perf_event_task_sched_in() next to fire_sched_notifiers_in() Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 07/12] sched: coalesce event notifiers Tejun Heo
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel; +Cc: Tejun Heo

Relocate fire_sched_notifiers_out() from prepare_task_switch() and
trace_sched_switch() from context_switch() to schedule() so that
trace_sched_switch(), perf_event_task_sched_out() and
fire_sched_notifiers_out() are colocated.

Both are now called before switch counts and rq->curr are updated and
in addition trace_sched_switch() is called before
prepare_lock/arch_switch(); however, this shouldn't make any visible
different for either.

This will help unifying notifiers in sched.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index df6e0af..0b753f0 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2753,7 +2753,6 @@ static inline void
 prepare_task_switch(struct rq *rq, struct task_struct *prev,
 		    struct task_struct *next)
 {
-	fire_sched_notifiers_out(prev, next);
 	prepare_lock_switch(rq, next);
 	prepare_arch_switch(next);
 }
@@ -2882,7 +2881,6 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(rq, prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
@@ -3714,7 +3712,9 @@ need_resched_nonpreemptible:
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
+		trace_sched_switch(rq, prev, next);
 		perf_event_task_sched_out(rq, prev, next);
+		fire_sched_notifiers_out(prev, next);
 
 		rq->nr_switches++;
 		rq->curr = next;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 07/12] sched: coalesce event notifiers
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (5 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 06/12] sched: relocate fire_sched_notifiers_out() and trace_sched_switch() Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 08/12] sched: add switch_in and tick tracepoints Tejun Heo
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel; +Cc: Tejun Heo

sched currently hosts three different event notification mechanisms -
tracepoints, perf_event functions and sched_notifiers.  The previous
patches modified and moved them around so that they are colocated
where applicable and share most of the arguments.  This patch
introduces and uses SCHED_EVENT() and SCHED_EVENT_RQ() to coalesce
calls to different mechanisms.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 kernel/sched.c |   51 ++++++++++++++++++++++++++++++++++++++++-----------
 1 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0b753f0..1acec30 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1439,6 +1439,39 @@ static inline void fire_sched_notifiers_in(struct task_struct *p)
 	fire_sched_notifiers(in, p);
 }
 
+/*
+ * Sched is watched by three different mechanisms - tracepoint,
+ * perf_event and sched_notifiers.  SCHED_EVENT*() can be used to
+ * define all or any part of them at once so that code clutter is kept
+ * to minimum and optimizations can be applied according to different
+ * config options.
+ *
+ * In SCHED_EVENT(), the first three arguments specify the name of
+ * tracepoint, perf_event and sched_notifier.  NONE can be used to
+ * omit any subset of the three.  The last argument is event arguments
+ * wrapped inside SE_ARGS() macro.
+ *
+ * SCHED_EVENT_RQ() is identical except that @rq argument will be
+ * added to tracepoint and perf calls.
+ */
+#define trace_sched_NONE(args...)		do { } while (0)
+#define perf_event_task_NONE(args...)		do { } while (0)
+#define fire_sched_notifiers_NONE(args...)	do { } while (0)
+
+#define SE_ARGS(args...)			args
+
+#define SCHED_EVENT(TP, PERF, SN, args) do {				\
+	trace_sched_##TP(args);						\
+	perf_event_task_##PERF(args);					\
+	fire_sched_notifiers_##SN(args);				\
+} while (0)
+
+#define SCHED_EVENT_RQ(TP, PERF, SN, rq, args) do {			\
+	trace_sched_##TP(rq, args);					\
+	perf_event_task_##PERF(rq, args);				\
+	fire_sched_notifiers_##SN(args);				\
+} while (0)
+
 static inline void inc_cpu_load(struct rq *rq, unsigned long load)
 {
 	update_load_add(&rq->load, load);
@@ -2083,8 +2116,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 			!(task_thread_info(p)->preempt_count & PREEMPT_ACTIVE));
 #endif
 
-	trace_sched_migrate_task(p, new_cpu);
-	perf_event_task_migrate(p, new_cpu);
+	SCHED_EVENT(migrate_task, migrate, NONE, SE_ARGS(p, new_cpu));
 
 	if (task_cpu(p) != new_cpu)
 		p->se.nr_migrations++;
@@ -2223,7 +2255,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(rq, p);
+		SCHED_EVENT_RQ(wait_task, NONE, NONE, rq, SE_ARGS(p));
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		ncsw = 0;
@@ -2515,7 +2547,7 @@ out_activate:
 	}
 
 out_running:
-	trace_sched_wakeup(rq, p, success);
+	SCHED_EVENT_RQ(wakeup, NONE, NONE, rq, SE_ARGS(p, success));
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2726,7 +2758,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 	p->state = TASK_RUNNING;
 	update_rq_clock(rq);
 	activate_task(rq, p, 0);
-	trace_sched_wakeup_new(rq, p, 1);
+	SCHED_EVENT_RQ(wakeup_new, NONE, NONE, rq, SE_ARGS(p, 1));
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
@@ -2795,8 +2827,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
 
-	perf_event_task_sched_in(current);
-	fire_sched_notifiers_in(current);
+	SCHED_EVENT(NONE, sched_in, in, SE_ARGS(current));
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -3498,7 +3529,7 @@ void scheduler_tick(void)
 	curr->sched_class->task_tick(rq, curr, 0);
 	raw_spin_unlock(&rq->lock);
 
-	perf_event_task_tick(curr);
+	SCHED_EVENT(NONE, tick, NONE, SE_ARGS(curr));
 
 #ifdef CONFIG_SMP
 	rq->idle_at_tick = idle_cpu(cpu);
@@ -3712,9 +3743,7 @@ need_resched_nonpreemptible:
 
 	if (likely(prev != next)) {
 		sched_info_switch(prev, next);
-		trace_sched_switch(rq, prev, next);
-		perf_event_task_sched_out(rq, prev, next);
-		fire_sched_notifiers_out(prev, next);
+		SCHED_EVENT_RQ(switch, sched_out, out, rq, SE_ARGS(prev, next));
 
 		rq->nr_switches++;
 		rq->curr = next;
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 08/12] sched: add switch_in and tick tracepoints
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (6 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 07/12] sched: coalesce event notifiers Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 09/12] perf: factor out perf_event_switch_clones() Tejun Heo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel; +Cc: Tejun Heo

Define and add sched_switch_in and sched_tick tracepoints.  Both are
colocated with perf event functions.  These will be used to make perf
use tracepoints.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 include/trace/events/sched.h |   14 ++++++++++++++
 kernel/sched.c               |    4 ++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cfceb0b..2bed8de 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -220,6 +220,20 @@ DECLARE_EVENT_CLASS(sched_process_template,
 );
 
 /*
+ * Tracepoint for task switch in:
+ */
+DEFINE_EVENT(sched_process_template, sched_switch_in,
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
+
+/*
+ * Tracepoint for scheduler tick:
+ */
+DEFINE_EVENT(sched_process_template, sched_tick,
+	     TP_PROTO(struct task_struct *p),
+	     TP_ARGS(p));
+
+/*
  * Tracepoint for freeing a task:
  */
 DEFINE_EVENT(sched_process_template, sched_process_free,
diff --git a/kernel/sched.c b/kernel/sched.c
index 1acec30..f70f9d7 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2827,7 +2827,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
 	finish_arch_switch(prev);
 	finish_lock_switch(rq, prev);
 
-	SCHED_EVENT(NONE, sched_in, in, SE_ARGS(current));
+	SCHED_EVENT(switch_in, sched_in, in, SE_ARGS(current));
 	if (mm)
 		mmdrop(mm);
 	if (unlikely(prev_state == TASK_DEAD)) {
@@ -3529,7 +3529,7 @@ void scheduler_tick(void)
 	curr->sched_class->task_tick(rq, curr, 0);
 	raw_spin_unlock(&rq->lock);
 
-	SCHED_EVENT(NONE, tick, NONE, SE_ARGS(curr));
+	SCHED_EVENT(tick, tick, NONE, SE_ARGS(curr));
 
 #ifdef CONFIG_SMP
 	rq->idle_at_tick = idle_cpu(cpu);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 09/12] perf: factor out perf_event_switch_clones()
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (7 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 08/12] sched: add switch_in and tick tracepoints Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 10/12] perf: make nr_events an int and add perf_online_mutex to protect it Tejun Heo
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

Factor out perf_event_switch_clones() from
perf_event_task_sched_out().  This is to ease future changes and
doesn't cause any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 kernel/perf_event.c |   62 +++++++++++++++++++++++++++++---------------------
 1 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 295699f..3f3e328 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -1158,30 +1158,14 @@ void perf_event_task_migrate(struct task_struct *task, int new_cpu)
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
 }
 
-/*
- * Called from scheduler to remove the events of the current task,
- * with interrupts disabled.
- *
- * We stop each event and update the event value in event->count.
- *
- * This does not protect us against NMI, but disable()
- * sets the disabled bit in the control field of event _before_
- * accessing the event control register. If a NMI hits, then it will
- * not restart the event.
- */
-void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
-			       struct task_struct *next)
+static bool perf_event_switch_clones(struct perf_cpu_context *cpuctx,
+				     struct perf_event_context *ctx,
+				     struct task_struct *task,
+				     struct task_struct *next)
 {
-	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
-	struct perf_event_context *ctx = task->perf_event_ctxp;
 	struct perf_event_context *next_ctx;
 	struct perf_event_context *parent;
-	int do_switch = 1;
-
-	perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0);
-
-	if (likely(!ctx || !cpuctx->task_ctx))
-		return;
+	bool switched = false;
 
 	rcu_read_lock();
 	parent = rcu_dereference(ctx->parent_ctx);
@@ -1208,7 +1192,7 @@ void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
 			next->perf_event_ctxp = ctx;
 			ctx->task = next;
 			next_ctx->task = task;
-			do_switch = 0;
+			switched = true;
 
 			perf_event_sync_stat(ctx, next_ctx);
 		}
@@ -1217,10 +1201,36 @@ void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
 	}
 	rcu_read_unlock();
 
-	if (do_switch) {
-		ctx_sched_out(ctx, cpuctx, EVENT_ALL);
-		cpuctx->task_ctx = NULL;
-	}
+	return switched;
+}
+
+/*
+ * Called from scheduler to remove the events of the current task,
+ * with interrupts disabled.
+ *
+ * We stop each event and update the event value in event->count.
+ *
+ * This does not protect us against NMI, but disable()
+ * sets the disabled bit in the control field of event _before_
+ * accessing the event control register. If a NMI hits, then it will
+ * not restart the event.
+ */
+void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
+			       struct task_struct *next)
+{
+	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
+	struct perf_event_context *ctx = task->perf_event_ctxp;
+
+	perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0);
+
+	if (likely(!ctx || !cpuctx->task_ctx))
+		return;
+
+	if (perf_event_switch_clones(cpuctx, ctx, task, next))
+		return;
+
+	ctx_sched_out(ctx, cpuctx, EVENT_ALL);
+	cpuctx->task_ctx = NULL;
 }
 
 static void task_ctx_sched_out(struct perf_event_context *ctx,
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 10/12] perf: make nr_events an int and add perf_online_mutex to protect it
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (8 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 09/12] perf: factor out perf_event_switch_clones() Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 11/12] perf: prepare to move sched perf functions on top of tracepoints Tejun Heo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

Change nr_events from atomic_t to int and add perf_online_mutex to
protect it.  Helpers to manipulate the count, perf_inc_nr_events() and
perf_dec_nr_events(), are added.  The former may fail although it
doesn't in the current code.

This will be used to make sched perf functions called via tracepoints.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 kernel/perf_event.c |   40 ++++++++++++++++++++++++++++++++++++----
 1 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 3f3e328..621d1f1 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -43,7 +43,7 @@ int perf_max_events __read_mostly = 1;
 static int perf_reserved_percpu __read_mostly;
 static int perf_overcommit __read_mostly = 1;
 
-static atomic_t nr_events __read_mostly;
+static int nr_events __read_mostly;
 static atomic_t nr_mmap_events __read_mostly;
 static atomic_t nr_comm_events __read_mostly;
 static atomic_t nr_task_events __read_mostly;
@@ -72,6 +72,26 @@ static atomic64_t perf_event_id;
 static DEFINE_SPINLOCK(perf_resource_lock);
 
 /*
+ * Lock to protect nr_events and master enable:
+ */
+static DEFINE_MUTEX(perf_online_mutex);
+
+static int perf_inc_nr_events(void)
+{
+	mutex_lock(&perf_online_mutex);
+	nr_events++;
+	mutex_unlock(&perf_online_mutex);
+	return 0;
+}
+
+static void perf_dec_nr_events(void)
+{
+	mutex_lock(&perf_online_mutex);
+	nr_events--;
+	mutex_unlock(&perf_online_mutex);
+}
+
+/*
  * Architecture provided APIs - weak aliases:
  */
 extern __weak const struct pmu *hw_perf_event_init(struct perf_event *event)
@@ -1589,7 +1609,13 @@ void perf_event_task_tick(struct task_struct *curr)
 	struct perf_event_context *ctx;
 	int rotate = 0;
 
-	if (!atomic_read(&nr_events))
+	/*
+	 * nr_events is incremented before events are enabled and
+	 * decremented after they have been pulled out, so it is
+	 * guaranteed to read as non-zero with any enabled event on
+	 * this cpu.
+	 */
+	if (!nr_events)
 		return;
 
 	cpuctx = &__get_cpu_var(perf_cpu_context);
@@ -1857,7 +1883,7 @@ static void free_event(struct perf_event *event)
 	perf_pending_sync(event);
 
 	if (!event->parent) {
-		atomic_dec(&nr_events);
+		perf_dec_nr_events();
 		if (event->attr.mmap)
 			atomic_dec(&nr_mmap_events);
 		if (event->attr.comm)
@@ -4569,6 +4595,12 @@ perf_event_alloc(struct perf_event_attr *attr,
 	if (!event)
 		return ERR_PTR(-ENOMEM);
 
+	err = perf_inc_nr_events();
+	if (err) {
+		kfree(event);
+		return ERR_PTR(err);
+	}
+
 	/*
 	 * Single events are their own group leaders, with an
 	 * empty sibling list:
@@ -4657,6 +4689,7 @@ done:
 	if (err) {
 		if (event->ns)
 			put_pid_ns(event->ns);
+		perf_dec_nr_events();
 		kfree(event);
 		return ERR_PTR(err);
 	}
@@ -4664,7 +4697,6 @@ done:
 	event->pmu = pmu;
 
 	if (!event->parent) {
-		atomic_inc(&nr_events);
 		if (event->attr.mmap)
 			atomic_inc(&nr_mmap_events);
 		if (event->attr.comm)
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 11/12] perf: prepare to move sched perf functions on top of tracepoints
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (9 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 10/12] perf: make nr_events an int and add perf_online_mutex to protect it Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 12:38 ` [PATCH 12/12] perf: " Tejun Heo
  2010-05-04 17:29 ` [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Peter Zijlstra
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

* Move prototypes of perf_event_task_{migrate|sched_in|sched_out|tick}()
  into a separate #ifdef block and append _fn to function names and
  define macros to redirect calls.

* Define PE_STATIC which currently is empty so that these functions
  can be made static depending on config option.

* Define no-op perf_task_sched_out_done() and call it from the end of
  perf_task_sched_out().

Other than renaming the functions, this function doesn't introduce any
visible change.  This is to prepare for moving these functions on top
of tracepoints.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 include/linux/perf_event.h |   43 +++++++++++++++++++++++++++++--------------
 kernel/perf_event.c        |   30 ++++++++++++++++++++----------
 2 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1e3c6c3..0ad898b 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -747,6 +747,35 @@ struct perf_output_handle {
 
 #ifdef CONFIG_PERF_EVENTS
 
+extern void perf_event_task_migrate_fn(struct task_struct *task, int new_cpu);
+extern void perf_event_task_sched_in_fn(struct task_struct *task);
+extern void perf_event_task_sched_out_fn(struct rq *rq,
+					 struct task_struct *task,
+					 struct task_struct *next);
+extern void perf_event_task_tick_fn(struct task_struct *task);
+
+#define perf_event_task_migrate(t, c)	perf_event_task_migrate_fn((t), (c))
+#define perf_event_task_sched_in(t)	perf_event_task_sched_in_fn((t))
+#define perf_event_task_sched_out(r, t, n) \
+	perf_event_task_sched_out_fn((r), (t), (n))
+#define perf_event_task_tick(t)		perf_event_task_tick_fn((t))
+
+#else
+
+static inline void
+perf_event_task_migrate(struct task_struct *task, int new_cpu)		{ }
+static inline void
+perf_event_task_sched_in(struct task_struct *task)			{ }
+static inline void
+perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
+			  struct task_struct *next)			{ }
+static inline void
+perf_event_task_tick(struct task_struct *task)				{ }
+
+#endif
+
+#ifdef CONFIG_PERF_EVENTS
+
 /*
  * Set by architecture code:
  */
@@ -754,11 +783,6 @@ extern int perf_max_events;
 
 extern const struct pmu *hw_perf_event_init(struct perf_event *event);
 
-extern void perf_event_task_migrate(struct task_struct *task, int new_cpu);
-extern void perf_event_task_sched_in(struct task_struct *task);
-extern void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
-				      struct task_struct *next);
-extern void perf_event_task_tick(struct task_struct *task);
 extern int perf_event_init_task(struct task_struct *child);
 extern void perf_event_exit_task(struct task_struct *child);
 extern void perf_event_free_task(struct task_struct *task);
@@ -950,15 +974,6 @@ extern void perf_swevent_put_recursion_context(int rctx);
 extern void perf_event_enable(struct perf_event *event);
 extern void perf_event_disable(struct perf_event *event);
 #else
-static inline void
-perf_event_task_migrate(struct task_struct *task, int new_cpu)		{ }
-static inline void
-perf_event_task_sched_in(struct task_struct *task)			{ }
-static inline void
-perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
-			    struct task_struct *next)			{ }
-static inline void
-perf_event_task_tick(struct task_struct *task)				{ }
 static inline int perf_event_init_task(struct task_struct *child)	{ return 0; }
 static inline void perf_event_exit_task(struct task_struct *child)	{ }
 static inline void perf_event_free_task(struct task_struct *task)	{ }
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 621d1f1..1c83dc6 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -76,6 +76,8 @@ static DEFINE_SPINLOCK(perf_resource_lock);
  */
 static DEFINE_MUTEX(perf_online_mutex);
 
+#define PE_STATIC
+
 static int perf_inc_nr_events(void)
 {
 	mutex_lock(&perf_online_mutex);
@@ -91,6 +93,10 @@ static void perf_dec_nr_events(void)
 	mutex_unlock(&perf_online_mutex);
 }
 
+static void perf_task_sched_out_done(struct perf_event_context *ctx)
+{
+}
+
 /*
  * Architecture provided APIs - weak aliases:
  */
@@ -189,7 +195,7 @@ perf_lock_task_context(struct task_struct *task, unsigned long *flags)
 		/*
 		 * If this context is a clone of another, it might
 		 * get swapped for another underneath us by
-		 * perf_event_task_sched_out, though the
+		 * perf_event_task_sched_out_fn, though the
 		 * rcu_read_lock() protects us from any context
 		 * getting freed.  Lock the context and check if it
 		 * got swapped before we could get the lock, and retry
@@ -582,7 +588,8 @@ static void __perf_event_disable(void *info)
  * goes to exit will block in sync_child_event.
  * When called from perf_pending_event it's OK because event->ctx
  * is the current context on this CPU and preemption is disabled,
- * hence we can't get into perf_event_task_sched_out for this context.
+ * hence we can't get into perf_event_task_sched_out_fn for this
+ * context.
  */
 void perf_event_disable(struct perf_event *event)
 {
@@ -1172,7 +1179,7 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
  * If the task is moving to a different cpu, generate a migration sw
  * event.
  */
-void perf_event_task_migrate(struct task_struct *task, int new_cpu)
+PE_STATIC void perf_event_task_migrate_fn(struct task_struct *task, int new_cpu)
 {
 	if (task_cpu(task) != new_cpu)
 		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
@@ -1235,8 +1242,9 @@ static bool perf_event_switch_clones(struct perf_cpu_context *cpuctx,
  * accessing the event control register. If a NMI hits, then it will
  * not restart the event.
  */
-void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
-			       struct task_struct *next)
+PE_STATIC void perf_event_task_sched_out_fn(struct rq *rq,
+					    struct task_struct *task,
+					    struct task_struct *next)
 {
 	struct perf_cpu_context *cpuctx = &__get_cpu_var(perf_cpu_context);
 	struct perf_event_context *ctx = task->perf_event_ctxp;
@@ -1244,13 +1252,15 @@ void perf_event_task_sched_out(struct rq *rq, struct task_struct *task,
 	perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 1, NULL, 0);
 
 	if (likely(!ctx || !cpuctx->task_ctx))
-		return;
+		goto out;
 
 	if (perf_event_switch_clones(cpuctx, ctx, task, next))
-		return;
+		goto out;
 
 	ctx_sched_out(ctx, cpuctx, EVENT_ALL);
 	cpuctx->task_ctx = NULL;
+out:
+	perf_task_sched_out_done(cpuctx->task_ctx);
 }
 
 static void task_ctx_sched_out(struct perf_event_context *ctx,
@@ -1397,7 +1407,7 @@ static void task_ctx_sched_in(struct task_struct *task,
  * accessing the event control register. If a NMI hits, then it will
  * keep the event running.
  */
-void perf_event_task_sched_in(struct task_struct *task)
+PE_STATIC void perf_event_task_sched_in_fn(struct task_struct *task)
 {
 	struct perf_event_context *ctx = task->perf_event_ctxp;
 	struct perf_cpu_context *cpuctx;
@@ -1603,7 +1613,7 @@ static void rotate_ctx(struct perf_event_context *ctx)
 	raw_spin_unlock(&ctx->lock);
 }
 
-void perf_event_task_tick(struct task_struct *curr)
+PE_STATIC void perf_event_task_tick_fn(struct task_struct *curr)
 {
 	struct perf_cpu_context *cpuctx;
 	struct perf_event_context *ctx;
@@ -1705,7 +1715,7 @@ static void perf_event_enable_on_exec(struct task_struct *task)
 
 	raw_spin_unlock(&ctx->lock);
 
-	perf_event_task_sched_in(task);
+	perf_event_task_sched_in_fn(task);
  out:
 	local_irq_restore(flags);
 }
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH 12/12] perf: move sched perf functions on top of tracepoints
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (10 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 11/12] perf: prepare to move sched perf functions on top of tracepoints Tejun Heo
@ 2010-05-04 12:38 ` Tejun Heo
  2010-05-04 17:29 ` [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Peter Zijlstra
  12 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 12:38 UTC (permalink / raw)
  To: mingo, peterz, efault, avi, paulus, acme, linux-kernel
  Cc: Tejun Heo, Peter Zijlstra

Now that all sched perf functions are colocated with tracepoints,
those perf functions can be moved on top of tracepoints instead of
being called directly.  After this patch, if both perf and tracepoints
are enabled, the four sched perf macros become noop and the backend
functions are defined static and registered as trace point probes on
demand.

The enable part is relatively simple.  Perf functions are registered
as tp probes.  sched_in is registered the last so that contexts don't
get scheduled in without all the functions active.

Disable is a bit more involved.  First, all probes other than
sched_out are unregistered and drained and online cpus are recorded in
a cpumask.  With zero nr_events, sched_out always switches out task
context and records that there's no task context for the cpu.  A
periodic timer is setup to watch the cpumask and when it sees that all
cpus have switched out their contexts, the sched_out probe is
unregistered.

The timer trick is necessary because unregistering a probe requires
thread context while neither workqueue nor tasklet can be directly
used from sched_out which is called under rq lock.

This results in reduced overhead when both tracepoints and perf are
enabled and opens up possibilities for further optimization.  Although
the sched functions are the frequently called ones, other perf
functions can also be converted to use TPs in similar manner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 include/linux/perf_event.h |    2 +-
 kernel/perf_event.c        |  152 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 152 insertions(+), 2 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 0ad898b..66f2cba 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -745,7 +745,7 @@ struct perf_output_handle {
 	int				locked;
 };
 
-#ifdef CONFIG_PERF_EVENTS
+#if defined(CONFIG_PERF_EVENTS) && !defined(CONFIG_TRACEPOINTS)
 
 extern void perf_event_task_migrate_fn(struct task_struct *task, int new_cpu);
 extern void perf_event_task_sched_in_fn(struct task_struct *task);
diff --git a/kernel/perf_event.c b/kernel/perf_event.c
index 1c83dc6..1424aac 100644
--- a/kernel/perf_event.c
+++ b/kernel/perf_event.c
@@ -17,6 +17,7 @@
 #include <linux/poll.h>
 #include <linux/slab.h>
 #include <linux/sysfs.h>
+#include <linux/timer.h>
 #include <linux/dcache.h>
 #include <linux/percpu.h>
 #include <linux/ptrace.h>
@@ -26,12 +27,15 @@
 #include <linux/rculist.h>
 #include <linux/uaccess.h>
 #include <linux/syscalls.h>
+#include <linux/workqueue.h>
 #include <linux/anon_inodes.h>
 #include <linux/kernel_stat.h>
 #include <linux/perf_event.h>
 #include <linux/ftrace_event.h>
 #include <linux/hw_breakpoint.h>
 
+#include <trace/events/sched.h>
+
 #include <asm/irq_regs.h>
 
 /*
@@ -76,6 +80,151 @@ static DEFINE_SPINLOCK(perf_resource_lock);
  */
 static DEFINE_MUTEX(perf_online_mutex);
 
+#ifdef CONFIG_TRACEPOINTS
+/*
+ * Tracepoints are enabled.  Some perf event functions (currently the
+ * sched related ones) are called via tracepoints.  The functions are
+ * registered to respective tracepoints when the first event is
+ * created and start to unregister after the last event is destroyed.
+ */
+
+/* won't be called directly, make them static and declare them */
+#define PE_STATIC static
+
+static void perf_event_task_migrate_fn(struct task_struct *task, int new_cpu);
+static void perf_event_task_sched_in_fn(struct task_struct *task);
+static void perf_event_task_sched_out_fn(struct rq *rq,
+					 struct task_struct *task,
+					 struct task_struct *next);
+static void perf_event_task_tick_fn(struct task_struct *task);
+
+/*
+ * After the last event is destroyed, all event functions except for
+ * sched_out are disabled.  With zero nr_events, sched_out will always
+ * switch out context and a timer is setup to periodically watch the
+ * perf_online_mask.  When all the cpus have seen NULL context at
+ * least once, the timer schedules perf_offline_work to unregister
+ * sched_out.
+ *
+ * The offline timer is necessary because sched_out is called under rq
+ * lock and no async mechanism other than SOFTIRQ can be scheduled
+ * from under there.  Although slightly convoluted, it's not really
+ * bad.  There's nothing urgent about unregistering sched_out anyway.
+ */
+static cpumask_t perf_online_mask;
+
+static void perf_offline_work_fn(struct work_struct *work)
+{
+	unregister_trace_sched_switch(perf_event_task_sched_out_fn);
+}
+static DECLARE_WORK(perf_offline_work, perf_offline_work_fn);
+
+static void perf_offline_timer_fn(unsigned long data);
+static DEFINE_TIMER(perf_offline_timer, perf_offline_timer_fn, 0, 0);
+
+static void perf_offline_timer_fn(unsigned long data)
+{
+	/*
+	 * We don't care about CPUs which have come up inbetween as
+	 * they would never have task context set, but need to
+	 * explicity ignore CPUs which went down inbetween.  Consider
+	 * draining done if there's no CPU left which was online when
+	 * nr_events hit zero and has stayed online.
+	 */
+	if (cpumask_any_and(&perf_online_mask, cpu_online_mask) >= nr_cpu_ids)
+		schedule_work(&perf_offline_work);
+	else
+		mod_timer(&perf_offline_timer, jiffies + HZ);
+}
+
+static int perf_inc_nr_events(void)
+{
+	int err = 0;
+
+	mutex_lock(&perf_online_mutex);
+
+	if (nr_events++)
+		goto out;
+
+	/* make sure nr_events > 0 is visible and cancel offline timer & work */
+	synchronize_sched();
+	del_timer_sync(&perf_offline_timer);
+	cancel_work_sync(&perf_offline_work);
+
+	/* first event, register probe functions */
+	err = register_trace_sched_migrate_task(perf_event_task_migrate_fn);
+	if (err && err != -EEXIST)
+		goto out;
+	err = register_trace_sched_tick(perf_event_task_tick_fn);
+	if (err && err != -EEXIST)
+		goto out;
+	err = register_trace_sched_switch(perf_event_task_sched_out_fn);
+	if (err && err != -EEXIST)
+		goto out;
+	/*
+	 * Register sched_in last so that contexts don't get scheduled
+	 * in with events partially enabled.  There already are enough
+	 * barriers to make this ordering effective.
+	 */
+	err = register_trace_sched_switch_in(perf_event_task_sched_in_fn);
+out:
+	if (err && err != -EEXIST) {
+		unregister_trace_sched_migrate_task(perf_event_task_migrate_fn);
+		unregister_trace_sched_tick(perf_event_task_tick_fn);
+		unregister_trace_sched_switch(perf_event_task_sched_out_fn);
+		nr_events--;
+	}
+	mutex_unlock(&perf_online_mutex);
+	return err;
+}
+
+static void perf_dec_nr_events(void)
+{
+	mutex_lock(&perf_online_mutex);
+	if (nr_events > 1) {
+		nr_events--;
+		goto out;
+	}
+
+	/* unregister anything other than sched_out */
+	unregister_trace_sched_migrate_task(perf_event_task_migrate_fn);
+	unregister_trace_sched_tick(perf_event_task_tick_fn);
+	unregister_trace_sched_switch_in(perf_event_task_sched_in_fn);
+
+	/* make sure probe functions are done */
+	synchronize_sched();
+
+	/*
+	 * Drain is complete when sched_out has seen NULL task context
+	 * at least once on all currently online CPUs after nr_events
+	 * hits zero.
+	 */
+	get_online_cpus();
+	cpumask_copy(&perf_online_mask, cpu_online_mask);
+	put_online_cpus();
+	smp_wmb();	/* online mask must be visible before zero nr_events */
+	nr_events--;	/* gogogo */
+
+	/* kick offline timer */
+	mod_timer(&perf_offline_timer, jiffies + HZ);
+out:
+	mutex_unlock(&perf_online_mutex);
+}
+
+static void perf_task_sched_out_done(struct perf_event_context *ctx)
+{
+	if (likely(nr_events) || ctx)
+		return;
+
+	smp_mb__before_clear_bit();	/* matches smp_wmb() in dec */
+	cpumask_clear_cpu(smp_processor_id(), &perf_online_mask);
+}
+
+#else
+/*
+ * Tracepoints not available.  Event functions are declared external
+ * and will be called directly.
+ */
 #define PE_STATIC
 
 static int perf_inc_nr_events(void)
@@ -96,6 +245,7 @@ static void perf_dec_nr_events(void)
 static void perf_task_sched_out_done(struct perf_event_context *ctx)
 {
 }
+#endif
 
 /*
  * Architecture provided APIs - weak aliases:
@@ -1254,7 +1404,7 @@ PE_STATIC void perf_event_task_sched_out_fn(struct rq *rq,
 	if (likely(!ctx || !cpuctx->task_ctx))
 		goto out;
 
-	if (perf_event_switch_clones(cpuctx, ctx, task, next))
+	if (nr_events && perf_event_switch_clones(cpuctx, ctx, task, next))
 		goto out;
 
 	ctx_sched_out(ctx, cpuctx, EVENT_ALL);
-- 
1.6.4.2


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 12:38 ` [PATCH 04/12] perf: add @rq to perf_event_task_sched_out() Tejun Heo
@ 2010-05-04 17:11   ` Peter Zijlstra
  2010-05-04 17:22     ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-04 17:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> Add @rq to perf_event_task_sched_out() so that its argument matches
> those of trace_sched_switch().  This will help unifying notifiers in
> sched.

The alternative is dropping rq from the trace events, its not like
anybody outside sched.o can do anything sensible with it anyway.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier
  2010-05-04 12:38 ` [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier Tejun Heo
@ 2010-05-04 17:11   ` Peter Zijlstra
  0 siblings, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-04 17:11 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> @cpu parameter is superflous.  Drop it.  This will help unifying
> notifiers in sched.

This seems sensible enough.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 17:11   ` Peter Zijlstra
@ 2010-05-04 17:22     ` Tejun Heo
  2010-05-04 17:42       ` Peter Zijlstra
  2010-05-04 18:22       ` Steven Rostedt
  0 siblings, 2 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-04 17:22 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

Hello,

On 05/04/2010 07:11 PM, Peter Zijlstra wrote:
> On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
>> Add @rq to perf_event_task_sched_out() so that its argument matches
>> those of trace_sched_switch().  This will help unifying notifiers in
>> sched.
> 
> The alternative is dropping rq from the trace events, its not like
> anybody outside sched.o can do anything sensible with it anyway.

The comment in include/trace/events/sched.h says...

 * (NOTE: the 'rq' argument is not used by generic trace events,
 *        but used by the latency tracer plugin. )

So, I left it there.  It would be great if all those @rq params can be
removed tho.  The latency tracer thing needs it, right?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
                   ` (11 preceding siblings ...)
  2010-05-04 12:38 ` [PATCH 12/12] perf: " Tejun Heo
@ 2010-05-04 17:29 ` Peter Zijlstra
  2010-05-05  5:00   ` Tejun Heo
  12 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-04 17:29 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> 
> This patchset does the following two things.
> 
> * 0001-0007: Unify the three tracers (tracepoints, perf_events and
>   preempt/sched notifiers) in scheduler.

Right, so I really don't like the SCHED_EVENT() thing much, esp the 3
different function postfixes.

> * 0008-0012: Move perf hooks in sched on top of tracepoints if TPs are
>              enabled.

This seems to add a terrible amount of overhead for no particular
reason.

I know Ingo was pushing to consolidate things, but imho this cure is
worse than the ailment.




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 17:22     ` Tejun Heo
@ 2010-05-04 17:42       ` Peter Zijlstra
  2010-05-04 18:22       ` Steven Rostedt
  1 sibling, 0 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-04 17:42 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 19:22 +0200, Tejun Heo wrote:
> Hello,
> 
> On 05/04/2010 07:11 PM, Peter Zijlstra wrote:
> > On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> >> Add @rq to perf_event_task_sched_out() so that its argument matches
> >> those of trace_sched_switch().  This will help unifying notifiers in
> >> sched.
> > 
> > The alternative is dropping rq from the trace events, its not like
> > anybody outside sched.o can do anything sensible with it anyway.
> 
> The comment in include/trace/events/sched.h says...
> 
>  * (NOTE: the 'rq' argument is not used by generic trace events,
>  *        but used by the latency tracer plugin. )
> 
> So, I left it there.  It would be great if all those @rq params can be
> removed tho.  The latency tracer thing needs it, right?

I had a quick look but couldn't actually find anything.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 17:22     ` Tejun Heo
  2010-05-04 17:42       ` Peter Zijlstra
@ 2010-05-04 18:22       ` Steven Rostedt
  2010-05-04 18:26         ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Steven Rostedt @ 2010-05-04 18:22 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 19:22 +0200, Tejun Heo wrote:
> Hello,
> 
> On 05/04/2010 07:11 PM, Peter Zijlstra wrote:
> > On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> >> Add @rq to perf_event_task_sched_out() so that its argument matches
> >> those of trace_sched_switch().  This will help unifying notifiers in
> >> sched.
> > 
> > The alternative is dropping rq from the trace events, its not like
> > anybody outside sched.o can do anything sensible with it anyway.
> 
> The comment in include/trace/events/sched.h says...
> 
>  * (NOTE: the 'rq' argument is not used by generic trace events,
>  *        but used by the latency tracer plugin. )
> 
> So, I left it there.  It would be great if all those @rq params can be
> removed tho.  The latency tracer thing needs it, right?

I believe the rq was used by the original latency tracer code that was
in the -rt patch set before ftrace. It is most likely there for
historical purposes. I guess it would be fine to just remove it.

-- Steve



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 18:22       ` Steven Rostedt
@ 2010-05-04 18:26         ` Peter Zijlstra
  2010-05-04 18:32           ` Steven Rostedt
  2010-05-07 18:41           ` [tip:sched/core] sched: Remove rq argument to the tracepoints tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-04 18:26 UTC (permalink / raw)
  To: rostedt; +Cc: Tejun Heo, mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 14:22 -0400, Steven Rostedt wrote:
> On Tue, 2010-05-04 at 19:22 +0200, Tejun Heo wrote:
> > Hello,
> > 
> > On 05/04/2010 07:11 PM, Peter Zijlstra wrote:
> > > On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> > >> Add @rq to perf_event_task_sched_out() so that its argument matches
> > >> those of trace_sched_switch().  This will help unifying notifiers in
> > >> sched.
> > > 
> > > The alternative is dropping rq from the trace events, its not like
> > > anybody outside sched.o can do anything sensible with it anyway.
> > 
> > The comment in include/trace/events/sched.h says...
> > 
> >  * (NOTE: the 'rq' argument is not used by generic trace events,
> >  *        but used by the latency tracer plugin. )
> > 
> > So, I left it there.  It would be great if all those @rq params can be
> > removed tho.  The latency tracer thing needs it, right?
> 
> I believe the rq was used by the original latency tracer code that was
> in the -rt patch set before ftrace. It is most likely there for
> historical purposes. I guess it would be fine to just remove it.

I'm building the below, so far so good... ;-)

---
 include/trace/events/sched.h      |   32 ++++++++++----------------------
 kernel/sched.c                    |    8 ++++----
 kernel/trace/trace_sched_wakeup.c |    5 ++---
 3 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cfceb0b..4f733ec 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -51,15 +51,12 @@ TRACE_EVENT(sched_kthread_stop_ret,
 
 /*
  * Tracepoint for waiting on task to unschedule:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 TRACE_EVENT(sched_wait_task,
 
-	TP_PROTO(struct rq *rq, struct task_struct *p),
+	TP_PROTO(struct task_struct *p),
 
-	TP_ARGS(rq, p),
+	TP_ARGS(p),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
@@ -79,15 +76,12 @@ TRACE_EVENT(sched_wait_task,
 
 /*
  * Tracepoint for waking up a task:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
 
-	TP_PROTO(struct rq *rq, struct task_struct *p, int success),
+	TP_PROTO(struct task_struct *p, int success),
 
-	TP_ARGS(rq, p, success),
+	TP_ARGS(p, success),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
@@ -111,31 +105,25 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 );
 
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
-	     TP_PROTO(struct rq *rq, struct task_struct *p, int success),
-	     TP_ARGS(rq, p, success));
+	     TP_PROTO(struct task_struct *p, int success),
+	     TP_ARGS(p, success));
 
 /*
  * Tracepoint for waking up a new task:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
-	     TP_PROTO(struct rq *rq, struct task_struct *p, int success),
-	     TP_ARGS(rq, p, success));
+	     TP_PROTO(struct task_struct *p, int success),
+	     TP_ARGS(p, success));
 
 /*
  * Tracepoint for task switches, performed by the scheduler:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 TRACE_EVENT(sched_switch,
 
-	TP_PROTO(struct rq *rq, struct task_struct *prev,
+	TP_PROTO(struct task_struct *prev,
 		 struct task_struct *next),
 
-	TP_ARGS(rq, prev, next),
+	TP_ARGS(prev, next),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
diff --git a/kernel/sched.c b/kernel/sched.c
index 9a0f37c..2642864 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2135,7 +2135,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(rq, p);
+		trace_sched_wait_task(p);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		ncsw = 0;
@@ -2406,7 +2406,7 @@ out_activate:
 	success = 1;
 
 out_running:
-	trace_sched_wakeup(rq, p, success);
+	trace_sched_wakeup(p, success);
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2580,7 +2580,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 
 	rq = task_rq_lock(p, &flags);
 	activate_task(rq, p, 0);
-	trace_sched_wakeup_new(rq, p, 1);
+	trace_sched_wakeup_new(p, 1);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
@@ -2800,7 +2800,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(rq, prev, next);
+	trace_sched_switch(prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 0271742..8052446 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -107,8 +107,7 @@ static void probe_wakeup_migrate_task(struct task_struct *task, int cpu)
 }
 
 static void notrace
-probe_wakeup_sched_switch(struct rq *rq, struct task_struct *prev,
-	struct task_struct *next)
+probe_wakeup_sched_switch(struct task_struct *prev, struct task_struct *next)
 {
 	struct trace_array_cpu *data;
 	cycle_t T0, T1, delta;
@@ -200,7 +199,7 @@ static void wakeup_reset(struct trace_array *tr)
 }
 
 static void
-probe_wakeup(struct rq *rq, struct task_struct *p, int success)
+probe_wakeup(struct task_struct *p, int success)
 {
 	struct trace_array_cpu *data;
 	int cpu = smp_processor_id();



^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 18:26         ` Peter Zijlstra
@ 2010-05-04 18:32           ` Steven Rostedt
  2010-05-05  4:48             ` Tejun Heo
  2010-05-05  9:58             ` Tejun Heo
  2010-05-07 18:41           ` [tip:sched/core] sched: Remove rq argument to the tracepoints tip-bot for Peter Zijlstra
  1 sibling, 2 replies; 55+ messages in thread
From: Steven Rostedt @ 2010-05-04 18:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Tejun Heo, mingo, efault, avi, paulus, acme, linux-kernel

On Tue, 2010-05-04 at 20:26 +0200, Peter Zijlstra wrote:
> On Tue, 2010-05-04 at 14:22 -0400, Steven Rostedt wrote:
> > On Tue, 2010-05-04 at 19:22 +0200, Tejun Heo wrote:
> > > Hello,
> > > 
> > > On 05/04/2010 07:11 PM, Peter Zijlstra wrote:
> > > > On Tue, 2010-05-04 at 14:38 +0200, Tejun Heo wrote:
> > > >> Add @rq to perf_event_task_sched_out() so that its argument matches
> > > >> those of trace_sched_switch().  This will help unifying notifiers in
> > > >> sched.
> > > > 
> > > > The alternative is dropping rq from the trace events, its not like
> > > > anybody outside sched.o can do anything sensible with it anyway.
> > > 
> > > The comment in include/trace/events/sched.h says...
> > > 
> > >  * (NOTE: the 'rq' argument is not used by generic trace events,
> > >  *        but used by the latency tracer plugin. )
> > > 
> > > So, I left it there.  It would be great if all those @rq params can be
> > > removed tho.  The latency tracer thing needs it, right?
> > 
> > I believe the rq was used by the original latency tracer code that was
> > in the -rt patch set before ftrace. It is most likely there for
> > historical purposes. I guess it would be fine to just remove it.
> 
> I'm building the below, so far so good... ;-)
> 
> ---
>  include/trace/events/sched.h      |   32 ++++++++++----------------------
>  kernel/sched.c                    |    8 ++++----
>  kernel/trace/trace_sched_wakeup.c |    5 ++---
>  3 files changed, 16 insertions(+), 29 deletions(-)
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index cfceb0b..4f733ec 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -51,15 +51,12 @@ TRACE_EVENT(sched_kthread_stop_ret,
>  
>  /*
>   * Tracepoint for waiting on task to unschedule:
> - *
> - * (NOTE: the 'rq' argument is not used by generic trace events,
> - *        but used by the latency tracer plugin. )
>   */
>  TRACE_EVENT(sched_wait_task,
>  
> -	TP_PROTO(struct rq *rq, struct task_struct *p),
> +	TP_PROTO(struct task_struct *p),

Acked-by: Steven Rostedt <rostedt@goodmis.org>

-- Steve



^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 18:32           ` Steven Rostedt
@ 2010-05-05  4:48             ` Tejun Heo
  2010-05-05  9:58             ` Tejun Heo
  1 sibling, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  4:48 UTC (permalink / raw)
  To: rostedt; +Cc: Peter Zijlstra, mingo, efault, avi, paulus, acme, linux-kernel

On 05/04/2010 08:32 PM, Steven Rostedt wrote:
>>> I believe the rq was used by the original latency tracer code that was
>>> in the -rt patch set before ftrace. It is most likely there for
>>> historical purposes. I guess it would be fine to just remove it.
>>
>> I'm building the below, so far so good... ;-)
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>

Cool, then I can remove SCHED_EVENT_RQ() thing then.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-04 17:29 ` [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Peter Zijlstra
@ 2010-05-05  5:00   ` Tejun Heo
  2010-05-05  9:06     ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  5:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

Hello,

On 05/04/2010 07:29 PM, Peter Zijlstra wrote:
>> * 0001-0007: Unify the three tracers (tracepoints, perf_events and
>>   preempt/sched notifiers) in scheduler.
> 
> Right, so I really don't like the SCHED_EVENT() thing much, esp the 3
> different function postfixes.

Yeap, it's not the prettiest thing.  I thought about renaming all of
them so that they share the same postfix but then again I need a way
to tell the script which to enable which is the easiest with
specifying postfixes as macro argument.  Any better ideas?

>> * 0008-0012: Move perf hooks in sched on top of tracepoints if TPs are
>>              enabled.
> 
> This seems to add a terrible amount of overhead for no particular
> reason.

Hmm... What overhead?

What matters more here is avoiding the overhead when perf or whatever
tracing mechanism is disabled.  When both perf and TPs are compiled in
but not enabled, moving perf hooks on top of TPs means that the sched
core code doesn't have to call into extra functions for perf and TPs
can be nooped.  ie. less overhead.  Also, even when perf is enabled,
there isn't any inherent extra runtime overhead other than during
enabling/disabling which again is a much colder path.

The only place where noticeable overhead is added is the extra pair of
irq enable/disable that I added for sched_out hook.  After glancing
through what perf does during context switch, I thought that overhead
wouldn't be anything noticeable but if it is, they can definitely be
separated.  The only purpose of that change was colocating sched
notifiers and perf hooks.

Also, if a few more perf hooks are converted to use the same
mechanism, it would be possible to put perf properly on top of TPs
other than init/exit code paths which will be cleaner for the rest of
the kernel and there won't be any extra runtime overhead.

The code will be much cleaner if perf depends on TPs.  With perf hooks
completely removed, there won't be much point in SCHED_EVENT and the
messy ifdefs in perf can also go away.  Would this be a possibility?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-04 12:38 ` [PATCH 03/12] perf: add perf_event_task_migrate() Tejun Heo
@ 2010-05-05  5:08   ` Frederic Weisbecker
  2010-05-05  5:16     ` Tejun Heo
  2010-05-05  9:11     ` Peter Zijlstra
  0 siblings, 2 replies; 55+ messages in thread
From: Frederic Weisbecker @ 2010-05-05  5:08 UTC (permalink / raw)
  To: Tejun Heo
  Cc: mingo, peterz, efault, avi, paulus, acme, linux-kernel, Peter Zijlstra

On Tue, May 04, 2010 at 02:38:35PM +0200, Tejun Heo wrote:
> Instead of calling perf_sw_event() directly from set_task_cpu(),
> implement perf_event_task_migrate() which takes the same arguments as
> trace_sched_migrate_task() and invokes perf_sw_event() is the task is
> really migrating (cur_cpu != new_cpu).  This will help unifying
> notifiers in sched.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Paul Mackerras <paulus@samba.org>
> Cc: Ingo Molnar <mingo@elte.hu>
> Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> ---
>  include/linux/perf_event.h |    3 +++
>  kernel/perf_event.c        |   11 +++++++++++
>  kernel/sched.c             |    5 ++---
>  3 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index c8e3754..a5eec48 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -754,6 +754,7 @@ extern int perf_max_events;
>  
>  extern const struct pmu *hw_perf_event_init(struct perf_event *event);
>  
> +extern void perf_event_task_migrate(struct task_struct *task, int new_cpu);
>  extern void perf_event_task_sched_in(struct task_struct *task);
>  extern void perf_event_task_sched_out(struct task_struct *task, struct task_struct *next);
>  extern void perf_event_task_tick(struct task_struct *task);
> @@ -949,6 +950,8 @@ extern void perf_event_enable(struct perf_event *event);
>  extern void perf_event_disable(struct perf_event *event);
>  #else
>  static inline void
> +perf_event_task_migrate(struct task_struct *task, int new_cpu)		{ }
> +static inline void
>  perf_event_task_sched_in(struct task_struct *task)			{ }
>  static inline void
>  perf_event_task_sched_out(struct task_struct *task,
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index 3d1552d..a01ba31 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -1148,6 +1148,17 @@ static void perf_event_sync_stat(struct perf_event_context *ctx,
>  }
>  
>  /*
> + * Called from scheduler set_task_cpu() to notify migration events.
> + * If the task is moving to a different cpu, generate a migration sw
> + * event.
> + */
> +void perf_event_task_migrate(struct task_struct *task, int new_cpu)
> +{
> +	if (task_cpu(task) != new_cpu)
> +		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
> +}



This needs to be static and inline (I haven't seem external users in this
patchset).

And we want it to be inlined because we save the caller address and the frame
pointer from perf_sw_event(), and a new level of call is not wanted here.



> +
> +/*
>   * Called from scheduler to remove the events of the current task,
>   * with interrupts disabled.
>   *
> diff --git a/kernel/sched.c b/kernel/sched.c
> index c20fd31..2568911 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2084,11 +2084,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>  #endif
>  
>  	trace_sched_migrate_task(p, new_cpu);
> +	perf_event_task_migrate(p, new_cpu);
>  
> -	if (task_cpu(p) != new_cpu) {
> +	if (task_cpu(p) != new_cpu)



In fact why not moving both tracing calls under this check.
This is going to fix the migrate trace event that gets called
even on "spurious" migrations, and you avoid the duplicate check
in the perf callback.

Thanks.


>  		p->se.nr_migrations++;
> -		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
> -	}
>  
>  	__set_task_cpu(p, new_cpu);
>  }
> -- 
> 1.6.4.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-05  5:08   ` Frederic Weisbecker
@ 2010-05-05  5:16     ` Tejun Heo
  2010-05-05  9:11     ` Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  5:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: mingo, peterz, efault, avi, paulus, acme, linux-kernel, Peter Zijlstra

Hello,

On 05/05/2010 07:08 AM, Frederic Weisbecker wrote:
>>  /*
>> + * Called from scheduler set_task_cpu() to notify migration events.
>> + * If the task is moving to a different cpu, generate a migration sw
>> + * event.
>> + */
>> +void perf_event_task_migrate(struct task_struct *task, int new_cpu)
>> +{
>> +	if (task_cpu(task) != new_cpu)
>> +		perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 1, NULL, 0);
>> +}
> 
> This needs to be static and inline (I haven't seem external users in this
> patchset).

Hmm... after the last patch, this one becomes a TP probe function
which can't be inlined.

> And we want it to be inlined because we save the caller address and the frame
> pointer from perf_sw_event(), and a new level of call is not wanted here.

Oh I see.  So, to use it with TP, I would need increase the @skip
parameter to perf_fetch_caller_regs() somehow, right?  Also, it
probably would be a good idea to add a comment w/ big fat warning to
perf_sw_event().

>> @@ -2084,11 +2084,10 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
>>  #endif
>>  
>>  	trace_sched_migrate_task(p, new_cpu);
>> +	perf_event_task_migrate(p, new_cpu);
>>  
>> -	if (task_cpu(p) != new_cpu) {
>> +	if (task_cpu(p) != new_cpu)
> 
> In fact why not moving both tracing calls under this check.
> This is going to fix the migrate trace event that gets called
> even on "spurious" migrations, and you avoid the duplicate check
> in the perf callback.

Yeah, that would be my preferred choice too.  I just didn't know that
could be changed.  Cool.  I'll.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05  5:00   ` Tejun Heo
@ 2010-05-05  9:06     ` Peter Zijlstra
  2010-05-05  9:32       ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05  9:06 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 07:00 +0200, Tejun Heo wrote:
> Hello,
> 
> On 05/04/2010 07:29 PM, Peter Zijlstra wrote:
> >> * 0001-0007: Unify the three tracers (tracepoints, perf_events and
> >>   preempt/sched notifiers) in scheduler.
> > 
> > Right, so I really don't like the SCHED_EVENT() thing much, esp the 3
> > different function postfixes.
> 
> Yeap, it's not the prettiest thing.  I thought about renaming all of
> them so that they share the same postfix but then again I need a way
> to tell the script which to enable which is the easiest with
> specifying postfixes as macro argument.  Any better ideas?

Add more NOP functions for the missing bits I guess, but that gets
cumbersome too.

> >> * 0008-0012: Move perf hooks in sched on top of tracepoints if TPs are
> >>              enabled.
> > 
> > This seems to add a terrible amount of overhead for no particular
> > reason.
> 
> Hmm... What overhead?

Well, the perf_{inc,dec}_nr_events() stuff, that's massively painful. If
that is the price to pay for using tracepoints then I'd rather not.

Also, the whole magic dance to keep !CONFIG_TRACEPOINTS working doesn't
make the code any saner.

> What matters more here is avoiding the overhead when perf or whatever
> tracing mechanism is disabled.  When both perf and TPs are compiled in
> but not enabled, moving perf hooks on top of TPs means that the sched
> core code doesn't have to call into extra functions for perf and TPs
> can be nooped.  ie. less overhead.  Also, even when perf is enabled,
> there isn't any inherent extra runtime overhead other than during
> enabling/disabling which again is a much colder path.

Well, in the current code we get the call overhead, but all perf
functions bail on !nr_events. We could invert that and have an inline
function check nr_events and only then do the call.

> The only place where noticeable overhead is added is the extra pair of
> irq enable/disable that I added for sched_out hook.  After glancing
> through what perf does during context switch, I thought that overhead
> wouldn't be anything noticeable but if it is, they can definitely be
> separated.  The only purpose of that change was colocating sched
> notifiers and perf hooks.

Right, recent Intel chips seem to do much better at IRQ disable than say
P4 and IA64, and poking at the PMU MSRs is way more painful, not sure
how it would balance out for PowerPC (or anything other) though.

> Also, if a few more perf hooks are converted to use the same
> mechanism, it would be possible to put perf properly on top of TPs
> other than init/exit code paths which will be cleaner for the rest of
> the kernel and there won't be any extra runtime overhead.
> 
> The code will be much cleaner if perf depends on TPs.  With perf hooks
> completely removed, there won't be much point in SCHED_EVENT and the
> messy ifdefs in perf can also go away.  Would this be a possibility?

Well, I already utterly hate that x86 can't build with !
CONFIG_PERF_EVENTS, also requiring CONFIG_TRACEPOINTS is going the wrong
way.

I've always explicitly avoided depending on tracepoints, and I'd very
much like to keep it that way.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-05  5:08   ` Frederic Weisbecker
  2010-05-05  5:16     ` Tejun Heo
@ 2010-05-05  9:11     ` Peter Zijlstra
  2010-05-05  9:37       ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05  9:11 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Tejun Heo, mingo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 07:08 +0200, Frederic Weisbecker wrote:
> 
> In fact why not moving both tracing calls under this check.
> This is going to fix the migrate trace event that gets called
> even on "spurious" migrations, and you avoid the duplicate check
> in the perf callback. 

I kept the tracepoint out of that because I wanted to see how often it
attempts silly migrations :-)




^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05  9:06     ` Peter Zijlstra
@ 2010-05-05  9:32       ` Tejun Heo
  2010-05-05  9:51         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  9:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

Hello,

On 05/05/2010 11:06 AM, Peter Zijlstra wrote:
>> Yeap, it's not the prettiest thing.  I thought about renaming all of
>> them so that they share the same postfix but then again I need a way
>> to tell the script which to enable which is the easiest with
>> specifying postfixes as macro argument.  Any better ideas?
> 
> Add more NOP functions for the missing bits I guess, but that gets
> cumbersome too.

Hmm... yeah, we'll need NOP functions for every missing hook.  Might
as well just spell each one out.

>> Hmm... What overhead?
> 
> Well, the perf_{inc,dec}_nr_events() stuff, that's massively painful. If
> that is the price to pay for using tracepoints then I'd rather not.

Ah, okay, you mean the code complexity overhead.  Eh well, if you
wanna do it on-demand, there's certain amount you need to pay.  As
long as the runtime overhead on hotpath isn't increased, I don't think
it's too bad tho.

> Also, the whole magic dance to keep !CONFIG_TRACEPOINTS working doesn't
> make the code any saner.

Things would have been much prettier if PERF could simply depend on
TRACEPOINTS.  More on this later.

> Well, in the current code we get the call overhead, but all perf
> functions bail on !nr_events. We could invert that and have an inline
> function check nr_events and only then do the call.

The thing is that unifying hooks not only saves the extra check now
but also makes future optimizations much easier.  ie. Ingo was talking
about run time patching TPs so that disabled ones just become noops.
It's far better to have all the tracing stuff on single hooking
mechanism for optimization efforts like that.

> Right, recent Intel chips seem to do much better at IRQ disable than say
> P4 and IA64, and poking at the PMU MSRs is way more painful, not sure
> how it would balance out for PowerPC (or anything other) though.

As I wrote above, here, the problem is mostly cosmetic.  It's just
that kvm hook needs irq on while perf hook can save one irq flipping
if called with irq off.  If we call them separately, there's no
problem.  If we wanna colocate them, it makes sense to pull perf hook
out of irq disabled section.  This one doesn't really make that much
of a difference.

> Well, I already utterly hate that x86 can't build with !
> CONFIG_PERF_EVENTS, also requiring CONFIG_TRACEPOINTS is going the wrong
> way.
> 
> I've always explicitly avoided depending on tracepoints, and I'd very
> much like to keep it that way.

I was wondering the other way around - ie. the possibility to make
perf optional and maybe even as a module which depends on TPs, which
would be nicer than the current situation and make the code less
cluttered too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-05  9:11     ` Peter Zijlstra
@ 2010-05-05  9:37       ` Tejun Heo
  2010-05-05  9:50         ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  9:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, mingo, efault, avi, paulus, acme, linux-kernel

On 05/05/2010 11:11 AM, Peter Zijlstra wrote:
> On Wed, 2010-05-05 at 07:08 +0200, Frederic Weisbecker wrote:
>>
>> In fact why not moving both tracing calls under this check.
>> This is going to fix the migrate trace event that gets called
>> even on "spurious" migrations, and you avoid the duplicate check
>> in the perf callback. 
> 
> I kept the tracepoint out of that because I wanted to see how often it
> attempts silly migrations :-)

Do you still need to keep it or is it okay to move it under the
silliness check?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-05  9:37       ` Tejun Heo
@ 2010-05-05  9:50         ` Peter Zijlstra
  2010-05-05  9:56           ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05  9:50 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, mingo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 11:37 +0200, Tejun Heo wrote:
> On 05/05/2010 11:11 AM, Peter Zijlstra wrote:
> > On Wed, 2010-05-05 at 07:08 +0200, Frederic Weisbecker wrote:
> >>
> >> In fact why not moving both tracing calls under this check.
> >> This is going to fix the migrate trace event that gets called
> >> even on "spurious" migrations, and you avoid the duplicate check
> >> in the perf callback. 
> > 
> > I kept the tracepoint out of that because I wanted to see how often it
> > attempts silly migrations :-)
> 
> Do you still need to keep it or is it okay to move it under the
> silliness check?

I guess you can move it, I often end up adding tons of trace_printk()
anyway when hunting funnies, might as well add one here when its
relevant.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05  9:32       ` Tejun Heo
@ 2010-05-05  9:51         ` Peter Zijlstra
  2010-05-05  9:54           ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05  9:51 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 11:32 +0200, Tejun Heo wrote:
> 
> > Well, I already utterly hate that x86 can't build with !
> > CONFIG_PERF_EVENTS, also requiring CONFIG_TRACEPOINTS is going the wrong
> > way.
> > 
> > I've always explicitly avoided depending on tracepoints, and I'd very
> > much like to keep it that way.
> 
> I was wondering the other way around - ie. the possibility to make
> perf optional and maybe even as a module which depends on TPs, which
> would be nicer than the current situation and make the code less
> cluttered too. 

I really really hate making perf rely on tracepoints.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05  9:51         ` Peter Zijlstra
@ 2010-05-05  9:54           ` Tejun Heo
  2010-05-05 11:38             ` Peter Zijlstra
  2010-05-10  5:20             ` Paul Mackerras
  0 siblings, 2 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  9:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

Hello,

On 05/05/2010 11:51 AM, Peter Zijlstra wrote:
>> I was wondering the other way around - ie. the possibility to make
>> perf optional and maybe even as a module which depends on TPs, which
>> would be nicer than the current situation and make the code less
>> cluttered too. 
> 
> I really really hate making perf rely on tracepoints.

Hmmm.... may I ask why?  Unifying hooking mechanism seems like a good
idea to me and it's not like it's gonna add any runtime overhead
although it does complicate init/exit but well that's something you
have to pay if you wanna do things dynamically and sans the ifdef
stuff it's like a couple hundred lines of isolated code.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 03/12] perf: add perf_event_task_migrate()
  2010-05-05  9:50         ` Peter Zijlstra
@ 2010-05-05  9:56           ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  9:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, mingo, efault, avi, paulus, acme, linux-kernel

On 05/05/2010 11:50 AM, Peter Zijlstra wrote:
> I guess you can move it, I often end up adding tons of trace_printk()
> anyway when hunting funnies, might as well add one here when its
> relevant.

Great, will do so.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH 04/12] perf: add @rq to perf_event_task_sched_out()
  2010-05-04 18:32           ` Steven Rostedt
  2010-05-05  4:48             ` Tejun Heo
@ 2010-05-05  9:58             ` Tejun Heo
  1 sibling, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-05  9:58 UTC (permalink / raw)
  To: rostedt; +Cc: Peter Zijlstra, mingo, efault, avi, paulus, acme, linux-kernel

On 05/04/2010 08:32 PM, Steven Rostedt wrote:
>> ---
>>  include/trace/events/sched.h      |   32 ++++++++++----------------------
>>  kernel/sched.c                    |    8 ++++----
>>  kernel/trace/trace_sched_wakeup.c |    5 ++---
>>  3 files changed, 16 insertions(+), 29 deletions(-)
> 
> Acked-by: Steven Rostedt <rostedt@goodmis.org>

Peter, can I include this patch into my series?  If so, can you please
post a patch w/ description?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05  9:54           ` Tejun Heo
@ 2010-05-05 11:38             ` Peter Zijlstra
  2010-05-05 12:28               ` Tejun Heo
  2010-05-05 12:33               ` Avi Kivity
  2010-05-10  5:20             ` Paul Mackerras
  1 sibling, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05 11:38 UTC (permalink / raw)
  To: Tejun Heo; +Cc: mingo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 11:54 +0200, Tejun Heo wrote:
> Hello,
> 
> On 05/05/2010 11:51 AM, Peter Zijlstra wrote:
> >> I was wondering the other way around - ie. the possibility to make
> >> perf optional and maybe even as a module which depends on TPs, which
> >> would be nicer than the current situation and make the code less
> >> cluttered too. 
> > 
> > I really really hate making perf rely on tracepoints.
> 
> Hmmm.... may I ask why? 

Because it will bloat the kernel for no reason. And I don't like the
endless indirections either.

Like said, I already really dislike x86 won't even build without
CONFIG_PERF_EVENTS, adding a hard requirement on TRACEEVENTS will only
make the whole thing even worse.

Sure, most distro configs will include both perf and tracepoints, but I
don't want to burden everybody who wants to use perf with the tracepoint
overhead.

As it stands I'd argue to simply drop this whole idea. The SCHED_EVENT()
thing doesn't look like its worth the obfuscation and I'm very much
opposed to making perf and sched_notifiers rely on tracepoints.

I don't see the few direct functions calls we have now as a big problem
and it sure as hell is a lot easier to read than the whole tracepoint
indirection mess.

If you care about the call overhead of the perf hooks, we can do the
same immediate value optimization to them as would be done to the
tracepoints.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 11:38             ` Peter Zijlstra
@ 2010-05-05 12:28               ` Tejun Heo
  2010-05-05 16:55                 ` Ingo Molnar
  2010-05-05 12:33               ` Avi Kivity
  1 sibling, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-05 12:28 UTC (permalink / raw)
  To: Peter Zijlstra, mingo; +Cc: efault, avi, paulus, acme, linux-kernel

Hello,

On 05/05/2010 01:38 PM, Peter Zijlstra wrote:
> As it stands I'd argue to simply drop this whole idea. The SCHED_EVENT()
> thing doesn't look like its worth the obfuscation and I'm very much
> opposed to making perf and sched_notifiers rely on tracepoints.

Hmmm... okay.  Well, this thing is born out of Ingo's suggestion that
for more sched notifiers to be added notification mechanisms need to
be consolidated.  As long as I can get those two notifiers needed for
cmwq, it's okay with me.  Ingo, what do you think?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 11:38             ` Peter Zijlstra
  2010-05-05 12:28               ` Tejun Heo
@ 2010-05-05 12:33               ` Avi Kivity
  2010-05-05 13:09                 ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Avi Kivity @ 2010-05-05 12:33 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Tejun Heo, mingo, efault, paulus, acme, linux-kernel

On 05/05/2010 02:38 PM, Peter Zijlstra wrote:
> As it stands I'd argue to simply drop this whole idea. The SCHED_EVENT()
> thing doesn't look like its worth the obfuscation and I'm very much
> opposed to making perf and sched_notifiers rely on tracepoints.
>    

We can make perf rely on sched_notifiers, I think it's exactly the same 
case as kvm - extra cpu state that needs switching at context switch 
time.  We could also make fpu switching use sched_notifiers, but that's 
pushing it a bit.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 12:33               ` Avi Kivity
@ 2010-05-05 13:09                 ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-05 13:09 UTC (permalink / raw)
  To: Avi Kivity; +Cc: Peter Zijlstra, mingo, efault, paulus, acme, linux-kernel

Hello,

On 05/05/2010 02:33 PM, Avi Kivity wrote:
> On 05/05/2010 02:38 PM, Peter Zijlstra wrote:
>> As it stands I'd argue to simply drop this whole idea. The SCHED_EVENT()
>> thing doesn't look like its worth the obfuscation and I'm very much
>> opposed to making perf and sched_notifiers rely on tracepoints.
> 
> We can make perf rely on sched_notifiers, I think it's exactly the same
> case as kvm - extra cpu state that needs switching at context switch
> time.  We could also make fpu switching use sched_notifiers, but that's
> pushing it a bit.

Oh, that was my first attempt too.  The difference between TPs and
sched_notifiers is the enable condition.  TP is system-wide while
sched_notifier is per-task.  The thing is that for perf hooks,
system-wide enable condition works much better.  Also, w/
sched_notifiers, there's the problem that no one else but self can
enable/disable it (which can be changed but doing it without causing
call time overhead requires a bit of effort).

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 12:28               ` Tejun Heo
@ 2010-05-05 16:55                 ` Ingo Molnar
  2010-05-05 18:12                   ` Peter Zijlstra
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2010-05-05 16:55 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, efault, avi, paulus, acme, linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 05/05/2010 01:38 PM, Peter Zijlstra wrote:
> > As it stands I'd argue to simply drop this whole idea. The SCHED_EVENT()
> > thing doesn't look like its worth the obfuscation and I'm very much
> > opposed to making perf and sched_notifiers rely on tracepoints.
> 
> Hmmm... okay.  Well, this thing is born out of Ingo's suggestion that for 
> more sched notifiers to be added notification mechanisms need to be 
> consolidated.  As long as I can get those two notifiers needed for cmwq, 
> it's okay with me.  Ingo, what do you think?

I'd much rather see the *_EVENT() APIs used - but enhanced to address Peter's 
objections. One change would be to add a DEFINE_EVENT_ABI() variant, which 
would be called via trace_abi_*() calls. That way we always know they are 
'hardwired' events in the extreme.

That then would allow the software events to be consolidated.

Peter?

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 16:55                 ` Ingo Molnar
@ 2010-05-05 18:12                   ` Peter Zijlstra
  2010-05-05 18:16                     ` Peter Zijlstra
  2010-05-06  6:31                     ` Ingo Molnar
  0 siblings, 2 replies; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05 18:12 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tejun Heo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 18:55 +0200, Ingo Molnar wrote:
> * Tejun Heo <tj@kernel.org> wrote:
> 
> > Hello,
> > 
> > On 05/05/2010 01:38 PM, Peter Zijlstra wrote:
> > > As it stands I'd argue to simply drop this whole idea. The SCHED_EVENT()
> > > thing doesn't look like its worth the obfuscation and I'm very much
> > > opposed to making perf and sched_notifiers rely on tracepoints.
> > 
> > Hmmm... okay.  Well, this thing is born out of Ingo's suggestion that for 
> > more sched notifiers to be added notification mechanisms need to be 
> > consolidated.  As long as I can get those two notifiers needed for cmwq, 
> > it's okay with me.  Ingo, what do you think?
> 
> I'd much rather see the *_EVENT() APIs used - but enhanced to address Peter's 
> objections. One change would be to add a DEFINE_EVENT_ABI() variant, which 
> would be called via trace_abi_*() calls. That way we always know they are 
> 'hardwired' events in the extreme.
> 
> That then would allow the software events to be consolidated.
> 
> Peter?

Well, I'd much rather just see a direct call in the code than having to
reverse engineer wth hangs onto that _EVENT() junk.

Also, we don't need ABI muck for this.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 18:12                   ` Peter Zijlstra
@ 2010-05-05 18:16                     ` Peter Zijlstra
  2010-05-05 18:30                       ` Frederic Weisbecker
  2010-05-06  6:31                     ` Ingo Molnar
  1 sibling, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-05 18:16 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tejun Heo, efault, avi, paulus, acme, linux-kernel

On Wed, 2010-05-05 at 20:15 +0200, Peter Zijlstra wrote:
> 
> Well, I'd much rather just see a direct call in the code than having to
> reverse engineer wth hangs onto that _EVENT() junk.

And again, I oppose mandating CONFIG_TRACEEVENT.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 18:16                     ` Peter Zijlstra
@ 2010-05-05 18:30                       ` Frederic Weisbecker
  2010-05-06  6:28                         ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Frederic Weisbecker @ 2010-05-05 18:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Tejun Heo, efault, avi, paulus, acme, linux-kernel

On Wed, May 05, 2010 at 08:16:36PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-05-05 at 20:15 +0200, Peter Zijlstra wrote:
> > 
> > Well, I'd much rather just see a direct call in the code than having to
> > reverse engineer wth hangs onto that _EVENT() junk.
> 
> And again, I oppose mandating CONFIG_TRACEEVENT.

And me too. But you don't need CONFIG_EVENT_TRACING for that.
TRACE_EVENT() with !CONFIG_EVENT_TRACING only produces tracepoints
if CONFIG_TRACEPOINTS.

In fact, a first progress that would handle these compromizes would
be to have CONFIG_PERF_EVENT_SW.

For now perf_event_task_sched_in and perf_event_task_sched_out can
stay as is because they are perf core utils.

But all the rest (faults, migrations, etc..) could be tracepoints builtin
only if CONFIG_PERF_EVENT_SW.
Which means CONFIG_PERF_EVENT_SW depends on CONFIG_TRACEPOINTS.

But nobody is forced to build CONFIG_PERF_EVENT_SW, breakpoints don't need
it.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 18:30                       ` Frederic Weisbecker
@ 2010-05-06  6:28                         ` Ingo Molnar
  2010-05-06  7:11                           ` Tejun Heo
  2010-05-06  8:18                           ` Avi Kivity
  0 siblings, 2 replies; 55+ messages in thread
From: Ingo Molnar @ 2010-05-06  6:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Tejun Heo, efault, avi, paulus, acme, linux-kernel


* Frederic Weisbecker <fweisbec@gmail.com> wrote:

> On Wed, May 05, 2010 at 08:16:36PM +0200, Peter Zijlstra wrote:
> > On Wed, 2010-05-05 at 20:15 +0200, Peter Zijlstra wrote:
> > > 
> > > Well, I'd much rather just see a direct call in the code than having to
> > > reverse engineer wth hangs onto that _EVENT() junk.
> > 
> > And again, I oppose mandating CONFIG_TRACEEVENT.
> 
> 
> And me too. But you don't need CONFIG_EVENT_TRACING for that. TRACE_EVENT() 
> with !CONFIG_EVENT_TRACING only produces tracepoints if CONFIG_TRACEPOINTS.
> 
> In fact, a first progress that would handle these compromizes would be to 
> have CONFIG_PERF_EVENT_SW.
> 
> For now perf_event_task_sched_in and perf_event_task_sched_out can stay as 
> is because they are perf core utils.
> 
> But all the rest (faults, migrations, etc..) could be tracepoints builtin 
> only if CONFIG_PERF_EVENT_SW. Which means CONFIG_PERF_EVENT_SW depends on 
> CONFIG_TRACEPOINTS.
> 
> But nobody is forced to build CONFIG_PERF_EVENT_SW, breakpoints don't need 
> it.

Yep. Also, most of the default distro kernels will have 3 sets of facilities:

 - preempt notifiers
 - tracepoints
 - sw events

which is crazy. We can just standardize on using the tracepoint interface 
definition methods - they are properly typed, widespread and well-known enough 
to be perfect for this.

( They are also under intense optimization - the jump-tracepoints patch makes 
  them probably even cheaper than preempt notifiers, in the off case. )

So lets get over this and consolidate our crazy hookery, and stand behind a 
single facility.

I'm also all for slimming down the trace events facilities by not requiring 
the /debug/tracing/ bits to be present.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05 18:12                   ` Peter Zijlstra
  2010-05-05 18:16                     ` Peter Zijlstra
@ 2010-05-06  6:31                     ` Ingo Molnar
  2010-05-06  7:04                       ` Peter Zijlstra
  1 sibling, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2010-05-06  6:31 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Tejun Heo, efault, avi, paulus, acme, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> Well, I'd much rather just see a direct call in the code than having to 
> reverse engineer wth hangs onto that _EVENT() junk.

Direct calls into code were fine 10 years ago, but since then we got:

 - preempt notifiers
 - sw events
 - tracepoints

Which add up to a lot more than just a plain call into code.

Also, with the jump-optimizations we will have tracepoints that are _cheaper_ 
than a plain function call.

> Also, we don't need ABI muck for this.

we already have an ABIs in place here - this would just properly unify and 
enumerate it.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  6:31                     ` Ingo Molnar
@ 2010-05-06  7:04                       ` Peter Zijlstra
  2010-05-06  7:11                         ` Ingo Molnar
  0 siblings, 1 reply; 55+ messages in thread
From: Peter Zijlstra @ 2010-05-06  7:04 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Tejun Heo, efault, avi, paulus, acme, linux-kernel

On Thu, 2010-05-06 at 08:31 +0200, Ingo Molnar wrote:
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > Well, I'd much rather just see a direct call in the code than having to 
> > reverse engineer wth hangs onto that _EVENT() junk.
> 
> Direct calls into code were fine 10 years ago, but since then we got:
> 
>  - preempt notifiers

Are per task an no good for driving perf.

>  - sw events
>  - tracepoints

Are unrelated to the core perf scheduler calls.

> Which add up to a lot more than just a plain call into code.
> 
> Also, with the jump-optimizations we will have tracepoints that are _cheaper_ 
> than a plain function call.

Which can just as easily be used on the core perf hooks.

> > Also, we don't need ABI muck for this.
> 
> we already have an ABIs in place here - this would just properly unify and 
> enumerate it.

I'm not getting it, this is about in-kernel stuff, there are _NO_ in
kernel ABIs.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  6:28                         ` Ingo Molnar
@ 2010-05-06  7:11                           ` Tejun Heo
  2010-05-06  8:27                             ` Ingo Molnar
  2010-05-06  8:18                           ` Avi Kivity
  1 sibling, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-06  7:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Peter Zijlstra, efault, avi, paulus, acme,
	linux-kernel

Hello,

On 05/06/2010 08:28 AM, Ingo Molnar wrote:
> ( They are also under intense optimization - the jump-tracepoints patch makes 
>   them probably even cheaper than preempt notifiers, in the off case. )

I mostly agree on other points but TPs and sched_notifiers are
inherently different in how they are enabled/disabled.
sched_notifiers are enabled/disabled per-task and at least w/ cmwq,
there will always be some tasks with active sched_notifiers so code
level optimizations aren't really useful.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  7:04                       ` Peter Zijlstra
@ 2010-05-06  7:11                         ` Ingo Molnar
  2010-05-06  7:29                           ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2010-05-06  7:11 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Tejun Heo, efault, avi, paulus, acme, linux-kernel


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, 2010-05-06 at 08:31 +0200, Ingo Molnar wrote:
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > Well, I'd much rather just see a direct call in the code than having to 
> > > reverse engineer wth hangs onto that _EVENT() junk.
> > 
> > Direct calls into code were fine 10 years ago, but since then we got:
> > 
> >  - preempt notifiers
> 
> Are per task an no good for driving perf.

No, but they could be expressed via tracepoints.

> >  - sw events
> >  - tracepoints
> 
> Are unrelated to the core perf scheduler calls.

Still they pose 3 different types of callback regimes, all for a slightly 
different purpose.

> > Which add up to a lot more than just a plain call into code.
> > 
> > Also, with the jump-optimizations we will have tracepoints that are 
> > _cheaper_ than a plain function call.
> 
> Which can just as easily be used on the core perf hooks.
> 
> > > Also, we don't need ABI muck for this.
> > 
> > we already have an ABIs in place here - this would just properly unify and 
> > enumerate it.
> 
> I'm not getting it, this is about in-kernel stuff, there are _NO_ in kernel 
> ABIs.

Yes - but the callbacks i mentioned do include ABIs, such as the sw event 
callbacks.

So my point is to have one coherent callback mechanism for it.

That said, i agree with you that turning the basic perf scheduling calls into 
events goes one step too far. But all the _other_ 3 types of callbacks (and we 
might as well add a fourth type as well: the butt-ugly hotplug event notifier 
chains) should be consolidated really.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  7:11                         ` Ingo Molnar
@ 2010-05-06  7:29                           ` Tejun Heo
  2010-05-06  7:33                             ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Tejun Heo @ 2010-05-06  7:29 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, efault, avi, paulus, acme, linux-kernel

Hello,

On 05/06/2010 09:11 AM, Ingo Molnar wrote:
> That said, i agree with you that turning the basic perf scheduling
> calls into events goes one step too far.

But if you put all those calls on top of TPs and apply static
optimization to TPs (yeah, that can be applied directly to perf calls
too but it's simpler to consolidate, IMHO), perf becomes zero-cost
when disabled.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  7:29                           ` Tejun Heo
@ 2010-05-06  7:33                             ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-06  7:33 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Peter Zijlstra, efault, avi, paulus, acme, linux-kernel

On 05/06/2010 09:29 AM, Tejun Heo wrote:
> (yeah, that can be applied directly to perf calls too but it's
> simpler to consolidate, IMHO)

Oh, also, if static text optimization is applied to perf calls
directly, it would require about the same complexity overhead as doing
it on top of TPs.  It would still have to drain contexts before
turning off those hooks, so the only advantage is the ability to build
the kernel w/ PERF but w/o TPs.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  6:28                         ` Ingo Molnar
  2010-05-06  7:11                           ` Tejun Heo
@ 2010-05-06  8:18                           ` Avi Kivity
  1 sibling, 0 replies; 55+ messages in thread
From: Avi Kivity @ 2010-05-06  8:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Peter Zijlstra, Tejun Heo, efault, paulus,
	acme, linux-kernel

On 05/06/2010 09:28 AM, Ingo Molnar wrote:
>
> Yep. Also, most of the default distro kernels will have 3 sets of facilities:
>
>   - preempt notifiers
>   - tracepoints
>   - sw events
>
> which is crazy. We can just standardize on using the tracepoint interface
> definition methods - they are properly typed, widespread and well-known enough
> to be perfect for this.
>
> ( They are also under intense optimization - the jump-tracepoints patch makes
>    them probably even cheaper than preempt notifiers, in the off case. )
>
>    

What about the on case?  kvm has preempt notifiers enabled all the time, 
and relies on their being fast.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  7:11                           ` Tejun Heo
@ 2010-05-06  8:27                             ` Ingo Molnar
  2010-05-06  8:41                               ` Tejun Heo
  0 siblings, 1 reply; 55+ messages in thread
From: Ingo Molnar @ 2010-05-06  8:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Frederic Weisbecker, Peter Zijlstra, efault, avi, paulus, acme,
	linux-kernel


* Tejun Heo <tj@kernel.org> wrote:

> Hello,
> 
> On 05/06/2010 08:28 AM, Ingo Molnar wrote:
> > ( They are also under intense optimization - the jump-tracepoints patch makes 
> >   them probably even cheaper than preempt notifiers, in the off case. )
> 
> I mostly agree on other points but TPs and sched_notifiers are inherently 
> different in how they are enabled/disabled. sched_notifiers are 
> enabled/disabled per-task and at least w/ cmwq, there will always be some 
> tasks with active sched_notifiers so code level optimizations aren't really 
> useful.

Note that preempt notifiers used to modify scheduling behavior/semantics i'm 
really against. Please get that functionality into the scheduler - especially 
if it's essentially always-on.

Preempt notifiers are for things like KVM which want to do extended context 
saving. They are _NOT_ 'hooks' into the scheduler.

	Ingo

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-06  8:27                             ` Ingo Molnar
@ 2010-05-06  8:41                               ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-06  8:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frederic Weisbecker, Peter Zijlstra, efault, avi, paulus, acme,
	linux-kernel

Hello, Ingo.

On 05/06/2010 10:27 AM, Ingo Molnar wrote:
> Note that preempt notifiers used to modify scheduling behavior/semantics i'm 
> really against. Please get that functionality into the scheduler - especially 
> if it's essentially always-on.

Okay.  Well, cmwq's usage of it isn't that hairy.  It's used for two
things.

1. to count the number of currently running workers which I think is
   fine for such hooks.

2. to wake up another worker.  I don't think this falls in the
   category of "behavior/semantics" change.  It just does
   try_to_wake_up() with the only caveat being it's called under the
   rq lock so requires a different API.  There's nothing which alters
   the scheduler behavior in any pervasive way.  It just looks at the
   counter and wakes up another task if the condition is right
   *before* beginning any of real scheduling work.

But, anyways, making a dedicated fixed function calls certainly isn't
a problem and might as well be slightly cheaper as it's always gonna
be there anyway.  If that's preferred, I'll happily take that path.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

* [tip:sched/core] sched: Remove rq argument to the tracepoints
  2010-05-04 18:26         ` Peter Zijlstra
  2010-05-04 18:32           ` Steven Rostedt
@ 2010-05-07 18:41           ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 55+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-05-07 18:41 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, rostedt, a.p.zijlstra, tglx, mingo

Commit-ID:  27a9da6538ee18046d7bff8e36a9f783542c54c3
Gitweb:     http://git.kernel.org/tip/27a9da6538ee18046d7bff8e36a9f783542c54c3
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Tue, 4 May 2010 20:36:56 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Fri, 7 May 2010 11:28:17 +0200

sched: Remove rq argument to the tracepoints

struct rq isn't visible outside of sched.o so its near useless to
expose the pointer, also there are no users of it, so remove it.

Acked-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1272997616.1642.207.camel@laptop>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/trace/events/sched.h      |   32 ++++++++++----------------------
 kernel/sched.c                    |    8 ++++----
 kernel/trace/ftrace.c             |    3 +--
 kernel/trace/trace_sched_switch.c |    5 ++---
 kernel/trace/trace_sched_wakeup.c |    5 ++---
 5 files changed, 19 insertions(+), 34 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index cfceb0b..4f733ec 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -51,15 +51,12 @@ TRACE_EVENT(sched_kthread_stop_ret,
 
 /*
  * Tracepoint for waiting on task to unschedule:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 TRACE_EVENT(sched_wait_task,
 
-	TP_PROTO(struct rq *rq, struct task_struct *p),
+	TP_PROTO(struct task_struct *p),
 
-	TP_ARGS(rq, p),
+	TP_ARGS(p),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
@@ -79,15 +76,12 @@ TRACE_EVENT(sched_wait_task,
 
 /*
  * Tracepoint for waking up a task:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 DECLARE_EVENT_CLASS(sched_wakeup_template,
 
-	TP_PROTO(struct rq *rq, struct task_struct *p, int success),
+	TP_PROTO(struct task_struct *p, int success),
 
-	TP_ARGS(rq, p, success),
+	TP_ARGS(p, success),
 
 	TP_STRUCT__entry(
 		__array(	char,	comm,	TASK_COMM_LEN	)
@@ -111,31 +105,25 @@ DECLARE_EVENT_CLASS(sched_wakeup_template,
 );
 
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup,
-	     TP_PROTO(struct rq *rq, struct task_struct *p, int success),
-	     TP_ARGS(rq, p, success));
+	     TP_PROTO(struct task_struct *p, int success),
+	     TP_ARGS(p, success));
 
 /*
  * Tracepoint for waking up a new task:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 DEFINE_EVENT(sched_wakeup_template, sched_wakeup_new,
-	     TP_PROTO(struct rq *rq, struct task_struct *p, int success),
-	     TP_ARGS(rq, p, success));
+	     TP_PROTO(struct task_struct *p, int success),
+	     TP_ARGS(p, success));
 
 /*
  * Tracepoint for task switches, performed by the scheduler:
- *
- * (NOTE: the 'rq' argument is not used by generic trace events,
- *        but used by the latency tracer plugin. )
  */
 TRACE_EVENT(sched_switch,
 
-	TP_PROTO(struct rq *rq, struct task_struct *prev,
+	TP_PROTO(struct task_struct *prev,
 		 struct task_struct *next),
 
-	TP_ARGS(rq, prev, next),
+	TP_ARGS(prev, next),
 
 	TP_STRUCT__entry(
 		__array(	char,	prev_comm,	TASK_COMM_LEN	)
diff --git a/kernel/sched.c b/kernel/sched.c
index 4956ed0..11ac0eb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2168,7 +2168,7 @@ unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &flags);
-		trace_sched_wait_task(rq, p);
+		trace_sched_wait_task(p);
 		running = task_running(rq, p);
 		on_rq = p->se.on_rq;
 		ncsw = 0;
@@ -2439,7 +2439,7 @@ out_activate:
 	success = 1;
 
 out_running:
-	trace_sched_wakeup(rq, p, success);
+	trace_sched_wakeup(p, success);
 	check_preempt_curr(rq, p, wake_flags);
 
 	p->state = TASK_RUNNING;
@@ -2613,7 +2613,7 @@ void wake_up_new_task(struct task_struct *p, unsigned long clone_flags)
 
 	rq = task_rq_lock(p, &flags);
 	activate_task(rq, p, 0);
-	trace_sched_wakeup_new(rq, p, 1);
+	trace_sched_wakeup_new(p, 1);
 	check_preempt_curr(rq, p, WF_FORK);
 #ifdef CONFIG_SMP
 	if (p->sched_class->task_woken)
@@ -2833,7 +2833,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	struct mm_struct *mm, *oldmm;
 
 	prepare_task_switch(rq, prev, next);
-	trace_sched_switch(rq, prev, next);
+	trace_sched_switch(prev, next);
 	mm = next->mm;
 	oldmm = prev->active_mm;
 	/*
diff --git a/kernel/trace/ftrace.c b/kernel/trace/ftrace.c
index 2404b59..aa3a92b 100644
--- a/kernel/trace/ftrace.c
+++ b/kernel/trace/ftrace.c
@@ -3212,8 +3212,7 @@ free:
 }
 
 static void
-ftrace_graph_probe_sched_switch(struct rq *__rq, struct task_struct *prev,
-				struct task_struct *next)
+ftrace_graph_probe_sched_switch(struct task_struct *prev, struct task_struct *next)
 {
 	unsigned long long timestamp;
 	int index;
diff --git a/kernel/trace/trace_sched_switch.c b/kernel/trace/trace_sched_switch.c
index 5fca0f5..a55fccf 100644
--- a/kernel/trace/trace_sched_switch.c
+++ b/kernel/trace/trace_sched_switch.c
@@ -50,8 +50,7 @@ tracing_sched_switch_trace(struct trace_array *tr,
 }
 
 static void
-probe_sched_switch(struct rq *__rq, struct task_struct *prev,
-			struct task_struct *next)
+probe_sched_switch(struct task_struct *prev, struct task_struct *next)
 {
 	struct trace_array_cpu *data;
 	unsigned long flags;
@@ -109,7 +108,7 @@ tracing_sched_wakeup_trace(struct trace_array *tr,
 }
 
 static void
-probe_sched_wakeup(struct rq *__rq, struct task_struct *wakee, int success)
+probe_sched_wakeup(struct task_struct *wakee, int success)
 {
 	struct trace_array_cpu *data;
 	unsigned long flags;
diff --git a/kernel/trace/trace_sched_wakeup.c b/kernel/trace/trace_sched_wakeup.c
index 0271742..8052446 100644
--- a/kernel/trace/trace_sched_wakeup.c
+++ b/kernel/trace/trace_sched_wakeup.c
@@ -107,8 +107,7 @@ static void probe_wakeup_migrate_task(struct task_struct *task, int cpu)
 }
 
 static void notrace
-probe_wakeup_sched_switch(struct rq *rq, struct task_struct *prev,
-	struct task_struct *next)
+probe_wakeup_sched_switch(struct task_struct *prev, struct task_struct *next)
 {
 	struct trace_array_cpu *data;
 	cycle_t T0, T1, delta;
@@ -200,7 +199,7 @@ static void wakeup_reset(struct trace_array *tr)
 }
 
 static void
-probe_wakeup(struct rq *rq, struct task_struct *p, int success)
+probe_wakeup(struct task_struct *p, int success)
 {
 	struct trace_array_cpu *data;
 	int cpu = smp_processor_id();

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-05  9:54           ` Tejun Heo
  2010-05-05 11:38             ` Peter Zijlstra
@ 2010-05-10  5:20             ` Paul Mackerras
  2010-05-10  5:48               ` Tejun Heo
  1 sibling, 1 reply; 55+ messages in thread
From: Paul Mackerras @ 2010-05-10  5:20 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Peter Zijlstra, mingo, efault, avi, acme, linux-kernel

On Wed, May 05, 2010 at 11:54:52AM +0200, Tejun Heo wrote:

> On 05/05/2010 11:51 AM, Peter Zijlstra wrote:
> >> I was wondering the other way around - ie. the possibility to make
> >> perf optional and maybe even as a module which depends on TPs, which
> >> would be nicer than the current situation and make the code less
> >> cluttered too. 
> > 
> > I really really hate making perf rely on tracepoints.
> 
> Hmmm.... may I ask why?  Unifying hooking mechanism seems like a good
> idea to me and it's not like it's gonna add any runtime overhead
> although it does complicate init/exit but well that's something you
> have to pay if you wanna do things dynamically and sans the ifdef
> stuff it's like a couple hundred lines of isolated code.

Don't forget where perf_events started out - as a way to count and
record hardware events.  So perf_events is very useful even in a
kernel that has no tracing infrastructure configured in at all. 

Paul.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP
  2010-05-10  5:20             ` Paul Mackerras
@ 2010-05-10  5:48               ` Tejun Heo
  0 siblings, 0 replies; 55+ messages in thread
From: Tejun Heo @ 2010-05-10  5:48 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Peter Zijlstra, mingo, efault, avi, acme, linux-kernel

Hello,

On 05/10/2010 07:20 AM, Paul Mackerras wrote:
> Don't forget where perf_events started out - as a way to count and
> record hardware events.  So perf_events is very useful even in a
> kernel that has no tracing infrastructure configured in at all. 

Alright, but I don't think it's very relevant to worry about
!CONFIG_TRACEPOINTS && CONFIG_PERF_EVENTS case as long as both can be
compiled out.  TPs are supposed to have very low to almost no runtime
overhead and building them into the kernel shouldn't cost all that
much either (I hear that it has been improved lately).  Sure, perf can
be used w/o TPs but I can't see what that configurability buys us.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2010-05-10  5:49 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-05-04 12:38 [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Tejun Heo
2010-05-04 12:38 ` [PATCH 01/12] sched: drop @cpu argument from sched_in preempt notifier Tejun Heo
2010-05-04 17:11   ` Peter Zijlstra
2010-05-04 12:38 ` [PATCH 02/12] sched: rename preempt_notifiers to sched_notifiers and refactor implementation Tejun Heo
2010-05-04 12:38 ` [PATCH 03/12] perf: add perf_event_task_migrate() Tejun Heo
2010-05-05  5:08   ` Frederic Weisbecker
2010-05-05  5:16     ` Tejun Heo
2010-05-05  9:11     ` Peter Zijlstra
2010-05-05  9:37       ` Tejun Heo
2010-05-05  9:50         ` Peter Zijlstra
2010-05-05  9:56           ` Tejun Heo
2010-05-04 12:38 ` [PATCH 04/12] perf: add @rq to perf_event_task_sched_out() Tejun Heo
2010-05-04 17:11   ` Peter Zijlstra
2010-05-04 17:22     ` Tejun Heo
2010-05-04 17:42       ` Peter Zijlstra
2010-05-04 18:22       ` Steven Rostedt
2010-05-04 18:26         ` Peter Zijlstra
2010-05-04 18:32           ` Steven Rostedt
2010-05-05  4:48             ` Tejun Heo
2010-05-05  9:58             ` Tejun Heo
2010-05-07 18:41           ` [tip:sched/core] sched: Remove rq argument to the tracepoints tip-bot for Peter Zijlstra
2010-05-04 12:38 ` [PATCH 05/12] perf: move perf_event_task_sched_in() next to fire_sched_notifiers_in() Tejun Heo
2010-05-04 12:38 ` [PATCH 06/12] sched: relocate fire_sched_notifiers_out() and trace_sched_switch() Tejun Heo
2010-05-04 12:38 ` [PATCH 07/12] sched: coalesce event notifiers Tejun Heo
2010-05-04 12:38 ` [PATCH 08/12] sched: add switch_in and tick tracepoints Tejun Heo
2010-05-04 12:38 ` [PATCH 09/12] perf: factor out perf_event_switch_clones() Tejun Heo
2010-05-04 12:38 ` [PATCH 10/12] perf: make nr_events an int and add perf_online_mutex to protect it Tejun Heo
2010-05-04 12:38 ` [PATCH 11/12] perf: prepare to move sched perf functions on top of tracepoints Tejun Heo
2010-05-04 12:38 ` [PATCH 12/12] perf: " Tejun Heo
2010-05-04 17:29 ` [RFC PATCHSET] sched,perf: unify tracers in sched and move perf on top of TP Peter Zijlstra
2010-05-05  5:00   ` Tejun Heo
2010-05-05  9:06     ` Peter Zijlstra
2010-05-05  9:32       ` Tejun Heo
2010-05-05  9:51         ` Peter Zijlstra
2010-05-05  9:54           ` Tejun Heo
2010-05-05 11:38             ` Peter Zijlstra
2010-05-05 12:28               ` Tejun Heo
2010-05-05 16:55                 ` Ingo Molnar
2010-05-05 18:12                   ` Peter Zijlstra
2010-05-05 18:16                     ` Peter Zijlstra
2010-05-05 18:30                       ` Frederic Weisbecker
2010-05-06  6:28                         ` Ingo Molnar
2010-05-06  7:11                           ` Tejun Heo
2010-05-06  8:27                             ` Ingo Molnar
2010-05-06  8:41                               ` Tejun Heo
2010-05-06  8:18                           ` Avi Kivity
2010-05-06  6:31                     ` Ingo Molnar
2010-05-06  7:04                       ` Peter Zijlstra
2010-05-06  7:11                         ` Ingo Molnar
2010-05-06  7:29                           ` Tejun Heo
2010-05-06  7:33                             ` Tejun Heo
2010-05-05 12:33               ` Avi Kivity
2010-05-05 13:09                 ` Tejun Heo
2010-05-10  5:20             ` Paul Mackerras
2010-05-10  5:48               ` Tejun Heo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.