* [RFC PATCH 0/2] IRQ based next prediction
@ 2016-01-06 15:22 Daniel Lezcano
2016-01-06 15:22 ` [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-06 15:22 ` [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
0 siblings, 2 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-06 15:22 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
The current approach to select an idle state is based on the idle period
statistics computation.
Useless to say this approach satisfied everyone as a solution to find the
best trade-off between the performances and the energy saving via the menu
governor.
However, the kernel is evolving to act pro-actively regarding the energy
constraints with the scheduler and the different power management subsystems
are not collaborating with the scheduler as the conductor of the decisions,
they all act independently.
The cpuidle governors are based on idle period statistics, without knowledge
of what woke up the cpu. In these sources of wakes up, the IPI are of course
accounted (as well as the timers irq) which results in doing statistics on
the scheduler behavior too. It is no sense to let the scheduler to take a
decision based on a next prediction of its own decisions.
In order to integrate the cpuidle framework into the scheduler, we have to
radically change the approach by clearly identifying what is causing a wake
up and how it behaves.
This serie inverts the logic.
Instead of tracking the idle durations and do statistics on them, these
patches track the interrupt individually and try to predict the next interrupt.
By combining the interrupts' next event on a single CPU, we can predict the
next event for the CPU, hence predict how long we will be sleeping when
entering idle.
The IPI and timer interrupts are not taken into account.
The first patch provides a callback to be registered in the irq subsystem
and to be called when an interrupt is handled with a timestamp.
The second patch uses the callback provided by the patch above to compute
the delta and store it in a circular buffer. It is per cpu, the callback
implements minimal operations as it is in an interrupt context.
When we the cpu enters idle, it asks for the expected sleep time. Then the
expected minimum sleep length for all interrupts is used and compared to
the timer sleep length, again the minimum is taken and gives the expected
sleep time.
The statistics are very trivial and could be improved later but this first
step shows we have a nice overall improvement in SMP. In UP the menu governor
is a bit better which may indicate the next prediction computation could be
improved but confirms removing the IPI from the equation increase the
accuracy.
These are the results with a workload emulator (mp3, video, browser, ...) on
a Dual Xeon 6 cores. Each test has been run 10 times. The results are based
on a debugfs statistics where, when we exit idle, we check if the idle state
was the right one which gives a successful prediction even if the exact
wakeup time is not correct. The error margin is absorbed by the idle state's
target residency, eg. idle1 is 10 us, idle2 is 50us, idle3 is 150us, if we
predict to sleep 110us but at the end we sleep 80us, the prediction is still
successful as the idle2 is the correct one for 80us and 110us.
(mean colum - 5th -, greater is better).
SMP (12 cores):
--------------------------
Successful predictions (%)
--------------------------
scripts/iolatsimu.sh.menu.dat:
N min max sum mean stddev
10 30.96 39.3 346.15 34.615 2.37729
scripts/iolatsimu.sh.irq.dat:
N min max sum mean stddev
10 22.21 54.39 377.1 37.71 11.3374
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-aio.sh.menu.dat:
N min max sum mean stddev
10 44.32 48.81 459.5 45.95 1.59114
scripts/fio-aio.sh.irq.dat:
N min max sum mean stddev
10 54.18 64.75 592.86 59.286 2.83509
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-cpuio.sh.menu.dat:
N min max sum mean stddev
10 86.86 91 890.21 89.021 1.34882
scripts/fio-cpuio.sh.irq.dat:
N min max sum mean stddev
10 97.97 98.57 981.85 98.185 0.171804
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-zipf.sh.menu.dat:
N min max sum mean stddev
10 87.58 94.53 917.07 91.707 1.81287
scripts/fio-zipf.sh.irq.dat:
N min max sum mean stddev
10 95.51 98.88 980.25 98.025 1.06026
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-netio.sh.menu.dat:
N min max sum mean stddev
10 65.51 81.82 729.41 72.941 5.34134
scripts/fio-netio.sh.irq.dat:
N min max sum mean stddev
10 97.38 97.88 976.19 97.619 0.162238
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-falloc.sh.menu.dat:
N min max sum mean stddev
10 79.4 91.58 858.71 85.871 3.51946
scripts/fio-falloc.sh.irq.dat:
N min max sum mean stddev
10 65.06 87.54 728.72 72.872 6.29703
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-null.sh.menu.dat:
N min max sum mean stddev
10 89.44 92.83 908.96 90.896 1.12064
scripts/fio-null.sh.irq.dat:
N min max sum mean stddev
10 95.68 98.74 978.19 97.819 0.954468
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-browser.sh.menu.dat:
N min max sum mean stddev
10 59 67.66 640.21 64.021 3.11123
scripts/rt-app-browser.sh.irq.dat:
N min max sum mean stddev
10 81.45 91.48 865.2 86.52 3.39756
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-mp3.sh.menu.dat:
N min max sum mean stddev
10 65.85 70.67 674.85 67.485 1.34196
scripts/rt-app-mp3.sh.irq.dat:
N min max sum mean stddev
10 94.97 97.24 963.96 96.396 0.743897
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-video.sh.menu.dat:
N min max sum mean stddev
10 59.41 63.14 615.57 61.557 1.24713
scripts/rt-app-video.sh.irq.dat:
N min max sum mean stddev
10 70.92 76.8 747.7 74.77 1.84455
--------------------------
Successful predictions (%)
--------------------------
scripts/video.sh.menu.dat:
N min max sum mean stddev
10 27.99 35.54 302.23 30.223 2.27853
scripts/video.sh.irq.dat:
N min max sum mean stddev
10 54.42 72.42 635.55 63.555 6.48589
--------------------------
Successful predictions (%)
--------------------------
scripts/netperf.sh.menu.dat:
N min max sum mean stddev
10 67.49 82.46 748.43 74.843 4.64976
scripts/netperf.sh.irq.dat:
N min max sum mean stddev
10 97.51 99.43 990.12 99.012 0.545768
In UP:
--------------------------
Successful predictions (%)
--------------------------
scripts/iolatsimu.sh.menu.dat:
N min max sum mean stddev
10 49.5 55.66 514.02 51.402 2.31278
scripts/iolatsimu.sh.irq.dat:
N min max sum mean stddev
10 39.49 63.51 552.35 55.235 6.54344
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-aio.sh.menu.dat:
N min max sum mean stddev
10 49.44 51.9 502.44 50.244 0.849552
scripts/fio-aio.sh.irq.dat:
N min max sum mean stddev
10 35.58 42.06 381.57 38.157 2.05544
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-cpuio.sh.menu.dat:
N min max sum mean stddev
10 93.07 97.26 954.71 95.471 1.52628
scripts/fio-cpuio.sh.irq.dat:
N min max sum mean stddev
10 91.02 97.48 943.92 94.392 2.21478
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-zipf.sh.menu.dat:
N min max sum mean stddev
10 76.92 88 831.94 83.194 3.2838
scripts/fio-zipf.sh.irq.dat:
(2 runs without idle transitions)
N min max sum mean stddev
8 50 100 657.5 82.1875 22.418
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-netio.sh.menu.dat:
N min max sum mean stddev
10 78.95 85.71 818.7 81.87 2.01494
scripts/fio-netio.sh.irq.dat:
N min max sum mean stddev
10 62.5 90 733.51 73.351 7.77129
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-falloc.sh.menu.dat:
N min max sum mean stddev
10 29.42 38.46 357.83 35.783 2.57226
scripts/fio-falloc.sh.irq.dat:
N min max sum mean stddev
10 28.85 37.48 333.22 33.322 3.121
--------------------------
Successful predictions (%)
--------------------------
scripts/fio-null.sh.menu.dat:
N min max sum mean stddev
10 73.53 84.62 806.44 80.644 3.57214
scripts/fio-null.sh.irq.dat:
N min max sum mean stddev
10 57.14 85.71 761.68 76.168 8.73407
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-browser.sh.menu.dat:
N min max sum mean stddev
10 96.33 98.8 976.84 97.684 0.834921
scripts/rt-app-browser.sh.irq.dat:
N min max sum mean stddev
10 93.02 100 984.58 98.458 2.08916
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-mp3.sh.menu.dat:
N min max sum mean stddev
10 99.06 99.59 993.8 99.38 0.178948
scripts/rt-app-mp3.sh.irq.dat:
N min max sum mean stddev
10 95.81 99.54 978.95 97.895 1.36267
--------------------------
Successful predictions (%)
--------------------------
scripts/rt-app-video.sh.menu.dat:
N min max sum mean stddev
10 96.96 99.08 983.25 98.325 0.632899
scripts/rt-app-video.sh.irq.dat:
N min max sum mean stddev
10 74.77 97.5 829.45 82.945 7.10247
--------------------------
Successful predictions (%)
--------------------------
scripts/video.sh.menu.dat:
N min max sum mean stddev
10 48.52 59.73 541.3 54.13 3.11611
scripts/video.sh.irq.dat:
N min max sum mean stddev
10 43.71 58.5 533.62 53.362 4.84646
Daniel Lezcano (2):
irq: Add a framework to measure interrupt timings
sched: idle: IRQ based next prediction for idle period
drivers/cpuidle/Kconfig | 5 +
include/linux/interrupt.h | 45 ++++
include/linux/irqdesc.h | 3 +
kernel/irq/Kconfig | 3 +
kernel/irq/handle.c | 12 +
kernel/irq/manage.c | 65 +++++-
kernel/sched/Makefile | 1 +
kernel/sched/idle-sched.c | 553 ++++++++++++++++++++++++++++++++++++++++++++++
8 files changed, 686 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/idle-sched.c
--
1.9.1
^ permalink raw reply [flat|nested] 63+ messages in thread
* [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings
2016-01-06 15:22 [RFC PATCH 0/2] IRQ based next prediction Daniel Lezcano
@ 2016-01-06 15:22 ` Daniel Lezcano
2016-01-08 15:31 ` Thomas Gleixner
2016-01-06 15:22 ` [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
1 sibling, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-06 15:22 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
The interrupt framework gives a lot of information and statistics about
each interrupt.
Unfortunately there is no way to measure when interrupts occur and provide
a mathematical model for their behavior which could help in predicting
their next occurence.
This framework allows for registering a callback function that is invoked
when an interrupt occurs.
Each time, the callback will be called with the timestamp corresponding to
when happened the interrupt.
This framework allows a subsystem to register a handler in order to receive
the timing information for the registered interrupt. That gives other
subsystems the ability to compute predictions for the next interrupt occurence.
The main objective is to track and detect the periodic interrupts in order
to predict the next event on a cpu and anticipate the sleeping time when
entering idle. This fine grain approach allows to simplify and rationalize
a wake up event prediction without IPIs interference, thus letting the
scheduler to be smarter with the wakeup IPIs regarding the idle period.
The irq timings tracking showed, in the proof-of-concept, an improvement with
the predictions, the approach is correct but my knowledge to the irq
subsystem is limited. I am not sure this patch measuring irq time interval
is correct or acceptable, so it is at the RFC state (minus some polishing).
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
include/linux/interrupt.h | 45 ++++++++++++++++++++++++++++++++
include/linux/irqdesc.h | 3 +++
kernel/irq/Kconfig | 4 +++
kernel/irq/handle.c | 12 +++++++++
kernel/irq/manage.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 128 insertions(+), 1 deletion(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index be7e75c..f48e8ff 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -123,6 +123,51 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+#ifdef CONFIG_IRQ_TIMINGS
+/**
+ * timing handler to be called when an interrupt happens
+ */
+typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *, void *);
+
+/**
+ * struct irqtimings - per interrupt irq timings descriptor
+ * @handler: interrupt handler timings function
+ * @data: pointer to the private data to be passed to the handler
+ * @timestamp: latest interruption occurence
+ */
+struct irqtimings {
+ irqt_handler_t handler;
+ void *data;
+} ____cacheline_internodealigned_in_smp;
+
+/**
+ * struct irqt_ops - structure to be used by the subsystem to call the
+ * register and unregister ops when an irq is setup or freed.
+ * @setup: registering callback
+ * @free: unregistering callback
+ *
+ * The callbacks assumes the lock is held on the irq desc
+ */
+struct irqtimings_ops {
+ int (*setup)(unsigned int, struct irqaction *);
+ void (*free)(unsigned int, void *);
+};
+
+extern int register_irq_timings(struct irqtimings_ops *ops);
+extern int setup_irq_timings(unsigned int irq, struct irqaction *act);
+extern void free_irq_timings(unsigned int irq, void *dev_id);
+#else
+static inline int setup_irq_timings(unsigned int irq, struct irqaction *act)
+{
+ return 0;
+}
+
+static inline void free_irq_timings(unsigned int irq, void *dev_id)
+{
+ ;
+}
+#endif
+
extern int __must_check
request_threaded_irq(unsigned int irq, irq_handler_t handler,
irq_handler_t thread_fn,
diff --git a/include/linux/irqdesc.h b/include/linux/irqdesc.h
index a587a33..e0d4263 100644
--- a/include/linux/irqdesc.h
+++ b/include/linux/irqdesc.h
@@ -51,6 +51,9 @@ struct irq_desc {
#ifdef CONFIG_IRQ_PREFLOW_FASTEOI
irq_preflow_handler_t preflow_handler;
#endif
+#ifdef CONFIG_IRQ_TIMINGS
+ struct irqtimings *timings;
+#endif
struct irqaction *action; /* IRQ action list */
unsigned int status_use_accessors;
unsigned int core_internal_state__do_not_mess_with_it;
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 9a76e3b..1275fd1 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -73,6 +73,10 @@ config GENERIC_MSI_IRQ_DOMAIN
config HANDLE_DOMAIN_IRQ
bool
+config IRQ_TIMINGS
+ bool
+ default y
+
config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
depends on IRQ_DOMAIN && DEBUG_FS
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index e25a83b..ca8b0c5 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -132,6 +132,17 @@ void __irq_wake_thread(struct irq_desc *desc, struct irqaction *action)
wake_up_process(action->thread);
}
+#ifdef CONFIG_IRQ_TIMINGS
+void handle_irqt_event(struct irqtimings *irqt, struct irqaction *action)
+{
+ if (irqt)
+ irqt->handler(action->irq, ktime_get(),
+ action->dev_id, irqt->data);
+}
+#else
+#define handle_irqt_event(a, b)
+#endif
+
irqreturn_t
handle_irq_event_percpu(struct irq_desc *desc, struct irqaction *action)
{
@@ -165,6 +176,7 @@ handle_irq_event_percpu(struct irq_desc *desc, struct irqaction *action)
/* Fall through to add to randomness */
case IRQ_HANDLED:
flags |= action->flags;
+ handle_irqt_event(desc->timings, action);
break;
default:
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index f9a59f6..21cc7bf 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1017,6 +1017,60 @@ static void irq_release_resources(struct irq_desc *desc)
c->irq_release_resources(d);
}
+#ifdef CONFIG_IRQ_TIMINGS
+/*
+ * Global variable, only used by accessor functions, currently only
+ * one user is allowed and it is up to the caller to make sure to
+ * setup the irq timings which are already setup.
+ */
+static struct irqtimings_ops *irqtimings_ops;
+
+/**
+ * register_irq_timings - register the ops when an irq is setup or freed
+ *
+ * @ops: the register/unregister ops to be called when at setup or
+ * free time
+ *
+ * Returns -EBUSY if the slot is already in use, zero on success.
+ */
+int register_irq_timings(struct irqtimings_ops *ops)
+{
+ if (irqtimings_ops)
+ return -EBUSY;
+
+ irqtimings_ops = ops;
+
+ return 0;
+}
+
+/**
+ * setup_irq_timings - call the timing register callback
+ *
+ * @desc: an irq desc structure
+ *
+ * Returns -EINVAL in case of error, zero on success.
+ */
+int setup_irq_timings(unsigned int irq, struct irqaction *act)
+{
+ if (irqtimings_ops && irqtimings_ops->setup)
+ return irqtimings_ops->setup(irq, act);
+ return 0;
+}
+
+/**
+ * free_irq_timings - call the timing unregister callback
+ *
+ * @irq: the interrupt number
+ * @dev_id: the device id
+ *
+ */
+void free_irq_timings(unsigned int irq, void *dev_id)
+{
+ if (irqtimings_ops && irqtimings_ops->free)
+ irqtimings_ops->free(irq, dev_id);
+}
+#endif /* CONFIG_IRQ_TIMINGS */
+
/*
* Internal function to register an irqaction - typically used to
* allocate special interrupts that are part of the architecture.
@@ -1037,6 +1091,9 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
if (!try_module_get(desc->owner))
return -ENODEV;
+ ret = setup_irq_timings(irq, new);
+ if (ret)
+ goto out_mput;
/*
* Check whether the interrupt nests into another interrupt
* thread.
@@ -1045,7 +1102,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
if (nested) {
if (!new->thread_fn) {
ret = -EINVAL;
- goto out_mput;
+ goto out_free_timings;
}
/*
* Replace the primary handler which was provided from
@@ -1323,6 +1380,10 @@ out_thread:
kthread_stop(t);
put_task_struct(t);
}
+
+out_free_timings:
+ free_irq_timings(irq, new->dev_id);
+
out_mput:
module_put(desc->owner);
return ret;
@@ -1408,6 +1469,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
unregister_handler_proc(irq, action);
+ free_irq_timings(irq, dev_id);
+
/* Make sure it's not being used on another CPU: */
synchronize_irq(irq);
--
1.9.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-06 15:22 [RFC PATCH 0/2] IRQ based next prediction Daniel Lezcano
2016-01-06 15:22 ` [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
@ 2016-01-06 15:22 ` Daniel Lezcano
2016-01-06 17:40 ` Nicolas Pitre
2016-01-08 15:43 ` Thomas Gleixner
1 sibling, 2 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-06 15:22 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
Many IRQs are quiet most of the time, or they tend to come in bursts of
fairly equal time intervals within each burst. It is therefore possible
to detect those IRQs with stable intervals and guestimate when the next
IRQ event is most likely to happen.
Examples of such IRQs may include audio related IRQs where the FIFO size
and/or DMA descriptor size with the sample rate create stable intervals,
block devices during large data transfers, etc. Even network streaming
of multimedia content creates patterns of periodic network interface IRQs
in some cases.
This patch adds code to track the mean interval and variance for each IRQ
over a window of time intervals between IRQ events. Those statistics can
be used to assist cpuidle in selecting the most appropriate sleep state
by predicting the most likely time for the next interrupt.
Because the stats are gathered in interrupt context, the core computation
is as light as possible.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
drivers/cpuidle/Kconfig | 5 +
kernel/irq/Kconfig | 1 -
kernel/sched/Makefile | 1 +
kernel/sched/idle-sched.c | 549 ++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 555 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/idle-sched.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 8c7930b..dd17215 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -25,6 +25,11 @@ config CPU_IDLE_GOV_MENU
bool "Menu governor (for tickless system)"
default y
+config CPU_IDLE_GOV_SCHED
+ bool "Sched idle governor"
+ select IRQ_TIMINGS
+ default y
+
config DT_IDLE_STATES
bool
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 1275fd1..81557ae 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -75,7 +75,6 @@ config HANDLE_DOMAIN_IRQ
config IRQ_TIMINGS
bool
- default y
config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..f7d5a35 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o
diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c
new file mode 100644
index 0000000..a8e7557
--- /dev/null
+++ b/kernel/sched/idle-sched.c
@@ -0,0 +1,549 @@
+/*
+ * Copyright (C) 2016 Linaro Ltd, Daniel Lezcano <daniel.lezcano@linaro.org>
+ * Nicolas Pitre <nicolas.pitre@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+#include <linux/cpuidle.h>
+#include <linux/interrupt.h>
+#include <linux/irqdesc.h>
+#include <linux/ktime.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+#include <linux/time64.h>
+
+/*
+ * Define the number of samples over which the average and variance
+ * are computed. A power of 2 is preferred so to let the compiler
+ * optimize divisions by that number with simple arithmetic shifts.
+ */
+#define STATS_NR_VALUES 4
+
+/**
+ * struct stats - internal structure to encapsulate stats informations
+ *
+ * @sum: sum of the values
+ * @values: array of values to do stats on
+ * @n: current number of values
+ * @w_ptr: current buffer pointer
+ */
+struct stats {
+ u64 sum; /* sum of values */
+ u32 values[STATS_NR_VALUES]; /* array of values */
+ unsigned int w_ptr; /* current window pointer */
+};
+
+/**
+ * struct wakeup - internal structure describing a source of wakeup
+ *
+ * @stats: the stats structure on the different event intervals
+ * @timestamp: latest update timestamp
+ */
+struct wakeup {
+ struct stats stats;
+ ktime_t timestamp;
+};
+
+/*
+ * Per cpu and irq statistics. Each cpu receives interrupts and those
+ * ones can be distributed following an irq chip specific
+ * algorithm. Random irq distribution is the worst case to predict
+ * interruption behavior but usually that does not happen or could be
+ * fixed from userspace by setting the irq affinity.
+ */
+static DEFINE_PER_CPU(struct wakeup, *wakeups[NR_IRQS]);
+
+/**
+ * stats_add - add a new value in the statistic structure
+ *
+ * @s: the statistic structure
+ * @value: the new value to be added
+ *
+ * Adds the value to the array, if the array is full, the oldest value
+ * is replaced.
+ */
+static void stats_add(int irq, struct stats *s, u32 value)
+{
+ /*
+ * This is a circular buffer, so the oldest value is
+ * the next one in the buffer. Let's compute the next
+ * pointer to retrieve the oldest value and re-use it
+ * to update the w_ptr after adding the new value.
+ */
+ unsigned int w_ptr = s->w_ptr;
+
+ /*
+ * Remove the oldest value from the summing. If this is the
+ * first time we go through this array slot, the previous
+ * value will be zero and we won't substract anything from the
+ * current sum. Hence this code relies on a zero-ed stat
+ * structure at init time via memset or kzalloc.
+ */
+ s->sum -= s->values[w_ptr];
+ s->values[w_ptr] = value;
+ s->w_ptr = (w_ptr + 1) % STATS_NR_VALUES;
+
+ /*
+ * In order to reduce the overhead and to prevent value
+ * derivation due to the integer computation, we just sum the
+ * value and do the division when the average and the variance
+ * are requested.
+ */
+ s->sum += value;
+}
+
+/**
+ * stats_reset - reset the stats
+ *
+ * @s: the statistic structure
+ *
+ * Reset the statistics and reset the values
+ */
+static inline void stats_reset(struct stats *s)
+{
+ memset(s, 0, sizeof(*s));
+}
+
+/**
+ * stats_mean - compute the average
+ *
+ * @s: the statistics structure
+ *
+ * Returns an u32 corresponding to the mean value, or zero if there is
+ * no data
+ */
+static inline u32 stats_mean(struct stats *s)
+{
+ /*
+ * gcc is smart enough to convert to a bits shift when the
+ * divisor is constant and multiple of 2^x.
+ *
+ * The number of values could have not reached STATS_NR_VALUES
+ * yet, but we can consider it acceptable as the situation is
+ * only at the very first beginning in the system life cycle.
+ */
+ return s->sum / STATS_NR_VALUES;
+}
+
+/**
+ * stats_variance - compute the variance
+ *
+ * @s: the statistic structure
+ *
+ * Returns an u64 corresponding to the variance, or zero if there is
+ * no data
+ */
+static u64 stats_variance(struct stats *s, u32 mean)
+{
+ int i;
+ u64 variance = 0;
+
+ /*
+ * The variance is the sum of the squared difference to the
+ * average divided by the number of elements.
+ */
+ for (i = 0; i < STATS_NR_VALUES; i++) {
+ s32 diff = s->values[i] - mean;
+ variance += (u64)diff * diff;
+ }
+
+ return variance / STATS_NR_VALUES;
+}
+
+/**
+ * sched_idle_irq - irq timestamp callback
+ *
+ * @irq: the irq number
+ * @timestamp: when the interrupt occured
+ * @dev_id: device id for shared interrupt (not yet used)
+ * @data: private data given when registering this callback
+ *
+ * Interrupt callback called when an interrupt happens. This function
+ * is critical as it is called under an interrupt section: minimum
+ * operations as possible are done here:
+ */
+static void sched_idle_irq(unsigned int irq, ktime_t timestamp,
+ void *dev_id, void *data)
+{
+ u32 diff;
+ unsigned int cpu = raw_smp_processor_id();
+ struct wakeup *w = per_cpu(wakeups[irq], cpu);
+
+ if (!w)
+ return;
+
+ /*
+ * It is the first time the interrupt occurs of the series, we
+ * can't do any stats as we don't have an interval, just store
+ * the timestamp and exit.
+ */
+ if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
+ w->timestamp = timestamp;
+ return;
+ }
+
+ /*
+ * Microsec resolution is enough for our purpose.
+ */
+ diff = ktime_us_delta(timestamp, w->timestamp);
+ w->timestamp = timestamp;
+ stats_add(irq, &w->stats, diff);
+}
+
+/*
+ * Callback to be called when an interrupt happens.
+ */
+static struct irqtimings irq_timings = {
+ .handler = sched_idle_irq,
+};
+
+static ktime_t next_irq_event(void)
+{
+ unsigned int irq, cpu = raw_smp_processor_id();
+ ktime_t diff, next, min = { .tv64 = KTIME_MAX };
+ struct wakeup *w;
+ u32 interval, mean;
+ u64 variance;
+
+ /*
+ * Lookup the interrupt array for this cpu and search for the
+ * earlier expected interruption.
+ */
+ for (irq = 0; irq < NR_IRQS; irq++) {
+
+ ktime_t now = ktime_get();
+
+ w = per_cpu(wakeups[irq], cpu);
+
+ /*
+ * The interrupt was not setup as a source of a wakeup
+ * or the wakeup source is not considered at this
+ * moment stable enough to do a prediction.
+ */
+ if (!w)
+ continue;
+
+ /*
+ * No statistics available yet.
+ */
+ if (ktime_equal(w->timestamp, ktime_set(0, 0)))
+ continue;
+
+ diff = ktime_sub(now, w->timestamp);
+
+ /*
+ * There is no point attempting predictions on interrupts more
+ * than 1 second apart. This has no benefit for sleep state
+ * selection and increases the risk of overflowing our variance
+ * computation. Reset all stats in that case.
+ */
+ if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
+ stats_reset(&w->stats);
+ continue;
+ }
+
+ /*
+ * If the mean value is null, just ignore this wakeup
+ * source.
+ */
+ mean = stats_mean(&w->stats);
+ if (!mean)
+ continue;
+
+ variance = stats_variance(&w->stats, mean);
+ /*
+ * We want to check the last interval is:
+ *
+ * mean - stddev < interval < mean + stddev
+ *
+ * That simplifies to:
+ *
+ * -stddev < interval - mean < stddev
+ *
+ * abs(interval - mean) < stddev
+ *
+ * The standard deviation is the sqrt of the variance:
+ *
+ * abs(interval - mean) < sqrt(variance)
+ *
+ * and we want to prevent to do an sqrt, so we square
+ * the equation:
+ *
+ * (interval - mean)^2 < variance
+ *
+ * So if the latest value of the stats complies with
+ * this condition, then the wakeup source is
+ * considered predictable and can be used to predict
+ * the next event.
+ */
+ interval = w->stats.values[(w->stats.w_ptr + 1) % STATS_NR_VALUES];
+ if ((u64)((interval - mean) * (interval - mean)) > variance)
+ continue;
+
+ /*
+ * Let's compute the next event: the wakeup source is
+ * considered predictable, we add the average interval
+ * time added to the latest interruption event time.
+ */
+ next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
+
+ /*
+ * If the interrupt is supposed to happen before the
+ * minimum time, then it becomes the minimum.
+ */
+ if (ktime_before(next, min))
+ min = next;
+ }
+
+ /*
+ * At this point, we have our prediction but the caller is
+ * expecting the remaining time before the next event, so
+ * compute the expected sleep length.
+ */
+ diff = ktime_sub(min, ktime_get());
+
+ /*
+ * The result could be negative for different reasons:
+ * - the prediction is incorrect
+ * - the prediction was too near now and expired while we were
+ * in this function
+ *
+ * In both cases, we return KTIME_MAX as a failure to do a
+ * prediction
+ */
+ if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
+ return (ktime_t){ .tv64 = KTIME_MAX };
+
+ return diff;
+}
+
+/**
+ * sched_idle_next_wakeup - Predict the next wakeup on the current cpu
+ *
+ * The next event on the cpu is based on a statistic approach of the
+ * interrupt events and the timer deterministic value. From the timer
+ * or the irqs, we return the one expected to occur first.
+ *
+ * Returns the expected remaining idle time before being woken up by
+ * an interruption.
+ */
+s64 sched_idle_next_wakeup(void)
+{
+ s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
+ s64 next_irq = ktime_to_us(next_irq_event());
+
+ return min(next_irq, next_timer);
+}
+
+/**
+ * sched_idle - go to idle for a specified amount of time
+ *
+ * @duration: the idle duration time
+ * @latency: the latency constraint
+ *
+ * Returns 0 on success, < 0 otherwise.
+ */
+int sched_idle(s64 duration, unsigned int latency)
+{
+ struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
+ struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ struct cpuidle_state_usage *su;
+ struct cpuidle_state *s;
+ int i, ret = 0, index = -1;
+
+ rcu_idle_enter();
+
+ /*
+ * No cpuidle driver is available, let's use the default arch
+ * idle function.
+ */
+ if (cpuidle_not_available(drv, dev))
+ goto default_idle;
+
+ /*
+ * Find the idle state with the lowest power while satisfying
+ * our constraints. We will save energy if the duration of the
+ * idle time is bigger than the target residency which is the
+ * break even point. The choice will be modulated by the
+ * latency.
+ */
+ for (i = 0; i < drv->state_count; i++) {
+
+ s = &drv->states[i];
+
+ su = &dev->states_usage[i];
+
+ if (s->disabled || su->disable)
+ continue;
+ if (s->target_residency > duration)
+ continue;
+ if (s->exit_latency > latency)
+ continue;
+
+ index = i;
+ }
+
+ /*
+ * The idle task must be scheduled, it is pointless to go to
+ * idle, just re-enable the interrupt and return.
+ */
+ if (current_clr_polling_and_test()) {
+ local_irq_enable();
+ goto out;
+ }
+
+ if (index < 0) {
+ /*
+ * No idle callbacks fulfilled the constraints, jump
+ * to the default function like there wasn't any
+ * cpuidle driver.
+ */
+ goto default_idle;
+ } else {
+ /*
+ * Enter the idle state previously returned by the
+ * governor decision. This function will block until
+ * an interrupt occurs and will take care of
+ * re-enabling the local interrupts
+ */
+ return cpuidle_enter(drv, dev, index);
+ }
+
+default_idle:
+ default_idle_call();
+out:
+ rcu_idle_exit();
+ return ret;
+}
+
+/**
+ * sched_irq_timing_free - free_irq callback
+ *
+ * @irq: the irq number to stop tracking
+ * @dev_id: not used at the moment
+ *
+ * This function will remove from the wakeup source prediction table.
+ */
+static void sched_irq_timing_free(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct wakeup *w;
+ unsigned int cpu;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ desc->timings = NULL;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+ for_each_possible_cpu(cpu) {
+ w = per_cpu(wakeups[irq], cpu);
+ if (!w)
+ continue;
+
+ per_cpu(wakeups[irq], cpu) = NULL;
+ kfree(w);
+ }
+}
+
+/**
+ * sched_irq_timing_setup - setup_irq callback
+ *
+ * @irq: the interrupt numbe to be tracked
+ * @act: the new irq action to be set to this interrupt
+ *
+ * Update the irq table to be tracked in order to predict the next event.
+ *
+ * Returns zero on success. On error it returns -ENOMEM.
+ */
+static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct wakeup *w;
+ unsigned int cpu;
+ unsigned long flags;
+ int ret = -ENOMEM;
+
+ /*
+ * No interrupt set for this descriptor or related to a timer.
+ * Timers are deterministic, so no need to try to do any
+ * prediction on them. No error for both cases, we are just not
+ * interested.
+ */
+ if (!act || (act->flags & __IRQF_TIMER))
+ return 0;
+
+ /*
+ * Allocates the wakeup structure and the stats structure. As
+ * the interrupt can occur on any cpu, allocate the wakeup
+ * structure per cpu basis.
+ */
+ for_each_possible_cpu(cpu) {
+
+ w = kzalloc(sizeof(*w), GFP_KERNEL);
+ if (!w)
+ goto undo;
+
+ per_cpu(wakeups[irq], cpu) = w;
+ }
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ desc->timings = &irq_timings;
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+ ret = 0;
+undo:
+ /*
+ * Rollback all allocations on failure
+ */
+ if (ret)
+ sched_irq_timing_free(irq, act->dev_id);
+
+ return ret;
+}
+
+/*
+ * Setup/free irq callbacks
+ */
+static struct irqtimings_ops irqt_ops = {
+ .setup = sched_irq_timing_setup,
+ .free = sched_irq_timing_free,
+};
+
+/**
+ * sched_idle_init - setup the interrupt tracking table
+ *
+ * At init time, some interrupts could have been setup and in the
+ * system life time, some devices could be setup. In order to track
+ * all interrupts we are interested in, we first register a couple of
+ * callback to keep up-to-date the interrupt tracking table and then
+ * we setup the table with the interrupt which were already set up.
+ */
+int __init sched_idle_init(void)
+{
+ struct irq_desc *desc;
+ unsigned int irq;
+ int ret;
+
+ /*
+ * Register the setup/free irq callbacks, so new interrupt or
+ * freed interrupt will update their tracking.
+ */
+ ret = register_irq_timings(&irqt_ops);
+ if (ret) {
+ pr_err("Failed to register timings ops\n");
+ return ret;
+ }
+
+ /*
+ * For all the irq already setup, assign the timing callback.
+ * All interrupts with their desc NULL will be discarded.
+ */
+ for_each_irq_desc(irq, desc)
+ sched_irq_timing_setup(irq, desc->action);
+
+ return 0;
+}
+late_initcall(sched_idle_init);
--
1.9.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-06 15:22 ` [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
@ 2016-01-06 17:40 ` Nicolas Pitre
2016-01-07 15:42 ` Daniel Lezcano
2016-01-10 22:37 ` Daniel Lezcano
2016-01-08 15:43 ` Thomas Gleixner
1 sibling, 2 replies; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-06 17:40 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, 6 Jan 2016, Daniel Lezcano wrote:
> Many IRQs are quiet most of the time, or they tend to come in bursts of
> fairly equal time intervals within each burst. It is therefore possible
> to detect those IRQs with stable intervals and guestimate when the next
> IRQ event is most likely to happen.
>
> Examples of such IRQs may include audio related IRQs where the FIFO size
> and/or DMA descriptor size with the sample rate create stable intervals,
> block devices during large data transfers, etc. Even network streaming
> of multimedia content creates patterns of periodic network interface IRQs
> in some cases.
>
> This patch adds code to track the mean interval and variance for each IRQ
> over a window of time intervals between IRQ events. Those statistics can
> be used to assist cpuidle in selecting the most appropriate sleep state
> by predicting the most likely time for the next interrupt.
>
> Because the stats are gathered in interrupt context, the core computation
> is as light as possible.
>
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
There are still a few problems with this patch.
Please see comments below.
> ---
> drivers/cpuidle/Kconfig | 5 +
> kernel/irq/Kconfig | 1 -
> kernel/sched/Makefile | 1 +
> kernel/sched/idle-sched.c | 549 ++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 555 insertions(+), 1 deletion(-)
> create mode 100644 kernel/sched/idle-sched.c
>
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index 8c7930b..dd17215 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -25,6 +25,11 @@ config CPU_IDLE_GOV_MENU
> bool "Menu governor (for tickless system)"
> default y
>
> +config CPU_IDLE_GOV_SCHED
> + bool "Sched idle governor"
> + select IRQ_TIMINGS
> + default y
> +
> config DT_IDLE_STATES
> bool
>
> diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
> index 1275fd1..81557ae 100644
> --- a/kernel/irq/Kconfig
> +++ b/kernel/irq/Kconfig
> @@ -75,7 +75,6 @@ config HANDLE_DOMAIN_IRQ
>
> config IRQ_TIMINGS
> bool
> - default y
>
> config IRQ_DOMAIN_DEBUG
> bool "Expose hardware/virtual IRQ mapping via debugfs"
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..f7d5a35 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
> obj-$(CONFIG_SCHEDSTATS) += stats.o
> obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o
> diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c
> new file mode 100644
> index 0000000..a8e7557
> --- /dev/null
> +++ b/kernel/sched/idle-sched.c
> @@ -0,0 +1,549 @@
> +/*
> + * Copyright (C) 2016 Linaro Ltd, Daniel Lezcano <daniel.lezcano@linaro.org>
> + * Nicolas Pitre <nicolas.pitre@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + */
> +#include <linux/cpuidle.h>
> +#include <linux/interrupt.h>
> +#include <linux/irqdesc.h>
> +#include <linux/ktime.h>
> +#include <linux/slab.h>
> +#include <linux/tick.h>
> +#include <linux/time64.h>
> +
> +/*
> + * Define the number of samples over which the average and variance
> + * are computed. A power of 2 is preferred so to let the compiler
> + * optimize divisions by that number with simple arithmetic shifts.
> + */
> +#define STATS_NR_VALUES 4
> +
> +/**
> + * struct stats - internal structure to encapsulate stats informations
> + *
> + * @sum: sum of the values
> + * @values: array of values to do stats on
> + * @n: current number of values
> + * @w_ptr: current buffer pointer
> + */
> +struct stats {
> + u64 sum; /* sum of values */
> + u32 values[STATS_NR_VALUES]; /* array of values */
> + unsigned int w_ptr; /* current window pointer */
> +};
> +
> +/**
> + * struct wakeup - internal structure describing a source of wakeup
> + *
> + * @stats: the stats structure on the different event intervals
> + * @timestamp: latest update timestamp
> + */
> +struct wakeup {
> + struct stats stats;
> + ktime_t timestamp;
> +};
> +
> +/*
> + * Per cpu and irq statistics. Each cpu receives interrupts and those
> + * ones can be distributed following an irq chip specific
> + * algorithm. Random irq distribution is the worst case to predict
> + * interruption behavior but usually that does not happen or could be
> + * fixed from userspace by setting the irq affinity.
> + */
> +static DEFINE_PER_CPU(struct wakeup, *wakeups[NR_IRQS]);
> +
> +/**
> + * stats_add - add a new value in the statistic structure
> + *
> + * @s: the statistic structure
> + * @value: the new value to be added
> + *
> + * Adds the value to the array, if the array is full, the oldest value
> + * is replaced.
> + */
> +static void stats_add(int irq, struct stats *s, u32 value)
> +{
> + /*
> + * This is a circular buffer, so the oldest value is
> + * the next one in the buffer. Let's compute the next
> + * pointer to retrieve the oldest value and re-use it
> + * to update the w_ptr after adding the new value.
> + */
> + unsigned int w_ptr = s->w_ptr;
> +
> + /*
> + * Remove the oldest value from the summing. If this is the
> + * first time we go through this array slot, the previous
> + * value will be zero and we won't substract anything from the
> + * current sum. Hence this code relies on a zero-ed stat
> + * structure at init time via memset or kzalloc.
> + */
> + s->sum -= s->values[w_ptr];
> + s->values[w_ptr] = value;
> + s->w_ptr = (w_ptr + 1) % STATS_NR_VALUES;
> +
> + /*
> + * In order to reduce the overhead and to prevent value
> + * derivation due to the integer computation, we just sum the
> + * value and do the division when the average and the variance
> + * are requested.
> + */
> + s->sum += value;
> +}
> +
> +/**
> + * stats_reset - reset the stats
> + *
> + * @s: the statistic structure
> + *
> + * Reset the statistics and reset the values
> + */
> +static inline void stats_reset(struct stats *s)
> +{
> + memset(s, 0, sizeof(*s));
> +}
> +
> +/**
> + * stats_mean - compute the average
> + *
> + * @s: the statistics structure
> + *
> + * Returns an u32 corresponding to the mean value, or zero if there is
> + * no data
> + */
> +static inline u32 stats_mean(struct stats *s)
> +{
> + /*
> + * gcc is smart enough to convert to a bits shift when the
> + * divisor is constant and multiple of 2^x.
> + *
> + * The number of values could have not reached STATS_NR_VALUES
> + * yet, but we can consider it acceptable as the situation is
> + * only at the very first beginning in the system life cycle.
> + */
> + return s->sum / STATS_NR_VALUES;
> +}
> +
> +/**
> + * stats_variance - compute the variance
> + *
> + * @s: the statistic structure
> + *
> + * Returns an u64 corresponding to the variance, or zero if there is
> + * no data
> + */
> +static u64 stats_variance(struct stats *s, u32 mean)
> +{
> + int i;
> + u64 variance = 0;
> +
> + /*
> + * The variance is the sum of the squared difference to the
> + * average divided by the number of elements.
> + */
> + for (i = 0; i < STATS_NR_VALUES; i++) {
> + s32 diff = s->values[i] - mean;
> + variance += (u64)diff * diff;
Strictly speaking, diff should be casted to s64 as it is a signed value
that may actually be negative. Because of the strange C type promotion
rules, the generated code appears correct (at least on ARM), but it
would be clearer to use s64 anyway.
The product will end up being positive in all cases of course so
variance may remain as a u64.
> + }
> +
> + return variance / STATS_NR_VALUES;
> +}
> +
> +/**
> + * sched_idle_irq - irq timestamp callback
> + *
> + * @irq: the irq number
> + * @timestamp: when the interrupt occured
> + * @dev_id: device id for shared interrupt (not yet used)
> + * @data: private data given when registering this callback
> + *
> + * Interrupt callback called when an interrupt happens. This function
> + * is critical as it is called under an interrupt section: minimum
> + * operations as possible are done here:
> + */
> +static void sched_idle_irq(unsigned int irq, ktime_t timestamp,
> + void *dev_id, void *data)
> +{
> + u32 diff;
> + unsigned int cpu = raw_smp_processor_id();
> + struct wakeup *w = per_cpu(wakeups[irq], cpu);
> +
> + if (!w)
> + return;
> +
> + /*
> + * It is the first time the interrupt occurs of the series, we
> + * can't do any stats as we don't have an interval, just store
> + * the timestamp and exit.
> + */
> + if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
> + w->timestamp = timestamp;
> + return;
> + }
> +
> + /*
> + * Microsec resolution is enough for our purpose.
> + */
> + diff = ktime_us_delta(timestamp, w->timestamp);
> + w->timestamp = timestamp;
> + stats_add(irq, &w->stats, diff);
> +}
> +
> +/*
> + * Callback to be called when an interrupt happens.
> + */
> +static struct irqtimings irq_timings = {
> + .handler = sched_idle_irq,
> +};
> +
> +static ktime_t next_irq_event(void)
> +{
> + unsigned int irq, cpu = raw_smp_processor_id();
> + ktime_t diff, next, min = { .tv64 = KTIME_MAX };
To avoid exposing the ktime_t definition, you should use
ktime_set(KTIME_SEC_MAX, 0) instead of { .tv64 = KTIME_MAX }.
> + struct wakeup *w;
> + u32 interval, mean;
> + u64 variance;
> +
> + /*
> + * Lookup the interrupt array for this cpu and search for the
> + * earlier expected interruption.
> + */
> + for (irq = 0; irq < NR_IRQS; irq++) {
> +
> + ktime_t now = ktime_get();
ktime_get() is potentially non-trivial. It should be plenty good enough
to call it only once outside the loop.
> + w = per_cpu(wakeups[irq], cpu);
> +
> + /*
> + * The interrupt was not setup as a source of a wakeup
> + * or the wakeup source is not considered at this
> + * moment stable enough to do a prediction.
> + */
> + if (!w)
> + continue;
> +
> + /*
> + * No statistics available yet.
> + */
> + if (ktime_equal(w->timestamp, ktime_set(0, 0)))
> + continue;
> +
> + diff = ktime_sub(now, w->timestamp);
> +
> + /*
> + * There is no point attempting predictions on interrupts more
> + * than 1 second apart. This has no benefit for sleep state
> + * selection and increases the risk of overflowing our variance
> + * computation. Reset all stats in that case.
> + */
> + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
> + stats_reset(&w->stats);
> + continue;
> + }
The above is wrong. It is not computing the interval between successive
interruts but rather the interval between the last interrupt occurrence
and the present time (i.e. when we're about to go idle). This won't
prevent interrupt intervals greater than one second from being summed
and potentially overflowing the variance if this code is executed less
than a second after one such IRQ interval. This test should rather be
performed in sched_idle_irq().
> + * If the mean value is null, just ignore this wakeup
> + * source.
> + */
> + mean = stats_mean(&w->stats);
> + if (!mean)
> + continue;
> +
> + variance = stats_variance(&w->stats, mean);
> + /*
> + * We want to check the last interval is:
> + *
> + * mean - stddev < interval < mean + stddev
> + *
> + * That simplifies to:
> + *
> + * -stddev < interval - mean < stddev
> + *
> + * abs(interval - mean) < stddev
> + *
> + * The standard deviation is the sqrt of the variance:
> + *
> + * abs(interval - mean) < sqrt(variance)
> + *
> + * and we want to prevent to do an sqrt, so we square
> + * the equation:
> + *
> + * (interval - mean)^2 < variance
> + *
> + * So if the latest value of the stats complies with
> + * this condition, then the wakeup source is
> + * considered predictable and can be used to predict
> + * the next event.
> + */
> + interval = w->stats.values[(w->stats.w_ptr + 1) % STATS_NR_VALUES];
But here (w->stats.w_ptr + 1) % STATS_NR_VALUES does not point at the
last interval. It rather points at the second oldest.
To make things simpler, you should rather pre-increment the pointer
before updating the array in stats_add(), and here the value you want
will directly be accessible with w->stats.values[w->stats.w_ptr].
> + if ((u64)((interval - mean) * (interval - mean)) > variance)
> + continue;
Same comment as in stats_variance(): this should be s64.
> + /*
> + * Let's compute the next event: the wakeup source is
> + * considered predictable, we add the average interval
> + * time added to the latest interruption event time.
> + */
> + next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
> +
> + /*
> + * If the interrupt is supposed to happen before the
> + * minimum time, then it becomes the minimum.
> + */
> + if (ktime_before(next, min))
> + min = next;
> + }
> +
> + /*
> + * At this point, we have our prediction but the caller is
> + * expecting the remaining time before the next event, so
> + * compute the expected sleep length.
> + */
> + diff = ktime_sub(min, ktime_get());
You should use the variable 'now' rather than asking for the current
time again.
> + /*
> + * The result could be negative for different reasons:
> + * - the prediction is incorrect
> + * - the prediction was too near now and expired while we were
> + * in this function
> + *
> + * In both cases, we return KTIME_MAX as a failure to do a
> + * prediction
> + */
> + if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
> + return (ktime_t){ .tv64 = KTIME_MAX };
See my comment about ktime_t internals at the beginning of this
function.
> +
> + return diff;
> +}
> +
> +/**
> + * sched_idle_next_wakeup - Predict the next wakeup on the current cpu
> + *
> + * The next event on the cpu is based on a statistic approach of the
> + * interrupt events and the timer deterministic value. From the timer
> + * or the irqs, we return the one expected to occur first.
> + *
> + * Returns the expected remaining idle time before being woken up by
> + * an interruption.
> + */
> +s64 sched_idle_next_wakeup(void)
> +{
> + s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
> + s64 next_irq = ktime_to_us(next_irq_event());
> +
> + return min(next_irq, next_timer);
> +}
> +
> +/**
> + * sched_idle - go to idle for a specified amount of time
> + *
> + * @duration: the idle duration time
> + * @latency: the latency constraint
> + *
> + * Returns 0 on success, < 0 otherwise.
> + */
> +int sched_idle(s64 duration, unsigned int latency)
> +{
> + struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
> + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> + struct cpuidle_state_usage *su;
> + struct cpuidle_state *s;
> + int i, ret = 0, index = -1;
> +
> + rcu_idle_enter();
> +
> + /*
> + * No cpuidle driver is available, let's use the default arch
> + * idle function.
> + */
> + if (cpuidle_not_available(drv, dev))
> + goto default_idle;
> +
> + /*
> + * Find the idle state with the lowest power while satisfying
> + * our constraints. We will save energy if the duration of the
> + * idle time is bigger than the target residency which is the
> + * break even point. The choice will be modulated by the
> + * latency.
> + */
> + for (i = 0; i < drv->state_count; i++) {
> +
> + s = &drv->states[i];
> +
> + su = &dev->states_usage[i];
> +
> + if (s->disabled || su->disable)
> + continue;
> + if (s->target_residency > duration)
> + continue;
> + if (s->exit_latency > latency)
> + continue;
> +
> + index = i;
> + }
> +
> + /*
> + * The idle task must be scheduled, it is pointless to go to
> + * idle, just re-enable the interrupt and return.
> + */
> + if (current_clr_polling_and_test()) {
> + local_irq_enable();
> + goto out;
> + }
> +
> + if (index < 0) {
> + /*
> + * No idle callbacks fulfilled the constraints, jump
> + * to the default function like there wasn't any
> + * cpuidle driver.
> + */
> + goto default_idle;
> + } else {
> + /*
> + * Enter the idle state previously returned by the
> + * governor decision. This function will block until
> + * an interrupt occurs and will take care of
> + * re-enabling the local interrupts
> + */
> + return cpuidle_enter(drv, dev, index);
> + }
> +
> +default_idle:
> + default_idle_call();
> +out:
> + rcu_idle_exit();
> + return ret;
> +}
> +
> +/**
> + * sched_irq_timing_free - free_irq callback
> + *
> + * @irq: the irq number to stop tracking
> + * @dev_id: not used at the moment
> + *
> + * This function will remove from the wakeup source prediction table.
> + */
> +static void sched_irq_timing_free(unsigned int irq, void *dev_id)
> +{
> + struct irq_desc *desc = irq_to_desc(irq);
> + struct wakeup *w;
> + unsigned int cpu;
> + unsigned long flags;
> +
> + raw_spin_lock_irqsave(&desc->lock, flags);
> + desc->timings = NULL;
> + raw_spin_unlock_irqrestore(&desc->lock, flags);
> +
> + for_each_possible_cpu(cpu) {
> + w = per_cpu(wakeups[irq], cpu);
> + if (!w)
> + continue;
> +
> + per_cpu(wakeups[irq], cpu) = NULL;
> + kfree(w);
> + }
> +}
> +
> +/**
> + * sched_irq_timing_setup - setup_irq callback
> + *
> + * @irq: the interrupt numbe to be tracked
> + * @act: the new irq action to be set to this interrupt
> + *
> + * Update the irq table to be tracked in order to predict the next event.
> + *
> + * Returns zero on success. On error it returns -ENOMEM.
> + */
> +static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act)
> +{
> + struct irq_desc *desc = irq_to_desc(irq);
> + struct wakeup *w;
> + unsigned int cpu;
> + unsigned long flags;
> + int ret = -ENOMEM;
> +
> + /*
> + * No interrupt set for this descriptor or related to a timer.
> + * Timers are deterministic, so no need to try to do any
> + * prediction on them. No error for both cases, we are just not
> + * interested.
> + */
> + if (!act || (act->flags & __IRQF_TIMER))
> + return 0;
> +
> + /*
> + * Allocates the wakeup structure and the stats structure. As
> + * the interrupt can occur on any cpu, allocate the wakeup
> + * structure per cpu basis.
> + */
> + for_each_possible_cpu(cpu) {
> +
> + w = kzalloc(sizeof(*w), GFP_KERNEL);
> + if (!w)
> + goto undo;
> +
> + per_cpu(wakeups[irq], cpu) = w;
> + }
> +
> + raw_spin_lock_irqsave(&desc->lock, flags);
> + desc->timings = &irq_timings;
> + raw_spin_unlock_irqrestore(&desc->lock, flags);
> +
> + ret = 0;
> +undo:
> + /*
> + * Rollback all allocations on failure
> + */
> + if (ret)
> + sched_irq_timing_free(irq, act->dev_id);
> +
> + return ret;
> +}
> +
> +/*
> + * Setup/free irq callbacks
> + */
> +static struct irqtimings_ops irqt_ops = {
> + .setup = sched_irq_timing_setup,
> + .free = sched_irq_timing_free,
> +};
> +
> +/**
> + * sched_idle_init - setup the interrupt tracking table
> + *
> + * At init time, some interrupts could have been setup and in the
> + * system life time, some devices could be setup. In order to track
> + * all interrupts we are interested in, we first register a couple of
> + * callback to keep up-to-date the interrupt tracking table and then
> + * we setup the table with the interrupt which were already set up.
> + */
> +int __init sched_idle_init(void)
> +{
> + struct irq_desc *desc;
> + unsigned int irq;
> + int ret;
> +
> + /*
> + * Register the setup/free irq callbacks, so new interrupt or
> + * freed interrupt will update their tracking.
> + */
> + ret = register_irq_timings(&irqt_ops);
> + if (ret) {
> + pr_err("Failed to register timings ops\n");
> + return ret;
> + }
> +
> + /*
> + * For all the irq already setup, assign the timing callback.
> + * All interrupts with their desc NULL will be discarded.
> + */
> + for_each_irq_desc(irq, desc)
> + sched_irq_timing_setup(irq, desc->action);
> +
> + return 0;
> +}
> +late_initcall(sched_idle_init);
> --
> 1.9.1
>
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-06 17:40 ` Nicolas Pitre
@ 2016-01-07 15:42 ` Daniel Lezcano
2016-01-12 19:27 ` Nicolas Pitre
2016-01-10 22:37 ` Daniel Lezcano
1 sibling, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-07 15:42 UTC (permalink / raw)
To: Nicolas Pitre
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On 01/06/2016 06:40 PM, Nicolas Pitre wrote:
> On Wed, 6 Jan 2016, Daniel Lezcano wrote:
>
>> Many IRQs are quiet most of the time, or they tend to come in bursts of
>> fairly equal time intervals within each burst. It is therefore possible
>> to detect those IRQs with stable intervals and guestimate when the next
>> IRQ event is most likely to happen.
>>
>> Examples of such IRQs may include audio related IRQs where the FIFO size
>> and/or DMA descriptor size with the sample rate create stable intervals,
>> block devices during large data transfers, etc. Even network streaming
>> of multimedia content creates patterns of periodic network interface IRQs
>> in some cases.
>>
>> This patch adds code to track the mean interval and variance for each IRQ
>> over a window of time intervals between IRQ events. Those statistics can
>> be used to assist cpuidle in selecting the most appropriate sleep state
>> by predicting the most likely time for the next interrupt.
>>
>> Because the stats are gathered in interrupt context, the core computation
>> is as light as possible.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>
> There are still a few problems with this patch.
>
> Please see comments below.
[ ... ]
>> +/**
>> + * stats_variance - compute the variance
>> + *
>> + * @s: the statistic structure
>> + *
>> + * Returns an u64 corresponding to the variance, or zero if there is
>> + * no data
>> + */
>> +static u64 stats_variance(struct stats *s, u32 mean)
>> +{
>> + int i;
>> + u64 variance = 0;
>> +
>> + /*
>> + * The variance is the sum of the squared difference to the
>> + * average divided by the number of elements.
>> + */
>> + for (i = 0; i < STATS_NR_VALUES; i++) {
>> + s32 diff = s->values[i] - mean;
>> + variance += (u64)diff * diff;
>
> Strictly speaking, diff should be casted to s64 as it is a signed value
> that may actually be negative. Because of the strange C type promotion
> rules, the generated code appears correct (at least on ARM), but it
> would be clearer to use s64 anyway.
I don't get the connection in your explanation of why it should be a
s64. It is already a signed s32, s->values[] are s32 and mean is u32.
What would be the benefit to convert diff to s64 ?
> The product will end up being positive in all cases of course so
> variance may remain as a u64.
>
[ ... ]
>> +static ktime_t next_irq_event(void)
>> +{
>> + unsigned int irq, cpu = raw_smp_processor_id();
>> + ktime_t diff, next, min = { .tv64 = KTIME_MAX };
>
> To avoid exposing the ktime_t definition, you should use
> ktime_set(KTIME_SEC_MAX, 0) instead of { .tv64 = KTIME_MAX }.
Ok.
>> + struct wakeup *w;
>> + u32 interval, mean;
>> + u64 variance;
>> +
>> + /*
>> + * Lookup the interrupt array for this cpu and search for the
>> + * earlier expected interruption.
>> + */
>> + for (irq = 0; irq < NR_IRQS; irq++) {
>> +
>> + ktime_t now = ktime_get();
>
> ktime_get() is potentially non-trivial. It should be plenty good enough
> to call it only once outside the loop.
Ok.
>> + w = per_cpu(wakeups[irq], cpu);
>> +
>> + /*
>> + * The interrupt was not setup as a source of a wakeup
>> + * or the wakeup source is not considered at this
>> + * moment stable enough to do a prediction.
>> + */
>> + if (!w)
>> + continue;
>> +
>> + /*
>> + * No statistics available yet.
>> + */
>> + if (ktime_equal(w->timestamp, ktime_set(0, 0)))
>> + continue;
>> +
>> + diff = ktime_sub(now, w->timestamp);
>> +
>> + /*
>> + * There is no point attempting predictions on interrupts more
>> + * than 1 second apart. This has no benefit for sleep state
>> + * selection and increases the risk of overflowing our variance
>> + * computation. Reset all stats in that case.
>> + */
>> + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
>> + stats_reset(&w->stats);
>> + continue;
>> + }
>
> The above is wrong. It is not computing the interval between successive
> interruts but rather the interval between the last interrupt occurrence
> and the present time (i.e. when we're about to go idle). This won't
> prevent interrupt intervals greater than one second from being summed
> and potentially overflowing the variance if this code is executed less
> than a second after one such IRQ interval. This test should rather be
> performed in sched_idle_irq().
Ok, I will move it.
>> + * If the mean value is null, just ignore this wakeup
>> + * source.
>> + */
>> + mean = stats_mean(&w->stats);
>> + if (!mean)
>> + continue;
>> +
>> + variance = stats_variance(&w->stats, mean);
>> + /*
>> + * We want to check the last interval is:
>> + *
>> + * mean - stddev < interval < mean + stddev
>> + *
>> + * That simplifies to:
>> + *
>> + * -stddev < interval - mean < stddev
>> + *
>> + * abs(interval - mean) < stddev
>> + *
>> + * The standard deviation is the sqrt of the variance:
>> + *
>> + * abs(interval - mean) < sqrt(variance)
>> + *
>> + * and we want to prevent to do an sqrt, so we square
>> + * the equation:
>> + *
>> + * (interval - mean)^2 < variance
>> + *
>> + * So if the latest value of the stats complies with
>> + * this condition, then the wakeup source is
>> + * considered predictable and can be used to predict
>> + * the next event.
>> + */
>> + interval = w->stats.values[(w->stats.w_ptr + 1) % STATS_NR_VALUES];
>
> But here (w->stats.w_ptr + 1) % STATS_NR_VALUES does not point at the
> last interval. It rather points at the second oldest.
> To make things simpler, you should rather pre-increment the pointer
> before updating the array in stats_add(), and here the value you want
> will directly be accessible with w->stats.values[w->stats.w_ptr].
Yeah, there is a glitch here. I will look at it.
>> + if ((u64)((interval - mean) * (interval - mean)) > variance)
>> + continue;
>
> Same comment as in stats_variance(): this should be s64.
>
>> + /*
>> + * Let's compute the next event: the wakeup source is
>> + * considered predictable, we add the average interval
>> + * time added to the latest interruption event time.
>> + */
>> + next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
>> +
>> + /*
>> + * If the interrupt is supposed to happen before the
>> + * minimum time, then it becomes the minimum.
>> + */
>> + if (ktime_before(next, min))
>> + min = next;
>> + }
>> +
>> + /*
>> + * At this point, we have our prediction but the caller is
>> + * expecting the remaining time before the next event, so
>> + * compute the expected sleep length.
>> + */
>> + diff = ktime_sub(min, ktime_get());
>
> You should use the variable 'now' rather than asking for the current
> time again.
Yep.
>> + /*
>> + * The result could be negative for different reasons:
>> + * - the prediction is incorrect
>> + * - the prediction was too near now and expired while we were
>> + * in this function
>> + *
>> + * In both cases, we return KTIME_MAX as a failure to do a
>> + * prediction
>> + */
>> + if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
>> + return (ktime_t){ .tv64 = KTIME_MAX };
>
> See my comment about ktime_t internals at the beginning of this
> function.
Ok.
Thanks for the review.
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings
2016-01-06 15:22 ` [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
@ 2016-01-08 15:31 ` Thomas Gleixner
2016-01-12 11:42 ` Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-08 15:31 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, 6 Jan 2016, Daniel Lezcano wrote:
> +#ifdef CONFIG_IRQ_TIMINGS
> +/**
> + * timing handler to be called when an interrupt happens
> + */
> +typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *, void *);
> +
> +/**
> + * struct irqtimings - per interrupt irq timings descriptor
> + * @handler: interrupt handler timings function
> + * @data: pointer to the private data to be passed to the handler
> + * @timestamp: latest interruption occurence
There is no timestamp member.
> + */
> +struct irqtimings {
> + irqt_handler_t handler;
> + void *data;
What's that data thingy for. The proposed user does not use it at all and I
have no idea why any user would want it. All it does is provide another level
of indirection in the hotpath.
> +} ____cacheline_internodealigned_in_smp;
> +/**
> + * struct irqt_ops - structure to be used by the subsystem to call the
> + * register and unregister ops when an irq is setup or freed.
> + * @setup: registering callback
> + * @free: unregistering callback
> + *
> + * The callbacks assumes the lock is held on the irq desc
Crap. It's called outside of the locked region and the proposed user locks the
descriptor itself, but that's a different story.
> +static inline void free_irq_timings(unsigned int irq, void *dev_id)
> +{
> + ;
What's the purpose of this semicolon?
> +#ifdef CONFIG_IRQ_TIMINGS
> +void handle_irqt_event(struct irqtimings *irqt, struct irqaction *action)
static ?
> +{
> + if (irqt)
This want's to use a static key.
> + irqt->handler(action->irq, ktime_get(),
> + action->dev_id, irqt->data);
> +}
> +#else
> +#define handle_irqt_event(a, b)
static inline stub if at all.
> +#ifdef CONFIG_IRQ_TIMINGS
> +/*
> + * Global variable, only used by accessor functions, currently only
> + * one user is allowed ...
That variable is static not global. And what the heck means:
> ... and it is up to the caller to make sure to
> + * setup the irq timings which are already setup.
-ENOPARSE.
> + */
> +static struct irqtimings_ops *irqtimings_ops;
> +
> +/**
> + * register_irq_timings - register the ops when an irq is setup or freed
> + *
> + * @ops: the register/unregister ops to be called when at setup or
> + * free time
> + *
> + * Returns -EBUSY if the slot is already in use, zero on success.
> + */
> +int register_irq_timings(struct irqtimings_ops *ops)
> +{
> + if (irqtimings_ops)
> + return -EBUSY;
> +
> + irqtimings_ops = ops;
> +
> + return 0;
> +}
> +
> +/**
> + * setup_irq_timings - call the timing register callback
> + *
> + * @desc: an irq desc structure
The argument list tells a different story.
> + *
> + * Returns -EINVAL in case of error, zero on success.
> + */
> +int setup_irq_timings(unsigned int irq, struct irqaction *act)
static is not in your book, right? These functions are only used in this file,
so no point for having them global visible and the stubs should be local as
well.
> +{
> + if (irqtimings_ops && irqtimings_ops->setup)
> + return irqtimings_ops->setup(irq, act);
> + return 0;
> +}
...
> @@ -1408,6 +1469,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>
> unregister_handler_proc(irq, action);
>
> + free_irq_timings(irq, dev_id);
This needs to go to the point where the interrupt is already synchronized and
the action about to be destroyed.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-06 15:22 ` [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
2016-01-06 17:40 ` Nicolas Pitre
@ 2016-01-08 15:43 ` Thomas Gleixner
2016-01-12 12:41 ` Daniel Lezcano
2016-01-18 13:21 ` Daniel Lezcano
1 sibling, 2 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-08 15:43 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, 6 Jan 2016, Daniel Lezcano wrote:
> +/**
> + * sched_irq_timing_free - free_irq callback
> + *
> + * @irq: the irq number to stop tracking
> + * @dev_id: not used at the moment
> + *
> + * This function will remove from the wakeup source prediction table.
> + */
> +static void sched_irq_timing_free(unsigned int irq, void *dev_id)
> +{
> + struct irq_desc *desc = irq_to_desc(irq);
What has this code to fiddle with irq_desc? Nothing!
> + raw_spin_lock_irqsave(&desc->lock, flags);
> + desc->timings = &irq_timings;
> + raw_spin_unlock_irqrestore(&desc->lock, flags);
Bah, no. This belongs into the core code. Add that handler - and yes, that's
all you need - to your timing_ops and let the core code do it in a proper way.
> +/**
> + * sched_idle_init - setup the interrupt tracking table
> + *
> + * At init time, some interrupts could have been setup and in the
> + * system life time, some devices could be setup. In order to track
> + * all interrupts we are interested in, we first register a couple of
> + * callback to keep up-to-date the interrupt tracking table and then
> + * we setup the table with the interrupt which were already set up.
> + */
> +int __init sched_idle_init(void)
> +{
> + struct irq_desc *desc;
> + unsigned int irq;
> + int ret;
> +
> + /*
> + * Register the setup/free irq callbacks, so new interrupt or
> + * freed interrupt will update their tracking.
> + */
> + ret = register_irq_timings(&irqt_ops);
> + if (ret) {
> + pr_err("Failed to register timings ops\n");
> + return ret;
> + }
So that stuff is installed unconditionally. Is it used unconditionally as
well?
> + /*
> + * For all the irq already setup, assign the timing callback.
> + * All interrupts with their desc NULL will be discarded.
> + */
> + for_each_irq_desc(irq, desc)
> + sched_irq_timing_setup(irq, desc->action);
No, no, no. This belongs into the core code register_irq_timings() function
which installs the handler into the irq descs with the proper protections and
once it has done that enables the static key.
The above is completely unprotected against interrupts being setup or even
freed concurrently.
Aside of that, you call that setup function in setup_irq for each action() and
here you call it only for the first one.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-06 17:40 ` Nicolas Pitre
2016-01-07 15:42 ` Daniel Lezcano
@ 2016-01-10 22:37 ` Daniel Lezcano
2016-01-10 22:46 ` Nicolas Pitre
1 sibling, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-10 22:37 UTC (permalink / raw)
To: Nicolas Pitre
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On 01/06/2016 06:40 PM, Nicolas Pitre wrote:
> On Wed, 6 Jan 2016, Daniel Lezcano wrote:
>
>> Many IRQs are quiet most of the time, or they tend to come in bursts of
>> fairly equal time intervals within each burst. It is therefore possible
>> to detect those IRQs with stable intervals and guestimate when the next
>> IRQ event is most likely to happen.
>>
>> Examples of such IRQs may include audio related IRQs where the FIFO size
>> and/or DMA descriptor size with the sample rate create stable intervals,
>> block devices during large data transfers, etc. Even network streaming
>> of multimedia content creates patterns of periodic network interface IRQs
>> in some cases.
>>
>> This patch adds code to track the mean interval and variance for each IRQ
>> over a window of time intervals between IRQ events. Those statistics can
>> be used to assist cpuidle in selecting the most appropriate sleep state
>> by predicting the most likely time for the next interrupt.
>>
>> Because the stats are gathered in interrupt context, the core computation
>> is as light as possible.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
[ ... ]
>> +
>> + diff = ktime_sub(now, w->timestamp);
>> +
>> + /*
>> + * There is no point attempting predictions on interrupts more
>> + * than 1 second apart. This has no benefit for sleep state
>> + * selection and increases the risk of overflowing our variance
>> + * computation. Reset all stats in that case.
>> + */
>> + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
>> + stats_reset(&w->stats);
>> + continue;
>> + }
>
> The above is wrong. It is not computing the interval between successive
> interruts but rather the interval between the last interrupt occurrence
> and the present time (i.e. when we're about to go idle). This won't
> prevent interrupt intervals greater than one second from being summed
> and potentially overflowing the variance if this code is executed less
> than a second after one such IRQ interval. This test should rather be
> performed in sched_idle_irq().
Hi Nico,
I have been through here again and think we should duplicate the test
because there are two cases:
1. We did not go idle and the interval measured in sched_idle_irq is
more than one second, then the stats are reset. I suggest to use an
approximation of one second: (diff < (1 << 20)) as we are in the fast
path.
2. We are going idle and the latest interrupt happened one second apart
from now. So we keep the current test.
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-10 22:37 ` Daniel Lezcano
@ 2016-01-10 22:46 ` Nicolas Pitre
2016-01-10 22:58 ` Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-10 22:46 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On Sun, 10 Jan 2016, Daniel Lezcano wrote:
> On 01/06/2016 06:40 PM, Nicolas Pitre wrote:
> > On Wed, 6 Jan 2016, Daniel Lezcano wrote:
> >
> > > Many IRQs are quiet most of the time, or they tend to come in bursts of
> > > fairly equal time intervals within each burst. It is therefore possible
> > > to detect those IRQs with stable intervals and guestimate when the next
> > > IRQ event is most likely to happen.
> > >
> > > Examples of such IRQs may include audio related IRQs where the FIFO size
> > > and/or DMA descriptor size with the sample rate create stable intervals,
> > > block devices during large data transfers, etc. Even network streaming
> > > of multimedia content creates patterns of periodic network interface IRQs
> > > in some cases.
> > >
> > > This patch adds code to track the mean interval and variance for each IRQ
> > > over a window of time intervals between IRQ events. Those statistics can
> > > be used to assist cpuidle in selecting the most appropriate sleep state
> > > by predicting the most likely time for the next interrupt.
> > >
> > > Because the stats are gathered in interrupt context, the core computation
> > > is as light as possible.
> > >
> > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>
> [ ... ]
>
> > > +
> > > + diff = ktime_sub(now, w->timestamp);
> > > +
> > > + /*
> > > + * There is no point attempting predictions on interrupts more
> > > + * than 1 second apart. This has no benefit for sleep state
> > > + * selection and increases the risk of overflowing our
> > > variance
> > > + * computation. Reset all stats in that case.
> > > + */
> > > + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
> > > + stats_reset(&w->stats);
> > > + continue;
> > > + }
> >
> > The above is wrong. It is not computing the interval between successive
> > interruts but rather the interval between the last interrupt occurrence
> > and the present time (i.e. when we're about to go idle). This won't
> > prevent interrupt intervals greater than one second from being summed
> > and potentially overflowing the variance if this code is executed less
> > than a second after one such IRQ interval. This test should rather be
> > performed in sched_idle_irq().
>
> Hi Nico,
>
> I have been through here again and think we should duplicate the test because
> there are two cases:
>
> 1. We did not go idle and the interval measured in sched_idle_irq is more than
> one second, then the stats are reset. I suggest to use an approximation of one
> second: (diff < (1 << 20)) as we are in the fast
> path.
>
> 2. We are going idle and the latest interrupt happened one second apart from
> now. So we keep the current test.
You don't need the current test if the interval is already limited
earlier on. Predictions that would otherwise trip that test will target
a time in the past and be discarded.
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-10 22:46 ` Nicolas Pitre
@ 2016-01-10 22:58 ` Daniel Lezcano
2016-01-10 23:13 ` Nicolas Pitre
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-10 22:58 UTC (permalink / raw)
To: Nicolas Pitre
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On 01/10/2016 11:46 PM, Nicolas Pitre wrote:
> On Sun, 10 Jan 2016, Daniel Lezcano wrote:
>
>> On 01/06/2016 06:40 PM, Nicolas Pitre wrote:
>>> On Wed, 6 Jan 2016, Daniel Lezcano wrote:
>>>
>>>> Many IRQs are quiet most of the time, or they tend to come in bursts of
>>>> fairly equal time intervals within each burst. It is therefore possible
>>>> to detect those IRQs with stable intervals and guestimate when the next
>>>> IRQ event is most likely to happen.
>>>>
>>>> Examples of such IRQs may include audio related IRQs where the FIFO size
>>>> and/or DMA descriptor size with the sample rate create stable intervals,
>>>> block devices during large data transfers, etc. Even network streaming
>>>> of multimedia content creates patterns of periodic network interface IRQs
>>>> in some cases.
>>>>
>>>> This patch adds code to track the mean interval and variance for each IRQ
>>>> over a window of time intervals between IRQ events. Those statistics can
>>>> be used to assist cpuidle in selecting the most appropriate sleep state
>>>> by predicting the most likely time for the next interrupt.
>>>>
>>>> Because the stats are gathered in interrupt context, the core computation
>>>> is as light as possible.
>>>>
>>>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>>
>> [ ... ]
>>
>>>> +
>>>> + diff = ktime_sub(now, w->timestamp);
>>>> +
>>>> + /*
>>>> + * There is no point attempting predictions on interrupts more
>>>> + * than 1 second apart. This has no benefit for sleep state
>>>> + * selection and increases the risk of overflowing our
>>>> variance
>>>> + * computation. Reset all stats in that case.
>>>> + */
>>>> + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
>>>> + stats_reset(&w->stats);
>>>> + continue;
>>>> + }
>>>
>>> The above is wrong. It is not computing the interval between successive
>>> interruts but rather the interval between the last interrupt occurrence
>>> and the present time (i.e. when we're about to go idle). This won't
>>> prevent interrupt intervals greater than one second from being summed
>>> and potentially overflowing the variance if this code is executed less
>>> than a second after one such IRQ interval. This test should rather be
>>> performed in sched_idle_irq().
>>
>> Hi Nico,
>>
>> I have been through here again and think we should duplicate the test because
>> there are two cases:
>>
>> 1. We did not go idle and the interval measured in sched_idle_irq is more than
>> one second, then the stats are reset. I suggest to use an approximation of one
>> second: (diff < (1 << 20)) as we are in the fast
>> path.
>>
>> 2. We are going idle and the latest interrupt happened one second apart from
>> now. So we keep the current test.
>
> You don't need the current test if the interval is already limited
> earlier on. Predictions that would otherwise trip that test will target
> a time in the past and be discarded.
Yes, but that wake up source should be discarded in the process of the
selection, so ignored in the loop, otherwise it can end up as the next
event (which is obviously wrong) and discarded at the end by returning
KTIME_MAX, instead of giving the opportunity to find another interrupt
as the next event with a greater value.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-10 22:58 ` Daniel Lezcano
@ 2016-01-10 23:13 ` Nicolas Pitre
0 siblings, 0 replies; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-10 23:13 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On Sun, 10 Jan 2016, Daniel Lezcano wrote:
> On 01/10/2016 11:46 PM, Nicolas Pitre wrote:
> > On Sun, 10 Jan 2016, Daniel Lezcano wrote:
> >
> > > On 01/06/2016 06:40 PM, Nicolas Pitre wrote:
> > > > On Wed, 6 Jan 2016, Daniel Lezcano wrote:
> > > >
> > > > > Many IRQs are quiet most of the time, or they tend to come in bursts
> > > > > of
> > > > > fairly equal time intervals within each burst. It is therefore
> > > > > possible
> > > > > to detect those IRQs with stable intervals and guestimate when the
> > > > > next
> > > > > IRQ event is most likely to happen.
> > > > >
> > > > > Examples of such IRQs may include audio related IRQs where the FIFO
> > > > > size
> > > > > and/or DMA descriptor size with the sample rate create stable
> > > > > intervals,
> > > > > block devices during large data transfers, etc. Even network
> > > > > streaming
> > > > > of multimedia content creates patterns of periodic network interface
> > > > > IRQs
> > > > > in some cases.
> > > > >
> > > > > This patch adds code to track the mean interval and variance for each
> > > > > IRQ
> > > > > over a window of time intervals between IRQ events. Those statistics
> > > > > can
> > > > > be used to assist cpuidle in selecting the most appropriate sleep
> > > > > state
> > > > > by predicting the most likely time for the next interrupt.
> > > > >
> > > > > Because the stats are gathered in interrupt context, the core
> > > > > computation
> > > > > is as light as possible.
> > > > >
> > > > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> > >
> > > [ ... ]
> > >
> > > > > +
> > > > > + diff = ktime_sub(now, w->timestamp);
> > > > > +
> > > > > + /*
> > > > > + * There is no point attempting predictions on
> > > > > interrupts more
> > > > > + * than 1 second apart. This has no benefit for sleep
> > > > > state
> > > > > + * selection and increases the risk of overflowing our
> > > > > variance
> > > > > + * computation. Reset all stats in that case.
> > > > > + */
> > > > > + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
> > > > > + stats_reset(&w->stats);
> > > > > + continue;
> > > > > + }
> > > >
> > > > The above is wrong. It is not computing the interval between successive
> > > > interruts but rather the interval between the last interrupt occurrence
> > > > and the present time (i.e. when we're about to go idle). This won't
> > > > prevent interrupt intervals greater than one second from being summed
> > > > and potentially overflowing the variance if this code is executed less
> > > > than a second after one such IRQ interval. This test should rather be
> > > > performed in sched_idle_irq().
> > >
> > > Hi Nico,
> > >
> > > I have been through here again and think we should duplicate the test
> > > because
> > > there are two cases:
> > >
> > > 1. We did not go idle and the interval measured in sched_idle_irq is more
> > > than
> > > one second, then the stats are reset. I suggest to use an approximation of
> > > one
> > > second: (diff < (1 << 20)) as we are in the fast
> > > path.
> > >
> > > 2. We are going idle and the latest interrupt happened one second apart
> > > from
> > > now. So we keep the current test.
> >
> > You don't need the current test if the interval is already limited
> > earlier on. Predictions that would otherwise trip that test will target
> > a time in the past and be discarded.
>
> Yes, but that wake up source should be discarded in the process of the
> selection, so ignored in the loop, otherwise it can end up as the next event
> (which is obviously wrong) and discarded at the end by returning KTIME_MAX,
> instead of giving the opportunity to find another interrupt as the next event
> with a greater value.
The loop should always discard any prediction for a past time and move
to the next, not only at the end.
If interrupt stats are not gathered if an interval is greater than one
second, that means no prediction will ever be more than one second away
from the last IRQ occurrence. If you ask for a prediction one second
after the last IRQ then those predictions will all be in the past.
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings
2016-01-08 15:31 ` Thomas Gleixner
@ 2016-01-12 11:42 ` Daniel Lezcano
0 siblings, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-12 11:42 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
Hi Thomas,
thanks for taking some time to review the patches.
On 01/08/2016 04:31 PM, Thomas Gleixner wrote:
> On Wed, 6 Jan 2016, Daniel Lezcano wrote:
>> +#ifdef CONFIG_IRQ_TIMINGS
>> +/**
>> + * timing handler to be called when an interrupt happens
>> + */
>> +typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *, void *);
>> +
>> +/**
>> + * struct irqtimings - per interrupt irq timings descriptor
>> + * @handler: interrupt handler timings function
>> + * @data: pointer to the private data to be passed to the handler
>> + * @timestamp: latest interruption occurence
>
> There is no timestamp member.
>
>> + */
>> +struct irqtimings {
>> + irqt_handler_t handler;
>> + void *data;
>
> What's that data thingy for. The proposed user does not use it at all and I
> have no idea why any user would want it. All it does is provide another level
> of indirection in the hotpath.
Yes, I agree. I added this private_data field for future use in case it
would be needed but it does not make sense now.
>> +} ____cacheline_internodealigned_in_smp;
>
>> +/**
>> + * struct irqt_ops - structure to be used by the subsystem to call the
>> + * register and unregister ops when an irq is setup or freed.
>> + * @setup: registering callback
>> + * @free: unregistering callback
>> + *
>> + * The callbacks assumes the lock is held on the irq desc
>
> Crap. It's called outside of the locked region and the proposed user locks the
> descriptor itself, but that's a different story.
>
>> +static inline void free_irq_timings(unsigned int irq, void *dev_id)
>> +{
>> + ;
>
> What's the purpose of this semicolon?
Bah, old habit. I will remove it.
>> +#ifdef CONFIG_IRQ_TIMINGS
>> +void handle_irqt_event(struct irqtimings *irqt, struct irqaction *action)
>
> static ?
>
>> +{
>> + if (irqt)
>
> This want's to use a static key.
Ok, I will look at that. I already used static keys to disable a portion
of code from sysfs but never this way.
>> + irqt->handler(action->irq, ktime_get(),
>> + action->dev_id, irqt->data);
>> +}
>> +#else
>> +#define handle_irqt_event(a, b)
>
> static inline stub if at all.
ok.
>> +#ifdef CONFIG_IRQ_TIMINGS
>> +/*
>> + * Global variable, only used by accessor functions, currently only
>> + * one user is allowed ...
>
> That variable is static not global. And what the heck means:
>
>> ... and it is up to the caller to make sure to
>> + * setup the irq timings which are already setup.
>
> -ENOPARSE.
hmm , yes ... it is not clear :)
I should have say:
"... and it is up to the caller to register the irq timing callback for
the interrupts which are already setup."
>> + */
>> +static struct irqtimings_ops *irqtimings_ops;
>> +
>> +/**
>> + * register_irq_timings - register the ops when an irq is setup or freed
>> + *
>> + * @ops: the register/unregister ops to be called when at setup or
>> + * free time
>> + *
>> + * Returns -EBUSY if the slot is already in use, zero on success.
>> + */
>> +int register_irq_timings(struct irqtimings_ops *ops)
>> +{
>> + if (irqtimings_ops)
>> + return -EBUSY;
>> +
>> + irqtimings_ops = ops;
>> +
>> + return 0;
>> +}
>> +
>> +/**
>> + * setup_irq_timings - call the timing register callback
>> + *
>> + * @desc: an irq desc structure
>
> The argument list tells a different story.
>
>> + *
>> + * Returns -EINVAL in case of error, zero on success.
>> + */
>> +int setup_irq_timings(unsigned int irq, struct irqaction *act)
>
> static is not in your book, right? These functions are only used in this file,
> so no point for having them global visible and the stubs should be local as
> well.
Ok.
>> +{
>> + if (irqtimings_ops && irqtimings_ops->setup)
>> + return irqtimings_ops->setup(irq, act);
>> + return 0;
>> +}
>
> ...
>
>> @@ -1408,6 +1469,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
>>
>> unregister_handler_proc(irq, action);
>>
>> + free_irq_timings(irq, dev_id);
>
> This needs to go to the point where the interrupt is already synchronized and
> the action about to be destroyed.
Ok, noted.
Thanks !
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-08 15:43 ` Thomas Gleixner
@ 2016-01-12 12:41 ` Daniel Lezcano
2016-01-12 13:42 ` Thomas Gleixner
2016-01-18 13:21 ` Daniel Lezcano
1 sibling, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-12 12:41 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
> On Wed, 6 Jan 2016, Daniel Lezcano wrote:
>> +/**
>> + * sched_irq_timing_free - free_irq callback
>> + *
>> + * @irq: the irq number to stop tracking
>> + * @dev_id: not used at the moment
>> + *
>> + * This function will remove from the wakeup source prediction table.
>> + */
>> +static void sched_irq_timing_free(unsigned int irq, void *dev_id)
>> +{
>> + struct irq_desc *desc = irq_to_desc(irq);
>
> What has this code to fiddle with irq_desc? Nothing!
>
>> + raw_spin_lock_irqsave(&desc->lock, flags);
>> + desc->timings = &irq_timings;
>> + raw_spin_unlock_irqrestore(&desc->lock, flags);
>
> Bah, no. This belongs into the core code. Add that handler - and yes, that's
> all you need - to your timing_ops and let the core code do it in a proper way.
Ah, yes, you are right. That will be much more cleaner.
>> +/**
>> + * sched_idle_init - setup the interrupt tracking table
>> + *
>> + * At init time, some interrupts could have been setup and in the
>> + * system life time, some devices could be setup. In order to track
>> + * all interrupts we are interested in, we first register a couple of
>> + * callback to keep up-to-date the interrupt tracking table and then
>> + * we setup the table with the interrupt which were already set up.
>> + */
>> +int __init sched_idle_init(void)
>> +{
>> + struct irq_desc *desc;
>> + unsigned int irq;
>> + int ret;
>> +
>> + /*
>> + * Register the setup/free irq callbacks, so new interrupt or
>> + * freed interrupt will update their tracking.
>> + */
>> + ret = register_irq_timings(&irqt_ops);
>> + if (ret) {
>> + pr_err("Failed to register timings ops\n");
>> + return ret;
>> + }
>
> So that stuff is installed unconditionally. Is it used unconditionally as
> well?
Sorry, I am not sure to understand your question. If the kernel is
compiled with CONFIG_CPU_IDLE_GOV_SCHED=y, this code is enabled and use
the irq timings. The condition comes from the compilation option.
Is that your question ?
>> + /*
>> + * For all the irq already setup, assign the timing callback.
>> + * All interrupts with their desc NULL will be discarded.
>> + */
>> + for_each_irq_desc(irq, desc)
>> + sched_irq_timing_setup(irq, desc->action);
>
> No, no, no. This belongs into the core code register_irq_timings() function
> which installs the handler into the irq descs with the proper protections and
> once it has done that enables the static key.
Ok.
> The above is completely unprotected against interrupts being setup or even
> freed concurrently.
Ok, I will do the change.
> Aside of that, you call that setup function in setup_irq for each action() and
> here you call it only for the first one.
Ah, right.
Thanks for the review.
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 12:41 ` Daniel Lezcano
@ 2016-01-12 13:42 ` Thomas Gleixner
2016-01-12 14:16 ` Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-12 13:42 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Tue, 12 Jan 2016, Daniel Lezcano wrote:
> On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
> > > + /*
> > > + * Register the setup/free irq callbacks, so new interrupt or
> > > + * freed interrupt will update their tracking.
> > > + */
> > > + ret = register_irq_timings(&irqt_ops);
> > > + if (ret) {
> > > + pr_err("Failed to register timings ops\n");
> > > + return ret;
> > > + }
> >
> > So that stuff is installed unconditionally. Is it used unconditionally as
> > well?
>
> Sorry, I am not sure to understand your question. If the kernel is compiled
> with CONFIG_CPU_IDLE_GOV_SCHED=y, this code is enabled and use the irq
> timings. The condition comes from the compilation option.
The question is whether the option also activates that thing or is there still
some /sys/whatever/idlegov magic where you can (de)select it.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 13:42 ` Thomas Gleixner
@ 2016-01-12 14:16 ` Daniel Lezcano
2016-01-12 14:26 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-12 14:16 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/12/2016 02:42 PM, Thomas Gleixner wrote:
> On Tue, 12 Jan 2016, Daniel Lezcano wrote:
>> On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
>>>> + /*
>>>> + * Register the setup/free irq callbacks, so new interrupt or
>>>> + * freed interrupt will update their tracking.
>>>> + */
>>>> + ret = register_irq_timings(&irqt_ops);
>>>> + if (ret) {
>>>> + pr_err("Failed to register timings ops\n");
>>>> + return ret;
>>>> + }
>>>
>>> So that stuff is installed unconditionally. Is it used unconditionally as
>>> well?
>>
>> Sorry, I am not sure to understand your question. If the kernel is compiled
>> with CONFIG_CPU_IDLE_GOV_SCHED=y, this code is enabled and use the irq
>> timings. The condition comes from the compilation option.
>
> The question is whether the option also activates that thing or is there still
> some /sys/whatever/idlegov magic where you can (de)select it.
Yes, in the next patches of the series I did not send, we can switch to
the cpuidle's governor framework or idle-sched. I will look at how to
disable it when switching to the cpuidle's governors.
Thanks
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 14:16 ` Daniel Lezcano
@ 2016-01-12 14:26 ` Thomas Gleixner
2016-01-12 14:52 ` Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-12 14:26 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Tue, 12 Jan 2016, Daniel Lezcano wrote:
> On 01/12/2016 02:42 PM, Thomas Gleixner wrote:
> > On Tue, 12 Jan 2016, Daniel Lezcano wrote:
> > > On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
> > > > > + /*
> > > > > + * Register the setup/free irq callbacks, so new interrupt or
> > > > > + * freed interrupt will update their tracking.
> > > > > + */
> > > > > + ret = register_irq_timings(&irqt_ops);
> > > > > + if (ret) {
> > > > > + pr_err("Failed to register timings ops\n");
> > > > > + return ret;
> > > > > + }
> > > >
> > > > So that stuff is installed unconditionally. Is it used unconditionally
> > > > as
> > > > well?
> > >
> > > Sorry, I am not sure to understand your question. If the kernel is
> > > compiled
> > > with CONFIG_CPU_IDLE_GOV_SCHED=y, this code is enabled and use the irq
> > > timings. The condition comes from the compilation option.
> >
> > The question is whether the option also activates that thing or is there
> > still
> > some /sys/whatever/idlegov magic where you can (de)select it.
>
> Yes, in the next patches of the series I did not send, we can switch to the
> cpuidle's governor framework or idle-sched. I will look at how to disable it
> when switching to the cpuidle's governors.
You better implement the switching part in the cpuidle core first, i.e. proper
callbacks when a governor is switched in/out. Then make use of this switcheroo
right away. Doing it the other way round is just wrong.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 14:26 ` Thomas Gleixner
@ 2016-01-12 14:52 ` Daniel Lezcano
2016-01-12 15:12 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-12 14:52 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/12/2016 03:26 PM, Thomas Gleixner wrote:
> On Tue, 12 Jan 2016, Daniel Lezcano wrote:
>> On 01/12/2016 02:42 PM, Thomas Gleixner wrote:
>>> On Tue, 12 Jan 2016, Daniel Lezcano wrote:
>>>> On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
>>>>>> + /*
>>>>>> + * Register the setup/free irq callbacks, so new interrupt or
>>>>>> + * freed interrupt will update their tracking.
>>>>>> + */
>>>>>> + ret = register_irq_timings(&irqt_ops);
>>>>>> + if (ret) {
>>>>>> + pr_err("Failed to register timings ops\n");
>>>>>> + return ret;
>>>>>> + }
>>>>>
>>>>> So that stuff is installed unconditionally. Is it used unconditionally
>>>>> as
>>>>> well?
>>>>
>>>> Sorry, I am not sure to understand your question. If the kernel is
>>>> compiled
>>>> with CONFIG_CPU_IDLE_GOV_SCHED=y, this code is enabled and use the irq
>>>> timings. The condition comes from the compilation option.
>>>
>>> The question is whether the option also activates that thing or is there
>>> still
>>> some /sys/whatever/idlegov magic where you can (de)select it.
>>
>> Yes, in the next patches of the series I did not send, we can switch to the
>> cpuidle's governor framework or idle-sched. I will look at how to disable it
>> when switching to the cpuidle's governors.
>
> You better implement the switching part in the cpuidle core first, i.e. proper
> callbacks when a governor is switched in/out. Then make use of this switcheroo
> right away. Doing it the other way round is just wrong.
The problem is this code is not another governor but a 'predictor' where
the scheduler will use the information to ask the cpuidle to go to a
specific idle state without going through the governor code, so into the
governor's callbacks. It is on top of cpuidle. The scheduler will become
the governor.
The current straightforward code, does the switch in the cpu_idle_loop
idle_task's function:
[ ... ]
if (cpu_idle_force_poll || tick_check_broadcast_expired())
cpu_idle_poll();
else {
if (sched_idle_enabled()) {
int latency = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
s64 duration = sched_idle_next_wakeup();
sched_idle(duration, latency);
} else {
cpuidle_idle_call();
}
}
Due to the complexity of the code, this first step introduce a mechanism
to predict the next event and re-use it trivially in the idle task.
Perhaps, it would be acceptable to have cpuidle_idle_call() to be
replaced by a callback and the switch acts at this level ?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 14:52 ` Daniel Lezcano
@ 2016-01-12 15:12 ` Thomas Gleixner
2016-01-12 16:04 ` Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-12 15:12 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Tue, 12 Jan 2016, Daniel Lezcano wrote:
> On 01/12/2016 03:26 PM, Thomas Gleixner wrote:
> > You better implement the switching part in the cpuidle core first, i.e.
> > proper
> > callbacks when a governor is switched in/out. Then make use of this
> > switcheroo
> > right away. Doing it the other way round is just wrong.
>
> The problem is this code is not another governor but a 'predictor' where the
> scheduler will use the information to ask the cpuidle to go to a specific idle
> state without going through the governor code, so into the governor's
> callbacks. It is on top of cpuidle. The scheduler will become the governor.
>
> The current straightforward code, does the switch in the cpu_idle_loop
> idle_task's function:
>
> [ ... ]
>
> if (cpu_idle_force_poll || tick_check_broadcast_expired())
> cpu_idle_poll();
> else {
> if (sched_idle_enabled()) {
> int latency = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
> s64 duration = sched_idle_next_wakeup();
> sched_idle(duration, latency);
> } else {
> cpuidle_idle_call();
> }
> }
>
> Due to the complexity of the code, this first step introduce a mechanism to
> predict the next event and re-use it trivially in the idle task.
This looks really wrong. Why on earth don't you implement a proper governor
and just get rid of this extra hackery?
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 15:12 ` Thomas Gleixner
@ 2016-01-12 16:04 ` Daniel Lezcano
2016-01-13 9:17 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-12 16:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/12/2016 04:12 PM, Thomas Gleixner wrote:
> On Tue, 12 Jan 2016, Daniel Lezcano wrote:
>> On 01/12/2016 03:26 PM, Thomas Gleixner wrote:
>>> You better implement the switching part in the cpuidle core first, i.e.
>>> proper
>>> callbacks when a governor is switched in/out. Then make use of this
>>> switcheroo
>>> right away. Doing it the other way round is just wrong.
>>
>> The problem is this code is not another governor but a 'predictor' where the
>> scheduler will use the information to ask the cpuidle to go to a specific idle
>> state without going through the governor code, so into the governor's
>> callbacks. It is on top of cpuidle. The scheduler will become the governor.
>>
>> The current straightforward code, does the switch in the cpu_idle_loop
>> idle_task's function:
>>
>> [ ... ]
>>
>> if (cpu_idle_force_poll || tick_check_broadcast_expired())
>> cpu_idle_poll();
>> else {
>> if (sched_idle_enabled()) {
>> int latency = pm_qos_request(PM_QOS_CPU_DMA_LATENCY);
>> s64 duration = sched_idle_next_wakeup();
>> sched_idle(duration, latency);
>> } else {
>> cpuidle_idle_call();
>> }
>> }
>>
>> Due to the complexity of the code, this first step introduce a mechanism to
>> predict the next event and re-use it trivially in the idle task.
>
> This looks really wrong. Why on earth don't you implement a proper governor
> and just get rid of this extra hackery?
That is part of the ongoing work where we are integrating the different
PM subsystems with the scheduler in order have them collaborating
together as asked by Ingo [1].
The idea is to get rid of the governors and let the scheduler to tell
the Cpuidle framework : "I expect to sleep <x> nsec and I have a <y>
nsec latency requirement" as stated by Peter Zijlstra [2].
The code above could be not nice but I think it begins to do what is
expecting Peter. It is an embryonic code and in order to prevent too
much different topics for a single series, I posted the two first
patches which provide the next event API as the foundations for the next
changes. How we integrate the 'next event' is a question I wanted to
postpone in a different series of RFC patches.
-- Daniel
[1] http://lwn.net/Articles/552885/
[2] https://lkml.org/lkml/2013/11/11/353 (broken here is another archive [3]
[3] http://lkml.iu.edu/hypermail/linux/kernel/1311.1/01360.html
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-07 15:42 ` Daniel Lezcano
@ 2016-01-12 19:27 ` Nicolas Pitre
0 siblings, 0 replies; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-12 19:27 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On Thu, 7 Jan 2016, Daniel Lezcano wrote:
> On 01/06/2016 06:40 PM, Nicolas Pitre wrote:
> > On Wed, 6 Jan 2016, Daniel Lezcano wrote:
> >
> > > Many IRQs are quiet most of the time, or they tend to come in bursts of
> > > fairly equal time intervals within each burst. It is therefore possible
> > > to detect those IRQs with stable intervals and guestimate when the next
> > > IRQ event is most likely to happen.
> > >
> > > Examples of such IRQs may include audio related IRQs where the FIFO size
> > > and/or DMA descriptor size with the sample rate create stable intervals,
> > > block devices during large data transfers, etc. Even network streaming
> > > of multimedia content creates patterns of periodic network interface IRQs
> > > in some cases.
> > >
> > > This patch adds code to track the mean interval and variance for each IRQ
> > > over a window of time intervals between IRQ events. Those statistics can
> > > be used to assist cpuidle in selecting the most appropriate sleep state
> > > by predicting the most likely time for the next interrupt.
> > >
> > > Because the stats are gathered in interrupt context, the core computation
> > > is as light as possible.
> > >
> > > Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> >
> > There are still a few problems with this patch.
> >
> > Please see comments below.
>
> [ ... ]
>
> > > +/**
> > > + * stats_variance - compute the variance
> > > + *
> > > + * @s: the statistic structure
> > > + *
> > > + * Returns an u64 corresponding to the variance, or zero if there is
> > > + * no data
> > > + */
> > > +static u64 stats_variance(struct stats *s, u32 mean)
> > > +{
> > > + int i;
> > > + u64 variance = 0;
> > > +
> > > + /*
> > > + * The variance is the sum of the squared difference to the
> > > + * average divided by the number of elements.
> > > + */
> > > + for (i = 0; i < STATS_NR_VALUES; i++) {
> > > + s32 diff = s->values[i] - mean;
> > > + variance += (u64)diff * diff;
> >
> > Strictly speaking, diff should be casted to s64 as it is a signed value
> > that may actually be negative. Because of the strange C type promotion
> > rules, the generated code appears correct (at least on ARM), but it
> > would be clearer to use s64 anyway.
>
> I don't get the connection in your explanation of why it should be a s64. It
> is already a signed s32, s->values[] are s32 and mean is u32.
>
> What would be the benefit to convert diff to s64 ?
Suppose diff ends up being -1. Normally, -1 * -1 = 1.
Now you cast the first term to a 64-bit value so the result of the
product becomes a 64-bit value to avoid overflows. However you cast it
to an _unsigned_ 64-bit value. Therefore you could expect something
like 0x00000000ffffffff * -1 would take place and that wouldn't produce
1 for result at all.
Yet, the C standard is weird and completely unintuitive sometimes.
Here is an example of such a time. Even though you cast the first term
to an unsigned value, the compiler is performing sign extension
nevertheless. So casting to u64 or s64 doesn't change anything in
practice. But for a human reading the code, using s64 will be more
intuitive.
> > The product will end up being positive in all cases of course so
> > variance may remain as a u64.
> >
>
> [ ... ]
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-12 16:04 ` Daniel Lezcano
@ 2016-01-13 9:17 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-13 9:17 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Tue, 12 Jan 2016, Daniel Lezcano wrote:
> On 01/12/2016 04:12 PM, Thomas Gleixner wrote:
> > This looks really wrong. Why on earth don't you implement a proper governor
> > and just get rid of this extra hackery?
>
> That is part of the ongoing work where we are integrating the different PM
> subsystems with the scheduler in order have them collaborating together as
> asked by Ingo [1].
>
> The idea is to get rid of the governors and let the scheduler to tell the
> Cpuidle framework : "I expect to sleep <x> nsec and I have a <y>
> nsec latency requirement" as stated by Peter Zijlstra [2].
>
> The code above could be not nice but I think it begins to do what is expecting
> Peter. It is an embryonic code and in order to prevent too much different
> topics for a single series, I posted the two first patches which provide the
> next event API as the foundations for the next changes. How we integrate the
> 'next event' is a question I wanted to postpone in a different series of RFC
> patches.
Fair enough.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-08 15:43 ` Thomas Gleixner
2016-01-12 12:41 ` Daniel Lezcano
@ 2016-01-18 13:21 ` Daniel Lezcano
2016-01-20 15:41 ` Thomas Gleixner
1 sibling, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-18 13:21 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
> On Wed, 6 Jan 2016, Daniel Lezcano wrote:
[ ... ]
>> + /*
>> + * For all the irq already setup, assign the timing callback.
>> + * All interrupts with their desc NULL will be discarded.
>> + */
>> + for_each_irq_desc(irq, desc)
>> + sched_irq_timing_setup(irq, desc->action);
>
> No, no, no. This belongs into the core code register_irq_timings() function
> which installs the handler into the irq descs with the proper protections and
> once it has done that enables the static key.
>
> The above is completely unprotected against interrupts being setup or even
> freed concurrently.
>
> Aside of that, you call that setup function in setup_irq for each action() and
> here you call it only for the first one.
Hi Thomas,
I went through the different comments and almost finished the changes
but I think the 'register_ops' approach, which happens after some irq
were setup, introduces some useless complexity and because of the desc
lock section, the ops can't do memory allocation. Before going further,
I am wondering if declaring the irq_timings_ops statically (read without
'register_ops' - hence without a init time dependency) and calling the
init/free ops from alloc_desc/free_desc wouldn't be cleaner and simpler.
What do you think ?
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-18 13:21 ` Daniel Lezcano
@ 2016-01-20 15:41 ` Thomas Gleixner
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-20 15:41 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Mon, 18 Jan 2016, Daniel Lezcano wrote:
> On 01/08/2016 04:43 PM, Thomas Gleixner wrote:
> > The above is completely unprotected against interrupts being setup or even
> > freed concurrently.
> >
> > Aside of that, you call that setup function in setup_irq for each action()
> > and
> > here you call it only for the first one.
>
> I went through the different comments and almost finished the changes but I
> think the 'register_ops' approach, which happens after some irq were setup,
> introduces some useless complexity and because of the desc lock section, the
> ops can't do memory allocation.
You can't protect that with desc_lock. You need to take the sparse_irq_lock,
which is a mutex, to protect the irq desc walk.
> Before going further, I am wondering if declaring the irq_timings_ops
> statically (read without 'register_ops' - hence without a init time
> dependency) and calling the init/free ops from alloc_desc/free_desc wouldn't
> be cleaner and simpler.
Then you don't need those ops at all. You can make it simple function calls,
which get compiled out if that stuff is not enabled.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* [RFC V2 0/2] IRQ based next prediction
2016-01-20 15:41 ` Thomas Gleixner
@ 2016-01-20 16:00 ` Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
` (4 more replies)
0 siblings, 5 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-20 16:00 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
The current approach to select an idle state is based on the idle period
statistics computation.
Useless to say this approach satisfied everyone as a solution to find the
best trade-off between the performances and the energy saving via the menu
governor.
However, the kernel is evolving to act pro-actively regarding the energy
constraints with the scheduler and the different power management subsystems
are not collaborating with the scheduler as the conductor of the decisions,
they all act independently.
The cpuidle governors are based on idle period statistics, without knowledge
of what woke up the cpu. In these sources of wakes up, the IPI are of course
accounted (as well as the timers irq) which results in doing statistics on
the scheduler behavior too. It is no sense to let the scheduler to take a
decision based on a next prediction of its own decisions.
In order to integrate the cpuidle framework into the scheduler, we have to
radically change the approach by clearly identifying what is causing a wake
up and how it behaves.
This serie inverts the logic.
Instead of tracking the idle durations and do statistics on them, these
patches track the interrupt individually and try to predict the next interrupt.
By combining the interrupts' next event on a single CPU, we can predict the
next event for the CPU, hence predict how long we will be sleeping when
entering idle.
The IPI and timer interrupts are not taken into account.
The first patch provides a callback to be registered in the irq subsystem
and to be called when an interrupt is handled with a timestamp.
The second patch uses the callback provided by the patch above to compute
the delta and store it in a circular buffer. It is per cpu, the callback
implements minimal operations as it is in an interrupt context.
When we the cpu enters idle, it asks for the expected sleep time. Then the
expected minimum sleep length for all interrupts is used and compared to
the timer sleep length, again the minimum is taken and gives the expected
sleep time.
The statistics are very trivial and could be improved later but this first
step shows we have a nice overall improvement in SMP. In UP the menu governor
is a bit better which may indicate the next prediction computation could be
improved but confirms removing the IPI from the equation increase the
accuracy.
Changelog:
V2:
- Changed the register_ops approach for the irq subsystem
- Fixed Nicolas's comments
Daniel Lezcano (2):
irq: Add a framework to measure interrupt timings
sched: idle: IRQ based next prediction for idle period
drivers/cpuidle/Kconfig | 9 +
include/linux/interrupt.h | 26 +++
include/linux/irqhandler.h | 1 +
kernel/irq/Kconfig | 4 +
kernel/irq/handle.c | 1 +
kernel/irq/internals.h | 43 ++++
kernel/irq/irqdesc.c | 6 +
kernel/irq/manage.c | 10 +-
kernel/sched/Makefile | 1 +
kernel/sched/idle-sched.c | 529 +++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 629 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/idle-sched.c
--
1.9.1
^ permalink raw reply [flat|nested] 63+ messages in thread
* [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
@ 2016-01-20 16:00 ` Daniel Lezcano
2016-01-20 17:55 ` Thomas Gleixner
` (2 more replies)
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
` (3 subsequent siblings)
4 siblings, 3 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-20 16:00 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
The interrupt framework gives a lot of information and statistics about
each interrupt.
Unfortunately there is no way to measure when interrupts occur and provide
a mathematical model for their behavior which could help in predicting
their next occurence.
This framework allows for registering a callback function that is invoked
when an interrupt occurs. Each time, the callback will be called with the
timestamp.
The main objective is to track and detect the periodic interrupts in order
to predict the next event on a cpu and anticipate the sleeping time when
entering idle. This fine grain approach allows to simplify and rationalize
a wake up event prediction without IPIs interference, thus letting the
scheduler to be smarter with the wakeup IPIs regarding the idle period.
The irq timing feature must be enabled by the subsytem at compile time and
this one must use the DECLARE_IRQ_TIMINGS(ops) macro in order to declare
the ops to be used. Without this, the kernel will fail to compile with an
unresolved symbol. That is the guarantee the irq timings is not enabled
for nothing and will be used if it is defined in the config file.
Moreover, using a global ops variable, encapsulated in the irq code via the
DECLARE_IRQ_TIMINGS macro, allows to have the ops to be called at init
time, before the interrupts are setup. That prevents to introduce complex
code to update the subsystem's irq tracking table *after* the irqs init
happened.
The ops are as the following:
- alloc : called when an irqdesc is allocated
- free : called when an irqdesc is freed
- setup : called when an irq is registered with the irq handler
- remove : called when an irq is removed
- handler : called when an irq was handled
A static key will be introduced when the irq prediction is switched on at
runtime in order to reduce an overhead near to zero when the kernel is not
using it.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
include/linux/interrupt.h | 26 ++++++++++++++++++++++++++
include/linux/irqhandler.h | 1 +
kernel/irq/Kconfig | 4 ++++
kernel/irq/handle.c | 1 +
kernel/irq/internals.h | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/irq/irqdesc.c | 6 ++++++
kernel/irq/manage.c | 10 +++++++++-
7 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index ad16809..f7ff6fe 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -123,6 +123,32 @@ struct irqaction {
struct proc_dir_entry *dir;
} ____cacheline_internodealigned_in_smp;
+#ifdef CONFIG_IRQ_TIMINGS
+/**
+ * struct irqt_ops - structure to be used by the subsystem to track
+ * irq timings
+ * @alloc: called when an irqdesc is allocated
+ * @free: called when an irqdesc is free
+ * @setup: called when an irq is setup, this is called under lock
+ * @remove: called when an irq is removed
+ * @handler: called when an interrupt is handled
+ */
+struct irqtimings_ops {
+ int (*alloc)(unsigned int);
+ void (*free)(unsigned int);
+ int (*setup)(unsigned int, struct irqaction *act);
+ void (*remove)(unsigned int, void *dev_id);
+ irqt_handler_t handler;
+};
+
+/**
+ * This macro *must* be used by the subsystem interested by the irq
+ * timing information.
+ */
+#define DECLARE_IRQ_TIMINGS(__ops) \
+ const struct irqtimings_ops *__irqtimings = __ops;
+#endif
+
extern irqreturn_t no_action(int cpl, void *dev_id);
extern int __must_check
diff --git a/include/linux/irqhandler.h b/include/linux/irqhandler.h
index 661bed0..4c1c77e 100644
--- a/include/linux/irqhandler.h
+++ b/include/linux/irqhandler.h
@@ -10,5 +10,6 @@ struct irq_desc;
struct irq_data;
typedef void (*irq_flow_handler_t)(struct irq_desc *desc);
typedef void (*irq_preflow_handler_t)(struct irq_data *data);
+typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *);
#endif
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 3b48dab..3f68619 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -77,6 +77,10 @@ config GENERIC_MSI_IRQ_DOMAIN
config HANDLE_DOMAIN_IRQ
bool
+config IRQ_TIMINGS
+ bool
+ default n
+
config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
depends on IRQ_DOMAIN && DEBUG_FS
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index a302cf9..cfc76fd 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
/* Fall through to add to randomness */
case IRQ_HANDLED:
flags |= action->flags;
+ handle_irqtiming(irq, action->dev_id);
break;
default:
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index fcab63c..cd4f61a 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -20,6 +20,49 @@ extern bool noirqdebug;
extern struct irqaction chained_action;
+#ifdef CONFIG_IRQ_TIMINGS
+
+extern const struct irqtimings_ops *__irqtimings;
+
+static inline int alloc_irqtiming(unsigned int irq)
+{
+ if (__irqtimings->alloc)
+ return __irqtimings->alloc(irq);
+ return 0;
+}
+
+static inline void free_irqtiming(unsigned int irq)
+{
+ if (__irqtimings->free)
+ __irqtimings->free(irq);
+}
+
+static inline int setup_irqtiming(unsigned int irq, struct irqaction *act)
+{
+ if (__irqtimings->setup)
+ return __irqtimings->setup(irq, act);
+ return 0;
+}
+
+static inline void remove_irqtiming(unsigned int irq, void *dev_id)
+{
+ if (__irqtimings->remove)
+ __irqtimings->remove(irq, dev_id);
+}
+
+static inline void handle_irqtiming(unsigned int irq, void *dev_id)
+{
+ if (__irqtimings->handler)
+ __irqtimings->handler(irq, ktime_get(), dev_id);
+}
+#else
+static inline int alloc_irqtiming(unsigned int irq) { return 0; }
+static inline int setup_irqtiming(unsigned int irq, void *dev_id) { return 0; }
+static inline void free_irqtiming(unsigned int irq) {}
+static inline void remove_irqtiming(unsigned int irq, void *dev_id) { }
+static inline void handle_irqtiming(unsigned int irq, void *dev_id) { }
+#endif
+
/*
* Bits used by threaded handlers:
* IRQTF_RUNTHREAD - signals that the interrupt handler thread should run
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 239e2ae..dc52f3a 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -157,6 +157,9 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
if (alloc_masks(desc, gfp, node))
goto err_kstat;
+ if (alloc_irqtiming(irq))
+ goto err_mask;
+
raw_spin_lock_init(&desc->lock);
lockdep_set_class(&desc->lock, &irq_desc_lock_class);
@@ -164,6 +167,8 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
return desc;
+err_mask:
+ free_masks(desc);
err_kstat:
free_percpu(desc->kstat_irqs);
err_desc:
@@ -187,6 +192,7 @@ static void free_desc(unsigned int irq)
delete_irq_desc(irq);
mutex_unlock(&sparse_irq_lock);
+ free_irqtiming(irq);
free_masks(desc);
free_percpu(desc->kstat_irqs);
kfree(desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 6ead200..df1cdb6 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1282,13 +1282,17 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
init_waitqueue_head(&desc->wait_for_threads);
+ ret = setup_irqtiming(irq, new);
+ if (ret)
+ goto out_mask;
+
/* Setup the type (level, edge polarity) if configured: */
if (new->flags & IRQF_TRIGGER_MASK) {
ret = __irq_set_trigger(desc,
new->flags & IRQF_TRIGGER_MASK);
if (ret)
- goto out_mask;
+ goto out_irqtiming;
}
desc->istate &= ~(IRQS_AUTODETECT | IRQS_SPURIOUS_DISABLED | \
@@ -1373,6 +1377,8 @@ mismatch:
}
ret = -EBUSY;
+out_irqtiming:
+ remove_irqtiming(irq, new->dev_id);
out_mask:
raw_spin_unlock_irqrestore(&desc->lock, flags);
free_cpumask_var(mask);
@@ -1483,6 +1489,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
/* Make sure it's not being used on another CPU: */
synchronize_irq(irq);
+ remove_irqtiming(irq, dev_id);
+
#ifdef CONFIG_DEBUG_SHIRQ
/*
* It's a shared IRQ -- the driver ought to be prepared for an IRQ
--
1.9.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
@ 2016-01-20 16:00 ` Daniel Lezcano
2016-01-20 17:46 ` Nicolas Pitre
` (4 more replies)
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
` (2 subsequent siblings)
4 siblings, 5 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-20 16:00 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
Many IRQs are quiet most of the time, or they tend to come in bursts of
fairly equal time intervals within each burst. It is therefore possible
to detect those IRQs with stable intervals and guestimate when the next
IRQ event is most likely to happen.
Examples of such IRQs may include audio related IRQs where the FIFO size
and/or DMA descriptor size with the sample rate create stable intervals,
block devices during large data transfers, etc. Even network streaming
of multimedia content creates patterns of periodic network interface IRQs
in some cases.
This patch adds code to track the mean interval and variance for each IRQ
over a window of time intervals between IRQ events. Those statistics can
be used to assist cpuidle in selecting the most appropriate sleep state
by predicting the most likely time for the next interrupt.
Because the stats are gathered in interrupt context, the core computation
is as light as possible.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
drivers/cpuidle/Kconfig | 9 +
kernel/sched/Makefile | 1 +
kernel/sched/idle-sched.c | 529 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 539 insertions(+)
create mode 100644 kernel/sched/idle-sched.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 8c7930b..a606106 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -25,6 +25,15 @@ config CPU_IDLE_GOV_MENU
bool "Menu governor (for tickless system)"
default y
+config CPU_IDLE_GOV_SCHED
+ bool "Sched idle governor"
+ select IRQ_TIMINGS
+ help
+ Enables an irq timings tracking mechanism to track the wakeup sources
+ of the platform.
+
+ If you are unsure, it is safe to say N.
+
config DT_IDLE_STATES
bool
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..f7d5a35 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o
diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c
new file mode 100644
index 0000000..c2b8568
--- /dev/null
+++ b/kernel/sched/idle-sched.c
@@ -0,0 +1,529 @@
+/*
+ * Copyright (C) 2016 Linaro Ltd, Daniel Lezcano <daniel.lezcano@linaro.org>
+ * Nicolas Pitre <nicolas.pitre@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+#include <linux/cpuidle.h>
+#include <linux/interrupt.h>
+#include <linux/irqdesc.h>
+#include <linux/ktime.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+#include <linux/time64.h>
+
+/*
+ * Define the number of samples over which the average and variance
+ * are computed. A power of 2 is preferred so to let the compiler
+ * optimize divisions by that number with simple arithmetic shifts.
+ */
+#define STATS_NR_VALUES 4
+
+/**
+ * struct stats - internal structure to encapsulate stats informations
+ *
+ * @sum: sum of the values
+ * @values: array of values to do stats on
+ * @w_ptr: current buffer pointer
+ */
+struct stats {
+ u64 sum; /* sum of values */
+ u32 values[STATS_NR_VALUES]; /* array of values */
+ unsigned char w_ptr; /* current window pointer */
+};
+
+/**
+ * struct wakeup - internal structure describing a source of wakeup
+ *
+ * @stats: the stats structure on the different event intervals
+ * @timestamp: latest update timestamp
+ */
+struct wakeup {
+ struct stats stats;
+ ktime_t timestamp;
+};
+
+/*
+ * Per cpu and irq statistics. Each cpu receives interrupts and those
+ * ones can be distributed following an irq chip specific
+ * algorithm. Random irq distribution is the worst case to predict
+ * interruption behavior but usually that does not happen or could be
+ * fixed from userspace by setting the irq affinity.
+ */
+static DEFINE_PER_CPU(struct wakeup, *wakeups[NR_IRQS]);
+
+static DECLARE_BITMAP(enabled_irq, NR_IRQS);
+
+/**
+ * stats_add - add a new value in the statistic structure
+ *
+ * @s: the statistic structure
+ * @value: the new value to be added
+ *
+ * Adds the value to the array, if the array is full, the oldest value
+ * is replaced.
+ */
+static void stats_add(struct stats *s, u32 value)
+{
+ /*
+ * This is a circular buffer, so the oldest value is the next
+ * one in the buffer. Let's compute the next pointer to
+ * retrieve the oldest value and re-use it to update the w_ptr
+ * after adding the new value.
+ */
+ s->w_ptr = (s->w_ptr + 1) % STATS_NR_VALUES;
+
+ /*
+ * Remove the oldest value from the summing. If this is the
+ * first time we go through this array slot, the previous
+ * value will be zero and we won't substract anything from the
+ * current sum. Hence this code relies on a zero-ed stat
+ * structure at init time via memset or kzalloc.
+ */
+ s->sum -= s->values[s->w_ptr];
+ s->values[s->w_ptr] = value;
+
+ /*
+ * In order to reduce the overhead and to prevent value
+ * derivation due to the integer computation, we just sum the
+ * value and do the division when the average and the variance
+ * are requested.
+ */
+ s->sum += value;
+}
+
+/**
+ * stats_reset - reset the stats
+ *
+ * @s: the statistic structure
+ *
+ * Reset the statistics and reset the values
+ */
+static inline void stats_reset(struct stats *s)
+{
+ memset(s, 0, sizeof(*s));
+}
+
+/**
+ * stats_mean - compute the average
+ *
+ * @s: the statistics structure
+ *
+ * Returns an u32 corresponding to the mean value, or zero if there is
+ * no data
+ */
+static inline u32 stats_mean(struct stats *s)
+{
+ /*
+ * gcc is smart enough to convert to a bits shift when the
+ * divisor is constant and multiple of 2^x.
+ *
+ * The number of values could have not reached STATS_NR_VALUES
+ * yet, but we can consider it acceptable as the situation is
+ * only at the beginning of the burst of irqs.
+ */
+ return s->sum / STATS_NR_VALUES;
+}
+
+/**
+ * stats_variance - compute the variance
+ *
+ * @s: the statistic structure
+ *
+ * Returns an u64 corresponding to the variance, or zero if there is
+ * no data
+ */
+static u64 stats_variance(struct stats *s, u32 mean)
+{
+ int i;
+ u64 variance = 0;
+
+ /*
+ * The variance is the sum of the squared difference to the
+ * average divided by the number of elements.
+ */
+ for (i = 0; i < STATS_NR_VALUES; i++) {
+ s64 diff = s->values[i] - mean;
+ variance += (u64)diff * diff;
+ }
+
+ return variance / STATS_NR_VALUES;
+}
+
+/**
+ * sched_idle_irq - irq timestamp callback
+ *
+ * @irq: the irq number
+ * @timestamp: when the interrupt occured
+ * @dev_id: device id for shared interrupt (not yet used)
+ *
+ * Interrupt callback called when an interrupt happens. This function
+ * is critical as it is called under an interrupt section: minimum
+ * operations as possible are done here:
+ */
+static void sched_irq_timing_handler(unsigned int irq, ktime_t timestamp, void *dev_id)
+{
+ u32 diff;
+ unsigned int cpu = raw_smp_processor_id();
+ struct wakeup *w = per_cpu(wakeups[irq], cpu);
+
+ /*
+ * It is the first time the interrupt occurs of the series, we
+ * can't do any stats as we don't have an interval, just store
+ * the timestamp and exit.
+ */
+ if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
+ w->timestamp = timestamp;
+ return;
+ }
+
+ /*
+ * Microsec resolution is enough for our purpose.
+ */
+ diff = ktime_us_delta(timestamp, w->timestamp);
+ w->timestamp = timestamp;
+
+ /*
+ * There is no point attempting predictions on interrupts more
+ * than ~1 second apart. This has no benefit for sleep state
+ * selection and increases the risk of overflowing our variance
+ * computation. Reset all stats in that case.
+ */
+ if (diff > (1 << 20)) {
+ stats_reset(&w->stats);
+ return;
+ }
+
+ stats_add(&w->stats, diff);
+}
+
+static ktime_t next_irq_event(void)
+{
+ unsigned int irq, cpu = raw_smp_processor_id();
+ ktime_t diff, next, min = ktime_set(KTIME_SEC_MAX, 0);
+ ktime_t now = ktime_get();
+ struct wakeup *w;
+ u32 interval, mean;
+ u64 variance;
+
+ /*
+ * Lookup the interrupt array for this cpu and search for the
+ * earlier expected interruption.
+ */
+ for (irq = 0; irq < NR_IRQS; irq = find_next_bit(enabled_irq, NR_IRQS, irq)) {
+
+ w = per_cpu(wakeups[irq], cpu);
+
+ /*
+ * The interrupt was not setup as a source of a wakeup
+ * or the wakeup source is not considered at this
+ * moment stable enough to do a prediction.
+ */
+ if (!w)
+ continue;
+
+ /*
+ * No statistics available yet.
+ */
+ if (ktime_equal(w->timestamp, ktime_set(0, 0)))
+ continue;
+
+ diff = ktime_sub(now, w->timestamp);
+
+ /*
+ * There is no point attempting predictions on interrupts more
+ * than 1 second apart. This has no benefit for sleep state
+ * selection and increases the risk of overflowing our variance
+ * computation. Reset all stats in that case.
+ */
+ if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
+ stats_reset(&w->stats);
+ continue;
+ }
+
+ /*
+ * If the mean value is null, just ignore this wakeup
+ * source.
+ */
+ mean = stats_mean(&w->stats);
+ if (!mean)
+ continue;
+
+ variance = stats_variance(&w->stats, mean);
+ /*
+ * We want to check the last interval is:
+ *
+ * mean - stddev < interval < mean + stddev
+ *
+ * That simplifies to:
+ *
+ * -stddev < interval - mean < stddev
+ *
+ * abs(interval - mean) < stddev
+ *
+ * The standard deviation is the sqrt of the variance:
+ *
+ * abs(interval - mean) < sqrt(variance)
+ *
+ * and we want to prevent to do an sqrt, so we square
+ * the equation:
+ *
+ * (interval - mean)^2 < variance
+ *
+ * So if the latest value of the stats complies with
+ * this condition, then the wakeup source is
+ * considered predictable and can be used to predict
+ * the next event.
+ */
+ interval = w->stats.values[w->stats.w_ptr];
+ if ((u64)((interval - mean) * (interval - mean)) > variance)
+ continue;
+
+ /*
+ * Let's compute the next event: the wakeup source is
+ * considered predictable, we add the average interval
+ * time added to the latest interruption event time.
+ */
+ next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
+
+ /*
+ * If the interrupt is supposed to happen before the
+ * minimum time, then it becomes the minimum.
+ */
+ if (ktime_before(next, min))
+ min = next;
+ }
+
+ /*
+ * At this point, we have our prediction but the caller is
+ * expecting the remaining time before the next event, so
+ * compute the expected sleep length.
+ */
+ diff = ktime_sub(min, now);
+
+ /*
+ * The result could be negative for different reasons:
+ * - the prediction is incorrect
+ * - the prediction was too near now and expired while we were
+ * in this function
+ *
+ * In both cases, we return KTIME_MAX as a failure to do a
+ * prediction
+ */
+ if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
+ return ktime_set(KTIME_SEC_MAX, 0);
+
+ return diff;
+}
+
+/**
+ * sched_idle_next_wakeup - Predict the next wakeup on the current cpu
+ *
+ * The next event on the cpu is based on a statistic approach of the
+ * interrupt events and the timer deterministic value. From the timer
+ * or the irqs, we return the one expected to occur first.
+ *
+ * Returns the expected remaining idle time before being woken up by
+ * an interruption.
+ */
+s64 sched_idle_next_wakeup(void)
+{
+ s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
+ s64 next_irq = ktime_to_us(next_irq_event());
+
+ return min(next_irq, next_timer);
+}
+
+/**
+ * sched_idle - go to idle for a specified amount of time
+ *
+ * @duration: the idle duration time
+ * @latency: the latency constraint
+ *
+ * Returns 0 on success, < 0 otherwise.
+ */
+int sched_idle(s64 duration, unsigned int latency)
+{
+ struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
+ struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ struct cpuidle_state_usage *su;
+ struct cpuidle_state *s;
+ int i, ret = 0, index = -1;
+
+ rcu_idle_enter();
+
+ /*
+ * No cpuidle driver is available, let's use the default arch
+ * idle function.
+ */
+ if (cpuidle_not_available(drv, dev))
+ goto default_idle;
+
+ /*
+ * Find the idle state with the lowest power while satisfying
+ * our constraints. We will save energy if the duration of the
+ * idle time is bigger than the target residency which is the
+ * break even point. The choice will be modulated by the
+ * latency.
+ */
+ for (i = 0; i < drv->state_count; i++) {
+
+ s = &drv->states[i];
+
+ su = &dev->states_usage[i];
+
+ if (s->disabled || su->disable)
+ continue;
+ if (s->target_residency > duration)
+ continue;
+ if (s->exit_latency > latency)
+ continue;
+
+ index = i;
+ }
+
+ /*
+ * The idle task must be scheduled, it is pointless to go to
+ * idle, just re-enable the interrupt and return.
+ */
+ if (current_clr_polling_and_test()) {
+ local_irq_enable();
+ goto out;
+ }
+
+ if (index < 0) {
+ /*
+ * No idle callbacks fulfilled the constraints, jump
+ * to the default function like there wasn't any
+ * cpuidle driver.
+ */
+ goto default_idle;
+ } else {
+ /*
+ * Enter the idle state previously returned by the
+ * governor decision. This function will block until
+ * an interrupt occurs and will take care of
+ * re-enabling the local interrupts
+ */
+ return cpuidle_enter(drv, dev, index);
+ }
+
+default_idle:
+ default_idle_call();
+out:
+ rcu_idle_exit();
+ return ret;
+}
+
+/**
+ * sched_irq_timing_remove - disable the tracking of the specified irq
+ *
+ * Clear the irq table slot to stop tracking the interrupt.
+ *
+ * @irq: the irq number to stop tracking
+ * @dev_id: the device id for shared irq
+ *
+ * This function will remove from the wakeup source prediction table.
+ */
+static void sched_irq_timing_remove(unsigned int irq, void *dev_id)
+{
+ clear_bit(irq, enabled_irq);
+}
+
+/**
+ * sched_irq_timing_setup - enable the tracking of the specified irq
+ *
+ * Function is called with the corresponding irqdesc lock taken. It is
+ * not allowed to do any memory allocation or blocking call. Flag the
+ * irq table slot to be tracked in order to predict the next event.
+ *
+ * @irq: the interrupt numbe to be tracked
+ * @act: the new irq action to be set to this interrupt
+ *
+ * Returns zero on success, < 0 otherwise.
+ */
+static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act)
+{
+ /*
+ * No interrupt set for this descriptor or related to a timer.
+ * Timers are deterministic, so no need to try to do any
+ * prediction on them. No error for both cases, we are just not
+ * interested.
+ */
+ if (!(act->flags & __IRQF_TIMER))
+ return 0;
+
+ set_bit(irq, enabled_irq);
+
+ return 0;
+}
+
+/**
+ * sched_irq_timing_free - free memory previously allocated
+ *
+ * @irq: the interrupt number
+ */
+static void sched_irq_timing_free(unsigned int irq)
+{
+ struct wakeup *w;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+
+ w = per_cpu(wakeups[irq], cpu);
+ if (!w)
+ continue;
+
+ per_cpu(wakeups[irq], cpu) = NULL;
+ kfree(w);
+ }
+}
+
+/**
+ * sched_irq_timing_alloc - allocates memory for irq tracking
+ *
+ * Allocates the memory to track the specified irq.
+ *
+ * @irq: the interrupt number
+ *
+ * Returns 0 on success, -ENOMEM on error.
+ */
+static int sched_irq_timing_alloc(unsigned int irq)
+{
+ struct wakeup *w;
+ int cpu, ret = -ENOMEM;
+
+ /*
+ * Allocates the wakeup structure and the stats structure. As
+ * the interrupt can occur on any cpu, allocate the wakeup
+ * structure per cpu basis.
+ */
+ for_each_possible_cpu(cpu) {
+
+ w = kzalloc(sizeof(*w), GFP_KERNEL);
+ if (!w)
+ goto out;
+
+ per_cpu(wakeups[irq], cpu) = w;
+ }
+
+ ret = 0;
+out:
+ if (ret)
+ sched_irq_timing_free(irq);
+
+ return ret;
+}
+
+static struct irqtimings_ops irqt_ops = {
+ .alloc = sched_irq_timing_alloc,
+ .free = sched_irq_timing_free,
+ .setup = sched_irq_timing_setup,
+ .remove = sched_irq_timing_remove,
+ .handler = sched_irq_timing_handler,
+};
+
+DECLARE_IRQ_TIMINGS(&irqt_ops);
--
1.9.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC V2 0/2] IRQ based next prediction
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
@ 2016-01-20 16:00 ` Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
4 siblings, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-20 16:00 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
The current approach to select an idle state is based on the idle period
statistics computation.
Useless to say this approach satisfied everyone as a solution to find the
best trade-off between the performances and the energy saving via the menu
governor.
However, the kernel is evolving to act pro-actively regarding the energy
constraints with the scheduler and the different power management subsystems
are not collaborating with the scheduler as the conductor of the decisions,
they all act independently.
The cpuidle governors are based on idle period statistics, without knowledge
of what woke up the cpu. In these sources of wakes up, the IPI are of course
accounted (as well as the timers irq) which results in doing statistics on
the scheduler behavior too. It is no sense to let the scheduler to take a
decision based on a next prediction of its own decisions.
In order to integrate the cpuidle framework into the scheduler, we have to
radically change the approach by clearly identifying what is causing a wake
up and how it behaves.
This serie inverts the logic.
Instead of tracking the idle durations and do statistics on them, these
patches track the interrupt individually and try to predict the next interrupt.
By combining the interrupts' next event on a single CPU, we can predict the
next event for the CPU, hence predict how long we will be sleeping when
entering idle.
The IPI and timer interrupts are not taken into account.
The first patch provides a callback to be registered in the irq subsystem
and to be called when an interrupt is handled with a timestamp.
The second patch uses the callback provided by the patch above to compute
the delta and store it in a circular buffer. It is per cpu, the callback
implements minimal operations as it is in an interrupt context.
When we the cpu enters idle, it asks for the expected sleep time. Then the
expected minimum sleep length for all interrupts is used and compared to
the timer sleep length, again the minimum is taken and gives the expected
sleep time.
The statistics are very trivial and could be improved later but this first
step shows we have a nice overall improvement in SMP. In UP the menu governor
is a bit better which may indicate the next prediction computation could be
improved but confirms removing the IPI from the equation increase the
accuracy.
Changelog:
V2:
- Changed the register_ops approach for the irq subsystem
- Fixed Nicolas's comments
Daniel Lezcano (2):
irq: Add a framework to measure interrupt timings
sched: idle: IRQ based next prediction for idle period
drivers/cpuidle/Kconfig | 9 +
include/linux/interrupt.h | 26 +++
include/linux/irqhandler.h | 1 +
kernel/irq/Kconfig | 4 +
kernel/irq/handle.c | 1 +
kernel/irq/internals.h | 43 ++++
kernel/irq/irqdesc.c | 6 +
kernel/irq/manage.c | 10 +-
kernel/sched/Makefile | 1 +
kernel/sched/idle-sched.c | 529 +++++++++++++++++++++++++++++++++++++++++++++
10 files changed, 629 insertions(+), 1 deletion(-)
create mode 100644 kernel/sched/idle-sched.c
--
1.9.1
^ permalink raw reply [flat|nested] 63+ messages in thread
* [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
` (2 preceding siblings ...)
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
@ 2016-01-20 16:00 ` Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
4 siblings, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-20 16:00 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
The interrupt framework gives a lot of information and statistics about
each interrupt.
Unfortunately there is no way to measure when interrupts occur and provide
a mathematical model for their behavior which could help in predicting
their next occurence.
This framework allows for registering a callback function that is invoked
when an interrupt occurs. Each time, the callback will be called with the
timestamp.
The main objective is to track and detect the periodic interrupts in order
to predict the next event on a cpu and anticipate the sleeping time when
entering idle. This fine grain approach allows to simplify and rationalize
a wake up event prediction without IPIs interference, thus letting the
scheduler to be smarter with the wakeup IPIs regarding the idle period.
The irq timing feature must be enabled by the subsytem at compile time and
this one must use the DECLARE_IRQ_TIMINGS(ops) macro in order to declare
the ops to be used. Without this, the kernel will fail to compile with an
unresolved symbol. That is the guarantee the irq timings is not enabled
for nothing and will be used if it is defined in the config file.
Moreover, using a global ops variable, encapsulated in the irq code via the
DECLARE_IRQ_TIMINGS macro, allows to have the ops to be called at init
time, before the interrupts are setup. That prevents to introduce complex
code to update the subsystem's irq tracking table *after* the irqs init
happened.
The ops are as the following:
- alloc : called when an irqdesc is allocated
- free : called when an irqdesc is freed
- setup : called when an irq is registered with the irq handler
- remove : called when an irq is removed
- handler : called when an irq was handled
A static key will be introduced when the irq prediction is switched on at
runtime in order to reduce an overhead near to zero when the kernel is not
using it.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
include/linux/interrupt.h | 26 ++++++++++++++++++++++++++
include/linux/irqhandler.h | 1 +
kernel/irq/Kconfig | 4 ++++
kernel/irq/handle.c | 1 +
kernel/irq/internals.h | 43 +++++++++++++++++++++++++++++++++++++++++++
kernel/irq/irqdesc.c | 6 ++++++
kernel/irq/manage.c | 10 +++++++++-
7 files changed, 90 insertions(+), 1 deletion(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index ad16809..f7ff6fe 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -123,6 +123,32 @@ struct irqaction {
struct proc_dir_entry *dir;
} ____cacheline_internodealigned_in_smp;
+#ifdef CONFIG_IRQ_TIMINGS
+/**
+ * struct irqt_ops - structure to be used by the subsystem to track
+ * irq timings
+ * @alloc: called when an irqdesc is allocated
+ * @free: called when an irqdesc is free
+ * @setup: called when an irq is setup, this is called under lock
+ * @remove: called when an irq is removed
+ * @handler: called when an interrupt is handled
+ */
+struct irqtimings_ops {
+ int (*alloc)(unsigned int);
+ void (*free)(unsigned int);
+ int (*setup)(unsigned int, struct irqaction *act);
+ void (*remove)(unsigned int, void *dev_id);
+ irqt_handler_t handler;
+};
+
+/**
+ * This macro *must* be used by the subsystem interested by the irq
+ * timing information.
+ */
+#define DECLARE_IRQ_TIMINGS(__ops) \
+ const struct irqtimings_ops *__irqtimings = __ops;
+#endif
+
extern irqreturn_t no_action(int cpl, void *dev_id);
extern int __must_check
diff --git a/include/linux/irqhandler.h b/include/linux/irqhandler.h
index 661bed0..4c1c77e 100644
--- a/include/linux/irqhandler.h
+++ b/include/linux/irqhandler.h
@@ -10,5 +10,6 @@ struct irq_desc;
struct irq_data;
typedef void (*irq_flow_handler_t)(struct irq_desc *desc);
typedef void (*irq_preflow_handler_t)(struct irq_data *data);
+typedef void (*irqt_handler_t)(unsigned int, ktime_t, void *);
#endif
diff --git a/kernel/irq/Kconfig b/kernel/irq/Kconfig
index 3b48dab..3f68619 100644
--- a/kernel/irq/Kconfig
+++ b/kernel/irq/Kconfig
@@ -77,6 +77,10 @@ config GENERIC_MSI_IRQ_DOMAIN
config HANDLE_DOMAIN_IRQ
bool
+config IRQ_TIMINGS
+ bool
+ default n
+
config IRQ_DOMAIN_DEBUG
bool "Expose hardware/virtual IRQ mapping via debugfs"
depends on IRQ_DOMAIN && DEBUG_FS
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index a302cf9..cfc76fd 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
/* Fall through to add to randomness */
case IRQ_HANDLED:
flags |= action->flags;
+ handle_irqtiming(irq, action->dev_id);
break;
default:
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index fcab63c..cd4f61a 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -20,6 +20,49 @@ extern bool noirqdebug;
extern struct irqaction chained_action;
+#ifdef CONFIG_IRQ_TIMINGS
+
+extern const struct irqtimings_ops *__irqtimings;
+
+static inline int alloc_irqtiming(unsigned int irq)
+{
+ if (__irqtimings->alloc)
+ return __irqtimings->alloc(irq);
+ return 0;
+}
+
+static inline void free_irqtiming(unsigned int irq)
+{
+ if (__irqtimings->free)
+ __irqtimings->free(irq);
+}
+
+static inline int setup_irqtiming(unsigned int irq, struct irqaction *act)
+{
+ if (__irqtimings->setup)
+ return __irqtimings->setup(irq, act);
+ return 0;
+}
+
+static inline void remove_irqtiming(unsigned int irq, void *dev_id)
+{
+ if (__irqtimings->remove)
+ __irqtimings->remove(irq, dev_id);
+}
+
+static inline void handle_irqtiming(unsigned int irq, void *dev_id)
+{
+ if (__irqtimings->handler)
+ __irqtimings->handler(irq, ktime_get(), dev_id);
+}
+#else
+static inline int alloc_irqtiming(unsigned int irq) { return 0; }
+static inline int setup_irqtiming(unsigned int irq, void *dev_id) { return 0; }
+static inline void free_irqtiming(unsigned int irq) {}
+static inline void remove_irqtiming(unsigned int irq, void *dev_id) { }
+static inline void handle_irqtiming(unsigned int irq, void *dev_id) { }
+#endif
+
/*
* Bits used by threaded handlers:
* IRQTF_RUNTHREAD - signals that the interrupt handler thread should run
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 239e2ae..dc52f3a 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -157,6 +157,9 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
if (alloc_masks(desc, gfp, node))
goto err_kstat;
+ if (alloc_irqtiming(irq))
+ goto err_mask;
+
raw_spin_lock_init(&desc->lock);
lockdep_set_class(&desc->lock, &irq_desc_lock_class);
@@ -164,6 +167,8 @@ static struct irq_desc *alloc_desc(int irq, int node, struct module *owner)
return desc;
+err_mask:
+ free_masks(desc);
err_kstat:
free_percpu(desc->kstat_irqs);
err_desc:
@@ -187,6 +192,7 @@ static void free_desc(unsigned int irq)
delete_irq_desc(irq);
mutex_unlock(&sparse_irq_lock);
+ free_irqtiming(irq);
free_masks(desc);
free_percpu(desc->kstat_irqs);
kfree(desc);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 6ead200..df1cdb6 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1282,13 +1282,17 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
init_waitqueue_head(&desc->wait_for_threads);
+ ret = setup_irqtiming(irq, new);
+ if (ret)
+ goto out_mask;
+
/* Setup the type (level, edge polarity) if configured: */
if (new->flags & IRQF_TRIGGER_MASK) {
ret = __irq_set_trigger(desc,
new->flags & IRQF_TRIGGER_MASK);
if (ret)
- goto out_mask;
+ goto out_irqtiming;
}
desc->istate &= ~(IRQS_AUTODETECT | IRQS_SPURIOUS_DISABLED | \
@@ -1373,6 +1377,8 @@ mismatch:
}
ret = -EBUSY;
+out_irqtiming:
+ remove_irqtiming(irq, new->dev_id);
out_mask:
raw_spin_unlock_irqrestore(&desc->lock, flags);
free_cpumask_var(mask);
@@ -1483,6 +1489,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)
/* Make sure it's not being used on another CPU: */
synchronize_irq(irq);
+ remove_irqtiming(irq, dev_id);
+
#ifdef CONFIG_DEBUG_SHIRQ
/*
* It's a shared IRQ -- the driver ought to be prepared for an IRQ
--
1.9.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
` (3 preceding siblings ...)
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
@ 2016-01-20 16:00 ` Daniel Lezcano
2016-01-20 20:14 ` Nicolas Pitre
4 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-20 16:00 UTC (permalink / raw)
To: tglx, peterz, rafael
Cc: linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
Many IRQs are quiet most of the time, or they tend to come in bursts of
fairly equal time intervals within each burst. It is therefore possible
to detect those IRQs with stable intervals and guestimate when the next
IRQ event is most likely to happen.
Examples of such IRQs may include audio related IRQs where the FIFO size
and/or DMA descriptor size with the sample rate create stable intervals,
block devices during large data transfers, etc. Even network streaming
of multimedia content creates patterns of periodic network interface IRQs
in some cases.
This patch adds code to track the mean interval and variance for each IRQ
over a window of time intervals between IRQ events. Those statistics can
be used to assist cpuidle in selecting the most appropriate sleep state
by predicting the most likely time for the next interrupt.
Because the stats are gathered in interrupt context, the core computation
is as light as possible.
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
---
drivers/cpuidle/Kconfig | 9 +
kernel/sched/Makefile | 1 +
kernel/sched/idle-sched.c | 529 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 539 insertions(+)
create mode 100644 kernel/sched/idle-sched.c
diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
index 8c7930b..a606106 100644
--- a/drivers/cpuidle/Kconfig
+++ b/drivers/cpuidle/Kconfig
@@ -25,6 +25,15 @@ config CPU_IDLE_GOV_MENU
bool "Menu governor (for tickless system)"
default y
+config CPU_IDLE_GOV_SCHED
+ bool "Sched idle governor"
+ select IRQ_TIMINGS
+ help
+ Enables an irq timings tracking mechanism to track the wakeup sources
+ of the platform.
+
+ If you are unsure, it is safe to say N.
+
config DT_IDLE_STATES
bool
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 6768797..f7d5a35 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
+obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o
diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c
new file mode 100644
index 0000000..c2b8568
--- /dev/null
+++ b/kernel/sched/idle-sched.c
@@ -0,0 +1,529 @@
+/*
+ * Copyright (C) 2016 Linaro Ltd, Daniel Lezcano <daniel.lezcano@linaro.org>
+ * Nicolas Pitre <nicolas.pitre@linaro.org>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ */
+#include <linux/cpuidle.h>
+#include <linux/interrupt.h>
+#include <linux/irqdesc.h>
+#include <linux/ktime.h>
+#include <linux/slab.h>
+#include <linux/tick.h>
+#include <linux/time64.h>
+
+/*
+ * Define the number of samples over which the average and variance
+ * are computed. A power of 2 is preferred so to let the compiler
+ * optimize divisions by that number with simple arithmetic shifts.
+ */
+#define STATS_NR_VALUES 4
+
+/**
+ * struct stats - internal structure to encapsulate stats informations
+ *
+ * @sum: sum of the values
+ * @values: array of values to do stats on
+ * @w_ptr: current buffer pointer
+ */
+struct stats {
+ u64 sum; /* sum of values */
+ u32 values[STATS_NR_VALUES]; /* array of values */
+ unsigned char w_ptr; /* current window pointer */
+};
+
+/**
+ * struct wakeup - internal structure describing a source of wakeup
+ *
+ * @stats: the stats structure on the different event intervals
+ * @timestamp: latest update timestamp
+ */
+struct wakeup {
+ struct stats stats;
+ ktime_t timestamp;
+};
+
+/*
+ * Per cpu and irq statistics. Each cpu receives interrupts and those
+ * ones can be distributed following an irq chip specific
+ * algorithm. Random irq distribution is the worst case to predict
+ * interruption behavior but usually that does not happen or could be
+ * fixed from userspace by setting the irq affinity.
+ */
+static DEFINE_PER_CPU(struct wakeup, *wakeups[NR_IRQS]);
+
+static DECLARE_BITMAP(enabled_irq, NR_IRQS);
+
+/**
+ * stats_add - add a new value in the statistic structure
+ *
+ * @s: the statistic structure
+ * @value: the new value to be added
+ *
+ * Adds the value to the array, if the array is full, the oldest value
+ * is replaced.
+ */
+static void stats_add(struct stats *s, u32 value)
+{
+ /*
+ * This is a circular buffer, so the oldest value is the next
+ * one in the buffer. Let's compute the next pointer to
+ * retrieve the oldest value and re-use it to update the w_ptr
+ * after adding the new value.
+ */
+ s->w_ptr = (s->w_ptr + 1) % STATS_NR_VALUES;
+
+ /*
+ * Remove the oldest value from the summing. If this is the
+ * first time we go through this array slot, the previous
+ * value will be zero and we won't substract anything from the
+ * current sum. Hence this code relies on a zero-ed stat
+ * structure at init time via memset or kzalloc.
+ */
+ s->sum -= s->values[s->w_ptr];
+ s->values[s->w_ptr] = value;
+
+ /*
+ * In order to reduce the overhead and to prevent value
+ * derivation due to the integer computation, we just sum the
+ * value and do the division when the average and the variance
+ * are requested.
+ */
+ s->sum += value;
+}
+
+/**
+ * stats_reset - reset the stats
+ *
+ * @s: the statistic structure
+ *
+ * Reset the statistics and reset the values
+ */
+static inline void stats_reset(struct stats *s)
+{
+ memset(s, 0, sizeof(*s));
+}
+
+/**
+ * stats_mean - compute the average
+ *
+ * @s: the statistics structure
+ *
+ * Returns an u32 corresponding to the mean value, or zero if there is
+ * no data
+ */
+static inline u32 stats_mean(struct stats *s)
+{
+ /*
+ * gcc is smart enough to convert to a bits shift when the
+ * divisor is constant and multiple of 2^x.
+ *
+ * The number of values could have not reached STATS_NR_VALUES
+ * yet, but we can consider it acceptable as the situation is
+ * only at the beginning of the burst of irqs.
+ */
+ return s->sum / STATS_NR_VALUES;
+}
+
+/**
+ * stats_variance - compute the variance
+ *
+ * @s: the statistic structure
+ *
+ * Returns an u64 corresponding to the variance, or zero if there is
+ * no data
+ */
+static u64 stats_variance(struct stats *s, u32 mean)
+{
+ int i;
+ u64 variance = 0;
+
+ /*
+ * The variance is the sum of the squared difference to the
+ * average divided by the number of elements.
+ */
+ for (i = 0; i < STATS_NR_VALUES; i++) {
+ s64 diff = s->values[i] - mean;
+ variance += (u64)diff * diff;
+ }
+
+ return variance / STATS_NR_VALUES;
+}
+
+/**
+ * sched_idle_irq - irq timestamp callback
+ *
+ * @irq: the irq number
+ * @timestamp: when the interrupt occured
+ * @dev_id: device id for shared interrupt (not yet used)
+ *
+ * Interrupt callback called when an interrupt happens. This function
+ * is critical as it is called under an interrupt section: minimum
+ * operations as possible are done here:
+ */
+static void sched_irq_timing_handler(unsigned int irq, ktime_t timestamp, void *dev_id)
+{
+ u32 diff;
+ unsigned int cpu = raw_smp_processor_id();
+ struct wakeup *w = per_cpu(wakeups[irq], cpu);
+
+ /*
+ * It is the first time the interrupt occurs of the series, we
+ * can't do any stats as we don't have an interval, just store
+ * the timestamp and exit.
+ */
+ if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
+ w->timestamp = timestamp;
+ return;
+ }
+
+ /*
+ * Microsec resolution is enough for our purpose.
+ */
+ diff = ktime_us_delta(timestamp, w->timestamp);
+ w->timestamp = timestamp;
+
+ /*
+ * There is no point attempting predictions on interrupts more
+ * than ~1 second apart. This has no benefit for sleep state
+ * selection and increases the risk of overflowing our variance
+ * computation. Reset all stats in that case.
+ */
+ if (diff > (1 << 20)) {
+ stats_reset(&w->stats);
+ return;
+ }
+
+ stats_add(&w->stats, diff);
+}
+
+static ktime_t next_irq_event(void)
+{
+ unsigned int irq, cpu = raw_smp_processor_id();
+ ktime_t diff, next, min = ktime_set(KTIME_SEC_MAX, 0);
+ ktime_t now = ktime_get();
+ struct wakeup *w;
+ u32 interval, mean;
+ u64 variance;
+
+ /*
+ * Lookup the interrupt array for this cpu and search for the
+ * earlier expected interruption.
+ */
+ for (irq = 0; irq < NR_IRQS; irq = find_next_bit(enabled_irq, NR_IRQS, irq)) {
+
+ w = per_cpu(wakeups[irq], cpu);
+
+ /*
+ * The interrupt was not setup as a source of a wakeup
+ * or the wakeup source is not considered at this
+ * moment stable enough to do a prediction.
+ */
+ if (!w)
+ continue;
+
+ /*
+ * No statistics available yet.
+ */
+ if (ktime_equal(w->timestamp, ktime_set(0, 0)))
+ continue;
+
+ diff = ktime_sub(now, w->timestamp);
+
+ /*
+ * There is no point attempting predictions on interrupts more
+ * than 1 second apart. This has no benefit for sleep state
+ * selection and increases the risk of overflowing our variance
+ * computation. Reset all stats in that case.
+ */
+ if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
+ stats_reset(&w->stats);
+ continue;
+ }
+
+ /*
+ * If the mean value is null, just ignore this wakeup
+ * source.
+ */
+ mean = stats_mean(&w->stats);
+ if (!mean)
+ continue;
+
+ variance = stats_variance(&w->stats, mean);
+ /*
+ * We want to check the last interval is:
+ *
+ * mean - stddev < interval < mean + stddev
+ *
+ * That simplifies to:
+ *
+ * -stddev < interval - mean < stddev
+ *
+ * abs(interval - mean) < stddev
+ *
+ * The standard deviation is the sqrt of the variance:
+ *
+ * abs(interval - mean) < sqrt(variance)
+ *
+ * and we want to prevent to do an sqrt, so we square
+ * the equation:
+ *
+ * (interval - mean)^2 < variance
+ *
+ * So if the latest value of the stats complies with
+ * this condition, then the wakeup source is
+ * considered predictable and can be used to predict
+ * the next event.
+ */
+ interval = w->stats.values[w->stats.w_ptr];
+ if ((u64)((interval - mean) * (interval - mean)) > variance)
+ continue;
+
+ /*
+ * Let's compute the next event: the wakeup source is
+ * considered predictable, we add the average interval
+ * time added to the latest interruption event time.
+ */
+ next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
+
+ /*
+ * If the interrupt is supposed to happen before the
+ * minimum time, then it becomes the minimum.
+ */
+ if (ktime_before(next, min))
+ min = next;
+ }
+
+ /*
+ * At this point, we have our prediction but the caller is
+ * expecting the remaining time before the next event, so
+ * compute the expected sleep length.
+ */
+ diff = ktime_sub(min, now);
+
+ /*
+ * The result could be negative for different reasons:
+ * - the prediction is incorrect
+ * - the prediction was too near now and expired while we were
+ * in this function
+ *
+ * In both cases, we return KTIME_MAX as a failure to do a
+ * prediction
+ */
+ if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
+ return ktime_set(KTIME_SEC_MAX, 0);
+
+ return diff;
+}
+
+/**
+ * sched_idle_next_wakeup - Predict the next wakeup on the current cpu
+ *
+ * The next event on the cpu is based on a statistic approach of the
+ * interrupt events and the timer deterministic value. From the timer
+ * or the irqs, we return the one expected to occur first.
+ *
+ * Returns the expected remaining idle time before being woken up by
+ * an interruption.
+ */
+s64 sched_idle_next_wakeup(void)
+{
+ s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
+ s64 next_irq = ktime_to_us(next_irq_event());
+
+ return min(next_irq, next_timer);
+}
+
+/**
+ * sched_idle - go to idle for a specified amount of time
+ *
+ * @duration: the idle duration time
+ * @latency: the latency constraint
+ *
+ * Returns 0 on success, < 0 otherwise.
+ */
+int sched_idle(s64 duration, unsigned int latency)
+{
+ struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
+ struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
+ struct cpuidle_state_usage *su;
+ struct cpuidle_state *s;
+ int i, ret = 0, index = -1;
+
+ rcu_idle_enter();
+
+ /*
+ * No cpuidle driver is available, let's use the default arch
+ * idle function.
+ */
+ if (cpuidle_not_available(drv, dev))
+ goto default_idle;
+
+ /*
+ * Find the idle state with the lowest power while satisfying
+ * our constraints. We will save energy if the duration of the
+ * idle time is bigger than the target residency which is the
+ * break even point. The choice will be modulated by the
+ * latency.
+ */
+ for (i = 0; i < drv->state_count; i++) {
+
+ s = &drv->states[i];
+
+ su = &dev->states_usage[i];
+
+ if (s->disabled || su->disable)
+ continue;
+ if (s->target_residency > duration)
+ continue;
+ if (s->exit_latency > latency)
+ continue;
+
+ index = i;
+ }
+
+ /*
+ * The idle task must be scheduled, it is pointless to go to
+ * idle, just re-enable the interrupt and return.
+ */
+ if (current_clr_polling_and_test()) {
+ local_irq_enable();
+ goto out;
+ }
+
+ if (index < 0) {
+ /*
+ * No idle callbacks fulfilled the constraints, jump
+ * to the default function like there wasn't any
+ * cpuidle driver.
+ */
+ goto default_idle;
+ } else {
+ /*
+ * Enter the idle state previously returned by the
+ * governor decision. This function will block until
+ * an interrupt occurs and will take care of
+ * re-enabling the local interrupts
+ */
+ return cpuidle_enter(drv, dev, index);
+ }
+
+default_idle:
+ default_idle_call();
+out:
+ rcu_idle_exit();
+ return ret;
+}
+
+/**
+ * sched_irq_timing_remove - disable the tracking of the specified irq
+ *
+ * Clear the irq table slot to stop tracking the interrupt.
+ *
+ * @irq: the irq number to stop tracking
+ * @dev_id: the device id for shared irq
+ *
+ * This function will remove from the wakeup source prediction table.
+ */
+static void sched_irq_timing_remove(unsigned int irq, void *dev_id)
+{
+ clear_bit(irq, enabled_irq);
+}
+
+/**
+ * sched_irq_timing_setup - enable the tracking of the specified irq
+ *
+ * Function is called with the corresponding irqdesc lock taken. It is
+ * not allowed to do any memory allocation or blocking call. Flag the
+ * irq table slot to be tracked in order to predict the next event.
+ *
+ * @irq: the interrupt numbe to be tracked
+ * @act: the new irq action to be set to this interrupt
+ *
+ * Returns zero on success, < 0 otherwise.
+ */
+static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act)
+{
+ /*
+ * No interrupt set for this descriptor or related to a timer.
+ * Timers are deterministic, so no need to try to do any
+ * prediction on them. No error for both cases, we are just not
+ * interested.
+ */
+ if (!(act->flags & __IRQF_TIMER))
+ return 0;
+
+ set_bit(irq, enabled_irq);
+
+ return 0;
+}
+
+/**
+ * sched_irq_timing_free - free memory previously allocated
+ *
+ * @irq: the interrupt number
+ */
+static void sched_irq_timing_free(unsigned int irq)
+{
+ struct wakeup *w;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+
+ w = per_cpu(wakeups[irq], cpu);
+ if (!w)
+ continue;
+
+ per_cpu(wakeups[irq], cpu) = NULL;
+ kfree(w);
+ }
+}
+
+/**
+ * sched_irq_timing_alloc - allocates memory for irq tracking
+ *
+ * Allocates the memory to track the specified irq.
+ *
+ * @irq: the interrupt number
+ *
+ * Returns 0 on success, -ENOMEM on error.
+ */
+static int sched_irq_timing_alloc(unsigned int irq)
+{
+ struct wakeup *w;
+ int cpu, ret = -ENOMEM;
+
+ /*
+ * Allocates the wakeup structure and the stats structure. As
+ * the interrupt can occur on any cpu, allocate the wakeup
+ * structure per cpu basis.
+ */
+ for_each_possible_cpu(cpu) {
+
+ w = kzalloc(sizeof(*w), GFP_KERNEL);
+ if (!w)
+ goto out;
+
+ per_cpu(wakeups[irq], cpu) = w;
+ }
+
+ ret = 0;
+out:
+ if (ret)
+ sched_irq_timing_free(irq);
+
+ return ret;
+}
+
+static struct irqtimings_ops irqt_ops = {
+ .alloc = sched_irq_timing_alloc,
+ .free = sched_irq_timing_free,
+ .setup = sched_irq_timing_setup,
+ .remove = sched_irq_timing_remove,
+ .handler = sched_irq_timing_handler,
+};
+
+DECLARE_IRQ_TIMINGS(&irqt_ops);
--
1.9.1
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
@ 2016-01-20 17:46 ` Nicolas Pitre
2016-01-20 18:44 ` Peter Zijlstra
2016-01-21 10:03 ` Daniel Lezcano
2016-01-20 19:02 ` Peter Zijlstra
` (3 subsequent siblings)
4 siblings, 2 replies; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-20 17:46 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, 20 Jan 2016, Daniel Lezcano wrote:
> Many IRQs are quiet most of the time, or they tend to come in bursts of
> fairly equal time intervals within each burst. It is therefore possible
> to detect those IRQs with stable intervals and guestimate when the next
> IRQ event is most likely to happen.
>
> Examples of such IRQs may include audio related IRQs where the FIFO size
> and/or DMA descriptor size with the sample rate create stable intervals,
> block devices during large data transfers, etc. Even network streaming
> of multimedia content creates patterns of periodic network interface IRQs
> in some cases.
>
> This patch adds code to track the mean interval and variance for each IRQ
> over a window of time intervals between IRQ events. Those statistics can
> be used to assist cpuidle in selecting the most appropriate sleep state
> by predicting the most likely time for the next interrupt.
>
> Because the stats are gathered in interrupt context, the core computation
> is as light as possible.
>
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> ---
> drivers/cpuidle/Kconfig | 9 +
> kernel/sched/Makefile | 1 +
> kernel/sched/idle-sched.c | 529 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 539 insertions(+)
> create mode 100644 kernel/sched/idle-sched.c
>
> diff --git a/drivers/cpuidle/Kconfig b/drivers/cpuidle/Kconfig
> index 8c7930b..a606106 100644
> --- a/drivers/cpuidle/Kconfig
> +++ b/drivers/cpuidle/Kconfig
> @@ -25,6 +25,15 @@ config CPU_IDLE_GOV_MENU
> bool "Menu governor (for tickless system)"
> default y
>
> +config CPU_IDLE_GOV_SCHED
> + bool "Sched idle governor"
> + select IRQ_TIMINGS
> + help
> + Enables an irq timings tracking mechanism to track the wakeup sources
> + of the platform.
> +
> + If you are unsure, it is safe to say N.
> +
> config DT_IDLE_STATES
> bool
>
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 6768797..f7d5a35 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -19,3 +19,4 @@ obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
> obj-$(CONFIG_SCHEDSTATS) += stats.o
> obj-$(CONFIG_SCHED_DEBUG) += debug.o
> obj-$(CONFIG_CGROUP_CPUACCT) += cpuacct.o
> +obj-$(CONFIG_CPU_IDLE_GOV_SCHED) += idle-sched.o
> diff --git a/kernel/sched/idle-sched.c b/kernel/sched/idle-sched.c
> new file mode 100644
> index 0000000..c2b8568
> --- /dev/null
> +++ b/kernel/sched/idle-sched.c
> @@ -0,0 +1,529 @@
> +/*
> + * Copyright (C) 2016 Linaro Ltd, Daniel Lezcano <daniel.lezcano@linaro.org>
> + * Nicolas Pitre <nicolas.pitre@linaro.org>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + */
> +#include <linux/cpuidle.h>
> +#include <linux/interrupt.h>
> +#include <linux/irqdesc.h>
> +#include <linux/ktime.h>
> +#include <linux/slab.h>
> +#include <linux/tick.h>
> +#include <linux/time64.h>
> +
> +/*
> + * Define the number of samples over which the average and variance
> + * are computed. A power of 2 is preferred so to let the compiler
> + * optimize divisions by that number with simple arithmetic shifts.
> + */
> +#define STATS_NR_VALUES 4
> +
> +/**
> + * struct stats - internal structure to encapsulate stats informations
> + *
> + * @sum: sum of the values
> + * @values: array of values to do stats on
> + * @w_ptr: current buffer pointer
> + */
> +struct stats {
> + u64 sum; /* sum of values */
> + u32 values[STATS_NR_VALUES]; /* array of values */
> + unsigned char w_ptr; /* current window pointer */
Why did you change this from an unsigned int?
This won't provide any memory space saving given that the structure has
to be padded up to the next 64-bit boundary.
> +};
> +
> +/**
> + * struct wakeup - internal structure describing a source of wakeup
> + *
> + * @stats: the stats structure on the different event intervals
> + * @timestamp: latest update timestamp
> + */
> +struct wakeup {
> + struct stats stats;
> + ktime_t timestamp;
> +};
> +
> +/*
> + * Per cpu and irq statistics. Each cpu receives interrupts and those
> + * ones can be distributed following an irq chip specific
> + * algorithm. Random irq distribution is the worst case to predict
> + * interruption behavior but usually that does not happen or could be
> + * fixed from userspace by setting the irq affinity.
> + */
> +static DEFINE_PER_CPU(struct wakeup, *wakeups[NR_IRQS]);
> +
> +static DECLARE_BITMAP(enabled_irq, NR_IRQS);
> +
> +/**
> + * stats_add - add a new value in the statistic structure
> + *
> + * @s: the statistic structure
> + * @value: the new value to be added
> + *
> + * Adds the value to the array, if the array is full, the oldest value
> + * is replaced.
> + */
> +static void stats_add(struct stats *s, u32 value)
> +{
> + /*
> + * This is a circular buffer, so the oldest value is the next
> + * one in the buffer. Let's compute the next pointer to
> + * retrieve the oldest value and re-use it to update the w_ptr
> + * after adding the new value.
> + */
> + s->w_ptr = (s->w_ptr + 1) % STATS_NR_VALUES;
> +
> + /*
> + * Remove the oldest value from the summing. If this is the
> + * first time we go through this array slot, the previous
> + * value will be zero and we won't substract anything from the
> + * current sum. Hence this code relies on a zero-ed stat
> + * structure at init time via memset or kzalloc.
> + */
> + s->sum -= s->values[s->w_ptr];
> + s->values[s->w_ptr] = value;
> +
> + /*
> + * In order to reduce the overhead and to prevent value
> + * derivation due to the integer computation, we just sum the
> + * value and do the division when the average and the variance
> + * are requested.
> + */
> + s->sum += value;
> +}
> +
> +/**
> + * stats_reset - reset the stats
> + *
> + * @s: the statistic structure
> + *
> + * Reset the statistics and reset the values
> + */
> +static inline void stats_reset(struct stats *s)
> +{
> + memset(s, 0, sizeof(*s));
> +}
> +
> +/**
> + * stats_mean - compute the average
> + *
> + * @s: the statistics structure
> + *
> + * Returns an u32 corresponding to the mean value, or zero if there is
> + * no data
> + */
> +static inline u32 stats_mean(struct stats *s)
> +{
> + /*
> + * gcc is smart enough to convert to a bits shift when the
> + * divisor is constant and multiple of 2^x.
> + *
> + * The number of values could have not reached STATS_NR_VALUES
> + * yet, but we can consider it acceptable as the situation is
> + * only at the beginning of the burst of irqs.
> + */
> + return s->sum / STATS_NR_VALUES;
> +}
> +
> +/**
> + * stats_variance - compute the variance
> + *
> + * @s: the statistic structure
> + *
> + * Returns an u64 corresponding to the variance, or zero if there is
> + * no data
> + */
> +static u64 stats_variance(struct stats *s, u32 mean)
> +{
> + int i;
> + u64 variance = 0;
> +
> + /*
> + * The variance is the sum of the squared difference to the
> + * average divided by the number of elements.
> + */
> + for (i = 0; i < STATS_NR_VALUES; i++) {
> + s64 diff = s->values[i] - mean;
> + variance += (u64)diff * diff;
> + }
This is completely wrong. Even more wrong than it used to be. I must
have expressed myself badly about this last time.
To avoid any confusion, here's what the code should be:
int i;
u64 variance = 0;
for (i = 0; i < STATS_NR_VALUES; i++) {
s32 diff = s->values[i] - mean;
variance += (s64)diff * diff;
}
[...]
> +
> + return variance / STATS_NR_VALUES;
> +}
> +
> +/**
> + * sched_idle_irq - irq timestamp callback
> + *
> + * @irq: the irq number
> + * @timestamp: when the interrupt occured
> + * @dev_id: device id for shared interrupt (not yet used)
> + *
> + * Interrupt callback called when an interrupt happens. This function
> + * is critical as it is called under an interrupt section: minimum
> + * operations as possible are done here:
> + */
> +static void sched_irq_timing_handler(unsigned int irq, ktime_t timestamp, void *dev_id)
> +{
> + u32 diff;
> + unsigned int cpu = raw_smp_processor_id();
> + struct wakeup *w = per_cpu(wakeups[irq], cpu);
> +
> + /*
> + * It is the first time the interrupt occurs of the series, we
> + * can't do any stats as we don't have an interval, just store
> + * the timestamp and exit.
> + */
> + if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
> + w->timestamp = timestamp;
> + return;
> + }
> +
> + /*
> + * Microsec resolution is enough for our purpose.
> + */
> + diff = ktime_us_delta(timestamp, w->timestamp);
> + w->timestamp = timestamp;
> +
> + /*
> + * There is no point attempting predictions on interrupts more
> + * than ~1 second apart. This has no benefit for sleep state
> + * selection and increases the risk of overflowing our variance
> + * computation. Reset all stats in that case.
> + */
> + if (diff > (1 << 20)) {
You could use the USEC_PER_SEC constant here. It is already widely used
and would make the code even more obvious.
> + stats_reset(&w->stats);
> + return;
> + }
> +
> + stats_add(&w->stats, diff);
> +}
> +
> +static ktime_t next_irq_event(void)
> +{
> + unsigned int irq, cpu = raw_smp_processor_id();
> + ktime_t diff, next, min = ktime_set(KTIME_SEC_MAX, 0);
> + ktime_t now = ktime_get();
> + struct wakeup *w;
> + u32 interval, mean;
> + u64 variance;
> +
> + /*
> + * Lookup the interrupt array for this cpu and search for the
> + * earlier expected interruption.
> + */
> + for (irq = 0; irq < NR_IRQS; irq = find_next_bit(enabled_irq, NR_IRQS, irq)) {
> +
> + w = per_cpu(wakeups[irq], cpu);
> +
> + /*
> + * The interrupt was not setup as a source of a wakeup
> + * or the wakeup source is not considered at this
> + * moment stable enough to do a prediction.
> + */
> + if (!w)
> + continue;
> +
> + /*
> + * No statistics available yet.
> + */
> + if (ktime_equal(w->timestamp, ktime_set(0, 0)))
> + continue;
> +
> + diff = ktime_sub(now, w->timestamp);
> +
> + /*
> + * There is no point attempting predictions on interrupts more
> + * than 1 second apart. This has no benefit for sleep state
> + * selection and increases the risk of overflowing our variance
> + * computation. Reset all stats in that case.
> + */
This comment is wrong. It is relevant in sched_irq_timing_handler() but
not here. Instead this should be something like:
/*
* This interrupt last triggered more than a second ago.
* It is definitely not predictable for our purpose anymore.
*/
> + if (unlikely(ktime_after(diff, ktime_set(1, 0)))) {
> + stats_reset(&w->stats);
> + continue;
> + }
> +
> + /*
> + * If the mean value is null, just ignore this wakeup
> + * source.
> + */
> + mean = stats_mean(&w->stats);
> + if (!mean)
> + continue;
> +
> + variance = stats_variance(&w->stats, mean);
> + /*
> + * We want to check the last interval is:
> + *
> + * mean - stddev < interval < mean + stddev
> + *
> + * That simplifies to:
> + *
> + * -stddev < interval - mean < stddev
> + *
> + * abs(interval - mean) < stddev
> + *
> + * The standard deviation is the sqrt of the variance:
> + *
> + * abs(interval - mean) < sqrt(variance)
> + *
> + * and we want to prevent to do an sqrt, so we square
> + * the equation:
> + *
> + * (interval - mean)^2 < variance
> + *
> + * So if the latest value of the stats complies with
> + * this condition, then the wakeup source is
> + * considered predictable and can be used to predict
> + * the next event.
> + */
> + interval = w->stats.values[w->stats.w_ptr];
> + if ((u64)((interval - mean) * (interval - mean)) > variance)
s/u64/s64/ please.
> + continue;
> +
> + /*
> + * Let's compute the next event: the wakeup source is
> + * considered predictable, we add the average interval
> + * time added to the latest interruption event time.
> + */
> + next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
> +
> + /*
> + * If the interrupt is supposed to happen before the
> + * minimum time, then it becomes the minimum.
> + */
> + if (ktime_before(next, min))
> + min = next;
> + }
> +
> + /*
> + * At this point, we have our prediction but the caller is
> + * expecting the remaining time before the next event, so
> + * compute the expected sleep length.
> + */
> + diff = ktime_sub(min, now);
> +
> + /*
> + * The result could be negative for different reasons:
> + * - the prediction is incorrect
> + * - the prediction was too near now and expired while we were
> + * in this function
> + *
> + * In both cases, we return KTIME_MAX as a failure to do a
> + * prediction
> + */
> + if (ktime_compare(diff, ktime_set(0, 0)) <= 0)
> + return ktime_set(KTIME_SEC_MAX, 0);
> +
> + return diff;
> +}
> +
> +/**
> + * sched_idle_next_wakeup - Predict the next wakeup on the current cpu
> + *
> + * The next event on the cpu is based on a statistic approach of the
> + * interrupt events and the timer deterministic value. From the timer
> + * or the irqs, we return the one expected to occur first.
> + *
> + * Returns the expected remaining idle time before being woken up by
> + * an interruption.
> + */
> +s64 sched_idle_next_wakeup(void)
> +{
> + s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
> + s64 next_irq = ktime_to_us(next_irq_event());
> +
> + return min(next_irq, next_timer);
> +}
> +
> +/**
> + * sched_idle - go to idle for a specified amount of time
> + *
> + * @duration: the idle duration time
> + * @latency: the latency constraint
> + *
> + * Returns 0 on success, < 0 otherwise.
> + */
> +int sched_idle(s64 duration, unsigned int latency)
> +{
> + struct cpuidle_device *dev = __this_cpu_read(cpuidle_devices);
> + struct cpuidle_driver *drv = cpuidle_get_cpu_driver(dev);
> + struct cpuidle_state_usage *su;
> + struct cpuidle_state *s;
> + int i, ret = 0, index = -1;
> +
> + rcu_idle_enter();
> +
> + /*
> + * No cpuidle driver is available, let's use the default arch
> + * idle function.
> + */
> + if (cpuidle_not_available(drv, dev))
> + goto default_idle;
> +
> + /*
> + * Find the idle state with the lowest power while satisfying
> + * our constraints. We will save energy if the duration of the
> + * idle time is bigger than the target residency which is the
> + * break even point. The choice will be modulated by the
> + * latency.
> + */
> + for (i = 0; i < drv->state_count; i++) {
> +
> + s = &drv->states[i];
> +
> + su = &dev->states_usage[i];
> +
> + if (s->disabled || su->disable)
> + continue;
> + if (s->target_residency > duration)
> + continue;
> + if (s->exit_latency > latency)
> + continue;
> +
> + index = i;
> + }
> +
> + /*
> + * The idle task must be scheduled, it is pointless to go to
> + * idle, just re-enable the interrupt and return.
> + */
> + if (current_clr_polling_and_test()) {
> + local_irq_enable();
> + goto out;
> + }
> +
> + if (index < 0) {
> + /*
> + * No idle callbacks fulfilled the constraints, jump
> + * to the default function like there wasn't any
> + * cpuidle driver.
> + */
> + goto default_idle;
> + } else {
> + /*
> + * Enter the idle state previously returned by the
> + * governor decision. This function will block until
> + * an interrupt occurs and will take care of
> + * re-enabling the local interrupts
> + */
> + return cpuidle_enter(drv, dev, index);
> + }
> +
> +default_idle:
> + default_idle_call();
> +out:
> + rcu_idle_exit();
> + return ret;
> +}
> +
> +/**
> + * sched_irq_timing_remove - disable the tracking of the specified irq
> + *
> + * Clear the irq table slot to stop tracking the interrupt.
> + *
> + * @irq: the irq number to stop tracking
> + * @dev_id: the device id for shared irq
> + *
> + * This function will remove from the wakeup source prediction table.
> + */
> +static void sched_irq_timing_remove(unsigned int irq, void *dev_id)
> +{
> + clear_bit(irq, enabled_irq);
> +}
> +
> +/**
> + * sched_irq_timing_setup - enable the tracking of the specified irq
> + *
> + * Function is called with the corresponding irqdesc lock taken. It is
> + * not allowed to do any memory allocation or blocking call. Flag the
> + * irq table slot to be tracked in order to predict the next event.
> + *
> + * @irq: the interrupt numbe to be tracked
> + * @act: the new irq action to be set to this interrupt
> + *
> + * Returns zero on success, < 0 otherwise.
> + */
> +static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act)
> +{
> + /*
> + * No interrupt set for this descriptor or related to a timer.
> + * Timers are deterministic, so no need to try to do any
> + * prediction on them. No error for both cases, we are just not
> + * interested.
> + */
> + if (!(act->flags & __IRQF_TIMER))
> + return 0;
> +
> + set_bit(irq, enabled_irq);
> +
> + return 0;
> +}
> +
> +/**
> + * sched_irq_timing_free - free memory previously allocated
> + *
> + * @irq: the interrupt number
> + */
> +static void sched_irq_timing_free(unsigned int irq)
> +{
> + struct wakeup *w;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> +
> + w = per_cpu(wakeups[irq], cpu);
> + if (!w)
> + continue;
> +
> + per_cpu(wakeups[irq], cpu) = NULL;
> + kfree(w);
> + }
> +}
> +
> +/**
> + * sched_irq_timing_alloc - allocates memory for irq tracking
> + *
> + * Allocates the memory to track the specified irq.
> + *
> + * @irq: the interrupt number
> + *
> + * Returns 0 on success, -ENOMEM on error.
> + */
> +static int sched_irq_timing_alloc(unsigned int irq)
> +{
> + struct wakeup *w;
> + int cpu, ret = -ENOMEM;
> +
> + /*
> + * Allocates the wakeup structure and the stats structure. As
> + * the interrupt can occur on any cpu, allocate the wakeup
> + * structure per cpu basis.
> + */
> + for_each_possible_cpu(cpu) {
> +
> + w = kzalloc(sizeof(*w), GFP_KERNEL);
> + if (!w)
> + goto out;
> +
> + per_cpu(wakeups[irq], cpu) = w;
> + }
> +
> + ret = 0;
> +out:
> + if (ret)
> + sched_irq_timing_free(irq);
> +
> + return ret;
> +}
> +
> +static struct irqtimings_ops irqt_ops = {
> + .alloc = sched_irq_timing_alloc,
> + .free = sched_irq_timing_free,
> + .setup = sched_irq_timing_setup,
> + .remove = sched_irq_timing_remove,
> + .handler = sched_irq_timing_handler,
> +};
> +
> +DECLARE_IRQ_TIMINGS(&irqt_ops);
> --
> 1.9.1
>
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
@ 2016-01-20 17:55 ` Thomas Gleixner
2016-01-21 9:25 ` Daniel Lezcano
2016-01-20 19:07 ` Peter Zijlstra
2016-01-20 19:28 ` Peter Zijlstra
2 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-20 17:55 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, 20 Jan 2016, Daniel Lezcano wrote:
> +#ifdef CONFIG_IRQ_TIMINGS
> +/**
> + * struct irqt_ops - structure to be used by the subsystem to track
> + * irq timings
> + * @alloc: called when an irqdesc is allocated
> + * @free: called when an irqdesc is free
> + * @setup: called when an irq is setup, this is called under lock
> + * @remove: called when an irq is removed
> + * @handler: called when an interrupt is handled
> + */
> +struct irqtimings_ops {
> + int (*alloc)(unsigned int);
> + void (*free)(unsigned int);
> + int (*setup)(unsigned int, struct irqaction *act);
> + void (*remove)(unsigned int, void *dev_id);
> + irqt_handler_t handler;
> +};
> +
> +/**
> + * This macro *must* be used by the subsystem interested by the irq
> + * timing information.
> + */
> +#define DECLARE_IRQ_TIMINGS(__ops) \
> + const struct irqtimings_ops *__irqtimings = __ops;
> +#endif
> @@ -20,6 +20,49 @@ extern bool noirqdebug;
>
> extern struct irqaction chained_action;
>
> +#ifdef CONFIG_IRQ_TIMINGS
> +
> +extern const struct irqtimings_ops *__irqtimings;
> +
> +static inline int alloc_irqtiming(unsigned int irq)
> +{
> + if (__irqtimings->alloc)
> + return __irqtimings->alloc(irq);
I really have a hard time to understand that indirection. __irqtimings is
statically allocated and compiled in. There can be only one user for this in
the system ever and that user has all callbacks populated.
Why can't you spare all that pointer muck and simply have:
#ifdef CONFIG_IRQ_TIMINGS
int irqtiming_alloc(usigned int irq);
....
#else
static int irqtiming_alloc(usigned int irq) { return 0; }
...
#endif
and implement those functions in your idle thingy?
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 17:46 ` Nicolas Pitre
@ 2016-01-20 18:44 ` Peter Zijlstra
2016-01-21 10:03 ` Daniel Lezcano
1 sibling, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 18:44 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Daniel Lezcano, tglx, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, Jan 20, 2016 at 12:46:48PM -0500, Nicolas Pitre wrote:
> > +struct stats {
> > + u64 sum; /* sum of values */
> > + u32 values[STATS_NR_VALUES]; /* array of values */
> > + unsigned char w_ptr; /* current window pointer */
>
> Why did you change this from an unsigned int?
>
> This won't provide any memory space saving given that the structure has
> to be padded up to the next 64-bit boundary.
Not to mention that loading bytes is more expensive on many archs
compared to full words.
Also, its not a pointer, its an index.
So: unsigned int w_idx;
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
2016-01-20 17:46 ` Nicolas Pitre
@ 2016-01-20 19:02 ` Peter Zijlstra
2016-01-20 19:17 ` Nicolas Pitre
2016-01-20 19:34 ` Peter Zijlstra
` (2 subsequent siblings)
4 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 19:02 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, Jan 20, 2016 at 05:00:33PM +0100, Daniel Lezcano wrote:
> +static void sched_irq_timing_handler(unsigned int irq, ktime_t timestamp, void *dev_id)
> +{
> + u32 diff;
> + unsigned int cpu = raw_smp_processor_id();
> + struct wakeup *w = per_cpu(wakeups[irq], cpu);
> +
> + /*
> + * It is the first time the interrupt occurs of the series, we
> + * can't do any stats as we don't have an interval, just store
> + * the timestamp and exit.
> + */
> + if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
> + w->timestamp = timestamp;
> + return;
> + }
> +
> + /*
> + * Microsec resolution is enough for our purpose.
> + */
It is also a friggin pointless /1000. The cpuidle code also loves to do
this, and its silly, u64 add/sub are _way_ cheaper than u64 / 1000.
> + diff = ktime_us_delta(timestamp, w->timestamp);
> + w->timestamp = timestamp;
> +
> + /*
> + * There is no point attempting predictions on interrupts more
> + * than ~1 second apart. This has no benefit for sleep state
> + * selection and increases the risk of overflowing our variance
> + * computation. Reset all stats in that case.
> + */
> + if (diff > (1 << 20)) {
> + stats_reset(&w->stats);
> + return;
> + }
> +
> + stats_add(&w->stats, diff);
> +}
> +
> +static ktime_t next_irq_event(void)
> +{
> + unsigned int irq, cpu = raw_smp_processor_id();
> + ktime_t diff, next, min = ktime_set(KTIME_SEC_MAX, 0);
> + ktime_t now = ktime_get();
Why !?! do we care about NTP correct timestamps?
ktime_get() can be horrendously slow, don't use it for statistics.
> + next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
> + s64 next_timer = ktime_to_us(tick_nohz_get_sleep_length());
> + s64 next_irq = ktime_to_us(next_irq_event());
more nonsense, just say no.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-20 17:55 ` Thomas Gleixner
@ 2016-01-20 19:07 ` Peter Zijlstra
2016-01-20 19:57 ` Thomas Gleixner
2016-01-21 9:26 ` Daniel Lezcano
2016-01-20 19:28 ` Peter Zijlstra
2 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 19:07 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
> +++ b/kernel/irq/handle.c
> @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
> /* Fall through to add to randomness */
> case IRQ_HANDLED:
> flags |= action->flags;
> + handle_irqtiming(irq, action->dev_id);
> break;
>
> default:
> +++ b/kernel/irq/internals.h
> +static inline void handle_irqtiming(unsigned int irq, void *dev_id)
> +{
> + if (__irqtimings->handler)
> + __irqtimings->handler(irq, ktime_get(), dev_id);
> +}
Here too, ktime_get() is daft.
Also, you really want to take the timestamp _before_ we call the
handlers, not after, otherwise you mix in whatever variance exist in the
handler duration.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 19:02 ` Peter Zijlstra
@ 2016-01-20 19:17 ` Nicolas Pitre
2016-01-20 19:29 ` Peter Zijlstra
0 siblings, 1 reply; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-20 19:17 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Daniel Lezcano, tglx, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, 20 Jan 2016, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 05:00:33PM +0100, Daniel Lezcano wrote:
> > +static void sched_irq_timing_handler(unsigned int irq, ktime_t timestamp, void *dev_id)
> > +{
> > + u32 diff;
> > + unsigned int cpu = raw_smp_processor_id();
> > + struct wakeup *w = per_cpu(wakeups[irq], cpu);
> > +
> > + /*
> > + * It is the first time the interrupt occurs of the series, we
> > + * can't do any stats as we don't have an interval, just store
> > + * the timestamp and exit.
> > + */
> > + if (ktime_equal(w->timestamp, ktime_set(0, 0))) {
> > + w->timestamp = timestamp;
> > + return;
> > + }
> > +
> > + /*
> > + * Microsec resolution is enough for our purpose.
> > + */
>
> It is also a friggin pointless /1000. The cpuidle code also loves to do
> this, and its silly, u64 add/sub are _way_ cheaper than u64 / 1000.
For the purpose of this code, nanoseconds simply provides too many bits
for what we care. Computing the variance implies squared values.
*However* we can simply do diff = (timestamp - w->timestamp) >> 10
instead. No need to have an exact microsecs base.
> > + ktime_t now = ktime_get();
>
> Why !?! do we care about NTP correct timestamps?
Not at all. Using sched_clock() instead would be more than good enough
indeed.
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-20 17:55 ` Thomas Gleixner
2016-01-20 19:07 ` Peter Zijlstra
@ 2016-01-20 19:28 ` Peter Zijlstra
2016-01-21 9:53 ` Daniel Lezcano
2 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 19:28 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
> +++ b/kernel/irq/handle.c
> @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
> /* Fall through to add to randomness */
> case IRQ_HANDLED:
> flags |= action->flags;
> + handle_irqtiming(irq, action->dev_id);
> break;
This also looks completely busted for shared interrupts.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 19:17 ` Nicolas Pitre
@ 2016-01-20 19:29 ` Peter Zijlstra
0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 19:29 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Daniel Lezcano, tglx, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, Jan 20, 2016 at 02:17:57PM -0500, Nicolas Pitre wrote:
> > It is also a friggin pointless /1000. The cpuidle code also loves to do
> > this, and its silly, u64 add/sub are _way_ cheaper than u64 / 1000.
>
> For the purpose of this code, nanoseconds simply provides too many bits
> for what we care. Computing the variance implies squared values.
>
> *However* we can simply do diff = (timestamp - w->timestamp) >> 10
> instead. No need to have an exact microsecs base.
Right, you could also reduce bits at the variance computation, but yes.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
2016-01-20 17:46 ` Nicolas Pitre
2016-01-20 19:02 ` Peter Zijlstra
@ 2016-01-20 19:34 ` Peter Zijlstra
2016-01-20 19:40 ` Peter Zijlstra
2016-01-20 19:49 ` Thomas Gleixner
4 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 19:34 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, Jan 20, 2016 at 05:00:33PM +0100, Daniel Lezcano wrote:
> + variance = stats_variance(&w->stats, mean);
> + /*
> + * We want to check the last interval is:
> + *
> + * mean - stddev < interval < mean + stddev
> + *
> + * That simplifies to:
> + *
> + * -stddev < interval - mean < stddev
> + *
> + * abs(interval - mean) < stddev
> + *
> + * The standard deviation is the sqrt of the variance:
> + *
> + * abs(interval - mean) < sqrt(variance)
> + *
> + * and we want to prevent to do an sqrt, so we square
> + * the equation:
> + *
> + * (interval - mean)^2 < variance
> + *
> + * So if the latest value of the stats complies with
> + * this condition, then the wakeup source is
> + * considered predictable and can be used to predict
> + * the next event.
> + */
> + interval = w->stats.values[w->stats.w_ptr];
> + if ((u64)((interval - mean) * (interval - mean)) > variance)
> + continue;
> +
Cute :-)
You could consider putting a simple 3 point median filter in front of
stat_add() to get rid of the worst input noise.
(I know, improving all this was a point for later)
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
` (2 preceding siblings ...)
2016-01-20 19:34 ` Peter Zijlstra
@ 2016-01-20 19:40 ` Peter Zijlstra
2016-01-20 19:57 ` Nicolas Pitre
2016-01-20 19:49 ` Thomas Gleixner
4 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 19:40 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, Jan 20, 2016 at 05:00:33PM +0100, Daniel Lezcano wrote:
> +static inline u32 stats_mean(struct stats *s)
> +{
> + /*
> + * gcc is smart enough to convert to a bits shift when the
> + * divisor is constant and multiple of 2^x.
> + *
> + * The number of values could have not reached STATS_NR_VALUES
> + * yet, but we can consider it acceptable as the situation is
> + * only at the beginning of the burst of irqs.
> + */
> + return s->sum / STATS_NR_VALUES;
> +}
Note that ->sum is u64, so you're very prone to truncation.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
` (3 preceding siblings ...)
2016-01-20 19:40 ` Peter Zijlstra
@ 2016-01-20 19:49 ` Thomas Gleixner
2016-01-21 13:54 ` Daniel Lezcano
4 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-20 19:49 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Wed, 20 Jan 2016, Daniel Lezcano wrote:
> +/*
> + * Per cpu and irq statistics. Each cpu receives interrupts and those
> + * ones can be distributed following an irq chip specific
> + * algorithm. Random irq distribution is the worst case to predict
> + * interruption behavior but usually that does not happen or could be
> + * fixed from userspace by setting the irq affinity.
> + */
> +static DEFINE_PER_CPU(struct wakeup, *wakeups[NR_IRQS]);
That does not work with sparse irqs. With sparse irqs NR_IRQS can be very
small and the core can expand up to NR_IRQS + 8196
> +static DECLARE_BITMAP(enabled_irq, NR_IRQS);
Ditto.
> +static void stats_add(struct stats *s, u32 value)
> +{
> + /*
> + * This is a circular buffer, so the oldest value is the next
> + * one in the buffer. Let's compute the next pointer to
> + * retrieve the oldest value and re-use it to update the w_ptr
> + * after adding the new value.
> + */
> + s->w_ptr = (s->w_ptr + 1) % STATS_NR_VALUES;
I rather have that an explicit '& STATS_NR_MASK' here and some enforcement
that STATS_NR_VALUES is a power of two.
> +/**
> + * stats_mean - compute the average
> + *
> + * @s: the statistics structure
> + *
> + * Returns an u32 corresponding to the mean value, or zero if there is
> + * no data
> + */
> +static inline u32 stats_mean(struct stats *s)
> +{
> + /*
> + * gcc is smart enough to convert to a bits shift when the
> + * divisor is constant and multiple of 2^x.
Instead of adding those comments everywhere, you just could use a shift value,
which naturally enforces the power of 2 of STATS_NR_VALUES
> + *
> + * The number of values could have not reached STATS_NR_VALUES
> + * yet, but we can consider it acceptable as the situation is
> + * only at the beginning of the burst of irqs.
> + */
> + return s->sum / STATS_NR_VALUES;
> +}
> +/**
> + * sched_idle_irq - irq timestamp callback
There is a way to actually compile KernelDoc ....
> + *
> + * @irq: the irq number
> + * @timestamp: when the interrupt occured
> + * @dev_id: device id for shared interrupt (not yet used)
What is it going to be used for?
> + /*
> + * Microsec resolution is enough for our purpose.
> + */
> + diff = ktime_us_delta(timestamp, w->timestamp);
You force a division by 1000 here. Storing the nsec values is way cheaper. If
you really need to make it smaller for computational reasons, then use a shift
value. There is no requirement for this to be precise.
> +static ktime_t next_irq_event(void)
> +{
> + unsigned int irq, cpu = raw_smp_processor_id();
> + ktime_t diff, next, min = ktime_set(KTIME_SEC_MAX, 0);
> + ktime_t now = ktime_get();
> + struct wakeup *w;
> + u32 interval, mean;
> + u64 variance;
> +
> + /*
> + * Lookup the interrupt array for this cpu and search for the
> + * earlier expected interruption.
> + */
> + for (irq = 0; irq < NR_IRQS; irq = find_next_bit(enabled_irq, NR_IRQS, irq)) {
Again. This cannot work.
1) NR_IRQS is the wrong thing to look at.
2) It's racy against a concurrent teardown of interrupts.
> + w = per_cpu(wakeups[irq], cpu);
> +
> + /*
> + * The interrupt was not setup as a source of a wakeup
> + * or the wakeup source is not considered at this
> + * moment stable enough to do a prediction.
> + */
> + if (!w)
> + continue;
That comment makes no sense whatsoever. You look at the data which you
allocated from __setup_irq(). So how is that related?
> +/**
> + * sched_irq_timing_remove - disable the tracking of the specified irq
> + *
> + * Clear the irq table slot to stop tracking the interrupt.
> + *
> + * @irq: the irq number to stop tracking
> + * @dev_id: the device id for shared irq
> + *
> + * This function will remove from the wakeup source prediction table.
> + */
> +static void sched_irq_timing_remove(unsigned int irq, void *dev_id)
> +{
> + clear_bit(irq, enabled_irq);
> +}
> +
> +/**
> + * sched_irq_timing_setup - enable the tracking of the specified irq
> + *
> + * Function is called with the corresponding irqdesc lock taken. It is
> + * not allowed to do any memory allocation or blocking call. Flag the
> + * irq table slot to be tracked in order to predict the next event.
> + *
> + * @irq: the interrupt numbe to be tracked
> + * @act: the new irq action to be set to this interrupt
> + *
> + * Returns zero on success, < 0 otherwise.
> + */
> +static int sched_irq_timing_setup(unsigned int irq, struct irqaction *act)
> +{
> + /*
> + * No interrupt set for this descriptor or related to a timer.
> + * Timers are deterministic, so no need to try to do any
> + * prediction on them. No error for both cases, we are just not
> + * interested.
> + */
> + if (!(act->flags & __IRQF_TIMER))
> + return 0;
Well, you do not make any predicitions, but you allocate the data and do
sampling ...
> +
> + set_bit(irq, enabled_irq);
> +
> + return 0;
> +}
These two are required for what? The callsite is broken as well. You call
setup() for the first action of an irq and remove() for any action which gets
torn down. That does not make any sense at all.
> +/**
> + * sched_irq_timing_free - free memory previously allocated
> + *
> + * @irq: the interrupt number
> + */
> +static void sched_irq_timing_free(unsigned int irq)
> +{
> + struct wakeup *w;
> + int cpu;
> +
> + for_each_possible_cpu(cpu) {
> +
> + w = per_cpu(wakeups[irq], cpu);
> + if (!w)
> + continue;
> +
> + per_cpu(wakeups[irq], cpu) = NULL;
> + kfree(w);
Your simple array does not work. You need a radix_tree to handle SPARSE_IRQ
and you need proper protection against teardown.
So we can avoid all that stuff and simply stick that data into irqdesc and let
the core handle it. That allows us to use proper percpu allocations and avoid
that for_each_possible_cpu() sillyness.
That leaves the iterator, but that's a solvable problem. We simply can have an
iterator function in the irq core, which gives you the next sample
structure. Something like this:
struct irqtiming_sample *irqtiming_get_next(int *irq)
{
struct irq_desc *desc;
int next;
/* Do a racy lookup of the next allocated irq */
next = irq_get_next_irq(*irq);
if (next >= nr_irqs)
return NULL;
*irq = next + 1;
/* Now lookup the descriptor. It's RCU protected. */
desc = irq_to_desc(next);
if (!desc || !desc->irqtimings || !(desc->istate & IRQS_TIMING))
return NULL;
return this_cpu_ptr(&desc->irqtimings);
}
And that needs to be called rcu protected;
next = 0;
rcu_read_lock();
sample = irqtiming_get_next(&next);
while (sample) {
....
sample = irqtiming_get_next(&next);
}
rcu_read_unlock();
So the interrupt part becomes:
if (desc->istate & IRQS_TIMING)
irqtimings_handle(__this_cpu_ptr(&desc->irqtimings));
So now for the allocation/free of that data. We simply allocate/free it along
with the irq descriptor. That IRQS_TIMING bit gets set in __setup_irq() except
for timer interrupts. That's simple and avoid _all_ the issues.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 19:07 ` Peter Zijlstra
@ 2016-01-20 19:57 ` Thomas Gleixner
2016-01-20 20:04 ` Nicolas Pitre
` (2 more replies)
2016-01-21 9:26 ` Daniel Lezcano
1 sibling, 3 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-20 19:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Daniel Lezcano, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Wed, 20 Jan 2016, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
> > +++ b/kernel/irq/handle.c
> > @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
> > /* Fall through to add to randomness */
> > case IRQ_HANDLED:
> > flags |= action->flags;
> > + handle_irqtiming(irq, action->dev_id);
> > break;
> >
> > default:
>
> > +++ b/kernel/irq/internals.h
>
> > +static inline void handle_irqtiming(unsigned int irq, void *dev_id)
> > +{
> > + if (__irqtimings->handler)
> > + __irqtimings->handler(irq, ktime_get(), dev_id);
> > +}
>
> Here too, ktime_get() is daft.
What's the problem? ktime_xxx() itself or just the clock monotonic variant?
On 99.9999% of the platforms ktime_get_mono_fast/raw_fast is not any slower
than sched_clock(). The only case where sched_clock is faster is if your TSC
is buggered and the box switches to HPET for timekeeping.
But I wonder, whether this couldn't do with jiffies in the first place. If the
interrupt comes faster than a jiffie then you hardly go into some interesting
power state, but I might be wrong as usual :)
> Also, you really want to take the timestamp _before_ we call the
> handlers, not after, otherwise you mix in whatever variance exist in the
> handler duration.
That and we don't want to call it for each handler which returned handled. The
called code would do two samples in a row for the same interrupt in case of
two shared handlers which get raised at the same time. Not very likely, but
possible.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 19:40 ` Peter Zijlstra
@ 2016-01-20 19:57 ` Nicolas Pitre
2016-01-20 20:22 ` Peter Zijlstra
0 siblings, 1 reply; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-20 19:57 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Daniel Lezcano, tglx, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, 20 Jan 2016, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 05:00:33PM +0100, Daniel Lezcano wrote:
> > +static inline u32 stats_mean(struct stats *s)
> > +{
> > + /*
> > + * gcc is smart enough to convert to a bits shift when the
> > + * divisor is constant and multiple of 2^x.
> > + *
> > + * The number of values could have not reached STATS_NR_VALUES
> > + * yet, but we can consider it acceptable as the situation is
> > + * only at the beginning of the burst of irqs.
> > + */
> > + return s->sum / STATS_NR_VALUES;
> > +}
>
> Note that ->sum is u64, so you're very prone to truncation.
It won't. It is the sum of u32 values, so the mean of those values
can't exceed 32 bits.
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 19:57 ` Thomas Gleixner
@ 2016-01-20 20:04 ` Nicolas Pitre
2016-01-20 20:20 ` Peter Zijlstra
2016-01-21 9:50 ` Daniel Lezcano
2 siblings, 0 replies; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-20 20:04 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Peter Zijlstra, Daniel Lezcano, rafael, linux-pm, linux-kernel,
vincent.guittot
On Wed, 20 Jan 2016, Thomas Gleixner wrote:
> On Wed, 20 Jan 2016, Peter Zijlstra wrote:
>
> > On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
> > > +++ b/kernel/irq/handle.c
> > > @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
> > > /* Fall through to add to randomness */
> > > case IRQ_HANDLED:
> > > flags |= action->flags;
> > > + handle_irqtiming(irq, action->dev_id);
> > > break;
> > >
> > > default:
> >
> > > +++ b/kernel/irq/internals.h
> >
> > > +static inline void handle_irqtiming(unsigned int irq, void *dev_id)
> > > +{
> > > + if (__irqtimings->handler)
> > > + __irqtimings->handler(irq, ktime_get(), dev_id);
> > > +}
> >
> > Here too, ktime_get() is daft.
>
> What's the problem? ktime_xxx() itself or just the clock monotonic variant?
>
> On 99.9999% of the platforms ktime_get_mono_fast/raw_fast is not any slower
> than sched_clock(). The only case where sched_clock is faster is if your TSC
> is buggered and the box switches to HPET for timekeeping.
>
> But I wonder, whether this couldn't do with jiffies in the first place. If the
> interrupt comes faster than a jiffie then you hardly go into some interesting
> power state, but I might be wrong as usual :)
Jiffies are not precise enough for some power states, even more so with
HZ = 100 on many platforms.
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
@ 2016-01-20 20:14 ` Nicolas Pitre
2016-01-21 13:04 ` Daniel Lezcano
0 siblings, 1 reply; 63+ messages in thread
From: Nicolas Pitre @ 2016-01-20 20:14 UTC (permalink / raw)
To: Daniel Lezcano
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, 20 Jan 2016, Daniel Lezcano wrote:
[...]
One more comment:
> + /*
> + * If the mean value is null, just ignore this wakeup
> + * source.
> + */
> + mean = stats_mean(&w->stats);
> + if (!mean)
> + continue;
> +
> + variance = stats_variance(&w->stats, mean);
> + /*
> + * We want to check the last interval is:
> + *
> + * mean - stddev < interval < mean + stddev
> + *
> + * That simplifies to:
> + *
> + * -stddev < interval - mean < stddev
> + *
> + * abs(interval - mean) < stddev
> + *
> + * The standard deviation is the sqrt of the variance:
> + *
> + * abs(interval - mean) < sqrt(variance)
> + *
> + * and we want to prevent to do an sqrt, so we square
> + * the equation:
> + *
> + * (interval - mean)^2 < variance
> + *
> + * So if the latest value of the stats complies with
> + * this condition, then the wakeup source is
> + * considered predictable and can be used to predict
> + * the next event.
> + */
> + interval = w->stats.values[w->stats.w_ptr];
> + if ((u64)((interval - mean) * (interval - mean)) > variance)
> + continue;
> +
> + /*
> + * Let's compute the next event: the wakeup source is
> + * considered predictable, we add the average interval
> + * time added to the latest interruption event time.
> + */
> + next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
You don't need to call stats_mean() again as you have it in the 'mean'
variable already.
Nicolas
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 19:57 ` Thomas Gleixner
2016-01-20 20:04 ` Nicolas Pitre
@ 2016-01-20 20:20 ` Peter Zijlstra
2016-01-20 20:22 ` Thomas Gleixner
2016-01-21 9:50 ` Daniel Lezcano
2 siblings, 1 reply; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 20:20 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Daniel Lezcano, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Wed, Jan 20, 2016 at 08:57:06PM +0100, Thomas Gleixner wrote:
> > Here too, ktime_get() is daft.
>
> What's the problem? ktime_xxx() itself or just the clock monotonic variant?
>
> On 99.9999% of the platforms ktime_get_mono_fast/raw_fast is not any slower
> than sched_clock(). The only case where sched_clock is faster is if your TSC
> is buggered and the box switches to HPET for timekeeping.
The HPET thing, I just don't want to have to explain why this is so much
more expensive on some random weird machine, esp. since it really
doesn't matter.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 20:20 ` Peter Zijlstra
@ 2016-01-20 20:22 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-20 20:22 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Daniel Lezcano, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Wed, 20 Jan 2016, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 08:57:06PM +0100, Thomas Gleixner wrote:
> > > Here too, ktime_get() is daft.
> >
> > What's the problem? ktime_xxx() itself or just the clock monotonic variant?
> >
> > On 99.9999% of the platforms ktime_get_mono_fast/raw_fast is not any slower
> > than sched_clock(). The only case where sched_clock is faster is if your TSC
> > is buggered and the box switches to HPET for timekeeping.
>
> The HPET thing, I just don't want to have to explain why this is so much
> more expensive on some random weird machine, esp. since it really
> doesn't matter.
Fair enough.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 19:57 ` Nicolas Pitre
@ 2016-01-20 20:22 ` Peter Zijlstra
0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-20 20:22 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Daniel Lezcano, tglx, rafael, linux-pm, linux-kernel, vincent.guittot
On Wed, Jan 20, 2016 at 02:57:07PM -0500, Nicolas Pitre wrote:
> On Wed, 20 Jan 2016, Peter Zijlstra wrote:
>
> > On Wed, Jan 20, 2016 at 05:00:33PM +0100, Daniel Lezcano wrote:
> > > +static inline u32 stats_mean(struct stats *s)
> > > +{
> > > + /*
> > > + * gcc is smart enough to convert to a bits shift when the
> > > + * divisor is constant and multiple of 2^x.
> > > + *
> > > + * The number of values could have not reached STATS_NR_VALUES
> > > + * yet, but we can consider it acceptable as the situation is
> > > + * only at the beginning of the burst of irqs.
> > > + */
> > > + return s->sum / STATS_NR_VALUES;
> > > +}
> >
> > Note that ->sum is u64, so you're very prone to truncation.
>
> It won't. It is the sum of u32 values, so the mean of those values
> can't exceed 32 bits.
Ah indeed!
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 17:55 ` Thomas Gleixner
@ 2016-01-21 9:25 ` Daniel Lezcano
2016-01-21 10:27 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 9:25 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/20/2016 06:55 PM, Thomas Gleixner wrote:
> On Wed, 20 Jan 2016, Daniel Lezcano wrote:
>> +#ifdef CONFIG_IRQ_TIMINGS
>> +/**
>> + * struct irqt_ops - structure to be used by the subsystem to track
>> + * irq timings
>> + * @alloc: called when an irqdesc is allocated
>> + * @free: called when an irqdesc is free
>> + * @setup: called when an irq is setup, this is called under lock
>> + * @remove: called when an irq is removed
>> + * @handler: called when an interrupt is handled
>> + */
>> +struct irqtimings_ops {
>> + int (*alloc)(unsigned int);
>> + void (*free)(unsigned int);
>> + int (*setup)(unsigned int, struct irqaction *act);
>> + void (*remove)(unsigned int, void *dev_id);
>> + irqt_handler_t handler;
>> +};
>> +
>> +/**
>> + * This macro *must* be used by the subsystem interested by the irq
>> + * timing information.
>> + */
>> +#define DECLARE_IRQ_TIMINGS(__ops) \
>> + const struct irqtimings_ops *__irqtimings = __ops;
>> +#endif
>
>> @@ -20,6 +20,49 @@ extern bool noirqdebug;
>>
>> extern struct irqaction chained_action;
>>
>> +#ifdef CONFIG_IRQ_TIMINGS
>> +
>> +extern const struct irqtimings_ops *__irqtimings;
>> +
>> +static inline int alloc_irqtiming(unsigned int irq)
>> +{
>> + if (__irqtimings->alloc)
>> + return __irqtimings->alloc(irq);
>
> I really have a hard time to understand that indirection. __irqtimings is
> statically allocated and compiled in. There can be only one user for this in
> the system ever and that user has all callbacks populated.
>
> Why can't you spare all that pointer muck and simply have:
>
> #ifdef CONFIG_IRQ_TIMINGS
> int irqtiming_alloc(usigned int irq);
> ....
> #else
> static int irqtiming_alloc(usigned int irq) { return 0; }
> ...
> #endif
>
> and implement those functions in your idle thingy?
Hi Thomas,
yes sure, I can do something simpler.
Just to be sure, do you suggest to put the function declaration in
kernel/irq/internal.h and the function definition in
kernel/sched/idle-sched.c ?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 19:07 ` Peter Zijlstra
2016-01-20 19:57 ` Thomas Gleixner
@ 2016-01-21 9:26 ` Daniel Lezcano
1 sibling, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 9:26 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/20/2016 08:07 PM, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
>> +++ b/kernel/irq/handle.c
>> @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
>> /* Fall through to add to randomness */
>> case IRQ_HANDLED:
>> flags |= action->flags;
>> + handle_irqtiming(irq, action->dev_id);
>> break;
>>
>> default:
>
>> +++ b/kernel/irq/internals.h
>
>> +static inline void handle_irqtiming(unsigned int irq, void *dev_id)
>> +{
>> + if (__irqtimings->handler)
>> + __irqtimings->handler(irq, ktime_get(), dev_id);
>> +}
>
> Here too, ktime_get() is daft.
>
> Also, you really want to take the timestamp _before_ we call the
> handlers, not after, otherwise you mix in whatever variance exist in the
> handler duration.
Indeed.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 19:57 ` Thomas Gleixner
2016-01-20 20:04 ` Nicolas Pitre
2016-01-20 20:20 ` Peter Zijlstra
@ 2016-01-21 9:50 ` Daniel Lezcano
2016-01-21 10:08 ` Peter Zijlstra
2016-01-21 13:52 ` Thomas Gleixner
2 siblings, 2 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 9:50 UTC (permalink / raw)
To: Thomas Gleixner, Peter Zijlstra
Cc: rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/20/2016 08:57 PM, Thomas Gleixner wrote:
> On Wed, 20 Jan 2016, Peter Zijlstra wrote:
>
>> On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
>>> +++ b/kernel/irq/handle.c
>>> @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
>>> /* Fall through to add to randomness */
>>> case IRQ_HANDLED:
>>> flags |= action->flags;
>>> + handle_irqtiming(irq, action->dev_id);
>>> break;
>>>
>>> default:
>>
>>> +++ b/kernel/irq/internals.h
>>
>>> +static inline void handle_irqtiming(unsigned int irq, void *dev_id)
>>> +{
>>> + if (__irqtimings->handler)
>>> + __irqtimings->handler(irq, ktime_get(), dev_id);
>>> +}
>>
>> Here too, ktime_get() is daft.
>
> What's the problem? ktime_xxx() itself or just the clock monotonic variant?
>
> On 99.9999% of the platforms ktime_get_mono_fast/raw_fast is not any slower
> than sched_clock(). The only case where sched_clock is faster is if your TSC
> is buggered and the box switches to HPET for timekeeping.
>
> But I wonder, whether this couldn't do with jiffies in the first place. If the
> interrupt comes faster than a jiffie then you hardly go into some interesting
> power state, but I might be wrong as usual :)
>
>> Also, you really want to take the timestamp _before_ we call the
>> handlers, not after, otherwise you mix in whatever variance exist in the
>> handler duration.
>
> That and we don't want to call it for each handler which returned handled. The
> called code would do two samples in a row for the same interrupt in case of
> two shared handlers which get raised at the same time. Not very likely, but
> possible.
Actually, the handle passes dev_id in order to let the irqtimings to
sort out a shared interrupt and prevent double sampling. In other words,
for shared interrupts, statistics should be per t-uple(irq , dev_id) but
that is something I did not implemented ATM.
IMO, the handler is at the right place. The prediction code does not
take care of the shared interrupts yet.
I tried to find a platform with shared interrupts in the ones I have
available around me but I did not find any. Are the shared interrupts
something used nowadays or coming from legacy hardware ? What is the
priority to handle the shared interrupts in the prediction code ?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-20 19:28 ` Peter Zijlstra
@ 2016-01-21 9:53 ` Daniel Lezcano
0 siblings, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 9:53 UTC (permalink / raw)
To: Peter Zijlstra
Cc: tglx, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/20/2016 08:28 PM, Peter Zijlstra wrote:
> On Wed, Jan 20, 2016 at 05:00:32PM +0100, Daniel Lezcano wrote:
>> +++ b/kernel/irq/handle.c
>> @@ -165,6 +165,7 @@ irqreturn_t handle_irq_event_percpu(struct irq_desc *desc)
>> /* Fall through to add to randomness */
>> case IRQ_HANDLED:
>> flags |= action->flags;
>> + handle_irqtiming(irq, action->dev_id);
>> break;
>
> This also looks completely busted for shared interrupts.
Hi Peter,
As explained in an answer to Thomas, in case of shared interrupts, it is
up to the prediction code to handle a tuple (irq, dev_id). The handler
itself is at the right place IMO.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 17:46 ` Nicolas Pitre
2016-01-20 18:44 ` Peter Zijlstra
@ 2016-01-21 10:03 ` Daniel Lezcano
1 sibling, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 10:03 UTC (permalink / raw)
To: Nicolas Pitre
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
Hi Nico,
On 01/20/2016 06:46 PM, Nicolas Pitre wrote:
> On Wed, 20 Jan 2016, Daniel Lezcano wrote:
>
>> Many IRQs are quiet most of the time, or they tend to come in bursts of
>> fairly equal time intervals within each burst. It is therefore possible
>> to detect those IRQs with stable intervals and guestimate when the next
>> IRQ event is most likely to happen.
>>
>> Examples of such IRQs may include audio related IRQs where the FIFO size
>> and/or DMA descriptor size with the sample rate create stable intervals,
>> block devices during large data transfers, etc. Even network streaming
>> of multimedia content creates patterns of periodic network interface IRQs
>> in some cases.
>>
>> This patch adds code to track the mean interval and variance for each IRQ
>> over a window of time intervals between IRQ events. Those statistics can
>> be used to assist cpuidle in selecting the most appropriate sleep state
>> by predicting the most likely time for the next interrupt.
>>
>> Because the stats are gathered in interrupt context, the core computation
>> is as light as possible.
>>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
>> ---
[ ... ]
>> +struct stats {
>> + u64 sum; /* sum of values */
>> + u32 values[STATS_NR_VALUES]; /* array of values */
>> + unsigned char w_ptr; /* current window pointer */
>
> Why did you change this from an unsigned int?
>
> This won't provide any memory space saving given that the structure has
> to be padded up to the next 64-bit boundary.
Ok, I will change it back to unsigned int.
[ ... ]
>> + for (i = 0; i < STATS_NR_VALUES; i++) {
>> + s64 diff = s->values[i] - mean;
>> + variance += (u64)diff * diff;
>> + }
>
> This is completely wrong. Even more wrong than it used to be. I must
> have expressed myself badly about this last time.
>
> To avoid any confusion, here's what the code should be:
>
> int i;
> u64 variance = 0;
>
> for (i = 0; i < STATS_NR_VALUES; i++) {
> s32 diff = s->values[i] - mean;
> variance += (s64)diff * diff;
> }
>
> [...]
Aah, ok :)
[ ... ]
>> + if (diff > (1 << 20)) {
>
> You could use the USEC_PER_SEC constant here. It is already widely used
> and would make the code even more obvious.
Indeed.
[ ... ]
>> + /*
>> + * There is no point attempting predictions on interrupts more
>> + * than 1 second apart. This has no benefit for sleep state
>> + * selection and increases the risk of overflowing our variance
>> + * computation. Reset all stats in that case.
>> + */
>
> This comment is wrong. It is relevant in sched_irq_timing_handler() but
> not here. Instead this should be something like:
>
> /*
> * This interrupt last triggered more than a second ago.
> * It is definitely not predictable for our purpose anymore.
> */
Ok.
[ ... ]
>> + interval = w->stats.values[w->stats.w_ptr];
>> + if ((u64)((interval - mean) * (interval - mean)) > variance)
>
> s/u64/s64/ please.
Noted.
Thanks Nico for the review.
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 9:50 ` Daniel Lezcano
@ 2016-01-21 10:08 ` Peter Zijlstra
2016-01-21 12:38 ` Daniel Lezcano
2016-01-21 20:27 ` Thomas Gleixner
2016-01-21 13:52 ` Thomas Gleixner
1 sibling, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-21 10:08 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Thomas Gleixner, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Thu, Jan 21, 2016 at 10:50:27AM +0100, Daniel Lezcano wrote:
> Actually, the handle passes dev_id in order to let the irqtimings to sort
> out a shared interrupt and prevent double sampling. In other words, for
> shared interrupts, statistics should be per t-uple(irq , dev_id) but that is
> something I did not implemented ATM.
>
> IMO, the handler is at the right place. The prediction code does not take
> care of the shared interrupts yet.
That certainly added to the confusion. But if you want per dev_id stats,
the whole alloc framework is 'broken' too, for it allocates the stuff
per irq.
> I tried to find a platform with shared interrupts in the ones I have
> available around me but I did not find any. Are the shared interrupts
> something used nowadays or coming from legacy hardware ? What is the
> priority to handle the shared interrupts in the prediction code ?
They're less common (thankfully) than they used to be, but I still have
them:
root@ivb-ep:~# cat /proc/interrupts | grep ","
59: 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 IO-APIC
5-fasteoi i801_smbus, i801_smbus
root@wsm-ep:~# cat /proc/interrupts | grep ","
18: 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC 18-fasteoi ehci_hcd:usb1, uhci_hcd:usb6
19: 9695230 19577242 13205011 3970578 740376 1138693 0 0 0 0 0 0 IO-APIC 19-fasteoi uhci_hcd:usb5, ata_piix
23: 3 0 0 0 927 0 0 0 0 0 0 0 IO-APIC 23-fasteoi ehci_hcd:usb2, uhci_hcd:usb4
root@snb:~# cat /proc/interrupts | grep ","
19: 11058485 0 0 0 0 0 0 0 IO-APIC 19-fasteoi ata_piix, ata_piix
Also there's a whole host of SOCs that has them.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 9:25 ` Daniel Lezcano
@ 2016-01-21 10:27 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-21 10:27 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Thu, 21 Jan 2016, Daniel Lezcano wrote:
> Just to be sure, do you suggest to put the function declaration in
> kernel/irq/internal.h and the function definition in kernel/sched/idle-sched.c
> ?
Well obviously the declarations and stub functions should be in a header which
is accessible for both.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 10:08 ` Peter Zijlstra
@ 2016-01-21 12:38 ` Daniel Lezcano
2016-01-21 20:27 ` Thomas Gleixner
1 sibling, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 12:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Thomas Gleixner, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On 01/21/2016 11:08 AM, Peter Zijlstra wrote:
> On Thu, Jan 21, 2016 at 10:50:27AM +0100, Daniel Lezcano wrote:
>
>> Actually, the handle passes dev_id in order to let the irqtimings to sort
>> out a shared interrupt and prevent double sampling. In other words, for
>> shared interrupts, statistics should be per t-uple(irq , dev_id) but that is
>> something I did not implemented ATM.
>>
>> IMO, the handler is at the right place. The prediction code does not take
>> care of the shared interrupts yet.
>
> That certainly added to the confusion. But if you want per dev_id stats,
> the whole alloc framework is 'broken' too, for it allocates the stuff
> per irq.
Yep, that's correct. I was planning to re-work it later by handling the
shared interrupts, assuming they were not so common, but regarding the
examples below, that's wrong.
>> I tried to find a platform with shared interrupts in the ones I have
>> available around me but I did not find any. Are the shared interrupts
>> something used nowadays or coming from legacy hardware ? What is the
>> priority to handle the shared interrupts in the prediction code ?
>
> They're less common (thankfully) than they used to be, but I still have
> them:
>
> root@ivb-ep:~# cat /proc/interrupts | grep ","
> 59: 0 0 0 0 0 0
> 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0
> 0 0 0 0 0 0 0
> 0 0 0 0 0 0 IO-APIC
> 5-fasteoi i801_smbus, i801_smbus
>
> root@wsm-ep:~# cat /proc/interrupts | grep ","
> 18: 0 0 0 0 0 0 0 0 0 0 0 0 IO-APIC 18-fasteoi ehci_hcd:usb1, uhci_hcd:usb6
> 19: 9695230 19577242 13205011 3970578 740376 1138693 0 0 0 0 0 0 IO-APIC 19-fasteoi uhci_hcd:usb5, ata_piix
> 23: 3 0 0 0 927 0 0 0 0 0 0 0 IO-APIC 23-fasteoi ehci_hcd:usb2, uhci_hcd:usb4
>
> root@snb:~# cat /proc/interrupts | grep ","
> 19: 11058485 0 0 0 0 0 0 0 IO-APIC 19-fasteoi ata_piix, ata_piix
>
>
> Also there's a whole host of SOCs that has them.
Ah, I see. Thank you very much for these examples.
Sounds like, I have to handle the shared interrupts sooner than what I
was expecting ... :)
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 20:14 ` Nicolas Pitre
@ 2016-01-21 13:04 ` Daniel Lezcano
0 siblings, 0 replies; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 13:04 UTC (permalink / raw)
To: Nicolas Pitre
Cc: tglx, peterz, rafael, linux-pm, linux-kernel, vincent.guittot
On 01/20/2016 09:14 PM, Nicolas Pitre wrote:
> On Wed, 20 Jan 2016, Daniel Lezcano wrote:
> [...]
>
> One more comment:
>
>> + /*
>> + * If the mean value is null, just ignore this wakeup
>> + * source.
>> + */
>> + mean = stats_mean(&w->stats);
>> + if (!mean)
>> + continue;
>> +
>> + variance = stats_variance(&w->stats, mean);
>> + /*
>> + * We want to check the last interval is:
>> + *
>> + * mean - stddev < interval < mean + stddev
>> + *
>> + * That simplifies to:
>> + *
>> + * -stddev < interval - mean < stddev
>> + *
>> + * abs(interval - mean) < stddev
>> + *
>> + * The standard deviation is the sqrt of the variance:
>> + *
>> + * abs(interval - mean) < sqrt(variance)
>> + *
>> + * and we want to prevent to do an sqrt, so we square
>> + * the equation:
>> + *
>> + * (interval - mean)^2 < variance
>> + *
>> + * So if the latest value of the stats complies with
>> + * this condition, then the wakeup source is
>> + * considered predictable and can be used to predict
>> + * the next event.
>> + */
>> + interval = w->stats.values[w->stats.w_ptr];
>> + if ((u64)((interval - mean) * (interval - mean)) > variance)
>> + continue;
>> +
>> + /*
>> + * Let's compute the next event: the wakeup source is
>> + * considered predictable, we add the average interval
>> + * time added to the latest interruption event time.
>> + */
>> + next = ktime_add_us(w->timestamp, stats_mean(&w->stats));
>
> You don't need to call stats_mean() again as you have it in the 'mean'
> variable already.
Good point.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 9:50 ` Daniel Lezcano
2016-01-21 10:08 ` Peter Zijlstra
@ 2016-01-21 13:52 ` Thomas Gleixner
2016-01-21 14:19 ` Daniel Lezcano
1 sibling, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-21 13:52 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Peter Zijlstra, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Thu, 21 Jan 2016, Daniel Lezcano wrote:
> On 01/20/2016 08:57 PM, Thomas Gleixner wrote:
> > That and we don't want to call it for each handler which returned handled.
> > The
> > called code would do two samples in a row for the same interrupt in case of
> > two shared handlers which get raised at the same time. Not very likely, but
> > possible.
>
> Actually, the handle passes dev_id in order to let the irqtimings to sort out
> a shared interrupt and prevent double sampling. In other words, for shared
> interrupts, statistics should be per t-uple(irq , dev_id) but that is
> something I did not implemented ATM.
So my comment about double sampling applies.
> IMO, the handler is at the right place. The prediction code does not take care
> of the shared interrupts yet.
>
> I tried to find a platform with shared interrupts in the ones I have available
> around me but I did not find any. Are the shared interrupts something used
> nowadays or coming from legacy hardware ? What is the priority to handle the
> shared interrupts in the prediction code ?
And why would that thing care about shared interruts at all? It's a legacy
burden and I really don't see a reason why that new thing which is targeted on
modern hardware should deal with them. Just treat them as a single interrupt
for now and be done with it.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-20 19:49 ` Thomas Gleixner
@ 2016-01-21 13:54 ` Daniel Lezcano
2016-01-21 14:12 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 13:54 UTC (permalink / raw)
To: Thomas Gleixner
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On 01/20/2016 08:49 PM, Thomas Gleixner wrote:
[ ... ]
Thanks for all your comments. I agree with them.
One question below.
>> +static void sched_irq_timing_free(unsigned int irq)
>> +{
>> + struct wakeup *w;
>> + int cpu;
>> +
>> + for_each_possible_cpu(cpu) {
>> +
>> + w = per_cpu(wakeups[irq], cpu);
>> + if (!w)
>> + continue;
>> +
>> + per_cpu(wakeups[irq], cpu) = NULL;
>> + kfree(w);
>
> Your simple array does not work. You need a radix_tree to handle SPARSE_IRQ
> and you need proper protection against teardown.
>
> So we can avoid all that stuff and simply stick that data into irqdesc and let
> the core handle it. That allows us to use proper percpu allocations and avoid
> that for_each_possible_cpu() sillyness.
>
> That leaves the iterator, but that's a solvable problem. We simply can have an
> iterator function in the irq core, which gives you the next sample
> structure. Something like this:
>
> struct irqtiming_sample *irqtiming_get_next(int *irq)
> {
> struct irq_desc *desc;
> int next;
>
> /* Do a racy lookup of the next allocated irq */
> next = irq_get_next_irq(*irq);
> if (next >= nr_irqs)
> return NULL;
>
> *irq = next + 1;
>
> /* Now lookup the descriptor. It's RCU protected. */
> desc = irq_to_desc(next);
> if (!desc || !desc->irqtimings || !(desc->istate & IRQS_TIMING))
> return NULL;
>
> return this_cpu_ptr(&desc->irqtimings);
> }
>
> And that needs to be called rcu protected;
>
> next = 0;
> rcu_read_lock();
> sample = irqtiming_get_next(&next);
> while (sample) {
> ....
> sample = irqtiming_get_next(&next);
> }
> rcu_read_unlock();
>
> So the interrupt part becomes:
>
> if (desc->istate & IRQS_TIMING)
> irqtimings_handle(__this_cpu_ptr(&desc->irqtimings));
>
> So now for the allocation/free of that data. We simply allocate/free it along
> with the irq descriptor. That IRQS_TIMING bit gets set in __setup_irq() except
> for timer interrupts. That's simple and avoid _all_ the issues.
Indeed, making this as part of the irq code makes everything much more
simple and self contained. For the shared interrupts, shouldn't we put
the timings samples into the irqaction structure instead of the irqdesc
structure ?
eg.
#define IRQT_MAX_VALUES 4
struct irqaction {
...
#ifdef CONFIG_IRQ_TIMINGS
u32 irqtimings_samples[IRQT_MAX_VALUES];
#endif
...
};
So we don't have to deal with the allocation/free under locks. The
drawback is the array won't be used in the case of the timers.
Does it make sense ?
Thanks Thomas for your help, your time and your suggestions.
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period
2016-01-21 13:54 ` Daniel Lezcano
@ 2016-01-21 14:12 ` Thomas Gleixner
0 siblings, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-21 14:12 UTC (permalink / raw)
To: Daniel Lezcano
Cc: peterz, rafael, linux-pm, linux-kernel, nicolas.pitre, vincent.guittot
On Thu, 21 Jan 2016, Daniel Lezcano wrote:
> On 01/20/2016 08:49 PM, Thomas Gleixner wrote:
> > So now for the allocation/free of that data. We simply allocate/free it
> > along
> > with the irq descriptor. That IRQS_TIMING bit gets set in __setup_irq()
> > except
> > for timer interrupts. That's simple and avoid _all_ the issues.
>
> Indeed, making this as part of the irq code makes everything much more simple
> and self contained. For the shared interrupts, shouldn't we put the timings
> samples into the irqaction structure instead of the irqdesc structure ?
>
> eg.
>
> #define IRQT_MAX_VALUES 4
>
> struct irqaction {
> ...
> #ifdef CONFIG_IRQ_TIMINGS
> u32 irqtimings_samples[IRQT_MAX_VALUES];
> #endif
> ...
> };
>
> So we don't have to deal with the allocation/free under locks. The drawback is
> the array won't be used in the case of the timers.
I still have to ask the question whether a per device information is useful at
all. I don't see the value, but I might miss something.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 13:52 ` Thomas Gleixner
@ 2016-01-21 14:19 ` Daniel Lezcano
2016-01-21 18:56 ` Thomas Gleixner
0 siblings, 1 reply; 63+ messages in thread
From: Daniel Lezcano @ 2016-01-21 14:19 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Peter Zijlstra, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On 01/21/2016 02:52 PM, Thomas Gleixner wrote:
> On Thu, 21 Jan 2016, Daniel Lezcano wrote:
>> On 01/20/2016 08:57 PM, Thomas Gleixner wrote:
>>> That and we don't want to call it for each handler which returned handled.
>>> The
>>> called code would do two samples in a row for the same interrupt in case of
>>> two shared handlers which get raised at the same time. Not very likely, but
>>> possible.
>>
>> Actually, the handle passes dev_id in order to let the irqtimings to sort out
>> a shared interrupt and prevent double sampling. In other words, for shared
>> interrupts, statistics should be per t-uple(irq , dev_id) but that is
>> something I did not implemented ATM.
>
> So my comment about double sampling applies.
>
>> IMO, the handler is at the right place. The prediction code does not take care
>> of the shared interrupts yet.
>>
>> I tried to find a platform with shared interrupts in the ones I have available
>> around me but I did not find any. Are the shared interrupts something used
>> nowadays or coming from legacy hardware ? What is the priority to handle the
>> shared interrupts in the prediction code ?
>
> And why would that thing care about shared interruts at all? It's a legacy
> burden and I really don't see a reason why that new thing which is targeted on
> modern hardware should deal with them. Just treat them as a single interrupt
> for now and be done with it.
I just sent an email about how handling them :)
If the shared interrupts are only related to old hardware, these ones
shouldn't have cpuidle, hence there is no need to enable the irq
timings. So you are right in this case and we can keep the feature simple.
On a other hand, Peter sent three examples of /proc/interrupts with
shared interrupts. I don't know how old are the platforms and what are
they, but it seems the shared irq are still used.
At this point I have two contradictory information.
For the best of my knowledge, I am inclined to agree with you.
Peter can you give your opinion ?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 14:19 ` Daniel Lezcano
@ 2016-01-21 18:56 ` Thomas Gleixner
2016-01-22 10:15 ` Peter Zijlstra
0 siblings, 1 reply; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-21 18:56 UTC (permalink / raw)
To: Daniel Lezcano
Cc: Peter Zijlstra, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Thu, 21 Jan 2016, Daniel Lezcano wrote:
> On 01/21/2016 02:52 PM, Thomas Gleixner wrote:
> > And why would that thing care about shared interruts at all? It's a legacy
> > burden and I really don't see a reason why that new thing which is
> > targeted on modern hardware should deal with them. Just treat them as a
> > single interrupt for now and be done with it.
>
> If the shared interrupts are only related to old hardware, these ones
> shouldn't have cpuidle, hence there is no need to enable the irq timings. So
> you are right in this case and we can keep the feature simple.
>
> On a other hand, Peter sent three examples of /proc/interrupts with shared
> interrupts. I don't know how old are the platforms and what are they, but it
> seems the shared irq are still used.
Well, its still there on x86 but slowly on the way out. What I meant with
legacy burden is, that the HW people finally got the idea that shared
interrupts are a horrible concept.
So we are seing them go away. My laptop (not the newest thingy) doesn't have
them anymore. Most devices use MSI now, except for the holdouts:
0: 20 0 0 0 IO-APIC-edge timer
1: 1 1 5 3 IO-APIC-edge i8042
8: 5 6 0 4 IO-APIC-edge rtc0
9: 294 1231 118 318 IO-APIC-fasteoi acpi
12: 96 1748 55 76 IO-APIC-edge i8042
18: 0 0 0 0 IO-APIC 18-fasteoi i801_smbus
23: 11 12 8 51 IO-APIC 23-fasteoi ehci_hcd:usb1
My latest server toy still has one shared entry:
18-fasteoi ehci_hcd:usb1, ehci_hcd:usb2, i801_smbus
which is just a complete braindamage on the hardware side. There are a
gazillion of free interrupt lines on that beast and of course they must route
3 devices to the same line.
We really should ignore that sillyness and if people complain, make them
complain to their HW vendor. That's the only way this crap will go away.
If we just keep on supporting this completely pointless nonsense the HW folks
will just not fix it.
We've been successful in the past to 'educate' hw people by making features
not available for mindless designs.
In this case we still support the feature, but it might be suboptimal. The
real interesting ports on that platform are MSI anyway, so I really couldn't
care less.
I have no idea how wide spread the shared nonsense is on the relevant ARM
platforms, but you might be able to figure that out faster than me.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 10:08 ` Peter Zijlstra
2016-01-21 12:38 ` Daniel Lezcano
@ 2016-01-21 20:27 ` Thomas Gleixner
1 sibling, 0 replies; 63+ messages in thread
From: Thomas Gleixner @ 2016-01-21 20:27 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Daniel Lezcano, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Thu, 21 Jan 2016, Peter Zijlstra wrote:
> On Thu, Jan 21, 2016 at 10:50:27AM +0100, Daniel Lezcano wrote:
>
> > Actually, the handle passes dev_id in order to let the irqtimings to sort
> > out a shared interrupt and prevent double sampling. In other words, for
> > shared interrupts, statistics should be per t-uple(irq , dev_id) but that is
> > something I did not implemented ATM.
> >
> > IMO, the handler is at the right place. The prediction code does not take
> > care of the shared interrupts yet.
>
> That certainly added to the confusion. But if you want per dev_id stats,
> the whole alloc framework is 'broken' too, for it allocates the stuff
> per irq.
>
> > I tried to find a platform with shared interrupts in the ones I have
> > available around me but I did not find any. Are the shared interrupts
> > something used nowadays or coming from legacy hardware ? What is the
> > priority to handle the shared interrupts in the prediction code ?
>
> They're less common (thankfully) than they used to be, but I still have
> them:
>
> root@ivb-ep:~# cat /proc/interrupts | grep ","
> 0 IO-APIC 5-fasteoi i801_smbus, i801_smbus
Hardly something which is worth to add the extra complexity.
> root@wsm-ep:~# cat /proc/interrupts | grep ","
> 18: 0 0 IO-APIC 18-fasteoi ehci_hcd:usb1, uhci_hcd:usb6
> 23: 3 927 IO-APIC 23-fasteoi ehci_hcd:usb2, uhci_hcd:usb4
Stick the USB thingy into the XHCI port :)
> 19: 9695230 19577242 .... IO-APIC 19-fasteoi uhci_hcd:usb5, ata_piix
> root@snb:~# cat /proc/interrupts | grep ","
> 19: 11058485 IO-APIC 19-fasteoi ata_piix, ata_piix
Go to the BIOS and enable AHCI mode. Both the snb and the wsm-ep chipsets can
be switched between legacy piix and ahci mode :)
You seem to have a faible for last century hardware.
> Also there's a whole host of SOCs that has them.
And one of the reasons is, that a lot of drivers do not support msi, while
these embedded beasts support MSI on almost every peripheral, at least the
newer ones.
Thanks,
tglx
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [RFC V2 1/2] irq: Add a framework to measure interrupt timings
2016-01-21 18:56 ` Thomas Gleixner
@ 2016-01-22 10:15 ` Peter Zijlstra
0 siblings, 0 replies; 63+ messages in thread
From: Peter Zijlstra @ 2016-01-22 10:15 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Daniel Lezcano, rafael, linux-pm, linux-kernel, nicolas.pitre,
vincent.guittot
On Thu, Jan 21, 2016 at 07:56:36PM +0100, Thomas Gleixner wrote:
> We really should ignore that sillyness and if people complain, make them
> complain to their HW vendor. That's the only way this crap will go away.
>
> If we just keep on supporting this completely pointless nonsense the HW folks
> will just not fix it.
>
> We've been successful in the past to 'educate' hw people by making features
> not available for mindless designs.
>
> In this case we still support the feature, but it might be suboptimal. The
> real interesting ports on that platform are MSI anyway, so I really couldn't
> care less.
I'm fine with not supporting shared interrupts, as long as the thing is
consistent.
^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2016-01-22 10:15 UTC | newest]
Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-06 15:22 [RFC PATCH 0/2] IRQ based next prediction Daniel Lezcano
2016-01-06 15:22 ` [RFC PATCH 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-08 15:31 ` Thomas Gleixner
2016-01-12 11:42 ` Daniel Lezcano
2016-01-06 15:22 ` [RFC PATCH 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
2016-01-06 17:40 ` Nicolas Pitre
2016-01-07 15:42 ` Daniel Lezcano
2016-01-12 19:27 ` Nicolas Pitre
2016-01-10 22:37 ` Daniel Lezcano
2016-01-10 22:46 ` Nicolas Pitre
2016-01-10 22:58 ` Daniel Lezcano
2016-01-10 23:13 ` Nicolas Pitre
2016-01-08 15:43 ` Thomas Gleixner
2016-01-12 12:41 ` Daniel Lezcano
2016-01-12 13:42 ` Thomas Gleixner
2016-01-12 14:16 ` Daniel Lezcano
2016-01-12 14:26 ` Thomas Gleixner
2016-01-12 14:52 ` Daniel Lezcano
2016-01-12 15:12 ` Thomas Gleixner
2016-01-12 16:04 ` Daniel Lezcano
2016-01-13 9:17 ` Thomas Gleixner
2016-01-18 13:21 ` Daniel Lezcano
2016-01-20 15:41 ` Thomas Gleixner
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-20 17:55 ` Thomas Gleixner
2016-01-21 9:25 ` Daniel Lezcano
2016-01-21 10:27 ` Thomas Gleixner
2016-01-20 19:07 ` Peter Zijlstra
2016-01-20 19:57 ` Thomas Gleixner
2016-01-20 20:04 ` Nicolas Pitre
2016-01-20 20:20 ` Peter Zijlstra
2016-01-20 20:22 ` Thomas Gleixner
2016-01-21 9:50 ` Daniel Lezcano
2016-01-21 10:08 ` Peter Zijlstra
2016-01-21 12:38 ` Daniel Lezcano
2016-01-21 20:27 ` Thomas Gleixner
2016-01-21 13:52 ` Thomas Gleixner
2016-01-21 14:19 ` Daniel Lezcano
2016-01-21 18:56 ` Thomas Gleixner
2016-01-22 10:15 ` Peter Zijlstra
2016-01-21 9:26 ` Daniel Lezcano
2016-01-20 19:28 ` Peter Zijlstra
2016-01-21 9:53 ` Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
2016-01-20 17:46 ` Nicolas Pitre
2016-01-20 18:44 ` Peter Zijlstra
2016-01-21 10:03 ` Daniel Lezcano
2016-01-20 19:02 ` Peter Zijlstra
2016-01-20 19:17 ` Nicolas Pitre
2016-01-20 19:29 ` Peter Zijlstra
2016-01-20 19:34 ` Peter Zijlstra
2016-01-20 19:40 ` Peter Zijlstra
2016-01-20 19:57 ` Nicolas Pitre
2016-01-20 20:22 ` Peter Zijlstra
2016-01-20 19:49 ` Thomas Gleixner
2016-01-21 13:54 ` Daniel Lezcano
2016-01-21 14:12 ` Thomas Gleixner
2016-01-20 16:00 ` [RFC V2 0/2] IRQ based next prediction Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 1/2] irq: Add a framework to measure interrupt timings Daniel Lezcano
2016-01-20 16:00 ` [RFC V2 2/2] sched: idle: IRQ based next prediction for idle period Daniel Lezcano
2016-01-20 20:14 ` Nicolas Pitre
2016-01-21 13:04 ` Daniel Lezcano
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.